37 Linear Discriminant

Preliminary

Linear Algebra

Affine space Section 11.4

Hyperplanes

Recall that a hyperplane \mathcal{H} in \mathbb{R}^{d} is an affine space that is expressed as

\mathcal{H} = \left\{ \mathbf{x} \mid \mathbf{w} \mathbf{x} + b = 0 \right\},

which can be viewed as the subspace \mathbf{w}^{\perp} translated by the vector \mathbf{x}_{0}

\mathcal{H} = \left\{ \mathbf{u} + \mathbf{x}_{0} \mid \mathbf{u} \in \mathbf{w}^{\perp} \right\}.

where

\mathbf{w}^{\perp} is the subspace that is perpendicular to the vector \mathbf{w},
\mathbf{v} = - \frac{b}{\mathbf{w}^{T} \mathbf{w}} \mathbf{w}.

Also recall that given a vector \mathbf{x} \in \mathbb{R}^{d}, its orthogonal projection onto \mathcal{H} is

\mathbf{p} = \mathbf{x} - \frac{ \mathbf{w}^{T} \mathbf{x} + b }{ \mathbf{w}^{T} \mathbf{w} } \mathbf{w}.

Therefore, the distance between \mathbf{x} and \mathcal{H} is the length the vector that connects \mathbf{p} and \mathbf{x}

\begin{aligned} \lVert \mathbf{x} - \mathbf{p} \rVert & = \left\lVert \frac{ \mathbf{w}^{T} \mathbf{x} + b }{ \mathbf{w}^{T} \mathbf{w} } \mathbf{w} \right\rVert \\ & = \left\lvert \frac{ \mathbf{w}^{T} \mathbf{x} + b }{ \mathbf{w}^{T} \mathbf{w} } \right\rvert \lVert \mathbf{w} \rVert \\ & = \frac{ \lvert \mathbf{w}^{T} \mathbf{x} + b \rvert }{ \lVert \mathbf{w} \rVert^{2} } \lVert \mathbf{w} \rVert \\ & = \frac{ \lvert \mathbf{w}^{T} \mathbf{x} + b \rvert }{ \lVert \mathbf{w} \rVert }. \end{aligned}

Hyperplanes as linear discriminants

The linear function defined by

f (\mathbf{x}) = \mathbf{w} \mathbf{x} + b

divides the points in \mathbb{R}^{d} into 3 spaces

f (\mathbf{x}) = 0: the points on the hyperplane \mathcal{H}.
f (\mathbf{x}) > 0: the points on the positive side of f (\mathbf{x}), which is the side that \mathbf{w} points to.
f (\mathbf{x}) < 0: the points on the negative side of f (\mathbf{x}).

Therefore, f (\mathbf{x}) is a linear discriminant if we classify the points in \mathbb{R}^{d} based on the following decision rule

g (\mathbf{x}) = \text{sign} (f (\mathbf{x})) = \begin{cases} 1 & f (\mathbf{x}) > 0 \\ 0 & f (\mathbf{x}) < 0 \\ \end{cases}

Augmented vector

A trick that is often used in the analysis of various linear discriminant algorithms is to make the bias term b as part of the weight vector

f (\mathbf{x}) = \mathbf{w}^{T} \mathbf{x} + b = \hat{\mathbf{w}}^{T} \hat{\mathbf{x}},

where \hat{\mathbf{w}} is defined by appending b to \mathbf{w} as the last element

\hat{\mathbf{w}} = \begin{bmatrix} w_{1} \\ \vdots \\ w_{d} \\ b \\ \end{bmatrix},

and \hat{\mathbf{x}} is formed by appending an 1 to \mathbf{x} as the last element

\mathbf{\hat{x}} = \begin{bmatrix} x_{1} \\ \vdots \\ x_{d} \\ 1 \\ \end{bmatrix}.

Margin

The margin of the instance \mathbf{x} is its signed distance from the hyperplane f

\gamma (\mathbf{x}) = \frac{\hat{y} f (\mathbf{x})}{\lVert \mathbf{w} \rVert},

where \hat{y} \in \{1, -1\} is the label of \mathbf{x}, but the negative label is denoted by -1

The margin of a hyperplane f is the distance from the hyperplane to the closest point in the training set

\gamma = \min_{i} \frac{ \lvert \gamma (\mathbf{x}) \rvert }{ \lVert \mathbf{w} \rVert } = \min_{i} \frac{ \lvert f (\mathbf{x}) \rvert }{ \lVert \mathbf{w} \rVert }.

A linear discriminant with a positive margin classifies all training instances correctly

\gamma > 0 \Leftrightarrow y_i (\mathbf{w} \mathbf{x}_i + b) > 0, \forall i.

Margin loss

The margin loss is a loss function that is defined with respect to the margin of the instances

L (y, g (\mathbf{x})) = \phi (\gamma (\mathbf{x})).