30 Bayesian Classifier
Bayes classifier is the particular hypothesis h^{*} (x) that minimizes the risk
h^{*} = \argmin_{h} R (h)
and the risk that Bayes decision rule achieves is called Bayes Risk,
R^{*} = R (h^{*})
which is the minimum risk that any hypothesis can achieve if we know the true probabilities \mathbb{P}_{X, Y}.
MAP rule
Since the true risk R (h) does not depend on X and Lemma 29.1 shows that R(h) = \mathbb{E}_{X} \left[ \mathbb{E}_{Y \mid X} \left[ L (h (X), Y) \right] \right], the Bayes classifier can also be written as the hypothesis that minimizes the conditional expectation
h^{*} = \argmin_{h} \mathbb{E}_{Y \mid X} \left[ L (h (X), Y) \right],
which can be further simplied to maximum a-posteriori probability (MAP) rule if the loss function is 0-1 loss and there are m labels y \in [1, m]
h^{*} (x) = \argmax_{y \in [1, m]} \mathbb{P}_{Y \mid X} (y \mid x)
According to Bayes Theorem,
\begin{aligned} \arg\max_{y} \mathbb{P}_{Y \mid \mathbf{X}}(y \mid \mathbf{x}) & = \arg\max_{y} \frac{\mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) \mathbb{P}_{Y}(y)}{\mathbb{P}_{\mathbf{X}}(\mathbf{x})} \\ & = \arg\max_{y} \mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) \mathbb{P}_{Y}(y) & [\mathbb{P}_{\mathbf{X}}(\mathbf{x}) \text{ doesn't depend on } y], \\ \end{aligned}
MAP rule can thus be computed using the class conditional probability (likelihood) and the class probability (prior), which is more practical since the class conditional probability and class probability can be more easily obtained from the data than the posterior probability.
Using the log trick, the BDR for 0-1 loss is often calculated using:
\begin{aligned} \arg\max_{y} \ln \mathbb{P}_{Y \mid \mathbf{X}}(y \mid \mathbf{x}) & = \arg\max_{y} \ln \mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) \mathbb{P}_{Y}(y) \\ & = \arg\max_{y} \ln \mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) + \ln \mathbb{P}_{Y}(y). \\ \end{aligned}
MAP rule for binary classification
Since there are only 2 labels in the binary classification problem, the MAP rule for binary classification is simplied to
\begin{aligned} h^{*} (x) & = \argmax_{y \in [0, 1]} \mathbb{P}_{Y \mid X} (y \mid x) \\ & = \mathbb{1} \left[ \mathbb{P}_{Y \mid X} (1 \mid x) > \mathbb{P}_{Y \mid X} (0 \mid x) \right] \\ & = \begin{cases} 1, \quad \mathbb{P}_{Y \mid X} (1 \mid x) > \frac{ 1 }{ 2 } \\ 0, \quad \mathbb{P}_{Y \mid X} (1 \mid x) < \frac{ 1 }{ 2 }. \end{cases} \end{aligned}
Regression function
The conditional distribution \mathbb{P}_{Y \mid X} can be modeled with a Bernoulli distribution \mathbb{P}_{Y \mid X} (y \mid x) = \mathrm{Ber} (\eta (x)), where \eta (x) is the regression function
\eta (x) = \mathbb{P}_{Y \mid X} (1 \mid x) = \mathbb{E}_{Y \mid X} (Y).
Lemma 30.1 For any hypothesis h, we can write its risk function with 0-1 loss for binary classification as
R (h) = \mathbb{E}_{X} \left[ \eta (X) (1 - h (X)) + (1 - \eta (X)) h (X) \right].
Theorem 30.1 The risk of the Bayes classifier for binary classification with 0-1 loss is the expectation of the minimum of \eta (X) and 1 - \eta (X)
R (h^{*}) = \mathbb{E}_{X} \left[ \min \left[ \eta (X), 1 - \eta (X) \right] \right]
and is less than \frac{ 1 }{ 2 }
\mathbb{E}_{X} \left[ \min \left[ \eta (X), 1 - \eta (X) \right] \right] \leq \frac{ 1 }{ 2 }.
Excess risk
For any hypothesis h, we are interested in the difference between its risk R (h) and Bayes risk R (h^{*}), which is called excess risk of h
\mathcal{E} (h) = R (h) - R (h^{*}).
Theorem 30.2 For any hypothesis h, the excess risk satisfies
\mathcal{E} (h) = \mathbb{E}_{X} \left[ \lvert 2 \eta (X) - 1 \rvert \times \mathbb{1} \left[ h (X) \neq h^{*} (X) \right] \right]