ML Q & A
Machine Learning Interview Questions
ML Basics
What is the trade-off between bias and variance?
- We always want the model to have low bias and low variance at the same time, but it is difficult.
- Bias is the average difference between the predictions and the label.
- Variance refers to the sensitivity of our model to the fluctuations (noise) in the training set. OR variance describes how much a random variable of your model differs from its expected value.
- Since we only have training set that is a sample from the overall distribution, training our model to reduce bias will generally increase variance at the same time because we don’t know what is the noise.
- High bias and low variance lead to under-fitting, where your model is too simple to capture the regularities in your data. This will result in both low training accuracy and testing accuracy.
- Low bias and high variance lead to over-fitting, where your model is too complex that it captures all the patterns in your training data including noise. It will result in high training accuracy, but low testing accuracy.
Explain over- and under-fitting and how to combat them?
- Under-fitting is when your model is too simple to fit the training set correctly. This will result in low training accuracy and low testing accuracy, assuming that our training set is a reasonable sample of the overall distribution.
- To combat under-fitting, the first we can do is to choose the model with more complexity. If decision tree under-performs on a dataset, we can try random forest. Second we can increase the parameters that our model have. Like increasing number of trees or the maximum depth of the decision tree. Last, for some models that uses iterative learning procedure such as neural network, we can increase the training time.
- Over-fitting is when your model is too complex such it can fit the training set very accurately. This means that it also fit the noise correctly and will result in high training accuracy and low testing accuracy.
- To combat over-fitting, we first can reduce the number of parameters. Second, using ensemble of your current models is proven to reduce over-fitting. Third, without changing models, we can do data augmentation such as over-sampling more data to make the training better represent the actual distribution.
Difference between supervised and unsupervised learning?
- Supervised learning task is usually provided with labeled dataset and goal is to predict the correct label for a given instance.
- Unsupervised learning only gives the instances themselves, without their labels. The goal is to learn patterns from those unlabeled data.
- Supervised learning include classification and regression. Classification is to predict a categorical label while regression is to predict a continuous value.
- Unsupervised learning include clustering and dimension reduction. Clustering is to group similar instances together and dimension reduction is to select the important features from all features.
What is the “Curse of Dimensionality” and how to combat it?
- Dimensionality refers to the number of features in your dataset.
- It is harder for the models to search through a space as the number of features grows. The required number of training instances to achieve the same accuracy grows exponentially with the number of features. Since in practice the number of training instances are fixed, the performance of the models will typically decrease as the dimension increases.
- We can use feature selection such as manual feature selection by human or feature extraction technique like PCA to reduce the dimensionality. The difference between selection and extraction is that selection selected subset of the original features and extraction extracts a set of new features from data.
What is a confusion matrix?
- The confusion matrix is used to evaluate the performance of an supervised classifier on a dataset. It has N rows and N columns where N is the number of classes in the dataset.
- Each row of the matrix gives the number of predictions from the classifier for a specific label and each column of the matrix gives the number of actual labels. If the classifier is for a binary classification dataset, the first row gives the number of true positives and the number of false positives and the second row will give false negatives and true negatives.
- Other performance evaluation methods such as accuracy or F1-score will be misleading on unbalanced dataset. For example, if we have a dataset that has 95 dogs and 5 cats and a classifier than always predict dog. We have 95% accuracy and 97.5% F1-score, which tells that our classifier is very good. Confusion matrix will give a whole picture of metrics that include true positive, true negative, false positive and false negative. If we add up true positive and false positive in the matrix, we can see that our classifier only predict positive labels.
Notes: 1. For a binary classification problem, we can have several terms: 1. True positive: classifier gives positive class and the prediction is correct. 2. False positive: classifier gives positive class and the prediction is incorrect. 3. True negative: classifier gives negative class and the prediction is correct. 4. False negative: classifier gives negative class and the prediction is incorrect. 2. For a binary classification problem, we have several performance measures: 1. Sensitivity (true positive rate): number of positive predictions correctly labeled / number of instances with positive labels. \mathrm{\frac{TP}{TP + FN}} False positive rate: number of positive predictions incorrectly labeled / number of instances with negative labels. \mathrm{\frac{FP}{FP + TN}} 2. Specificity (true negative rate): number of negative predictions correctly labeled / number of instances with negative labels. \mathrm{\frac{TN}{TN + FP}} 3. Precision: number of positive instances correctly predicted / number of positive predictions. \mathrm{\frac{TP}{TP + FP}} 4. Recall: same as sensitivity. \mathrm{\frac{TP}{TP + FN}} 5. F1-score: harmonic mean of the precision and recall. \mathrm{2 \cdot \frac{precision \cdot recall}{precision + recall} = \frac{TP}{TP + \frac{1}{2}(FP + FN)}}
What is ROC curve and AOC?
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc 1. An ROC curve is a graph showing the performance of a classification model at all classification thresholds. The x axis is false positive rate and y axis is true positive rate. 2. An ROC curve plots TPR vs. FPR at different classification thresholds. For all models that first produce a score and then thresholded to give the classification, different thresholds mean different number of positive and negative predictions. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives and the curve will always go higher. 3. A perfect model will have a straight horizontal line at y = 1 while a perfectly wrong model will have a horizontal line at y = 0. A random guessing model will have a diagonal line, which means that the model has no class separation capacity. 4. AOC stands for area under the curve and measures the area under the ROC curve. It is used to measure a model’s performance across all possible classification thresholds.
What is exponential moving average (exponential weighted moving average)?
- Calculate moving average of a set of data points is creating a series of averages of different subsets of the dataset. The dataset to be analyzed usually contains time series data (data that is indexed by timestamps). Exponential moving average (EMA) is one type of moving average algorithm.
- The EMA value at timestamp t calculates the average of the data from the beginning to timestamp t, which is calculated by adding up the weighted value of the data at timestamp t and the weighted EMA value on the previous timestamp t-1.
- The weighted part is called the smoothing factor and is a hyper-parameter that is between 0 and 1. Higher smoothing factor means the current data value is weighted more and the previous EMA value is weighted less in the calculation of the new EMA value.
- This techniques is usually used by gradient optimizer to calculate new learning rate.
Notes: 1. [EMA equation]: If the current timestamp is t, then the equation for EMA is: y_{t} = \alpha x_{t} + (1 - \alpha)y_{t-1} where y_{t} is the moving average at timestamp t, x_{t} is the data point at timestamp t, and \alpha is the smoothing factor, which is the range [0, 1].
What are cross-valiation and nested cross-validation?
TODO
Neural Network
What is gradient descent?
- Gradient descent is an optimization algorithm that can iteratively minimize the target function to find its local minimum. If the target function is convex, then gradient descent can also find its global minimum.
- First we calculate the derivatives of the target function with respect to the parameters of the model. This is the gradient.
- Second we update the parameters to the opposite direction of the gradients to minimize the value of the target function. This is the descent.
- Gradient descent will minimize the target function, while gradient ascent, which updates the parameters to the same direction of the gradients, will maximize the target function.
What is back-propagation?
- TODO
Differences between gradient descent, stochastic gradient descent and mini-batch gradient descent.
https://ruder.io/optimizing-gradient-descent/index.html#fn4 https://cs231n.github.io/neural-networks-3/#update 1. Normal gradient descent is batch gradient descent. One batch means one complete run of the training set. Thus, we need to evaluate all instances of the training set before one gradient update. Gradient descent is more slow, but guaranteed to converge to the local minimum. 2. Stochastic gradient descent means that one gradient update is performed for each instance evaluated. This approach converges faster and can be used on the fly as the new instance comes in, but can cause target function to fluctuate. 3. Mini-batch is the combination of the two above. Instead of whole batch or single instance, we take subset of training set as the mini-batch and evaluate them to get gradient for a single gradient update. This combines the benefits of two method above.
Notes: 1. [Different gradient descent optimization algorithms] 1. SGD \theta_{t+1} = \theta_{t} - \lambda \partial_{t}(\theta) where \lambda is the learning rate and \partial_{t}(\theta) is the gradient of the loss function w.r.t the parameter \theta at time t. 2. SGD with Momentum: It will help the convergence speed of SGD because it reduces the oscillations of the SGD near the local minimas, which is done by building up the velocity in the correct direction that has consistent gradients v_{t} = \mu v_{t-1} - \lambda\cdot\partial_{t}(\theta) \theta_{t+1} = \theta_{t} + v_{t} where \mu is the momentum parameter that is typically 0.9 and v_{t} is the correct accumulated gradient direction at time t. 3. Adagrad: It will make the learning rates of the weights that receive high gradients reduced, and the learning rates of the weights that receive small or infrequent updates increased. Thus we don’t have to manually tune the learning rates in the training progress. However, since there is no way to reduce the accumulated squared gradients in the denominator of the learning rate, the monotonically decreasing learning rate in the training process will eventually stop the training. g_{t} = g_{t-1} + \partial_{t}^{2}(\theta) \theta_{t+1} = \theta_{t} - \frac{\lambda}{\sqrt{g_{t} + \epsilon}} \partial_{t}(\theta) where g_{t} is the accumulation of the squared gradient for \theta until time t and \epsilon is a small value used to prevent the division by 0. 4. RMSprop: RMSprop improves on Adagrad by replacing the accumulation of the past squared gradients with the exponential moving average of the past squared gradients. This can solve the issue of Adagrad that the learning rates are monotonically decreasing. e_{t} = \beta e_{t-1} + (1-\beta)\partial_{t}^{2}(\theta) \theta_{t+1} = \theta_{t} - \frac{\lambda}{\sqrt{e_{t} + \epsilon}} \partial_{t}(\theta) where e_{t} is the exponential moving average of the squared gradient for \theta until time t and \beta is like momentum that controls degree of weighting decay and is usually set to 0.9. 5. Adam: Adam improves on RMSprop by replacing the raw gradient with the exponential moving average of the past gradients in the update step. It thus combines the benefits of RMSprop and SGD with momentum. v_{t} = \beta_{1} v_{t-1} + (1-\beta_{1})\partial_{t}(\theta) e_{t} = \beta_{2} e_{t-1} + (1-\beta_{2})\partial_{t}^{2}(\theta) \hat{v}_{t} = \frac{v_{t}}{1-\beta_{1}^{t}} \hat{e}_{t} = \frac{e_{t}}{1-\beta_{2}^{t}} \theta_{t+1} = \theta_{t} - \frac{\lambda \hat{v}_{t}}{\sqrt{\hat{e}_{t} + \epsilon}} where \hat{v}_{t} and \hat{e}_{t} are the bias corrected versions of v_{t} and e_{t} and \beta^{t} means the \beta to the power of t.
What is vanishing gradient?
- Vanishing gradient happens to the parameters in the earlier layers of a deep neural network where the gradients are so small in the back-propagation process that the weights are not really changed.
- The primary reason for this problem is the choice of activation functions such as sigmoid or hyperbolic tangent function, whose gradients are very small and are always much less than 1. If multiple layers with such activations are stacked together, the gradients to the earlier layers of the networks are multiplied lots of times with the loss gradient in the back-propagation process. Each layer reduce the original loss gradient by a fraction and in the end the gradient to the earlier layers are very small. Therefore, the large number of layers are also an important reason for vanishing gradient problem.
- One effective solution to the problem is to use other activation functions such as ReLU.
Why is ReLU better and more often used than sigmoid in Neural Networks?
- It can help solve the vanishing gradient problem. The gradient of the ReLU for inputs larger than 0 is 1 and thus the gradient won’t be reduced in the back-propagation process.
- ReLU is computationally more efficient because it only needs to cut the negative input to 0.
- Historically speaking, ReLU is just good enough for neural network to be trained stably.
What is the difference between L_1 and L_2 regularization?
- L_1 regularization is also called Lasso regularization. It adds the sum of the absolute value of all weights in the neural network as a penalty term to the loss function.
- L_2 regularization is also called Ridge regularization. It adds the sum of the squared value of all weights in the neural network as a penalty term to the loss function.
- They both are used to reduce over-fitting issue of the large neural network.
- The key difference is the gradient of each penalty. The gradient of Lasso is a 1 or -1 depending on the sign of each weight, while the gradient of Ridge is 2 times the value of the parameter. The weights with Lasso can possibly shrink to exactly 0, while weights with Ridge can only shrink to a very small value instead of exact 0 because the gradients also decrease as the weights decrease. Thus Lasso can be used to train sparse neural network.
What is dropout in neural network.
- Dropout is to randomly drop neurons of the network in the training process to avoid over-fitting. A neuron is dropped means that the data and the gradient don’t go through that neuron in both forward or backward process.
- Typically dropout is applied per layer and we can set a probability p for each layer to indicate what percentage of neurons we want to drop in that layer. p is usually selected for 0.5 for hidden layer, but a less value for input layer like 0.1 because randomly dropping an entire column of input data is very risky.
- The dropout neurons are changed every instance or mini-batch depending on what type of gradient descent we are using. We only apply dropout in the training process and we will use all neurons for testing to have consistent output.
What is batch normalization?
- Batch normalization is to normalize the inputs to each layer for each mini-batch. Batch normalization is proved to help neural network training converges faster.
- Batch normalization first normalize the inputs to each layer by subtracting the mini-batch mean from each value and then divide it by the mini-batch standard deviation. This process will make each input value to be in the range between 0 to 1. Then we need to scale and shift the normalized value into a desirable range. The coefficients for scaling and shifting are also two parameters that are needed to be learned in the backward propagation.
- The challenge that batch normalization is trying to solve is called internal covariate shift. Since each layer’s output is fed into next layer’s input, the change of the weights in the first layer due to backward process after a mini-batch will cause the change of its output distribution. The internal covariate shift slows down the training process because the learning in the next iteration needs to accommodate for this change.
Notes: 1. [Batch Normalization Layer]: Batch normalization layer can be appended before each layer and outputs the same dimension as the inputs 1. Get the mean \mu_{B} and standard deviation \sigma_{B} of the inputs in a mini-batch: \mu_{B}=\frac{1}{m}\sum_{i=1}^{m}x_i \sigma_{B}=\sqrt{\frac{1}{m}\sum_{i=1}^{m}(x_i-\mu_{B})^{2}} 2. Normalize the input: \hat{x}_i=\frac{x_i-\mu_{B}}{\sigma_{B}} 3. Scale and shift back, where \gamma is the scaling parameter and \beta is the shifting parameter: y_{i} = \gamma\hat{x}_{i}+\beta
What is Xavier initialization?
https://www.deeplearning.ai/ai-notes/initialization/index.html 1. Xavier initialization is a specific way of initialize the weights and bias in the neural network such that the variance of the activations are relatively the same across all layers. 2. If we use Xavier initialization method, the bias will be initialized to 0 and the weights are randomly sampled from a normal distribution that has mean of 0 and variance of 1 over the the number of neurons in the last layer. 3. If we initialize the weights to have the same value, all neurons will have the same activations. Same activations mean that all neurons will have the same gradients and will evolve the same throughout the training. 4. If we initialize the weights to be too small or too large, the activations of each layer in the first several iterations will also be very small or large. Since gradients of weights for each layer are calculated based on the activations, large weights will result in gradient exploding and small weights result in gradient vanishing, both of which prevent the neural from efficiently learning.
Unsupervised learning
https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-unsupervised-learning#clustering
K-means
https://stanford.edu/~cpiech/cs221/handouts/kmeans.html 1. K-means is used to cluster unlabeled instances in dataset into K groups that are defined by their centroids. The points in the same group can be further labeled or analyzed. 2. We first randomly choose K centroids in the space. Then we cluster each data point to its nearest centroids and distance is calculated using sum of the square of the difference. After all data points have been assigned to a cluster, we recompute the centroids of the cluster by taking the average of all the data points that belong to that cluster. Then we cluster the data points again based on the new centroids and we repeat process until centroids don’t really change. 3. K-means is proved to find local minimum instead of global minimum. Thus the initialization of the centroids do matter to the outcome.
Notes 1. [K-means algorithm]: 1. Initialize cluster centroids \mu_{1}, ..., \mu_{k} randomly. 2. Repeat until convergence: 1. Get the centroid for each instance x_{i}: c_{i} = \underset{\mu_{i}}{\operatorname{argmin}} \lVert x_{i}-\mu_{i} \rVert^{2} 2. Update the centroid based on the instances: \mu_{j} = \frac{\sum_{i}^{m}1_{\{c_i=j\}}x_{i}}{\sum_{i}^{m}1_{\{c_i=j\}}}
PCA (Principle Component Analysis)
https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c 1. Principal component analysis is an unsupervised learning algorithm that is used to reduce dimensionality of the training set. PCA selects multiple orthogonal dimensions that preserve maximum variance in the training set and optionally projects the training instances onto these dimensions. 2. To do PCA, we first need to normalize the training set by subtracting each value in a column by its mean and divide each value by column’s standard deviation. Then we get the covariance matrix of the normalized training set by multiply it with its transposed matrix. The covariance matrix gives us how each variable of the training set relates to each other. We can then use eigendecomposition to decompose the covariance matrix to get the eigenvalues and their corresponding eigenvectors. Here the eigenvectors are orthogonal components and the eigenvalues indicate the importance of the corresponding components. Finally we sort the eigenvalues in decreasing order and select first few eigenvalues and their corresponding eigenvectors as the principle components. We can get the transformed dataset by multiplying the training set with the selected eigenvectors.
Notes: 1. [Eigenvectors, Eigenvalues]: Given a matrix A\in\mathbb{R}^{n\times n}, \lambda is said to be an eigenvalue of A if there exists a eigenvector z\in\mathbb{R}^n \neq 0, such that: Az = \lambda z 2. [Eigendecomposition (spectral decomposition]: Let M be a real symmetric d \times d matrix with eigenvalues \lambda_{1}, ... , \lambda_{d} and corresponding orthonormal eigenvectors u_{1}, ..., u_{d}. Then: M = Q \Lambda Q^T where \Lambda is a diagonal matrix with \lambda_{1}, ... , \lambda_{d} in diagonal and 0 elsewhere and Q matrix has u_{1}, ..., u{d} vectors as columns.
Supervised Learning
Naive Bayes Classifier (NB)
- [Bayes’ theorem]: the conditional possibility of event A given the event B is true P(A|B) can be computed as: P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} which in the Bayesian term is written as: \mathrm{Posterior} = \frac{\mathrm{Likelihood} \cdot \mathrm{Prior}}{\mathrm{Evidence}} If we think A as a label and B as a set of features:
- P(A|B) is the posterior probability of a label given a set of features.
- P(B|A) is the likelihood which is the probability of a set of features given a label.
- P(A) is the prior probability of a label.
- P(B) is the evidence probaility of a set of features.
- [Naive Bayes]: Naive Bayes is a classifier that selects the label \hat{y} from all possible labels y \in Y that has maximum conditional possibility given the instance \mathbf{x}. \hat{y} = \underset{y \in Y}{\operatorname{argmax}} P(y|\mathbf{x}) Applying Bayes’ theorem, we have: P(y|\mathbf{x}) = \frac{P(\mathbf{x}|y) \cdot P(y)}{P(\mathbf{x})} Since P(\mathbf{x}) is a constant and is independent from P(y_{i}), we can simply drop it: P(y|\mathbf{x}) \propto P(\mathbf{x}|y) \cdot P(y) If we assume that each feature is independent from each other (naive conditional independence assumption), the possibility that the features values are all in \mathbf{x} is the product of their possibilities: P(\mathbf{x}|y) = \prod_{i}^{m}P(x_{i}|y) Put them together, we have: \hat{y} = \underset{y \in Y}{\operatorname{argmax}} \prod_{i}^{m}P(x_{i}|y) \cdot P(y)
Linear regression and logistic regression (LR)
https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html 1. Linear regression is a supervised machine learning model that fits a correlation linear line between input and label variables. The output value can be arbitrary continuous value and thus it is used for regression. 2. The model has a weight vector and a bias as parameters. The output of the model is the dot product of the weight vector and the input vector plus a bias value. 3. The linear regression model is trained by solving an optimization problem that is defined by applying a cost function that evaluates the difference between the model’s output and the correct labels. The cost function for linear regression is mean squared error function that takes the mean of the squared value of each prediction’s error. MSE for linear regression is proved to be convex, so solving it using convex optimization or gradient descent will get the global minimum. 4. Logistic regression is similar to linear regression, but the output is a probability value between 0 and 1, so it is used for binary classification instead of regression. 5. A sigmoid (logistic) function is attached after the output of linear regression to output a probability for logistic regression. Instead of using MSE, the cost function is changed to binary cross entropy such that the loss grows exponentially with the difference between outputs and labels.
Notes: 1. [Mean squared Error (MSE)]:
\mathrm{MSE} = \frac{1}{n}\sum_{i}^{n}(y_{i}-\hat{y}_{i})^2 where y_{i} is the actual label and \hat{y}_{i} is the prediction given by the classifier. 1. [Binary cross entropy (BCE)]: only works if the labels y_{i} are 0 or 1 and the values of predictions \hat{y}_{i} are between 0 and 1. \mathrm{BCE} = -\frac{1}{n}\sum_{i}^{n}(y_{i}\log(\hat{y}_{i}) + (1-y_{i})\log(1-\hat{y}))) which can be decomposed to two cases for each prediction and label pair: -\log(\hat{y}_{i}) \;\mathrm{if}\; y=1 -\log(1 - \hat{y}_{i}) \;\mathrm{if}\; y=0 1. [Sigmoid (logistic) function and logic function]: \mathrm{sigmoid}(x) = \frac{1}{1 + e^{-x}} \mathrm{logic}(x) = \log(\frac{x}{1 - x}) The inverse of sigmoid function is the logic function:
\begin{alignat}{2}
x &= \frac{1}{1 + e^{-y}} \\
\frac{1}{x} &= 1 + e^{-y} \\
e^{-y} &= \frac{1 - x}{x} \\
e^{y} &= \frac{x}{1 - x} \\
y &= \log(\frac{x}{1 - x}) \\
\end{alignat}
1. [How to solve the parameters for linear regression and logistic regression] 1. Linear regression can be solved mathematically by setting partial derivative of loss w.r.t each weight to 0: (bias is removed for simplification) \frac{\partial f}{\partial w_{k}} = \frac{2}{N} \sum_{i}^{N} x_{i,k} \bigg(\sum_{j}^{D}w_{j}x_{i,j} - \hat{y}_{i} \bigg) = \frac{2}{N} \sum_{i}^{N} \bigg( x_{i, k} \sum_{j}^{D}w_{j}x_{i,j} - x_{i, k}\hat{y}_{i} \bigg) = \frac{2}{N} \sum_{i}^{N} \bigg( x_{i, k} \sum_{j}^{D}w_{j}x_{i,j} \bigg) - \frac{2}{N} \sum_{i}^{N} x_{i,k}\hat{y}_{i} = \frac{2}{N} \sum_{j}^{D} w_{j} \bigg( \sum_{i}^{N} x_{i,j}x_{i,k} \bigg) - \frac{2}{N} \sum_{i}^{N} x_{i,k}\hat{y}_{i} 1. Gradient descent can be applied to solve both linear regression and logistic regression. Logistic regression doesn’t have a closed-form solution because of the non-linearity that the sigmoid function imposes.
Generalized linear models (GLM) and generalized additive models (GAM)
https://christophm.github.io/interpretable-ml-book/extend-lm.html 1. Generalized linear models build on linear regression models to predict a non-Gaussian distribution. It keeps the weighted sum of the features of the linear regression, but connect the weighted sum and the expected mean of the output distribution through a possibly nonlinear function. 2. For example, the logistic regression is a type of the generalized linear models and it assumes a Bernoulli distribution for the outcome and links the expected mean and the weighted sum using the logic function. 3. Generalized additive models further relax the restriction that the relationship must be a simple weighted sum, and instead assume that the outcome can be modeled by a sum of arbitrary functions of each feature. It allows to model the potentially non-linear relations between the features and the output.
Notes: 1. [Assumptions of linear regression]: 1. The input features are independent from each other (no interactions between the features). 2. The output distribution y given the input features X follows a Gaussian distribution. This follows the following theorem: > Let X_1, ..., X_n be n mutually independent normal random variables, having means \mu_1, ..., \mu_n and variances \sigma_1^2, ... \sigma_n^2. If the random variable Y is a linear combinations of the X with w_1, ..., w_n coefficients: Y=\sum_{i=1}^{n}w_iX_i, then Y is a Gaussian distribution with the mean \mathrm{E}[Y] = \sum_{i=1}^{n}b_i\mu_i and variance \mathrm{Var}[Y] = \sum_{i=1}^{n}b_i^2\sigma_i^2. 3. The true relationship between each feature X_i and y is linear. 2. [Components of GLM] 1. Random component: the probability distribution of the output variable Y. It’s expected value (mean value) is \mathrm{E}(Y). 2. Systematic component: the weighted sum \sum_{1}^{n}w_ix_i + w_0. 3. Link function: the relation between the random component and the systematic component g. g(\mathrm{E}(Y)) = \sum_{1}^{n}w_ix_i + w_0 3. [Equation of GAM] g(\mathrm{E}(Y)) = \sum_{1}^{n}f(x_i) where f() can be arbitrarily defined function.
Support vector machine (SVM)
https://shuzhanfan.github.io/2018/05/understanding-mathematics-behind-support-vector-machines/ https://cse.iitkgp.ac.in/~dsamanta/courses/da/resources/slides/10SupportVectorMachine.pdf 1. The objective of support vector machine is to find a hyperplane in a N dimensional space that separates two classes. Thus similar to linear regression, SVM also contains a weight vector and a bias as parameters. 1. To find the correct parameters, we first need to assume the training instances are linearly separable. Then an convex optimization problem is solved to find the weights and bias such that the hyperplane has the maximum distances from the support vectors. The support vectors are the training instances that are closest to the hyperplane. 1. If the training set contains noise points that make them linearly non-separable, we can add slack variable for each training instance to the constraints of the optimization problem so that it permits some training instances to be on the other side of the hyperplane. Basically large slack variables allow more misclassified training instances and the sum of them is added to the target function to be minimized. 1. A hyperparameter C can be used to determine how important the slack variables are. Setting C to be 0 means that we want the SVM to perfectly separate two classes in the training set while a suitable value means that we allow some errors in the training process.
Notes: 1. [SVM without slacks (hard margin SVM)]: Given a dataset with n instances x_{i} \in R^{d} and n labels y_{i} \in \{-1, 1\}, a hard margin SVM model is a linear function (hyperplane) that is defined by a set of weights w \in R^{d} and a bias b \in R, which has the largest distances to the support vectors. You can get the hyperplane by solving following optimization problem: \begin{alignat}{2} \min \quad & \frac{1}{2} \lVert w \rVert^{2} \\ \text{s.t. } \quad & y_{i}(w x_{i} + b) \geq 1, \quad i = 1, \dots n \\ \end{alignat} 1. Solving the above optimization problem will give us two parallel hyperplanes (w x + b = 1 and w x + b = -1) that strictly separate the positive and negative training instances and at the same time have the maximum gap in between. 1. The objective maximizes the squared distance between the parallel hyperplanes by minimizing the multiplicative inverse of the squared distance between the parallel hyperplanes, which is defined as \frac{\lvert b_{2} - b_{1} \rvert}{\lVert w \rVert} = \frac{\lvert (b + 1) - (b - 1) \rvert}{\lVert w \rVert} = \frac{2}{\lVert w \rVert} 1. The constraints specify that the instances must be on the correct side of the two hyperplanes: w x_{i} + b \geq 1 \quad \mathrm{if} y_{i} = 1 w x_{i} + b \leq -1 \quad \mathrm{if} y_{i} = -1 and y_{i}(w x_{i} + b) \geq 1 summarizes the above two conditions. 1. [SVM with slacks (soft margin SVM)]: In case there is no way that the instances can be linearly separated, we can use slack variables in the formulation to tolerate a small number of non-separable training instances. \begin{alignat}{2} \min \quad & \frac{1}{2} \lVert w \rVert^{2} + C \sum_{i}^{n} \xi_{i} \\ \text{s.t. } \quad & y_{i}(w x_{i} + b) \geq 1 - \xi_{i}, \quad i = 1, \dots n \\ \quad & \xi_{i} \geq 0, \quad i = 1, \dots n \\ \end{alignat} where \xi_{i} is the slack variable for the instance x_{i} and C is a hyperparameter that penalizes the misclassification of x_{i}. 1. If \xi_{i} is nonzero for x_{i}, it means that x_{i} is on the misclassified side of w x_{i} + b = 1 (or w x_{i} + b = -1) and the distance is \xi_{i}. 1. If C = 0, \xi_{i} can be arbitrary large for each x_{i}. If C \to \inf, it is the same as hard margin SVM because any misclassification can induce infinite loss.
- [Solving hard margin SVM]
- Rewrite the primal program for easier Lagrangian computation below: \begin{alignat}{2} \min \quad & \frac{1}{2} ww \\ \text{s.t. } \quad & -(y_{i}(w x_{i} + b) - 1) \leq 0, \quad i = 1, \dots n \\ \end{alignat}
- We can derive the Lagrangian primal function from the primal program: \begin{alignat}{2} L(w, b, \alpha) & = f(w, b) + \sum_{i}^{n} \alpha h_{i}(w, b) \\ & = \frac{1}{2} ww - \sum_{i}^{n} \alpha_{i}(y_{i}(w x_{i} + b) - 1) \\ \end{alignat} where \alpha is a new variable called Lagrangian multiplier.
- Then we can write and solve Lagrangian dual function: \begin{alignat}{2} g(\alpha) & = \min_{w, b} L(w, b, \alpha) \\ & = \min_{w, b} \frac{1}{2} ww - \sum_{i}^{n} \alpha_{i}(y_{i}(w x_{i} + b) - 1) \\ \end{alignat} Taking the derivation of L(w, b, \alpha) over w: \begin{alignat}{2} \frac{\partial L}{\partial w} & = 0 \\ w - \sum_{i}{n} \alpha_{i}y_{i}x_{i} & = 0 \\ w & = \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \\ \end{alignat} Taking the derivation of L(w, b, \alpha) over b: \begin{alignat}{2} \frac{\partial L}{\partial b} & = 0 \\ \sum_{i}^{n} \alpha_{i}y_{i} & = 0 \\ \end{alignat} Plug in w = \sum_{i}^{n} \alpha_{i}y_{i}x_{i} back to g(\alpha): \begin{alignat}{2} g(\alpha) & = \min_{w, b} \frac{1}{2} ww - \sum_{i}^{n} \alpha_{i}(y_{i}(w x_{i} + b) - 1) \\ & = \min_{w, b} \frac{1}{2} \left( \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \right) \left( \sum_{j}^{n} \alpha_{j}y_{j}x_{j} \right) - \sum_{i}^{n} \alpha_{i} \left( y_{i} \left( \left( \sum_{j}^{n} \alpha_{j}y_{j}x_{j} \right) x_{i} + b \right) - 1 \right) \\ & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) - \sum_{i}^{n} \alpha_{i}y_{i}\left( \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) x_{i} + b \right) + \sum_{i}^{n}\alpha_{i} \\ & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) - \sum_{i}^{n} \alpha_{i}y_{i} \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) x_{i} + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} \\ & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) - \left( \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \right) \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} \\ & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) - \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} \\ & = \min_{w, b} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} \\ \end{alignat} Since we know that \alpha_{i}y_{i} = 0, then b\sum_{i}^{n} \alpha_{i}y_{i} = 0, and thus the final Lagrange dual function is: g(\alpha) = \sum_{i}^{n}\alpha_{i} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j})
- The Lagrange dual problem is written as: \begin{alignat}{2} \max \quad & \sum_{i}^{n}\alpha_{i} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) \\ \text{s.t. } \quad & \alpha_{i} \geq 0, \quad i = 1, \dots n \\ \quad & \alpha_{i}y_{i} = 0 \\ \end{alignat} Notice that \alpha_{i}y_{i} = 0 is added as part of the constraint.
- Since strong duality holds for hard margin SVM and also soft margin SVM, solving dual problem has the same solution as the primal problem. The benefits of solving its dual problem are:
- The Lagrange dual problem only involves \alpha_{i}, but primal problem has w and b, which are much more parameters.
- The Lagrange dual problem allows application of kernel trick in the computation process, but the primal problem doesn’t.
- [Solving soft margin SVM]
- Similar as hard margin SVM, we can write Lagrangian dual function as: \begin{alignat}{2} g(\alpha, \beta) & = \min_{w, b} \frac{1}{2} ww - \sum_{i}^{n} \alpha_{i}\left( y_{i}(w x_{i} + b) - 1 + \xi_{i} \right) - \sum_{i}^{n}\beta_{i}\xi_{i} \\ \end{alignat} where a new Lagrange multiplier is introduced for the constraint \xi_{i} \geq 0.
- Similar as hard margin SVM, we can solve Lagrangian dual function by taking the derivatives over the w, b, and \xi_i: \begin{alignat}{2} \frac{\partial L}{\partial w} = 0 & \Rightarrow w - \sum_{i}{n} \alpha_{i}y_{i}x_{i} = 0 \Rightarrow w = \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \\ \frac{\partial L}{\partial b} = 0 & \Rightarrow \sum_{i}^{n} \alpha_{i}y_{i} = 0 \\ \frac{\partial L}{\partial \xi_{i}} = 0 & \Rightarrow C - \alpha_{i} - \beta_{i} = 0 \Rightarrow C = \alpha_{i} + \beta_{i} \\ \end{alignat} and plug the w = \sum_{i}^{n} \alpha_{i}y_{i}x_{i} and C = \alpha_{i} + \beta_{i} back in g(\alpha, \beta). \begin{alignat}{2} g(\alpha, \beta) & = \min_{w, b} \frac{1}{2} ww + C\sum_{i}^{n}\xi_{i} - \sum_{i}^{n} \alpha_{i}\left( y_{i}(w x_{i} + b) - 1 + \xi_{i} \right) - \sum_{i}^{n}\beta_{i}\xi_{i} \\ & = \min_{w, b} \frac{1}{2} \left( \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \right) \left( \sum_{j}^{n} \alpha_{j}y_{j}x_{j} \right) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i} - \sum_{i}^{n} \alpha_{i} \left( y_{i} \left( \left( \sum_{j}^{n} \alpha_{j}y_{j}x_{j} \right) x_{i} + b \right) - 1 + \xi_{i} \right) - \sum_{i}^{n} \beta_{i} \xi_{i} \\ & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i} - \sum_{i}^{n} \alpha_{i}y_{i}\left( \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) x_{i} + b \right) + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\ & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i} - \sum_{i}^{n} \alpha_{i}y_{i} \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) x_{i} + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\ & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i} - \left( \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \right) \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\ & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i} - \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\ & = \min_{w, b} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i} + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\ & = \min_{w, b} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n} \alpha_{i}\xi_{i} + \sum_{i}^{n} \beta_{i}\xi_{i} + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\ & = \min_{w, b} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} \\ & = \min_{w, b} \sum_{i}^{n}\alpha_{i} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) \\ \end{alignat} which has exactly the same form as Lagrangian dual function of hard margin SVM.
- The Lagrange dual problem is written as: \begin{alignat}{2} \max \quad & \sum_{i}^{n}\alpha_{i} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) \\ \text{s.t. } \quad & \alpha_{i} \geq 0, \quad i = 1, \dots n \\ \quad & \beta_{i} \geq 0, \quad i = 1, \dots n \\ \quad & \alpha_{i}y_{i} = 0 \\ \end{alignat} Since we know C = \alpha_{i} + \beta_{i} \Rightarrow \alpha_{i} = C - \beta_{i}, the constraint \beta_{i} \geq 0 can be removed by merging into \alpha_{i} \geq 0: \begin{alignat}{2} \max \quad & \sum_{i}^{n}\alpha_{i} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) \\ \text{s.t. } \quad & C \geq \alpha_{i} \geq 0, \quad i = 1, \dots n \\ \quad & \alpha_{i}y_{i} = 0 \\ \end{alignat} The only difference with Lagrange dual problem of hard margin SVM is the addition of C \geq \alpha_{i}.
- [Kernel trick]
- Kernel trick
- [Duality and KKT conditions]
- The Lagrangian dual problem:
- Given a minimization primal problem: \begin{alignat}{2} \min_{x} \quad & f(x) \\ \text{s.t. } \quad & h_{i}(x) \leq 0, \quad i = 1, \dots, n \\ \quad & l_{j}(x) = 0, \quad j = 1, \dots, m \\ \end{alignat}
- The Lagrangian is defined as: L(x, u, v) = f(x) + \sum_{i}^{n} u_{i}h_{i}(x) + \sum_{j}^{m} v_{j}l_{j}(x) where u_{i} and v_{j} are new variables called Lagrangian multipliers.
- The Lagrange dual function is: g(u, v) = \min_{x} L(x, u, v)
- The Lagrange dual problem is: \begin{alignat}{2} \max_{u, v} \quad & g(u, v) \\ \text{s.t. } \quad & u \geq 0 \\ \end{alignat}
- The properties of dual problem:
- The dual problem is always convex even if the primal problem is not convex.
- For any primal problem and its dual problem, the weak duality always holds (the optimal value of the primal problem is always greater or equal to the optimal value of the dual problem).
- Karush-Kuhn-Tucker (KKT) conditions
- Given the Lagrange dual problem stated above, the KKT conditions are:
- Stationarity condition: 0 \in \partial \left( f(x) + \sum_{i=1}^{n} u_{i} h_{i}(x) + \sum_{j=1}^{m} v_{j}l_{j}(x) \right)
- Complementary Slackness: u_{i}h_{i}(x) = 0, \quad i = 1, \dots, n
- Primal feasibility: h_{i}(x) \leq 0, \quad i = 1, \dots, n l_{j}(x) = 0, \quad j = 1, \dots, m
- Dual feasibility: u_{i} \geq 0, \quad i = 1, \dots, n
- If a strong duality (the primal optimal objective and the dual optimal objective are equal) holds, the x^{*} and u^{*}, v^{*} are primal and dual solutions if and only if x^{*} and u^{*}, v^{*} satisfy the KKT conditions.
- Given the Lagrange dual problem stated above, the KKT conditions are:
- The Lagrangian dual problem:
Decision Tree (DT)
https://victorzhou.com/blog/intro-to-random-forests/ 1. Decision tree is a tree structure that consists lots of decision nodes and can be used for both classification and regression. Each internal node of the tree splits on certain value of a feature to crate different decision branches and the leaf nodes are the predicted labels. To make a prediction, we start from the root node and follow the path that matches our instance until the leaf node where we are given the label for the instance. 2. To train a classification decision tree, we greedily split on certain feature value that has the max uncertainty gain among all possible splitting choices. The uncertainty gain is calculated using Gini impurity index or information gain that measure how much uncertainty can be reduced in the dataset after the splitting. After the splitting, two or more new child nodes will be created. For each new node, we apply the same algorithm again with the subset of the training instances that follows the decision path. We only stop splitting when we only have one class left in the remaining training instances and that node is a leaf node with the label given by the remaining training instances. 3. Decision tree is interpretable and very efficient to learn, but suffers from over-fitting because tree can be constructed very complex so that a slight difference of the instance will cause the label change. We can apply post pruning or setting the maximum depth to reduce it.
Notes 1. [Gini Index and Information Entropy]: Both applies to a dataset (instances with labels) to measure its uncertainty. They both become 0 when there is only one class in the set.
Gini Index (Gini impurity) measures the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the dataset. G(\mathcal{D}) = \sum_{c=1}^{C} \textrm{P}(c)(1-\textrm{P}(c)) = 1 - \sum_{c=1}^{C} \textrm{P}(c)^2 Information Entropy can be roughly thought as the dataset’s variance. E(\mathcal{D}) = \sum_{c=1}^{C} \textrm{P}(c)\log_2\textrm{P}(c) In both cases, \mathcal{D} is the dataset to be evaluated, C is the total number of classes in \mathcal{D} and \textrm{P}(c) is the probability of picking an instance with the class c (fraction of instances with class c in \mathcal{D}). 2. [Gini Gain and Information Gain]: Both measure the uncertainty (Gini Index and Information Entropy) difference between before and after a splitting on the dataset. G(\mathcal{D}, S) = M(\mathcal{D}) - \sum_{s\in S}\frac{\lvert s \rvert}{\lvert D \rvert} M(s) where \mathcal{D} is the dataset before splitting, S are subsets of \mathcal{D} created from all possible splitting of \mathcal{d}, M is Gini Index (G) or Information Entropy (E), and \lvert \cdot \rvert gives the number of items in a set. 3. [Decision tree training algorithm]: We consider binary classification decision tree. Given a dataset \mathcal{D}, 1. Identify all possible splittings among all features. For each categorical feature, each discrete value is a possible splitting. For each numerical feature, we can do either a) treat it as categorical feature by discretizing it or b) sort all training value of this numerical feature in ascending order and each interval between two consecutive number is a possible split. 2. Calculate the uncertainty difference (Gini Gain or Information Gain) for all possible splitting and select the splitting with max uncertainty difference to split. 3. Once a node splits into two children, compute the data points that satisfy the two branches respectively. For each branch, return to procedure 1 with the new sub dataset. 4. The splitting stops when no further splitting can be made (the dataset contains only one class).
Random Forest (RF)
https://victorzhou.com/blog/intro-to-random-forests/ 1. Random Forest contains many decision trees and combine all their outputs to give a final decision. 2. A particular goal in training a random forest is to make each tree in the forest different from each other. First is to use bootstrapping, which means that each decision tree is trained on different dataset that is randomly sampled with replacement from the original dataset. Then to further inject randomness, random subset of the features instead of all features are considered in each split of the decision tree. Then the final output of random forest is to take the majority vote or average each output. 3. The goal of randomize the decision trees and taking the aggregation result is to reduce the variance and thus prevent over-fitting of the single decision tree. By taking an average of the random predictions, some errors can cancel out. Using multiple trees in the prediction make random forest a black box and the explanation for a prediction is hard to be understood by the users.
Notes: 1. [Bagging]: Bagging involves two procedures: bootstrapping and Aggregating. Bootstrapping means training each model with sampled with replacement subset of the dataset. Aggregating means combining each model in some specific way to give the final output.
Adaboost (Ada)
https://koalaverse.github.io/machine-learning-in-R/gradient-boosting-machines.html
https://arxiv.org/pdf/1403.1452.pdf 1. Adaboost, or boosting in general, combines a series of weak learners into a strong learner. A weak learner is defined as any classifier that is slightly better than random guessing (>50%) which means that it has some basic understandings of the underlying distribution of the dataset. The output from the final strong learner is a combination of the weighted outputs of the weak learners. 2. Adaboost works by repeatedly fitting a base model on training instances with different weights. First we initialize a equal weight for each training instance and then we have M iterations. In each iteration, we fit the base model on the training instances with the current weights and get a value called error rate that evaluates what is the percentage of the weights of the incorrectly classified instances. The error rate then is used to compute the classifier coefficient that increases as the error rate decreases. In the end of each iteration, we update the weight of each instance so that misclassified instances get larger weights and correctly classified instances get lower weights. After the iterations, we get M classifiers and their coefficients. To make a prediction for an instance from the strong learner, we get the outputs from the M classifiers, sum up the product of the outputs and their coefficients and take the sign of value as the final output. 3. Adaboost assumes the weak learner to always have training accuracy larger than 50% and the output class to be 1 and -1. A very short decision tree called decision stump is usually used as the weak learner.
Notes: 1. [Adaboost algorithm] Here we show the adaboost algorithm for binary classification problems (y \in \{-1, 1\}). 1. For the dataset with N instances, initialize the observation weights for each instance w_i=\frac{1}{N}, i=1,2, ... ,N. 2. For m = 1 ... M, 1. Fit a classifier G_m(x) to the training instances with weights w_i. 2. Compute E_m=\frac{\sum_{i=1}^{N} w_i \mathcal{1}(y_i\neq G_m(x_i))}{\sum_{i=1}^{N}w_i} 3. Compute \alpha_m = \log(\frac{1-E_m}{E_m}) 4. Set w_i \gets w_i \cdot e^{\alpha_m y_i G_m(x_i)} 3. Final output of Adaboost: G(x) = \textrm{sign} (\sum_{m=1}^M \alpha_m G_m(x))
Gradient Boosting (GB)
Gradient boosting can be seen as the generalized version of boosting, i.e. Adaboost is one special case of gradient boosting. GB can be seen as
GB is a generalized additive model of n weak learners. G(x) = g_{1}(x) + \dots + g_{n}(x) where G(x) is the final gradient boosting model and g(x) is one type of weak learners.
The weak learner g(x) can be any regression model (output a real number). The regression tree is the most commonly used weak leaner in Gradient Boosting.
$g_{1}(x) g_{n}(x) $ are the same weak leaner (regression tree) trained on different training sets.
Given a loss function L(\cdot), a training set X = \{\mathbf{x_{i}}\}, \mathbf{y} = \{y_{i}\}, a learning rate \alpha, and a number of iterations M, the algorithm to train a GBRT is as follows: 1. Intiailize G(x) by fitting CART on D 1. For m = 1 \dots M, 1. Evaluate the loss over the current G(x) 1. Calculate the gradient of the loss w.r.t the labels to get the residuals \tilde{\mathbf{y}}: \tilde{\mathbf{y}} = \frac{\partial L(G(X), \mathbf{y})}{\partial \mathbf{y}} Note \tilde{\mathbf{y}} has the same shape as \mathbf{y}. 1. Use X and residuals \tilde{\mathbf{y}} as the new training set to train a CART g(x). 1. Add the new weak leaner into the current model: G(x) = G(x) + \alpha g(x)