SGD Warm Restarts

Published

August 2, 2021

This page contains my reading notes on

SGDR: Stochastic Gradient Descent with Warm Restarts

Why it is needed

This paper proposes a very simple yet quite effective learning rate scheduling technique. It alternates between a cosine annealing (gradually decreasing with a cosine form) phase and warm restarts step (rapidly increase to a high value) of the learning rate.

For the SGD with momentum, which is a more traditional gradient optimization algorithm, the only hyper-parameter is the learning rate. v_{t} = \mu v_{t-1} - \lambda\cdot\partial_{t}(\theta) \theta_{t+1} = \theta_{t} + v_{t} where \lambda is the learning rate, \partial_{t}(\theta) is the gradient of the loss function w.r.t the parameter \theta at time t, \mu is the momentum parameter that is typically 0.9, and v_{t} is the correct accumulated gradient direction at time t.
We usually always want to decrease the learning rate as the training time increases because we can quickly approach the target quickly at first with large learning rate and then slow down to take small steps around the local minima so that we don’t overshoot.
1. Instead of using traditional step-wise or linearly decreasing, SGDR uses a wave form that is closed to cosine function.
Another paper is the first to propose that it is beneficial to periodically decrease and increase the learning rate for neural network training. It explains the intuition and imperatively demonstrate its effectiveness. The intuition is that the model will usually be stuck in the saddle point instead of the local minima and using the high learning rate at the proper time will help model jump out of the saddle point and traverse through the saddle point quickly.
1. Instead of gradually increasing the learning rate, SGDR “restarts” the learning rate by directly setting it to a high value at some epochs.

Cosine annealing

At given epoch t, the learning rate l is calculated as follows: l = l_{\mathrm{min}} + \frac{1}{2}(l_{\mathrm{max}} - l_{\mathrm{min}})(1 + \cos(\frac{T_{\mathrm{cur}}}{T}\pi)) where T_{\mathrm{cur}} is how many epochs have been performed since the last restart, l_{\mathrm{min}} is the min learning rate, l_{\mathrm{max}} is the max learning rate, and T defines how many epochs is one period (how many epochs to restart).

Warm restart

When T_{\mathrm{cur}} = 0, l = l_{\mathrm{max}} and when T_{\mathrm{cur}} = T, l = l_{\mathrm{min}}. Whenever T_{\mathrm{cur}} = T, we set the l directly to l_{\mathrm{max}}.

Notes

l_{\mathrm{min}}, l_{\mathrm{max}} and T are hyper-parameters. Typically l_{\mathrm{min}} is set to 0 and l_{\mathrm{max}} is set to the initial learning rate.
The figure below shows how the function looks like if we set l_{\mathrm{max}} = 1, l_{\mathrm{min}} = 0, and T = 1, that is, we gradually decrease the learning rate from 1 to 0 in the period of 1 epoch.
T_{\mathrm{cur}} can also be a fraction number that represents the number of batches in the epoch. If an epoch has 10 batches, T_{\mathrm{cur}} is updated 0.1 after each batch.
Instead of using the fixed period T, the authors suggested to define another hyper-parameter T_{\mathrm{mult}} to make T increase after each restart. T_{n+1} = T_{n} \times T_{\mathrm{mult}} where n is the number of restarts.
l_{\mathrm{min}} and l_{\mathrm{max}} can also be changed in each restart, but the authors suggest not to change it to reduce the number of hyper-parameters involved.

Code

from IPython.display import IFrame
IFrame('https://www.desmos.com/calculator/9hrbpo2ajf?embed', width=500, height=500)