SSGD
This page contains my reading notes on
Proposed SSGD framework
Typically, a loss function for a neural network function is as follows:
\underset{\theta}{\min} J({\boldsymbol{\theta}}) + \lambda G(\boldsymbol{\theta})
In which, J({\boldsymbol{\theta}}) is the accuracy loss function and G(\boldsymbol{\theta}) is the regularization function.
In the SSR literature, G(·) is usually referred to as the general diversity measure that serves as an alternative to the L_0 “norm” for encouraging sparsity.
We further define a separable diversity measure that has the form
G(\theta) = \sum g(\theta_i)
where g(·) has the following properties:
- g(u) is symmetric.
- g(\lvert u \rvert) is monotonically increasing with \lvert u \rvert.
- g(u) is finite.
- g(u) is strictly concave in \lvert u \rvert or u^2.
Any function that holds the above properties is a candidate for effective SSR algorithm and thus a candidate for the proposed SSGD framework.
Iterative reweighting frameworks.
Iterative reweighted l_2 and l_1 frameworks are two popular frameworks in SSR literature to solve the above problem.
For each timestamp t, we have these two target functions to minimize, respectively for l_2 and l_1 frameworks:
\underset{\theta}{\min} J({\boldsymbol{\theta}}) + \lambda \lVert \boldsymbol{\Omega}\boldsymbol{\theta} \rVert_2^2
\underset{\theta}{\min} J({\boldsymbol{\theta}}) + \lambda \lVert \boldsymbol{\Omega}\boldsymbol{\theta} \rVert_1
where \Omega is a vector of w_i, which is pre-computed based on \theta_i as discussed below and \theta_i is the variable that we are solving for.
In both cases, they need to satisfy g(u)=f(u^2) and g(u)=f(|u|) respectively, where f(z) is concave for z \in R_+.
In each iteration, after \theta_i is solved, we can use it to update w_i. Their update rules for l_2 and l_1 frameworks are:
w_i = \bigg( \frac{\textrm{d}f(z)}{\textrm{d}z} \bigg) ^{-\frac{1}{2}} \mathrm{where} \; z=\theta_i^2
w_i = \bigg( \frac{\textrm{d}f(z)}{\textrm{d}z} \bigg) ^{-1} \mathrm{where} \; z=\lvert \theta_i \rvert
Log-sum as the diversity measure
Log-sum penalties for the reweighted l_2 and l_1 frameworks:
g(\theta_i) = \log(\theta_i^2 + \epsilon)
g(\theta_i) = \log(\lvert \theta_i \rvert + \epsilon)
where \epsilon > 0 and smaller \epsilon induces stronger sparsity.
Their w_i can be computed as:
w_i = (\theta_i^2 + \epsilon) ^ \frac{1}{2}
w_i = \lvert \theta_i \rvert + \epsilon
SSGD: Sparsity-promoting Stochastic Gradient Descent
Sparsity-promoting matrix store all the coefficients that can be multiplied with the gradient for each parameter during gradient descent update to promote sparsity.
The coefficient can be obtained from w_i by
s_i = \frac{w_i^2}{\frac{1}{|\mathcal{I}^k|} \sum_{j\in \mathcal{I}^k} w_j^2} \mathrm{, for} \; i \in \mathcal{I}^k
where \mathcal{I}^k is the set of parameters (weights) in layer k and |\mathcal{I}^k| is the number of parameters in the layer k.
Complete SSGD training algorithm for DNN:
Apply one of the general diversity measure in the loss function.
And for each gradient update: 1. Compute scaling factors: w_i based on \theta_i. 2. Compute s_i by the equation above. 3. Update parameters: \theta_i = \theta_i - \mathrm{lr} \times s_i \times \nabla_i, where \nabla_i is the gradient for \theta_i.