SSGD

Published

March 5, 2021

This page contains my reading notes on

SSGD: Sparsity-Promoting Stochastic Gradient Descent Algorithm for Unbiased Dnn Pruning

Proposed SSGD framework

Typically, a loss function for a neural network function is as follows:

\underset{\theta}{\min} J({\boldsymbol{\theta}}) + \lambda G(\boldsymbol{\theta})

In which, J({\boldsymbol{\theta}}) is the accuracy loss function and G(\boldsymbol{\theta}) is the regularization function.

In the SSR literature, G(·) is usually referred to as the general diversity measure that serves as an alternative to the L_0 “norm” for encouraging sparsity.

We further define a separable diversity measure that has the form

G(\theta) = \sum g(\theta_i)

where g(·) has the following properties:

g(u) is symmetric.
g(\lvert u \rvert) is monotonically increasing with \lvert u \rvert.
g(u) is finite.
g(u) is strictly concave in \lvert u \rvert or u^2.

Any function that holds the above properties is a candidate for effective SSR algorithm and thus a candidate for the proposed SSGD framework.

Iterative reweighting frameworks.

Iterative reweighted l_2 and l_1 frameworks are two popular frameworks in SSR literature to solve the above problem.

For each timestamp t, we have these two target functions to minimize, respectively for l_2 and l_1 frameworks:

\underset{\theta}{\min} J({\boldsymbol{\theta}}) + \lambda \lVert \boldsymbol{\Omega}\boldsymbol{\theta} \rVert_2^2

\underset{\theta}{\min} J({\boldsymbol{\theta}}) + \lambda \lVert \boldsymbol{\Omega}\boldsymbol{\theta} \rVert_1

where \Omega is a vector of w_i, which is pre-computed based on \theta_i as discussed below and \theta_i is the variable that we are solving for.

In both cases, they need to satisfy g(u)=f(u^2) and g(u)=f(|u|) respectively, where f(z) is concave for z \in R_+.

In each iteration, after \theta_i is solved, we can use it to update w_i. Their update rules for l_2 and l_1 frameworks are:

w_i = \bigg( \frac{\textrm{d}f(z)}{\textrm{d}z} \bigg) ^{-\frac{1}{2}} \mathrm{where} \; z=\theta_i^2

w_i = \bigg( \frac{\textrm{d}f(z)}{\textrm{d}z} \bigg) ^{-1} \mathrm{where} \; z=\lvert \theta_i \rvert

Log-sum as the diversity measure

Log-sum penalties for the reweighted l_2 and l_1 frameworks:

g(\theta_i) = \log(\theta_i^2 + \epsilon)

g(\theta_i) = \log(\lvert \theta_i \rvert + \epsilon)

where \epsilon > 0 and smaller \epsilon induces stronger sparsity.

Their w_i can be computed as:

w_i = (\theta_i^2 + \epsilon) ^ \frac{1}{2}

w_i = \lvert \theta_i \rvert + \epsilon

SSGD: Sparsity-promoting Stochastic Gradient Descent

Sparsity-promoting matrix store all the coefficients that can be multiplied with the gradient for each parameter during gradient descent update to promote sparsity.

The coefficient can be obtained from w_i by

s_i = \frac{w_i^2}{\frac{1}{|\mathcal{I}^k|} \sum_{j\in \mathcal{I}^k} w_j^2} \mathrm{, for} \; i \in \mathcal{I}^k

where \mathcal{I}^k is the set of parameters (weights) in layer k and |\mathcal{I}^k| is the number of parameters in the layer k.

Complete SSGD training algorithm for DNN:

Apply one of the general diversity measure in the loss function.

And for each gradient update: 1. Compute scaling factors: w_i based on \theta_i. 2. Compute s_i by the equation above. 3. Update parameters: \theta_i = \theta_i - \mathrm{lr} \times s_i \times \nabla_i, where \nabla_i is the gradient for \theta_i.