Given a vector x of length n (matrix can also be seen as a vector by stacking up the rows/cols), the common vector norms are:
L_{0} norm:
\sum_{i=1}^{n} \mathbb{1}[x_{i} \neq 0]
L_{1} norm:
\sum_{i=1}^{n} \lvert x_{i} \rvert
which is also called ridge regularization in neural network.
L_{2} norm:
\sum_{i=1}^{n} x_{i}^2
which is also called lasso regularization in neural network.
L_{\infty} norm:
\max_{i=1}^{n} \lvert x_i \rvert
The normal way to prune the edges of the neural network is to use L_{1} or L_{2} regularization to drive weights to near 0 (not exactly 0), and then directly set all weights that are less than threshold to 0.
L_{0} is not used because the operation of counting the number of 0s is not differentiable.
However, L_{0} regularization is still desired because it won’t affect the magnitude of the weights in the pruning process.
General recipe of L_{0} regularization
The loss function used to train a neural network with L_{0} regularization is:
\mathcal{L} is a standard loss function (cross-entropy loss, softmax)
\theta are the parameters in the network
x, y are training instances
\lambda is a hyper-parameter that balance loss and the regularization.
If we attach a trainable binary random variable z_{i} to each element of the model parameter \theta_{i}, then the weights used in the feed-forward operation of the neural network can be replaced by \theta \odot z. The loss function then becomes:
z \in \{0, 1\}^{\lvert \theta \rvert} is randomly sampled in each forward propagation according to some distribution.
\odot corresponds to the elementwise product.
If we assume each z_{i} as a binary random variable with a Bernoulli distribution that has a parameter \pi_{i}, i.e. z_{i} = \mathrm{Bern}(\pi_{i}), then the loss function becomes:
where \mathbb{E}_{z=\mathrm{Bern}(\pi)} [\cdot] gives an expectation value of a function that has a Bernoulli distribution z as the input.
The reformulation of the above loss function can be established because
Since the minimum of a function is upper bounded by the expectation of the function, minimizing \mathbb{E}_{z=\mathrm{Bern}(\pi)} \big[ \mathcal{L}(f(x, \theta \odot z), y) \big] is the same as minimizing the upper bound of \mathcal{L}(f(x, \theta \odot z), y).
According to the definition of the Bernoulli distribution, \pi gives the probability of z being 1 (non zero). Thus, minimizing \pi is to increase the probability of z being 0.
In the equation above,
\pi_{i} is a parameter that we want to be learned using gradient descent.
Thus, the second term \lambda \sum_{i=1}^{\lvert \theta \rvert} \pi_{i} can be directly minimized to regularize \pi because gradient of the second term w.r.t to pi_{i} can be easily calculated.
However, the first term \mathbb{E}_{z=\mathrm{Bern}(\pi)} \big[ \mathcal{L}(f(x, \theta \odot z), y) \big] is still problematic because z as a categorical random variable cannot be differentiated with respect to \pi.
Hard concrete distribution
Binary Concrete distribution
The binary concrete distribution can be seen as a continuous approximation of the Bernoulli distribution. It has 2 parameters:
location \alpha: similar to the probability parameter of the Bernoulli distribution.
temperature \beta: it controls how similar the binary concrete distribution is with the Bernoulli distribution.
Using the the reparametrization trick, the binary concrete distribution s can be represented as:
s = \operatorname{sigmoid} \left( \frac{\alpha + l}{\beta} \right)
where l is a sample from the logistic distribution.
Code
def plot_logit(size=100000, **kwargs): logit_samples = [logit_dist() for _ inrange(size)] plot_probability([logit_samples])interact(plot_logit);
We cannot use s (binary concrete distribution) to directly replace z (Bernoulli distribution) - The range of s is (0, 1) and never touches 0 or 1. - However, we want z to be either 0 or 1.
A simple trick to solve this problem is
First “stretch” the binary concrete distribution from interval (0, 1) to interval (\gamma, \zeta) with \gamma < 0 and \zeta > 1
\bar{s} = s(\zeta - \gamma) + \gamma
Then clip the stretch binary concrete distribution into the range [0, 1]
\bar{z} = \mathrm{clip}(\bar{s}, 0, 1)
where
\mathrm{clip}(x, \mathrm{min}, \mathrm{max}) means to clip x between the range [\mathrm{min}, \mathrm{max}].
\bar{z} is a random variable that follows the hard concrete distribution and it can be used to replace z.
Also, we cannot use \alpha (location parameter of binary concrete distribution) to replace \pi (probability parameter of Bernoulli distribution) in the regularization term of the loss function. - Remember that the regularization term above measures the sum of the probabilities of the z (Bernoulli distribution) being non-zero.
$$ \sum_{i=1}^{\lvert \theta \rvert} \pi_{i} $$
Measuring the probability of \bar{z} of being non-zero is the same as measuring the probability of \bar{s} being positive.
Because all negative values of \bar{s} is clipped to be 0.
Since we know that the total probability of a random variable being all values is 1, the probability of \bar{s} being positive is written as: