Skip to content

Tags

On this page

Exponential Learning Rates

via blog and (Li & Arora, 2019)Li:2019tn

Two key properties of SOTA nets: normalisation of parameters within layers (Batch Norm); and weight decay (i.e l2l_2 regulariser, [[explicit-regularization]]). For some reason I never thought of [[batch-norm]] as falling in the category of normalisations (see Effectiveness of Normalised Quantities).

It has been noted that BN + WD can be viewed as increasing the learning rate (LR).

When combined with BN, this implies strange dynamics in parameter space, and the experimental papers (van Laarhoven, 2017, Hoffer et al., 2018a and Zhang et al., 2019), noticed that combining BN and weight decay can be viewed as increasing the LR.

What they show is the following:

(Informal Theorem) Weight Decay + Constant LR + BN + Momentum is equivalent (in function space) to ExpLR + BN + Momentum

The proof holds for any loss function satisfying scale invariance:

L(cθ)=L(θ)L(c \cdot \theta) = L(\theta)

Here's an important Lemma:

Lemma: A scale-invariant loss LL satisfies:

\begin{align} \langle \nabla_{\theta} L, \theta \rangle &= 0 \\ \nabla_{\theta} L \mid_{\theta = \theta_0} &= c \nabla_{\theta} L \mid_{\theta = c \theta_0} \end{align}

Proof: Taking derivatives of L(cθ)=L(θ)L(c \cdot \theta) = L(\theta) wrt cc, and then setting c=1c=1 gives the first result. Taking derivatives wrt θ\theta gives the second result.

Illustration of Lemma

The first result, if you think of it geometrically, ensures that θ|\theta| is increasing. The second result shows that while the loss is scale-invariant, the gradients have a sort of corrective factor such that larger parameters have smaller gradients.

Thoughts

The paper itself is more interested in learning rates. What I think is interesting here is the preoccupation with scale-invariance. There seems to be something self-correcting about it that makes it ideal for neural network training. Also, I wonder if there is any way to use the above scale-invariance facts in our proofs.

They also deal with learning rates, except that the rates themselves are uniform across all parameters, making it much easier to analyze – unlike Adam where you have adaptivity.


  1. Li, Z. & Arora, S., 2019. An Exponential Learning Rate Schedule for Deep Learning. arXiv.org.
Edit this page
Last updated on 1/12/2022