0% found this document useful (0 votes)
12 views

Why Momentum Really Works

Uploaded by

Mennan Gök
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Why Momentum Really Works

Uploaded by

Mennan Gök
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Why Momentum Really Works

Starting Point

Optimum

Solution

Step-size α = 0.02 Momentum β = 0.99 We often think of Momentum as a means of dampening oscillations and
speeding up the iterations, leading to faster convergence. But it has other
0 0.003 0.006 0.00 0.500 0.990 interesting behavior. It allows a larger range of step-sizes to be used, and
creates its own oscillations. What is going on?

GABRIEL GOH UC Davis


April. 4 2017
Citation: Goh, 2017

Here’s a popular story about momentum [1, 2, 3] : gradient descent is a man walking down a hill. He follows the steepest
path downwards; his progress is slow, but steady. Momentum is a heavy ball rolling down the same hill. The added inertia
acts both as a smoother and an accelerator, dampening oscillations and causing us to barrel through narrow valleys, small
humps and local minima.

This standard story isn’t wrong, but it fails to explain many important behaviors of momentum. In fact, momentum can be
understood far more precisely if we study it on the right model.

One nice model is the convex quadratic. This model is rich enough to reproduce momentum’s local dynamics in real
problems, and yet simple enough to be understood in closed form. This balance gives us powerful traction for
understanding this algorithm.

We begin with gradient descent. The algorithm has many virtues, but speed is not one of them. It is simple — when
optimizing a smooth function f , we make a small step in the gradient

wk+1
​ = wk​ − α∇f (wk​ ).

For a step-size small enough, gradient descent makes a monotonic improvement at every iteration. It always converges,
albeit to a local minimum. And under a few weak curvature conditions it can even get there at an exponential rate.

But the exponential decrease, though appealing in theory, can often be infuriatingly small. Things often begin quite well
— with an impressive, almost immediate decrease in the loss. But as the iterations progress, things start to slow down. You
start to get a nagging feeling you’re not making as much progress as you should be. What has gone wrong?
The problem could be the optimizer’s old nemesis, pathological curvature. Pathological curvature is, simply put, regions of
f which aren’t scaled properly. The landscapes are often described as valleys, trenches, canals and ravines. The iterates
either jump between valleys, or approach the optimum in small, timid steps. Progress along certain directions grind to a
halt. In these unfortunate regions, gradient descent fumbles.

Momentum proposes the following tweak to gradient descent. We give gradient descent a short-term memory:

z k+1
​ = βz k​ + ∇f (wk​ )

wk+1
​ = wk​ − αz k+1

The change is innocent, and costs almost nothing. When β = 0 , we recover gradient descent. But for β = 0.99
(sometimes 0.999, if things are really bad), this appears to be the boost we need. Our iterations regain that speed and
boldness it lost, speeding to the optimum with a renewed energy.

Optimizers call this minor miracle “acceleration”.

The new algorithm may seem at first glance like a cheap hack. A simple trick to get around gradient descent’s more
aberrant behavior — a smoother for oscillations between steep canyons. But the truth, if anything, is the other way round.
It is gradient descent which is the hack. First, momentum gives up to a quadratic speedup on many functions. 1 This is no
small matter — this is similar to the speedup you get from the Fast Fourier Transform, Quicksort, and Grover’s Algorithm.
When the universe gives you quadratic speedups, you should start to pay attention.

But there’s more. A lower bound, courtesy of Nesterov [5] , states that momentum is, in a certain very narrow and technical
sense, optimal. Now, this doesn’t mean it is the best algorithm for all functions in all circumstances. But it does satisfy
some curiously beautiful mathematical properties which scratch a very human itch for perfection and closure. But more
on that later. Let’s say this for now — momentum is an algorithm for the book.

First Steps: Gradient Descent


We begin by studying gradient descent on the simplest model possible which isn’t trivial — the convex quadratic,

f (w) = 12​ wT​ Aw − bT​ w, w ∈ Rn​ .

Assume A is symmetric and invertible, then the optimal solution w ⋆​ occurs at

w⋆​ = A−1
​ b.

Simple as this model may be, it is rich enough to approximate many functions (think of A as your favorite model of
curvature — the Hessian, Fisher Information Matrix [6] , etc) and captures all the key features of pathological curvature.
And more importantly, we can write an exact closed formula for gradient descent on this function.

This is how it goes. Since ∇f (w) = Aw − b, the iterates are

​ = wk​ − α(Awk​ − b).


wk+1

Here’s the trick. There is a very natural space to view gradient descent where all the dimensions act independently — the
eigenvectors of A.

Optimum Optimum
Every symmetric matrix A has an eigenvalue decomposition

A = Q diag(λ1​ , … , λn​ ) QT​ , Q = [q1​ , … , qn​ ],

and, as per convention, we will assume that the λ​i ’s are sorted, from smallest λ​1 to biggest λ​n . If we perform a change of
basis, xk​ = QT​ (wk​ − w⋆​ ), the iterations break apart, becoming:

​i = xk​i − αλi xk​i


xk+1

​ x0​i
= (1 − αλi​ )xk​i = (1 − αλi​ )k+1

Moving back to our original space w , we can see that


n
wk​ − w⋆​ = Qxk​ = ​∑ x0​i (1 − αλ​i )k​ qi
i

and there we have it — gradient descent in closed form.

Decomposing the Error

The above equation admits a simple interpretation. Each element of x0​ is the component of the error in the initial guess in
the Q-basis. There are n such errors, and each of these errors follows its own, solitary path to the minimum, decreasing
exponentially with a compounding rate of 1 − αλ​i . The closer that number is to 1, the slower it converges.

For most step-sizes, the eigenvectors with largest eigenvalues converge the fastest. This triggers an explosion of progress
in the first few iterations, before things slow down as the smaller eigenvectors’ struggles are revealed. By writing the
contributions of each eigenspace’s error to the loss

f (wk​ ) − f (w⋆​ ) = ∑(1 − αλi​ )2k


​ λi​ [x0​i ]2​

we can visualize the contributions of each error component to the loss.

Optimization can be seen as combination of several component


problems, shown here as 1 2 3 with eigenvalues Step-size Optimal Step-size
λ1​ = 0.01, λ2​ = 0.1, and λ3​ = 1 respectively.
0 1 2

( ) - f(w * )
f wk
At the initial
point, the error
in each
component is
equal.
0 20 40 60 80 100 120 140
Eigenvalue 1
2
3
At the optimum, the rates of convergence of the largest and smallest eigenvalues equalize.
Choosing A Step-size

The above analysis gives us immediate guidance as to how to set a step-size α. In order to converge, each ∣1 − αλi​ ∣ must
be strictly less than 1. All workable step-sizes, therefore, fall in the interval

0 < αλi​ < 2.

The overall convergence rate is determined by the slowest error component, which must be either λ1​ or λn
​:

rate(α)​ = ​max ∣1 − αλ​∣


i
​ i

​ = max {∣1 − αλ1​ ∣, ∣1 − αλn​ ∣}

This overall rate is minimized when the rates for λ1​ and λn
​ are the same — this mirrors our informal observation in the
previous section that the optimal step-size causes the first and last eigenvectors to converge at the same rate. If we work
this through we get:

2
optimal α = argmin​rate(α) = ​
α λ1​ + λn​

λn​ /λ1​ − 1
optimal rate = min​rate(α) = ​
α λ​n /λ​1 + 1

​ /λ1​ determines the convergence rate of the problem. In fact, this ratio appears often enough that we
Notice the ratio λn
give it a name, and a symbol — the condition number.

λ​n
condition number := κ := ​
λ1​
The condition number means many things. It is a measure of how close to singular a matrix is. It is a measure of how
robust A−1
​ b is to perturbations in b. And, in this context, the condition number gives us a measure of how poorly gradient
descent will perform. A ratio of κ = 1 is ideal, giving convergence in one step (of course, the function is trivial).
Unfortunately the larger the ratio, the slower gradient descent will be. The condition number is therefore a direct measure
of pathological curvature.

Example: Polynomial Regression


The above analysis reveals an insight: all errors are not made equal. Indeed, there are different kinds of errors, n to be
exact, one for each of the eigenvectors of A. And gradient descent is better at correcting some kinds of errors than others.
But what do the eigenvectors of A mean? Surprisingly, in many applications they admit a very concrete interpretation.

Lets see how this plays out in polynomial regression. Given 1D data, ξi​, our problem is to fit the model

model(ξ) = w1​ p1​ (ξ) + ⋯ + wn pn​ (ξ) pi​ = ξ ↦ ξ i−1


to our observations, d​i . This model, though nonlinear in the input ξ , is linear in the weights, and therefore we can write
the model as a linear combination of monomials, like:

scrub values
2.65p1​ + -4.09p2​ + -26.0p3​ + 53.7p4​ + 53.4p5​ + -96.1p6​ = model

0 + 0 + 0 + 0 + 0 + 0 = 0
Because of the linearity, we can fit this model to our data ξi​ using linear regression on the model mismatch

minimizew
​ 1
​2 ∑
​ (model(ξi​) − di​ )2 = 1​2 ∥Zw − d∥2​
i

where

⎛ 1 ξ1​ ξ12​ … ξ1n−1


​ ⎞
Z=⎜ ⎟
1 ξ2​ ξ22​ … ξ2n−1

​⎜ ​ ​ ​ ​ ​⎟ .
⋮ ⋮ ⋮ ⋱ ⋮
⎝ 1 ξm
​ 2

ξm … ξmn−1
​ ⎠

The path of convergence, as we know, is elucidated when we view the iterates in the space of Q (the eigenvectors of Z T​ Z ).
So let’s recast our regression problem in the basis of Q. First, we do a change of basis, by rotating w into Qw , and counter-
rotating our feature maps p into eigenspace, ​p̄ . We can now conceptualize the same regression as one over a different
polynomial basis, with the model

model(ξ) = x1​ ​p̄ 1​ (ξ) + ⋯ + xn​ ​p̄ n​ (ξ) ​p̄ i​ = ∑ qij​ pj​ .

This model is identical to the old one. But these new features ​p̄ (which I call “eigenfeatures”) and weights have the
pleasing property that each coordinate acts independently of the others. Now our optimization problem breaks down,
really, into n small 1D optimization problems. And each coordinate can be optimized greedily and independently, one at a
time in any order, to produce the final, global, optimum. The eigenfeatures are also much more informative:

0.458 p̄​ ​1 + 0.903 p̄​ ​2 + 6.13 p̄​ ​3 + 12.4 p̄​ ​4 + 128 p̄​ ​5 + 261 p̄​ ​6 = model

0 + 0 + 0 + 0 + 0 + 0 = 0

The data comes in 2 clusters. The first 2 Next there are smooth variations within and finally, jagged polynomials which drag points to fit data
eigenfeatures capture variations between clusters, peaks within clusters, differ wildly on neighboring points.
the clusters.

The observations in the above diagram can be justified mathematically. From a statistical point of view, we would like a
model which is, in some sense, robust to noise. Our model cannot possibly be meaningful if the slightest perturbation to
the observations changes the entire model dramatically. And the eigenfeatures, the principal components of the data, give
us exactly the decomposition we need to sort the features by its sensitivity to perturbations in di​ ’s. The most robust
components appear in the front (with the largest eigenvalues), and the most sensitive components in the back (with the
smallest eigenvalues).

This measure of robustness, by a rather convenient coincidence, is also a measure of how easily an eigenspace converges.
And thus, the “pathological directions” — the eigenspaces which converge the slowest — are also those which are most
sensitive to noise! So starting at a simple initial point like 0 (by a gross abuse of language, let’s think of this as a prior), we
track the iterates till a desired level of complexity is reached. Let’s see how this plays out in gradient descent.

-0.389 p̄​ ​1 + -2.47 p̄​ ​2 + -0.481 p̄​ ​3 + -0.540 p̄​ ​4 + -0.0128 p̄​ ​5 + 0.0543 p̄​ ​6 = model

0 + 0 + 0 + 0 + 0 + 0 = 0

We begin at x=w=0
When an eigenspace
has converged to
three significant
digits, the bar greys
Step k = 49 out. Drag the
observations to
0 15 265 4,536 77,652 1,329,083 change fit.
Eigenvalue 1
6
This effect is harnessed with the heuristic of early stopping : by stopping the optimization early, you can often get better
generalizing results. Indeed, the effect of early stopping is very similar to that of more conventional methods of
regularization, such as Tikhonov Regression. Both methods try to suppress the components of the smallest eigenvalues
directly, though they employ different methods of spectral decay.2 But early stopping has a distinct advantage. Once the
step-size is chosen, there are no regularization parameters to fiddle with. Indeed, in the course of a single optimization, we
have the entire family of models, from underfitted to overfitted, at our disposal. This gift, it seems, doesn’t come at a price.
A beautiful free lunch [7] indeed.

The Dynamics of Momentum


Let’s turn our attention back to momentum. Recall that the momentum update is

​ = βz k​ + ∇f (wk​ )
z k+1

wk+1
​ = wk​ − αz k+1
​ .

Since ∇f (w k​ ) = Awk​ − b, the update on the quadratic is

​ = βz k​ + (Awk​ − b)
z k+1

​ = wk​ − αz k+1
wk+1 ​ .

Following [8] , we go through the same motions, with the change of basis xk​ = Q(wk​ − w⋆​ ) and yk​ = Qz k​ , to yield the
update rule

​ = βyik​ + λi​ xk​i


yik+1

​i = xk​i − αyik+1
xk+1 ​ .

in which each component acts independently of the other components (though xk​i and yik​ are coupled). This lets us rewrite
our iterates as 3

y k​ y 0​ λ​i
( ​ik ) = Rk​ ( ​i0 ) R=( ).
β
​ ​
x​i x​i −αβ 1 − αλ​i

​ power. But for the 2 × 2 case there is an elegant and little known
There are many ways of taking a matrix to the k th
formula [9] in terms of the eigenvalues of R, σ1​ and σ2​ .

Rk​ = {​ 1k
σ k​ R1​ − σ2k​ R2​ σ1​ ≠ σ2​ R − σj​ I
​ , Rj​ = ​
σ1​ (kR/σ1​ − (k − 1)I) σ1​ = σ2​ σ1​ − σ2​

This formula is rather complicated, but the takeaway here is that it plays the exact same role the individual convergence
rates, 1 − αλi​ do in gradient descent. But instead of one geometric series, we have two coupled series, which may have
real or complex values. The convergence rate is therefore the slowest of the two rates, max{∣σ1​ ∣, ∣σ2​ ∣} 4. By plotting this
out, we see there are distinct regions of the parameter space which reveal a rich taxonomy of convergence behavior [10] :

Momentum 1 Convergence Rate


β=

0.0 0.2 0.4 0.6 0.8 1.0 1.2


A plot of max{∣σ1​ ∣, ∣σ2​ ∣} reveals
distinct regions, each with its own
style of convergence.
0
0 2 Step-size α = 4

Ripples Monotonic Decrease 1-Step Convergence Monotonic Oscillations Divergence

R's eigenvalues are complex, R's eigenvalues are both real, When α = 1/λ​i , and β = 0, When α > 1/λ​i , the iterates When max{∣σ1​ ∣, ∣σ2​ ∣} > 1,
and the iterates display low are positive, and have norm we converge in one step. flip between + and − at the iterates diverge.
frequency ripples. less than one. The behavior This is a very special point, each iteration. These are
Surprisingly, the ​ here resembles gradient and kills the error in the often referred to as
convergence rate 2√​​β is descent. eigenspace completely. 'oscillations' in gradient
independent of α and λi​. descent.

For what values of α and β does momentum converge? Since we need both σ1​ and σ2​ to converge, our convergence
criterion is now max{∣σ1​ ∣, ∣σ2​ ∣} < 1. The range of available step-sizes work out 5 to be

0 < αλi​ < 2 + 2β for 0≤β<1

We recover the previous result for gradient descent when β = 0. But notice an immediate boon we get. Momentum allows
us to crank up the step-size up by a factor of 2 before diverging.

The Critical Damping Coefficient

The true magic happens, however, when we find the sweet spot of α and β . Let us try to first optimize over β .

Momentum admits an interesting physical interpretation when α is [11] small: it is a discretization of a damped harmonic
oscillator. Consider a physical simulation operating in discrete time (like a video game).

yik+1
​ = βyik​ + λi xk​i
We can think of which is dampened at and perturbed by an
−yik​ as velocity each step external force field
xk+1
​i = xk​i − αyik+1

And x is our particle’s which is moved at each step byk+1a small amount in
position the direction of the velocity yi​ .
We can break this equation apart to see how each component affects the dynamics of the system. Here we plot, for 150
iterates, the particle’s velocity (the horizontal axis) against its position (the vertical axis), in a phase diagram.

Damping
β=1 0.985 0.95 0.9 0.85 0.8
Velocity
External Force Initial point : Horizontal Axis
λ=0 x = 1, y = 1 β
When λi​ = 0 and
noitisoP

β = 1, the object
moves at constant
speed. As β goes
down, the particle
0.02 decelerates, losing a
proportion of its
energy at each tick.
0.1 λ : Vertical Axis
The external force
causes the particle to
return to the origin.
Combining damping
and the force field,
the particle behaves
0.25 like a damped
harmonic oscillator,
returning lazily to
equlibrium.

This system is best imagined as a weight suspended on a spring. We pull the weight down by one unit, and we study the
path it follows as it returns to equilibrium. In the analogy, the spring is the source of our external force λi​ xk​i , and
equilibrium is the state when both the position xk​i and the speed yik​ are 0. The choice of β crucially affects the rate of
return to equilibrium.

Momentum β
1.0 0.8 0.6 0.4 0.2 0.0

Underdamping Critical Damping Overdamping


When β is too large we're The best value of β lies in the When β is too small (e.g. in
under-damping. Here the middle of the two extremes. This Gradient Descent, β = 0), we're
resistance is too small, and sweet spot happens when the over-damping. The particle is
spring oscillates up and down eigenvalues of R are​ repeated, immersed in a viscous fluid
forever, missing the optimal when β = (1 − √​​αλ​i )2​ . which saps it of its kinetic
value over and over. energy at every timestep.

reaches
optimum

​ ​
= (1 − √​​αλi​ )2​ gives us a convergence rate (in eigenspace i) of 1 − √​​αλi​ . A square root
The critical value of β
improvement over gradient descent, 1 − αλi​ ! Alas, this only applies to the error in the ith
​ eigenspace, with α fixed.

Optimal parameters

To get a global convergence rate, we must optimize over both α and β . This is a more complicated affair,6 but they work
out to be
​ ​
​2 2
​√​λ​n − √​λ​1 2
α = ( ​​ ​ ​ )​ β = (​ ​
​ ​ )​
​ λ​1 + √​λ​n
√​ ​ λ​n + √​λ​1
√​
Plug this into the convergence rate, and you get

​​ ​κ​ − 1
√ κ−1
​ ​ Convergence rate, ​ Convergence rate,
​ ​κ​ + 1
√ Momentum κ+1 Gradient Descent
With barely a modicum of extra effort, we have essentially square rooted the condition number! These gains, in principle,
require explicit knowledge of λ1​ and λn
​ . But the formulas reveal a simple guideline. When the problem’s conditioning is
poor, the optimal α is approximately twice that of gradient descent, and the momentum term is close to 1. So set β as close
to 1 as you can, and then find the highest α which still converges. Being at the knife’s edge of divergence, like in gradient
descent, is a good place to be.
0 2 Step-size α = 4
We can do the same decomposition here with Momentum β = 1
momentum, with eigenvalues λ​1 = 0.01,
λ2​ = 0.1, and λ3​ = 1. Though the decrease is
no longer monotonic, but significantly faster. Optimal parameters
0

f (wk​ ) − f (w⋆​ )

0 20 40 60 80 100 120 140


Eigenvalue 1
2
3
Note that the optimal parameters do not necessarily imply the fastest convergence, though, only the fastest asymptotic convergence rate.

While the loss function of gradient descent had a graceful, monotonic curve, optimization with momentum displays clear
oscillations. These ripples are not restricted to quadratics, and occur in all kinds of functions in practice. They are not
cause for alarm, but are an indication that extra tuning of the hyperparameters is required.

Example: The Colorization Problem


Let’s look at how momentum accelerates convergence with a concrete example. On a grid of pixels let G be the graph with
vertices as pixels, E be the set of edges connecting each pixel to its four neighboring pixels, and D be a small set of a few
distinguished vertices. Consider the problem of minimizing
1 1
minimize ​ ∑
​ (wi​ − 1)2​ + ​ ​∑ (wi​ − wj​ )2​ .
2 i∈D 2 i,j∈E

The colorizer pulls The smoother spreads out


distinguished pixels the color
towards 1
The optimal solution to this problem is a vector of all 1’s 7. An inspection of the gradient iteration reveals why we take a
long time to get there. The gradient step, for each component, is some form of weighted average of the current value and
its neighbors:

​ (wik​ − wjk ) − {​ i
α(wk​ − 1) i ∈ D
​ = wik​ − α ∑
wik+1 ​
j∈N
0 i∉D

This kind of local averaging is effective at smoothing out local variations in the pixels, but poor at taking advantage of
global structure. The updates are akin to a drop of ink, diffusing through water. Movement towards equilibrium is made
only through local corrections and so, left undisturbed, its march towards the solution is slow and laborious. Fortunately,
momentum speeds things up significantly.

0 2 Step-size α = 4 Weights
Momentum β = 1
0 20 40 60 80 100
Optimal parameters
Each square represents a node, colored by its weight.
Edges connect each square to the four neighboring
squares.
0
Step k = 20
0 5 75 1,074 15,367 219,695
Eigenvalue 1 The eigenvectors of the
colorization problem
form a generalized n
Fourier basis for R​ .
25 The smallest
eigenvalues have low
frequencies, hence
50 gradient descent
corrects high frequency
errors well but not low
frequency ones.
1500

2000
In vectorized form, the colorization problem is
1 T 1
minimize ​ x​ L​G x + ​ ∑
​ (xT​ e​i eT​i x − eT​i x)
2 2
i∈D

The smoother’s And the colorizer is a small low rank


quadratic form is the correction with a linear term. ei​ is the ith​ unit
Graph Laplacian vector.
​ 8, which dominates the behavior of the optimization problem, is a valuable bridge between
The Laplacian matrix, LG
linear algebra and graph theory. This is a rich field of study, but one fact is pertinent to our discussion here. The
​ , here defined as the ratio of the second eigenvector to the last (the first eigenvalue is always 0, with
conditioning of LG
eigenvector equal to the matrix of all 1′s), is directly connected to the connectivity of the graph.

Small world graphs, like expanders The conditioning of grids improves And long, wiry graphs, like paths,
and dense graphs, have excellent with its dimensionality. condition poorly.
conditioning

These observations carry through to the colorization problem, and the intuition behind it should be clear. Well connected
graphs allow rapid diffusion of information through the edges, while graphs with poor connectivity do not. And this
principle, taken to the extreme, furnishes a class of functions so hard to optimize they reveal the limits of first order
optimization.
The Limits of Descent
Let’s take a step back. We have, with a clever trick, improved the convergence of gradient descent by a quadratic factor
with the introduction of a single auxiliary sequence. But is this the best we can do? Could we improve convergence even
more with two sequences? Could one perhaps choose the α’s and β ’s intelligently and adaptively? It is tempting to ride
this wave of optimism - to the cube root and beyond!

Unfortunately, while improvements to the momentum algorithm do exist, they all run into a certain, critical, almost
inescapable lower bound.

Adventures in Algorithmic Space

To understand the limits of what we can do, we must first formally define the algorithmic space in which we are
searching. Here’s one possible definition. The observation we will make is that both gradient descent and momentum can
be “unrolled”. Indeed, since

w1​ = w0​ − α∇f (w0​ )


w2​ = w1​ − α∇f (w1​ )
​ ​ 0​ − α∇f (w0​ ) − α∇f (w1​ )
​ w
=

w​k+1
= w0​ − α∇f (w0​ ) − ⋯⋯ − α∇f (wk​ )

we can write gradient descent as

k

wk+1 = w0​ − α ​∑ ∇f (wi​ ).
i

A similar trick can be done with momentum:

k
(1 − β k+1−i
​ )

wk+1 = w0​ + α ​∑ ​ ∇f (wi​ ).
1−β
i

In fact, all manner of first order algorithms, including the Conjugate Gradient algorithm, AdaMax, Averaged Gradient and
more, can be written (though not quite so neatly) in this unrolled form. Therefore the class of algorithms for which

k

wk+1 = w0​ + ​∑ γik​ ∇f (wi​ ) for some γik​
i

contains momentum, gradient descent and a whole bunch of other algorithms you might dream up. This is what is
assumed in Assumption 2.1.4 [5] of Nesterov. But let’s push this even further, and expand this class to allow different step-
sizes for different directions.

k

wk+1 = w0​ + ​∑ Γk​i ∇f (wi​ ) for some diagonal matrix Γk​i .
i

This class of methods covers most of the popular algorithms for training neural networks, including ADAM and AdaGrad.
We shall refer to this class of methods as “Linear First Order Methods”, and we will show a single function all these
methods ultimately fail on.

The Resisting Oracle

Earlier, when we talked about the colorizer problem, we observed that wiry graphs cause bad conditioning in our
optimization problem. Taking this to its extreme, we can look at a graph consisting of a single path — a function so badly
conditioned that Nesterov called a variant of it “the worst function in the world”. The function follows the same structure
as the colorizer problem, and we shall call this the Convex Rosenbrock,

f n​ (w) = 1​ (w​ − 1)2​ + 1


n
+ ​ 2 ∥w∥2​ .
2
1 ​ ​∑(wi​ − wi+1
​ )2​ κ 1
2
2 ∑( i i+1 ) κ−1
2 i=1

with a colorizer of strong couplings of adjacent nodes in and a small regularization


one node the path, term.
The optimal solution of this problem is

​​ ​κ​ − 1 i

wi⋆​ = (​ ​ )​
​ ​κ​ + 1

​ approaches κ as n goes to infinity. Now observe the behavior of the
and the condition number of the problem f n
momentum algorithm on this function, starting from w 0​ = 0.

0 2 Step-size α = 4
Here we see the first 50 iterates of momentum Momentum β = 1
on the Convex Rosenbrock for n = 25. The
behavior here is similar to that of any Linear First
Order Algorithm. Optimal parameters

Weights Error 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Iteration 0
This triangle is a “dead zone”
of our iterates. The iterates
10 are always 0, no matter what
the parameters.

20

30
The remaining expanding
space is the “light cone” of
our iterate’s influence.
Momentum does very well
40 here with the optimal
parameters.

50

The observations made in the above diagram are true for any Linear First Order algorithm. Let us prove this. First observe
that each component of the gradient depends only on the values directly before and after it:

4
∇f (x)i​ = 2wi​ − wi−1
​ − wi+1
​ +​ wi​, i ≠ 1.
κ−1
Therefore the fact we start at 0 guarantees that that component must remain stoically there till an element either before
or after it turns nonzero. And therefore, by induction, for any linear first order algorithm,

w0​ = [ 0, 0, 0, … 0, 0, … 0]
1
w​ = [ w11​ , 0, 0, … 0, 0, … 0]
2
​ ​
w =
​ [​ w12​ , w

2
​ 2​ ,
0, …
​ 0,
​ 0,
​ …
​ 0​]

wk​ = [ w1k​ , w2k​ , w3k​ , … wkk​ , 0, … 0 ].
Think of this restriction as a “speed of light” of information transfer. Error signals will take at least k steps to move from
w0​ to wk​ . We can therefore sum up the errors which cannot have changed yet9:

∥wk​ − w⋆​ ∥∞
​ ​≥ ​max {∣wi⋆ ∣}
i≥k+1

​​ ​κ​ − 1 k+1

​ ( ​√
​= ​
​ ​κ​ + 1
)​

​√​κ​ − 1 k 0
​= ( ​ ​ )​ ∥w​ − w⋆​ ∥​∞ .
​ ​κ​ + 1

​ approaches κ. And the gap therefore closes; the convergence rate that
As n gets large, the condition number of f n
momentum promises matches the best any linear first order algorithm can do. And we arrive at the disappointing
conclusion that on this problem, we cannot do better.

Like many such lower bounds, this result must not be taken literally, but spiritually. It, perhaps, gives a sense of closure
and finality to our investigation. But this is not the final word on first order optimization. This lower bound does not
preclude the possibility, for example, of reformulating the problem to change the condition number itself! There is still
much room for speedups, if you understand the right places to look.

Momentum with Stochastic Gradients


There is a final point worth addressing. All the discussion above assumes access to the true gradient — a luxury seldom
afforded in modern machine learning. Computing the exact gradient requires a full pass over all the data, the cost of
which can be prohibitively expensive. Instead, randomized approximations of the gradient, like minibatch sampling, are
often used as a plug-in replacement of ∇f (w). We can write the approximation in two parts,

∇f (w) + error(w).
the true gradient and an approximation error.
If the estimator is unbiased e.g. E[error(w)] = 0
It is helpful to think of our approximate gradient as the injection of a special kind of noise into our iteration. And using the
machinery developed in the previous sections, we can deal with this extra term directly. On a quadratic, the error term
cleaves cleanly into a separate term, where 10
k
yik​ y 0​ ​ (
ϵk​i ​∑ Rk−j
1
)
( ​ ) = Rk​ ( ​i0 ) +
−α
xk​i xi​ j=1

the noisy iterates are the noiseless, deterministic iterates a decaying sum of the errors, where
a sum of and ϵk​ = Q ⋅ error(wk​ ).

The error term, ϵk​ , with its dependence on the w k​ , is a fairly hairy object. Following [10] , we model this as independent 0-
mean Gaussian noise. In this simplified model, the objective also breaks into two separable components, a sum of a
deterministic error and a stochastic error 11, visualized here.

0 2 Step-size α = 4
We decompose the expected value of the objective Momentum β = 1
value Ef (w) − f (w⋆​ ) into a deterministic part
and a stochastic part .

0
Transient phase Fine-tuning phase
Ef (w) − f (w⋆​ )
The small black dots
are a single run of
stochastic gradient

0 20 40 60 80 100 120 140


0 20 40 60 80 100 120 140

As [1] observes, the optimization has two phases. In the initial transient phase the magnitude of the noise is smaller
than the magnitude of the gradient, and Momentum still makes good progress. In the second, stochastic phase, the
noise overwhelms the gradient, and momentum is less effective.

Note that there are a set of unfortunate tradeoffs which seem to pit the two components of error against each other.
Lowering the step-size, for example, decreases the stochastic error, but also slows down the rate of convergence. And
increasing momentum, contrary to popular belief, causes the errors to compound. Despite these undesirable properties,
stochastic gradient descent with momentum has still been shown to have competitive performance on neural networks.
As [1] has observed, the transient phase seems to matter more than the fine-tuning phase in machine learning. And in fact,
it has been recently suggested [12] that this noise is a good thing — it acts as a implicit regularizer, which, like early
stopping, prevents overfitting in the fine-tuning phase of optimization.

Onwards and Downwards


The study of acceleration is seeing a small revival within the optimization community. If the ideas in this article excite
you, you may wish to read [13] , which fully explores the idea of momentum as the discretization of a certain differential
equation. But other, less physical, interpretations exist. There is an algebraic interpretation of momentum in terms of
approximating polynomials [3, 14] . Geometric interpretations are emerging [15, 16] , connecting momentum to older
methods, like the Ellipsoid method. And finally, there are interpretations relating momentum to duality [17] , perhaps
providing a clue as how to accelerate second order methods and Quasi Newton (for a first step, see [18] ). But like the
proverbial blind men feeling an elephant, momentum seems like something bigger than the sum of its parts. One day,
hopefully soon, the many perspectives will converge into a satisfying whole.

Acknowledgments
I am deeply indebted to the editorial contributions of Shan Carter and Chris Olah, without which this article would be greatly impoverished. Shan Carter
provided complete redesigns of many of my original interactive widgets, a visual coherence for all the figures, and valuable optimizations to the page’s
performance. Chris Olah provided impeccable editorial feedback at all levels of detail and abstraction - from the structure of the content, to the alignment
of equations.
I am also grateful to Michael Nielsen for providing the title of this article, which really tied the article together. Marcos Ginestra provided editorial input for
the earliest drafts of this article, and spiritual encouragement when I needed it the most. And my gratitude extends to my reviewers, Matt Hoffman and
Anonymous Reviewer B for their astute observations and criticism. I would like to thank Reviewer B, in particular, for pointing out two non-trivial errors in
the original manuscript (discussion here). The contour plotting library for the hero visualization is the joint work of Ben Frederickson, Jeff Heer and Mike
Bostock.
Many thanks to the numerous pull requests and issues filed on github. Thanks in particular, to Osemwaro Pedro for spotting an off by one error in one of
the equations. And also to Dan Schmidt who did an editing pass over the whole project, correcting numerous typographical and grammatical errors.
Discussion and Review
Reviewer A - Matt Hoffman
Reviewer B - Anonymous
Discussion with User derifatives
Footnotes
1. It is possible, however, to construct very specific counterexamples where momentum does not converge, even on convex functions. See [4] for a
counterexample.
2. In Tikhonov Regression we add a quadratic penalty to the regression, minimizing
η
1 η
minimize ​2 ∥Zw − d∥2​ + ​ ∥w∥2​ = 1​2 wT​ (Z T​ Z + ηI)w − (Zd)T​ w
2
Recall that Z T​ Z = Q diag(Λ​1 , … , Λ​n ) QT​ . The solution to Tikhonov Regression is therefore
1 1
​ (Zd) = Q diag ( ​
(Z T​ Z + ηI)−1 ,⋯, ​ ) QT​ (Zd)
λ​1 + η λn​ + η

We can think of regularization as a function which decays the largest eigenvalues, as follows:
1 1
Tikhonov Regularized λ​i = ​ = ​ (1 − (1 + λ​i /η )−1
​ ).
λ​i + η λi​
Gradient descent can be seen as employing a similar decay, but with the decay rate
1
Gradient Descent Regularized λi​ = ​ (1 − (1 − αλi​ )k​ )
λ​i
instead. Note that this decay is dependent on the step-size.
3. This is true as we can write updates in matrix form as
​ ) ( ik+1
​ )=( ​
1 0 y k+1
​ λ​i y k​
( ​ ) ( ​ik )
β
α 1 x​i 0 1 x​i

which implies, by inverting the matrix on the left,


( ​ )=( ​
yik+1
​ λ​i y k​ x0​
) ( ​ik ) = Rk+1
​ ( ​0i )
β

xk+1
​i −αβ 1 − αλi​ x​i yi​

4. We can write out the convergence rates explicitly. The eigenvalues are
1 ​
σ1​ ​= ​ (1 − αλ + β + √​(−αλ + β + 1)2​ − 4β )
​ 2
1 ​
σ2​ ​= ​ (1 − αλ + β − √​(−αλ + β + 1)2​ − 4β )
2
When the (−αλ + β + 1)2​ − 4β < 0 is less than zero, then the roots are complex and the convergence rate is
​ ​
∣σ1​ ∣ = ∣σ2​ ∣​= √​(1 − αλ + β)2​ + ∣(−αλ + β + 1)2​ − 4β∣ = 2√​β

Which is, surprisingly, independent of the step-size or the eigenvalue αλ. When the roots are real, the convergence rate is

max{∣σ1​ ∣, ∣σ2​ ∣} = 1​2 max {∣1 − αλi​ + β ± √​(1 − αλi​ + β)2​ − 4β ∣}

5. This can be derived by reducing the inequalities for all 4 + 1 cases in the explicit form of the convergence rate above.
6. We must optimize over
∥ λ​i ∥ ∥ λ​n ∥
max {​∥
∥( −αβ )​∥, … , ​∥
∥( −αβ )∥ } .
β β

min ​ ​ ​ ​
α,β ∥ 1 − αλ​i ∥∥ ∥ 1 − αλ​n ∥

( ∥ ⋅ ∥ here denotes the magnitude of the maximum eigenvalue), and occurs when the roots of the characteristic polynomial are repeated for the
matrices corresponding to the extremal eigenvalues.
7. The above optimization problem is bounded from below by 0, and vector of all 1’s achieve this.
8. This can be written explicitly as
⎧degree of vertex i i = j

[LG ​ = ​⎨​−1
​ ]ij ​i ≠ j, (i, j) or (j, i) ∈ E


0 otherwise

9. We use the infinity norm to measure our error, similar results can be derived for the 1 and 2 norms.
10. The momentum iterations are
z k+1
​ = βz k​ + Awk​ + error(wk​ )

wk+1
​ = wk​ − αz k+1
​ .

which, after a change of variables, become


​ ) ( ik+1
​ )=( ​
1 0 y k+1
​ λ​i y k​ ϵk​
( ​ ​ ) ( ​ik ) + ( i​ )
β
α 1 x​i 0 1 x​i 0

Inverting the 2 × 2 matrix on the left, and applying the formula recursively yields the final solution.
11. On the 1D function f (x) = λ​2 x2​ , the objective value is
λ
Ef (xk )​= ​ E[(xk​ )2​ ]
2 2

​= ​ E (eT​2 Rk​ ( ​0 ) + ϵk​ eT​2 ∑


k
y 0​ 1
​ ( ​ ))​
λ
​ Rk−i
2 x​ −α
i=1
2

​ ​2 e​2 R​ ( x​0​ ) + ​2 E (ϵ​ e​2 ​∑ R​ ( −α


k
λ T k y 0​ 1
​ ))​
λ k T k−i
​=
i=1
k
y 0​ 1 2
​= ​ eT​2 Rk​ ( ​0 ) + ​ E[ϵk​ ] ⋅ ​∑ (eT​2 Rk−i
​ ( ​ ))​
λ λ
2 x​ 2 −α
i=1
k
y 0​ λE[ϵk​ 1
​= ​ eT​2 Rk​ ( ​0 ) + ​ ​ ( ​ )
λ
⋅ ​∑ γi2​ , γi​ = eT​2 Rk−i
2 x ​ 2 −α
i=1

The third inequality uses the fact that Eϵk​ = 0 and the fourth uses the fact they are uncorrelated.
References
1. On the importance of initialization and momentum in deep learning. [PDF]
Sutskever, I., Martens, J., Dahl, G.E. and Hinton, G.E., 2013. ICML (3), Vol 28, pp. 1139—1147.
2. Some methods of speeding up the convergence of iteration methods [PDF]
Polyak, B.T., 1964. USSR Computational Mathematics and Mathematical Physics, Vol 4(5), pp. 1—17. Elsevier. DOI: 10.1016/0041-5553(64)90137-5
3. Theory of gradient methods
Rutishauser, H., 1959. Refined iterative methods for computation of the solution and the eigenvalues of self-adjoint boundary value problems, pp. 24—
49. Springer. DOI: 10.1007/978-3-0348-7224-9_2
4. Analysis and design of optimization algorithms via integral quadratic constraints [PDF]
Lessard, L., Recht, B. and Packard, A., 2016. SIAM Journal on Optimization, Vol 26(1), pp. 57—95. SIAM.
5. Introductory lectures on convex optimization: A basic course
Nesterov, Y., 2013. , Vol 87. Springer Science \& Business Media. DOI: 10.1007/978-1-4419-8853-9
6. Natural gradient works efficiently in learning [link]
Amari, S., 1998. Neural computation, Vol 10(2), pp. 251—276. MIT Press. DOI: 10.1162/089976698300017746
7. Deep Learning, NIPS′2015 Tutorial [PDF]
Hinton, G., Bengio, Y. and LeCun, Y., 2015.
8. Adaptive restart for accelerated gradient schemes [PDF]
O’Donoghue, B. and Candes, E., 2015. Foundations of computational mathematics, Vol 15(3), pp. 715—732. Springer. DOI: 10.1007/s10208-013-9150-3
9. The Nth Power of a 2x2 Matrix. [PDF]
Williams, K., 1992. Mathematics Magazine, Vol 65(5), pp. 336. MAA. DOI: 10.2307/2691246
10. From Averaging to Acceleration, There is Only a Step-size. [PDF]
Flammarion, N. and Bach, F.R., 2015. COLT, pp. 658—695.
11. On the momentum term in gradient descent learning algorithms [PDF]
Qian, N., 1999. Neural networks, Vol 12(1), pp. 145—151. Elsevier. DOI: 10.1016/s0893-6080(98)00116-6
12. Understanding deep learning requires rethinking generalization [PDF]
Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O., 2016. arXiv preprint arXiv:1611.03530.
13. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights [PDF]
Su, W., Boyd, S. and Candes, E., 2014. Advances in Neural Information Processing Systems, pp. 2510—2518.
14. The Zen of Gradient Descent [HTML]
Hardt, M., 2013.
15. A geometric alternative to Nesterov’s accelerated gradient descent [PDF]
Bubeck, S., Lee, Y.T. and Singh, M., 2015. arXiv preprint arXiv:1506.08187.
16. An optimal first order method based on optimal quadratic averaging [PDF]
Drusvyatskiy, D., Fazel, M. and Roy, S., 2016. arXiv preprint arXiv:1604.06543.
17. Linear coupling: An ultimate unification of gradient and mirror descent [PDF]
Allen-Zhu, Z. and Orecchia, L., 2014. arXiv preprint arXiv:1407.1537.
18. Accelerating the cubic regularization of Newton’s method on convex problems [PDF]
Nesterov, Y., 2008. Mathematical Programming, Vol 112(1), pp. 159—181. Springer. DOI: 10.1007/s10107-006-0089-x
Updates and Corrections
View all changes to this article since it was first published. If you see a mistake or want to suggest a change, please create an issue on GitHub.
Citations and Reuse
Diagrams and text are licensed under Creative Commons Attribution CC-BY 2.0, unless noted otherwise, with the source available on GitHub. The figures
that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: “Figure from …”.
For attribution in academic contexts, please cite this work as
Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006

BibTeX citation
@article{goh2017why,
author = {Goh, Gabriel},
title = {Why Momentum Really Works},
journal = {Distill},
year = {2017},
url = {http://distill.pub/2017/momentum},
doi = {10.23915/distill.00006}
}

You might also like