0% found this document useful (0 votes)
8 views

ML807_Distributed_and_Federated_Learning_Slides_2

Uploaded by

Nicolas Cuadrado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ML807_Distributed_and_Federated_Learning_Slides_2

Uploaded by

Nicolas Cuadrado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 211

ML807: Distributed and Federated Learning

SGD in ML, its many Variants and Convergence Analysis

Instructor: Samuel Horvath, Eduard Gorbunov, Praneeth Vepakomma


Semester: Spring 2024

Machine Learning, Mohamed bin Zayed University of Artificial Intelligence


Stochastic Gradient Descent (SGD)

• Stochastic gradient descent (SGD) is at the core of almost all


state-of-the-art machine learning
• In the last lectures, we discussed:
• basic concepts such as convexity, smoothness, gradients, hessian,
GD
• SGD in linear regression, optimizing the empirical risk function
• In the today’s lecture and next lectures, we will discuss:
• The backpropagation algorithm (the special case of chain rule for
NNs)
• Variants of SGD: Variance Reduction, Momentum, Random
Reshuffling, Asynchronous, Adaptive Methods
• Their convergence analysis

1
Outline

1. SGD in Neural Network training

2. Gradient Descent

3. SGD-like Methods

2
SGD in Neural Network training
Multi-layer Neural Network

• wij : weights connecting node i in layer ℓ − 1 to node j in layer ℓ.


• bj , bk : bias for nodes in layers j and k, respectively.
P
• uj , uk : inputs to nodes j and k (where uj = bj + i xi wij ).
• gj (·), gk (·): activation function for node j (applied to uj ) and
node k.
• yj = gj (uj ), zk = gk (uk ): output/activation of nodes j and k.
• tk : target value connecting for node k in the output layer.
3
Activation Function: Sigmoid

1
σ(z) =
1 + e−z

• Squashing type non-linearity: pushes output to range (0, 1).

4
Activation Function: Sigmoid

1
σ(z) =
1 + e−z

• Squashing type non-linearity: pushes output to range (0, 1).

• Problem: Near-constant value across most of their domain,


strongly sensitive only when z is closer to zero.
• Saturation makes gradient based learning difficult.

4
Activation Function: ReLU

• Approximates the softplus function which is log(1 + ez ).


• ReLU activation function is g(z) = max{0, z}
• Similar to linear units. Easy to optimize!
• Give large and consistent gradients when active.

5
Activation Function: ReLU

• Approximates the softplus function which is log(1 + ez ).


• ReLU activation function is g(z) = max{0, z}
• Similar to linear units. Easy to optimize!
• Give large and consistent gradients when active.
• Problem: gradient is zero for negative inputs.
• Solution: Leaky ReLU g(z) = max{az, z}, 0 < a < 1 is a small
constant.
5
How do you perform inference using a trained neural network?

Expressing outputs z in terms of inputs x is called


forward-propagation. Express:
P
• inputs uj to the hidden layer in terms of x: uj = i wij xi + bj ,
P 
• outputs yj of the hidden layer in terms of x: yj = g i wij xi + bj ,
P
• inputs to the final layer in terms of x uk = j wjk yj + bj ,
P 
• outputs zk of the final layer in terms of x: zk = gk i wij xi + bj .

6
Learning Parameters

How to learn the parameters?

7
Learning Parameters

How to learn the parameters?


Choose the right loss function:

• Regression: least-square loss


n
1X 2
min (F(xi , θ) − yi )
θ∈Rd n
i=1

• Classification: cross-entropy loss


n
1X
min − log Fyi (xi , θ)
θ∈Rd n
i=1

7
Learning Parameters

How to learn the parameters?


Choose the right loss function:

• Regression: least-square loss


n
1X 2
min (F(xi , θ) − yi )
θ∈Rd n
i=1

• Classification: cross-entropy loss


n
1X
min − log Fyi (xi , θ)
θ∈Rd n
i=1

Hard optimization problem because of F(x, θ) (the output of the


neural network) is a complicated function of x.

7
Learning Parameters

How to learn the parameters?


Choose the right loss function:

• Regression: least-square loss


n
1X 2
min (F(xi , θ) − yi )
θ∈Rd n
i=1

• Classification: cross-entropy loss


n
1X
min − log Fyi (xi , θ)
θ∈Rd n
i=1

Hard optimization problem because of F(x, θ) (the output of the


neural network) is a complicated function of x. Typically, we use SGD
with many optimization tricks.

7
SGD for Neural Network

• Randomly pick a data point (xi ; yi ).


• Compute the gradient using only this data point, for example,
2
∂(F(xi , θ) − yi )
gi =
∂θ
• Update the parameters: θ ← θ − ηgi .
• Iterate the process until some (pre-specified) stopping criteria.

8
Updating the parameter values

Back-propagate the error. Given parameters θ = {w, b}:


1. Forward-propagate to find zk in terms of the input (the
“feed-forward signals”).
2. Calculate output error E by comparing the predicted output zk to
its true value tk .
3. Back-propagate E by weighting it by the gradients of the
associated activation functions and the weights in previous
layers.
4. Calculate the gradients ∂w
∂E
and ∂b
∂E
for the parameters w, b at
each layer based on the backpropagated error signal and the
feedforward signals from the inputs.
5. Update the parameters using the calculated gradients
w ← w − η ∂w
∂E
, b ← b − η ∂b
∂E
, where eta is the step size.

9
Illustrative example

• wij : weights connecting node i in layer ℓ − 1 to node j in layer ℓ.


• bj , bk : bias for nodes in layers j and k, respectively.
P
• uj , uk : inputs to nodes j and k (where uj = bj + i xi wij ).
• gj (·), gk (·): activation function for node j (applied to uj ) and
node k.
• yj = gj (uj ), zk = gk (uk ): output/activation of nodes j and k.
• tk : target value connecting for node k in the output layer.
10
Illustrative example (steps 1 and 2)

• Step 1: Forward-propagate for each output zk


   ! 
X X X
zk = gk (uk ) = gk bk + yj wjk  = gk bk + gj b j + xj wij wjk 
j j i

• Step 2: Find the error. Let’s assume that the error function is the
sum of the squared differences between the target values tk and
P
the network output zk : E = 21 k (zk − tk )2 .
11
Illustrative example (step 3, output layer)

Step 3: Backpropagate the error. Let’s start at the output layer with
P P
weight wjk , recalling that E = 21 k (zk − tk )2 , uk = bk + j yj wjk :

∂E ∂E ∂zk ∂uk
= = (zk − tk )g′k (uk )yj = δk yj ,
∂wjk ∂zk ∂uk ∂wjk

where δk = (zk − tk )g′k (uk ) is called the error in uk .

12
Illustrative example (step 3, hidden layer)

Step 3 (cont’d): Now let’s consider wij in the hidden layer, recalling
P P
uk = bk + j yj wjk , uk = bk + j yj wjk , zk = gk (uk ):
!
∂E X ∂E ∂zk ∂yj ∂uj X
= = δk wjk g′j (uj )xi = δj xi ,
∂wij ∂uk ∂uk ∂uj ∂wij
k k
P
where we substituted δj = g′j (uj ) k (zk − tk )g′k (uk ), the error in uj .

13
Illustrative example (steps 3 and 4)

Step 3 (cont’d): We similarly find that ∂E


∂bk
∂E
= δk , ∂bj
= δj (Exercise)

14
Illustrative example (steps 3 and 4)

Step 3 (cont’d): We similarly find that ∂b


∂E
k
∂E
= δk , ∂bj
= δj (Exercise)
Step 4: Calculate the gradients. We have found that
∂E ∂E
= δj xi , and = δ k yj ,
∂wij ∂wjk
P
where δj = g′j (uj ) k (zk − tk )g′k (uk ), δk = (zk − tk )g′k (uk ).

14
Illustrative example (steps 3 and 4)

Step 3 (cont’d): We similarly find that ∂b


∂E
k
∂E
= δk , ∂bj
= δj (Exercise)
Step 4: Calculate the gradients. We have found that
∂E ∂E
= δj xi , and = δ k yj ,
∂wij ∂wjk
P
where δj = g′j (uj ) k (zk − tk )g′k (uk ), δk = (zk − tk )g′k (uk ).
Now since we know the zk , yj , xi , uk and uj for a given set of
parameter values w, b, we can use these expressions to calculate the
gradients at each iteration and update them. 14
Illustrative example (steps 3 and 4)

Step 4: Calculate the gradients. We have found that


∂E ∂E
= δj xi , and = δ k yj ,
∂wij ∂wjk
P
where δj = g′j (uj ) k (zk − tk )g′k (uk ), δk = (zk − tk )g′k (uk ).

15
Illustrative example (steps 3 and 4)

Step 4: Calculate the gradients. We have found that


∂E ∂E
= δj xi , and = δ k yj ,
∂wij ∂wjk
P
where δj = g′j (uj ) k (zk − tk )g′k (uk ), δk = (zk − tk )g′k (uk ).
Step 4: Update the weights and biases with learning rate η. For
example,
∂E ∂E
wjk ← wjk − η , and wij ← wij − η .
∂wjk ∂wij 15
High-level Procedure: Can be Used with More Hidden Layers

Final layer
• Error in each of its outputs is zk − tk .
• Error in the input uk to the final layer is δk = (zk − tk )g′k (uk ).

16
High-level Procedure: Can be Used with More Hidden Layers

Final layer
• Error in each of its outputs is zk − tk .
• Error in the input uk to the final layer is δk = (zk − tk )g′k (uk ).
Hidden layer
P
• Error in output yj is outputs is k δk wjk .
P
• Error in the input uj of the hidden layer is g′j (uj ) k δk wjk .

16
High-level Procedure: Can be Used with More Hidden Layers

Final layer
• Error in each of its outputs is zk − tk .
• Error in the input uk to the final layer is δk = (zk − tk )g′k (uk ).
Hidden layer
P
• Error in output yj is outputs is k δk wjk .
P
• Error in the input uj of the hidden layer is g′j (uj ) k δk wjk .

The gradient w.r.t. wij is xi δj .


16
Vectorized Implementation

Much faster than implementing a loop over all neurons in each layer.

17
Vectorized Implementation

Much faster than implementing a loop over all neurons in each layer.
Forward-Propagation
• Represent the weights between layers l − 1 and l as a matrix Wl .
• Outputs of layer l − 1 are in a row vector yl . Then we have
ul = yl−1 Wl .
• Outputs of layer l are in the row vector yl = gl (ul ).
17
Vectorized Implementation

Much faster than implementing a loop over all neurons in each layer.

18
Vectorized Implementation

Much faster than implementing a loop over all neurons in each layer.
Back-Propagation
• For each layer l find ∆l , the vector of errors in ul in terms of the
final error.
• Compute gradient w.r.t. the parameters of the current layer l.
• Recursively find ∆l−1 in terms ∆l .
18
Vectorized Implementation

Much faster than implementing a loop over all neurons in each layer.
Back-Propagation
• For each layer l find ∆l , the vector of errors in ul in terms of the
final error.
• Compute gradient w.r.t. the parameters of the current layer l.
• Recursively find ∆l−1 in terms ∆l .
Exercise: How to implement this ? 18
Mini-batch SGD

• Recall the empirical risk loss function that we considered for the
back-propagation discussion
n
1 X 2
E= (F(xi , θ) − yi )
2n
i=1

19
Mini-batch SGD

• Recall the empirical risk loss function that we considered for the
back-propagation discussion
n
1 X 2
E= (F(xi , θ) − yi )
2n
i=1

• For large training datasets (large n), then computing gradients


w.r.t. each datapoint is expensive. For example,
n
∂E 1 X (i) (i)
= δ k yj ,
∂wjk n
i=1

for back-propagation.

19
Mini-batch SGD

• Recall the empirical risk loss function that we considered for the
back-propagation discussion
n
1 X 2
E= (F(xi , θ) − yi )
2n
i=1

• For large training datasets (large n), then computing gradients


w.r.t. each datapoint is expensive. For example,
n
∂E 1 X (i) (i)
= δ k yj ,
∂wjk n
i=1

for back-propagation.
• Therefore we use stochastic gradient descent (SGD), where we
2
choose a random data point xn and use Ẽ = 21 (F(xi , θ) − yi )
instead of the entire sum.
19
Mini-batch SGD

• Mini-batch SGD is between these two extremes.

20
Mini-batch SGD

• Mini-batch SGD is between these two extremes.


• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.

20
Mini-batch SGD

• Mini-batch SGD is between these two extremes.


• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.
• Small b saves per-iteration computing cost, but increases noise
in the gradients and yields worse error convergence.

20
Mini-batch SGD

• Mini-batch SGD is between these two extremes.


• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.
• Small b saves per-iteration computing cost, but increases noise
in the gradients and yields worse error convergence.
• Large b reduces gradient noise and gives better error
convergence, but increases computing cost per iteration.

20
Questions?

20
Gradient Descent
Gradient Descent

Objective: find x⋆ ∈ minx∈Rd f(x)

21
Gradient Descent

Objective: find x⋆ ∈ minx∈Rd f(x)


Assumptions
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• f is µ-convex, i.e.,
µ
f(y) ≥ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2

21
Gradient Descent

Objective: find x⋆ ∈ minx∈Rd f(x)


Assumptions
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• f is µ-convex, i.e.,
µ
f(y) ≥ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2

Algorithm xt+1 = xt − η∇f(x) for a small learning rate η > 0.

21
Gradient Descent

Objective: find x⋆ ∈ minx∈Rd f(x)


Assumptions
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• f is µ-convex, i.e.,
µ
f(y) ≥ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2

Algorithm xt+1 = xt − η∇f(x) for a small learning rate η > 0.


Intuition: A person standing on a hill covered in fog, who wants to
find their way down to the hill. They can feel the slope with their feet
and move along the direction of steepest descent.

21
Gradient Descent as MM Algorithm

Majorization-Minimization (MM) Algorithm

22
Gradient Descent as MM Algorithm

Majorization-Minimization (MM) Algorithm


• minimize upper bound g(y) ≥ f(y) s.t. g(x) = f(x) for current
iterate.

22
Gradient Descent as MM Algorithm

Majorization-Minimization (MM) Algorithm


• minimize upper bound g(y) ≥ f(y) s.t. g(x) = f(x) for current
iterate.
• =⇒ guaranteed decrease.

22
Gradient Descent as MM Algorithm

Majorization-Minimization (MM) Algorithm


• minimize upper bound g(y) ≥ f(y) s.t. g(x) = f(x) for current
iterate.
• =⇒ guaranteed decrease.

Gradient Descent as MM

22
Gradient Descent as MM Algorithm

Majorization-Minimization (MM) Algorithm


• minimize upper bound g(y) ≥ f(y) s.t. g(x) = f(x) for current
iterate.
• =⇒ guaranteed decrease.

Gradient Descent as MM
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2

22
Gradient Descent as MM Algorithm

Majorization-Minimization (MM) Algorithm


• minimize upper bound g(y) ≥ f(y) s.t. g(x) = f(x) for current
iterate.
• =⇒ guaranteed decrease.

Gradient Descent as MM
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
def
• Let g(y) = f(x) + ⟨∇f(x), y − x⟩ + 2L ∥y − x∥2 .

22
Gradient Descent as MM Algorithm

Majorization-Minimization (MM) Algorithm


• minimize upper bound g(y) ≥ f(y) s.t. g(x) = f(x) for current
iterate.
• =⇒ guaranteed decrease.

Gradient Descent as MM
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
def
• Let g(y) = f(x) + ⟨∇f(x), y − x⟩ + 2L ∥y − x∥2 .
• Then, x − L1 ∇f(x) = arg miny∈Rd g(y).

22
Gradient Descent as MM Algorithm

Majorization-Minimization (MM) Algorithm


• minimize upper bound g(y) ≥ f(y) s.t. g(x) = f(x) for current
iterate.
• =⇒ guaranteed decrease.

Gradient Descent as MM
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
def
• Let g(y) = f(x) + ⟨∇f(x), y − x⟩ + 2L ∥y − x∥2 .
• Then, x − L1 ∇f(x) = arg miny∈Rd g(y).
• GD with η = L1 .

22
Gradient Descent as MM Algorithm

Majorization-Minimization (MM) Algorithm


• minimize upper bound g(y) ≥ f(y) s.t. g(x) = f(x) for current
iterate.
• =⇒ guaranteed decrease.

Gradient Descent as MM
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
def
• Let g(y) = f(x) + ⟨∇f(x), y − x⟩ + 2L ∥y − x∥2 .
• Then, x − L1 ∇f(x) = arg miny∈Rd g(y).
• GD with η = L1 .
1
• Therefore, f(xt+1 ) ≤ f(xt ) − 2
2L ∥∇f(xt )∥ , for xt+1 = xt − L1 ∇f(xt ).

22
Gradient for µ-convex functions

By strong convexity, we have


µ
f(y) ≥ f(x) + ⟨∇f(x), y − x⟩ + ∥x − y∥2 ,
2

23
Gradient for µ-convex functions

By strong convexity, we have


µ
f(y) ≥ f(x) + ⟨∇f(x), y − x⟩ + ∥x − y∥2 ,
2
Equivalently,
µ
f(y) ≥ f(x) + ⟨∇f(y), y − x⟩ − ⟨∇f(x) − ∇f(y), y − x⟩ + ∥y − x∥2 ,
2

23
Gradient for µ-convex functions

By strong convexity, we have


µ
f(y) ≥ f(x) + ⟨∇f(x), y − x⟩ + ∥x − y∥2 ,
2
Equivalently,
µ
f(y) ≥ f(x) + ⟨∇f(y), y − x⟩ − ⟨∇f(x) − ∇f(y), y − x⟩ + ∥y − x∥2 ,
2
Since 2⟨a, b⟩ + ∥b∥2 ≥ −∥a∥2 ,
1
f(y) ≥ f(x) + ⟨∇f(y), y − x⟩ − ∥∇f(x) − ∇f(y)∥2 .

23
Gradient for µ-convex functions

By strong convexity, we have


µ
f(y) ≥ f(x) + ⟨∇f(x), y − x⟩ + ∥x − y∥2 ,
2
Equivalently,
µ
f(y) ≥ f(x) + ⟨∇f(y), y − x⟩ − ⟨∇f(x) − ∇f(y), y − x⟩ + ∥y − x∥2 ,
2
Since 2⟨a, b⟩ + ∥b∥2 ≥ −∥a∥2 ,
1
f(y) ≥ f(x) + ⟨∇f(y), y − x⟩ − ∥∇f(x) − ∇f(y)∥2 .

Let y = x⋆ , i.e., ∇f(x⋆ ) = 0. and f⋆ = f(x⋆ ), then
1
f⋆ ≥ f(x) − ∥∇f(x)∥2 =⇒ ∥∇f(x)∥2 ≥ 2µ(f(x) − f⋆ ).

23
Gradient for µ-convex functions

By strong convexity, we have


µ
f(y) ≥ f(x) + ⟨∇f(x), y − x⟩ + ∥x − y∥2 ,
2
Equivalently,
µ
f(y) ≥ f(x) + ⟨∇f(y), y − x⟩ − ⟨∇f(x) − ∇f(y), y − x⟩ + ∥y − x∥2 ,
2
Since 2⟨a, b⟩ + ∥b∥2 ≥ −∥a∥2 ,
1
f(y) ≥ f(x) + ⟨∇f(y), y − x⟩ − ∥∇f(x) − ∇f(y)∥2 .

Let y = x⋆ , i.e., ∇f(x⋆ ) = 0. and f⋆ = f(x⋆ ), then
1
f⋆ ≥ f(x) − ∥∇f(x)∥2 =⇒ ∥∇f(x)∥2 ≥ 2µ(f(x) − f⋆ ).

If we can make gradient small, we converge to an optimal solution.
23
Gradient Descent Convergence

We have

∥∇f(x)∥2 ≥ 2µ(f(x) − f⋆ ), ∀x ∈ Rd .

24
Gradient Descent Convergence

We have

∥∇f(x)∥2 ≥ 2µ(f(x) − f⋆ ), ∀x ∈ Rd .
1 2
Combining this with f(xt+1 ) ≤ f(xt ) − 2L ∥∇f(xt )∥ , we obtain
1 µ
f(xt+1 ) ≤ f(xt ) − ∥∇f(xt )∥2 ≤ f(xt ) − (f(xt ) − f⋆ ).
2L L

24
Gradient Descent Convergence

We have

∥∇f(x)∥2 ≥ 2µ(f(x) − f⋆ ), ∀x ∈ Rd .
1 2
Combining this with f(xt+1 ) ≤ f(xt ) − 2L ∥∇f(xt )∥ , we obtain
1 µ
f(xt+1 ) ≤ f(xt ) − ∥∇f(xt )∥2 ≤ f(xt ) − (f(xt ) − f⋆ ).
2L L
Subtracting f⋆ from both sides leads to
 µ
f(xt+1 ) − f⋆ ≤ 1 − (f(xt ) − f⋆ ).
L

24
Gradient Descent Convergence

We have

∥∇f(x)∥2 ≥ 2µ(f(x) − f⋆ ), ∀x ∈ Rd .
1 2
Combining this with f(xt+1 ) ≤ f(xt ) − 2L ∥∇f(xt )∥ , we obtain
1 µ
f(xt+1 ) ≤ f(xt ) − ∥∇f(xt )∥2 ≤ f(xt ) − (f(xt ) − f⋆ ).
2L L
Subtracting f⋆ from both sides leads to
 µ
f(xt+1 ) − f⋆ ≤ 1 − (f(xt ) − f⋆ ).
L
L
Let κ = µ be the condition number, then applying above recursively
yields
 T
1
f(xT ) − f⋆ ≤ 1− (f(x0 ) − f⋆ ).
κ

24
GD: Theorem

Convergence of GD
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
 T
1
f(xT ) − f⋆ ≤ 1 − (f(x0 ) − f⋆ ).
κ

25
GD: Theorem

Convergence of GD
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
 T
1
f(xT ) − f⋆ ≤ 1 − (f(x0 ) − f⋆ ).
κ

• Exercise: Show that similar recursion holds for any η ≤ L1 .

25
GD: Theorem

Convergence of GD
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
 T
1
f(xT ) − f⋆ ≤ 1 − (f(x0 ) − f⋆ ).
κ

• Exercise: Show that similar recursion holds for any η ≤ L1 .


• Question: How many iterations do we need to achieve
f(xT ) − f⋆ ≤ ε?

25
GD: Theorem

Convergence of GD
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
 T
1
f(xT ) − f⋆ ≤ 1 − (f(x0 ) − f⋆ ).
κ

• Exercise: Show that similar recursion holds for any η ≤ L1 .


• Question: How many iterations do we need to achieve
f(xT ) − f⋆ ≤ ε?

• Answer: T = O κ log ε1 .

25
Complexity Bound

We want to find such T for which


 T
1
1− (f(x0 ) − f⋆ ) ≤ ε.
κ

26
Complexity Bound

We want to find such T for which


 T
1
1− (f(x0 ) − f⋆ ) ≤ ε.
κ

This implies that


       
1 ε f(x0 ) − f⋆ 1
T log 1 − ≤ log ⇐⇒ T ≥ log / log 1 + .
κ f(x0 ) − f⋆ ε κ−1

26
Complexity Bound

We want to find such T for which


 T
1
1− (f(x0 ) − f⋆ ) ≤ ε.
κ

This implies that


       
1 ε f(x0 ) − f⋆ 1
T log 1 − ≤ log ⇐⇒ T ≥ log / log 1 + .
κ f(x0 ) − f⋆ ε κ−1
 
1
Since 1/ log 1 + κ−1 ≥ κ − 1 (compound vs. simple interest),
f(x0 )−f⋆
T ≥ (κ − 1) log ε implies f(xT ) − f⋆ ≤ ε.

26
Complexity Bound

We want to find such T for which


 T
1
1− (f(x0 ) − f⋆ ) ≤ ε.
κ

This implies that


       
1 ε f(x0 ) − f⋆ 1
T log 1 − ≤ log ⇐⇒ T ≥ log / log 1 + .
κ f(x0 ) − f⋆ ε κ−1
 
1
Since 1/ log 1 + κ−1 ≥ κ − 1 (compound vs. simple interest),
T ≥ (κ − 1) log f(x0 )−f

ε implies f(xT ) − f⋆ ≤ ε.
Exercise: Complete proof.

26
GD: Convergence in Iterates
Convergence of GD iterates
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
 T
2 1 2
∥xT − x⋆ ∥ ≤ 1 − ∥x0 − x⋆ ∥ .
κ

27
GD: Convergence in Iterates
Convergence of GD iterates
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
 T
2 1 2
∥xT − x⋆ ∥ ≤ 1 − ∥x0 − x⋆ ∥ .
κ

Easy proof of weaker results: reuse the previous result.

27
GD: Convergence in Iterates
Convergence of GD iterates
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
 T
2 1 2
∥xT − x⋆ ∥ ≤ 1 − ∥x0 − x⋆ ∥ .
κ

Easy proof of weaker results: reuse the previous result.


µ 2
f(x) ≥ f⋆ + ∥x − x⋆ ∥ .
2

27
GD: Convergence in Iterates
Convergence of GD iterates
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
 T
2 1 2
∥xT − x⋆ ∥ ≤ 1 − ∥x0 − x⋆ ∥ .
κ

Easy proof of weaker results: reuse the previous result.


µ 2
f(x) ≥ f⋆ + ∥x − x⋆ ∥ .
2
Therefore,
 T
⋆ 2 2 1 2
∥xT − x ∥ ≤ (f(xT ) − f ) ≤ 1 −

(f(x0 ) − f⋆ )
µ κ µ
 T
1 2
≤ 1− κ∥x0 − x⋆ ∥ .
κ
27
GD: Convergence in Iterates
Convergence of GD iterates
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
 T
2 1 2
∥xT − x⋆ ∥ ≤ 1 − ∥x0 − x⋆ ∥ .
κ

Easy proof of weaker results: reuse the previous result.


µ 2
f(x) ≥ f⋆ + ∥x − x⋆ ∥ .
2
Therefore,
 T
⋆ 2 2 1 2
∥xT − x ∥ ≤ (f(xT ) − f ) ≤ 1 −

(f(x0 ) − f⋆ )
µ κ µ
 T
1 2
≤ 1− κ∥x0 − x⋆ ∥ .
κ
This is not optimal, we lost κ factor. 27
GD: Proof

Let us use the recursive update directly

∥xt+1 − x⋆ ∥2 = ∥xt − η∇f(xt ) − x⋆ ∥2


= ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 ∥∇f(xt )∥2
= ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ) − ∇f(x⋆ ), xt − x⋆ ⟩ + η 2 ∥∇f(xt )∥2 .

28
GD: Proof

Let us use the recursive update directly

∥xt+1 − x⋆ ∥2 = ∥xt − η∇f(xt ) − x⋆ ∥2


= ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 ∥∇f(xt )∥2
= ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ) − ∇f(x⋆ ), xt − x⋆ ⟩ + η 2 ∥∇f(xt )∥2 .

How to show decrease?

28
GD: Proof

Let us use the recursive update directly

∥xt+1 − x⋆ ∥2 = ∥xt − η∇f(xt ) − x⋆ ∥2


= ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 ∥∇f(xt )∥2
= ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ) − ∇f(x⋆ ), xt − x⋆ ⟩ + η 2 ∥∇f(xt )∥2 .

How to show decrease? Use the middle term.

28
GD: Proof

Let us use the recursive update directly

∥xt+1 − x⋆ ∥2 = ∥xt − η∇f(xt ) − x⋆ ∥2


= ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 ∥∇f(xt )∥2
= ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ) − ∇f(x⋆ ), xt − x⋆ ⟩ + η 2 ∥∇f(xt )∥2 .

How to show decrease? Use the middle term.

∥xt+1 − x⋆ ∥2 ≤ ∥xt − x⋆ ∥2 − ηµ∥xt − x⋆ ∥2 − η⟨∇f(xt ) − ∇f(x⋆ ), xt − x⋆ ⟩


+ η 2 ∥∇f(xt )∥2

1
⋆ 2
≤ (1 − ηµ)∥xt − x ∥ − η − η ∥∇f(xt )∥2 .
L

28
GD: Proof

Let us use the recursive update directly

∥xt+1 − x⋆ ∥2 = ∥xt − η∇f(xt ) − x⋆ ∥2


= ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 ∥∇f(xt )∥2
= ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ) − ∇f(x⋆ ), xt − x⋆ ⟩ + η 2 ∥∇f(xt )∥2 .

How to show decrease? Use the middle term.

∥xt+1 − x⋆ ∥2 ≤ ∥xt − x⋆ ∥2 − ηµ∥xt − x⋆ ∥2 − η⟨∇f(xt ) − ∇f(x⋆ ), xt − x⋆ ⟩


+ η 2 ∥∇f(xt )∥2
 
⋆ 21
≤ (1 − ηµ)∥xt − x ∥ − η − η ∥∇f(xt )∥2 .
L
1
Setting η = L concludes the proof.

28
SGD-like Methods
Stochastic Gradient Descent

h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)

29
Stochastic Gradient Descent

h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)

Assumptions
• fi is smooth, i.e.,
L
fi (y) ≤ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• fi is µ-convex, i.e.,
µ
fi (y) ≥ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2

29
Stochastic Gradient Descent

h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)

Assumptions
• fi is smooth, i.e.,
L
fi (y) ≤ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• fi is µ-convex, i.e.,
µ
fi (y) ≥ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2

Algorithm xt+1 = xt − ηgt , where gt = ∇fit (xt ), and it is selected


uniformly at random from [n] = {1, 2, . . . , n}.

29
SGD: One Step Analysis

Let Et [·] denote E [·|xt ], then


   
Et ∥xt+1 − x⋆ ∥2 = Et ∥xt − ηgt − x⋆ ∥2
 
= ∥xt − x⋆ ∥2 − 2η⟨Et [gt ] , xt − x⋆ ⟩ + η 2 Et ∥gt ∥2

30
SGD: One Step Analysis

Let Et [·] denote E [·|xt ], then


   
Et ∥xt+1 − x⋆ ∥2 = Et ∥xt − ηgt − x⋆ ∥2
 
= ∥xt − x⋆ ∥2 − 2η⟨Et [gt ] , xt − x⋆ ⟩ + η 2 Et ∥gt ∥2
Since gradient is sampled using uniform random sampling, we have
* n
+
 ⋆ 2
 ⋆ 2 1X
Et ∥xt+1 − x ∥ = ∥xt − x ∥ − 2η ∇fi (xt ), xt − x ⋆
n
i=1
n
1X
+ η2 ∥∇fi (xt )∥2
n
i=1

30
SGD: One Step Analysis

Let Et [·] denote E [·|xt ], then


   
Et ∥xt+1 − x⋆ ∥2 = Et ∥xt − ηgt − x⋆ ∥2
 
= ∥xt − x⋆ ∥2 − 2η⟨Et [gt ] , xt − x⋆ ⟩ + η 2 Et ∥gt ∥2
Since gradient is sampled using uniform random sampling, we have
* n
+
 ⋆ 2
 ⋆ 2 1X
Et ∥xt+1 − x ∥ = ∥xt − x ∥ − 2η ∇fi (xt ), xt − x ⋆
n
i=1
n
1X
+ η2 ∥∇fi (xt )∥2
n
i=1

 
Et ∥xt+1 − x⋆ ∥2 = ∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x + η 2

∥∇fi (xt )∥2
n n
i=1 i=1

30
SGD: One Step Analysis

Let Et [·] denote E [·|xt ], then


   
Et ∥xt+1 − x⋆ ∥2 = Et ∥xt − ηgt − x⋆ ∥2
 
= ∥xt − x⋆ ∥2 − 2η⟨Et [gt ] , xt − x⋆ ⟩ + η 2 Et ∥gt ∥2
Since gradient is sampled using uniform random sampling, we have
* n
+
 ⋆ 2
 ⋆ 2 1X
Et ∥xt+1 − x ∥ = ∥xt − x ∥ − 2η ∇fi (xt ), xt − x ⋆
n
i=1
n
1X
+ η2 ∥∇fi (xt )∥2
n
i=1

 
Et ∥xt+1 − x⋆ ∥2 = ∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x + η 2

∥∇fi (xt )∥2
n n
i=1 i=1

The first 2 terms are exactly the same as for gradient descent =⇒
We use µ-convexity. 30
SGD: One Step Analysis

 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x⋆ + η 2 ∥∇fi (xt )∥2
n n
i=1 i=1

31
SGD: One Step Analysis

 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x⋆ + η 2 ∥∇fi (xt )∥2
n n
i=1 i=1

What about the second term?

31
SGD: One Step Analysis

 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x⋆ + η 2 ∥∇fi (xt )∥2
n n
i=1 i=1

What about the second term?


For GD, we used ⟨∇f(x) − ∇f(x⋆ ), x − x⋆ ⟩ ≥ 1/L∥∇f(x) − f(x⋆ )∥2 , can we
use it here?

31
SGD: One Step Analysis

 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x⋆ + η 2 ∥∇fi (xt )∥2
n n
i=1 i=1

What about the second term?


For GD, we used ⟨∇f(x) − ∇f(x⋆ ), x − x⋆ ⟩ ≥ 1/L∥∇f(x) − f(x⋆ )∥2 , can we
use it here? Yes.
1
⟨∇fi (xt ) − ∇fi (x⋆ ), xt − x⋆ ⟩ ≥ ∥∇fi (xt ) − ∇fi (x⋆ )∥2
L
We have almost what we need.

31
SGD: One Step Analysis

 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x⋆ + η 2 ∥∇fi (xt )∥2
n n
i=1 i=1

What about the second term?


For GD, we used ⟨∇f(x) − ∇f(x⋆ ), x − x⋆ ⟩ ≥ 1/L∥∇f(x) − f(x⋆ )∥2 , can we
use it here? Yes.
1
⟨∇fi (xt ) − ∇fi (x⋆ ), xt − x⋆ ⟩ ≥ ∥∇fi (xt ) − ∇fi (x⋆ )∥2
L
We have almost what we need. We need to further decompose
Pn 2
i=1 ∥∇fi (xt )∥ , e.g., Young’s inequality
1/n

(∥a + b∥2 ≤ (1 + α)∥a∥2 + (1 + 1/α)∥b∥2 )


n n n
1X 2X 2X
∥∇fi (xt )∥2 ≤ ∥∇fi (xt ) − ∇fi (x⋆ )∥2 + ∥∇fi (x⋆ )∥2
n n n
i=1 i=1 i=1
31
SGD: One Step Analysis

Pn
Let us denote σ⋆2 = 1/n i=1 ∥∇fi (x

)∥2 . Putting all together, we obtain
 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
  X n
1 2
−η −η ∥∇fi (xt ) − ∇fi (x⋆ )∥2 + 2η 2 σ⋆2
2L n
i=1

32
SGD: One Step Analysis

Pn
Let us denote σ⋆2 = 1/n i=1 ∥∇fi (x

)∥2 . Putting all together, we obtain
 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
  X n
1 2
−η −η ∥∇fi (xt ) − ∇fi (x⋆ )∥2 + 2η 2 σ⋆2
2L n
i=1

Let us take η ≤ 1/2L, therefore

32
SGD: One Step Analysis

Pn
Let us denote σ⋆2 = 1/n i=1 ∥∇fi (x

)∥2 . Putting all together, we obtain
 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
  X n
1 2
−η −η ∥∇fi (xt ) − ∇fi (x⋆ )∥2 + 2η 2 σ⋆2
2L n
i=1

Let us take η ≤ 1/2L, therefore


 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + 2η 2 σ⋆2

32
SGD: Rate of Convergence
 
Can we guarantee the rate for E ∥xT − x⋆ ∥2 using the obtained
convergence?

33
SGD: Rate of Convergence
 
Can we guarantee the rate for E ∥xT − x⋆ ∥2 using the obtained
convergence?
    
E ∥xT − x⋆ ∥2 = E ET−1 ∥xT − x⋆ ∥2
 
≤ (1 − ηµ)E ∥xT−1 − x⋆ ∥2 + +2η 2 σ⋆2 .

33
SGD: Rate of Convergence
 
Can we guarantee the rate for E ∥xT − x⋆ ∥2 using the obtained
convergence?
    
E ∥xT − x⋆ ∥2 = E ET−1 ∥xT − x⋆ ∥2
 
≤ (1 − ηµ)E ∥xT−1 − x⋆ ∥2 + +2η 2 σ⋆2 .

Recursively applying the above leads to


T−1
X
 
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 + 2η 2 σ⋆2 (1 − ηµ)t
t=0

X
≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 + 2η 2 σ⋆2 (1 − ηµ)t
t=0
2ησ⋆2
= (1 − ηµ)T ∥x0 − x⋆ ∥2 + .
µ

33
SGD: Theorem

Convergence of SGD
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD with η ≤ 2L
converges as follows

  2ησ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 + ,
µ
Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .

34
SGD: Theorem

Convergence of SGD
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD with η ≤ 2L
converges as follows

  2ησ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 + ,
µ
Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .

Questions
• How does this compare to GD? (Hint: Think about smoothness L.)

34
SGD: Theorem

Convergence of SGD
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD with η ≤ 2L
converges as follows

  2ησ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 + ,
µ
Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .

Questions
• How does this compare to GD? (Hint: Think about smoothness L.)
 
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε if
η = 1/2L?

34
SGD: Theorem

Convergence of SGD
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD with η ≤ 2L
converges as follows

  2ησ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 + ,
µ
Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .

Questions
• How does this compare to GD? (Hint: Think about smoothness L.)
 
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε if
η = 1/2L?
• Can we remove the neighbourhood?

34
SGD: Complexity

If we can make
2ησ⋆2
(1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2 and ≤ ε/2,
µ
 
then E ∥xT − x⋆ ∥2 ≤ ε.

35
SGD: Complexity

If we can make
2ησ⋆2
(1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2 and ≤ ε/2,
µ
 
then E ∥xT − x⋆ ∥2 ≤ ε. For the first term, recall GD
 
1 2∥x0 − x⋆ ∥2
T≥ log =⇒ (1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2.
ηµ ε

35
SGD: Complexity

If we can make
2ησ⋆2
(1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2 and ≤ ε/2,
µ
 
then E ∥xT − x⋆ ∥2 ≤ ε. For the first term, recall GD
 
1 2∥x0 − x⋆ ∥2
T≥ log =⇒ (1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2.
ηµ ε

For the second term, η ≤ εµ/4σ⋆2 .

35
SGD: Complexity

If we can make
2ησ⋆2
(1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2 and ≤ ε/2,
µ
 
then E ∥xT − x⋆ ∥2 ≤ ε. For the first term, recall GD
 
1 2∥x0 − x⋆ ∥2
T≥ log =⇒ (1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2.
ηµ ε

For the second term, η ≤ εµ/4σ⋆2 .


Putting all together with η ≤ 1/2L,
   
σ⋆2 2∥x0 − x⋆ ∥2  
T ≥ 4 max κ, 2 log =⇒ E ∥xT − x⋆ ∥2 ≤ ε.
µε ε

35
SGD: Complexity

SGD Complexity
n
Let {fi }i=1 be L-smooth and µ-convex, then SGD has the following
worst-case complexity
  
1 1  
T=O κ+ 2 log =⇒ E ∥xT − x⋆ ∥2 ≤ ε.
µε ε

36
SGD: Complexity

SGD Complexity
n
Let {fi }i=1 be L-smooth and µ-convex, then SGD has the following
worst-case complexity
  
1 1  
T=O κ+ 2 log =⇒ E ∥xT − x⋆ ∥2 ≤ ε.
µε ε

Questions
• When and why is SGD a good method?

36
SGD: Complexity

SGD Complexity
n
Let {fi }i=1 be L-smooth and µ-convex, then SGD has the following
worst-case complexity
  
1 1  
T=O κ+ 2 log =⇒ E ∥xT − x⋆ ∥2 ≤ ε.
µε ε

Questions
• When and why is SGD a good method?
• Can we do better?

36
Problem with SGD

What is the main bottleneck of the SGD analysis? We had


 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x + η 2

∥∇fi (xt )∥2 ,
n n
i=1 i=1

while we wanted to apply


1
⟨∇fi (xt ) − ∇fi (x⋆ ), xt − x⋆ ⟩ ≥ ∥∇fi (xt ) − ∇fi (x⋆ )∥2
L

37
Problem with SGD

What is the main bottleneck of the SGD analysis? We had


 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x + η 2

∥∇fi (xt )∥2 ,
n n
i=1 i=1

while we wanted to apply


1
⟨∇fi (xt ) − ∇fi (x⋆ ), xt − x⋆ ⟩ ≥ ∥∇fi (xt ) − ∇fi (x⋆ )∥2
L
Can we adjust the update such that we can use the above directly
without using Young’s inequality?

37
SGD-Star

h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)

Assumptions
• fi is smooth, i.e.,
L
fi (y) ≤ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• fi is µ-convex, i.e.,
µ
fi (y) ≥ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2

Algorithm xt+1 = xt − ηgt , where gt = ∇fit (xt ) − ∇fit (x⋆ ), and it is


selected uniformly at random from [n] = {1, 2, . . . , n}.

38
SGD-Star: One Step Analysis

Let Et [·] denote E [·|xt ], then


   
Et ∥xt+1 − x⋆ ∥2 = Et ∥xt − ηgt − x⋆ ∥2
 
= ∥xt − x⋆ ∥ − 2η⟨Et [gt ] , xt − x⋆ ⟩ + η 2 Et ∥gt ∥2

39
SGD-Star: One Step Analysis

Let Et [·] denote E [·|xt ], then


   
Et ∥xt+1 − x⋆ ∥2 = Et ∥xt − ηgt − x⋆ ∥2
 
= ∥xt − x⋆ ∥ − 2η⟨Et [gt ] , xt − x⋆ ⟩ + η 2 Et ∥gt ∥2
Since gradient is sampled using uniform random sampling, we have
* n
+
 ⋆ 2
 ⋆ 2 1X
Et ∥xt+1 − x ∥ = ∥xt − x ∥ − 2η ∇fi (xt ) − ∇fi (x ), xt − x
⋆ ⋆
n
i=1
n
1X
+ η2 ∥∇fi (xt ) − ∇fi (x⋆ )∥2
n
i=1

39
SGD-Star: One Step Analysis

Let Et [·] denote E [·|xt ], then


   
Et ∥xt+1 − x⋆ ∥2 = Et ∥xt − ηgt − x⋆ ∥2
 
= ∥xt − x⋆ ∥ − 2η⟨Et [gt ] , xt − x⋆ ⟩ + η 2 Et ∥gt ∥2
Since gradient is sampled using uniform random sampling, we have
* n
+
 ⋆ 2
 ⋆ 2 1X
Et ∥xt+1 − x ∥ = ∥xt − x ∥ − 2η ∇fi (xt ) − ∇fi (x ), xt − x
⋆ ⋆
n
i=1
n
1X
+ η2 ∥∇fi (xt ) − ∇fi (x⋆ )∥2
n
i=1

 
Et ∥xt+1 − x⋆ ∥2 = ∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩
* n
+
1X
−η ∇fi (xt ) − ∇fi (x ), xt − x
⋆ ⋆
n
i=1
n
X
21
+η ∥∇fi (xt ) − ∇fi (x⋆ )∥2
n 39
i=1
SGD-Star: One Step Analysis

We can apply the same analysis as for GD


 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
  X n
1 1
−η −η ∥∇fi (xt )∥2
L n
i=1

40
SGD-Star: One Step Analysis

We can apply the same analysis as for GD


 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
  X n
1 1
−η −η ∥∇fi (xt )∥2
L n
i=1

Therefore for η ≤ 1/L, we have


 
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2 .

40
SGD-Star: Theorem

Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
 
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .

41
SGD-Star: Theorem

Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
 
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .

Questions
• How does this compare to GD/SGD?

41
SGD-Star: Theorem

Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
 
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .

Questions
• How does this compare to GD/SGD?
 
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε

41
SGD-Star: Theorem

Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
 
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .

Questions
• How does this compare to GD/SGD?
 
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O(κ log 1/ε)

41
SGD-Star: Theorem

Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
 
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .

Questions
• How does this compare to GD/SGD?
 
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O(κ log 1/ε)
• What about ∇fi (x⋆ )’s?

41
SGD-Star: Theorem

Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
 
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .

Questions
• How does this compare to GD/SGD?
 
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O(κ log 1/ε)
• What about ∇fi (x⋆ )’s?
• What is the variance of SGD-Star at the optimum?

41
Variance Reduction: Learning Optimal Shift

Problem with SGD-Star


• ∇fi (x⋆ )’s are unknown!

42
Variance Reduction: Learning Optimal Shift

Problem with SGD-Star


• ∇fi (x⋆ )’s are unknown!

Properties of gt for
• Et [gt ] = ∇f(xt )

42
Variance Reduction: Learning Optimal Shift

Problem with SGD-Star


• ∇fi (x⋆ )’s are unknown!

Properties of gt for
• Et [gt ] = ∇f(xt )
• If xt = x⋆ then Et [gt ] = ∇f(x⋆ ) = 0 and Vt (gt ) = 0.

42
Variance Reduction: Learning Optimal Shift

Problem with SGD-Star


• ∇fi (x⋆ )’s are unknown!

Properties of gt for
• Et [gt ] = ∇f(xt )
• If xt = x⋆ then Et [gt ] = ∇f(x⋆ ) = 0 and Vt (gt ) = 0.

Can we construct practical gt with similar properties to the SGD-Star


update?

42
Variance Reduction: Learning Optimal Shift

Problem with SGD-Star


• ∇fi (x⋆ )’s are unknown!

Properties of gt for
• Et [gt ] = ∇f(xt )
• If xt = x⋆ then Et [gt ] = ∇f(x⋆ ) = 0 and Vt (gt ) = 0.

Can we construct practical gt with similar properties to the SGD-Star


update?
Control Variate
• gt = ∇fit (xt ) + ct ,
t→∞ t→∞
• Et [ct ] → 0 and Vt (gt ) → 0

42
Variance Reduction: Learning Optimal Shift

Control Variate
• gt = ∇fit (xt ) + ct ,
t→∞ t→∞
• Et [ct ] → 0 and Vt (gt ) → 0

43
Variance Reduction: Learning Optimal Shift

Control Variate
• gt = ∇fit (xt ) + ct ,
t→∞ t→∞
• Et [ct ] → 0 and Vt (gt ) → 0

How to construct ct ?

43
Variance Reduction: Learning Optimal Shift

Control Variate
• gt = ∇fit (xt ) + ct ,
t→∞ t→∞
• Et [ct ] → 0 and Vt (gt ) → 0

How to construct ct ?
Stochastic Variance Reduced Gradient
(SVRG) [Johnson and Zhang, 2013]

gt = ∇fit (xt ) + (∇fit (yt ) − ∇f(yt ))

43
Variance Reduction: Learning Optimal Shift

Control Variate
• gt = ∇fit (xt ) + ct ,
t→∞ t→∞
• Et [ct ] → 0 and Vt (gt ) → 0

How to construct ct ?
Stochastic Variance Reduced Gradient
(SVRG) [Johnson and Zhang, 2013]

gt = ∇fit (xt ) + (∇fit (yt ) − ∇f(yt ))

When is gt control variate? How to pick yt ?

43
Variance Reduction: Learning Optimal Shift

Control Variate
• gt = ∇fit (xt ) + ct ,
t→∞ t→∞
• Et [ct ] → 0 and Vt (gt ) → 0

How to construct ct ?
Stochastic Variance Reduced Gradient
(SVRG) [Johnson and Zhang, 2013]

gt = ∇fit (xt ) + (∇fit (yt ) − ∇f(yt ))

When is gt control variate? How to pick yt ?


Other approaches
• SAG [Schmidt et al., 2017], SAGA [Defazio et al., 2014],
SARAH [Nguyen et al., 2017], SPIDER [Fang et al., 2018].

43
Loopless SVRG (L-SVRG)
h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)

Assumptions
• fi is smooth, i.e.,
L
fi (y) ≤ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• fi is µ-convex, i.e.,
µ
fi (y) ≥ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2

Algorithm xt+1 = xt − ηgt , where gt = ∇fit (xt ) − ∇fit (yt ) + ∇f(yt ), and it
is selected uniformly at random from [n] and y0 = x0 with
(
xt w.p. p,
yt+1 =
yt otherwise.

44
L-SVRG: One Round Progress xt

Let Et [·] denote E [·|xt ], then


   
Et ∥xt+1 − x⋆ ∥2 = Et ∥xt − ηgt − x⋆ ∥2
 
= ∥xt − x⋆ ∥ − 2η⟨Et [gt ] , xt − x⋆ ⟩ + η 2 Et ∥gt ∥2

45
L-SVRG: One Round Progress xt

Let Et [·] denote E [·|xt ], then


   
Et ∥xt+1 − x⋆ ∥2 = Et ∥xt − ηgt − x⋆ ∥2
 
= ∥xt − x⋆ ∥ − 2η⟨Et [gt ] , xt − x⋆ ⟩ + η 2 Et ∥gt ∥2

Since Et [gt ] = ∇f(xt ), we have


   
Et ∥xt+1 − x⋆ ∥2 = ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 Et ∥gt ∥2

45
L-SVRG: One Round Progress xt

Let Et [·] denote E [·|xt ], then


   
Et ∥xt+1 − x⋆ ∥2 = Et ∥xt − ηgt − x⋆ ∥2
 
= ∥xt − x⋆ ∥ − 2η⟨Et [gt ] , xt − x⋆ ⟩ + η 2 Et ∥gt ∥2

Since Et [gt ] = ∇f(xt ), we have


   
Et ∥xt+1 − x⋆ ∥2 = ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 Et ∥gt ∥2

By µ-convexity,
   
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 Et ∥gt ∥2 .

45
L-SVRG: One Round Progress xt

Let Et [·] denote E [·|xt ], then


   
Et ∥xt+1 − x⋆ ∥2 = Et ∥xt − ηgt − x⋆ ∥2
 
= ∥xt − x⋆ ∥ − 2η⟨Et [gt ] , xt − x⋆ ⟩ + η 2 Et ∥gt ∥2

Since Et [gt ] = ∇f(xt ), we have


   
Et ∥xt+1 − x⋆ ∥2 = ∥xt − x⋆ ∥2 − 2η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 Et ∥gt ∥2

By µ-convexity,
   
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 Et ∥gt ∥2 .

What about gt ?

45
L-SVRG: Bound on gt

By definition
   
Et ∥gt ∥2 = Et ∥∇fit (xt ) − ∇fit (yt ) + ∇f(yt )∥2

46
L-SVRG: Bound on gt

By definition
   
Et ∥gt ∥2 = Et ∥∇fit (xt ) − ∇fit (yt ) + ∇f(yt )∥2

Similarly to the SGD analysis, we want to obtain ∥∇fit (xt ) − ∇fit (x⋆ )∥2
to apply smoothness. Therefore, we apply Young’s inequality
   
Et ∥gt ∥2 ≤ 2Et ∥∇fit (xt ) − ∇fit (x⋆ )∥2 + ∥∇fit (x⋆ ) − ∇fit (yt ) + ∇f(yt )∥2
n
2 X 
= ∥∇fi (xt ) − ∇fit (x⋆ )∥2 + ∥∇fit (x⋆ ) − ∇fit (yt ) + ∇f(yt )∥2
n
i=1

46
L-SVRG: Bound on gt

By definition
   
Et ∥gt ∥2 = Et ∥∇fit (xt ) − ∇fit (yt ) + ∇f(yt )∥2

Similarly to the SGD analysis, we want to obtain ∥∇fit (xt ) − ∇fit (x⋆ )∥2
to apply smoothness. Therefore, we apply Young’s inequality
   
Et ∥gt ∥2 ≤ 2Et ∥∇fit (xt ) − ∇fit (x⋆ )∥2 + ∥∇fit (x⋆ ) − ∇fit (yt ) + ∇f(yt )∥2
n
2 X 
= ∥∇fi (xt ) − ∇fit (x⋆ )∥2 + ∥∇fit (x⋆ ) − ∇fit (yt ) + ∇f(yt )∥2
n
i=1
Pn
We denote Dt = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (yt ) + ∇f(yt )∥2 , which is the
error due to the imprecise shift.

46
L-SVRG: One Round Progress Dt

Due to the probabilistic sampling of yt+1 , i.e.,


(
xt w.p. p,
yt+1 =
yt otherwise,
we have
n
pX
Et [Dt ] = ∥∇fi (x⋆ ) − ∇fi (xt ) + ∇f(xt )∥2 + (1 − p)Dt
n
i=1

47
L-SVRG: One Round Progress Dt

Due to the probabilistic sampling of yt+1 , i.e.,


(
xt w.p. p,
yt+1 =
yt otherwise,
we have
n
pX
Et [Dt ] = ∥∇fi (x⋆ ) − ∇fi (xt ) + ∇f(xt )∥2 + (1 − p)Dt
n
i=1

 
Since V(X) ≤ E ∥X∥2 , we have
n
pX
Et [Dt ] ≤ ∥∇fi (x⋆ ) − ∇fi (xt )∥2 + (1 − p)Dt
n
i=1

47
L-SVRG: One Round Progress Dt

Due to the probabilistic sampling of yt+1 , i.e.,


(
xt w.p. p,
yt+1 =
yt otherwise,
we have
n
pX
Et [Dt ] = ∥∇fi (x⋆ ) − ∇fi (xt ) + ∇f(xt )∥2 + (1 − p)Dt
n
i=1

 
Since V(X) ≤ E ∥X∥2 , we have
n
pX
Et [Dt ] ≤ ∥∇fi (x⋆ ) − ∇fi (xt )∥2 + (1 − p)Dt
n
i=1

How can we combine these results?


47
L-SVRG: One Step Analysis

Summing the bounds on gt and Dt with rt+1 ∈ R+ , we obtain


 
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩
n
!
2 2X
+η ∥∇fi (xt ) − ∇fit (x⋆ )∥2 + 2Dt
n
i=1
n
prt+1 X
∥∇fi (x⋆ ) − ∇fi (xt )∥2 + (1 − p)rt+1 Dt .
n
i=1

48
L-SVRG: One Step Analysis

Summing the bounds on gt and Dt with rt+1 ∈ R+ , we obtain


 
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩
n
!
2 2X
+η ∥∇fi (xt ) − ∇fit (x⋆ )∥2 + 2Dt
n
i=1
n
prt+1 X
∥∇fi (x⋆ ) − ∇fi (xt )∥2 + (1 − p)rt+1 Dt .
n
i=1

Applying L-smoothness of fi ’s leads to


 
  2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt
rt+1
  X n
1 prt+1 1
−η − 2η − ∥∇fi (xt ) − ∇fit (x⋆ )∥2 .
L η n
i=1

48
L-SVRG: One Step Analysis

Summing the bounds on gt and Dt with rt+1 ∈ R+ , we obtain


 
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩
n
!
2 2X
+η ∥∇fi (xt ) − ∇fit (x⋆ )∥2 + 2Dt
n
i=1
n
prt+1 X
∥∇fi (x⋆ ) − ∇fi (xt )∥2 + (1 − p)rt+1 Dt .
n
i=1

Applying L-smoothness of fi ’s leads to


 
  2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt
rt+1
  X n
1 prt+1 1
−η − 2η − ∥∇fi (xt ) − ∇fit (x⋆ )∥2 .
L η n
i=1

How to select p, η and rt ’s?


48
L-SVRG: One Step Analysis

 
  2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt
rt+1
  X n
1 prt+1 1
−η − 2η − ∥∇fi (xt ) − ∇fit (x⋆ )∥2 .
L η n
i=1

49
L-SVRG: One Step Analysis

 
  2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt
rt+1
  X n
1 prt+1 1
−η − 2η − ∥∇fi (xt ) − ∇fit (x⋆ )∥2 .
L η n
i=1
prt+1
If 1/2L ≥ 2η and 1/2L ≥ η , then
 
  2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt .
rt+1

49
L-SVRG: One Step Analysis

 
  2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt
rt+1
  X n
1 prt+1 1
−η − 2η − ∥∇fi (xt ) − ∇fit (x⋆ )∥2 .
L η n
i=1
prt+1
If 1/2L ≥ 2η and 1/2L ≥ η , then
 
  2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt .
rt+1
2
If p/2 ≥ 2η /rt+1 and rt+1 = r, ∀t, then
   p
Et ∥xt+1 − x⋆ ∥2 + rDt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + 1 − rDt
2
n p o 
≤ max 1 − ηµ, 1 − ∥xt − x⋆ ∥2 + rDt .
2

49
L-SVRG: Choice of Parameters

We require the following inequalities to hold

1 1 pr p 2η 2
≥ 2η, ≥ , ≥ .
2L 2L η 2 r

50
L-SVRG: Choice of Parameters

We require the following inequalities to hold

1 1 pr p 2η 2
≥ 2η, ≥ , ≥ .
2L 2L η 2 r
 
To obtain the best rate for E ∥xT − x⋆ ∥2 , we might want r to be as
2
small as possible. Therefore, let us select r = 4η /p.

50
L-SVRG: Choice of Parameters

We require the following inequalities to hold

1 1 pr p 2η 2
≥ 2η, ≥ , ≥ .
2L 2L η 2 r
 
To obtain the best rate for E ∥xT − x⋆ ∥2 , we might want r to be as
2
small as possible. Therefore, let us select r = 4η /p.
2
Combining the second inequality with r = 4η /p leads to

1 p4η 2
≥ = 4η .
2L pη

50
L-SVRG: Choice of Parameters

We require the following inequalities to hold

1 1 pr p 2η 2
≥ 2η, ≥ , ≥ .
2L 2L η 2 r
 
To obtain the best rate for E ∥xT − x⋆ ∥2 , we might want r to be as
2
small as possible. Therefore, let us select r = 4η /p.
2
Combining the second inequality with r = 4η /p leads to

1 p4η 2
≥ = 4η .
2L pη
2
Therefore, η ≤ 1/8L, r = 4η /p and p ∈ (0, 1] satisfies all 3 inequalities.

50
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n  
  p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .

51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n  
  p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .

Questions
• How does this compare to GD/SGD?

51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n  
  p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .

Questions
• How does this compare to GD/SGD?
 
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε

51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n  
  p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .

Questions
• How does this compare to GD/SGD?
 
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O((1/p + κ) log 1/pε)

51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n  
  p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .

Questions
• How does this compare to GD/SGD?
 
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O((1/p + κ) log 1/pε)
• What is expected computational cost per iteration?

51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n  
  p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .

Questions
• How does this compare to GD/SGD?
 
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O((1/p + κ) log 1/pε)
• What is expected computational cost per iteration? pn + (1 − p)
• What are good choices of p?

51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n  
  p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .

Questions
• How does this compare to GD/SGD?
 
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O((1/p + κ) log 1/pε)
• What is expected computational cost per iteration? pn + (1 − p)
• What are good choices of p? In practice p = 1/n is a popular
choice with total complexity O((n + κ) log n/ε).
51
Comparison

Figure 1: Stochastic vs. deterministic methods

52
Different Type of Variance Reduction

SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].

53
Different Type of Variance Reduction

SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].
Momentum SGD

gt = βgt−1 + (1 − β)∇fit (xt ), g0 = 0.

53
Different Type of Variance Reduction

SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].
Momentum SGD

gt = βgt−1 + (1 − β)∇fit (xt ), g0 = 0.

Variance Reduction:
t
X
gt = (1 − β) β t−k ∇fik (xk ).
k=0

Question: What is the sum of weights?

53
Different Type of Variance Reduction

SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].
Momentum SGD

gt = βgt−1 + (1 − β)∇fit (xt ), g0 = 0.

Variance Reduction:
t
X
gt = (1 − β) β t−k ∇fik (xk ).
k=0

Question: What is the sum of weights? 1 − β t+1 .

53
Different Type of Variance Reduction

SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].
Momentum SGD

gt = βgt−1 + (1 − β)∇fit (xt ), g0 = 0.

Variance Reduction:
t
X
gt = (1 − β) β t−k ∇fik (xk ).
k=0

Question: What is the sum of weights? 1 − β t+1 .


Problem:
Et [gt ] ̸= ∇f(xt ).

53
Different Type of Variance Reduction

SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].
Momentum SGD

gt = βgt−1 + (1 − β)∇fit (xt ), g0 = 0.

Variance Reduction:
t
X
gt = (1 − β) β t−k ∇fik (xk ).
k=0

Question: What is the sum of weights? 1 − β t+1 .


Problem:
Et [gt ] ̸= ∇f(xt ).
In deterministic case, this is called Polyak’s heavy ball method,
which enjoys accelerated rate. 53
Curvature Adaptive Methods

Question: What is a good measure of curvature?

54
Curvature Adaptive Methods

Question: What is a good measure of curvature? Hessian.

54
Curvature Adaptive Methods

Question: What is a good measure of curvature? Hessian.


Newton’s Method

1 ⊤
f(xt+1 ) ≈ f(xt ) + ⟨∇f(xt ), xt+1 − xt ⟩ + (xt+1 − xt ) ∇2 f(xt )(xt+1 − xt ).
2

54
Curvature Adaptive Methods

Question: What is a good measure of curvature? Hessian.


Newton’s Method

1 ⊤
f(xt+1 ) ≈ f(xt ) + ⟨∇f(xt ), xt+1 − xt ⟩ + (xt+1 − xt ) ∇2 f(xt )(xt+1 − xt ).
2
Minimizing the above leads to

xt+1 = xt − ∇2 f(xt )−1 ∇f(xt ).

54
Curvature Adaptive Methods

Question: What is a good measure of curvature? Hessian.


Newton’s Method

1 ⊤
f(xt+1 ) ≈ f(xt ) + ⟨∇f(xt ), xt+1 − xt ⟩ + (xt+1 − xt ) ∇2 f(xt )(xt+1 − xt ).
2
Minimizing the above leads to

xt+1 = xt − ∇2 f(xt )−1 ∇f(xt ).

Drawbacks

54
Curvature Adaptive Methods

Question: What is a good measure of curvature? Hessian.


Newton’s Method

1 ⊤
f(xt+1 ) ≈ f(xt ) + ⟨∇f(xt ), xt+1 − xt ⟩ + (xt+1 − xt ) ∇2 f(xt )(xt+1 − xt ).
2
Minimizing the above leads to

xt+1 = xt − ∇2 f(xt )−1 ∇f(xt ).

Drawbacks
• Expensive to evaluate inverse.
• Requires full gradient computation.
• Only local convergence is guaranteed.

54
Curvature Adaptive Methods
ADAM

mt = β1 mt−1 + (1 − β1 )∇fit (xt ) (First Moment)


2
vt = β2 vt−1 + (1 − β2 )(∇fit (xt )) (Second Moment)

55
Curvature Adaptive Methods
ADAM

mt = β1 mt−1 + (1 − β1 )∇fit (xt ) (First Moment)


2
vt = β2 vt−1 + (1 − β2 )(∇fit (xt )) (Second Moment)

Update
η mt
xt+1 = xt − p .
vt /(1 − β2t ) + ε 1 − β1t

55
Curvature Adaptive Methods
ADAM

mt = β1 mt−1 + (1 − β1 )∇fit (xt ) (First Moment)


2
vt = β2 vt−1 + (1 − β2 )(∇fit (xt )) (Second Moment)

Update
η mt
xt+1 = xt − p .
vt /(1 − β2t ) + ε 1 − β1t

Similar Methods

55
Curvature Adaptive Methods
ADAM

mt = β1 mt−1 + (1 − β1 )∇fit (xt ) (First Moment)


2
vt = β2 vt−1 + (1 − β2 )(∇fit (xt )) (Second Moment)

Update
η mt
xt+1 = xt − p .
vt /(1 − β2t ) + ε 1 − β1t

Similar Methods
Pt 2
• AdaGrad [Duchi et al., 2011] (β1 = 0, vt = k=1 (∇fit (xt )) ),
• AdaDelta [Zeiler, 2012] (β1 = 0).

55
Curvature Adaptive Methods
ADAM

mt = β1 mt−1 + (1 − β1 )∇fit (xt ) (First Moment)


2
vt = β2 vt−1 + (1 − β2 )(∇fit (xt )) (Second Moment)

Update
η mt
xt+1 = xt − p .
vt /(1 − β2t ) + ε 1 − β1t

Similar Methods
Pt 2
• AdaGrad [Duchi et al., 2011] (β1 = 0, vt = k=1 (∇fit (xt )) ),
• AdaDelta [Zeiler, 2012] (β1 = 0).

ADAM can diverge for simple convex problems [Reddi et al., 2019]!
55
Practical Sampling: Random Reshuffling
h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)

Assumptions
• fi ’s is L-smooth and µ-convex.

Algorithm 1 Random Reshuffling (RR)


1: Input: η > 0, x0 = x00 ∈ Rd
2: for t = 1, . . . do
n−1
3: Sample a random permutation {πi }i=0 of [n]
4: for i = 0 to n − 1 do
5: xi+1
t = xit − η∇fπi (xit )
6: end for
7: xt+1 = xnt
8: end for

56
Random Reshuffling

Strengths

57
Random Reshuffling

Strengths
• For each epoch, we see all data points.
• Cheaper to sample.

57
Random Reshuffling

Strengths
• For each epoch, we see all data points.
• Cheaper to sample.

Limitations

57
Random Reshuffling

Strengths
• For each epoch, we see all data points.
• Cheaper to sample.

Limitations
 
• Bias, i.e., Ei,t ∇fπi (xit ) ̸= ∇f(xit ) (only for i = 0)

57
Random Reshuffling

Strengths
• For each epoch, we see all data points.
• Cheaper to sample.

Limitations
 
• Bias, i.e., Ei,t ∇fπi (xit ) ̸= ∇f(xit ) (only for i = 0)

Convergence Example
See whiteboard.

57
RR: Handling Bias

Auxiliary Sequence
New notion to handle bias
def
x0⋆ = x⋆
def
xi+1
⋆ = xi⋆ − η∇fπi (x⋆ )

58
RR: Handling Bias

Auxiliary Sequence
New notion to handle bias
def
x0⋆ = x⋆
def
xi+1
⋆ = xi⋆ − η∇fπi (x⋆ )

What is xn⋆ ?

58
RR: Handling Bias

Auxiliary Sequence
New notion to handle bias
def
x0⋆ = x⋆
def
xi+1
⋆ = xi⋆ − η∇fπi (x⋆ )

What is xn⋆ ?

xn⋆ = x⋆

58
RR: One Step Analysis
 
Let Ei,t [·] denote E ·|xit , then
h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= Ei,t ∥xit − η∇fπi (xit ) − xi⋆ + η∇fπi (x⋆ )∥2
D h i E
= ∥xit − xi⋆ ∥2 − 2η Ei,t ∇fπi (xit ) − ∇fπi (x⋆ ) , xit − xi⋆
h i
+ η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2

59
RR: One Step Analysis
 
Let Ei,t [·] denote E ·|xit , then
h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= Ei,t ∥xit − η∇fπi (xit ) − xi⋆ + η∇fπi (x⋆ )∥2
D h i E
= ∥xit − xi⋆ ∥2 − 2η Ei,t ∇fπi (xit ) − ∇fπi (x⋆ ) , xit − xi⋆
h i
+ η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2

Three point identity

⟨∇h(a) − ∇h(b), b − c⟩ = Dh (c, a) − Dh (c, b) − Dh (b, a)

59
RR: One Step Analysis
 
Let Ei,t [·] denote E ·|xit , then
h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= Ei,t ∥xit − η∇fπi (xit ) − xi⋆ + η∇fπi (x⋆ )∥2
D h i E
= ∥xit − xi⋆ ∥2 − 2η Ei,t ∇fπi (xit ) − ∇fπi (x⋆ ) , xit − xi⋆
h i
+ η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2

Three point identity

⟨∇h(a) − ∇h(b), b − c⟩ = Dh (c, a) − Dh (c, b) − Dh (b, a)

Applying three point identity to middle term leads to


h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= ∥xit − xi⋆ ∥2 − 2ηEi,t Dfπi (xi⋆ , xit ) + Dfπi (xit , x⋆ )
h i h i
+ η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2 + 2ηEi,t Dfπi (xi⋆ , x⋆ )

59
RR: One Step Analysis

h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= ∥xit − xi⋆ ∥2 − 2ηEi,t Dfπi (xi⋆ , xit )
h i h i
− 2ηEi,t Dfπi (xit , x⋆ ) + η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2
h i
+ 2ηEi,t Dfπi (xi⋆ , x⋆ ) .

60
RR: One Step Analysis

h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= ∥xit − xi⋆ ∥2 − 2ηEi,t Dfπi (xi⋆ , xit )
h i h i
− 2ηEi,t Dfπi (xit , x⋆ ) + η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2
h i
+ 2ηEi,t Dfπi (xi⋆ , x⋆ ) .

Using L-smoothness and µ-convexity of fπi , we have


h i
Ei,t ∥xi+1
t − x⋆i+1 ∥2 = (1 − ηµ)∥xit − xi⋆ ∥2
  h i
1
−η − η Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2
L
h i
+ 2ηEi,t Dfπi (xi⋆ , x⋆ ) .

60
RR: One Step Analysis

h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= ∥xit − xi⋆ ∥2 − 2ηEi,t Dfπi (xi⋆ , xit )
h i h i
− 2ηEi,t Dfπi (xit , x⋆ ) + η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2
h i
+ 2ηEi,t Dfπi (xi⋆ , x⋆ ) .

Using L-smoothness and µ-convexity of fπi , we have


h i
Ei,t ∥xi+1
t − x⋆i+1 ∥2 = (1 − ηµ)∥xit − xi⋆ ∥2
  h i
1
−η − η Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2
L
h i
+ 2ηEi,t Dfπi (xi⋆ , x⋆ ) .
h i
2
If η ≤ 1/L and σshuffle = maxi∈[n] Ei,t Dfπi (xi⋆ , x⋆ ) , then
h i
Ei,t ∥xi+1
t − x⋆i+1 ∥2 = (1 − ηµ)∥xit − xi⋆ ∥2 + 2ησshuffle
2
.
60
RR: Theorem

Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
2
  2σshuffle
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,
µ
h i
2
where σshuffle = maxi∈[n] Ei,t Dfπi (xi⋆ , x⋆ ) .

61
RR: Theorem

Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
2
  2σshuffle
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,
µ
h i
2
where σshuffle = maxi∈[n] Ei,t Dfπi (xi⋆ , x⋆ ) .

Questions
• How does this compare to SGD?

61
RR: Theorem

Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
2
  2σshuffle
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,
µ
h i
2
where σshuffle = maxi∈[n] Ei,t Dfπi (xi⋆ , x⋆ ) .

Questions
• How does this compare to SGD?
• Can we remove the neighbourhood?

61
RR: Variance

By L-smoothness
h i
2
σshuffle = max Ei,t Dfπi (xi⋆ , x⋆ )
i∈[n]
 
L i
2
≤ max x⋆ − x ⋆
2 i∈[n]
 
2
2

 i−1
X 

η L
= max ∇fπj (x ) ⋆
2 i∈[n] 
 j=1 

 
η2 L (n − j)j 2 η 2 nL 2
= max σ⋆ ≤ σ
2 i∈[n] n − 1 4 ⋆

The last equality follows from the formula of the sampling without
replacement [Mishchenko et al., 2020, Lemma 1] (we will discuss
sampling techniques in detail later).

62
RR: Updated Theorem

Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
  η 2 nLσ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,

Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .

63
RR: Updated Theorem

Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
  η 2 nLσ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,

Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .

Questions

• Complexity?

63
RR: Updated Theorem

Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
  η 2 nLσ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,

Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .

Questions
 q  
nL 1
• Complexity? Exercise: O κ+ µ2 ε log ε .
• How does this compare to SGD in terms of ε?

63
Literature Review

Related Papers
• GD [Nesterov, 2003]
• SGD [Gower et al., 2019]
• SGD-Star [Gorbunov et al., 2020]
• L-SVRG [Kovalev et al., 2020]
• Momentum [Liu et al., 2020]
• Adam [Kingma and Ba, 2014], Yogi [Reddi et al., 2019]
• RR [Mishchenko et al., 2020]

64
Questions?

64
References i

Defazio, A., Bach, F., and Lacoste-Julien, S. (2014).


Saga: A fast incremental gradient method with support for
non-strongly convex composite objectives.
Advances in neural information processing systems, 27.
Defazio, A. and Bottou, L. (2019).
On the ineffectiveness of variance reduced optimization for
deep learning.
Advances in Neural Information Processing Systems, 32.
Duchi, J., Hazan, E., and Singer, Y. (2011).
Adaptive subgradient methods for online learning and
stochastic optimization.
Journal of machine learning research, 12(7).
References ii

Fang, C., Li, C. J., Lin, Z., and Zhang, T. (2018).


Spider: Near-optimal non-convex optimization via stochastic
path-integrated differential estimator.
Advances in Neural Information Processing Systems, 31.
Gorbunov, E., Hanzely, F., and Richtárik, P. (2020).
A unified theory of sgd: Variance reduction, sampling,
quantization and coordinate descent.
In International Conference on Artificial Intelligence and
Statistics, pages 680–690. PMLR.
Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., and
Richtárik, P. (2019).
Sgd: General analysis and improved rates.
In International Conference on Machine Learning, pages
5200–5209. PMLR.
References iii

Johnson, R. and Zhang, T. (2013).


Accelerating stochastic gradient descent using predictive
variance reduction.
Advances in neural information processing systems, 26.
Kingma, D. P. and Ba, J. (2014).
Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980.
Kovalev, D., Horváth, S., and Richtárik, P. (2020).
Don’t jump through hoops and remove those loops: Svrg and
katyusha are better without the outer loop.
In Algorithmic Learning Theory, pages 451–467. PMLR.
References iv

Liu, Y., Gao, Y., and Yin, W. (2020).


An improved analysis of stochastic gradient descent with
momentum.
Advances in Neural Information Processing Systems,
33:18261–18271.
Mishchenko, K., Khaled, A., and Richtárik, P. (2020).
Random reshuffling: Simple analysis with vast improvements.
Advances in Neural Information Processing Systems,
33:17309–17320.
Nesterov, Y. (2003).
Introductory lectures on convex optimization: A basic course,
volume 87.
Springer Science & Business Media.
References v

Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. (2017).


Sarah: A novel method for machine learning problems using
stochastic recursive gradient.
In International Conference on Machine Learning, pages
2613–2621. PMLR.
Reddi, S. J., Kale, S., and Kumar, S. (2019).
On the convergence of adam and beyond.
arXiv preprint arXiv:1904.09237.
Schmidt, M., Le Roux, N., and Bach, F. (2017).
Minimizing finite sums with the stochastic average gradient.
Mathematical Programming, 162(1):83–112.
Zeiler, M. D. (2012).
Adadelta: an adaptive learning rate method.
arXiv preprint arXiv:1212.5701.

You might also like