ML807_Distributed_and_Federated_Learning_Slides_2
ML807_Distributed_and_Federated_Learning_Slides_2
1
Outline
2. Gradient Descent
3. SGD-like Methods
2
SGD in Neural Network training
Multi-layer Neural Network
1
σ(z) =
1 + e−z
4
Activation Function: Sigmoid
1
σ(z) =
1 + e−z
4
Activation Function: ReLU
5
Activation Function: ReLU
6
Learning Parameters
7
Learning Parameters
7
Learning Parameters
7
Learning Parameters
7
SGD for Neural Network
8
Updating the parameter values
9
Illustrative example
• Step 2: Find the error. Let’s assume that the error function is the
sum of the squared differences between the target values tk and
P
the network output zk : E = 21 k (zk − tk )2 .
11
Illustrative example (step 3, output layer)
Step 3: Backpropagate the error. Let’s start at the output layer with
P P
weight wjk , recalling that E = 21 k (zk − tk )2 , uk = bk + j yj wjk :
∂E ∂E ∂zk ∂uk
= = (zk − tk )g′k (uk )yj = δk yj ,
∂wjk ∂zk ∂uk ∂wjk
12
Illustrative example (step 3, hidden layer)
Step 3 (cont’d): Now let’s consider wij in the hidden layer, recalling
P P
uk = bk + j yj wjk , uk = bk + j yj wjk , zk = gk (uk ):
!
∂E X ∂E ∂zk ∂yj ∂uj X
= = δk wjk g′j (uj )xi = δj xi ,
∂wij ∂uk ∂uk ∂uj ∂wij
k k
P
where we substituted δj = g′j (uj ) k (zk − tk )g′k (uk ), the error in uj .
13
Illustrative example (steps 3 and 4)
14
Illustrative example (steps 3 and 4)
14
Illustrative example (steps 3 and 4)
15
Illustrative example (steps 3 and 4)
Final layer
• Error in each of its outputs is zk − tk .
• Error in the input uk to the final layer is δk = (zk − tk )g′k (uk ).
16
High-level Procedure: Can be Used with More Hidden Layers
Final layer
• Error in each of its outputs is zk − tk .
• Error in the input uk to the final layer is δk = (zk − tk )g′k (uk ).
Hidden layer
P
• Error in output yj is outputs is k δk wjk .
P
• Error in the input uj of the hidden layer is g′j (uj ) k δk wjk .
16
High-level Procedure: Can be Used with More Hidden Layers
Final layer
• Error in each of its outputs is zk − tk .
• Error in the input uk to the final layer is δk = (zk − tk )g′k (uk ).
Hidden layer
P
• Error in output yj is outputs is k δk wjk .
P
• Error in the input uj of the hidden layer is g′j (uj ) k δk wjk .
Much faster than implementing a loop over all neurons in each layer.
17
Vectorized Implementation
Much faster than implementing a loop over all neurons in each layer.
Forward-Propagation
• Represent the weights between layers l − 1 and l as a matrix Wl .
• Outputs of layer l − 1 are in a row vector yl . Then we have
ul = yl−1 Wl .
• Outputs of layer l are in the row vector yl = gl (ul ).
17
Vectorized Implementation
Much faster than implementing a loop over all neurons in each layer.
18
Vectorized Implementation
Much faster than implementing a loop over all neurons in each layer.
Back-Propagation
• For each layer l find ∆l , the vector of errors in ul in terms of the
final error.
• Compute gradient w.r.t. the parameters of the current layer l.
• Recursively find ∆l−1 in terms ∆l .
18
Vectorized Implementation
Much faster than implementing a loop over all neurons in each layer.
Back-Propagation
• For each layer l find ∆l , the vector of errors in ul in terms of the
final error.
• Compute gradient w.r.t. the parameters of the current layer l.
• Recursively find ∆l−1 in terms ∆l .
Exercise: How to implement this ? 18
Mini-batch SGD
• Recall the empirical risk loss function that we considered for the
back-propagation discussion
n
1 X 2
E= (F(xi , θ) − yi )
2n
i=1
19
Mini-batch SGD
• Recall the empirical risk loss function that we considered for the
back-propagation discussion
n
1 X 2
E= (F(xi , θ) − yi )
2n
i=1
for back-propagation.
19
Mini-batch SGD
• Recall the empirical risk loss function that we considered for the
back-propagation discussion
n
1 X 2
E= (F(xi , θ) − yi )
2n
i=1
for back-propagation.
• Therefore we use stochastic gradient descent (SGD), where we
2
choose a random data point xn and use Ẽ = 21 (F(xi , θ) − yi )
instead of the entire sum.
19
Mini-batch SGD
20
Mini-batch SGD
for back-propagation.
20
Mini-batch SGD
for back-propagation.
• Small b saves per-iteration computing cost, but increases noise
in the gradients and yields worse error convergence.
20
Mini-batch SGD
for back-propagation.
• Small b saves per-iteration computing cost, but increases noise
in the gradients and yields worse error convergence.
• Large b reduces gradient noise and gives better error
convergence, but increases computing cost per iteration.
20
Questions?
20
Gradient Descent
Gradient Descent
21
Gradient Descent
21
Gradient Descent
21
Gradient Descent
21
Gradient Descent as MM Algorithm
22
Gradient Descent as MM Algorithm
22
Gradient Descent as MM Algorithm
22
Gradient Descent as MM Algorithm
Gradient Descent as MM
22
Gradient Descent as MM Algorithm
Gradient Descent as MM
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
22
Gradient Descent as MM Algorithm
Gradient Descent as MM
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
def
• Let g(y) = f(x) + ⟨∇f(x), y − x⟩ + 2L ∥y − x∥2 .
22
Gradient Descent as MM Algorithm
Gradient Descent as MM
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
def
• Let g(y) = f(x) + ⟨∇f(x), y − x⟩ + 2L ∥y − x∥2 .
• Then, x − L1 ∇f(x) = arg miny∈Rd g(y).
22
Gradient Descent as MM Algorithm
Gradient Descent as MM
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
def
• Let g(y) = f(x) + ⟨∇f(x), y − x⟩ + 2L ∥y − x∥2 .
• Then, x − L1 ∇f(x) = arg miny∈Rd g(y).
• GD with η = L1 .
22
Gradient Descent as MM Algorithm
Gradient Descent as MM
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd .
2
def
• Let g(y) = f(x) + ⟨∇f(x), y − x⟩ + 2L ∥y − x∥2 .
• Then, x − L1 ∇f(x) = arg miny∈Rd g(y).
• GD with η = L1 .
1
• Therefore, f(xt+1 ) ≤ f(xt ) − 2
2L ∥∇f(xt )∥ , for xt+1 = xt − L1 ∇f(xt ).
22
Gradient for µ-convex functions
23
Gradient for µ-convex functions
23
Gradient for µ-convex functions
23
Gradient for µ-convex functions
23
Gradient for µ-convex functions
We have
∥∇f(x)∥2 ≥ 2µ(f(x) − f⋆ ), ∀x ∈ Rd .
24
Gradient Descent Convergence
We have
∥∇f(x)∥2 ≥ 2µ(f(x) − f⋆ ), ∀x ∈ Rd .
1 2
Combining this with f(xt+1 ) ≤ f(xt ) − 2L ∥∇f(xt )∥ , we obtain
1 µ
f(xt+1 ) ≤ f(xt ) − ∥∇f(xt )∥2 ≤ f(xt ) − (f(xt ) − f⋆ ).
2L L
24
Gradient Descent Convergence
We have
∥∇f(x)∥2 ≥ 2µ(f(x) − f⋆ ), ∀x ∈ Rd .
1 2
Combining this with f(xt+1 ) ≤ f(xt ) − 2L ∥∇f(xt )∥ , we obtain
1 µ
f(xt+1 ) ≤ f(xt ) − ∥∇f(xt )∥2 ≤ f(xt ) − (f(xt ) − f⋆ ).
2L L
Subtracting f⋆ from both sides leads to
µ
f(xt+1 ) − f⋆ ≤ 1 − (f(xt ) − f⋆ ).
L
24
Gradient Descent Convergence
We have
∥∇f(x)∥2 ≥ 2µ(f(x) − f⋆ ), ∀x ∈ Rd .
1 2
Combining this with f(xt+1 ) ≤ f(xt ) − 2L ∥∇f(xt )∥ , we obtain
1 µ
f(xt+1 ) ≤ f(xt ) − ∥∇f(xt )∥2 ≤ f(xt ) − (f(xt ) − f⋆ ).
2L L
Subtracting f⋆ from both sides leads to
µ
f(xt+1 ) − f⋆ ≤ 1 − (f(xt ) − f⋆ ).
L
L
Let κ = µ be the condition number, then applying above recursively
yields
T
1
f(xT ) − f⋆ ≤ 1− (f(x0 ) − f⋆ ).
κ
24
GD: Theorem
Convergence of GD
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
T
1
f(xT ) − f⋆ ≤ 1 − (f(x0 ) − f⋆ ).
κ
25
GD: Theorem
Convergence of GD
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
T
1
f(xT ) − f⋆ ≤ 1 − (f(x0 ) − f⋆ ).
κ
25
GD: Theorem
Convergence of GD
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
T
1
f(xT ) − f⋆ ≤ 1 − (f(x0 ) − f⋆ ).
κ
25
GD: Theorem
Convergence of GD
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
T
1
f(xT ) − f⋆ ≤ 1 − (f(x0 ) − f⋆ ).
κ
25
Complexity Bound
26
Complexity Bound
26
Complexity Bound
26
Complexity Bound
ε implies f(xT ) − f⋆ ≤ ε.
Exercise: Complete proof.
26
GD: Convergence in Iterates
Convergence of GD iterates
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
T
2 1 2
∥xT − x⋆ ∥ ≤ 1 − ∥x0 − x⋆ ∥ .
κ
27
GD: Convergence in Iterates
Convergence of GD iterates
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
T
2 1 2
∥xT − x⋆ ∥ ≤ 1 − ∥x0 − x⋆ ∥ .
κ
27
GD: Convergence in Iterates
Convergence of GD iterates
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
T
2 1 2
∥xT − x⋆ ∥ ≤ 1 − ∥x0 − x⋆ ∥ .
κ
27
GD: Convergence in Iterates
Convergence of GD iterates
Let f be L-smooth and µ-convex, then GD with η = L1 converges as
follows
T
2 1 2
∥xT − x⋆ ∥ ≤ 1 − ∥x0 − x⋆ ∥ .
κ
28
GD: Proof
28
GD: Proof
28
GD: Proof
28
GD: Proof
28
SGD-like Methods
Stochastic Gradient Descent
h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)
29
Stochastic Gradient Descent
h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)
Assumptions
• fi is smooth, i.e.,
L
fi (y) ≤ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• fi is µ-convex, i.e.,
µ
fi (y) ≥ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
29
Stochastic Gradient Descent
h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)
Assumptions
• fi is smooth, i.e.,
L
fi (y) ≤ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• fi is µ-convex, i.e.,
µ
fi (y) ≥ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
29
SGD: One Step Analysis
30
SGD: One Step Analysis
30
SGD: One Step Analysis
Et ∥xt+1 − x⋆ ∥2 = ∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x + η 2
⋆
∥∇fi (xt )∥2
n n
i=1 i=1
30
SGD: One Step Analysis
Et ∥xt+1 − x⋆ ∥2 = ∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x + η 2
⋆
∥∇fi (xt )∥2
n n
i=1 i=1
The first 2 terms are exactly the same as for gradient descent =⇒
We use µ-convexity. 30
SGD: One Step Analysis
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x⋆ + η 2 ∥∇fi (xt )∥2
n n
i=1 i=1
31
SGD: One Step Analysis
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x⋆ + η 2 ∥∇fi (xt )∥2
n n
i=1 i=1
31
SGD: One Step Analysis
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x⋆ + η 2 ∥∇fi (xt )∥2
n n
i=1 i=1
31
SGD: One Step Analysis
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x⋆ + η 2 ∥∇fi (xt )∥2
n n
i=1 i=1
31
SGD: One Step Analysis
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
* n
+ n
1X 1X
−η ∇fi (xt ), xt − x⋆ + η 2 ∥∇fi (xt )∥2
n n
i=1 i=1
Pn
Let us denote σ⋆2 = 1/n i=1 ∥∇fi (x
⋆
)∥2 . Putting all together, we obtain
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
X n
1 2
−η −η ∥∇fi (xt ) − ∇fi (x⋆ )∥2 + 2η 2 σ⋆2
2L n
i=1
32
SGD: One Step Analysis
Pn
Let us denote σ⋆2 = 1/n i=1 ∥∇fi (x
⋆
)∥2 . Putting all together, we obtain
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
X n
1 2
−η −η ∥∇fi (xt ) − ∇fi (x⋆ )∥2 + 2η 2 σ⋆2
2L n
i=1
32
SGD: One Step Analysis
Pn
Let us denote σ⋆2 = 1/n i=1 ∥∇fi (x
⋆
)∥2 . Putting all together, we obtain
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2
X n
1 2
−η −η ∥∇fi (xt ) − ∇fi (x⋆ )∥2 + 2η 2 σ⋆2
2L n
i=1
32
SGD: Rate of Convergence
Can we guarantee the rate for E ∥xT − x⋆ ∥2 using the obtained
convergence?
33
SGD: Rate of Convergence
Can we guarantee the rate for E ∥xT − x⋆ ∥2 using the obtained
convergence?
E ∥xT − x⋆ ∥2 = E ET−1 ∥xT − x⋆ ∥2
≤ (1 − ηµ)E ∥xT−1 − x⋆ ∥2 + +2η 2 σ⋆2 .
33
SGD: Rate of Convergence
Can we guarantee the rate for E ∥xT − x⋆ ∥2 using the obtained
convergence?
E ∥xT − x⋆ ∥2 = E ET−1 ∥xT − x⋆ ∥2
≤ (1 − ηµ)E ∥xT−1 − x⋆ ∥2 + +2η 2 σ⋆2 .
33
SGD: Theorem
Convergence of SGD
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD with η ≤ 2L
converges as follows
2ησ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 + ,
µ
Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .
34
SGD: Theorem
Convergence of SGD
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD with η ≤ 2L
converges as follows
2ησ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 + ,
µ
Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .
Questions
• How does this compare to GD? (Hint: Think about smoothness L.)
34
SGD: Theorem
Convergence of SGD
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD with η ≤ 2L
converges as follows
2ησ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 + ,
µ
Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .
Questions
• How does this compare to GD? (Hint: Think about smoothness L.)
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε if
η = 1/2L?
34
SGD: Theorem
Convergence of SGD
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD with η ≤ 2L
converges as follows
2ησ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 + ,
µ
Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .
Questions
• How does this compare to GD? (Hint: Think about smoothness L.)
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε if
η = 1/2L?
• Can we remove the neighbourhood?
34
SGD: Complexity
If we can make
2ησ⋆2
(1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2 and ≤ ε/2,
µ
then E ∥xT − x⋆ ∥2 ≤ ε.
35
SGD: Complexity
If we can make
2ησ⋆2
(1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2 and ≤ ε/2,
µ
then E ∥xT − x⋆ ∥2 ≤ ε. For the first term, recall GD
1 2∥x0 − x⋆ ∥2
T≥ log =⇒ (1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2.
ηµ ε
35
SGD: Complexity
If we can make
2ησ⋆2
(1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2 and ≤ ε/2,
µ
then E ∥xT − x⋆ ∥2 ≤ ε. For the first term, recall GD
1 2∥x0 − x⋆ ∥2
T≥ log =⇒ (1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2.
ηµ ε
35
SGD: Complexity
If we can make
2ησ⋆2
(1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2 and ≤ ε/2,
µ
then E ∥xT − x⋆ ∥2 ≤ ε. For the first term, recall GD
1 2∥x0 − x⋆ ∥2
T≥ log =⇒ (1 − ηµ)T ∥x0 − x⋆ ∥2 ≤ ε/2.
ηµ ε
35
SGD: Complexity
SGD Complexity
n
Let {fi }i=1 be L-smooth and µ-convex, then SGD has the following
worst-case complexity
1 1
T=O κ+ 2 log =⇒ E ∥xT − x⋆ ∥2 ≤ ε.
µε ε
36
SGD: Complexity
SGD Complexity
n
Let {fi }i=1 be L-smooth and µ-convex, then SGD has the following
worst-case complexity
1 1
T=O κ+ 2 log =⇒ E ∥xT − x⋆ ∥2 ≤ ε.
µε ε
Questions
• When and why is SGD a good method?
36
SGD: Complexity
SGD Complexity
n
Let {fi }i=1 be L-smooth and µ-convex, then SGD has the following
worst-case complexity
1 1
T=O κ+ 2 log =⇒ E ∥xT − x⋆ ∥2 ≤ ε.
µε ε
Questions
• When and why is SGD a good method?
• Can we do better?
36
Problem with SGD
37
Problem with SGD
37
SGD-Star
h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)
Assumptions
• fi is smooth, i.e.,
L
fi (y) ≤ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• fi is µ-convex, i.e.,
µ
fi (y) ≥ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
38
SGD-Star: One Step Analysis
39
SGD-Star: One Step Analysis
39
SGD-Star: One Step Analysis
Et ∥xt+1 − x⋆ ∥2 = ∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩
* n
+
1X
−η ∇fi (xt ) − ∇fi (x ), xt − x
⋆ ⋆
n
i=1
n
X
21
+η ∥∇fi (xt ) − ∇fi (x⋆ )∥2
n 39
i=1
SGD-Star: One Step Analysis
40
SGD-Star: One Step Analysis
40
SGD-Star: Theorem
Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .
41
SGD-Star: Theorem
Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .
Questions
• How does this compare to GD/SGD?
41
SGD-Star: Theorem
Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .
Questions
• How does this compare to GD/SGD?
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
41
SGD-Star: Theorem
Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .
Questions
• How does this compare to GD/SGD?
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O(κ log 1/ε)
41
SGD-Star: Theorem
Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .
Questions
• How does this compare to GD/SGD?
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O(κ log 1/ε)
• What about ∇fi (x⋆ )’s?
41
SGD-Star: Theorem
Convergence of SGD-Star
n 1
Let {fi }i=1 be L-smooth and µ-convex, then SGD-Star with η ≤ L
converges as follows
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)T ∥x0 − x⋆ ∥2 .
Questions
• How does this compare to GD/SGD?
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O(κ log 1/ε)
• What about ∇fi (x⋆ )’s?
• What is the variance of SGD-Star at the optimum?
41
Variance Reduction: Learning Optimal Shift
42
Variance Reduction: Learning Optimal Shift
Properties of gt for
• Et [gt ] = ∇f(xt )
42
Variance Reduction: Learning Optimal Shift
Properties of gt for
• Et [gt ] = ∇f(xt )
• If xt = x⋆ then Et [gt ] = ∇f(x⋆ ) = 0 and Vt (gt ) = 0.
42
Variance Reduction: Learning Optimal Shift
Properties of gt for
• Et [gt ] = ∇f(xt )
• If xt = x⋆ then Et [gt ] = ∇f(x⋆ ) = 0 and Vt (gt ) = 0.
42
Variance Reduction: Learning Optimal Shift
Properties of gt for
• Et [gt ] = ∇f(xt )
• If xt = x⋆ then Et [gt ] = ∇f(x⋆ ) = 0 and Vt (gt ) = 0.
42
Variance Reduction: Learning Optimal Shift
Control Variate
• gt = ∇fit (xt ) + ct ,
t→∞ t→∞
• Et [ct ] → 0 and Vt (gt ) → 0
43
Variance Reduction: Learning Optimal Shift
Control Variate
• gt = ∇fit (xt ) + ct ,
t→∞ t→∞
• Et [ct ] → 0 and Vt (gt ) → 0
How to construct ct ?
43
Variance Reduction: Learning Optimal Shift
Control Variate
• gt = ∇fit (xt ) + ct ,
t→∞ t→∞
• Et [ct ] → 0 and Vt (gt ) → 0
How to construct ct ?
Stochastic Variance Reduced Gradient
(SVRG) [Johnson and Zhang, 2013]
43
Variance Reduction: Learning Optimal Shift
Control Variate
• gt = ∇fit (xt ) + ct ,
t→∞ t→∞
• Et [ct ] → 0 and Vt (gt ) → 0
How to construct ct ?
Stochastic Variance Reduced Gradient
(SVRG) [Johnson and Zhang, 2013]
43
Variance Reduction: Learning Optimal Shift
Control Variate
• gt = ∇fit (xt ) + ct ,
t→∞ t→∞
• Et [ct ] → 0 and Vt (gt ) → 0
How to construct ct ?
Stochastic Variance Reduced Gradient
(SVRG) [Johnson and Zhang, 2013]
43
Loopless SVRG (L-SVRG)
h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)
Assumptions
• fi is smooth, i.e.,
L
fi (y) ≤ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• fi is µ-convex, i.e.,
µ
fi (y) ≥ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
Algorithm xt+1 = xt − ηgt , where gt = ∇fit (xt ) − ∇fit (yt ) + ∇f(yt ), and it
is selected uniformly at random from [n] and y0 = x0 with
(
xt w.p. p,
yt+1 =
yt otherwise.
44
L-SVRG: One Round Progress xt
45
L-SVRG: One Round Progress xt
45
L-SVRG: One Round Progress xt
By µ-convexity,
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 Et ∥gt ∥2 .
45
L-SVRG: One Round Progress xt
By µ-convexity,
Et ∥xt+1 − x⋆ ∥2 ≤ (1 − ηµ)∥xt − x⋆ ∥2 − η⟨∇f(xt ), xt − x⋆ ⟩ + η 2 Et ∥gt ∥2 .
What about gt ?
45
L-SVRG: Bound on gt
By definition
Et ∥gt ∥2 = Et ∥∇fit (xt ) − ∇fit (yt ) + ∇f(yt )∥2
46
L-SVRG: Bound on gt
By definition
Et ∥gt ∥2 = Et ∥∇fit (xt ) − ∇fit (yt ) + ∇f(yt )∥2
Similarly to the SGD analysis, we want to obtain ∥∇fit (xt ) − ∇fit (x⋆ )∥2
to apply smoothness. Therefore, we apply Young’s inequality
Et ∥gt ∥2 ≤ 2Et ∥∇fit (xt ) − ∇fit (x⋆ )∥2 + ∥∇fit (x⋆ ) − ∇fit (yt ) + ∇f(yt )∥2
n
2 X
= ∥∇fi (xt ) − ∇fit (x⋆ )∥2 + ∥∇fit (x⋆ ) − ∇fit (yt ) + ∇f(yt )∥2
n
i=1
46
L-SVRG: Bound on gt
By definition
Et ∥gt ∥2 = Et ∥∇fit (xt ) − ∇fit (yt ) + ∇f(yt )∥2
Similarly to the SGD analysis, we want to obtain ∥∇fit (xt ) − ∇fit (x⋆ )∥2
to apply smoothness. Therefore, we apply Young’s inequality
Et ∥gt ∥2 ≤ 2Et ∥∇fit (xt ) − ∇fit (x⋆ )∥2 + ∥∇fit (x⋆ ) − ∇fit (yt ) + ∇f(yt )∥2
n
2 X
= ∥∇fi (xt ) − ∇fit (x⋆ )∥2 + ∥∇fit (x⋆ ) − ∇fit (yt ) + ∇f(yt )∥2
n
i=1
Pn
We denote Dt = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (yt ) + ∇f(yt )∥2 , which is the
error due to the imprecise shift.
46
L-SVRG: One Round Progress Dt
47
L-SVRG: One Round Progress Dt
Since V(X) ≤ E ∥X∥2 , we have
n
pX
Et [Dt ] ≤ ∥∇fi (x⋆ ) − ∇fi (xt )∥2 + (1 − p)Dt
n
i=1
47
L-SVRG: One Round Progress Dt
Since V(X) ≤ E ∥X∥2 , we have
n
pX
Et [Dt ] ≤ ∥∇fi (x⋆ ) − ∇fi (xt )∥2 + (1 − p)Dt
n
i=1
48
L-SVRG: One Step Analysis
48
L-SVRG: One Step Analysis
2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt
rt+1
X n
1 prt+1 1
−η − 2η − ∥∇fi (xt ) − ∇fit (x⋆ )∥2 .
L η n
i=1
49
L-SVRG: One Step Analysis
2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt
rt+1
X n
1 prt+1 1
−η − 2η − ∥∇fi (xt ) − ∇fit (x⋆ )∥2 .
L η n
i=1
prt+1
If 1/2L ≥ 2η and 1/2L ≥ η , then
2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt .
rt+1
49
L-SVRG: One Step Analysis
2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt
rt+1
X n
1 prt+1 1
−η − 2η − ∥∇fi (xt ) − ∇fit (x⋆ )∥2 .
L η n
i=1
prt+1
If 1/2L ≥ 2η and 1/2L ≥ η , then
2η 2
Et ∥xt+1 − x⋆ ∥2 + rt+1 Dt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + + (1 − p) rt+1 Dt .
rt+1
2
If p/2 ≥ 2η /rt+1 and rt+1 = r, ∀t, then
p
Et ∥xt+1 − x⋆ ∥2 + rDt+1 ≤ (1 − ηµ)∥xt − x⋆ ∥2 + 1 − rDt
2
n p o
≤ max 1 − ηµ, 1 − ∥xt − x⋆ ∥2 + rDt .
2
49
L-SVRG: Choice of Parameters
1 1 pr p 2η 2
≥ 2η, ≥ , ≥ .
2L 2L η 2 r
50
L-SVRG: Choice of Parameters
1 1 pr p 2η 2
≥ 2η, ≥ , ≥ .
2L 2L η 2 r
To obtain the best rate for E ∥xT − x⋆ ∥2 , we might want r to be as
2
small as possible. Therefore, let us select r = 4η /p.
50
L-SVRG: Choice of Parameters
1 1 pr p 2η 2
≥ 2η, ≥ , ≥ .
2L 2L η 2 r
To obtain the best rate for E ∥xT − x⋆ ∥2 , we might want r to be as
2
small as possible. Therefore, let us select r = 4η /p.
2
Combining the second inequality with r = 4η /p leads to
1 p4η 2
≥ = 4η .
2L pη
50
L-SVRG: Choice of Parameters
1 1 pr p 2η 2
≥ 2η, ≥ , ≥ .
2L 2L η 2 r
To obtain the best rate for E ∥xT − x⋆ ∥2 , we might want r to be as
2
small as possible. Therefore, let us select r = 4η /p.
2
Combining the second inequality with r = 4η /p leads to
1 p4η 2
≥ = 4η .
2L pη
2
Therefore, η ≤ 1/8L, r = 4η /p and p ∈ (0, 1] satisfies all 3 inequalities.
50
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n
p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .
51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n
p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .
Questions
• How does this compare to GD/SGD?
51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n
p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .
Questions
• How does this compare to GD/SGD?
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n
p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .
Questions
• How does this compare to GD/SGD?
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O((1/p + κ) log 1/pε)
51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n
p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .
Questions
• How does this compare to GD/SGD?
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O((1/p + κ) log 1/pε)
• What is expected computational cost per iteration?
51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n
p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .
Questions
• How does this compare to GD/SGD?
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O((1/p + κ) log 1/pε)
• What is expected computational cost per iteration? pn + (1 − p)
• What are good choices of p?
51
L-SVRG: Theorem
Convergence of L-SVRG
n
Let {fi }i=1 be L-smooth and µ-convex, then L-SVRG with η ≤ 8L1 and
p ∈ (0, 1] converges as follows
n
p oT 1
E ∥xT − x⋆ ∥2 ≤ max 1 − ηµ, 1 − ∥x0 − x⋆ ∥2 + D 0 ,
2 16L2 p
Pn
where D0 = 1/n i=1 ∥∇fi (x⋆ ) − ∇fi (x0 ) + ∇f(x0 )∥2 .
Questions
• How does this compare to GD/SGD?
• What is the rate of convergence in terms of E ∥xT − x⋆ ∥2 ≤ ε
O((1/p + κ) log 1/pε)
• What is expected computational cost per iteration? pn + (1 − p)
• What are good choices of p? In practice p = 1/n is a popular
choice with total complexity O((n + κ) log n/ε).
51
Comparison
52
Different Type of Variance Reduction
SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].
53
Different Type of Variance Reduction
SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].
Momentum SGD
53
Different Type of Variance Reduction
SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].
Momentum SGD
Variance Reduction:
t
X
gt = (1 − β) β t−k ∇fik (xk ).
k=0
53
Different Type of Variance Reduction
SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].
Momentum SGD
Variance Reduction:
t
X
gt = (1 − β) β t−k ∇fik (xk ).
k=0
53
Different Type of Variance Reduction
SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].
Momentum SGD
Variance Reduction:
t
X
gt = (1 − β) β t−k ∇fik (xk ).
k=0
53
Different Type of Variance Reduction
SVRG type of variance reduction does not work well for DL?
Open problem, partially explained in [Defazio and Bottou, 2019].
Momentum SGD
Variance Reduction:
t
X
gt = (1 − β) β t−k ∇fik (xk ).
k=0
54
Curvature Adaptive Methods
54
Curvature Adaptive Methods
1 ⊤
f(xt+1 ) ≈ f(xt ) + ⟨∇f(xt ), xt+1 − xt ⟩ + (xt+1 − xt ) ∇2 f(xt )(xt+1 − xt ).
2
54
Curvature Adaptive Methods
1 ⊤
f(xt+1 ) ≈ f(xt ) + ⟨∇f(xt ), xt+1 − xt ⟩ + (xt+1 − xt ) ∇2 f(xt )(xt+1 − xt ).
2
Minimizing the above leads to
54
Curvature Adaptive Methods
1 ⊤
f(xt+1 ) ≈ f(xt ) + ⟨∇f(xt ), xt+1 − xt ⟩ + (xt+1 − xt ) ∇2 f(xt )(xt+1 − xt ).
2
Minimizing the above leads to
Drawbacks
54
Curvature Adaptive Methods
1 ⊤
f(xt+1 ) ≈ f(xt ) + ⟨∇f(xt ), xt+1 − xt ⟩ + (xt+1 − xt ) ∇2 f(xt )(xt+1 − xt ).
2
Minimizing the above leads to
Drawbacks
• Expensive to evaluate inverse.
• Requires full gradient computation.
• Only local convergence is guaranteed.
54
Curvature Adaptive Methods
ADAM
55
Curvature Adaptive Methods
ADAM
Update
η mt
xt+1 = xt − p .
vt /(1 − β2t ) + ε 1 − β1t
55
Curvature Adaptive Methods
ADAM
Update
η mt
xt+1 = xt − p .
vt /(1 − β2t ) + ε 1 − β1t
Similar Methods
55
Curvature Adaptive Methods
ADAM
Update
η mt
xt+1 = xt − p .
vt /(1 − β2t ) + ε 1 − β1t
Similar Methods
Pt 2
• AdaGrad [Duchi et al., 2011] (β1 = 0, vt = k=1 (∇fit (xt )) ),
• AdaDelta [Zeiler, 2012] (β1 = 0).
55
Curvature Adaptive Methods
ADAM
Update
η mt
xt+1 = xt − p .
vt /(1 − β2t ) + ε 1 − β1t
Similar Methods
Pt 2
• AdaGrad [Duchi et al., 2011] (β1 = 0, vt = k=1 (∇fit (xt )) ),
• AdaDelta [Zeiler, 2012] (β1 = 0).
ADAM can diverge for simple convex problems [Reddi et al., 2019]!
55
Practical Sampling: Random Reshuffling
h Pn i
def 1
Objective: find x⋆ ∈ minx∈Rd f(x) = n i=1 fi (x)
Assumptions
• fi ’s is L-smooth and µ-convex.
56
Random Reshuffling
Strengths
57
Random Reshuffling
Strengths
• For each epoch, we see all data points.
• Cheaper to sample.
57
Random Reshuffling
Strengths
• For each epoch, we see all data points.
• Cheaper to sample.
Limitations
57
Random Reshuffling
Strengths
• For each epoch, we see all data points.
• Cheaper to sample.
Limitations
• Bias, i.e., Ei,t ∇fπi (xit ) ̸= ∇f(xit ) (only for i = 0)
57
Random Reshuffling
Strengths
• For each epoch, we see all data points.
• Cheaper to sample.
Limitations
• Bias, i.e., Ei,t ∇fπi (xit ) ̸= ∇f(xit ) (only for i = 0)
Convergence Example
See whiteboard.
57
RR: Handling Bias
Auxiliary Sequence
New notion to handle bias
def
x0⋆ = x⋆
def
xi+1
⋆ = xi⋆ − η∇fπi (x⋆ )
58
RR: Handling Bias
Auxiliary Sequence
New notion to handle bias
def
x0⋆ = x⋆
def
xi+1
⋆ = xi⋆ − η∇fπi (x⋆ )
What is xn⋆ ?
58
RR: Handling Bias
Auxiliary Sequence
New notion to handle bias
def
x0⋆ = x⋆
def
xi+1
⋆ = xi⋆ − η∇fπi (x⋆ )
What is xn⋆ ?
xn⋆ = x⋆
58
RR: One Step Analysis
Let Ei,t [·] denote E ·|xit , then
h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= Ei,t ∥xit − η∇fπi (xit ) − xi⋆ + η∇fπi (x⋆ )∥2
D h i E
= ∥xit − xi⋆ ∥2 − 2η Ei,t ∇fπi (xit ) − ∇fπi (x⋆ ) , xit − xi⋆
h i
+ η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2
59
RR: One Step Analysis
Let Ei,t [·] denote E ·|xit , then
h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= Ei,t ∥xit − η∇fπi (xit ) − xi⋆ + η∇fπi (x⋆ )∥2
D h i E
= ∥xit − xi⋆ ∥2 − 2η Ei,t ∇fπi (xit ) − ∇fπi (x⋆ ) , xit − xi⋆
h i
+ η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2
59
RR: One Step Analysis
Let Ei,t [·] denote E ·|xit , then
h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= Ei,t ∥xit − η∇fπi (xit ) − xi⋆ + η∇fπi (x⋆ )∥2
D h i E
= ∥xit − xi⋆ ∥2 − 2η Ei,t ∇fπi (xit ) − ∇fπi (x⋆ ) , xit − xi⋆
h i
+ η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2
59
RR: One Step Analysis
h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= ∥xit − xi⋆ ∥2 − 2ηEi,t Dfπi (xi⋆ , xit )
h i h i
− 2ηEi,t Dfπi (xit , x⋆ ) + η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2
h i
+ 2ηEi,t Dfπi (xi⋆ , x⋆ ) .
60
RR: One Step Analysis
h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= ∥xit − xi⋆ ∥2 − 2ηEi,t Dfπi (xi⋆ , xit )
h i h i
− 2ηEi,t Dfπi (xit , x⋆ ) + η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2
h i
+ 2ηEi,t Dfπi (xi⋆ , x⋆ ) .
60
RR: One Step Analysis
h i h i
Ei,t ∥xi+1
t − xi+1
⋆ ∥
2
= ∥xit − xi⋆ ∥2 − 2ηEi,t Dfπi (xi⋆ , xit )
h i h i
− 2ηEi,t Dfπi (xit , x⋆ ) + η 2 Ei,t ∥∇fπi (xit ) − ∇fπi (x⋆ )∥2
h i
+ 2ηEi,t Dfπi (xi⋆ , x⋆ ) .
Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
2
2σshuffle
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,
µ
h i
2
where σshuffle = maxi∈[n] Ei,t Dfπi (xi⋆ , x⋆ ) .
61
RR: Theorem
Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
2
2σshuffle
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,
µ
h i
2
where σshuffle = maxi∈[n] Ei,t Dfπi (xi⋆ , x⋆ ) .
Questions
• How does this compare to SGD?
61
RR: Theorem
Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
2
2σshuffle
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,
µ
h i
2
where σshuffle = maxi∈[n] Ei,t Dfπi (xi⋆ , x⋆ ) .
Questions
• How does this compare to SGD?
• Can we remove the neighbourhood?
61
RR: Variance
By L-smoothness
h i
2
σshuffle = max Ei,t Dfπi (xi⋆ , x⋆ )
i∈[n]
L i
2
≤ max x⋆ − x ⋆
2 i∈[n]
2
2
i−1
X
η L
= max ∇fπj (x ) ⋆
2 i∈[n]
j=1
η2 L (n − j)j 2 η 2 nL 2
= max σ⋆ ≤ σ
2 i∈[n] n − 1 4 ⋆
The last equality follows from the formula of the sampling without
replacement [Mishchenko et al., 2020, Lemma 1] (we will discuss
sampling techniques in detail later).
62
RR: Updated Theorem
Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
η 2 nLσ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,
2µ
Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .
63
RR: Updated Theorem
Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
η 2 nLσ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,
2µ
Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .
Questions
• Complexity?
63
RR: Updated Theorem
Convergence of RR
n 1
Let {fi }i=1 be L-smooth and µ-convex, then RR with η ≤ L converges
as follows
η 2 nLσ⋆2
E ∥xT − x⋆ ∥2 ≤ (1 − ηµ)nT ∥x0 − x⋆ ∥2 + ,
2µ
Pn
where σ⋆2 = 1/n i=1 ∥∇fi (x⋆ )∥2 .
Questions
q
nL 1
• Complexity? Exercise: O κ+ µ2 ε log ε .
• How does this compare to SGD in terms of ε?
63
Literature Review
Related Papers
• GD [Nesterov, 2003]
• SGD [Gower et al., 2019]
• SGD-Star [Gorbunov et al., 2020]
• L-SVRG [Kovalev et al., 2020]
• Momentum [Liu et al., 2020]
• Adam [Kingma and Ba, 2014], Yogi [Reddi et al., 2019]
• RR [Mishchenko et al., 2020]
64
Questions?
64
References i