0% found this document useful (0 votes)

11 views

ML807_Distributed_and_Federated_Learning_Slides_2

Uploaded by

Nicolas Cuadrado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

ML807_Distributed_and_Federated_Learning_Slides_2

Uploaded by

Nicolas Cuadrado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 211

ML807: Distributed and Federated Learning

SGD in ML, its many Variants and Convergence Analysis

Instructor: Samuel Horvath, Eduard Gorbunov, Praneeth Vepakomma

Semester: Spring 2024

Machine Learning, Mohamed bin Zayed University of Artificial Intelligence

Stochastic Gradient Descent (SGD)

• Stochastic gradient descent (SGD) is at the core of almost all

state-of-the-art machine learning
• In the last lectures, we discussed:
• basic concepts such as convexity, smoothness, gradients, hessian,
GD
• SGD in linear regression, optimizing the empirical risk function
• In the today’s lecture and next lectures, we will discuss:
• The backpropagation algorithm (the special case of chain rule for
NNs)
• Variants of SGD: Variance Reduction, Momentum, Random
Reshuffling, Asynchronous, Adaptive Methods
• Their convergence analysis

1
Outline

1. SGD in Neural Network training

2. Gradient Descent

3. SGD-like Methods

2
SGD in Neural Network training
Multi-layer Neural Network

• wij : weights connecting node i in layer ℓ − 1 to node j in layer ℓ.

• bj , bk : bias for nodes in layers j and k, respectively.
P
• uj , uk : inputs to nodes j and k (where uj = bj + i xi wij ).
• gj (·), gk (·): activation function for node j (applied to uj ) and
node k.
• yj = gj (uj ), zk = gk (uk ): output/activation of nodes j and k.
• tk : target value connecting for node k in the output layer.
3
Activation Function: Sigmoid

1
σ(z) =
1 + e−z

• Squashing type non-linearity: pushes output to range (0, 1).

4
Activation Function: Sigmoid

1
σ(z) =
1 + e−z

• Squashing type non-linearity: pushes output to range (0, 1).

• Problem: Near-constant value across most of their domain,

strongly sensitive only when z is closer to zero.
• Saturation makes gradient based learning difficult.

4
Activation Function: ReLU

• Approximates the softplus function which is log(1 + ez ).

• ReLU activation function is g(z) = max{0, z}
• Similar to linear units. Easy to optimize!
• Give large and consistent gradients when active.

5
Activation Function: ReLU

• Approximates the softplus function which is log(1 + ez ).

• ReLU activation function is g(z) = max{0, z}
• Similar to linear units. Easy to optimize!
• Give large and consistent gradients when active.
• Problem: gradient is zero for negative inputs.
• Solution: Leaky ReLU g(z) = max{az, z}, 0 < a < 1 is a small
constant.
5
How do you perform inference using a trained neural network?

Expressing outputs z in terms of inputs x is called

forward-propagation. Express:
P
• inputs uj to the hidden layer in terms of x: uj = i wij xi + bj ,
P
• outputs yj of the hidden layer in terms of x: yj = g i wij xi + bj ,
P
• inputs to the final layer in terms of x uk = j wjk yj + bj ,
P
• outputs zk of the final layer in terms of x: zk = gk i wij xi + bj .

6
Learning Parameters

How to learn the parameters?

7
Learning Parameters

How to learn the parameters?

Choose the right loss function:

• Regression: least-square loss

n
1X 2
min (F(xi , θ) − yi )
θ∈Rd n
i=1

• Classification: cross-entropy loss

n
1X
min − log Fyi (xi , θ)
θ∈Rd n
i=1

7
Learning Parameters

How to learn the parameters?

Choose the right loss function:

• Regression: least-square loss

n
1X 2
min (F(xi , θ) − yi )
θ∈Rd n
i=1

• Classification: cross-entropy loss

n
1X
min − log Fyi (xi , θ)
θ∈Rd n
i=1

Hard optimization problem because of F(x, θ) (the output of the

neural network) is a complicated function of x.

7
Learning Parameters

How to learn the parameters?

Choose the right loss function:

• Regression: least-square loss

n
1X 2
min (F(xi , θ) − yi )
θ∈Rd n
i=1

• Classification: cross-entropy loss

n
1X
min − log Fyi (xi , θ)
θ∈Rd n
i=1

Hard optimization problem because of F(x, θ) (the output of the

neural network) is a complicated function of x. Typically, we use SGD
with many optimization tricks.

7
SGD for Neural Network

• Randomly pick a data point (xi ; yi ).

• Compute the gradient using only this data point, for example,
2
∂(F(xi , θ) − yi )
gi =
∂θ
• Update the parameters: θ ← θ − ηgi .
• Iterate the process until some (pre-specified) stopping criteria.

8
Updating the parameter values

Back-propagate the error. Given parameters θ = {w, b}:

1. Forward-propagate to find zk in terms of the input (the
“feed-forward signals”).
2. Calculate output error E by comparing the predicted output zk to
its true value tk .
3. Back-propagate E by weighting it by the gradients of the
associated activation functions and the weights in previous
layers.
4. Calculate the gradients ∂w
∂E
and ∂b
∂E
for the parameters w, b at
each layer based on the backpropagated error signal and the
feedforward signals from the inputs.
5. Update the parameters using the calculated gradients
w ← w − η ∂w
∂E
, b ← b − η ∂b
∂E
, where eta is the step size.

9
Illustrative example

• wij : weights connecting node i in layer ℓ − 1 to node j in layer ℓ.

• Step 1: Forward-propagate for each output zk

   ! 
X X X
zk = gk (uk ) = gk bk + yj wjk  = gk bk + gj b j + xj wij wjk 
j j i

• Step 2: Find the error. Let’s assume that the error function is the
sum of the squared differences between the target values tk and
P
the network output zk : E = 21 k (zk − tk )2 .
11
Illustrative example (step 3, output layer)

Step 3: Backpropagate the error. Let’s start at the output layer with
P P
weight wjk , recalling that E = 21 k (zk − tk )2 , uk = bk + j yj wjk :

∂E ∂E ∂zk ∂uk
= = (zk − tk )g′k (uk )yj = δk yj ,
∂wjk ∂zk ∂uk ∂wjk

where δk = (zk − tk )g′k (uk ) is called the error in uk .

12
Illustrative example (step 3, hidden layer)

Step 3 (cont’d): Now let’s consider wij in the hidden layer, recalling
P P
uk = bk + j yj wjk , uk = bk + j yj wjk , zk = gk (uk ):
!
∂E X ∂E ∂zk ∂yj ∂uj X
= = δk wjk g′j (uj )xi = δj xi ,
∂wij ∂uk ∂uk ∂uj ∂wij
k k
P
where we substituted δj = g′j (uj ) k (zk − tk )g′k (uk ), the error in uj .

13
Illustrative example (steps 3 and 4)

Step 3 (cont’d): We similarly find that ∂E

∂bk
∂E
= δk , ∂bj
= δj (Exercise)

14
Illustrative example (steps 3 and 4)

Step 3 (cont’d): We similarly find that ∂b

∂E
k
∂E
= δk , ∂bj
= δj (Exercise)
Step 4: Calculate the gradients. We have found that
∂E ∂E
= δj xi , and = δ k yj ,
∂wij ∂wjk
P
where δj = g′j (uj ) k (zk − tk )g′k (uk ), δk = (zk − tk )g′k (uk ).

14
Illustrative example (steps 3 and 4)

Step 3 (cont’d): We similarly find that ∂b

∂E
k
∂E
= δk , ∂bj
= δj (Exercise)
Step 4: Calculate the gradients. We have found that
∂E ∂E
= δj xi , and = δ k yj ,
∂wij ∂wjk
P
where δj = g′j (uj ) k (zk − tk )g′k (uk ), δk = (zk − tk )g′k (uk ).
Now since we know the zk , yj , xi , uk and uj for a given set of
parameter values w, b, we can use these expressions to calculate the
gradients at each iteration and update them. 14
Illustrative example (steps 3 and 4)

Step 4: Calculate the gradients. We have found that

∂E ∂E
= δj xi , and = δ k yj ,
∂wij ∂wjk
P
where δj = g′j (uj ) k (zk − tk )g′k (uk ), δk = (zk − tk )g′k (uk ).

15
Illustrative example (steps 3 and 4)

Step 4: Calculate the gradients. We have found that

∂E ∂E
= δj xi , and = δ k yj ,
∂wij ∂wjk
P
where δj = g′j (uj ) k (zk − tk )g′k (uk ), δk = (zk − tk )g′k (uk ).
Step 4: Update the weights and biases with learning rate η. For
example,
∂E ∂E
wjk ← wjk − η , and wij ← wij − η .
∂wjk ∂wij 15
High-level Procedure: Can be Used with More Hidden Layers

Final layer
• Error in each of its outputs is zk − tk .
• Error in the input uk to the final layer is δk = (zk − tk )g′k (uk ).

16
High-level Procedure: Can be Used with More Hidden Layers

Final layer
• Error in each of its outputs is zk − tk .
• Error in the input uk to the final layer is δk = (zk − tk )g′k (uk ).
Hidden layer
P
• Error in output yj is outputs is k δk wjk .
P
• Error in the input uj of the hidden layer is g′j (uj ) k δk wjk .

16
High-level Procedure: Can be Used with More Hidden Layers

The gradient w.r.t. wij is xi δj .

16
Vectorized Implementation

Much faster than implementing a loop over all neurons in each layer.

17
Vectorized Implementation

Much faster than implementing a loop over all neurons in each layer.
Forward-Propagation
• Represent the weights between layers l − 1 and l as a matrix Wl .
• Outputs of layer l − 1 are in a row vector yl . Then we have
ul = yl−1 Wl .
• Outputs of layer l are in the row vector yl = gl (ul ).
17
Vectorized Implementation

Much faster than implementing a loop over all neurons in each layer.

18
Vectorized Implementation

Much faster than implementing a loop over all neurons in each layer.
Back-Propagation
• For each layer l find ∆l , the vector of errors in ul in terms of the
final error.
• Compute gradient w.r.t. the parameters of the current layer l.
• Recursively find ∆l−1 in terms ∆l .
18
Vectorized Implementation

• Recall the empirical risk loss function that we considered for the
back-propagation discussion
n
1 X 2
E= (F(xi , θ) − yi )
2n
i=1

19
Mini-batch SGD

• Recall the empirical risk loss function that we considered for the
back-propagation discussion
n
1 X 2
E= (F(xi , θ) − yi )
2n
i=1

• For large training datasets (large n), then computing gradients

w.r.t. each datapoint is expensive. For example,
n
∂E 1 X (i) (i)
= δ k yj ,
∂wjk n
i=1

for back-propagation.

19
Mini-batch SGD

• Recall the empirical risk loss function that we considered for the
back-propagation discussion
n
1 X 2
E= (F(xi , θ) − yi )
2n
i=1

• For large training datasets (large n), then computing gradients

w.r.t. each datapoint is expensive. For example,
n
∂E 1 X (i) (i)
= δ k yj ,
∂wjk n
i=1

for back-propagation.
• Therefore we use stochastic gradient descent (SGD), where we
2
choose a random data point xn and use Ẽ = 21 (F(xi , θ) − yi )
instead of the entire sum.
19
Mini-batch SGD

• Mini-batch SGD is between these two extremes.

20
Mini-batch SGD

• Mini-batch SGD is between these two extremes.

• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.

20
Mini-batch SGD

• Mini-batch SGD is between these two extremes.

• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.
• Small b saves per-iteration computing cost, but increases noise
in the gradients and yields worse error convergence.

20
Mini-batch SGD

• Mini-batch SGD is between these two extremes.

• In each iteration t, we choose a set St of b samples uniformly at
random from the n training samples and use
X 2
L̃(wt ) = w⊤t xi − y i
i∈St

for back-propagation.
• Small b saves per-iteration computing cost, but increases noise
in the gradients and yields worse error convergence.
• Large b reduces gradient noise and gives better error
convergence, but increases computing cost per iteration.

20
Questions?

20
Gradient Descent
Gradient Descent

Objective: find x⋆ ∈ minx∈Rd f(x)

21
Gradient Descent

Objective: find x⋆ ∈ minx∈Rd f(x)

Assumptions
• f is smooth, i.e.,
L
f(y) ≤ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2
• f is µ-convex, i.e.,
µ
f(y) ≥ f(x) + ⟨∇f(x), y − x⟩ + ∥y − x∥2 , ∀x, y ∈ Rd ,
2

21
Gradient Descent

Objective: find x⋆ ∈ minx∈Rd f(x)

Algorithm xt+1 = xt − η∇f(x) for a small learning rate η > 0.

21
Gradient Descent

Objective: find x⋆ ∈ minx∈Rd f(x)

Algorithm xt+1 = xt − η∇f(x) for a small learning rate η > 0.

Intuition: A person standing on a hill covered in fog, who wants to
find their way down to the hill. They can feel the slope with their feet
and move along the direction of steepest descent.

21
Gradient Descent as MM Algorithm