Lecture 7 - Optimization Part I
Lecture 7 - Optimization Part I
Optimization - I
Dr. Rajdip Nayek
Block V, Room 418D
Department of Applied Mechanics
Indian Institute of Technology Delhi
E-mail: [email protected]
Course website: https://www.csccm.in/courses/deep-learning-for-mechanics
Key components of any ML algorithm
2
Overview
• Training examples: 𝒙(1) , 𝑡 1 , 𝒙(2) , 𝑡 2 , ⋯ , 𝒙(𝑁) , 𝑡 𝑁
𝑤2
𝑤1 4
Basics: Visualizing a loss surface
• Minima are points that minimize the loss function within a small neighbourhood
• One of them is a local minimum
• Another one is a global minimum, a point which achieves the minimum loss
over all values of 𝜽
𝑤2
𝑤1
5
Basics: Contour plots
6
Basics: Contour plots
• Each of the contours is associated with a set of parameters where the loss
function takes a particular value
• A small distance between the contours indicates a steep slope along
that direction
• A large distance between the contours indicates a gentle slope along that
direction
7
Basics: Contour plots
𝑤2
𝑤1
• Can’t determine the magnitude of the gradient from the contour plot
• Can determine the gradient direction: the gradient is always orthogonal
(perpendicular) to the contours
• Recall: Gradient descent would happen in opposite direction of gradient
8
Overview
• Training examples: 𝒙(1) , 𝑡 1 , 𝒙(2) , 𝑡 2 , ⋯ , 𝒙(𝑁) , 𝑡 𝑁
• Compute the total gradient ∇𝜽 ℒ by averaging over all individual gradients for
every training example, and then update the parameters
• The algorithm goes over the entire data once before updating the parameters
• This is known as batch gradient descent (BGD), since we treat the entire training
set as a batch
10
Batch Gradient Descent
𝑁
1
∇𝜽 ℒ = ∇𝜽 ℓ 𝑖 𝜽𝑛𝑒𝑤 ← 𝜽𝑜𝑙𝑑 − 𝜂 ∇𝜽 ℒ ቚ
𝑁 𝜽= 𝜽𝑜𝑙𝑑
𝑖=1
• Batch gradient descent treat the entire training set as a single batch
• Updates the parameter vector after each full pass (epoch) over the entire dataset
11
Stochastic Gradient Descent
• Stochastic gradient descent: we can use a noisy (or stochastic) estimate of the
gradient from a single training example to update the parameter vector 𝜽
𝑁
1
∇𝜽 ℒ = ∇𝜽 ℓ 𝑖 𝜽𝑛𝑒𝑤 ← 𝜽𝑜𝑙𝑑 − 𝜂 ∇𝜽 ℓ 𝑖 ቚ
𝑁 𝜽= 𝜽𝑜𝑙𝑑
𝑖=1
• The algorithm updates the parameters for every single data point
• Pros: SGD can make significant progress before it has even looked at all the data!
12
Stochastic Gradient Descent
• We see many fluctuations. Why ? Because we are making greedy decisions
• Each data point is trying to push the parameters in a direction most favorable to
it (without being aware of how the parameter update affects other points)
• A parameter update which is locally favorable to one point may harm other
points (its almost as if the data points are competing with each other)
• There is no guarantee that each local greedy move will reduce the global error
13
Mini-batch Gradient Descent
• Compute the gradients on a medium-sized set of training examples, called a
mini-batch
• Note that the algorithm updates the parameters after it sees a batch size 𝐵
number of data points
• The stochastic estimates of gradients here are slightly better and less noisy
14
Mini-batch Gradient Descent performance
Number of iterations
Algorithm Batch size
in 1 epoch
Batch GD 𝑁 1
SGD 1 𝑁
Mini-batch GD 𝐵 𝑁ൗ
𝐵
PyTorch code mini-batch SGD
17
PyTorch code mini-batch SGD
18
Gradient descent-based optimizers
Local Optima
19
Local optima
• If a function is convex, it has a global minimum and no local minima
ℒ ℒ
𝑤 𝑤
20
Local optima
• Unfortunately, training a neural network with hidden units is non-convex,
mainly because of permutation symmetries
• Permutation symmetry: We can re-order the weights in a way that gives the
same loss function for the neural network
21
Local optima
• Training a multilayer neural network with hidden units will have multiple
local minima
• Workaround: One can try to improve the issue by using random restarts
• Random restarts: initialize the training from several random locations, run
the training procedure from each one, and pick whichever result has the
lowest cost
• Random restart is sometimes done in neural net training, but more often we
just ignore the problem
• In practice, the local optima are usually fine, so we think about training in
terms of converging faster to a local minimum, rather than finding the global
minimum
25
Gradient descent-based optimizers
Symmetry
26
Symmetry
• We start our optimization by initializing the values of weights and biases
• Suppose we initialize all the weights and biases of a neural network to all zeros
• All the hidden activations will be identical (indistiguisable features), and all the
weights feeding into a given hidden unit will have zero derivatives
• No learning will occur
• If the initial weights are zero, multiplying them
by any gradient will set the gradients to be
zero. Due to zero gradient, there will be no
change in the weights
Saddle
points
27
Saddle point
𝜕ℒ
• A saddle point has = 𝟎, even though we are not at a minimum
𝜕𝜽
• We are at a location which is minimum with respect to some directions, and
maximum with respect to others
• When would saddle points be a problem?
• If we’re exactly on the saddle point, then we’re stuck
• If we’re slightly to the side, then we can get unstuck
28
Initialization strategies
• Don’t initialize all the weights and biases to zero!
29
Gradient descent-based optimizers
30
Saturated and dead units
• A flat region of the loss surface is called a plateau
𝑤2
𝑤1
• Caused by
• Saturated units whose activations are always near the ends of their dynamic
range/possible values
• Dead units whose activations are always close to zero
31
Saturated and dead units
• A flat region of the loss surface is called a plateau
Examples
• Logistic
ℒ activation
𝑤2
𝑤1
• ReLU
activation
• Caused by
• Saturated units whose activations are always near the ends of their dynamic
range/possible values
• Dead units whose activations are always zero (or close to zero)
• ReLU units don’t saturate for positive 𝑧, which is convenient. Unfortunately, they can
die if 𝑧 is consistently negative, so it helps to initialize the biases to a small positive
value (such as 0.1)
33
Gradient descent-based optimizers
Ill-conditioning
34
Ill-conditioning
• Suppose that we have the following dataset for linear regression
𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
𝑤2
𝑥1 𝑥2 𝑡
0.0023 1002.3 5.12
0.0025 1005.2 4.25
0.0016 988.7 3.45
0.0013 945.9 3.13
⋮ ⋮ ⋮
𝑤1
• 𝑤1 will take larger values, whereas 𝑤2 will take very small values
35
Ill-conditioning
• Which weight, 𝑤1 or 𝑤2 , will receive a larger gradient descent update?
• 𝑤2 receives a much larger update due to steeper gradient
𝑤1
• But, here we want to take smaller steps in the steeper direction and larger steps in the
flatter direction
36
Ill-conditioning: Workarounds
• We want larger gradient descent updates in the direction of 𝑤1 and smaller in
the direction of 𝑤2
𝑤2
• One workaround: Normalization
𝑤1 37
Ill-conditioning: Workarounds
• Not just inputs, but inputs to several hidden layers need to be centered too
• Hidden units may also have non-centered activations, and those are even
harder to deal with
• One trick: replace logistic units (which range from 0 to 1) with tanh units (which
range from -1 to 1)
𝑦1 𝑦10
Output layer
Hidden layers
ℎ1 ℎ2 ℎ32
https://www.machinecurve.com/index.php/2021/03/29/bat
ch-normalization-with-pytorch/ 39
Recap of what we have seen
Batch normalization,
Ill-conditioning (hard)
momentum, ADAM
40