week 06 - Deep Feedforward Networks - Optimization
week 06 - Deep Feedforward Networks - Optimization
Networks - Optimization
Paolo Favaro
2
Contents
• Optimization in Feedforward Neural Networks
Note
3
Optimization
• We use optimization for training neural networks
Note
4
We discussed this aspect already in the Machine Learning review when introducing the generalization error. We would like to optimize the bottom error (over the full data
probability distribution), but we only have access to a finite amount of data (the empirical probability distribution).
5
Overfitting might be due to: high capacity models, too little data or data that is not a good representative of the true data distribution. This may lead to the model
memorizing the training set, for example.
Some other effects of interest are that sometimes the validation or test error might continue to decrease although the training error has reached zero (so it is not decreasing
anymore).
The continual minimization of the training loss keeps pushing the decision boundaries away from the classes. This results in a more robust classifier (which then generalizes
better and this means a lower and lower error in the test set).
6
Training algorithms do not fully minimize the surrogate function (i.e., they do not necessarily stop when the gradient on the 0-1 loss of other surrogate losses is zero). They
typically stop earlier due to other criteria, such as earlier stopping (see regularization). An important criterion to stop the minimization is to use the error on the validation
set. This is used to avoid overfitting.
7
Another example of loss functions written as a sum over of the samples is shown in the next slide
8
Note
10
Recall that we do not aim to minimize the training set error, but rather the validation/test set errors. Therefore, the computation of the gradient on a batch is not
necessarily a poor computational choice. Indeed, in practice one can observe that the convergence is faster when using batches than when using the full gradient (when
keeping the overall execution time fixed).
Another interesting aspect is that this approximation is the same approximation of the gradient of the generalization error.
11
Purely stochastic methods are used with data streams while the others are used with fixed datasets.
12
Note
13
Note
14
Shufffling
• An unbiased estimate of the expected gradient
requires independent samples
In practice one goes through the same dataset multiple times. The time it takes to visit the whole dataset is called epoch. The improvement of the model when training for
multiple epochs prevails over the loss due to the bias of seeing the same sample multiple times.
15
Challenges
Note
16
Ill-Conditioning
• Consider a second-order Taylor expansion of the cost
function at the current parameter estimate
1
J(✓ ✏g) ' J(✓) + rJ(✓)> (✓ ✏g ✓) + (✓ ✏g ✓)> H(✓ ✏g ✓)
2
1
= J(✓) ✏g > g + ✏2 g > Hg with g = rJ(✓)
2
• The cost function decreases if
1
g > g > ✏g > Hg
2
which depends on the Hessian. Gradient descent may
not reach a critical point due to its ill-conditioning
The ill-conditioning of the Hessian may manifest itself at a high curvature point. This high curvature combined with the learning rate e/2 may exceed 1 and cause an
increase of the cost instead of a decrease.
Also, the cost function may stop to decrease (or increase) when the two terms cancel each other, although the parameter has not reached any critical point
17
Ill-Conditioning
On the left the norm of the gradient keeps increasing although the error rate on the validation set (in classification) keeps decreasing (left).
18
f (x)
Ideally, we would like
to arrive at the global
minimum, but this
might not be possible.
This local minimum performs
poorly and should be avoided.
Figure 4.3: Optimization algorithms may fail to find a global minimum when there are
multiple local minima or plateaus present. In the context of deep learning, we generally
accept such solutions even though they are not truly minimal, so long as they correspond
Note
to significantly low values of the cost function.
critical points are points where every element of the gradient is equal to zero.
The directional derivative in direction u (a unit vector) is the slope of the
function f in direction u. In other words, the directional derivative is the derivative
of the function f (x + ↵u) with respect to ↵, evaluated at ↵ = 0. Using the chain
rule, we can see that @↵@
f (x + ↵u) evaluates to u> rx f (x) when ↵ = 0.
To minimize f , we would like to find the direction in which f decreases the
fastest. We can do this using the directional derivative:
where ✓ is the angle between u and the gradient. Substituting in ||u||2 = 1 and
ignoring factors that do not depend on u, this simplifies to minu cos ✓. This is
19
Local Minima
• Neural networks have infinite equivalent parameter
solutions (i.e., they yield the same output)
In each layer we can take n! permutations of the units as long as the composition of permutations cancels out at the very end. Since we have m layers then we have n!
permutations for each of the other n! permutations in each layer.
20
Local Minima
Note
21
Local Minima
• Local minima can be problematic if they often have
a higher cost than the global minimum
Even if the gradient is small, it might not be possible to directly conclude that we reached a local minimum.
22
Saddle Points
500
f(x1 ,x1 )
0
−500
15
−15 0 x1
x1 0 −15
15
Figure 4.5: A saddle point containing both positive and negative curvature. The function
At a saddle point we can see that the Hessian will have negative and positive eigenvalues.
in this example is f (x) = x1 x2 . Along the axis corresponding to x1 , the function
Along one cross-section the cost decreases away from the 2 2 and along another cross-section the cost increases.
saddle point
curves upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue.
Along the axis corresponding to x2 , the function curves downward. This direction is an
eigenvector of the Hessian with negative eigenvalue. The name “saddle point” derives from
the saddle-like shape of this function. This is the quintessential example of a function
with a saddle point. In more than one dimension, it is not necessary to have an eigenvalue
of 0 in order to get a saddle point: it is only necessary to have both positive and negative
eigenvalues. We can think of a saddle point with both signs of eigenvalues as being a local
maximum within one cross section and a local minimum within another cross section.
23
Saddle Points
Note
24
Saddle Points
• Given a function f : Rn 7! R the expected ratio
between the number of saddle points and local
minima grows exponentially with n
Note
25
Saddle Points
• Other interesting properties of random functions
are that
Note
26
Saddle Points
Note
27
Saddle Points
Note
28
Saddle Points
Note
29
Theoretical Limits
• The optimization of neural networks is largely an
open problem
Note
30
Weight Initialization
Strategies
• Since the optimization problem is non convex,
initialization determines the quality of the solution
Same units connected to the same parts must have different initialization otherwise a deterministic algorithm will update them exactly in the same way.
31
Weight Initialization
Strategies
• Two hidden units with the same activation function
and inputs should have different initial parameters
Note
32
Weight Initialization
Strategies
• The magnitude of the random variable matters
Note
33
Weight Initialization
Strategies
• Initialization of weights of a fully connected layer
with m inputs and n outputs
✓ ◆
1 1
• Heuristic #1: Sample from U p ,p
m m
r r !
6 6
• Heuristic #2: Sample from U ,
m+n m+n
These heuristics try to equalize the variance of the activation function among the layers and the variance of the gradients. In the second heuristic the goal is to have
different variances for each unit with a different set of inputs/outputs.
Weight Initialization
Strategies
• Tuning scaling factors to guarantee optimization
properties: e.g., total number of iterations to convergence
is independent of depth or orthogonal initialization can be
avoided
Some recommendations are to use orthogonal matrices as initialization. However, careful tuning may remove this need.
The usual principle is to preserve the scale of a signal. This criterion might not be ideal in general.
35
Weight Initialization
Strategies
• A drawback of fixing the scaling is that large layers
shrink their individual weights
Sparse initialization may have issues with networks that require a coordinated training of the weights (eg maxout networks).
Gradient descent might take some time to decrease incorrect weights.
A good rule of thumb for choosing initial weights (hyperparameter tuning): compute the standard dev of the activations on a single minibatch of data.
Identify the layer with the smallest activations and increase its weights (until the variance of the output of the layer is 1). This is the layer-sequential unit variance (LSUV)
method of Mishkin and Matas (2016).
36
For example, in ReLU we may choose a bias of 0.1 to avoid being in the saturating regime.
37
Learning Based
Initialization Strategies
This relates to Transfer Learning. We learn to solve a task and then transfer the learned knowledge to solve other tasks.
Basic Algorithms
39
• end while
Note
40
Note
41
Note
42
Note
43
J(✓) ˆ
min J(✓)
✓ˆ ✓ ◆
1
• In a convex problem the excess error is O p
k
✓ ◆
1
• In a strongly convex problem is instead O
k
Let us analyze the convergence rates of SGD. To do so we can use the excess error.
The excess error is the difference between the achieved error and the minimum possible error.
44
Note
45
Momentum
• The method of momentum aims at accelerating
learning, especially with high curvature (Hessian),
small and noisy gradients
The convergence rates we just mentioned are in the ideal/best case. In practice, there are many scenarios where SGD can be quite slow and one can provide some useful
speed ups. One such speed up comes from the method of momentum.
46
m
!
1 X
v ↵v ✏r✓ L(f (xi ; ✓), yi )
m i=1
✓ ✓+v
with ↵ 2 [0, 1)
When \alpha = 0 the method reduces to the usual SGD. With a non zero \alpha (but less than one) the method accumulates gradients over time, but gives more relevance
to recent gradients.
47
The red path is the one followed by the momentum method. The black arrows show the positions that SGD would have followed instead. As the momentum method
averages gradients over iteration time, it tends to reduce the oscillations and thus results in a faster convergence.
48
Nesterov Momentum
• A small variant of the momentum method
m
!
1 X
v ↵v ✏r✓ L(f (xi ; ✓ + ↵v), yi )
m i=1
✓ ✓+v
with ↵ 2 [0, 1)
Nesterov method provides an update for the conventional gradient descent method and goes from an O(1/k) to an O(1/k^2) convergence rate (for convex problems).
However, this does not transfer to SGD.
49
Note
50
Delta-bar-delta Algorithm
• A heuristic to choose individual learning rates for
each model parameter during training
✓ ✓ ✏r✓ J(✓) ✏ ✏+ ✏
8
>
< if ¯(t 1) (t) > 0
✏= ✏ if ¯(t 1) (t) < 0
>
:
0 otherwise
(t) = r✓ J(✓) ¯(t) = (1 ) (t) + ¯(t 1)
• Increases linearly but decreases exponentially fast
Note
51
AdaGrad
• Adapts the learning rates of all parameters by
scaling them inversely proportional to the previous
ones
Note
52
AdaGrad
• With g the gradient of the loss function on a
minibatch
r r+g g
✏
✓ p g
+ r
✓ ✓+ ✓
Where r=0 initially, \epsilon is the global learning rate, \delta = 10^-7 (to avoid numerical problems), \odot denotes element-wise products.
One issue is that the accumulation of gradients from the very beginning of the training can lead to too fast a decrease of the learning rate. Performs well but not on all
deep learning models
53
RMSProp
• A modification of AdaGrad for non-convex problems
Note
54
RMSProp
• With g the gradient of the loss function on a
minibatch
r ⇢r + (1 ⇢)g g
✏
✓ p g
+r
✓ ✓+ ✓
\rho is an additional parameter that induces an exponential forgetting of the past gradients squares.
55
Adam
Note
56
Adam
• With g the gradient of the loss function on a
minibatch
r ⇢2 r + (1 ⇢2 )g g
s ⇢1 s + (1 ⇢1 )g
r s
r̂ ŝ t t+1
1 ⇢t2 1 ⇢t1
✏ŝ
✓ p
+ r̂
✓ ✓+ ✓
The two intermediate steps with \hat r and \hat s introduce some scaling with exponentially decaying scales
57
Note
58
Optimization Strategies
• We now explore more specific techniques
• Batch normalization
• Coordinate descent
• Polyak averaging
• Supervised pretraining
Note
59
Batch Normalization
• The gradient of the cost function tells us how much
the cost changes as each parameter (in isolation)
changes
Notes
60
Batch Normalization
• Example
Let us illustrate the issues of the compositionality of neural networks by examining what happens to the loss function of 3 equivalent networks (with different
parametrizations).
When we pick the weights so that the I/O mapping is the same for the three networks, we expect the gradient update to do the same job in all 3 cases. However, this does
not happen.
Notice that the 1-layer case gives the optimal update and we use it as a reference.
61
Batch Normalization
• Example
Here we show the updated weights in the three cases and substitute the parameters that make the three networks identical.
We can see that the updated weights are different in the three cases.
62
Batch Normalization
• Example (general case)
Y
The gradient is git = rwi ŷ = x wj
j6=i
Gradient descent provides the update for each parameter to improve the loss given that all the other parameters are fixed. However we update all parameters
simultaneously.
63
Batch Normalization
k
Y k
Y
ŷ(x) t+1
=x wit+1 =x wit ✏git
i=1 i=1
0 1 0 1
k
Y Y k
Y Y
@wit 1
=x ✏x wjt A =x @wit
t ✏x wjt A
i=1 i=1
wi j
j6=i
k
Y k
(wt )2 i ✏x xY
=x = (wit )2 ✏x
i=1
wit i=1
0 1
XY
= x @ ✏x (wjt )2 + · · · + ( 1)k (✏x)k ()k 1A
i j6=i
The output layer depends on high order terms which make the choice of \epsilon extremely difficult (because the variation of the gradient magnitude depends on the
current values of the weights).
We do not know if the first order or the higher order terms will be the largest. So setting \epsilon becomes not trivial (it depends on the values of the weights) or we need
to be conservative and set it to a very small value.
Batch Normalization
k
Y k
Y
ŷ(x) t+1
=x wit+1 =x wit ✏git
i=1 i=1
0 1 0 1
k
Y Y k
Y Y
@wit 1
=x ✏x wjt A =x @wit
t ✏x wjt A
i=1 i=1
wi j
j6=i
k
Y k
(wt )2 i ✏x xY
=x = (wit )2 ✏x
i=1
wit i=1
0 1
XY
= x @ ✏x (wjt )2 + · · · + ( 1)k (✏x)k ()k 1A
i j6=i
63
Batch Normalization
k
Y k
Y
ŷ(x) t+1
=x wit+1 =x wit ✏git
i=1 i=1
0 1 0 1
k
Y Y k
Y Y
@wit 1
=x ✏x wjt A =x @wit
t ✏x wjt A
i=1 i=1
wi j
j6=i
k
Y k
(wt )2 i ✏x xY
=x = (wit )2 ✏x
i=1
wit i=1
0 1
XY
= x @ ✏x (wjt )2 + · · · + ( 1)k (✏x)k ()k 1A
i j6=i
63
Batch Normalization
k
Y k
Y
ŷ(x) t+1
=x wit+1 =x wit ✏git
i=1 i=1
0 1 0 1
k
Y Y k
Y Y
@wit 1
=x ✏x wjt A =x @wit
t ✏x wjt A
i=1 i=1
wi j
j6=i
k
Y k
(wt )2 i ✏x xY
=x = (wit )2 ✏x
i=1
wit i=1
0 1
XY
= x @ ✏x (wjt )2 + · · · + ( 1)k (✏x)k ()k 1A
i j6=i
64
Batch Normalization
• Batch normalization proposes a re-parametrization
that does not compromise the original capacity but
changes the learning dynamics
hi = hi 1 wi
Notes
65
Batch Normalization
• Proposed re-parametrization
H µ(H)
Ĥ = H 0 + = +
(H)
At run-time the mean and standard deviations are fixed based on the training set (otherwise one would need to compute statistics on a single input sample — Instance
Normalization).
66
Batch Normalization
• Recall previous 3-network example
1
P
xz m j xj z x µx
1. 1-layer H0 = q P P 2
=
1 1 x
m j xj z m i xi z
1
P
xw2 m j xj w 2 x µx
2. 2-layer (shrd) 0
H =q P P 2
=
1 2 1 2 x
m j xj w m i xi w
1
P
xw1 w2 m jxj w 1 w 2 x µx
3. 2-layer 0
H =q P P 2
=
1 1 x
m j xj w 1 w 2 m i xi w 1 w 2
All three cases are independent of the original (linear) parameters. Since the new mapping is invariant to the weights, their gradient will be zero. Basically, these weights
will not receive any update.
67
Batch Normalization
• In all 3 networks we have the same mapping
Ĥ = H 0 +
Ĥ ( ✏H 0 )H 0 + ( ✏)
The new parametrization makes the updates in all three cases identical.
The gradient step is computed wrt \gamma and \beta.
68
Coordinate Descent
Note
69
Coordinate Descent
• Example
Note
70
Polyak Averaging
• Average the parameters estimated during gradient
descent
Xt
1
✓ˆt = ✓i
t i=1
✓ˆt = ↵✓t 1
+ (1 ↵)✓t ↵ 2 (0, 1)
Relates to Momentum
71
Supervised Pretraining
Note
72
Supervised Pretraining
Note
73
Supervised Pretraining
• Example of greedy supervised pretraining
w1
x h
start with
y u1
w1 w2
then add the x h1 h2
next layer
y u1 y u2
w1 w2 wk
until the x h1 h2 hk
last layer y u1 y u2 y uk
Note
74
Supervised Pretraining
• It relates to transfer learning
• Example
Note
75
Supervised Pretraining
• First, train an easy to train Teacher network (low
depth and large width)
Note
76
Designing Models to
Aid Optimization
• Rather than improving the optimization by working on
the algorithms, design the model so that optimization
is easier
Note
77
Continuation Methods
• Improving the local estimates (e.g., Adam,
RMSProp, AdaGrad) has a limited effect
Note
78
Continuation Methods
• Build a sequence of objective
functions{Ji}i=0,..,n on the
same set of variables, such
that one solves objectives of
increasing difficulty and Jn=J,
the original objective function
Note
79
Continuation Methods
Note
80
Curriculum Learning
• A special type of continuation method that imitates
the way humans learn
In the stochastic curriculum learning the rates of easy and hard examples are changed. Initially the mix has few hard examples and then later many more.