UNIT IV
Optimization for Training Deep
Models
P Jyothi,Asst. Prof., CSE Dept.
P Jyothi
Asst. prof.,
CSE Dept.
P Jyothi,Asst. Prof., CSE Dept.
Introduction
one particular case of optimization: finding the
parameters θ of a neural network that significantly reduce
a cost function J(θ), which typically includes a
performance measure evaluated on the entire training set
as well as additional regularization terms.
There is a need for optimising (minimising) the cost or
loss function.
P Jyothi,Asst. Prof., CSE Dept.
Introduction conti..
The goal of a machine learning algorithm is to reduce the expected
generalization error. This quantity is known as the risk.
risk minimization would be an optimization task solvable by an optimization
algorithm.
The simplest way to convert a machine learning problem back into an
optimization problem is to minimize the expected loss on the training set.
The training process based on minimizing this average training error is known
as empirical risk minimization.
P Jyothi,Asst. Prof., CSE Dept.
Introduction conti..
Rather than optimizing the risk directly, we optimize the
empirical risk, and hope that the risk decreases
significantly as well. A variety of theoretical results
establish conditions under which the true risk can be
expected to decrease by various amounts.
empirical risk minimization is prone to overfitting. Models
with high capacity can simply memorize the training set.
In many cases, empirical risk minimization is not really
feasible. The most effective modern optimization
algorithms are based on gradient descent
P Jyothi,Asst. Prof., CSE Dept.
Introduction conti..
Optimization algorithms that use the entire training set
are called batch or deterministic gradient methods,
because they process all of the training examples
simultaneously in a large batch.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network Optimization
Optimization in general is an extremely difficult task.
Traditionally, machine learning has avoided the difficulty
of general optimization by carefully designing the
objective function and constraints to ensure that the
optimization problem is convex.
When training neural networks, we must confront the
general non-convex case. Even convex optimization is not
without its complications. In this section, we summarize
several of the most prominent challenges involved in
optimization for training deep models.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
(i) Ill-Conditioning:
The ill-conditioning problem is generally believed to be present in neural
network training problems.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
Figure 8.1: Gradient descent often does not arrive at a
critical point of any kind.
In this example, the gradient norm increases throughout
training of a convolutional network used for object
detection.
(Left)A scatterplot showing how the norms of individual
gradient evaluations are distributed over time.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
To improve legibility, only one gradient norm is
plotted per epoch. The running average of all
gradient norms is plotted as a solid curve. The
gradient norm clearly increases over time, rather
than decreasing as we would expect if the training
process converged to a critical point. (Right)Despite
the increasing gradient, the training process is
reasonably successful. The validation set
classification error decreases to a low level.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
Local Minima:
ii)
One of the most prominent features of a convex
optimization problem is that it can be reduced to
the problem of finding a local minimum.
Any local minimum is guaranteed to be a global
minimum. Some convex functions have a flat
region at the bottom rather than a single global
minimum point, but any point within such a flat
region is an acceptable solution.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
With non-convex functions, such as neural networks, it is
possible to have many local minima. Indeed, nearly any
deep model is essentially guaranteed to have an
extremely large number of local minima.
Local minima can be problematic if they have high cost in
comparison to the global minimum.
If local minima with high cost are common, this could
pose a serious problem for gradient-based optimization
algorithms
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
(iii) Plateaus, Saddle Points and Other Flat Regions:
Saddle point (SP) is another type of point with zero gradient
where some points around it have higher value and the others
have lower. Intuitively, this means that a saddle point acts as
both a local minima for some neighbors and a local maxima for
the others.
Thus, Hessian at SP has both positive and negative eigenvalues
TL;DR — for a function to curve upwards or downwards around a
point as in the case of local minima and local maxima, the
eigenvalues should have the same sign, positive for local minima
and negative for local maxima).
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
Plateaus:
All local search algorithms use a function that evaluates the
quality of assignment, for example the number of
constraints violated by the assignment. This amount is
called the cost of the assignment. The aim of local search is
that of finding an assignment of minimal cost, which is a
solution if any exists.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
Two classes of local search algorithms exist. The first one is that
of greedy or non-randomized algorithms. These algorithms
proceed by changing the current assignment by always trying to
decrease (or at least, non-increase) its cost.
The main problem of these algorithms is the possible presence
of plateaus, which are regions of the space of assignments
where no local move decreases cost. The second class of local
search algorithm have been invented to solve this problem. They
escape these plateaus by doing random moves, and are called
randomized local search algorithms.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
(iv)Cliffs and Exploding Gradients:
Neural Networks (NNs) might sometimes have extremely
steep regions resembling cliffs due to the repeated
multiplication of weights. Suppose we use a 3-layer (input-
hidden-output) neural network with all the activation
functions as linear.
So, deep neural networks involve multiplication of a large
number of parameters leading to sharp non-linearities in
the parameter space. These non-linearities give rise to high
gradients in some places. At the edge of such a cliff, an
update step might throw the parameters extremely far.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
The cliff can be dangerous whether we approach
it from above or from below, but fortunately its
most serious consequences can be avoided using
the gradient clipping heuristic.
It can be taken care of by using Gradient Clipping
(GC). The gradient indicates only the direction in
which to make the update. If the GD update
proposes making a very large step, GC intervenes
to reduce the step size.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
(v)Long-Term Dependencies:
Another difficulty that neural network optimization
algorithms must overcome arises when the computational
graph becomes extremely deep. Feedforward networks
with many layers have such deep computational graphs.
So do recurrent networks, which construct very deep
computational graphs by repeatedly applying the same
operation at each time step of a long temporal sequence.
Repeated application of the same parameters gives rise to
especially pronounced difficulties.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
Recurrent networks use the same matrix W at each time
step, but feedforward networks do not, so even very deep
feedforward networks can largely avoid the vanishing and
exploding gradient problem
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
(VI)Inexact Gradients:
Most optimization algorithms use a noisy/biased estimate of
the gradient in cases where the estimate is based on
sampling, or in cases where the true gradient is intractable
for e.g. in the case of training a Restricted Boltzmann
Machine (RBM), an approximation of the gradient is used.
For RBM, the contrastive divergence algorithm gives a
technique for approximating the gradient of its intractable
log-likelihood.
P Jyothi,Asst. Prof., CSE Dept.
Challenges in Neural Network
Optimization conti..
Neural Networks might not end up at any critical
point at all and such critical points might not even
necessarily exist.
A lot of the problems might be avoided if there
exists a space connected reasonably directly to the
solution by a path that local descent can follow and
if we are able to initialize learning within that well-
behaved space. Thus, choosing good initial points
should be studied.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms
Gradient descent algorithm that follows the gradient of an
entire training set downhill. This may be accelerated
considerably by using stochastic gradient descent to
follow the gradient of randomly selected minibatches
downhill.
1. Stochastic Gradient Descent:
Stochastic gradient descent (SGD) and its variants are
probably the most used optimization algorithms for
machine learning in general and for deep learning in
particular.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
The learning rate ϵ is a very important parameter for
SGD. ϵ should be reduced after each epoch in general.
This is due to the fact that the random sampling of batches
acts as a source of noise which might make SGD keep
oscillating around the minima without actually reaching it.
The true gradient of the total cost function (involving the
entire dataset) actually becomes 0 when we reach the
minimum.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
The following conditions guarantee convergence under convexity assumptions
in case of SGD:
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
Setting it too low makes the training proceed slowly which
might lead to the algorithm being stuck at a high cost value.
Setting it too high would lead to large oscillations which
might even push the learning outside the optimal region.
The best way is to monitor the first several iterations and
set the learning rate to be higher than the best performing
one, but not too high to cause instability.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
A big advantage of SGD is that the time taken to compute a
weight update doesn’t grow with the number of training
examples as each update is computed after observing a
batch of samples which is independent of the total number
of training examples. Theoretically, for a convex problem,
BGD makes the error rate O(1/k) after k iterations whereas
SGD makes it O(1/√k). However, SGD compensates for
this with its advantages after a few iterations along with the
ability to make rapid updates in the case of a large training
set.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
2. Momentum:
The momentum algorithm accumulates the exponentially decaying moving
average of past gradients (called as velocity) and uses it as the direction in
which to take the next step. Momentum is given by mass times velocity, which
is equal to velocity if we’re using unit mass. The momentum update is given by:
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
The step size (earlier equal to learning rate * gradient) now
depends on how large and aligned the sequence of
gradients are. If the gradients at each iteration point in the
same direction (say g), it will lead to a higher value of the
step size as they just keep accumulating. Once it reaches a
constant (terminal) velocity, the step size becomes ϵ || g|| /
(1 — α). Thus, using α as 0.9 makes the speed 10 times.
Common values of α are 0.5, 0.9 and 0.99.
P Jyothi,Asst. Prof., CSE Dept.
P Jyothi,Asst. Prof., CSE Dept.
P Jyothi,Asst. Prof., CSE Dept.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
3 Nesterov Momentum:
introduced a variant of the momentum algorithm that was inspired by
Nesterov’s accelerated gradient method (Nesterov, 1983, 2004). The update rules
in this case are given by:
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
where the parameters α and Theta play a similar role as in
the standard momentum method. The difference between
Nesterov momentum and standard momentum is where
the gradient is evaluated. With Nesterov momentum the
gradient is evaluated after the current velocity is applied.
Thus one can interpret Nesterov momentum as attempting
to add a correction factor to the standard method of
momentum.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
P Jyothi,Asst. Prof., CSE Dept.
Parameter Initialization Strategies
Training algorithms for deep learning models are
iterative in nature and require the specification of
an initial point. This is extremely crucial as it often
decides whether or not the algorithm converges
and if it does, then does the algorithm converge to
a point with high cost or low cost.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
We have limited understanding of neural network
optimization but the one property that we know with
complete certainty is that the initialization should break
symmetry.
This means that if two hidden units are connected to the
same input units, then these should have different
initialization or else the gradient would update both the
units in the same way and we don’t learn anything new by
using an additional unit. The idea of having each unit learn
something different motivates random initialization of
weights which is also computationally cheaper.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
Biases are often chosen heuristically (zero mostly) and only
the weights are randomly initialized, almost always from a
Gaussian or uniform distribution. The scale of the
distribution is of utmost concern. Large weights might have
better symmetry-breaking effect but might lead to chaos
(extreme sensitivity to small perturbations in the input) and
exploding values during forward & back propagation. As an
example of how large weights might lead to chaos, consider
that there’s a slight noise adding ϵ to the input.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
Now, we if did just a simple linear transformation
like W * x , the ϵ noise would add a factor of W *
ϵ to the output. In case the weights are high, this
ends up making a significant contribution to the
output.
SGD and its variants tend to halt in areas near the
initial values, thereby expressing a prior that the
path to the final parameters from the initial values
is discoverable by steepest descent algorithms.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
Various suggestions have been made for appropriate initialization of the
parameters. The most commonly used ones include sampling the weights of
each fully-connected layer having m inputs and n outputs uniformly from the
following distributions:
• U(-1 / √m, 1 / √m)
• U(- √6 / (m+n), √6 / (m+n))
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
These initializations have already been
incorporated into the most commonly used Deep
Learning frameworks nowadays so that you can
just specify which initializer to use and the
framework takes care of sampling appropriately.
For e.g. Keras, which is a very famous deep
learning framework, has a module
called initializers, where the second distribution
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
One drawback of using 1 / √m as the standard
deviation is that the weights end up being small
when a layer has too many input/output units.
Motivated by the idea to have the total amount of
input to each unit independent of the number of
input units m, Sparse initialization sets each unit
to have exactly k non-zero weights. However, it
takes a long time for GD to correct incorrect large
values and hence, this initialization might cause
problems.
P Jyothi,Asst. Prof., CSE Dept.
Basic Algorithms conti..
If the weights are too small, the range of activations across
the mini-batch will shrink as the activations propagate
forward through the [Link] repeatedly identifying the
first layer with unacceptably small activations and
increasing its weights, it is possible to eventually obtain a
network with reasonable initial activations throughout.
The biases are relatively easier to choose. Setting the
biases to zero is compatible with most weight initialization
schemes except for a few cases for e.g. when used for an
output unit, to prevent saturation at initialization or when
using unit as a gate for making a decision. Refer to
the chapter for details.
P Jyothi,Asst. Prof., CSE Dept.
Algorithms with Adaptive Learning Rates
Neural network researchers have long realized that the
learning rate was reliably one of the hyperparameters
that is the most difficult to set because it has a significant
impact on model performance.
1. AdaGrad: it is important to incrementally decrease the
learning rate for faster convergence. Instead of manually
reducing the learning rate after each (or several) epochs, a
better approach is to adapt the learning rate as the training
progresses. This can be done by scaling the learning rates
of each model parameter individually inversely proportional to
the square root of the sum of historical squared values of the
gradient.
P Jyothi,Asst. Prof., CSE Dept.
Algorithms with Adaptive Learning
Rates Conti..
In the parameter update equation below, r is initialized with 0 and the
multiplication in the update step happens element-wise as mentioned.
Since the gradient value would be different for each parameter, the
learning rate is scaled differently for each parameter too.
Thus, those parameters having a large gradient have a large decrease
in the learning rate as the learning rate might be too high leading to
oscillations or it might be approaching the minima but having large
learning rate might cause it to jump over the minima as explained in the
figure below, because of which the learning rate should be decreased
for better convergence, while those with small gradients have a small
decrease in the learning rate as they might have already approached
their respective minima and should not be pushed away from that. Even
if they have not, reducing the learning rate too much would reduce the
gradient even further leading to slower learning.
P Jyothi,Asst. Prof., CSE Dept.
Algorithms with Adaptive Learning
Rates Conti..
P Jyothi,Asst. Prof., CSE Dept.
Algorithms with Adaptive Learning
Rates Conti..
P Jyothi,Asst. Prof., CSE Dept.
However, accumulation of squared gradients from the very beginning can lead
to excessive and premature decrease in the learning rate. Consider that we had
a model with only 2 parameters (for simplicity) and both the initial gradients are
1000. After some iterations, the gradient of one of the parameters has reduced
to 100 but that of the other parameter is still around 750. However, because of
the accumulation at each update, the accumulated gradient would still have
almost the same value.
P Jyothi,Asst. Prof., CSE Dept.
Algorithms with Adaptive Learning
Rates Conti..
P Jyothi,Asst. Prof., CSE Dept.
Algorithms with Adaptive Learning
Rates Conti..
2. RMSProp: RMSProp addresses the problem caused by accumulated
gradients in AdaGrad. It modifies the gradient accumulation step to an
exponentially weighted moving average in order to discard history from the
extreme past. The RMSProp update is given by:
P Jyothi,Asst. Prof., CSE Dept.
Algorithms with Adaptive Learning
Rates Conti..
This allows the algorithm to converge rapidly after finding a
convex bowl, as if it were an instance of AdaGrad initialized
within that bowl. Let me explain why this is so. Consider the
figure below.
The region represented by 1 indicates usual RMSProp
parameter updates as given by the update equation, which is
nothing but exponentially averaged AdaGrad updates. Once the
optimization process lands on A, it essentially lands at the top of
a convex bowl. At this point, intuitively, all the updates before A
can be seen to be forgotten due to the exponential averaging
and it can be seen as if (exponentially averaged) AdaGrad
updates
P Jyothi,Asst. start from point A onwards.
Prof., CSE Dept.
Algorithms with Adaptive Learning
Rates Conti..
P Jyothi,Asst. Prof., CSE Dept.
Algorithms with Adaptive Learning
Rates Conti..
• 3. Adam: Adapted from “adaptive moments”, it focuses on combining RMSProp
and Momentum. Firstly, it views Momentum as an estimate of the first-order
moment and RMSProp as that of the second moment. The weight update for
Adam is given by:
P Jyothi,Asst. Prof., CSE Dept.
Algorithms with Adaptive Learning
Rates Conti..
Secondly, since s and r are initialized as zeros, the authors
observed a bias during the initial steps of training thereby
adding a correction term for both the moments to account
for their initialization near the origin.
P Jyothi,Asst. Prof., CSE Dept.
Approximate Second-Order Methods
The optimization algorithms that we’ve looked at till
now involved computing only the first derivative.
But there are many methods which involve higher
order derivatives as well.
The main problem with these algorithms are that
they are not practically feasible in their vanilla form
and so, certain methods are used to approximate
the values of the derivatives. We explain three
such methods, all of which use empirical risk as
the objective function:
P Jyothi,Asst. Prof., CSE Dept.
Approximate Second-Order Methods
conti..
Newton’s Method: This is the most common
higher-order derivative method used. It makes use
of the curvature of the loss function via its second-
order derivative to arrive at the optimal point. Using
the second-order Taylor Series expansion to
approximate J(θ) around a point θo and ignoring
derivatives of order greater than 2 (this has already
been discussed in previous chapters), we get:
P Jyothi,Asst. Prof., CSE Dept.
Approximate Second-Order Methods
conti..
P Jyothi,Asst. Prof., CSE Dept.
Approximate Second-Order Methods
conti..
For quadratic surfaces (i.e. where cost function is quadratic), this directly gives
the optimal result in one step whereas gradient descent would still need to
iterate. However, for surfaces that are not quadratic, as long as the Hessian
remains positive definite, we can obtain the optimal point through a 2-step
iterative process — 1) Get the inverse of the Hessian and 2) update the
parameters.
Saddle points are problematic for Newton’s method. If all the eigenvalues are
not positive, Newton’s method might cause the updates to move in the wrong
direction. A way to avoid this is to add regularization:
P Jyothi,Asst. Prof., CSE Dept.
Approximate Second-Order Methods
conti..
However, if there is a strong negative curvature i.e. the
eigenvalues are largely negative, α needs to be sufficiently
high to offset the negative eigenvalues in which case the
Hessian becomes dominated by the diagonal matrix.
Another problem restricting the use of Newton’s method is
the computational cost. It takes O(k³) time to calculate the
inverse of the Hessian where k is the number of
parameters. It’s not uncommon for Deep Neural Networks
to have about a million parameters and since the
parameters are updated every iteration, this inverse needs
to be calculated at every iteration, which is not
computationally feasible.
P Jyothi,Asst. Prof., CSE Dept.
Approximate Second-Order Methods
conti..
Conjugate Gradients: One weakness of the method of steepest
descent (i.e. GD) is that line searches happen along the direction
of the gradient. Suppose the previous search direction is d(t-1).
Once the search terminates (which it does when the gradient
along the current gradient direction vanishes) at the minimum,
the next search direction, d(t) is given by the gradient at that
point, which is orthogonal to d(t-1) (because if it’s not orthogonal,
it’ll have some component along d(t-1) which cannot be true as
at the minimum, the gradient along d(t-1) has vanished).
Upon getting the minimum along the current search direction, the
minimum along the previous search direction is not preserved,
undoing, in a sense, the progress made in previous search
direction.
P Jyothi,Asst. Prof., CSE Dept.
Approximate Second-Order Methods
conti..
P Jyothi,Asst. Prof., CSE Dept.
Approximate Second-Order Methods
conti..
BFGS: Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm
This algorithm tries to bring the advantages of Newton’s
method without the additional computational burden by
approximating the inverse of H by M(t), which is iteratively
refined using low-rank updates. Finally, line search is
conducted along the direction M(t)g(t). However, BFGS
requires storing the matrix M(t) which takes O(n²) memory
making it infeasible. An approach called Limited Memory
BFGS (L-BFGS) has been proposed to tackle this
infeasibility by computing the matrix M(t) using the same
method as BFGS but assuming that M(t−1) is the identity
matrix.
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms
Many optimization techniques are not exactly
algorithms, but rather general templates that can
be specialized to yield algorithms, or subroutines
that can be incorporated into many different
algorithms.
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
Batch Normalization: Batch normalization (BN) is one of the most exciting
innovations in Deep learning that has significantly stabilized the learning
process and allowed faster convergence rates. The intuition behind batch
normalization is as follows: Most of the Deep Learning networks are
compositions of many layers (or functions) and the gradient with respect to one
layer is taken considering the other layers to be constant.
However, in practise all the layers are updated simultaneously and this can
lead to unexpected results. For example, let y* = x W¹ W² … W¹⁰. Here, y* is a
linear function of x but not a linear function of the weights. Suppose the gradient
is given by g and we now intend to reduce y* by 0.1. Using first-order Taylor
Series approximation, taking a step of ϵg would reduce y* by ϵg’
g. Thus, ϵ should be 0.1/(g’ g) just using the first-order information. However,
higher order effects also creep in as the updated y* is given by:
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
Batch normalization takes care of this problem by using an efficient
reparameterization of almost any deep network. Given a matrix of
activations, H, the normalization is given by: H’ = (H-μ) / σ, where the
subtraction and division is broadcasted.
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
Coordinate Descent: Generally, a single weight update is made by
taking the gradient with respect to every parameter. However, in cases
where some of the parameters might be independent (discussed below)
of the remaining, it might be more efficient to take the gradient with
respect to those independent sets of parameters separately for making
updates. Let me clarify that with an example. Suppose we have the
following cost function:
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
This cost function describes the learning problem called sparse coding.
Here, H refers to the sparse representation of X and W is the set of
weights used to linearly decode H to retrieve X. An explanation of why
this cost function enforces the learning of a sparse representation
of X follows. The first term of the cost function penalizes values far from
0 (positive or negative because of the modulus, |H|, operator. This
enforces most of the values to be 0, thereby sparse. The second term is
pretty self-explanatory in that it compensates the difference
between X and H being linearly transformed by W, thereby enforcing
them to take the same value. In this way, H is now learned as a sparse
“representation” of X.
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
The cost function generally consists of additionally a regularization term
like weight decay, which has been avoided for simplicity. Here, we can
divide the entire list of parameters into two sets, W and H. Minimizing
the cost function with respect to any of these sets of parameters is a
convex problem.
Coordinate Descent (CD) refers to minimizing the cost function with
respect to only 1 parameter at a time. It has been shown that
repeatedly cycling through all the parameters, we are guaranteed to
arrive at a local minima. If instead of 1 parameter, we take a set of
parameters as we did before with W and H, it is called block
coordinate descent (the interested reader should explore Alternating
Minimization). CD makes sense if either the parameters are clearly
separable into independent groups or if optimizing with respect to
certain set of parameters is more efficient than with respect to others.
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
• Polyak Averaging: Polyak averaging consists of averaging several points in
the parameter space that the optimization algorithm traverses through. So, if
the algorithm encounters the points θ(1), θ(2), … during optimization, the output
of Polyak averaging is:
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
Most optimization problems in deep learning are non-convex where the path
taken by the optimization algorithm is quite complicated and it might happen
that a point visited in the distant past might be quite far from the current point in
the parameter space. Thus, including such a point in the distant past might not
be useful, which is why an exponentially decaying running average is taken.
This scheme where the recent iterates are weighted more than the past ones is
called Polyak-Ruppert Averaging:
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
Supervised Pre-training: Sometimes it’s hard to directly train to solve for a specific
task. Instead it might be better to train for solving a simpler task and use that as an
initialization point for training to solve the more challenging task.
example — learning a simpler task would put you in a better position to understand
the more complex task. This particular strategy of training to solve a simpler task
before facing the herculean one is called pretraining. A particular type of
pretraining, called greedy supervised pretraining, firstly breaks a given supervised
learning problem into simpler supervised learning ones and solving for the optimal
version of each component in isolation. To build on the above intuition, the
hypothesis as to why this works is that it gives better guidance to the intermediate
layers of the network and helps in both, generalization and optimization. More often
that not, the greedy pretraining is followed by a fine-tuning stage where all the parts
are jointly optimized to search for the optimal solution to the full problem. As an
example, the figure below shows how each hidden layer is trained one at a time,
where the input to the hidden layer being learned is the output of the previously
trained hidden layer.
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
Also, FitNets shows an alternative way to guide the training
process. Deep networks are hard to train mainly because
as deeper the model gets, more non-linearities are
introduced. The authors propose the use of a shallower and
wider teacher network that is trained first. Then, a second
network which is thinner and deeper, called
the student network is trained to predict not only the final
outputs but also the intermediate layers of the teacher
network. For those who might not be clear with what deep,
shallow, wide and thin might mean, refer the following
diagram:
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
The idea is that predicting the intermediate layers of the
teacher network provides some hints as to how the layers
of the student network should be used and aids the
optimization procedure. It was shown that without the hints
to the hidden layers, the students networks performs poorly
in both the training and test data.
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
Designing Models to Aid Optimization: Most of the work in
deep learning has been towards making the models easier to
optimize rather than designing a more powerful optimization
algorithm. This is evident from the fact that Stochastic Gradient
Descent, which is primarily used for training deep models today,
has been in use since the 1960s. Many of the current design
choices lend towards using linear transformations between
layers and using activation functions like ReLU [max(0, x)] which
are linear for the most part and enjoy large gradients as
compared to sigmoidal units which saturate easily. Also, linear
functions increase in a particular direction. Thus, if there’s an
error, there’s a clear direction towards which the output should
move to minimize the error.
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
Residual connections reduce the length of the shortest path
from the output to the lower layers, thereby allowing a
larger gradient to flow through and hence, tackling
the vanishing gradient
problem. Similarly, GoogleNet attached multiple copies of
the output to intermediate layers so that a larger gradient
flows to those layers. Once training is complete, the
auxiliary heads are removed.
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
Continuation Methods and Curriculum Learning: Based on the
explanation given in the book (and also given that it’s fairly short), I’d
recommend the interested reader to refer the book directly. The reason
being that I realized that the content there was exactly to the point with
no further scope of summarizing and hence, found it redundant to
explain it again here.
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
To give a very brief idea, Continuation Methods are a
family of strategies where instead of a single cost function,
a series of cost functions are optimized to aid optimization
by making sure that the algorithm spends most of the time
in well-behaved regions in the parameter space. The series
of cost functions are designed such that the cost functions
become progressively harder to solve, i.e. the first cost
function is easiest to solve and the final cost function (which
is actually the cost function we wanted to minimize) is the
toughest, and the solution to each easier cost function
serves as a good initialization point for the next, harder-to-
optimize cost function.
P Jyothi,Asst. Prof., CSE Dept.
Optimization Strategies and Meta-
Algorithms conti..
Curriculum Learning can be interpreted as a type of
continuation method where the learning process is
planned such that the simpler concepts are learned
first and then the harder ones which depend on those
simpler concepts. This concept had been shown to
accelerate progress in animal training. One simple way
to use it in the context of machine learning is to initially
show the algorithm more of the easier images and less
of the harder images. As the learning progresses, the
proportion of easier images is reduced and that of the
harder images is increased so that the model can
ultimately learn the harder task.
P Jyothi,Asst. Prof., CSE Dept.