0% found this document useful (0 votes)
7 views

Lecture 11

Uploaded by

chirag
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 11

Uploaded by

chirag
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Lecture 11

Math Foundations Team


Introduction

▶ We will take a deeper look at some challenges in non-linear


optimization, continuing on from the previous lecture.
▶ First we will address the problem of flat regions and local
optima.
▶ Second we will look at the problem of different levels of
curvature in different directions.
▶ We will need to figure out strategies to dela with the above
two situations to design a good optimization algorithm.
Local optima and flat regions

▶ Consider the following one-dimensional function:


F (x) = (x − 1)2 ((x − 3)2 − 1).
▶ Taking the derivative and setting it to zero, we have
F ′ (x) = 2(x − 1)((x − 1)(x − 3) + (x − 3)2 − 1) = 0.
▶ The solutions √ to this equation are√
x = 1, x = 5−2 3 = 1.634, x = 5+2 3 = 3.366.
▶ We can show that the first and third roots are minima since
F ′′ (x) > 0 at these points while the second point is a
maximum since F ′′ (x) < 0.
▶ Evaluating the function
√ values at these
√ points, we find
5− 3 5+ 3
F (1) = 0, F ( 2 ) = 0.348, F ( 2 ) = −4.848.
Local optima and flat regions

If we start gradient descent from any point less than 1.634, we will
arrive only at a local minimum.
Local optima and flat regions

▶ We might never arrive at a global minimum if we keep


choosing wrong starting points. Thus we will get stuck at a
local minimum, not knowing that a better solution exists.
▶ The problem becomes worse with high-dimensionality.
▶ Consider an objective function consisting of the sum of
d-univariate functions in the variables x1 , x2 , . . . , xd , each a
function of a different variable, say
F (x1 , x2 , . . . xd ) = A1 (x1 ) + A2 (x2 ) + . . . + An (xn ).
▶ Let Ai (xi ) have ki local minima. Setting ∂x ∂F
i
= 0∀i, we note
∗ ∗ ∗ ∗
that any point (x1 , x2 , . . . , xd ), where xi is a local minima of
∂F
the function Ai (xi ), is a solution to ∂x i
= 0.
∂F
▶ This is because ∂x | ∗ ∗ ∗ = A′i (xi∗ ) = 0 since xi∗ is a local
i (x1 ,x2 ,...xd )
minimum of Ai (xi ).
Local optima and flat regions

Qi=d
▶ There are therefore i=1 ki local minima for the function
F (x1 , x2 , . . . , xd ), which is very large number of points.
Gradient descent could be stuck at any one of these points
which might be far from the global optimum.
▶ Another problem to contend with is the presence of flat
regions where the gradient is close to zero. An example of this
situation is shown in the next slide.
▶ Flat regions are problematic because the speed of descent
depends on the magnitude of the gradient, given a fixed
learning rate. The optimization process will take a long time
to cross a flat region of space which will make convergence
slow.
Local optima and flat regions
Differential curvature

▶ In multi-dimensional settings, the components of the gradient


with respect to different parameters can vary widely. As has
already been mentioned in a previous slide, this will cause
convergence problems since there is oscillation in the update
step with respect to some components and a steady
movement with respect to other components.
▶ Consider the simplest possible case of a bowl-like convex,
quadratic objective function with a single global minimum -
L = x 2 + y 2 represents a perfectly circular bowl, and the
function L = x 2 + 4y 2 .
▶ We shall show contour plots of both functions and how
gradient descent performs on finding the minimum of the two
functions.
Differential curvature

▶ What is the qualitative difference between L = x 2 + y 2 and


L = x 2 + 4y 2 . Intuitively one looks symmetric in x and y , and
the other is not.
▶ The second loss function is more sensitive to changes in y as
compared to x - it looks like an elliptical bowl. The specific
sensitivity depends on the position of x, y .
▶ Looking at the second-order derivatives we can see that for
2 2
the second function ∂∂2 xL2 , and ∂∂2 yL2 are very different.
Differential curvature

▶ The second-prder derivative measures the rate of change of


the gradient - a high second-order derivative means high
curvature.
▶ From the point of view of gradient descent we want moderate
curvature in all dimensions as it would means that the
gradient does not change too much in some dimensions
compared to others. We can then make gradient-descent
steps of large sizes.
Differential curvature

▶ In the next slide we show contour plots of the perfect and


elliptical bowls discussed previously. We see that in case of
the perfect bowl, a sufficiently large step-size from any point
can take us directly to the optimum of the function in
one-step, since the gradient at any point points towards the
optimum of the function. This is not true for the elliptical
bowl, the gradient at any point does not point to the
optimum of the function.
▶ Note that the gradient at any point is orthogonal to the
contour line at that point. This because the dot product of
the gradient ∇F and a small displacement δ x along the
contour line gives the change in the value of the function
along the displacement x . Since the function remains
constant along the contour line, ∇F .x = 0
Contour plots
Differential curvature

▶ A closer look at the controur plot for the elliptical bowl case
shows that in the y -direction, we see oscillatory movement as
in each step we correct the mistake of overshooting made in
the previous step. The gradient component along the
y -direction is more than the component along the x-direction.
▶ Along the x-direction, we make small movements towards the
optimum x-value.
▶ Overall, after many training steps we find that we have made
little progress to the optimum.
▶ It needs to be kept in mind that the path of steepest descent
in most objective functions is only an instantaneous direction
of best improvement, and is not the correct direction of
descent in the longer term.
Revisiting feature normalization

We show how to address in some measure the differential


curvature problem by feature normalization. Consider the following
toy dataset, where the two input attributes are Guns and Butter
respectively, and the output attribute is Happiness.
Guns Butter Happiness
(number per capita) (ounces per capita) (index)
0.1 25 7
0.8 10 1
0.4 10 4
We intend to find a relationship of the form y = w1 x1 + w2 x2 from
the data. The coefficients w1 and w2 are found using gradient
descent on the loss function computed from the above data.
Revisiting feature normalization

▶ From the given three examples we can set up the loss function
as follows:

J(w ) = (0.1w1 +25w2 −7)2 +(0.8w1 +10w2 −1)2 +(0.4w1 +10w2 −4)2

.
▶ We note that this objective function is much more sensitive to
w2 than w1 since the coefficients for w2 in the expression
above are much larger than those for w1 .
▶ One way to get around this issue is to standardize each
column to zero mean and unit variance, the coefficients for w1
and w2 will become much more similar, and differential
curvature will be reduced.
Difficult topologies - cliffs

▶ Two examples of high curvature surfaces are cliffs and valleys.


▶ An example of a cliff is shown below:
Difficult topologies - cliffs

▶ Considering the previous figure, we see that the partial


derivative with respect to x changes drastically as we go along
the axis of x.
▶ A modest learning rate will cause minimal reduction in the
value of the objective function in the gently sloping regions.
▶ The same modest learning rate in the high-sloping regions will
cause us to overshoot the optimal value in those regions.
▶ The problem is caused by the nature of curvature - the first
order gradient does not have any information that will help
control the size of the update.
▶ We need to look at second-order derivatives in this case.
Difficult topologies - valleys

▶ Consider the figure below depicting a valley where there is


gentle slope along the y -direction and a U-shape in the
x-direction.
▶ The gradient descent method will bounce violently along the
steep sides of the valley while not making much progress
along the x-axis.
Difficult topologies - valleys
Adjusting first-order derivatives for descent

▶ We need to find ways of magnifying movement along


consistent directions of the gradient to avoid bouncing around
the optimal solution.
▶ We could use second-order information to modify the
first-order derivatives by taking curvature into account as we
modify components of the gradient. Obtaining second-order
information is computationally expensive.
▶ A computationally less expensive way to handle the problem is
to use different learning rates for different parameters because
parameters with large partial derivatives show oscillatory
behaviour whereas those with small partial derivatives show
consistent behaviour.
Momentum-based learning

▶ Momentum-based methods attack the issues of flat-regions,


cliffs and valleys by emphasizing medium-term to long-term
directions of consistent movement.
▶ An aggregated measure of feedback is used to reinforce
movement along certain directions and speed up gradient
descent.
▶ The concept of momentum can be illustrated by a marble
rolling down a hill that has a number of ”local” distortions
like potholes, ditches etc. The momentum of the marble
causes it to navigate local distortions and emerge out of them.
▶ The normal update procedure for gradient descent can be
written as w ← w + v where v ← −α ∂∂Jw .
Momentum-based learning

▶ The gradient-descent procedure is modified in case of


momentum-based learning in the sense that the update vector
v inherits a fraction β ∈ (0, 1) of its velocity from the
previous step: v ← β v − α ∂∂Jw .
▶ When β = 0, we have the standard gradient descent approach.
▶ When β is close to 1, we end up reinforcing a consistent
velocity in the correct direction.
▶ β is referred to as the momentum parameter or the friction
parameter.
Momentum-based learning

▶ The figure below how momentum-based learning compares to


gradient descent, and how momentum allows it to navigate
potholes.
Momentum-based learning

▶ Momentum-based learning accelerates gradient descent since


the algorithm moves quicker in the direction of the optimal
solution.
▶ The useless sideways oscillations as they get cancelled out
during the averaging process.
▶ In the figure on the next slide, it should be clear that
momentum increases the relative component of the gradient
in the correct direction.
Momentum-based learning
AdaGrad

▶ Here we keep track of the aggregated squared magnitude of


the partial derivative with respect to each parameter over the
course of the algorithm.
▶ Mathematically the aggregated squared partial derivative is
stored in Ai for the ith variable and we can write
∂J 2
Ai ← Ai + ( ∂w i
) , ∀i.
▶ The actual update step becomes wi ← wi − √α ∂J , ∀i.
Ai ∂wi
▶ Sometimes we might need to avoid ill-conditioning; this is
done by adding a small ϵ = 10−8 to Ai in the expression
above.
AdaGrad

▶ Ai measures only the historical magnitude of the gradient


rather than the sign.
▶ If the gradient takes values +100 and -100 alternatively, Ai
will be pretty large and the update step along the parameter
in question will be pretty small. On the other hand, if the
gradient takes a value 0.1 consistently, Ai will not be as large
as before and the update step will be comparatively larger
than in the oscillating case.
▶ The update step along the consistent gradient will be
emphasized relative to the update step along the oscillatory
component.
▶ With the passage of time, however, absolute movements along
all components will slow down because Ai is monotonically
increasing with time.
RMSProp

▶ AdaGrad suffers from the problem of not making much


progress after a while, and the fact that Ai is aggregated over
the entire history of partial derivatives which may make the
method stale.
▶ Instead of simply adding squared gradients to estimate Ai , it
uses exponential averaging. Therefore the scaling factor Ai
does not constantly increase. Ai is updated according to
∂J 2
Ai ← ρAi + (1 − ρ)( ∂w i
) where ρ ∈ (0, 1).
▶ The update step is wi ← wi − √α ∂J , ∀i.
Ai ∂wi
▶ The key idea that differentiates RMSProp from AdaGrad is
that the importance of ancient gradients decays exponentially
with time as the gradient from t steps before is weighted by
ρt .
Adam

▶ The Adam method uses similar normalization as AdaGrad and


RMSProp.
▶ It also incorporates momentum into the update and addresses
the initialization bias present in RMSProp.
▶ Ai is exponentially averaged the same way as in RMSProp, i.e
∂J 2
Ai ← ρAi + (1 − ρ)( ∂w i
) where ρ ∈ (0, 1).
▶ An exponentially smoothed gradient for which the ith
component is Fi is maintained. Smoothing is performed with
∂J
a decay parameter ρf : Fi ← ρf Fi + (1 − ρf ) ∂w i
▶ The following update is used at the tth iteration:
wi ← wi − α√tAFi .
i
Adam

▶ There are two key differences with the RMSProp algorithm -


the first difference is the gradient is replaced with the
exponentially smoothed value in order to incorporate
momentum.
▶ The second difference is that the learning rate depends√ on the
1−ρt
iteration index t, and is defined as follows: αt = α 1−ρtf
.
▶ Both Fi and Ai are initialized to zero which causes bias in
early iterations. The two quantities are affected differently
which accounts for the equation for αt .
▶ As t → ∞, ρt → 0, ρtf → 0 and αt → α since ρ, ρf ∈ (0, 1).
The default suggested value for ρf and ρ are 0.9 and 0.999
respectively.

You might also like