Lecture 11
Lecture 11
If we start gradient descent from any point less than 1.634, we will
arrive only at a local minimum.
Local optima and flat regions
Qi=d
▶ There are therefore i=1 ki local minima for the function
F (x1 , x2 , . . . , xd ), which is very large number of points.
Gradient descent could be stuck at any one of these points
which might be far from the global optimum.
▶ Another problem to contend with is the presence of flat
regions where the gradient is close to zero. An example of this
situation is shown in the next slide.
▶ Flat regions are problematic because the speed of descent
depends on the magnitude of the gradient, given a fixed
learning rate. The optimization process will take a long time
to cross a flat region of space which will make convergence
slow.
Local optima and flat regions
Differential curvature
▶ A closer look at the controur plot for the elliptical bowl case
shows that in the y -direction, we see oscillatory movement as
in each step we correct the mistake of overshooting made in
the previous step. The gradient component along the
y -direction is more than the component along the x-direction.
▶ Along the x-direction, we make small movements towards the
optimum x-value.
▶ Overall, after many training steps we find that we have made
little progress to the optimum.
▶ It needs to be kept in mind that the path of steepest descent
in most objective functions is only an instantaneous direction
of best improvement, and is not the correct direction of
descent in the longer term.
Revisiting feature normalization
▶ From the given three examples we can set up the loss function
as follows:
J(w ) = (0.1w1 +25w2 −7)2 +(0.8w1 +10w2 −1)2 +(0.4w1 +10w2 −4)2
.
▶ We note that this objective function is much more sensitive to
w2 than w1 since the coefficients for w2 in the expression
above are much larger than those for w1 .
▶ One way to get around this issue is to standardize each
column to zero mean and unit variance, the coefficients for w1
and w2 will become much more similar, and differential
curvature will be reduced.
Difficult topologies - cliffs