Machine Learning - Lecture 2
Machine Learning - Lecture 2
(521289S)
Lecture 2
1
FROM PREVIOUS LECTURE:
2
Optimizing a model for linear regression: cost
function minimization
• Parameters to be optimized are the Slope and Intercept of the fitting line
– Parameter space is 2-dimensional with axes Slope and Intercept
• Different values are tried for the parameters until an optimal solution is found
• The goodness of a model is evaluated by computing the cost function value g()
– A smaller value of g() indicates a better model
3
Zero-order optimality condition
• Our aim is to search for a minimum in the cost function g()
• Minimization problem can be stated as:
• Global minimum: Search for a point w* in the parameter space such that
• Local minimum: Sometimes there are several minima in the cost function and we may be
satisfied with any of them or with the one that represents sufficient accuracy / low cost in a
small neighborhood of parameter space:
4
Local optimization techniques
• Aim: take a randomly selected initial point w0 in the parameter space and
start sequentally refining (=iteration) it to search for a minimum in the
cost function
– In each iteration round k, always select the next point wk such that we go
downwards on the cost function
– While iterating K rounds we get a sequence
w0, w1, w2, w3,…, wK such that:
6
First-order optimality condition
• At a minimum of a 1-dimensional function, the derivative of a function
g(w) has the value of zero
7
Solving the minimum point analytically
• The first-order system of equations can be very difficult
or virtually impossible to resolve algebraically in a general case
when the cost function is very complex
– Solvable is some special cases only
– The function may include several minima, maxima and
saddle points which all are solutions to the minimization problem
• Therefore, the solution is searched with iterative techniques in practice
8
Steepest ascent and descent directions
• The gradient computed at a particular point w of the cost function yields a vector
that points to the direction of steepest ascent of the cost function: the fastest
increase direction
• When searching the minimum point of the cost function, we need to go to the
steepest descent direction instead, i.e. to the opposite direction!
• Note: the ascent/descent direction is defined in the parameter space as we are
searching for good increments for the parameter values which will eventually
minimize the cost function
up
down
9
Gradient descent algorithm
• The basic structure of the algorithm is iterative:
– Set an initial point for parameter vector: w0
– Iterate parameter values wk until halting condition is met:
w2
gradient
w1
d = -gradient 10
Gradient descent in nonconvex cost
functions
• If the cost function has a complex shape with more than one extremum
point, the algorithm finds one of them by random chance
– Minimum, maximum, saddle points
– Nonconvex cost functions (convex = an ’upwards opening bowl’)
• Thus, the algorithm must be run
several times with different
initial w0 to find satisfactory Nonconvex function
solutions
• With convex cost functions,
the algorithm will find the
optimum if it keeps iterating Convex function
long enough
11
Steplength α control
• The previously shown choices can be applied: fixed steplength and
diminishing steplength (α = 1/k)
• A choice of fixed steplength must be done experimentally
• The cost function history plot below demonstrates that if α is not chosen
properly, the search can be slow or optimum may not be found
12
An example of comparing the fixed
and diminishing steplength control
• Cost function history plot may indicate that the solution
avoids the minimum or even oscillates
13
An example of nonconvex cost
function optimization
• Depending of the α value, the cost function history plot may
show fast convergence, slow convergence, oscillation, and
halting at a local minimum
14
Halting conditions
• A halting condition should be tested at each iteration to decide when to
stop the search
– One should halt near a stationary point (minimum, maximum, saddle point)
where cost function does not change any more much:
15
About gradient direction
• The gradient is always perpendicular to the local contour of the cost
function
• Thus, the search algorithm always proceeds towards the closest minimum
downhill
• Thus, if there are many local
minima, the search will end
at one of them
16
Zig-zagging behavior of gradient
descent (1/2)
• Very often in real applications, a minimum is located in an area of parameter space
in which the cost function shape is a slowly varying, long and narrow valley
surrounded by steep walls
• Due to the fact that gradient direction is always perpendicular to the local contour
of the cost function, the trajectory may not proceed directly towards the minimum
but does ’zig-zagging’ when making update steps in the iteration
17
Zig-zagging behavior of gradient
descent (2/2)
• A popular solution to the problem
is to apply momentum-accelerated (0 < β < 1)
gradient descent
– Basically, it smoothes the direction trajectory
(exponential smoothing) Cost function optimization:
18
Slow-crawling behavior of gradient
descent (1/2)
• As the gradient magnitude starts vanishing close to a stationary point, the iteration
becomes very slow and not much progress is done
19
Slow-crawling behavior of gradient
descent (2/2)
• A popular solution to the problem is to apply normalized gradient descent
• Let’s keep just the gradient direction and normalize the gradient length to unity:
20
CHAPTER 4:
SECOND-ORDER OPTIMIZATION
TECHNIQUES
21
The basic idea
• Also the local curvature (2nd derivative) of the cost function at point wk is
considered when deciding on the best direction and steplength to proceed in the
optimization process
23
Second-order optimality condition
• One can test which condition is the case by
checking the second derivative (1-D case):
24
Weakness 1 of Newton’s method
• The basic form of Newton’s algorithm may halt at a local maximum or an inflection
point if the cost function is nonconvex
25
Weakness 1 of Newton’s method
• However, by using the version with the regularization term ε>0, the cost function is
locally transformed into convex shape (proof skipped here)
– Then, the Newton’s method will continue towards the minimum point!
26
Weakness 2 of Newton’s method
• The method suffers from a scaling limitation: computational complexity
grows very fast with the number of parameters N
– The problem is the computation of the inverse of the Hessian matrix related with the
second derivatives
• Hessian-free optimization methods have been developed to solve the
issue
– Hessian matrix is replaced with a close approximation that does not suffer the issue
– Subsampling the Hessian: uses only a fraction of the matrix entries
– Quasi-Newton method: Hessian is replaced with a low-rank approximation that can be
computed effectively
• The details of these methods are omitted here (an interested reader can
check text book appendix for more details)
27
About laboratory assignments
• Currently, software is available for automatic differentiation of
functions
• The functions are parsed into elementary functions automatically
which helps automatic differentiation
• The numerical methods that are used have been developed by
highly skilled professional programmers and mathematicians
28