0% found this document useful (0 votes)
11 views

Machine Learning - Lecture 2

Uploaded by

Athmajan Vu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Machine Learning - Lecture 2

Uploaded by

Athmajan Vu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Machine Learning

(521289S)
Lecture 2

Prof. Tapio Seppänen


Center for Machine Vision and Signal Analysis
University of Oulu

1
FROM PREVIOUS LECTURE:

2
Optimizing a model for linear regression: cost
function minimization
• Parameters to be optimized are the Slope and Intercept of the fitting line
– Parameter space is 2-dimensional with axes Slope and Intercept
• Different values are tried for the parameters until an optimal solution is found
• The goodness of a model is evaluated by computing the cost function value g()
– A smaller value of g() indicates a better model

• An example with two model


candidates:
– Left: An optimal solution
(g() has the smallest value)

– Right: A non-optimal solution

– Red dot shows the parameter values

3
Zero-order optimality condition
• Our aim is to search for a minimum in the cost function g()
• Minimization problem can be stated as:

– The function arguments wi are the model parameters


• By stacking all the parameters in an N-dimensional vector w,
we can express the minimization problem as:

– The vector w represents a point in the parameter space

• Global minimum: Search for a point w* in the parameter space such that

• Local minimum: Sometimes there are several minima in the cost function and we may be
satisfied with any of them or with the one that represents sufficient accuracy / low cost in a
small neighborhood of parameter space:

4
Local optimization techniques
• Aim: take a randomly selected initial point w0 in the parameter space and
start sequentally refining (=iteration) it to search for a minimum in the
cost function
– In each iteration round k, always select the next point wk such that we go
downwards on the cost function
– While iterating K rounds we get a sequence
w0, w1, w2, w3,…, wK such that:

• With complex-shaped cost functions,


one may find a local minimum instead of
the global minimum
– If the particular local optimum has too high
cost for the application, then restart from
a different initial point and repeat
the procedure
5
CHAPTER 3:
FIRST-ORDER OPTIMIZATION TECHNIQUES

6
First-order optimality condition
• At a minimum of a 1-dimensional function, the derivative of a function
g(w) has the value of zero

– The slope of the tangent is zero at the minimum point w=v

• In N-dimensional function, the gradient is


zero at a minimum of the function

– All partial derivatives are zero!


– The slope of the hyperplane tangent is zero
at the minimum point w=v

7
Solving the minimum point analytically
• The first-order system of equations can be very difficult
or virtually impossible to resolve algebraically in a general case
when the cost function is very complex
– Solvable is some special cases only
– The function may include several minima, maxima and
saddle points which all are solutions to the minimization problem
• Therefore, the solution is searched with iterative techniques in practice

8
Steepest ascent and descent directions
• The gradient computed at a particular point w of the cost function yields a vector
that points to the direction of steepest ascent of the cost function: the fastest
increase direction
• When searching the minimum point of the cost function, we need to go to the
steepest descent direction instead, i.e. to the opposite direction!
• Note: the ascent/descent direction is defined in the parameter space as we are
searching for good increments for the parameter values which will eventually
minimize the cost function

up
down

9
Gradient descent algorithm
• The basic structure of the algorithm is iterative:
– Set an initial point for parameter vector: w0
– Iterate parameter values wk until halting condition is met:

• The direction vector dk-1 is now the negative gradient at the


current point in the parameter space:

• The update rule is thus:

w2
gradient

w1
d = -gradient 10
Gradient descent in nonconvex cost
functions
• If the cost function has a complex shape with more than one extremum
point, the algorithm finds one of them by random chance
– Minimum, maximum, saddle points
– Nonconvex cost functions (convex = an ’upwards opening bowl’)
• Thus, the algorithm must be run
several times with different
initial w0 to find satisfactory Nonconvex function
solutions
• With convex cost functions,
the algorithm will find the
optimum if it keeps iterating Convex function
long enough

11
Steplength α control
• The previously shown choices can be applied: fixed steplength and
diminishing steplength (α = 1/k)
• A choice of fixed steplength must be done experimentally
• The cost function history plot below demonstrates that if α is not chosen
properly, the search can be slow or optimum may not be found

Cost function optimization with


three α values

Cost function history plots showing


the convergence of search

12
An example of comparing the fixed
and diminishing steplength control
• Cost function history plot may indicate that the solution
avoids the minimum or even oscillates

13
An example of nonconvex cost
function optimization
• Depending of the α value, the cost function history plot may
show fast convergence, slow convergence, oscillation, and
halting at a local minimum

14
Halting conditions
• A halting condition should be tested at each iteration to decide when to
stop the search
– One should halt near a stationary point (minimum, maximum, saddle point)
where cost function does not change any more much:

• Alternative implementations of halting conditions at iteration k:


– Magnitude of the gradient is sufficiently small (→ε)

– Changes to the parameter vector are sufficiently small (→ε)

– Change to the cost function value is sufficiently small (→ε)

– When a fixed maximum number of iterations Kmax has been completed

15
About gradient direction
• The gradient is always perpendicular to the local contour of the cost
function
• Thus, the search algorithm always proceeds towards the closest minimum
downhill
• Thus, if there are many local
minima, the search will end
at one of them

16
Zig-zagging behavior of gradient
descent (1/2)
• Very often in real applications, a minimum is located in an area of parameter space
in which the cost function shape is a slowly varying, long and narrow valley
surrounded by steep walls
• Due to the fact that gradient direction is always perpendicular to the local contour
of the cost function, the trajectory may not proceed directly towards the minimum
but does ’zig-zagging’ when making update steps in the iteration

Optimization with three different


cost functions

17
Zig-zagging behavior of gradient
descent (2/2)
• A popular solution to the problem
is to apply momentum-accelerated (0 < β < 1)
gradient descent
– Basically, it smoothes the direction trajectory
(exponential smoothing) Cost function optimization:

Basic gradient descent, β = 0.0

Momentum acceleration gradient


descent with β = 0.2

Momentum acceleration gradient


descent with β = 0.7

18
Slow-crawling behavior of gradient
descent (1/2)
• As the gradient magnitude starts vanishing close to a stationary point, the iteration
becomes very slow and not much progress is done

19
Slow-crawling behavior of gradient
descent (2/2)
• A popular solution to the problem is to apply normalized gradient descent
• Let’s keep just the gradient direction and normalize the gradient length to unity:

• Here, only α determines the steplength (fixed length, diminishing length,…)

• Often, the following modification is used instead:


– Small ε>0 is used to avoid
division by zero

20
CHAPTER 4:
SECOND-ORDER OPTIMIZATION
TECHNIQUES

21
The basic idea
• Also the local curvature (2nd derivative) of the cost function at point wk is
considered when deciding on the best direction and steplength to proceed in the
optimization process

• Often, the Newton’s method is used


– 1-D case (1 parameter only):

– N-D case (multiple parameters):

– The second order gradient is NxN-size Hessian matrix


• Appears inverted in the update rule!
– Fast convergence in many problems
– If the cost function g() is a polynomial of order 2,
the algorithm solves the optimization in one step
22
Second-order optimality condition
• An iterative search algorithm halts at a stationary point (local optimum) which
represents a minimum, maximum or an inflection point on the cost function
– Gradient is then equal to zero vector!
• If a cost function is convex, the algorithms will eventually halt at the minimum
• However, in nonconvex cases, the algorithm can also halt at a local maximum or an
inflection point

23
Second-order optimality condition
• One can test which condition is the case by
checking the second derivative (1-D case):

• In N-dimensional case, the condition is


defined with gradients instead
– Eigenvalues of Hessian matrix are considered

24
Weakness 1 of Newton’s method
• The basic form of Newton’s algorithm may halt at a local maximum or an inflection
point if the cost function is nonconvex

25
Weakness 1 of Newton’s method
• However, by using the version with the regularization term ε>0, the cost function is
locally transformed into convex shape (proof skipped here)
– Then, the Newton’s method will continue towards the minimum point!

– ε>0 is a regularization term which also Convexification of a cost function:


stabilizes the update rule by avoiding
division by zero
– I is an identity matrix with all 1’s on the
main diagonal

26
Weakness 2 of Newton’s method
• The method suffers from a scaling limitation: computational complexity
grows very fast with the number of parameters N
– The problem is the computation of the inverse of the Hessian matrix related with the
second derivatives
• Hessian-free optimization methods have been developed to solve the
issue
– Hessian matrix is replaced with a close approximation that does not suffer the issue
– Subsampling the Hessian: uses only a fraction of the matrix entries
– Quasi-Newton method: Hessian is replaced with a low-rank approximation that can be
computed effectively
• The details of these methods are omitted here (an interested reader can
check text book appendix for more details)

27
About laboratory assignments
• Currently, software is available for automatic differentiation of
functions
• The functions are parsed into elementary functions automatically
which helps automatic differentiation
• The numerical methods that are used have been developed by
highly skilled professional programmers and mathematicians

• The latest Matlab version includes functionality for this


• In the laboratory exercises, automatic differentiation will be used in
machine learning assignments
– Programmer defines the cost function and the automatic differentiation
yields gradients etc.

28

You might also like