0% found this document useful (0 votes)

49 views28 pages

Machine Learning: Cost Function Optimization

Uploaded by

Athmajan Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views28 pages

Machine Learning: Cost Function Optimization

Uploaded by

Athmajan Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Machine Learning

(521289S)
Lecture 2

Prof. Tapio Seppänen

Center for Machine Vision and Signal Analysis
University of Oulu

1
FROM PREVIOUS LECTURE:

2
Optimizing a model for linear regression: cost
function minimization
• Parameters to be optimized are the Slope and Intercept of the fitting line
– Parameter space is 2-dimensional with axes Slope and Intercept
• Different values are tried for the parameters until an optimal solution is found
• The goodness of a model is evaluated by computing the cost function value g()
– A smaller value of g() indicates a better model

• An example with two model

candidates:
– Left: An optimal solution
(g() has the smallest value)

– Right: A non-optimal solution

– Red dot shows the parameter values

3
Zero-order optimality condition
• Our aim is to search for a minimum in the cost function g()
• Minimization problem can be stated as:

– The function arguments wi are the model parameters

• By stacking all the parameters in an N-dimensional vector w,
we can express the minimization problem as:

– The vector w represents a point in the parameter space

• Global minimum: Search for a point w* in the parameter space such that

• Local minimum: Sometimes there are several minima in the cost function and we may be
satisfied with any of them or with the one that represents sufficient accuracy / low cost in a
small neighborhood of parameter space:

4
Local optimization techniques
• Aim: take a randomly selected initial point w0 in the parameter space and
start sequentally refining (=iteration) it to search for a minimum in the
cost function
– In each iteration round k, always select the next point wk such that we go
downwards on the cost function
– While iterating K rounds we get a sequence
w0, w1, w2, w3,…, wK such that:

• With complex-shaped cost functions,

one may find a local minimum instead of
the global minimum
– If the particular local optimum has too high
cost for the application, then restart from
a different initial point and repeat
the procedure
5
CHAPTER 3:
FIRST-ORDER OPTIMIZATION TECHNIQUES

6
First-order optimality condition
• At a minimum of a 1-dimensional function, the derivative of a function
g(w) has the value of zero

– The slope of the tangent is zero at the minimum point w=v

• In N-dimensional function, the gradient is

zero at a minimum of the function

– All partial derivatives are zero!

– The slope of the hyperplane tangent is zero
at the minimum point w=v

7
Solving the minimum point analytically
• The first-order system of equations can be very difficult
or virtually impossible to resolve algebraically in a general case
when the cost function is very complex
– Solvable is some special cases only
– The function may include several minima, maxima and
saddle points which all are solutions to the minimization problem
• Therefore, the solution is searched with iterative techniques in practice

8
Steepest ascent and descent directions
• The gradient computed at a particular point w of the cost function yields a vector
that points to the direction of steepest ascent of the cost function: the fastest
increase direction
• When searching the minimum point of the cost function, we need to go to the
steepest descent direction instead, i.e. to the opposite direction!
• Note: the ascent/descent direction is defined in the parameter space as we are
searching for good increments for the parameter values which will eventually
minimize the cost function

up
down

9
Gradient descent algorithm
• The basic structure of the algorithm is iterative:
– Set an initial point for parameter vector: w0
– Iterate parameter values wk until halting condition is met:

• The direction vector dk-1 is now the negative gradient at the

current point in the parameter space:

• The update rule is thus:

w2
gradient

w1
d = -gradient 10
Gradient descent in nonconvex cost
functions
• If the cost function has a complex shape with more than one extremum
point, the algorithm finds one of them by random chance
– Minimum, maximum, saddle points
– Nonconvex cost functions (convex = an ’upwards opening bowl’)
• Thus, the algorithm must be run
several times with different
initial w0 to find satisfactory Nonconvex function
solutions
• With convex cost functions,
the algorithm will find the
optimum if it keeps iterating Convex function
long enough

11
Steplength α control
• The previously shown choices can be applied: fixed steplength and
diminishing steplength (α = 1/k)
• A choice of fixed steplength must be done experimentally
• The cost function history plot below demonstrates that if α is not chosen
properly, the search can be slow or optimum may not be found

Cost function optimization with

three α values

Cost function history plots showing

the convergence of search

12
An example of comparing the fixed
and diminishing steplength control
• Cost function history plot may indicate that the solution
avoids the minimum or even oscillates

13
An example of nonconvex cost
function optimization
• Depending of the α value, the cost function history plot may
show fast convergence, slow convergence, oscillation, and
halting at a local minimum

14
Halting conditions
• A halting condition should be tested at each iteration to decide when to
stop the search
– One should halt near a stationary point (minimum, maximum, saddle point)
where cost function does not change any more much:

• Alternative implementations of halting conditions at iteration k:

– Magnitude of the gradient is sufficiently small (→ε)

– Changes to the parameter vector are sufficiently small (→ε)

– Change to the cost function value is sufficiently small (→ε)

– When a fixed maximum number of iterations Kmax has been completed

15
About gradient direction
• The gradient is always perpendicular to the local contour of the cost
function
• Thus, the search algorithm always proceeds towards the closest minimum
downhill
• Thus, if there are many local
minima, the search will end
at one of them

16
Zig-zagging behavior of gradient
descent (1/2)
• Very often in real applications, a minimum is located in an area of parameter space
in which the cost function shape is a slowly varying, long and narrow valley
surrounded by steep walls
• Due to the fact that gradient direction is always perpendicular to the local contour
of the cost function, the trajectory may not proceed directly towards the minimum
but does ’zig-zagging’ when making update steps in the iteration

Optimization with three different

cost functions

17
Zig-zagging behavior of gradient
descent (2/2)
• A popular solution to the problem
is to apply momentum-accelerated (0 < β < 1)
gradient descent
– Basically, it smoothes the direction trajectory
(exponential smoothing) Cost function optimization:

Basic gradient descent, β = 0.0

Momentum acceleration gradient

descent with β = 0.2

Momentum acceleration gradient

descent with β = 0.7

18
Slow-crawling behavior of gradient
descent (1/2)
• As the gradient magnitude starts vanishing close to a stationary point, the iteration
becomes very slow and not much progress is done

19
Slow-crawling behavior of gradient
descent (2/2)
• A popular solution to the problem is to apply normalized gradient descent
• Let’s keep just the gradient direction and normalize the gradient length to unity:

• Here, only α determines the steplength (fixed length, diminishing length,…)

• Often, the following modification is used instead:

– Small ε>0 is used to avoid
division by zero

20
CHAPTER 4:
SECOND-ORDER OPTIMIZATION
TECHNIQUES

21
The basic idea
• Also the local curvature (2nd derivative) of the cost function at point wk is
considered when deciding on the best direction and steplength to proceed in the
optimization process

• Often, the Newton’s method is used

– 1-D case (1 parameter only):

– N-D case (multiple parameters):

– The second order gradient is NxN-size Hessian matrix

• Appears inverted in the update rule!
– Fast convergence in many problems
– If the cost function g() is a polynomial of order 2,
the algorithm solves the optimization in one step
22
Second-order optimality condition
• An iterative search algorithm halts at a stationary point (local optimum) which
represents a minimum, maximum or an inflection point on the cost function
– Gradient is then equal to zero vector!
• If a cost function is convex, the algorithms will eventually halt at the minimum
• However, in nonconvex cases, the algorithm can also halt at a local maximum or an
inflection point

23
Second-order optimality condition
• One can test which condition is the case by
checking the second derivative (1-D case):

• In N-dimensional case, the condition is

defined with gradients instead
– Eigenvalues of Hessian matrix are considered

24
Weakness 1 of Newton’s method
• The basic form of Newton’s algorithm may halt at a local maximum or an inflection
point if the cost function is nonconvex

25
Weakness 1 of Newton’s method
• However, by using the version with the regularization term ε>0, the cost function is
locally transformed into convex shape (proof skipped here)
– Then, the Newton’s method will continue towards the minimum point!

– ε>0 is a regularization term which also Convexification of a cost function:

stabilizes the update rule by avoiding
division by zero
– I is an identity matrix with all 1’s on the
main diagonal

26
Weakness 2 of Newton’s method
• The method suffers from a scaling limitation: computational complexity
grows very fast with the number of parameters N
– The problem is the computation of the inverse of the Hessian matrix related with the
second derivatives
• Hessian-free optimization methods have been developed to solve the
issue
– Hessian matrix is replaced with a close approximation that does not suffer the issue
– Subsampling the Hessian: uses only a fraction of the matrix entries
– Quasi-Newton method: Hessian is replaced with a low-rank approximation that can be
computed effectively
• The details of these methods are omitted here (an interested reader can
check text book appendix for more details)

27
About laboratory assignments
• Currently, software is available for automatic differentiation of
functions
• The functions are parsed into elementary functions automatically
which helps automatic differentiation
• The numerical methods that are used have been developed by
highly skilled professional programmers and mathematicians

• The latest Matlab version includes functionality for this

• In the laboratory exercises, automatic differentiation will be used in
machine learning assignments
– Programmer defines the cost function and the automatic differentiation
yields gradients etc.

BHHH Algorithm in Matlab Optimization
100% (1)
BHHH Algorithm in Matlab Optimization
32 pages
Overview of Optimization Techniques
No ratings yet
Overview of Optimization Techniques
47 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
54 pages
Continuous Optimization Techniques
No ratings yet
Continuous Optimization Techniques
46 pages
Linear Regression and Gradient Descent
No ratings yet
Linear Regression and Gradient Descent
46 pages
Understanding Newton Decrement in Optimization
No ratings yet
Understanding Newton Decrement in Optimization
21 pages
Optimization Methods for Decision Making
100% (1)
Optimization Methods for Decision Making
70 pages
Understanding Optimization Process
No ratings yet
Understanding Optimization Process
7 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
21 pages
Gradient Descent in Convex Optimization
No ratings yet
Gradient Descent in Convex Optimization
6 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
22 pages
Understanding Cost Function & Gradient Descent
No ratings yet
Understanding Cost Function & Gradient Descent
142 pages
Cost Function and Gradient Descent Basics
No ratings yet
Cost Function and Gradient Descent Basics
142 pages
Gradient Descent and Partial Derivatives
No ratings yet
Gradient Descent and Partial Derivatives
58 pages
Zero-Order Optimization Overview
No ratings yet
Zero-Order Optimization Overview
4 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
15 pages
Unconstrained Optimization Methods
No ratings yet
Unconstrained Optimization Methods
19 pages
GradientLess Descent for Zeroth-Order Optimization
No ratings yet
GradientLess Descent for Zeroth-Order Optimization
21 pages
Unconstrained Optimization Strategies
No ratings yet
Unconstrained Optimization Strategies
12 pages
Subgradient Methods for Convex Optimization
No ratings yet
Subgradient Methods for Convex Optimization
27 pages
Unconstrained Optimization Algorithms
No ratings yet
Unconstrained Optimization Algorithms
5 pages
Machine Learning: Gradient Descent Overview
No ratings yet
Machine Learning: Gradient Descent Overview
52 pages
Unconstrained Optimization Algorithms
No ratings yet
Unconstrained Optimization Algorithms
6 pages
Unconstrained Convex Optimization Methods
No ratings yet
Unconstrained Convex Optimization Methods
28 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Engineering Optimization: Functions of Several Variables
100% (1)
Engineering Optimization: Functions of Several Variables
56 pages
Regression II: Gradient Descent Explained
No ratings yet
Regression II: Gradient Descent Explained
41 pages
Gradient Descent in Convex Optimization
No ratings yet
Gradient Descent in Convex Optimization
27 pages
Machine Learning Optimization Techniques
No ratings yet
Machine Learning Optimization Techniques
20 pages
Convex Optimization Techniques Overview
No ratings yet
Convex Optimization Techniques Overview
39 pages
Comprehensive Guide to Optimization Methods
No ratings yet
Comprehensive Guide to Optimization Methods
116 pages
GD Alogrithm
No ratings yet
GD Alogrithm
22 pages
Lec B2
No ratings yet
Lec B2
63 pages
Understanding Gradient Descent Methods
No ratings yet
Understanding Gradient Descent Methods
19 pages
Pattern Search in Optimization Techniques
No ratings yet
Pattern Search in Optimization Techniques
7 pages
Understanding Optimization Techniques
No ratings yet
Understanding Optimization Techniques
28 pages
06 23ECE216 GradientDescent v2
No ratings yet
06 23ECE216 GradientDescent v2
55 pages
Applied Convex Optimization Course
No ratings yet
Applied Convex Optimization Course
25 pages
Unconstrained Optimization Techniques
No ratings yet
Unconstrained Optimization Techniques
21 pages
Backpropagation and Optimization in Deep Learning
No ratings yet
Backpropagation and Optimization in Deep Learning
14 pages
Machine Learning Optimization Techniques
No ratings yet
Machine Learning Optimization Techniques
35 pages
Gradient Descent Explained with Examples
No ratings yet
Gradient Descent Explained with Examples
9 pages
Optimization Techniques in Machine Learning
No ratings yet
Optimization Techniques in Machine Learning
23 pages
Linear Regression Explained: Cost & Gradient
No ratings yet
Linear Regression Explained: Cost & Gradient
74 pages
Understanding Optimization Techniques
No ratings yet
Understanding Optimization Techniques
47 pages
Gradient Descent and Back-Propagation Guide
No ratings yet
Gradient Descent and Back-Propagation Guide
11 pages
Optimization Course by J. Zico Kolter
No ratings yet
Optimization Course by J. Zico Kolter
59 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
23 pages
Unconstrained Optimization Techniques
100% (1)
Unconstrained Optimization Techniques
25 pages
Local Unconstrained Optimization Methods
No ratings yet
Local Unconstrained Optimization Methods
11 pages
Unconstrained Optimization Methods
No ratings yet
Unconstrained Optimization Methods
24 pages
CS-6777 Liu Abs
100% (1)
CS-6777 Liu Abs
103 pages
Gradient Descent for Cost Function Optimization
No ratings yet
Gradient Descent for Cost Function Optimization
14 pages
Steepest Descent and Newton's Method
No ratings yet
Steepest Descent and Newton's Method
5 pages
Continuous Optimization Techniques
No ratings yet
Continuous Optimization Techniques
25 pages
Unconstrained Optimization Methods
No ratings yet
Unconstrained Optimization Methods
13 pages
Hardware Accelerators Overview
No ratings yet
Hardware Accelerators Overview
11 pages
Gesture Control in Gaming with MediaPipe
No ratings yet
Gesture Control in Gaming with MediaPipe
10 pages
Stochastic Inventory Routing Optimization
No ratings yet
Stochastic Inventory Routing Optimization
19 pages
BBST Hons Degree Overview and Careers
100% (1)
BBST Hons Degree Overview and Careers
25 pages
HSV MPSW 10 Us (0424) DS - VMSR790
No ratings yet
HSV MPSW 10 Us (0424) DS - VMSR790
2 pages
Relay Interfacing with LDR & IR Sensors
No ratings yet
Relay Interfacing with LDR & IR Sensors
9 pages
C Storage Classes: Auto, Extern, Static, Register
No ratings yet
C Storage Classes: Auto, Extern, Static, Register
6 pages
Neural Data-to-Text Generation Survey
No ratings yet
Neural Data-to-Text Generation Survey
46 pages
First Term English Exam 2024/2025
No ratings yet
First Term English Exam 2024/2025
2 pages
RAKESH SHUKLA Adrin Cyber Securiy Tools
No ratings yet
RAKESH SHUKLA Adrin Cyber Securiy Tools
92 pages
HPE Primera and Synergy Integration Guide
No ratings yet
HPE Primera and Synergy Integration Guide
9 pages
ISE 2020 Attendance Insights
No ratings yet
ISE 2020 Attendance Insights
14 pages
S3900 Switches Setup Guide
No ratings yet
S3900 Switches Setup Guide
6 pages
MS-DOS Basics and Commands Guide
No ratings yet
MS-DOS Basics and Commands Guide
9 pages
SDCC Compiler User Guide 3.2.1
No ratings yet
SDCC Compiler User Guide 3.2.1
123 pages
Understanding Body Control Module (BCM)
100% (1)
Understanding Body Control Module (BCM)
12 pages
CodeChef Java Course Contributions
No ratings yet
CodeChef Java Course Contributions
1 page
Microsoft Applied Skills Overview
No ratings yet
Microsoft Applied Skills Overview
1 page
English-I Mid-Term Exam 2014 Guide
No ratings yet
English-I Mid-Term Exam 2014 Guide
2 pages
Cambridge Preparation Centre Login Guide
No ratings yet
Cambridge Preparation Centre Login Guide
4 pages
Divide and Conquer Algorithm Assignments
No ratings yet
Divide and Conquer Algorithm Assignments
2 pages
BQ 24726
No ratings yet
BQ 24726
38 pages
7050QX Series 10/40G Data Center Switches: Product Highlights
No ratings yet
7050QX Series 10/40G Data Center Switches: Product Highlights
8 pages
AC Voltmeter and Ammeter Principles
No ratings yet
AC Voltmeter and Ammeter Principles
56 pages
Oracle SQL Tracing Basics
No ratings yet
Oracle SQL Tracing Basics
3 pages
Long Multiplication: 4-Digit by 2-Digit
No ratings yet
Long Multiplication: 4-Digit by 2-Digit
3 pages
Rousseau Music PDF
No ratings yet
Rousseau Music PDF
4 pages
Git Apprentice
No ratings yet
Git Apprentice
311 pages
Ethernet Switching User Guide PDF
No ratings yet
Ethernet Switching User Guide PDF
1,678 pages
कंप्यूटर की बेसिक जानकारी हिंदी में
No ratings yet
कंप्यूटर की बेसिक जानकारी हिंदी में
10 pages
Borbely JFET Amp & Regulator Notes
No ratings yet
Borbely JFET Amp & Regulator Notes
3 pages
State Space Search in AI Algorithms
No ratings yet
State Space Search in AI Algorithms
12 pages
Neural Network Backpropagation Insights
No ratings yet
Neural Network Backpropagation Insights
68 pages
Angular Code Review Checklist
No ratings yet
Angular Code Review Checklist
16 pages

Machine Learning: Cost Function Optimization

Uploaded by

Machine Learning: Cost Function Optimization

Uploaded by

Machine Learning

Prof. Tapio Seppänen

• An example with two model

– Right: A non-optimal solution

– Red dot shows the parameter values

– The function arguments wi are the model parameters

– The vector w represents a point in the parameter space

• With complex-shaped cost functions,

– The slope of the tangent is zero at the minimum point w=v

• In N-dimensional function, the gradient is

– All partial derivatives are zero!

• The direction vector dk-1 is now the negative gradient at the

• The update rule is thus:

Cost function optimization with

Cost function history plots showing

• Alternative implementations of halting conditions at iteration k:

– Changes to the parameter vector are sufficiently small (→ε)

– Change to the cost function value is sufficiently small (→ε)

– When a fixed maximum number of iterations Kmax has been completed

Optimization with three different

Basic gradient descent, β = 0.0

Momentum acceleration gradient

Momentum acceleration gradient

• Here, only α determines the steplength (fixed length, diminishing length,…)

• Often, the following modification is used instead:

• Often, the Newton’s method is used

– N-D case (multiple parameters):

– The second order gradient is NxN-size Hessian matrix

• In N-dimensional case, the condition is

– ε>0 is a regularization term which also Convexification of a cost function:

• The latest Matlab version includes functionality for this

You might also like