0% found this document useful (0 votes)
59 views160 pages

CS115 Optimization

This document is an introduction to optimization for machine learning that will cover topics like matrix calculus, positive definite matrices, optimality conditions, constrained vs unconstrained optimization, convex vs nonconvex optimization, smooth vs nonsmooth optimization, and first-order methods. The core problem in machine learning is parameter estimation, which involves solving an optimization problem to find the parameter values that minimize a loss function. The goal is usually to find a local optimum rather than the globally optimal solution.

Uploaded by

Tran Nhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views160 pages

CS115 Optimization

This document is an introduction to optimization for machine learning that will cover topics like matrix calculus, positive definite matrices, optimality conditions, constrained vs unconstrained optimization, convex vs nonconvex optimization, smooth vs nonsmooth optimization, and first-order methods. The core problem in machine learning is parameter estimation, which involves solving an optimization problem to find the parameter values that minimize a loss function. The goal is usually to find a local optimum rather than the globally optimal solution.

Uploaded by

Tran Nhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 160

Optimization

Ngoc-Hoang Luong

University of Information Technology (UIT)


Vietnam National University - Ho Chi Minh City (VNU-HCM)

Math for CS, Fall 2021

University of Information Technology (UIT) Math for CS CS115 1 / 60


References

The contents of this document are taken mainly from the follow sources:
Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. 1

1
https://probml.github.io/pml-book/book1.html
University of Information Technology (UIT) Math for CS CS115 2 / 60
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 3 / 60


Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 4 / 60


Introduction
The core problem in ML is parameter estimation (model fitting).

University of Information Technology (UIT) Math for CS CS115 5 / 60


Introduction
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

University of Information Technology (UIT) Math for CS CS115 5 / 60


Introduction
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of


variables being optimized.

University of Information Technology (UIT) Math for CS CS115 5 / 60


Introduction
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of


variables being optimized.
We focus on continuous optimization.

University of Information Technology (UIT) Math for CS CS115 5 / 60


Introduction
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of


variables being optimized.
We focus on continuous optimization.
To maximize a score function or reward function R(θ), we can
minimize L(θ) = −R(θ).

University of Information Technology (UIT) Math for CS CS115 5 / 60


Introduction
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of


variables being optimized.
We focus on continuous optimization.
To maximize a score function or reward function R(θ), we can
minimize L(θ) = −R(θ).
The term objective function refers to a function we want to
maximize or minimize.

University of Information Technology (UIT) Math for CS CS115 5 / 60


Introduction
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of


variables being optimized.
We focus on continuous optimization.
To maximize a score function or reward function R(θ), we can
minimize L(θ) = −R(θ).
The term objective function refers to a function we want to
maximize or minimize.
An algorithm to find an optimum of an objective function is a solver.
University of Information Technology (UIT) Math for CS CS115 5 / 60
Local versus global optimization
A point that satisfies Equation 1 is called a global optimum. Finding
such a point is called global optimization.

University of Information Technology (UIT) Math for CS CS115 6 / 60


Local versus global optimization
A point that satisfies Equation 1 is called a global optimum. Finding
such a point is called global optimization.
In general, finding global optima is computationally intractable. We
will try to find a local optimum.

University of Information Technology (UIT) Math for CS CS115 6 / 60


Local versus global optimization
A point that satisfies Equation 1 is called a global optimum. Finding
such a point is called global optimization.
In general, finding global optima is computationally intractable. We
will try to find a local optimum.
For continuous problem, a local optimum is a point θ ∗ which has
lower (or equal) cost than “nearby” points.
∃δ > 0, ∀θ ∈ Θ, s.t. ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) ≤ L(θ) (2)

1.5

1.0

0.5

0.0

−0.5

−1.0

−1.5

−2.0 local minimum

−2.5 Global minimum

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0


University of Information Technology (UIT) Math for CS CS115 6 / 60
Local versus global optimization
A local minimum could be surrounded by other local minima with the
same objective value; this is known as a flat local minimum.

University of Information Technology (UIT) Math for CS CS115 7 / 60


Local versus global optimization
A local minimum could be surrounded by other local minima with the
same objective value; this is known as a flat local minimum.

A point is said to be a strict local minimum if its cost is strictly


lower than those of neighboring points.
∃δ > 0, ∀θ ∈ Θ, θ ̸= θ ∗ : ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) < L(θ) (3)

University of Information Technology (UIT) Math for CS CS115 7 / 60


Local versus global optimization
A local minimum could be surrounded by other local minima with the
same objective value; this is known as a flat local minimum.

A point is said to be a strict local minimum if its cost is strictly


lower than those of neighboring points.
∃δ > 0, ∀θ ∈ Θ, θ ̸= θ ∗ : ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) < L(θ) (3)

We can define a (strict) local maximum analogously.


University of Information Technology (UIT) Math for CS CS115 7 / 60
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 8 / 60


Derivatives
The topic of calculus concerns computing “rates of change” of
functions as we vary their inputs.

University of Information Technology (UIT) Math for CS CS115 9 / 60


Derivatives
The topic of calculus concerns computing “rates of change” of
functions as we vary their inputs.
Consider a scalar-argument function f : R → R. Its derivative at a
point a is the quantity
f (x + h) − f (x)
f ′ (x) ≜ lim
h→0 h
assuming the limit exists.

University of Information Technology (UIT) Math for CS CS115 9 / 60


Derivatives
The topic of calculus concerns computing “rates of change” of
functions as we vary their inputs.
Consider a scalar-argument function f : R → R. Its derivative at a
point a is the quantity
f (x + h) − f (x)
f ′ (x) ≜ lim
h→0 h
assuming the limit exists.
This measures how quickly the output changes when we move a small
distance in the input space away from x (i.e., the “rate of change”).

University of Information Technology (UIT) Math for CS CS115 9 / 60


Derivatives

f ′ (x) can be seen as the slope of the tangent line at f (x)


f (x + ∆x) ≈ f (x) + f ′ (x)∆x
for small ∆x.

University of Information Technology (UIT) Math for CS CS115 10 / 60


Derivatives

We can compute a finite difference approximation to the derivative


by using a finite step size h

f (x + h) − f (x)
f ′ (x) = lim
h→0
| {z h }
forward difference
f (x + h/2) − f (x − h/2)
= lim
h→0
| {z h }
central difference
f (x) − f (x − h)
= lim
h→0
| {z h }
backward difference

The smaller the step size h, the better the estimate.

University of Information Technology (UIT) Math for CS CS115 11 / 60


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′

University of Information Technology (UIT) Math for CS CS115 12 / 60


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).

University of Information Technology (UIT) Math for CS CS115 12 / 60


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.

University of Information Technology (UIT) Math for CS CS115 12 / 60


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.
The second derivative function, which measures how quickly the
gradient is changing, is denoted by f ′′ .

University of Information Technology (UIT) Math for CS CS115 12 / 60


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.
The second derivative function, which measures how quickly the
gradient is changing, is denoted by f ′′ .
The n’th derivative function is denote f (n) .

University of Information Technology (UIT) Math for CS CS115 12 / 60


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.
The second derivative function, which measures how quickly the
gradient is changing, is denoted by f ′′ .
The n’th derivative function is denote f (n) .
We can use Leibniz notation, if we denote the function by y = f (x),
dy d
and its derivative by dx or dx f (x).

University of Information Technology (UIT) Math for CS CS115 12 / 60


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.
The second derivative function, which measures how quickly the
gradient is changing, is denoted by f ′′ .
The n’th derivative function is denote f (n) .
We can use Leibniz notation, if we denote the function by y = f (x),
dy d
and its derivative by dx or dx f (x).
To denote the evaluation of the derivative at a point a, we write
df

dx .
x=a

University of Information Technology (UIT) Math for CS CS115 12 / 60


Gradients
We extend the notion of derivatives to handle vector-argument
functions, f : Rn → R, by defining the partial derivative of f with
respect to xi to be
∂f f (x + hei ) − f (x)
= lim
∂xi h→0 h
where ei is the i’th unit vector, ei = (0, . . . , 1, . . . , 0) with the i’th
element = 1 and all the other elements are 0.

University of Information Technology (UIT) Math for CS CS115 13 / 60


Gradients
We extend the notion of derivatives to handle vector-argument
functions, f : Rn → R, by defining the partial derivative of f with
respect to xi to be
∂f f (x + hei ) − f (x)
= lim
∂xi h→0 h
where ei is the i’th unit vector, ei = (0, . . . , 1, . . . , 0) with the i’th
element = 1 and all the other elements are 0.
The gradient of f at a point x is the vector of its partial derivatives
 ∂f 
1∂x
∂f ∂f ∂f
g= = ∇f =  ...  = e1 + . . . + en
 
∂x ∂f
∂x1 ∂xn
∂xn

University of Information Technology (UIT) Math for CS CS115 13 / 60


Gradients
We extend the notion of derivatives to handle vector-argument
functions, f : Rn → R, by defining the partial derivative of f with
respect to xi to be
∂f f (x + hei ) − f (x)
= lim
∂xi h→0 h
where ei is the i’th unit vector, ei = (0, . . . , 1, . . . , 0) with the i’th
element = 1 and all the other elements are 0.
The gradient of f at a point x is the vector of its partial derivatives
 ∂f 
1∂x
∂f ∂f ∂f
g= = ∇f =  ...  = e1 + . . . + en
 
∂x ∂f
∂x1 ∂xn
∂xn

To emphasize the point at which the gradient is evaluated, we write



∗ ∂f
g(x ) ≜
∂x ∗ x
University of Information Technology (UIT) Math for CS CS115 13 / 60
Gradients

Example:

f (x1 , x2 ) = x21 + x1 x2 + 3x22


! 
∂f 
∂x 2x 1 + x 2
∇f (x1 , x2 ) = ∂f1 =
∂x
x1 + 6x2
2

The nabla operator ∇ maps a function f : Rn → R to another


function g : Rn → Rn .
Since g() is a vector-valued function, it is known as a vector field.

University of Information Technology (UIT) Math for CS CS115 14 / 60


Directional derivative

The directional derivative measures how much the function


f : Rn → R changes along a direction v in space.

f (x + hv) − f (x)
Dv f (x) = lim
h→0 h
We can approximate this numerically using 2 function calls to f ,
regardless of n.
By contrast, a numerical approximation to the standard gradient
vector takes n + 1 calls (or 2n if using central differences).
The directional derivative along v is the scalar product of the
gradient g and the vector v:

Dv f (x) = ∇f (x) · v

University of Information Technology (UIT) Math for CS CS115 15 / 60


Directional derivative
Example: Let f (x, y) = x2 y. Find the derivative of f in the direction
(1,2) at the point (3,2).
The gradient ∇f (x, y) is:
! 
∂f 
∂x 2xy
∇f (x, y) = ∂f =
∂y
x2
     
12 1 0
∇f (3, 2) = = 12 +9 = 12e1 + 9e2
9 0 1
Let u = u1 e1 + u2 e2 be a unit vector. The derivative of f in the
direction of u at (3,2) is:

Du f (3, 2) = ∇f (3, 2) · u
= (12e1 + 9e2 ) · (u1 e1 + u2 e2 )
= 12u1 + 9u2

University of Information Technology (UIT) Math for CS CS115 16 / 60


Directional derivative

Example (cont.)
The unit vector in the direction of vector (1,2) is:

(1, 2) (1, 2) (1, 2) √ √


u= =√ = √ = (1/ 5, 2/ 5)
∥(1, 2)∥ 12 + 2 2 5
The directional derivative at (3,2) in the direction of (1,2) is:

Du f (3, 2) = 12u1 + 9u2


12 18 30
=√ +√ =√
5 5 5
We normalize vector (1,2) so that the directional derivative is
independent of its magnitude and depending only on its direction.

University of Information Technology (UIT) Math for CS CS115 17 / 60


Directional derivative

Example 2: Let f (x, y) = x2 y. Find the derivative of f in the direction of


(2,1) at the point (3,2).

University of Information Technology (UIT) Math for CS CS115 18 / 60


Directional derivative

Example 2: Let f (x, y) = x2 y. Find the derivative of f in the direction of


(2,1) at the point (3,2).
The unit vector in the direction of (2,1) is:

(2, 1) √ √
u = √ = (2/ 5, 1/ 5)
5
The directional derivative of f at (3,2) in the direction of (2,1) is:

Du f (3, 2) = 12u1 + 9u2


24 9 33
=√ +√ =√
5 5 5

University of Information Technology (UIT) Math for CS CS115 18 / 60


Directional derivative

Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?

University of Information Technology (UIT) Math for CS CS115 19 / 60


Directional derivative

Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?
The relationship between the gradient and the directional derivative:

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

University of Information Technology (UIT) Math for CS CS115 19 / 60


Directional derivative

Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?
The relationship between the gradient and the directional derivative:

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

The maximal value of Du f (a) occurs when u and ∇f (a) point in the
same direction (i.e., θ = 0).

University of Information Technology (UIT) Math for CS CS115 19 / 60


Directional derivative

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

When θ = 0, the directional derivative Du f (a) = ∥∇f (a)∥.


When θ = π, the directional derivative Du f (a) = −∥∇f (a)∥.
For what value of θ is Du f (a) = 0?

University of Information Technology (UIT) Math for CS CS115 20 / 60


Jacobian

Consider a function that maps a vector to another vector,


f : Rn → Rm . The Jacobian matrix of this function is an m × n
matrix of partial derivatives:
 ∂f1 ∂f1 
∇f1 (x)T
 
∂x1 ... ∂xn
∂f
J f (x) = ≜  ... .. ..  =  ..
 
T . .   .
∂x

∂f m ∂fm ∇fm (x) T
∂x1 ... ∂xn

We layout the results in the same orientation as the output f . This is


called the numerator layout of the Jacobian formulation.

University of Information Technology (UIT) Math for CS CS115 21 / 60


Hessian

For a function f : Rn → R that is twice differentiable, the Hessian


matrix is the (symmetric) n × n matrix of second partial derivatives
∂2f ∂2f
 
∂x21
... ∂x1 ∂xn
∂2f .. .. ..
= ∇2 f = 
 
Hf = 2 . . .

∂x  
∂2f ∂2f
∂xn ∂x1 ... ∂x2n

The Hessian is the Jacobian of the gradient.

University of Information Technology (UIT) Math for CS CS115 22 / 60


Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).

University of Information Technology (UIT) Math for CS CS115 23 / 60


Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).
First, compute the gradient (i.e., first-order partial derivatives):
! 
∂f
2xy + y 2

∇f (x, y) = ∂f =∂x
∂y
x2 + 2yx

University of Information Technology (UIT) Math for CS CS115 23 / 60


Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).
First, compute the gradient (i.e., first-order partial derivatives):
! 
∂f
2xy + y 2

∇f (x, y) = ∂f =∂x
∂y
x2 + 2yx

Second, compute the Hessian (i.e., second-order partial derivatives):


∂2f ∂2f
!  
∂x2 ∂x∂y 2y 2x + 2y
H f (x, y) = ∂ 2 f ∂2f =
2
2x + 2y 2x
∂y∂x ∂y

University of Information Technology (UIT) Math for CS CS115 23 / 60


Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).
First, compute the gradient (i.e., first-order partial derivatives):
! 
∂f
2xy + y 2

∇f (x, y) = ∂f =∂x
∂y
x2 + 2yx

Second, compute the Hessian (i.e., second-order partial derivatives):


∂2f ∂2f
!  
∂x2 ∂x∂y 2y 2x + 2y
H f (x, y) = ∂ 2 f ∂2f =
2
2x + 2y 2x
∂y∂x ∂y

Finally, evaluate the Hessian matrix at the point (1,1):


 
2 4
H f (1, 1) =
4 2

University of Information Technology (UIT) Math for CS CS115 23 / 60


Geometric meaning

If we follow the direction d from x, we can define a uni-dimensional


function g(α):

g(α) = f (x + αd)
g ′ (α) = dT ∇f (x + αd)
g ′′ (α) = dT ∇2 f (x + αd)d

Interpretation

g ′ (0) = dT ∇f (x) [directional derivative]


g ′′ (0) = dT ∇2 f (x)d [directional curvature]

If g ′′ (0) is non-negative with a certain d: f is convex in direction d.


If g ′′ (0) is non-negative for all d: ∇2 f (x) is positive semidefinite → f
is convex at x.
University of Information Technology (UIT) Math for CS CS115 24 / 60
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 25 / 60


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,

University of Information Technology (UIT) Math for CS CS115 26 / 60


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,

University of Information Technology (UIT) Math for CS CS115 26 / 60


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,

University of Information Technology (UIT) Math for CS CS115 26 / 60


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,
negative definite (A ≺ 0) if xT Ax < 0 for all x ̸= 0,

University of Information Technology (UIT) Math for CS CS115 26 / 60


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,
negative definite (A ≺ 0) if xT Ax < 0 for all x ̸= 0,
indefinite if none of the above apply.

University of Information Technology (UIT) Math for CS CS115 26 / 60


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,
negative definite (A ≺ 0) if xT Ax < 0 for all x ̸= 0,
indefinite if none of the above apply.

University of Information Technology (UIT) Math for CS CS115 26 / 60


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,
negative definite (A ≺ 0) if xT Ax < 0 for all x ̸= 0,
indefinite if none of the above apply.
The expression xT Ax is a function of x called the quadratic form
associated to A. (It’s made up of terms like x2i and xi xj .)
We make these definitions for a symmetric matrix A, i.e., AT = A.
Hessian matrices are symmetric.

University of Information Technology (UIT) Math for CS CS115 26 / 60


Diagonal matrices

For a diagonal matrix


 
d1 0 . . . 0
 0 d2 . . . 0 
 
 ...
D= .. . .
.
.. 
. .
..
 
0 0 . dn

the quadratic form


  
d1 0 . . . 0 x1
 0 d2 . . . 0   
  x2 
xT Dx = x1 x2
 
 ...
. . . xn  .. . . ..   . 
. . .  .. 
..

0 0 . dn xn

is just d1 x21 + d2 x22 + . . . + dn x2n .


University of Information Technology (UIT) Math for CS CS115 27 / 60
Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.

University of Information Technology (UIT) Math for CS CS115 28 / 60


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.

University of Information Technology (UIT) Math for CS CS115 28 / 60


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.

University of Information Technology (UIT) Math for CS CS115 28 / 60


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.

University of Information Technology (UIT) Math for CS CS115 28 / 60


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.

University of Information Technology (UIT) Math for CS CS115 28 / 60


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.
Example: Consider the function f (x, y) = x2 + 2y 2 .
The gradient ∇f (x, y) = (2x, 4y).

University of Information Technology (UIT) Math for CS CS115 28 / 60


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.
Example: Consider the function f (x, y) = x2 + 2y 2 .
The gradient ∇f (x, y) = (2x, 4y).
The Hessian matrix of f is:
 
2 0
H f (x, y) =
0 4

University of Information Technology (UIT) Math for CS CS115 28 / 60


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.
Example: Consider the function f (x, y) = x2 + 2y 2 .
The gradient ∇f (x, y) = (2x, 4y).
The Hessian matrix of f is:
 
2 0
H f (x, y) =
0 4
For an arbitrary x ∈ R2 , we have
 
T 2 0
x x = 2x21 + 4x22 > 0 for all x ̸= 0.
0 4

University of Information Technology (UIT) Math for CS CS115 28 / 60


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.
Example: Consider the function f (x, y) = x2 + 2y 2 .
The gradient ∇f (x, y) = (2x, 4y).
The Hessian matrix of f is:
 
2 0
H f (x, y) =
0 4
For an arbitrary x ∈ R2 , we have
 
T 2 0
x x = 2x21 + 4x22 > 0 for all x ̸= 0.
0 4
So, H f (x, y) ≻ 0 for all (x, y) ∈ R2 . H f (x, y) is positive definite.
University of Information Technology (UIT) Math for CS CS115 28 / 60
Positive definiteness and eigenvalues

For an n × n matrix A, if a nonzero vector x ∈ Rn satisfies

Ax = λx

for some scalar λ ∈ R, we call λ an eigenvalue of A and x its


associated eigenvector.
If A is an n × n symmetric matrix, then it can be factored as
 
λ1 0 ... 0
 0 λ2 ... 0 
A = QT ΛQ = QT 
 
 ... ..
.
.. . Q
. .. 
..
 
0 0 . λn

where λ1 , . . . , λn are the eigenvalues of A and the columns of Q are


the corresponding eigenvectors.

University of Information Technology (UIT) Math for CS CS115 29 / 60


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

University of Information Technology (UIT) Math for CS CS115 30 / 60


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

University of Information Technology (UIT) Math for CS CS115 30 / 60


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

University of Information Technology (UIT) Math for CS CS115 30 / 60


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.


A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0

University of Information Technology (UIT) Math for CS CS115 30 / 60


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.


A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0

University of Information Technology (UIT) Math for CS CS115 30 / 60


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.


A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0

University of Information Technology (UIT) Math for CS CS115 30 / 60


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.


A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0
A ≺ 0 if λ1 , λ2 , . . . , λn < 0

University of Information Technology (UIT) Math for CS CS115 30 / 60


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.


A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0
A ≺ 0 if λ1 , λ2 , . . . , λn < 0
A is indefinite if it has both positive and negative eigenvalues.
University of Information Technology (UIT) Math for CS CS115 30 / 60
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 31 / 60


Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the


points which correspond to local optima.

University of Information Technology (UIT) Math for CS CS115 32 / 60


Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the


points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.

University of Information Technology (UIT) Math for CS CS115 32 / 60


Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the


points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.
Consider a point θ ∗ ∈ RD , and let g ∗ = g(θ)|θ∗ be the gradient at
that point, and H ∗ = H(θ)|θ∗ be the corresponding Hessian.

University of Information Technology (UIT) Math for CS CS115 32 / 60


Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the


points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.
Consider a point θ ∗ ∈ RD , and let g ∗ = g(θ)|θ∗ be the gradient at
that point, and H ∗ = H(θ)|θ∗ be the corresponding Hessian.
Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.

University of Information Technology (UIT) Math for CS CS115 32 / 60


Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the


points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.
Consider a point θ ∗ ∈ RD , and let g ∗ = g(θ)|θ∗ be the gradient at
that point, and H ∗ = H(θ)|θ∗ be the corresponding Hessian.
Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.
Sufficient conditions: If g ∗ = 0 and H ∗ is positive definite, then θ ∗
is a local optimum.

University of Information Technology (UIT) Math for CS CS115 32 / 60


Optimality conditions for local vs global optima
1 Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.
Suppose we were at a point θ ∗ at which the gradient is non-zero.
At such a point, we could decrease the function by following the
negative gradient a small distance, so this would not be optimal.
So the gradient must be zero.

University of Information Technology (UIT) Math for CS CS115 33 / 60


Optimality conditions for local vs global optima
1 Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.
Suppose we were at a point θ ∗ at which the gradient is non-zero.
At such a point, we could decrease the function by following the
negative gradient a small distance, so this would not be optimal.
So the gradient must be zero.
2 Sufficient conditions: If g ∗ = 0 and H ∗ is positive definite, then θ ∗
is a local optimum.
Why a zero gradient is not sufficient?
The stationary point could be a local minimum, local maximum, or
saddle point.

University of Information Technology (UIT) Math for CS CS115 33 / 60


Global optimizers

We classify a stationary point of a function f : Rn → R as a global


minimizer if the Hessian matrix of f is positive semidefinite
everywhere,
and as a global maximizer if the Hessian matrix is negative
semidefinite everywhere.
If the Hessian matrix is positive definite, or negative definite, the
minimizer and maximizer (respectively) is strict.

University of Information Technology (UIT) Math for CS CS115 34 / 60


Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .


(x21 + x22 − 1)x1
 
The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2

University of Information Technology (UIT) Math for CS CS115 35 / 60


Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .


(x21 + x22 − 1)x1
 
The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).

University of Information Technology (UIT) Math for CS CS115 35 / 60


Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .


(x21 + x22 − 1)x1
 
The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).
 2
3x1 + x22 − 1

2x1 x2
The Hessian is ∇2 f (x) = 4
2x1 x2 x21 + 6x22 − 2

University of Information Technology (UIT) Math for CS CS115 35 / 60


Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .


(x21 + x22 − 1)x1
 
The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).
 2
3x1 + x22 − 1

2x1 x2
The Hessian is ∇2 f (x) = 4
2x1 x2 x21 + 6x22 − 2
 
−1 0
Since ∇2 f (0, 0) = 4 ≺ 0, it follows that (0,0) is a strict
0 −2
local maximum point.

University of Information Technology (UIT) Math for CS CS115 35 / 60


Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .


(x21 + x22 − 1)x1
 
The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).
 2
3x1 + x22 − 1

2x1 x2
The Hessian is ∇2 f (x) = 4
2x1 x2 x21 + 6x22 − 2
 
−1 0
Since ∇2 f (0, 0) = 4 ≺ 0, it follows that (0,0) is a strict
0 −2
local maximum point.
By the fact that f (x1 , 0) = (x21 − 1)2 + 1 → ∞ as x1 → ∞, the
function is not bounded above, and thus (0,0) is not a global
maximum point.

University of Information Technology (UIT) Math for CS CS115 35 / 60


Example

 
2 0
∇2 f (1, 0) = ∇2 f (−1, 0) =4 , which is an indefinite matrix.
0 −1

University of Information Technology (UIT) Math for CS CS115 36 / 60


Example


2 0
∇2 f (1, 0)
= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.

University of Information Technology (UIT) Math for CS CS115 36 / 60


Example

 
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
 
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4

University of Information Technology (UIT) Math for CS CS115 36 / 60


Example

 
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
 
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
The fact that the Hessian matrices of f at (0,1) and (0,-1) are
positive semidefinite is not enough to conclude that these are local
minimum points; they might be saddle points.

University of Information Technology (UIT) Math for CS CS115 36 / 60


Example

 
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
 
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
The fact that the Hessian matrices of f at (0,1) and (0,-1) are
positive semidefinite is not enough to conclude that these are local
minimum points; they might be saddle points.
However, in this case, since f (0, 1) = f (0, −1) = 0 and the function
is lower bounded by zero, (0,1) and (0,-1) are global minimum points.

University of Information Technology (UIT) Math for CS CS115 36 / 60


Example

 
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
 
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
The fact that the Hessian matrices of f at (0,1) and (0,-1) are
positive semidefinite is not enough to conclude that these are local
minimum points; they might be saddle points.
However, in this case, since f (0, 1) = f (0, −1) = 0 and the function
is lower bounded by zero, (0,1) and (0,-1) are global minimum points.
Because there are two global minimum points, they are nonstrict
global minima, but they are strict local minimum points, since each
has a neighborhood in which it is the unique minimizer.

University of Information Technology (UIT) Math for CS CS115 36 / 60


Example

University of Information Technology (UIT) Math for CS CS115 37 / 60


Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 38 / 60


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.

University of Information Technology (UIT) Math for CS CS115 39 / 60


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.

University of Information Technology (UIT) Math for CS CS115 39 / 60


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.
We partition the set of constraints C into:
Inequality constraints: gj (θ) ≤ 0 for j ∈ I.
Equality constraints: hk (θ) = 0 for k ∈ E.

University of Information Technology (UIT) Math for CS CS115 39 / 60


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.
We partition the set of constraints C into:
Inequality constraints: gj (θ) ≤ 0 for j ∈ I.
Equality constraints: hk (θ) = 0 for k ∈ E.
The feasible set is the subset of the parameter space that satisfies
the constraints:
C = {θ : gj (θ) ≤ 0 : j ∈ I, hk (θ) = 0 : k ∈ E} ⊆ RD

University of Information Technology (UIT) Math for CS CS115 39 / 60


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.
We partition the set of constraints C into:
Inequality constraints: gj (θ) ≤ 0 for j ∈ I.
Equality constraints: hk (θ) = 0 for k ∈ E.
The feasible set is the subset of the parameter space that satisfies
the constraints:
C = {θ : gj (θ) ≤ 0 : j ∈ I, hk (θ) = 0 : k ∈ E} ⊆ RD

Our constrained optimization problem is



θ̂ = argmin L(θ)
θ∈C

University of Information Technology (UIT) Math for CS CS115 39 / 60


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.
We partition the set of constraints C into:
Inequality constraints: gj (θ) ≤ 0 for j ∈ I.
Equality constraints: hk (θ) = 0 for k ∈ E.
The feasible set is the subset of the parameter space that satisfies
the constraints:
C = {θ : gj (θ) ≤ 0 : j ∈ I, hk (θ) = 0 : k ∈ E} ⊆ RD

Our constrained optimization problem is



θ̂ = argmin L(θ)
θ∈C

If C = RD , it is called unconstrained optimization.


University of Information Technology (UIT) Math for CS CS115 39 / 60
Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.

University of Information Technology (UIT) Math for CS CS115 40 / 60


Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.


A function that was unbounded (no well-defined global maximum or
minimum) can acquire multiple maxima or minima when we add
constraints.

University of Information Technology (UIT) Math for CS CS115 40 / 60


Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.


A function that was unbounded (no well-defined global maximum or
minimum) can acquire multiple maxima or minima when we add
constraints.

The task of finding any point (regardless of its cost) in the feasible
set is called feasibility problem.

University of Information Technology (UIT) Math for CS CS115 40 / 60


Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 41 / 60


Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.

University of Information Technology (UIT) Math for CS CS115 42 / 60


Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.
Many models are designed so that their training objectives are convex.

University of Information Technology (UIT) Math for CS CS115 42 / 60


Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.
Many models are designed so that their training objectives are convex.
We say S is a convex set if, for any x, x′ ∈ S, we have
λx + (1 − λ)x′ ∈ S, ∀λ ∈ [0, 1]

University of Information Technology (UIT) Math for CS CS115 42 / 60


Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.
Many models are designed so that their training objectives are convex.
We say S is a convex set if, for any x, x′ ∈ S, we have
λx + (1 − λ)x′ ∈ S, ∀λ ∈ [0, 1]

If we draw a line from x to x′ , all points on the line lie inside the set.

University of Information Technology (UIT) Math for CS CS115 42 / 60


Convex functions

f is a convex function if its epigraph (the set of points above the


function) defines a convex set.

University of Information Technology (UIT) Math for CS CS115 43 / 60


Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

University of Information Technology (UIT) Math for CS CS115 44 / 60


Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.

University of Information Technology (UIT) Math for CS CS115 44 / 60


Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.


A function is concave if −f (x) is convex.

University of Information Technology (UIT) Math for CS CS115 44 / 60


Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.


A function is concave if −f (x) is convex.
A function can be neither convex nor concave.

University of Information Technology (UIT) Math for CS CS115 44 / 60


Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.


A function is concave if −f (x) is convex.
A function can be neither convex nor concave.
Some examples of 1d convex functions: x2 , eax , − log(x),
xa (a > 1, x > 0), |x|a (a ≥ 1), x log x(x > 0).
University of Information Technology (UIT) Math for CS CS115 44 / 60
Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

University of Information Technology (UIT) Math for CS CS115 45 / 60


Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

For example, consider the quadratic form

f (x) = xT Ax

University of Information Technology (UIT) Math for CS CS115 45 / 60


Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.

University of Information Technology (UIT) Math for CS CS115 45 / 60


Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.


This is strictly convex if A is positive definite.
It is neither convex nor concave if

University of Information Technology (UIT) Math for CS CS115 45 / 60


Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.


This is strictly convex if A is positive definite.
It is neither convex nor concave if A has eigenvalues of mixed sign.

University of Information Technology (UIT) Math for CS CS115 45 / 60


Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.


This is strictly convex if A is positive definite.
It is neither convex nor concave if A has eigenvalues of mixed sign.
Intuitively, a convex function is shaped like a bowl.

University of Information Technology (UIT) Math for CS CS115 45 / 60


Convex functions

The quadratic form f (x) = xT Ax in 2d.

University of Information Technology (UIT) Math for CS CS115 46 / 60


Convex functions

The quadratic form f (x) = xT Ax in 2d.


(a) A is positive definite, so f is strictly convex.

University of Information Technology (UIT) Math for CS CS115 46 / 60


Convex functions

The quadratic form f (x) = xT Ax in 2d.


(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.

University of Information Technology (UIT) Math for CS CS115 46 / 60


Convex functions

The quadratic form f (x) = xT Ax in 2d.


(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.
(c) A is positive semi-definite, but singular, so f is convex.

University of Information Technology (UIT) Math for CS CS115 46 / 60


Convex functions

The quadratic form f (x) = xT Ax in 2d.


(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.
(c) A is positive semi-definite, but singular, so f is convex.
(d) A is indefinite, so f is neither convex nor concave.
University of Information Technology (UIT) Math for CS CS115 46 / 60
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 47 / 60


Smooth vs nonsmooth optimization

In smooth optimization, the objective and constraints are


continuously differentiable functions.

University of Information Technology (UIT) Math for CS CS115 48 / 60


Smooth vs nonsmooth optimization

In smooth optimization, the objective and constraints are


continuously differentiable functions.
In nonsmooth optimization, there are some points where the
gradient of the objective or the constraints is not well-defined.

University of Information Technology (UIT) Math for CS CS115 48 / 60


Smooth vs nonsmooth optimization

In smooth optimization, the objective and constraints are


continuously differentiable functions.
In nonsmooth optimization, there are some points where the
gradient of the objective or the constraints is not well-defined.
In some problems, we partition the objective into a part that contains
smooth terms, and a part that contains the nonsmooth terms:
L(θ) = Ls θ + Lr (θ)
where Ls is smooth (differentiable), and Lr is nonsmooth (“rough”).

University of Information Technology (UIT) Math for CS CS115 48 / 60


Smooth vs nonsmooth optimization

In smooth optimization, the objective and constraints are


continuously differentiable functions.
In nonsmooth optimization, there are some points where the
gradient of the objective or the constraints is not well-defined.
In some problems, we partition the objective into a part that contains
smooth terms, and a part that contains the nonsmooth terms:
L(θ) = Ls θ + Lr (θ)
where Ls is smooth (differentiable), and Lr is nonsmooth (“rough”).
In ML, Ls is the train loss, and Lr is a regularizer, like ℓ1 norm of θ.
University of Information Technology (UIT) Math for CS CS115 48 / 60
Smooth vs nonsmooth optimization
For smooth functions, we can quantify the degree of smoothness
using the Lipschitz constant.

University of Information Technology (UIT) Math for CS CS115 49 / 60


Smooth vs nonsmooth optimization
For smooth functions, we can quantify the degree of smoothness
using the Lipschitz constant.
In the 1d case, this is defined as any constant L ≥ 0 such that, for all
real x1 and x2 , we have:
|f (x1 ) − f (x2 ) ≤ L|x1 − x2 |

University of Information Technology (UIT) Math for CS CS115 49 / 60


Smooth vs nonsmooth optimization
For smooth functions, we can quantify the degree of smoothness
using the Lipschitz constant.
In the 1d case, this is defined as any constant L ≥ 0 such that, for all
real x1 and x2 , we have:
|f (x1 ) − f (x2 ) ≤ L|x1 − x2 |

Given a constant L, the function output cannot change by more than


L if we change the function input by 1 unit.

University of Information Technology (UIT) Math for CS CS115 49 / 60


Smooth vs nonsmooth optimization

University of Information Technology (UIT) Math for CS CS115 50 / 60


Subgradients

We generalize the notion of a derivative to work with functions which


have local discontinuities.
For a convex function f : Rn → R, we say g ∈ Rn is a subgradient
of f at x ∈ dom(f ) if for all vector z ∈ dom(f ),

f (z) ≥ f (x) + g T (z − x)

At x1 , f is differentiable, and g1 is the unique subgradient at x1


At x2 , f is not differentiable, and there are many subgradients at x2 .
University of Information Technology (UIT) Math for CS CS115 51 / 60
Subgradients

A function f is called subdifferentiable at x, if there is at least one


subgradient at x.
The set of such subgradients is called subdifferential of f at x,
denoted as ∂f (x)
For example, consider f (x) = |x|. Its subdifferential is given by

{−1}, if x < 0

∂f (x) = [−1, 1], if x = 0

{+1}, if x > 0

where [−1, 1] here means any value between -1 and 1 (inclusive).

University of Information Technology (UIT) Math for CS CS115 52 / 60


Subgradients


{−1}, if x < 0

∂f (x) = [−1, 1], if x = 0

{+1}, if x > 0

where [−1, 1] here means any value between -1 and 1 (inclusive).

University of Information Technology (UIT) Math for CS CS115 53 / 60


Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 54 / 60


First-order methods

We consider iterative optimization methods that leverage first order


derivatives of the objective function.

University of Information Technology (UIT) Math for CS CS115 55 / 60


First-order methods

We consider iterative optimization methods that leverage first order


derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.

University of Information Technology (UIT) Math for CS CS115 55 / 60


First-order methods

We consider iterative optimization methods that leverage first order


derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.
All these algorithms require the user specify a starting point θ 0 .

University of Information Technology (UIT) Math for CS CS115 55 / 60


First-order methods

We consider iterative optimization methods that leverage first order


derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.
All these algorithms require the user specify a starting point θ 0 .
At each iteration t, an update is performed

θ t+1 = θ t + ρt dt

where ρt is the step size or learning rate, and dt is a descent


direction, e.g, the negative of the gradient given by g t = ∇θ L(θ)|θt .

University of Information Technology (UIT) Math for CS CS115 55 / 60


First-order methods

We consider iterative optimization methods that leverage first order


derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.
All these algorithms require the user specify a starting point θ 0 .
At each iteration t, an update is performed

θ t+1 = θ t + ρt dt

where ρt is the step size or learning rate, and dt is a descent


direction, e.g, the negative of the gradient given by g t = ∇θ L(θ)|θt .
The update steps are continued until a stationary point is reached,
where the gradient is zero.

University of Information Technology (UIT) Math for CS CS115 55 / 60


Descent direction

A direction d is a descent direction if there is a small enough (but


nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.

University of Information Technology (UIT) Math for CS CS115 56 / 60


Descent direction

A direction d is a descent direction if there is a small enough (but


nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.
We require there exists an ρmax > 0 such that

L(θ + ρd) < L(θ)

for all 0 < ρ < ρmax .

University of Information Technology (UIT) Math for CS CS115 56 / 60


Descent direction

A direction d is a descent direction if there is a small enough (but


nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.
We require there exists an ρmax > 0 such that

L(θ + ρd) < L(θ)

for all 0 < ρ < ρmax .


The gradient at the current iterate,

g t ≜ ∇L(θ)|θt = ∇L(θ t ) = g(θ t )

points in the direction of maximal increase in f , so the negative


gradient is a descent direction.

University of Information Technology (UIT) Math for CS CS115 56 / 60


Descent direction

Any direction d is also a descent direction if the angle θ between d


and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

University of Information Technology (UIT) Math for CS CS115 57 / 60


Descent direction

Any direction d is also a descent direction if the angle θ between d


and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

The best choice would be to pick dt = −g t .

University of Information Technology (UIT) Math for CS CS115 57 / 60


Descent direction

Any direction d is also a descent direction if the angle θ between d


and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

The best choice would be to pick dt = −g t .


This is the direction of steepest descent.

University of Information Technology (UIT) Math for CS CS115 57 / 60


Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.

University of Information Technology (UIT) Math for CS CS115 58 / 60


Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.

University of Information Technology (UIT) Math for CS CS115 58 / 60


Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.
However, if it is too large, the method may fail to converge. If it is
too small, the function will converge but very slowly.

University of Information Technology (UIT) Math for CS CS115 58 / 60


Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.
However, if it is too large, the method may fail to converge. If it is
too small, the function will converge but very slowly.
Example:
L(θ) = 0.5(θ12 − θ2 )2 + 0.5(θ1 − 1)2
Pick our descent direction dt = −g t . Consider ρt = 0.1 vs ρt = 0.6:

University of Information Technology (UIT) Math for CS CS115 58 / 60


Line search

The optimal step size can be found by finding the value that
maximally decreases the objective along the chosen direction by
solving the 1d minimization problem

ρt = argmin ϕt (ρ) = argmin L(θ t + ρdt )


ρ>0 ρ>0

This is line search: we are searching along the line defined by dt .


ϕt (ρ) = L(θ t + ρdt ) is a convex function of an affine function of ρ,
for fixed θ t and d.
If the loss is convex, this subproblem is also convex.

University of Information Technology (UIT) Math for CS CS115 59 / 60


Line search
Example, consider the quadratic loss
1
L(θ) = θ T Aθ + bT θ + c
2
Compute the derivatives of ϕ(ρ) = L(θ + ρd) gives
 
dϕ(ρ) d 1 T T
= (θ + ρd) A(θ + ρd) + b (θ + ρd) + c
dρ dρ 2
= dT A(θ + ρd) + dT b
= dT (Aθ + b) + ρdT Ad
dϕ(ρ)
Solving for dρ = 0 gives
dT (Aθ + b)
ρ=−
dT Ad
This is exact line search. There are several methods, such as
Armijo backtracking method, that try to ensure reduction in the
objective function without spending too much time trying to solve
University ofthis subproblem
Information precisely.
Technology (UIT) Math for CS CS115 60 / 60

You might also like