0% found this document useful (0 votes)

59 views160 pages

CS115 Optimization

This document is an introduction to optimization for machine learning that will cover topics like matrix calculus, positive definite matrices, optimality conditions, constrained vs unconstrained optimization, convex vs nonconvex optimization, smooth vs nonsmooth optimization, and first-order methods. The core problem in machine learning is parameter estimation, which involves solving an optimization problem to find the parameter values that minimize a loss function. The goal is usually to find a local optimum rather than the globally optimal solution.

Uploaded by

Tran Nhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views160 pages

CS115 Optimization

Uploaded by

Tran Nhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 160

Optimization

Ngoc-Hoang Luong

University of Information Technology (UIT)

Vietnam National University - Ho Chi Minh City (VNU-HCM)

Math for CS, Fall 2021

University of Information Technology (UIT) Math for CS CS115 1 / 60

References

The contents of this document are taken mainly from the follow sources:
Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. 1

1
https://probml.github.io/pml-book/book1.html
University of Information Technology (UIT) Math for CS CS115 2 / 60
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 3 / 60

Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 4 / 60

Introduction
The core problem in ML is parameter estimation (model fitting).

University of Information Technology (UIT) Math for CS CS115 5 / 60

Introduction
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)

θ∈Θ

University of Information Technology (UIT) Math for CS CS115 5 / 60

θ ∗ = argmin L(θ) (1)

θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of

variables being optimized.

University of Information Technology (UIT) Math for CS CS115 5 / 60

θ ∗ = argmin L(θ) (1)

θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of

variables being optimized.
We focus on continuous optimization.

University of Information Technology (UIT) Math for CS CS115 5 / 60

θ ∗ = argmin L(θ) (1)

θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of

variables being optimized.
We focus on continuous optimization.
To maximize a score function or reward function R(θ), we can
minimize L(θ) = −R(θ).

University of Information Technology (UIT) Math for CS CS115 5 / 60

θ ∗ = argmin L(θ) (1)

θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of

University of Information Technology (UIT) Math for CS CS115 5 / 60

θ ∗ = argmin L(θ) (1)

θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of

variables being optimized.
We focus on continuous optimization.
To maximize a score function or reward function R(θ), we can
minimize L(θ) = −R(θ).
The term objective function refers to a function we want to
maximize or minimize.
An algorithm to find an optimum of an objective function is a solver.
University of Information Technology (UIT) Math for CS CS115 5 / 60
Local versus global optimization
A point that satisfies Equation 1 is called a global optimum. Finding
such a point is called global optimization.

University of Information Technology (UIT) Math for CS CS115 6 / 60

Local versus global optimization
A point that satisfies Equation 1 is called a global optimum. Finding
such a point is called global optimization.
In general, finding global optima is computationally intractable. We
will try to find a local optimum.
For continuous problem, a local optimum is a point θ ∗ which has
lower (or equal) cost than “nearby” points.
∃δ > 0, ∀θ ∈ Θ, s.t. ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) ≤ L(θ) (2)

1.5

1.0

0.5

0.0

−0.5

−1.0

−1.5

−2.0 local minimum

−2.5 Global minimum

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

University of Information Technology (UIT) Math for CS CS115 6 / 60
Local versus global optimization
A local minimum could be surrounded by other local minima with the
same objective value; this is known as a flat local minimum.

University of Information Technology (UIT) Math for CS CS115 7 / 60

Local versus global optimization
A local minimum could be surrounded by other local minima with the
same objective value; this is known as a flat local minimum.

A point is said to be a strict local minimum if its cost is strictly

lower than those of neighboring points.
∃δ > 0, ∀θ ∈ Θ, θ ̸= θ ∗ : ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) < L(θ) (3)

University of Information Technology (UIT) Math for CS CS115 7 / 60

Local versus global optimization
A local minimum could be surrounded by other local minima with the
same objective value; this is known as a flat local minimum.

A point is said to be a strict local minimum if its cost is strictly

lower than those of neighboring points.
∃δ > 0, ∀θ ∈ Θ, θ ̸= θ ∗ : ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) < L(θ) (3)

We can define a (strict) local maximum analogously.

University of Information Technology (UIT) Math for CS CS115 7 / 60
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 8 / 60

Derivatives
The topic of calculus concerns computing “rates of change” of
functions as we vary their inputs.

University of Information Technology (UIT) Math for CS CS115 9 / 60

Derivatives
The topic of calculus concerns computing “rates of change” of
functions as we vary their inputs.
Consider a scalar-argument function f : R → R. Its derivative at a
point a is the quantity
f (x + h) − f (x)
f ′ (x) ≜ lim
h→0 h
assuming the limit exists.
This measures how quickly the output changes when we move a small
distance in the input space away from x (i.e., the “rate of change”).

University of Information Technology (UIT) Math for CS CS115 9 / 60

Derivatives

f ′ (x) can be seen as the slope of the tangent line at f (x)

f (x + ∆x) ≈ f (x) + f ′ (x)∆x
for small ∆x.

University of Information Technology (UIT) Math for CS CS115 10 / 60

Derivatives

We can compute a finite difference approximation to the derivative

by using a finite step size h

f (x + h) − f (x)
f ′ (x) = lim
h→0
| {z h }
forward difference
f (x + h/2) − f (x − h/2)
= lim
h→0
| {z h }
central difference
f (x) − f (x − h)
= lim
h→0
| {z h }
backward difference

The smaller the step size h, the better the estimate.

University of Information Technology (UIT) Math for CS CS115 11 / 60

Derivatives

We can think of differentiation as an operator that maps functions

to functions, D(f ) = f ′

University of Information Technology (UIT) Math for CS CS115 12 / 60

Derivatives

We can think of differentiation as an operator that maps functions

to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).

University of Information Technology (UIT) Math for CS CS115 12 / 60

Derivatives

We can think of differentiation as an operator that maps functions

to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.

University of Information Technology (UIT) Math for CS CS115 12 / 60

Derivatives

We can think of differentiation as an operator that maps functions

University of Information Technology (UIT) Math for CS CS115 12 / 60

Derivatives

We can think of differentiation as an operator that maps functions

University of Information Technology (UIT) Math for CS CS115 12 / 60

Derivatives

We can think of differentiation as an operator that maps functions

to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.
The second derivative function, which measures how quickly the
gradient is changing, is denoted by f ′′ .
The n’th derivative function is denote f (n) .
We can use Leibniz notation, if we denote the function by y = f (x),
dy d
and its derivative by dx or dx f (x).

University of Information Technology (UIT) Math for CS CS115 12 / 60

Derivatives

We can think of differentiation as an operator that maps functions

University of Information Technology (UIT) Math for CS CS115 12 / 60

University of Information Technology (UIT) Math for CS CS115 13 / 60

Gradients
We extend the notion of derivatives to handle vector-argument
functions, f : Rn → R, by defining the partial derivative of f with
respect to xi to be
∂f f (x + hei ) − f (x)
= lim
∂xi h→0 h
where ei is the i’th unit vector, ei = (0, . . . , 1, . . . , 0) with the i’th
element = 1 and all the other elements are 0.
The gradient of f at a point x is the vector of its partial derivatives
 ∂f 
1∂x
∂f ∂f ∂f
g= = ∇f =  ...  = e1 + . . . + en
 
∂x ∂f
∂x1 ∂xn
∂xn

University of Information Technology (UIT) Math for CS CS115 13 / 60

To emphasize the point at which the gradient is evaluated, we write

∗ ∂f
g(x ) ≜
∂x ∗ x
University of Information Technology (UIT) Math for CS CS115 13 / 60
Gradients

Example:

f (x1 , x2 ) = x21 + x1 x2 + 3x22

!
∂f
∂x 2x 1 + x 2
∇f (x1 , x2 ) = ∂f1 =
∂x
x1 + 6x2
2

The nabla operator ∇ maps a function f : Rn → R to another

function g : Rn → Rn .
Since g() is a vector-valued function, it is known as a vector field.

University of Information Technology (UIT) Math for CS CS115 14 / 60

Directional derivative

The directional derivative measures how much the function

f : Rn → R changes along a direction v in space.

f (x + hv) − f (x)
Dv f (x) = lim
h→0 h
We can approximate this numerically using 2 function calls to f ,
regardless of n.
By contrast, a numerical approximation to the standard gradient
vector takes n + 1 calls (or 2n if using central differences).
The directional derivative along v is the scalar product of the
gradient g and the vector v:

Dv f (x) = ∇f (x) · v

University of Information Technology (UIT) Math for CS CS115 15 / 60

Directional derivative
Example: Let f (x, y) = x2 y. Find the derivative of f in the direction
(1,2) at the point (3,2).
The gradient ∇f (x, y) is:
!
∂f
∂x 2xy
∇f (x, y) = ∂f =
∂y
x2

12 1 0
∇f (3, 2) = = 12 +9 = 12e1 + 9e2
9 0 1
Let u = u1 e1 + u2 e2 be a unit vector. The derivative of f in the
direction of u at (3,2) is:

Du f (3, 2) = ∇f (3, 2) · u
= (12e1 + 9e2 ) · (u1 e1 + u2 e2 )
= 12u1 + 9u2

University of Information Technology (UIT) Math for CS CS115 16 / 60

Directional derivative

Example (cont.)
The unit vector in the direction of vector (1,2) is:

(1, 2) (1, 2) (1, 2) √ √

u= =√ = √ = (1/ 5, 2/ 5)
∥(1, 2)∥ 12 + 2 2 5
The directional derivative at (3,2) in the direction of (1,2) is:

Du f (3, 2) = 12u1 + 9u2

12 18 30
=√ +√ =√
5 5 5
We normalize vector (1,2) so that the directional derivative is
independent of its magnitude and depending only on its direction.

University of Information Technology (UIT) Math for CS CS115 17 / 60

Directional derivative

Example 2: Let f (x, y) = x2 y. Find the derivative of f in the direction of

(2,1) at the point (3,2).

University of Information Technology (UIT) Math for CS CS115 18 / 60

Directional derivative

Example 2: Let f (x, y) = x2 y. Find the derivative of f in the direction of

(2,1) at the point (3,2).
The unit vector in the direction of (2,1) is:

(2, 1) √ √
u = √ = (2/ 5, 1/ 5)
5
The directional derivative of f at (3,2) in the direction of (2,1) is:

Du f (3, 2) = 12u1 + 9u2

24 9 33
=√ +√ =√
5 5 5

University of Information Technology (UIT) Math for CS CS115 18 / 60

Directional derivative

Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?

University of Information Technology (UIT) Math for CS CS115 19 / 60

Directional derivative

Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?
The relationship between the gradient and the directional derivative:

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

University of Information Technology (UIT) Math for CS CS115 19 / 60

Directional derivative

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

The maximal value of Du f (a) occurs when u and ∇f (a) point in the
same direction (i.e., θ = 0).

University of Information Technology (UIT) Math for CS CS115 19 / 60

Directional derivative

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

When θ = 0, the directional derivative Du f (a) = ∥∇f (a)∥.

When θ = π, the directional derivative Du f (a) = −∥∇f (a)∥.
For what value of θ is Du f (a) = 0?

University of Information Technology (UIT) Math for CS CS115 20 / 60

Jacobian

Consider a function that maps a vector to another vector,

f : Rn → Rm . The Jacobian matrix of this function is an m × n
matrix of partial derivatives:
 ∂f1 ∂f1 
∇f1 (x)T
 
∂x1 ... ∂xn
∂f
J f (x) = ≜  ... .. ..  =  ..
 
T . .   .
∂x

∂f m ∂fm ∇fm (x) T
∂x1 ... ∂xn

We layout the results in the same orientation as the output f . This is

called the numerator layout of the Jacobian formulation.

University of Information Technology (UIT) Math for CS CS115 21 / 60

Hessian

For a function f : Rn → R that is twice differentiable, the Hessian

matrix is the (symmetric) n × n matrix of second partial derivatives
∂2f ∂2f
 
∂x21
... ∂x1 ∂xn
∂2f .. .. ..
= ∇2 f = 
 
Hf = 2 . . .

∂x  
∂2f ∂2f
∂xn ∂x1 ... ∂x2n

The Hessian is the Jacobian of the gradient.

University of Information Technology (UIT) Math for CS CS115 22 / 60

Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).

University of Information Technology (UIT) Math for CS CS115 23 / 60

Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).
First, compute the gradient (i.e., first-order partial derivatives):
!
∂f
2xy + y 2

∇f (x, y) = ∂f =∂x
∂y
x2 + 2yx

University of Information Technology (UIT) Math for CS CS115 23 / 60

Second, compute the Hessian (i.e., second-order partial derivatives):

∂2f ∂2f
!
∂x2 ∂x∂y 2y 2x + 2y
H f (x, y) = ∂ 2 f ∂2f =
2
2x + 2y 2x
∂y∂x ∂y

University of Information Technology (UIT) Math for CS CS115 23 / 60

Second, compute the Hessian (i.e., second-order partial derivatives):

∂2f ∂2f
!
∂x2 ∂x∂y 2y 2x + 2y
H f (x, y) = ∂ 2 f ∂2f =
2
2x + 2y 2x
∂y∂x ∂y

Finally, evaluate the Hessian matrix at the point (1,1):

2 4
H f (1, 1) =
4 2

University of Information Technology (UIT) Math for CS CS115 23 / 60

Geometric meaning

If we follow the direction d from x, we can define a uni-dimensional

function g(α):

g(α) = f (x + αd)
g ′ (α) = dT ∇f (x + αd)
g ′′ (α) = dT ∇2 f (x + αd)d

Interpretation

g ′ (0) = dT ∇f (x) [directional derivative]

g ′′ (0) = dT ∇2 f (x)d [directional curvature]

If g ′′ (0) is non-negative with a certain d: f is convex in direction d.

If g ′′ (0) is non-negative for all d: ∇2 f (x) is positive semidefinite → f
is convex at x.
University of Information Technology (UIT) Math for CS CS115 24 / 60
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 25 / 60

Definitions

We say that a symmetric n × n matrix A is:

positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,

University of Information Technology (UIT) Math for CS CS115 26 / 60

Definitions

We say that a symmetric n × n matrix A is:

positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,

University of Information Technology (UIT) Math for CS CS115 26 / 60

Definitions

We say that a symmetric n × n matrix A is:

positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,

University of Information Technology (UIT) Math for CS CS115 26 / 60

Definitions

We say that a symmetric n × n matrix A is:

University of Information Technology (UIT) Math for CS CS115 26 / 60

Definitions

We say that a symmetric n × n matrix A is:

University of Information Technology (UIT) Math for CS CS115 26 / 60

Definitions

We say that a symmetric n × n matrix A is:

University of Information Technology (UIT) Math for CS CS115 26 / 60

Definitions

We say that a symmetric n × n matrix A is:

positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,
negative definite (A ≺ 0) if xT Ax < 0 for all x ̸= 0,
indefinite if none of the above apply.
The expression xT Ax is a function of x called the quadratic form
associated to A. (It’s made up of terms like x2i and xi xj .)
We make these definitions for a symmetric matrix A, i.e., AT = A.
Hessian matrices are symmetric.

University of Information Technology (UIT) Math for CS CS115 26 / 60

Diagonal matrices

For a diagonal matrix

 
d1 0 . . . 0
 0 d2 . . . 0 
 
 ...
D= .. . .
.
.. 
. .
..
 
0 0 . dn

the quadratic form

  
d1 0 . . . 0 x1
 0 d2 . . . 0   
  x2 
xT Dx = x1 x2

 ...
. . . xn  .. . . ..   . 
. . .  .. 
..

0 0 . dn xn

is just d1 x21 + d2 x22 + . . . + dn x2n .

University of Information Technology (UIT) Math for CS CS115 27 / 60
Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.

University of Information Technology (UIT) Math for CS CS115 28 / 60

Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.
Example: Consider the function f (x, y) = x2 + 2y 2 .
The gradient ∇f (x, y) = (2x, 4y).
The Hessian matrix of f is:

2 0
H f (x, y) =
0 4
For an arbitrary x ∈ R2 , we have

T 2 0
x x = 2x21 + 4x22 > 0 for all x ̸= 0.
0 4

University of Information Technology (UIT) Math for CS CS115 28 / 60

For an n × n matrix A, if a nonzero vector x ∈ Rn satisfies

Ax = λx

for some scalar λ ∈ R, we call λ an eigenvalue of A and x its

associated eigenvector.
If A is an n × n symmetric matrix, then it can be factored as
 
λ1 0 ... 0
 0 λ2 ... 0 
A = QT ΛQ = QT 
 
 ... ..
.
.. . Q
. .. 
..
 
0 0 . λn

where λ1 , . . . , λn are the eigenvalues of A and the columns of Q are

the corresponding eigenvectors.

University of Information Technology (UIT) Math for CS CS115 29 / 60

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

University of Information Technology (UIT) Math for CS CS115 30 / 60

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

University of Information Technology (UIT) Math for CS CS115 30 / 60

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

University of Information Technology (UIT) Math for CS CS115 30 / 60

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.

A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0

University of Information Technology (UIT) Math for CS CS115 30 / 60

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.

A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0

University of Information Technology (UIT) Math for CS CS115 30 / 60

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.

A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0

University of Information Technology (UIT) Math for CS CS115 30 / 60

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.

A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0
A ≺ 0 if λ1 , λ2 , . . . , λn < 0

University of Information Technology (UIT) Math for CS CS115 30 / 60

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.

A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0
A ≺ 0 if λ1 , λ2 , . . . , λn < 0
A is indefinite if it has both positive and negative eigenvalues.
University of Information Technology (UIT) Math for CS CS115 30 / 60
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 31 / 60

Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the

points which correspond to local optima.

University of Information Technology (UIT) Math for CS CS115 32 / 60

Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the

points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.

University of Information Technology (UIT) Math for CS CS115 32 / 60

Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the

University of Information Technology (UIT) Math for CS CS115 32 / 60

Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the

points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.
Consider a point θ ∗ ∈ RD , and let g ∗ = g(θ)|θ∗ be the gradient at
that point, and H ∗ = H(θ)|θ∗ be the corresponding Hessian.
Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.

University of Information Technology (UIT) Math for CS CS115 32 / 60

Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the

University of Information Technology (UIT) Math for CS CS115 32 / 60

University of Information Technology (UIT) Math for CS CS115 33 / 60

Optimality conditions for local vs global optima
1 Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.
Suppose we were at a point θ ∗ at which the gradient is non-zero.
At such a point, we could decrease the function by following the
negative gradient a small distance, so this would not be optimal.
So the gradient must be zero.
2 Sufficient conditions: If g ∗ = 0 and H ∗ is positive definite, then θ ∗
is a local optimum.
Why a zero gradient is not sufficient?
The stationary point could be a local minimum, local maximum, or
saddle point.

University of Information Technology (UIT) Math for CS CS115 33 / 60

Global optimizers

We classify a stationary point of a function f : Rn → R as a global

minimizer if the Hessian matrix of f is positive semidefinite
everywhere,
and as a global maximizer if the Hessian matrix is negative
semidefinite everywhere.
If the Hessian matrix is positive definite, or negative definite, the
minimizer and maximizer (respectively) is strict.

University of Information Technology (UIT) Math for CS CS115 34 / 60

Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .

(x21 + x22 − 1)x1

The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2

University of Information Technology (UIT) Math for CS CS115 35 / 60

Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .

(x21 + x22 − 1)x1

The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).

University of Information Technology (UIT) Math for CS CS115 35 / 60

Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .

University of Information Technology (UIT) Math for CS CS115 35 / 60

Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .

University of Information Technology (UIT) Math for CS CS115 35 / 60

Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .

(x21 + x22 − 1)x1

The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).
2
3x1 + x22 − 1

2x1 x2
The Hessian is ∇2 f (x) = 4
2x1 x2 x21 + 6x22 − 2

−1 0
Since ∇2 f (0, 0) = 4 ≺ 0, it follows that (0,0) is a strict
0 −2
local maximum point.
By the fact that f (x1 , 0) = (x21 − 1)2 + 1 → ∞ as x1 → ∞, the
function is not bounded above, and thus (0,0) is not a global
maximum point.

University of Information Technology (UIT) Math for CS CS115 35 / 60

Example

2 0
∇2 f (1, 0) = ∇2 f (−1, 0) =4 , which is an indefinite matrix.
0 −1

University of Information Technology (UIT) Math for CS CS115 36 / 60

Example

2 0
∇2 f (1, 0)
= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.

University of Information Technology (UIT) Math for CS CS115 36 / 60

Example

2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.

2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4

University of Information Technology (UIT) Math for CS CS115 36 / 60

Example

University of Information Technology (UIT) Math for CS CS115 36 / 60

Example

2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.

2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
The fact that the Hessian matrices of f at (0,1) and (0,-1) are
positive semidefinite is not enough to conclude that these are local
minimum points; they might be saddle points.
However, in this case, since f (0, 1) = f (0, −1) = 0 and the function
is lower bounded by zero, (0,1) and (0,-1) are global minimum points.

University of Information Technology (UIT) Math for CS CS115 36 / 60

Example

University of Information Technology (UIT) Math for CS CS115 36 / 60

Example

University of Information Technology (UIT) Math for CS CS115 37 / 60

Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 38 / 60

Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.

University of Information Technology (UIT) Math for CS CS115 39 / 60

Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.
We partition the set of constraints C into:
Inequality constraints: gj (θ) ≤ 0 for j ∈ I.
Equality constraints: hk (θ) = 0 for k ∈ E.
The feasible set is the subset of the parameter space that satisfies
the constraints:
C = {θ : gj (θ) ≤ 0 : j ∈ I, hk (θ) = 0 : k ∈ E} ⊆ RD

University of Information Technology (UIT) Math for CS CS115 39 / 60

Our constrained optimization problem is

∗
θ̂ = argmin L(θ)
θ∈C

University of Information Technology (UIT) Math for CS CS115 39 / 60

Our constrained optimization problem is

∗
θ̂ = argmin L(θ)
θ∈C

If C = RD , it is called unconstrained optimization.

University of Information Technology (UIT) Math for CS CS115 39 / 60
Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.

University of Information Technology (UIT) Math for CS CS115 40 / 60

Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.

A function that was unbounded (no well-defined global maximum or
minimum) can acquire multiple maxima or minima when we add
constraints.

University of Information Technology (UIT) Math for CS CS115 40 / 60

Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.

A function that was unbounded (no well-defined global maximum or
minimum) can acquire multiple maxima or minima when we add
constraints.

The task of finding any point (regardless of its cost) in the feasible
set is called feasibility problem.

University of Information Technology (UIT) Math for CS CS115 40 / 60

Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 41 / 60

Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.

University of Information Technology (UIT) Math for CS CS115 42 / 60

Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.
Many models are designed so that their training objectives are convex.
We say S is a convex set if, for any x, x′ ∈ S, we have
λx + (1 − λ)x′ ∈ S, ∀λ ∈ [0, 1]

University of Information Technology (UIT) Math for CS CS115 42 / 60

If we draw a line from x to x′ , all points on the line lie inside the set.

University of Information Technology (UIT) Math for CS CS115 42 / 60

Convex functions

f is a convex function if its epigraph (the set of points above the

function) defines a convex set.

University of Information Technology (UIT) Math for CS CS115 43 / 60

Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

University of Information Technology (UIT) Math for CS CS115 44 / 60

Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.

University of Information Technology (UIT) Math for CS CS115 44 / 60

Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.

A function is concave if −f (x) is convex.

University of Information Technology (UIT) Math for CS CS115 44 / 60

Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.

A function is concave if −f (x) is convex.
A function can be neither convex nor concave.

University of Information Technology (UIT) Math for CS CS115 44 / 60

Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.

A function is concave if −f (x) is convex.
A function can be neither convex nor concave.
Some examples of 1d convex functions: x2 , eax , − log(x),
xa (a > 1, x > 0), |x|a (a ≥ 1), x log x(x > 0).
University of Information Technology (UIT) Math for CS CS115 44 / 60
Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

University of Information Technology (UIT) Math for CS CS115 45 / 60

Convex functions

For example, consider the quadratic form

f (x) = xT Ax

University of Information Technology (UIT) Math for CS CS115 45 / 60

Convex functions

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.

University of Information Technology (UIT) Math for CS CS115 45 / 60

Convex functions

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.

This is strictly convex if A is positive definite.
It is neither convex nor concave if

University of Information Technology (UIT) Math for CS CS115 45 / 60

Convex functions

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.

This is strictly convex if A is positive definite.
It is neither convex nor concave if A has eigenvalues of mixed sign.

University of Information Technology (UIT) Math for CS CS115 45 / 60

Convex functions

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.

This is strictly convex if A is positive definite.
It is neither convex nor concave if A has eigenvalues of mixed sign.
Intuitively, a convex function is shaped like a bowl.

University of Information Technology (UIT) Math for CS CS115 45 / 60

Convex functions

The quadratic form f (x) = xT Ax in 2d.

University of Information Technology (UIT) Math for CS CS115 46 / 60

Convex functions

The quadratic form f (x) = xT Ax in 2d.

(a) A is positive definite, so f is strictly convex.

University of Information Technology (UIT) Math for CS CS115 46 / 60

Convex functions

The quadratic form f (x) = xT Ax in 2d.

(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.

University of Information Technology (UIT) Math for CS CS115 46 / 60

Convex functions

The quadratic form f (x) = xT Ax in 2d.

(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.
(c) A is positive semi-definite, but singular, so f is convex.

University of Information Technology (UIT) Math for CS CS115 46 / 60

Convex functions

The quadratic form f (x) = xT Ax in 2d.

(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.
(c) A is positive semi-definite, but singular, so f is convex.
(d) A is indefinite, so f is neither convex nor concave.
University of Information Technology (UIT) Math for CS CS115 46 / 60
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 47 / 60

Smooth vs nonsmooth optimization

In smooth optimization, the objective and constraints are

continuously differentiable functions.

University of Information Technology (UIT) Math for CS CS115 48 / 60

Smooth vs nonsmooth optimization

In smooth optimization, the objective and constraints are

continuously differentiable functions.
In nonsmooth optimization, there are some points where the
gradient of the objective or the constraints is not well-defined.

University of Information Technology (UIT) Math for CS CS115 48 / 60

Smooth vs nonsmooth optimization

In smooth optimization, the objective and constraints are

University of Information Technology (UIT) Math for CS CS115 48 / 60

Smooth vs nonsmooth optimization

In smooth optimization, the objective and constraints are

continuously differentiable functions.
In nonsmooth optimization, there are some points where the
gradient of the objective or the constraints is not well-defined.
In some problems, we partition the objective into a part that contains
smooth terms, and a part that contains the nonsmooth terms:
L(θ) = Ls θ + Lr (θ)
where Ls is smooth (differentiable), and Lr is nonsmooth (“rough”).
In ML, Ls is the train loss, and Lr is a regularizer, like ℓ1 norm of θ.
University of Information Technology (UIT) Math for CS CS115 48 / 60
Smooth vs nonsmooth optimization
For smooth functions, we can quantify the degree of smoothness
using the Lipschitz constant.

University of Information Technology (UIT) Math for CS CS115 49 / 60

Smooth vs nonsmooth optimization
For smooth functions, we can quantify the degree of smoothness
using the Lipschitz constant.
In the 1d case, this is defined as any constant L ≥ 0 such that, for all
real x1 and x2 , we have:
|f (x1 ) − f (x2 ) ≤ L|x1 − x2 |

University of Information Technology (UIT) Math for CS CS115 49 / 60

Given a constant L, the function output cannot change by more than

L if we change the function input by 1 unit.

University of Information Technology (UIT) Math for CS CS115 49 / 60

Smooth vs nonsmooth optimization

University of Information Technology (UIT) Math for CS CS115 50 / 60

Subgradients

We generalize the notion of a derivative to work with functions which

have local discontinuities.
For a convex function f : Rn → R, we say g ∈ Rn is a subgradient
of f at x ∈ dom(f ) if for all vector z ∈ dom(f ),

f (z) ≥ f (x) + g T (z − x)

At x1 , f is differentiable, and g1 is the unique subgradient at x1

At x2 , f is not differentiable, and there are many subgradients at x2 .
University of Information Technology (UIT) Math for CS CS115 51 / 60
Subgradients

A function f is called subdifferentiable at x, if there is at least one

subgradient at x.
The set of such subgradients is called subdifferential of f at x,
denoted as ∂f (x)
For example, consider f (x) = |x|. Its subdifferential is given by

{−1}, if x < 0

∂f (x) = [−1, 1], if x = 0

{+1}, if x > 0


where [−1, 1] here means any value between -1 and 1 (inclusive).

University of Information Technology (UIT) Math for CS CS115 52 / 60

Subgradients


{−1}, if x < 0

∂f (x) = [−1, 1], if x = 0

{+1}, if x > 0


where [−1, 1] here means any value between -1 and 1 (inclusive).

University of Information Technology (UIT) Math for CS CS115 53 / 60

Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 Smooth vs nonsmooth optimization

8 First-order methods

University of Information Technology (UIT) Math for CS CS115 54 / 60

First-order methods

We consider iterative optimization methods that leverage first order

derivatives of the objective function.

University of Information Technology (UIT) Math for CS CS115 55 / 60

First-order methods

We consider iterative optimization methods that leverage first order

derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.

University of Information Technology (UIT) Math for CS CS115 55 / 60

First-order methods

We consider iterative optimization methods that leverage first order

derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.
All these algorithms require the user specify a starting point θ 0 .

University of Information Technology (UIT) Math for CS CS115 55 / 60

First-order methods

We consider iterative optimization methods that leverage first order

derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.
All these algorithms require the user specify a starting point θ 0 .
At each iteration t, an update is performed

θ t+1 = θ t + ρt dt

where ρt is the step size or learning rate, and dt is a descent

direction, e.g, the negative of the gradient given by g t = ∇θ L(θ)|θt .

University of Information Technology (UIT) Math for CS CS115 55 / 60

First-order methods

We consider iterative optimization methods that leverage first order

θ t+1 = θ t + ρt dt

where ρt is the step size or learning rate, and dt is a descent

direction, e.g, the negative of the gradient given by g t = ∇θ L(θ)|θt .
The update steps are continued until a stationary point is reached,
where the gradient is zero.

University of Information Technology (UIT) Math for CS CS115 55 / 60

Descent direction

A direction d is a descent direction if there is a small enough (but

nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.

University of Information Technology (UIT) Math for CS CS115 56 / 60

Descent direction

A direction d is a descent direction if there is a small enough (but

nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.
We require there exists an ρmax > 0 such that

L(θ + ρd) < L(θ)

for all 0 < ρ < ρmax .

University of Information Technology (UIT) Math for CS CS115 56 / 60

Descent direction

A direction d is a descent direction if there is a small enough (but

nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.
We require there exists an ρmax > 0 such that

L(θ + ρd) < L(θ)

for all 0 < ρ < ρmax .

The gradient at the current iterate,

g t ≜ ∇L(θ)|θt = ∇L(θ t ) = g(θ t )

points in the direction of maximal increase in f , so the negative

gradient is a descent direction.

University of Information Technology (UIT) Math for CS CS115 56 / 60

Descent direction

Any direction d is also a descent direction if the angle θ between d

and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

University of Information Technology (UIT) Math for CS CS115 57 / 60

Descent direction

Any direction d is also a descent direction if the angle θ between d

and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

The best choice would be to pick dt = −g t .

University of Information Technology (UIT) Math for CS CS115 57 / 60

Descent direction

Any direction d is also a descent direction if the angle θ between d

and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

The best choice would be to pick dt = −g t .

This is the direction of steepest descent.

University of Information Technology (UIT) Math for CS CS115 57 / 60

Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.

University of Information Technology (UIT) Math for CS CS115 58 / 60

Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.

University of Information Technology (UIT) Math for CS CS115 58 / 60

Step size (learning rate)

University of Information Technology (UIT) Math for CS CS115 58 / 60

Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.
However, if it is too large, the method may fail to converge. If it is
too small, the function will converge but very slowly.
Example:
L(θ) = 0.5(θ12 − θ2 )2 + 0.5(θ1 − 1)2
Pick our descent direction dt = −g t . Consider ρt = 0.1 vs ρt = 0.6:

University of Information Technology (UIT) Math for CS CS115 58 / 60

Line search

The optimal step size can be found by finding the value that
maximally decreases the objective along the chosen direction by
solving the 1d minimization problem

ρt = argmin ϕt (ρ) = argmin L(θ t + ρdt )

ρ>0 ρ>0

This is line search: we are searching along the line defined by dt .

ϕt (ρ) = L(θ t + ρdt ) is a convex function of an affine function of ρ,
for fixed θ t and d.
If the loss is convex, this subproblem is also convex.

University of Information Technology (UIT) Math for CS CS115 59 / 60

Line search
Example, consider the quadratic loss
1
L(θ) = θ T Aθ + bT θ + c
2
Compute the derivatives of ϕ(ρ) = L(θ + ρd) gives

dϕ(ρ) d 1 T T
= (θ + ρd) A(θ + ρd) + b (θ + ρd) + c
dρ dρ 2
= dT A(θ + ρd) + dT b
= dT (Aθ + b) + ρdT Ad
dϕ(ρ)
Solving for dρ = 0 gives
dT (Aθ + b)
ρ=−
dT Ad
This is exact line search. There are several methods, such as
Armijo backtracking method, that try to ensure reduction in the
objective function without spending too much time trying to solve
University ofthis subproblem
Information precisely.
Technology (UIT) Math for CS CS115 60 / 60

mms9 Workbook 05 Unit5
50% (4)
mms9 Workbook 05 Unit5
46 pages
Lesson 16.1 Properties of Logarithms
No ratings yet
Lesson 16.1 Properties of Logarithms
16 pages
CS115 Optimization
No ratings yet
CS115 Optimization
148 pages
CS115_Intro_to_Optimization (1) (1)
No ratings yet
CS115_Intro_to_Optimization (1) (1)
60 pages
Convex Optimization Prerequisite_topics
No ratings yet
Convex Optimization Prerequisite_topics
6 pages
Lecture 5
No ratings yet
Lecture 5
18 pages
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
No ratings yet
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
18 pages
The Mathematics of Optimization
No ratings yet
The Mathematics of Optimization
29 pages
MLF Combined
No ratings yet
MLF Combined
84 pages
OptimumEngineeringDesign Day2b
No ratings yet
OptimumEngineeringDesign Day2b
24 pages
Week 10 (2)
No ratings yet
Week 10 (2)
8 pages
Lec 03
No ratings yet
Lec 03
42 pages
IE643 Lecture8 2020sep11 2020sep8
No ratings yet
IE643 Lecture8 2020sep11 2020sep8
100 pages
Lecture 1
No ratings yet
Lecture 1
93 pages
Curs Tehnici de Optimizare
No ratings yet
Curs Tehnici de Optimizare
141 pages
Lecture1 introductionPCA
No ratings yet
Lecture1 introductionPCA
75 pages
ML Notes
No ratings yet
ML Notes
14 pages
Optimizatio With Matlab
No ratings yet
Optimizatio With Matlab
49 pages
Objective Function in Machine Learning: Enhancing Performance Optimization Through Mathematical Modeling
No ratings yet
Objective Function in Machine Learning: Enhancing Performance Optimization Through Mathematical Modeling
2 pages
NEOM UNIT-1 Sept-23
No ratings yet
NEOM UNIT-1 Sept-23
34 pages
Machine Learning Notes2
No ratings yet
Machine Learning Notes2
34 pages
Numerical Methods and Computation MTL 107
No ratings yet
Numerical Methods and Computation MTL 107
50 pages
OQM Lecture Note - Part 1 Introduction To Mathematical Optimisation
No ratings yet
OQM Lecture Note - Part 1 Introduction To Mathematical Optimisation
10 pages
Microeconomic Analysis Notes
No ratings yet
Microeconomic Analysis Notes
23 pages
Maths For ML
No ratings yet
Maths For ML
1 page
Nonlinear Programming PDF
No ratings yet
Nonlinear Programming PDF
224 pages
Nonlinear Programming Concepts PDF
No ratings yet
Nonlinear Programming Concepts PDF
224 pages
Lecture_1_2_background
No ratings yet
Lecture_1_2_background
6 pages
University of Maryland: Econ 600
No ratings yet
University of Maryland: Econ 600
22 pages
Lec 18
No ratings yet
Lec 18
6 pages
Introduction To Optimization
No ratings yet
Introduction To Optimization
158 pages
exam2018
No ratings yet
exam2018
18 pages
DL (Unit I)
No ratings yet
DL (Unit I)
25 pages
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
14 pages
Chapter 5 Defination: Ax+b C
No ratings yet
Chapter 5 Defination: Ax+b C
3 pages
3.1 - 3.20 Foundational Math - Calculus For ML and AI
No ratings yet
3.1 - 3.20 Foundational Math - Calculus For ML and AI
60 pages
MS_key-2
No ratings yet
MS_key-2
4 pages
Optimization Models by Giuseppe C. Calafiore, Laurent El Ghaoui )
No ratings yet
Optimization Models by Giuseppe C. Calafiore, Laurent El Ghaoui )
632 pages
Machine Learning Interview Questions
No ratings yet
Machine Learning Interview Questions
8 pages
Texts in Mathematics
No ratings yet
Texts in Mathematics
265 pages
DL Slides 3
No ratings yet
DL Slides 3
99 pages
CO 250 Yuan Si
No ratings yet
CO 250 Yuan Si
88 pages
MS_key-4
No ratings yet
MS_key-4
4 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Week 8
No ratings yet
Week 8
5 pages
Linear Programming MSM 2da (G05a, M09a) : Matthias Gerdts
No ratings yet
Linear Programming MSM 2da (G05a, M09a) : Matthias Gerdts
85 pages
Process Optimization Algorythms PDF
No ratings yet
Process Optimization Algorythms PDF
77 pages
Intro 2 ML
No ratings yet
Intro 2 ML
162 pages
Optimization (SF1811 SF1831 SF1841)
No ratings yet
Optimization (SF1811 SF1831 SF1841)
198 pages
Part 3 Nonlinear Op Tim Ization
No ratings yet
Part 3 Nonlinear Op Tim Ization
69 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
Optim
No ratings yet
Optim
70 pages
Intro SVM New Example PDF
100% (1)
Intro SVM New Example PDF
56 pages
CSC2411 - Linear Programming and Combinatorial Optimization Lecture 1: Introduction To Optimization Problems and Mathematical Programming
No ratings yet
CSC2411 - Linear Programming and Combinatorial Optimization Lecture 1: Introduction To Optimization Problems and Mathematical Programming
9 pages
Chapter 7 - Unconstrained Minimization Methods
No ratings yet
Chapter 7 - Unconstrained Minimization Methods
66 pages
Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
ML ES 23-24-II Key
No ratings yet
ML ES 23-24-II Key
4 pages
MFMLHandout
No ratings yet
MFMLHandout
7 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Worked Examples in Advanced Mechanics of Materials using MATLAB
From Everand
Worked Examples in Advanced Mechanics of Materials using MATLAB
Eric Okoth Ogur
No ratings yet
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
Itf 6
No ratings yet
Itf 6
4 pages
Sma301 Exam 2022 2023 Bed Arts January 2023
No ratings yet
Sma301 Exam 2022 2023 Bed Arts January 2023
4 pages
Hyperbolic Spaces The Jyv Askyl A Notes: John R. Parker
No ratings yet
Hyperbolic Spaces The Jyv Askyl A Notes: John R. Parker
93 pages
Jee Lectures Link - Maths
No ratings yet
Jee Lectures Link - Maths
3 pages
Business Mathematics L 1 9
100% (1)
Business Mathematics L 1 9
308 pages
Appendix Dirac Delta Function
No ratings yet
Appendix Dirac Delta Function
15 pages
Math 4023 Tutorial Notes 12
No ratings yet
Math 4023 Tutorial Notes 12
3 pages
NA9 E Lagrange 1
No ratings yet
NA9 E Lagrange 1
81 pages
UNIT II Eigenvalues and Eigenvectors
100% (2)
UNIT II Eigenvalues and Eigenvectors
18 pages
Course-Lecture-Wise-Plan Linear Algebra New
No ratings yet
Course-Lecture-Wise-Plan Linear Algebra New
6 pages
Theory Problems of Matrices - Schaum PDF
No ratings yet
Theory Problems of Matrices - Schaum PDF
230 pages
Lecture 6 Optimum Design
No ratings yet
Lecture 6 Optimum Design
23 pages
Download Full Green s Functions with Applications 2nd Dean G Duffy PDF All Chapters
No ratings yet
Download Full Green s Functions with Applications 2nd Dean G Duffy PDF All Chapters
40 pages
Manifold Learning: What, How, and Why: Marina Meila, Hanyu Zhang, November 8, 2023
No ratings yet
Manifold Learning: What, How, and Why: Marina Meila, Hanyu Zhang, November 8, 2023
33 pages
Common Core Integrated III Chapter 16 Notes
No ratings yet
Common Core Integrated III Chapter 16 Notes
72 pages
Math SBG11
No ratings yet
Math SBG11
486 pages
Pu II Maths Passing Package 2021
No ratings yet
Pu II Maths Passing Package 2021
74 pages
Galois Solutions PDF
No ratings yet
Galois Solutions PDF
20 pages
Notes 2.1 Day 1 Average and Instantaneous Rate of Change
No ratings yet
Notes 2.1 Day 1 Average and Instantaneous Rate of Change
6 pages
Gen Math 1ST Q Exam 2
No ratings yet
Gen Math 1ST Q Exam 2
6 pages
GATE Mathematics Questions All Branch by S K Mondal
No ratings yet
GATE Mathematics Questions All Branch by S K Mondal
75 pages
Fortran Codes Set 2
100% (1)
Fortran Codes Set 2
26 pages
Precise Definition of Limits
No ratings yet
Precise Definition of Limits
28 pages
Homework # 7: Department of Physics IIT Kanpur, Semester II, 2022-23
No ratings yet
Homework # 7: Department of Physics IIT Kanpur, Semester II, 2022-23
6 pages
3 - Matrix & Determinants
No ratings yet
3 - Matrix & Determinants
24 pages
Chapter 11 - Matrices
0% (1)
Chapter 11 - Matrices
39 pages
Notes On Limit, Continuity, and Differentiability: Jitender Singh
No ratings yet
Notes On Limit, Continuity, and Differentiability: Jitender Singh
2 pages
Invers and Rank of Matrix(0)
No ratings yet
Invers and Rank of Matrix(0)
2 pages