06 Optimization
06 Optimization
Optimization
Textbook §4.1-4.4
September 20 2022
Announcements
1. Recap
5. Optimization Properties
Recap: Estimators
•
i=1 x
fi
fi
fi
Exercise: Making your
own optimization algorithm
• Imagine I told you that you need to nd
• Pretend you have never heard of gradient descent. What algorithm might
you design to nd this?
Local Minima
Saddlepoint
Global
Global Minima
Minimum
Convex functions
f(tx1 + (1 − t)x2) ≤ tf(x1) + (1 − t)f(x2)
* from Wikipedia
Identifying the type of
the stationary point
• If function curved upwards (convex) locally,
then local minimum
• If function curved downwards (concave) locally, Local Minima
then local maximum
• If function at locally, then might be a saddlepoint Saddlepoint
but could also be a local min or local max
• Locally, cannot distinguish between local min
and global min (its a global property of the surface)
Global
Global Minima
Minimum
fl
Second derivative re ects curvature
fl
A corresponding definition is a concave function, which is precisely the opposite: all
points lie above the line. For any convex function c, the negative of that function ≠c is a
concave function.
Second derivative test
The second derivative test tells us locally if the stationary point is a local minimum,
local maximum or if it is inconclusive. Namely, the test is
1. If c (w0 )
ÕÕ > 0 then w0 is a local minimum.
2. If c (w0 )
ÕÕ < 0 then w0 is a local maximum.
3. If c (w0 )
ÕÕ = 0 then the test is inconclusive: we cannot say which type of stationary
point we have and it could be any of the three.
To understand this test, notice that the second derivative reflects
Local the local curvature of the
Minima
function. It tells us how the derivative is changing. If the slope of the derivative c (w0 ) is
Õ
positive at w0 , namely c (w0 ) > 0, then we know that the derivative is Saddlepoint
ÕÕ increasing and vice
versa.
Let us consider an example to understand this better. Consider a sin curve sin(w) and
the point halfway between the bottom and top of the hill. At one these in-between points,
say w = 0, the derivative is maximally positive: it is cos(0) = +1. As we increase w, the
Global
Global Minima
Minimum
derivative starts to decrease until it is zero at the top of the hill, at w = fi/2. Then it flips
Testing optimality without the
second derivative test
Convex functions have a global minimum at every stationary point
c is convex ⟺ c(tw1 + (1 − t)w2) ≤ tc(w1) + (1 − t)c(w2)
Procedure
• Find a stationary point, namely w0 such that c′(w0) =0
• Sometimes we can do this analytically (closed form solution, namely an
explicit formula for w0)
2 2
• Find the solution to the optimization problem min(w − 2) + (w − 3)
w∈ℝ
Show that all of these have the same set of stationary points,
namely points w where c’(w) = 0
Numerical Optimization
• So a simple recipe for optimizing a function is to nd its stationary points;
one of those must be the minimum (as long as domain is unbounded)
• But, we will almost never be able to analytically compute the minimum of
the functions that we want to optimize
✴ (Linear regression is an important exception)
• Instead, we must try to nd the minimum numerically
• Main techniques: First-order and second-order gradient descent
fi
fi
Taylor Series
De nition: A Taylor series is a way of approximating a function c in a small
neighbourhood around a point a:
(k)
c′′(a) 2 c (a) k
c(w) ≈ c(a) + c′(a)(w − a) + (w − a) + ⋯ + (w − a)
2 k!
k (i)
c (a) i
∑ i!
= c(a) + (w − a)
i=1



fi
Taylor Series Visualization
hod
Taylor Series Visualization (2)
) in the neighborhood of point x , can be approximated using the
0
1
X (n)
f (x0 ) n
f (x) = (x x0 ) ,
n=0
n!
)µ.
than
of i
Wtt
↳
we ,
• I
Notice
.
( )
( w l c(w* , )
< clwt )
,
l l
l l


-i¥w

WTH W



Second-Order Gradient Descent
dw [ ]
1. Approximate the target function with a d c′′(a) 2
0= c(a) + c′(a)(w − a) + (w − a)
second-order Taylor series around the 2
current guess wt: c′′(a) c′′(a)
c′′(wt) = c′(a) + 2 w−2 a
̂
c(w) = c(wt) + c′(wt)(w − wt) + (w − wt) 2 2 2
2 = c′(a) + c′′(a)(w − a)
2. Find the stationary point of the approximation
⟺ − c′(a) = c′′(a)(w − a)
c′(wt)
wt+1 ← wt − c′(a)
c′′(wt)
⟺ (w − a) = −
c′′(a)
3. If the stationary point of the approximation is c′(a)
⟺ w=a−
a (good enough) stationary point of the c′′(a)


objective, then stop. Else, goto 1.
























(First-Order) Gradient Descent
• We can run Second-order GD whenever we have access to both the rst and second
derivatives of the target function
• Often we want to only use the rst derivative
• Not obvious yet why, but for the multivariate case second-order is computationally intensive
1
First-order gradient descent: Replace the second derivative with a constant (the step
•
size) in the approximation: η
c′′(wt) 2
̂
c(w) = c(wt) + c′(wt)(w − wt)+ (w − wt)
2
1 2
̂
c(w) = c(wt) + c′(wt)(w − wt)+ (w − wt)
2η
• By exactly the same derivation as before:
wt+1 ← wt − ηc′(wt)





fi
fi
1st and 2nd order
OR
De nition:
∂f
The partial derivative (x1, …, xd)
∂xi
of a function f(x1, …, xd) at x1, …, xd with respect to xi is g′(xi), where
De nition:
d d
The gradient ∇f(x) of a function f : ℝ → ℝ at x ∈ ℝ is a vector of all the
partial derivatives of f at x:
∂f
∂x1
(x)
∂f
(x)
∇f(x) = ∂x2
⋮
∂f
∂xd
(x)
fi
fi
Multivariate Gradient Descent
d
First-order gradient descent for multivariate functions c : ℝ → ℝ is just:
wt+1 ← wt − η ∇c(wt)
∂c
∂w1
(wt )
wt+1,1 wt,1
∂c
wt+1,2 wt,2 (wt )
= −η ∂w2
⋮ ⋮ ⋮
wt+1,d wt,d
∂c
∂wd
(wt )
Extending to stepsize per timestep
d
First-order gradient descent for multivariate functions c : ℝ → ℝ is just:
wt+1 ← wt − ηt ∇c(wt)
.
.
÷
☒ •
I
00
.
•
÷
{
¥
I
¥
£
÷÷
¥
.
÷
E
¥
8
:
• If the step size is too small, gradient descent will "work", but take forever
• Too big, and we can overshoot the optimum
• There are some heuristics that we can use to adaptively guess good values for ηt
l
,
I
l
'
,
Notice
c(w* , )
< clwt )
a scalar stepsize?
l l
i white w
• Or can we use a different stepsize per dimension? And why would we?
and wz
Weights
w
, For fired w
fixed wheel
,
For
# Vann . '
#
Wz
W
,
L ,1 should
Stepsize
be small
Stepsize
42 2should be
big
Now what if we have constraints?
• For this course, we almost always only deal with unconstrained problems
I s
:[
ur I
5
go ☐
s
£8S
o 8
→
s
e
' £ C
s } • a
is s
✗
-
¥ ;
Is ? P
S
E.
{ a
0
Ju :
g
.
E
o
E
o
a *
f
f f
s
• a
•
u
T
:
5.
we ? I
Visualizing the effect of constraints
É O
8-
s
÷¥€ }
Summary
1. Maximizing c(w) is the same as minimizing −c(w):