0% found this document useful (0 votes)
3 views

06 Optimization

Machine learnning

Uploaded by

akhil10831
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

06 Optimization

Machine learnning

Uploaded by

akhil10831
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

CMPUT 267: Basics of Machine Learning

Optimization
Textbook §4.1-4.4

September 20 2022
Announcements

• Assignment 1 due this Friday.


• Quiz 1 will be on eClass (remote), open from today at 2pm to Wednesday
2pm. You will have 30 minutes to complete it.
• TA Of ce hours moved to CAB 313 now.
fi
Outline

1. Recap

2. Optimization by Gradient Descent

3. Multivariate Gradient Descent

4. Adaptive Step Sizes

5. Optimization Properties
Recap: Estimators

• An estimator is a random variable representing a procedure for estimating


the value of an unobserved quantity based on observed data
• Concentration inequalities let us bound the probability of a given
estimator being at least ϵ from the estimated quantity
• An estimator is consistent if it converges in probability to the estimated
quantity
Recap: Sample Complexity
• Sample complexity is the number of samples needed to attain a desired
error bound ϵ at a desired probability 1 − δ
• The mean squared error of an estimator decomposes into bias (squared)
and variance
• Using a biased estimator can have lower error than an unbiased estimator
• Bias the estimator based some prior information
• But this only helps if the prior information is correct
• Cannot reduce error by adding in arbitrary bias
Optimization
We often want to nd the argument w* that minimizes an
objective function c

w* = arg min c(w)


w

Example: Using linear regression to t a dataset {(xi, yi)}


n
i=1 y

Estimate the targets by ŷ = f(x) = w0 + w1x


f(x)
• (x2, y2)

• Each vector w speci es a particular f


n e1 = f(x1) { y1
2

Objective is the total error c(w) = ( f(xi) − yi) (x1, y1)


i=1 x
fi
fi
fi
Exercise: Making your
own optimization algorithm
• Imagine I told you that you need to nd

w* = arg min c(w)


d
w∈ℝ

• Pretend you have never heard of gradient descent. What algorithm might
you design to nd this?

• Now what if I told you that w ∈ = {1,2,3,...,1000}. Now how would


you solve

w* = arg min c(w)


w∈
𝒲
𝒲
fi
fi
Optimization Properties

1. Maximizing c(w) is the same as minimizing −c(w):

arg max c(w) = arg min − c(w)


w w
2. Equivalence under constant shifts: Adding, subtracting, or multiplying
by a positive constant does not change the minimizer of a function:
+
arg min c(w) = arg min c(w)+k = arg min c(w)−k = arg min kc(w) ∀k ∈ ℝ
w w w w
2(w − 2)2 Example
2
(w − 2) arg min (w − 2)2
w∈ℝ
2 2
(w − 2) = arg min 2(w − 2)
w∈ℝ
2
w = arg min (w − 2) + 1
w∈ℝ
2
2 = arg max −(w − 2)
−(w − 2) w∈ℝ
=2
Stationary Points
• Every minimum of an everywhere-differentiable function c(w) must occur at
a stationary point: A point at which c′(w) = 0
✴ Question: What is the exception? Local Minima

• However, not every stationary point is a minimum


Saddlepoint
• Every stationary point is either:
• A local minimum
• A local maximum
• A saddlepoint Global
Global Minima
Minimum

• The global minimum is either a local minimum (or a boundary point)


Let’s assume for now that w is unconstrained (i.e, w ∈ ℝ rather than w ≥ 0 or w ∈ [0,1] )

Identifying the type of
the stationary point
• If function curved upwards (convex) locally,
then local minimum

Local Minima

Saddlepoint

Global
Global Minima
Minimum
Convex functions
f(tx1 + (1 − t)x2) ≤ tf(x1) + (1 − t)f(x2)

Convex = shaped like a bowl


Concave = shaped like an upside bowl

* from Wikipedia
Identifying the type of
the stationary point
• If function curved upwards (convex) locally,
then local minimum
• If function curved downwards (concave) locally, Local Minima
then local maximum
• If function at locally, then might be a saddlepoint Saddlepoint
but could also be a local min or local max
• Locally, cannot distinguish between local min
and global min (its a global property of the surface)
Global
Global Minima
Minimum
fl
Second derivative re ects curvature

fl
A corresponding definition is a concave function, which is precisely the opposite: all
points lie above the line. For any convex function c, the negative of that function ≠c is a
concave function.
Second derivative test
The second derivative test tells us locally if the stationary point is a local minimum,
local maximum or if it is inconclusive. Namely, the test is
1. If c (w0 )
ÕÕ > 0 then w0 is a local minimum.

2. If c (w0 )
ÕÕ < 0 then w0 is a local maximum.

3. If c (w0 )
ÕÕ = 0 then the test is inconclusive: we cannot say which type of stationary
point we have and it could be any of the three.
To understand this test, notice that the second derivative reflects
Local the local curvature of the
Minima
function. It tells us how the derivative is changing. If the slope of the derivative c (w0 ) is
Õ

positive at w0 , namely c (w0 ) > 0, then we know that the derivative is Saddlepoint
ÕÕ increasing and vice
versa.
Let us consider an example to understand this better. Consider a sin curve sin(w) and
the point halfway between the bottom and top of the hill. At one these in-between points,
say w = 0, the derivative is maximally positive: it is cos(0) = +1. As we increase w, the
Global
Global Minima
Minimum
derivative starts to decrease until it is zero at the top of the hill, at w = fi/2. Then it flips
Testing optimality without the
second derivative test
Convex functions have a global minimum at every stationary point
c is convex ⟺ c(tw1 + (1 − t)w2) ≤ tc(w1) + (1 − t)c(w2)
Procedure
• Find a stationary point, namely w0 such that c′(w0) =0
• Sometimes we can do this analytically (closed form solution, namely an
explicit formula for w0)

• Reason about if it is optimal


• Check if your function is convex
• If you have only one stationary point and it is a local minimum, then it is a
global minimum
• Otherwise, if second derivate test says its a local min, can only say that

Exercise

2 2
• Find the solution to the optimization problem min(w − 2) + (w − 3)
w∈ℝ

• Recall that the procedure is:

• 1. Find a stationary point, namely w0 such that c′(w0) =0


• 2. Do the second derivative test (or reason about if this function is convex)

Solution
2 2
• c(w) = (w − 2) + (w − 3)
• c′(w) = 2(w − 2) + 2(w − 3) = 4w − 10
• c′′(w) = 4
• c′(w0) = 0 = 4w0 − 10 → w0 = 10/4 = 2.5
• c′′(w0) = 4 > 0, so a local min. Only one stationary point, so its a global
min.






Exercise: Prove equivalence under
constant shifts

Equivalence under constant shifts: Adding, subtracting, or multiplying by a


positive constant does not change the minimizer of a function:
+
arg min c(w) = arg min c(w)+k = arg min c(w)−k = arg min kc(w) ∀k ∈ ℝ
w w w w

Show that all of these have the same set of stationary points,
namely points w where c’(w) = 0
Numerical Optimization
• So a simple recipe for optimizing a function is to nd its stationary points;
one of those must be the minimum (as long as domain is unbounded)
• But, we will almost never be able to analytically compute the minimum of
the functions that we want to optimize
✴ (Linear regression is an important exception)
• Instead, we must try to nd the minimum numerically
• Main techniques: First-order and second-order gradient descent
fi
fi
Taylor Series
De nition: A Taylor series is a way of approximating a function c in a small
neighbourhood around a point a:
(k)
c′′(a) 2 c (a) k
c(w) ≈ c(a) + c′(a)(w − a) + (w − a) + ⋯ + (w − a)
2 k!
k (i)
c (a) i
∑ i!
= c(a) + (w − a)
i=1



fi
Taylor Series Visualization
hod
Taylor Series Visualization (2)
) in the neighborhood of point x , can be approximated using the
0

1
X (n)
f (x0 ) n
f (x) = (x x0 ) ,
n=0
n!

is the n-th derivative of function f (x) evaluated at point x0 . Also,


Approximating sin function
red to be infinitely differentiable. For practical reasons, we will
at point x0 = 0
is function using the first three terms of the series as
(How can you tell?)
0 1 2 00
f (x) ⇡ f (x0 ) + (x x0 )f (x0 ) + (x x0 ) f (x0 ).
2
f this function can be found by finding the first derivative and setting
nically, one should check the second derivative
degree 1, 3, 5, 7, 9, 11 and 13. as well)
Taylor Series
De nition: A Taylor series is a way of approximating a function c in a small
neighbourhood around a point a:
(k)
c′′(a) 2 c (a) k
c(w) ≈ c(a) + c′(a)(w − a) + (w − a) + ⋯ + (w − a)
2 k!
k (i)
c (a) i
∑ i!
= c(a) + (w − a)
i=1

• Intuition: Following tangent line of the function approximates how it changes


• i.e., following a function with the same rst derivative
• Following a function with the same rst and second derivatives is a better
approximation; with the same rst, second, third derivatives is even better;
etc.



fi
fi
fi
fi
Second-Order Gradient Descent
(Newton-Raphson Method)
1. Approximate the target function with a second-order Taylor series around the current
c′′(wt) 2
guess wt: ̂
c(w) = c(wt) + c′(wt)(w − wt) + (w − wt)
2
c′(wt)
wt+1 ← wt −
c′′(wt)
2. Find the stationary point of the approximation
Wtt , minimum

)µ.
than
of i

Wtt

we ,
• I
Notice
.

( )
( w l c(w* , )
< clwt )
,
l l

l l


-i¥w

WTH W



Second-Order Gradient Descent

dw [ ]
1. Approximate the target function with a d c′′(a) 2
0= c(a) + c′(a)(w − a) + (w − a)
second-order Taylor series around the 2
current guess wt: c′′(a) c′′(a)
c′′(wt) = c′(a) + 2 w−2 a
̂
c(w) = c(wt) + c′(wt)(w − wt) + (w − wt) 2 2 2
2 = c′(a) + c′′(a)(w − a)
2. Find the stationary point of the approximation
⟺ − c′(a) = c′′(a)(w − a)
c′(wt)
wt+1 ← wt − c′(a)
c′′(wt)
⟺ (w − a) = −
c′′(a)
3. If the stationary point of the approximation is c′(a)
⟺ w=a−
a (good enough) stationary point of the c′′(a)


objective, then stop. Else, goto 1.
























(First-Order) Gradient Descent
• We can run Second-order GD whenever we have access to both the rst and second
derivatives of the target function
• Often we want to only use the rst derivative
• Not obvious yet why, but for the multivariate case second-order is computationally intensive
1
First-order gradient descent: Replace the second derivative with a constant (the step

size) in the approximation: η
c′′(wt) 2
̂
c(w) = c(wt) + c′(wt)(w − wt)+ (w − wt)
2
1 2
̂
c(w) = c(wt) + c′(wt)(w − wt)+ (w − wt)

• By exactly the same derivation as before:

wt+1 ← wt − ηc′(wt)





fi
fi
1st and 2nd order
OR

2nd order 1st order, distance controlled by stepsize


Partial Derivatives
• So far: Optimizing univariate function c :ℝ→ℝ
d
• But actually: Optimizing multivariate function c :ℝ →ℝ
• d is typically h u g e (d ≫ 10,000 is not uncommon)
• First derivative of a multivariate function is a vector of partial derivatives

De nition:
∂f
The partial derivative (x1, …, xd)
∂xi
of a function f(x1, …, xd) at x1, …, xd with respect to xi is g′(xi), where

g(y) = f(x1, …, xi−1, y, xi+1, …, xd)



fi
Example
2
• c(w1, w2) = (2w1 + 4w2 − 7)
∂c
(w1, w2) = 4(2w1 + 4w2 − 7)

∂w1
• Then we query at a particular point, e.g., (w1, w2) = (1, − 1), giving
∂c
(1, − 1) = 4(2 − 4 − 7) = − 36
∂w1
• Equivalently, let f(w1) = c(w1, − 1) for this xed w2
∂c ∂c
Then f′(w1) = (w1, − 1), i.e., f′(1) = (1, − 1) = − 36

∂w1 ∂w1


fi
Gradients
The multivariate analog to a rst derivative is called a gradient.

De nition:
d d
The gradient ∇f(x) of a function f : ℝ → ℝ at x ∈ ℝ is a vector of all the
partial derivatives of f at x:
∂f
∂x1
(x)
∂f
(x)
∇f(x) = ∂x2


∂f
∂xd
(x)
fi
fi
Multivariate Gradient Descent
d
First-order gradient descent for multivariate functions c : ℝ → ℝ is just:
wt+1 ← wt − η ∇c(wt)

∂c
∂w1
(wt )
wt+1,1 wt,1
∂c
wt+1,2 wt,2 (wt )
= −η ∂w2
⋮ ⋮ ⋮
wt+1,d wt,d
∂c
∂wd
(wt )
Extending to stepsize per timestep

d
First-order gradient descent for multivariate functions c : ℝ → ℝ is just:
wt+1 ← wt − ηt ∇c(wt)

• Notice the t subscript on η


• We can choose a different ηt for each iteration
• Indeed, for univariate functions, Newton-Raphson can be understood as rst-
1
order gradient descent that chooses a step size of ηt = at each iteration.
c′′(wt)
• Choosing a good step size is crucial to ef ciently using rst-order gradient descent


fi
fi
fi
Adaptive Step Sizes

.
.
÷

☒ •

I
00

.

÷
{

¥
I

¥
£
÷÷

¥
.
÷
E
¥
8
:

• If the step size is too small, gradient descent will "work", but take forever
• Too big, and we can overshoot the optimum
• There are some heuristics that we can use to adaptively guess good values for ηt

• Ideally, we would choose ηt = arg min+ c (wt − η ∇c(wt))


η∈ℝ
• But that's another optimization!
Line Search
Intuition:
A simple heuristic: line search
• Big step sizes are better so long as
1. Try some largest-reasonable step size they don't overshoot
(0)
ηt = ηmax • Try a big step size! If it increases
the objective, we must have
2. Is c (wt − ηt ∇c(wt)) < c(wt)?
(s)
overshot, so try a smaller one.
(s)
If yes, wt+1 ← wt − ηt ∇c(wt) • Keep trying smaller ones until you
(s+1) (s) decrease the objective; then start
3. Otherwise, try ηt = τηt iteration t + 1 from ηmax again.
(for τ < 1) and goto 2
• Typically τ ∈ [0.5,0.9]
Adaptive stepsize algorithms

• Stepsize selection is very important, and so there is a vast array of


algorithms for adaptive stepsizes
• Line search is a bit onerous to use, and not common with something called
stochastic gradient descent (which is what we will use later)
• We will see smarter stepsize algorithms then, and in your assignment
¥¥t
Do we have to use
&

l
,
I

l
'

,
Notice
c(w* , )
< clwt )

a scalar stepsize?
l l

i white w

• Or can we use a different stepsize per dimension? And why would we?

and wz
Weights
w
, For fired w

fixed wheel
,
For

# Vann . '

#
Wz
W
,

L ,1 should
Stepsize
be small
Stepsize
42 2should be
big
Now what if we have constraints?
• For this course, we almost always only deal with unconstrained problems

• We will only consider constraints like w ≥ 0 or w ∈ [a, b]


• Then the procedure is:
• 1. Find a stationary point
• 2. Verify that it is the only stationary point, and a local min according to the
second derivative test
• 3. Additionally check if the boundary points have a smaller value
§
I s 8
← ⇐ ✗
.
69s
← d- d-
§ S
'
F
S o f
5 d-
-

I s
:[
ur I
5
go ☐
s
£8S
o 8

s
e
' £ C
s } • a
is s

-
¥ ;
Is ? P
S
E.
{ a
0
Ju :
g
.
E
o
E
o

a *

f
f f
s
• a

u
T
:
5.

we ? I
Visualizing the effect of constraints

É O
8-
s
÷¥€ }
Summary
1. Maximizing c(w) is the same as minimizing −c(w):

arg max c(w) = arg min − c(w)


w w
2. Convex functions have a global minimum at every stationary point
c is convex ⟺ c(tw1 + (1 − t)w2) ≤ tc(w1) + (1 − t)c(w2)
3. Identi ability: Sometimes we want the actual global minimum; other times
we want a good-enough minimizer (i.e., local minimum might be OK).
4. Equivalence under constant shifts: Adding, subtracting, or multiplying by a
positive constant does not change the minimizer of a function:
+
arg min c(w) = arg min c(w)+k = arg min c(w)−k = arg min kc(w) ∀k ∈ ℝ
w w w w
fi
Summary
• We often want to nd the argument w* that minimizes an objective function c:
w* = arg min c(w)
w
• Every interior minimum is a stationary point, so check the stationary points
• Stationary points usually identi ed numerically
• Typically, by gradient descent
• Choosing the step size is important for ef ciency and correctness
• Common approach: Adaptive step size
• E.g., by line search
fi
fi
fi

You might also like