0% found this document useful (0 votes)
94 views

Subgradient Method: Ryan Tibshirani Convex Optimization 10-725

The subgradient method is an extension of gradient descent that can be used when the objective function is convex but not necessarily differentiable. It replaces gradients with subgradients in the update rule. The subgradient method converges to the optimal value at a rate of O(1/k^2), which is slower than the O(1/k) rate of gradient descent. The method can be applied to problems like regularized logistic regression and finding the intersection of convex sets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Subgradient Method: Ryan Tibshirani Convex Optimization 10-725

The subgradient method is an extension of gradient descent that can be used when the objective function is convex but not necessarily differentiable. It replaces gradients with subgradients in the update rule. The subgradient method converges to the optimal value at a rate of O(1/k^2), which is slower than the O(1/k) rate of gradient descent. The method can be applied to problems like regularized logistic regression and finding the intersection of convex sets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Subgradient Method

Ryan Tibshirani
Convex Optimization 10-725
Last last time: gradient descent
Consider the problem
min f (x)
x

for f convex and differentiable, dom(f ) = Rn . Gradient descent:


choose initial x(0) ∈ Rn , repeat:

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

Step sizes tk chosen to be fixed and small, or by backtracking line


search

If ∇f is Lipschitz, gradient descent has convergence rate O(1/).


Downsides:
• Requires f differentiable — addressed this lecture
• Can be slow to converge — addressed next lecture

2
Subgradient method

Now consider f convex, having dom(f ) = Rn , but not necessarily


differentiable

Subgradient method: like gradient descent, but replacing gradients


with subgradients. Initialize x(0) , repeat:

x(k) = x(k−1) − tk · g (k−1) , k = 1, 2, 3, . . .

where g (k−1) ∈ ∂f (x(k−1) ), any subgradient of f at x(k−1)

Subgradient method is not necessarily a descent method, thus we


(k)
keep track of best iterate xbest among x(0) , . . . , x(k) so far, i.e.,
(k)
f (xbest ) = min f (x(i) )
i=0,...,k

3
Outline

Today:
• How to choose step sizes
• Convergence analysis
• Intersection of sets
• Projected subgradient method

4
Step size choices

• Fixed step sizes: tk = t all k = 1, 2, 3, . . .


• Diminishing step sizes: choose to meet conditions

X ∞
X
t2k < ∞, tk = ∞,
k=1 k=1

i.e., square summable but not summable. Important here that


step sizes go to zero, but not too fast

There are several other options too, but key difference to gradient
descent: step sizes are pre-specified, not adaptively computed

5
Convergence analysis

Assume that f convex, dom(f ) = Rn , and also that f is Lipschitz


continuous with constant G > 0, i.e.,

|f (x) − f (y)| ≤ Gkx − yk2 for all x, y

Theorem: For a fixed step size t, subgradient method satisfies


(k)
lim f (xbest ) ≤ f ? + G2 t/2
k→∞

Theorem: For diminishing step sizes, subgradient method sat-


isfies
(k)
lim f (xbest ) = f ?
k→∞

6
Basic inequality

Can prove both results from same basic inequality. Key steps:
• Using definition of subgradient,

kx(k) − x? k22 ≤
kx(k−1) − x? k22 − 2tk f (x(k−1) ) − f (x? ) + t2k kg (k−1) k22


• Iterating last inequality,

kx(k) − x? k22 ≤
k
X k
 X
kx(0) − x? k22 − 2 ti f (x(i−1) ) − f (x? ) + t2i kg (i−1) k22
i=1 i=1

7
• Using kx(k) − x? k2 ≥ 0, and letting R = kx(0) − x? k2 ,

k
X k
X
0 ≤ R2 − 2 ti f (x(i−1) ) − f (x? ) + G2 t2i


i=1 i=1

(k)
• Introducing f (xbest ) = mini=0,...,k f (x(i) ), and rearranging, we
have the basic inequality

R2 + G2 ki=1 t2i
P
(k) ?
f (xbest ) − f (x ) ≤
2 ki=1 ti
P

For different step sizes choices, convergence results can be directly


obtained from this bound, e.g., previous theorems follow

8
Convergence rate

The basic inequality tells us that after k steps, we have

R2 + G2 ki=1 t2i
P
(k) ?
f (xbest ) − f (x ) ≤
2 ki=1 ti
P

With fixed step size t, this gives

(k) R2 G2 t
f (xbest ) − f ? ≤ +
2kt 2
For this to be ≤ , let’s make each term ≤ /2. So we can choose
t = /G2 , and k = R2 /t · 1/ = R2 G2 /2

That is, subgradient method has convergence rate O(1/2 ) ... note
that this is slower than O(1/) rate of gradient descent

9
Example: regularized logistic regression
Given (xi , yi ) ∈ Rp × {0, 1} for i = 1, . . . , n, the logistic regression
loss is
Xn  
f (β) = − yi xTi β + log(1 + exp(xTi β))
i=1

This is a smooth and convex function with


n
X 
∇f (β) = yi − pi (β) xi
i=1

where pi (β) = exp(xTi β)/(1 + exp(xTi β)), i = 1, . . . , n. Consider


the regularized problem:

min f (β) + λ · P (β)


β

where P (β) = kβk22 , ridge penalty; or P (β) = kβk1 , lasso penalty

10
Ridge: use gradients; lasso: use subgradients. Example here has
n = 1000, p = 20:

Gradient descent Subgradient method

t=0.001 t=0.001

2.00
1e−01

t=0.001/k

0.50
1e−04
f−fstar

f−fstar

0.20
1e−07
1e−10

0.05
0.02
1e−13

0 50 100 150 200 0 50 100 150 200

k k

Step sizes hand-tuned to be favorable for each method (of course


comparison is imperfect, but it reveals the convergence behaviors)
11
Polyak step sizes

Polyak step sizes: when the optimal value f ? is known, take

f (x(k−1) ) − f ?
tk = , k = 1, 2, 3, . . .
kg (k−1) k22

Can be motivated from first step in subgradient proof:

kx(k) −x? k22 ≤ kx(k−1) −x? k22 −2tk f (x(k−1) )−f (x? ) +t2k kg (k−1) k22


Polyak step size minimizes the right-hand side

With Polyak step sizes, can show subgradient method converges to


optimal value. Convergence rate is still O(1/2 )

12
Example: intersection of sets

Suppose we want to find x? ∈ C1 ∩ · · · ∩ Cm , i.e., find a point in


intersection of closed, convex sets C1 , . . . , Cm

First define

fi (x) = dist(x, Ci ), i = 1, . . . , m
f (x) = max fi (x)
i=1,...,m

and now solve


min f (x)
x

Check: is this convex?

Note that f ? = 0 ⇐⇒ x? ∈ C1 ∩ · · · ∩ Cm

13
Recall the distance function dist(x, C) = miny∈C ky − xk2 . Last
time we computed its gradient

x − PC (x)
∇dist(x, C) =
kx − PC (x)k2

where PC (x) is the projection of x onto C

Also recall subgradient rule: if f (x) = maxi=1,...,m fi (x), then


 [ 
∂f (x) = conv ∂fi (x)
i:fi (x)=f (x)

So if fi (x) = f (x) and gi ∈ ∂fi (x), then gi ∈ ∂f (x)

14
Put these two facts together for intersection of sets problem, with
fi (x) = dist(x, Ci ): if Ci is farthest set from x (so fi (x) = f (x)),
and
x − PCi (x)
gi = ∇fi (x) =
kx − PCi (x)k2
then gi ∈ ∂f (x)

Now apply subgradient method, with Polyak size tk = f (x(k−1) ).


At iteration k, with Ci farthest from x(k−1) , we perform update

x(k−1) − PCi (x(k−1) )


x(k) = x(k−1) − f (x(k−1) )
kx(k−1) − PCi (x(k−1) )k2
= PCi (x(k−1) )

15
For two sets, this is the famous alternating projections algorithm1 ,
i.e., just keep projecting back and forth

(From Boyd’s lecture notes)

1
von Neumann (1950), “Functional operators, volume II: The geometry of
orthogonal spaces”
16
Projected subgradient method

To optimize a convex function f over a convex set C,

min f (x) subject to x ∈ C


x

we can use the projected subgradient method. Just like the usual
subgradient method, except we project onto C at each iteration:

x(k) = PC x(k−1) − tk · g (k−1) , k = 1, 2, 3, . . .




Assuming we can do this projection, we get the same convergence


guarantees as the usual subgradient method, with the same step
size choices

17
What sets C are easy to project onto? Lots, e.g.,
• Affine images: {Ax + b : x ∈ Rn }
• Solution set of linear system: {x : Ax = b}
• Nonnegative orthant: Rn+ = {x : x ≥ 0}
• Some norm balls: {x : kxkp ≤ 1} for p = 1, 2, ∞
• Some simple polyhedra and simple cones

Warning: it is easy to write down seemingly simple set C, and PC


can turn out to be very hard! E.g., generally hard to project onto
arbitrary polyhedron C = {x : Ax ≤ b}

Note: projected gradient descent works too, more next time ...

18
Can we do better?
Upside of the subgradient method: broad applicability. Downside:
O(1/2 ) convergence rate over problem class of convex, Lipschitz
functions is really slow

Nonsmooth first-order methods: iterative methods updating x(k) in

x(0) + span{g (0) , g (1) , . . . , g (k−1) }

where subgradients g (0) , g (1) , . . . , g (k−1) come from weak oracle

Theorem (Nesterov): For any k ≤ n−1 and starting point x(0) ,


there is a function in the problem class such that any nonsmooth
first-order method satisfies
RG
f (x(k) ) − f ? ≥ √
2(1 + k + 1)

19
Improving on the subgradient method

In words, we cannot do better than the O(1/2 ) rate of subgradient


method (unless we go beyond nonsmooth first-order methods)

So instead of trying to improve across the board, we will focus on


minimizing composite functions of the form

f (x) = g(x) + h(x)

where g is convex and differentiable, h is convex and nonsmooth


but “simple”

For a lot of problems (i.e., functions h), we can recover the O(1/)
rate of gradient descent with a simple algorithm, having important
practical consequences

20
References and further reading

• S. Boyd, Lecture notes for EE 264B, Stanford University,


Spring 2010-2011
• Y. Nesterov (1998), “Introductory lectures on convex
optimization: a basic course”, Chapter 3
• B. Polyak (1987), “Introduction to optimization”, Chapter 5
• L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012

21

You might also like