Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
Ryan Tibshirani
Convex Optimization 10-725
Last last time: gradient descent
Consider the problem
min f (x)
x
2
Subgradient method
3
Outline
Today:
• How to choose step sizes
• Convergence analysis
• Intersection of sets
• Projected subgradient method
4
Step size choices
There are several other options too, but key difference to gradient
descent: step sizes are pre-specified, not adaptively computed
5
Convergence analysis
6
Basic inequality
Can prove both results from same basic inequality. Key steps:
• Using definition of subgradient,
kx(k) − x? k22 ≤
kx(k−1) − x? k22 − 2tk f (x(k−1) ) − f (x? ) + t2k kg (k−1) k22
kx(k) − x? k22 ≤
k
X k
X
kx(0) − x? k22 − 2 ti f (x(i−1) ) − f (x? ) + t2i kg (i−1) k22
i=1 i=1
7
• Using kx(k) − x? k2 ≥ 0, and letting R = kx(0) − x? k2 ,
k
X k
X
0 ≤ R2 − 2 ti f (x(i−1) ) − f (x? ) + G2 t2i
i=1 i=1
(k)
• Introducing f (xbest ) = mini=0,...,k f (x(i) ), and rearranging, we
have the basic inequality
R2 + G2 ki=1 t2i
P
(k) ?
f (xbest ) − f (x ) ≤
2 ki=1 ti
P
8
Convergence rate
R2 + G2 ki=1 t2i
P
(k) ?
f (xbest ) − f (x ) ≤
2 ki=1 ti
P
(k) R2 G2 t
f (xbest ) − f ? ≤ +
2kt 2
For this to be ≤ , let’s make each term ≤ /2. So we can choose
t = /G2 , and k = R2 /t · 1/ = R2 G2 /2
That is, subgradient method has convergence rate O(1/2 ) ... note
that this is slower than O(1/) rate of gradient descent
9
Example: regularized logistic regression
Given (xi , yi ) ∈ Rp × {0, 1} for i = 1, . . . , n, the logistic regression
loss is
Xn
f (β) = − yi xTi β + log(1 + exp(xTi β))
i=1
10
Ridge: use gradients; lasso: use subgradients. Example here has
n = 1000, p = 20:
t=0.001 t=0.001
2.00
1e−01
t=0.001/k
0.50
1e−04
f−fstar
f−fstar
0.20
1e−07
1e−10
0.05
0.02
1e−13
k k
f (x(k−1) ) − f ?
tk = , k = 1, 2, 3, . . .
kg (k−1) k22
kx(k) −x? k22 ≤ kx(k−1) −x? k22 −2tk f (x(k−1) )−f (x? ) +t2k kg (k−1) k22
12
Example: intersection of sets
First define
fi (x) = dist(x, Ci ), i = 1, . . . , m
f (x) = max fi (x)
i=1,...,m
Note that f ? = 0 ⇐⇒ x? ∈ C1 ∩ · · · ∩ Cm
13
Recall the distance function dist(x, C) = miny∈C ky − xk2 . Last
time we computed its gradient
x − PC (x)
∇dist(x, C) =
kx − PC (x)k2
14
Put these two facts together for intersection of sets problem, with
fi (x) = dist(x, Ci ): if Ci is farthest set from x (so fi (x) = f (x)),
and
x − PCi (x)
gi = ∇fi (x) =
kx − PCi (x)k2
then gi ∈ ∂f (x)
15
For two sets, this is the famous alternating projections algorithm1 ,
i.e., just keep projecting back and forth
1
von Neumann (1950), “Functional operators, volume II: The geometry of
orthogonal spaces”
16
Projected subgradient method
we can use the projected subgradient method. Just like the usual
subgradient method, except we project onto C at each iteration:
17
What sets C are easy to project onto? Lots, e.g.,
• Affine images: {Ax + b : x ∈ Rn }
• Solution set of linear system: {x : Ax = b}
• Nonnegative orthant: Rn+ = {x : x ≥ 0}
• Some norm balls: {x : kxkp ≤ 1} for p = 1, 2, ∞
• Some simple polyhedra and simple cones
Note: projected gradient descent works too, more next time ...
18
Can we do better?
Upside of the subgradient method: broad applicability. Downside:
O(1/2 ) convergence rate over problem class of convex, Lipschitz
functions is really slow
19
Improving on the subgradient method
For a lot of problems (i.e., functions h), we can recover the O(1/)
rate of gradient descent with a simple algorithm, having important
practical consequences
20
References and further reading
21