0% found this document useful (0 votes)
92 views

Subgradients: Ryan Tibshirani Convex Optimization 10-725

This document discusses subgradients, which generalize the concept of gradients to non-differentiable convex functions. It provides examples of subgradients for various functions, including the absolute value function, L1 and L2 norms. It then covers properties of subgradients like subdifferentials, and how subgradients can be used to characterize optimal solutions. Specifically, it shows that a point x? is a minimizer of a convex function f if and only if 0 is a subgradient of f at x?. This allows deriving first-order optimality conditions for constrained optimization problems. The document concludes by applying these concepts to the lasso optimization problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

Subgradients: Ryan Tibshirani Convex Optimization 10-725

This document discusses subgradients, which generalize the concept of gradients to non-differentiable convex functions. It provides examples of subgradients for various functions, including the absolute value function, L1 and L2 norms. It then covers properties of subgradients like subdifferentials, and how subgradients can be used to characterize optimal solutions. Specifically, it shows that a point x? is a minimizer of a convex function f if and only if 0 is a subgradient of f at x?. This allows deriving first-order optimality conditions for constrained optimization problems. The document concludes by applying these concepts to the lasso optimization problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Subgradients

Ryan Tibshirani
Convex Optimization 10-725
Last time: gradient descent
Consider the problem
min f (x)
x

for f convex and differentiable, dom(f ) = Rn . Gradient descent:


choose initial x(0) ∈ Rn , repeat

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

Step sizes tk chosen to be fixed and small, or by backtracking line


search

If ∇f is Lipschitz, gradient descent has convergence rate O(1/).


Downsides:
• Requires f differentiable
• Can be slow to converge

2
Outline

Today: crucial mathematical underpinnings!


• Subgradients
• Examples
• Properties
• Optimality characterizations

3
Subgradients

Recall that for convex and differentiable f ,

f (y) ≥ f (x) + ∇f (x)T (y − x) for all x, y

That is, linear approximation always underestimates f

A subgradient of a convex function f at x is any g ∈ Rn such that

f (y) ≥ f (x) + g T (y − x) for all y

• Always exists1
• If f differentiable at x, then g = ∇f (x) uniquely
• Same definition works for nonconvex f (however, subgradients
need not exist)

1
On the relative interior of dom(f )
4
Examples of subgradients

Consider f : R → R, f (x) = |x|

2.0
1.5
1.0
f(x)

0.5
0.0
−0.5

−2 −1 0 1 2

• For x 6= 0, unique subgradient g = sign(x)


• For x = 0, subgradient g is any element of [−1, 1]

5
Consider f : Rn → R, f (x) = kxk2

f(x)

x2
x1

• For x 6= 0, unique subgradient g = x/kxk2


• For x = 0, subgradient g is any element of {z : kzk2 ≤ 1}

6
Consider f : Rn → R, f (x) = kxk1

f(x)

x2
x1

• For xi 6= 0, unique ith component gi = sign(xi )


• For xi = 0, ith component gi is any element of [−1, 1]

7
Consider f (x) = max{f1 (x), f2 (x)}, for f1 , f2 : Rn → R convex,
differentiable

15
10
f(x)

5
0

−2 −1 0 1 2

• For f1 (x) > f2 (x), unique subgradient g = ∇f1 (x)


• For f2 (x) > f1 (x), unique subgradient g = ∇f2 (x)
• For f1 (x) = f2 (x), subgradient g is any point on line segment
between ∇f1 (x) and ∇f2 (x)

8
Subdifferential

Set of all subgradients of convex f is called the subdifferential:

∂f (x) = {g ∈ Rn : g is a subgradient of f at x}

• Nonempty (only for convex f )


• ∂f (x) is closed and convex (even for nonconvex f )
• If f is differentiable at x, then ∂f (x) = {∇f (x)}
• If ∂f (x) = {g}, then f is differentiable at x and ∇f (x) = g

9
Connection to convex geometry

Convex set C ⊆ Rn , consider indicator function IC : Rn → R,


(
0 if x ∈ C
IC (x) = I{x ∈ C} =
∞ if x ∈ /C

For x ∈ C, ∂IC (x) = NC (x), the normal cone of C at x is, recall

NC (x) = {g ∈ Rn : g T x ≥ g T y for any y ∈ C}

Why? By definition of subgradient g,

IC (y) ≥ IC (x) + g T (y − x) for all y


• For y ∈
/ C, IC (y) = ∞
• For y ∈ C, this means 0 ≥ g T (y − x)

10

● ●

11
Subgradient calculus

Basic rules for convex functions:


• Scaling: ∂(af ) = a · ∂f provided a > 0
• Addition: ∂(f1 + f2 ) = ∂f1 + ∂f2
• Affine composition: if g(x) = f (Ax + b), then

∂g(x) = AT ∂f (Ax + b)

• Finite pointwise maximum: if f (x) = maxi=1,...,m fi (x), then


 [ 
∂f (x) = conv ∂fi (x)
i:fi (x)=f (x)

convex hull of union of subdifferentials of active functions at x

12
• General composition: if
 
f (x) = h g(x) = h g1 (x), . . . , gk (x)

where g : Rn → Rk , h : Rk → R, f : Rn → R, h is convex
and nondecreasing in each argument, g is convex, then
n
∂f (x) ⊆ p1 q1 + · · · + pk qk :
o
p ∈ ∂h(g(x)), qi ∈ ∂gi (x), i = 1, . . . , k

• General pointwise maximum: if f (x) = maxs∈S fs (x), then


  [ 
∂f (x) ⊇ cl conv ∂fs (x)
s:fs (x)=f (x)

Under some regularity conditions (on S, fs ), we get equality

13
• Norms: important special case. To each norm k · k, there is a
dual norm k · k∗ such that

kxk = max z T x
kzk∗ ≤1

(For example, k · kp and k · kq are dual when 1/p + 1/q = 1.)


In fact, for f (x) = kxk (and fz (x) = z T x), we get equality:
  [ 
∂f (x) = cl conv ∂fz (x)
z:fz (x)=f (x)

Note that ∂fz (x) = z. And if z1 , z2 each achieve the max at


x, which means that z1T x = z2T x = kxk, then by linearity, so
will tz1 + (1 − t)z2 for any t ∈ [0, 1]. Thus

∂f (x) = argmax z T x
kzk∗ ≤1

14
Optimality condition

For any f (convex or not),

f (x? ) = min f (x) ⇐⇒ 0 ∈ ∂f (x? )


x

That is, x? is a minimizer if and only if 0 is a subgradient of f at


x? . This is called the subgradient optimality condition

Why? Easy: g = 0 being a subgradient means that for all y

f (y) ≥ f (x? ) + 0T (y − x? ) = f (x? )

Note the implication for a convex and differentiable function f ,


with ∂f (x) = {∇f (x)}

15
Derivation of first-order optimality

Example of the power of subgradients: we can use what we have


learned so far to derive the first-order optimality condition. Recall

min f (x) subject to x ∈ C


x

is solved at x, for f convex and differentiable, if and only if

∇f (x)T (y − x) ≥ 0 for all y ∈ C

Intuitively: says that gradient increases as we move away from x.


How to prove it? First recast problem as

min f (x) + IC (x)


x

Now apply subgradient optimality: 0 ∈ ∂(f (x) + IC (x))

16
Observe

0 ∈ ∂ f (x) + IC (x)
⇐⇒ 0 ∈ {∇f (x)} + NC (x)
⇐⇒ − ∇f (x) ∈ NC (x)
⇐⇒ − ∇f (x)T x ≥ −∇f (x)T y for all y ∈ C
⇐⇒ ∇f (x)T (y − x) ≥ 0 for all y ∈ C

as desired

Note: the condition 0 ∈ ∂f (x) + NC (x) is a fully general condition


for optimality in convex problems. But it’s not always easy to work
with (KKT conditions, later, are easier)

17
Example: lasso optimality conditions
Given y ∈ Rn , X ∈ Rn×p , lasso problem can be parametrized as
1
min ky − Xβk22 + λkβk1
β 2
where λ ≥ 0. Subgradient optimality:
1 
0 ∈ ∂ ky − Xβk22 + λkβk1
2
⇐⇒ 0 ∈ −X T (y − Xβ) + λ∂kβk1
⇐⇒ X T (y − Xβ) = λv

for some v ∈ ∂kβk1 , i.e.,



{1}
 if βi > 0
vi ∈ {−1} if βi < 0 , i = 1, . . . , p

[−1, 1] if βi = 0

18
Write X1 , . . . , Xp for columns of X. Then our condition reads:
(
XiT (y − Xβ) = λ · sign(βi ) if βi 6= 0
|XiT (y − Xβ)| ≤ λ if βi = 0

Note: subgradient optimality conditions don’t lead to closed-form


expression for a lasso solution ... however they do provide a way to
check lasso optimality

They are also helpful in understanding the lasso estimator; e.g., if


|XiT (y − Xβ)| < λ, then βi = 0 (used by screening rules, later?)

19
Example: soft-thresholding
Simplfied lasso problem with X = I:
1
min ky − βk22 + λkβk1
β 2

This we can solve directly using subgradient optimality. Solution is


β = Sλ (y), where Sλ is the soft-thresholding operator:

yi − λ if yi > λ

[Sλ (y)]i = 0 if − λ ≤ yi ≤ λ , i = 1, . . . , n

yi + λ if yi < −λ

Check: from last slide, subgradient optimality conditions are


(
yi − βi = λ · sign(βi ) if βi 6= 0
|yi − βi | ≤ λ if βi = 0

20
Now plug in β = Sλ (y) and check these are satisfied:
• When yi > λ, βi = yi − λ > 0, so yi − βi = λ = λ · 1
• When yi < −λ, argument is similar
• When |yi | ≤ λ, βi = 0, and |yi − βi | = |yi | ≤ λ

1.0
0.5
Soft-thresholding in 0.0
one variable:
−0.5
−1.0

−1.0 −0.5 0.0 0.5 1.0

21
Example: distance to a convex set

Recall the distance function to a closed, convex set C:

dist(x, C) = min ky − xk2


y∈C

This is a convex function. What are its subgradients?

Write dist(x, C) = kx − PC (x)k2 , where PC (x) is the projection of


x onto C. It turns out that when dist(x, C) > 0,
 
x − PC (x)
∂dist(x, C) =
kx − PC (x)k2

Only has one element, so in fact dist(x, C) is differentiable and


this is its gradient

22
We will only show one direction, i.e., that

x − PC (x)
∈ ∂dist(x, C)
kx − PC (x)k2

Write u = PC (x). Then by first-order optimality conditions for a


projection,
(x − u)T (y − u) ≤ 0 for all y ∈ C
Hence
C ⊆ H = {y : (x − u)T (y − u) ≤ 0}

Claim:
(x − u)T (y − u)
dist(y, C) ≥ for all y
kx − uk2
Check: first, for y ∈ H, the right-hand side is ≤ 0

23
Now for y ∈/ H, we have (x − u)T (y − u) = kx − uk2 ky − uk2 cos θ
where θ is the angle between x − u and y − u. Thus

(x − u)T (y − u)
= ky − uk2 cos θ = dist(y, H) ≤ dist(y, C)
kx − uk2

as desired

Using the claim, we have for any y

(x − u)T (y − x + x − u)
dist(y, C) ≥
kx − uk2
 T
x−u
= kx − uk2 + (y − x)
kx − uk2

Hence g = (x − u)/kx − uk2 is a subgradient of dist(x, C) at x

24
References and further reading

• S. Boyd, Lecture notes for EE 264B, Stanford University,


Spring 2010-2011
• R. T. Rockafellar (1970), “Convex analysis”, Chapters 23–25
• L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012

25

You might also like