0% found this document useful (0 votes)
80 views

Convex Opt Alg

Lecture Notes on Convex Analysis and Iterative Algorithms

Uploaded by

DragutinAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Convex Opt Alg

Lecture Notes on Convex Analysis and Iterative Algorithms

Uploaded by

DragutinAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Lecture Notes on Convex Analysis and Iterative

Algorithms
İlker Bayram
[email protected]
About These Notes
These are the lectures notes of a graduate course I offered in the Dept. of Elec-
tronics and Telecommunications Engineering at Istanbul Technical University.
My goal was to get students acquainted with methods of convex analysis, to
make them more comfortable in following arguments that appear in recent
signal processing literature, and understand/analyze the proximal point al-
gorithm, along with its many variants. In the first half of the course, convex
analysis is introduced at a level suitable for graduate students in electrical engi-
neering (i.e., some familiarity with the notion of a convex set, convex functions
from other courses). Then several other algorithms are derived based on the
proximal point algorithm, such as the Douglas-Rachford algorithm, ADMM,
and some applications to saddle point problems. There are no references in
this version. I hope to add some in the future.

İlker Bayram
December, 2018
Contents
1 Convex Sets 2
1.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Operations Preserving Convexity of Sets . . . . . . . . . . . . . 3
1.3 Projections onto Closed Convex Sets . . . . . . . . . . . . . . . 7
1.4 Separation and Normal Cones . . . . . . . . . . . . . . . . . . . 10
1.5 Tangent and Normal Cones . . . . . . . . . . . . . . . . . . . . 12

2 Convex Functions 15
2.1 Operations That Preserve the Convexity of Functions . . . . . . 17
2.2 First Order Differentiation . . . . . . . . . . . . . . . . . . . . . 18
2.3 Second Order Differentiation . . . . . . . . . . . . . . . . . . . . 20

3 Conjugate Functions 22

4 Duality 27
4.1 A General Discussion of Duality . . . . . . . . . . . . . . . . . . 27
4.2 Lagrangian Duality . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Karush-Kuhn-Tucker (KKT) Conditions . . . . . . . . . . . . . 34

5 Subdifferentials 35
5.1 Motivation, Definition, Properties of Subdifferentials . . . . . . 35
5.2 Connection with the KKT Conditions . . . . . . . . . . . . . . . 40
5.3 Monotonicity of the Subdifferential . . . . . . . . . . . . . . . . 40

6 Applications to Algorithms 45
6.1 The Proximal Point Algorithm . . . . . . . . . . . . . . . . . . . 45
6.2 Firmly-Nonexpansive Operators . . . . . . . . . . . . . . . . . . 48
6.3 The Dual PPA and the Augmented Lagrangian . . . . . . . . . 52
6.4 The Douglas-Rachford Algorithm . . . . . . . . . . . . . . . . . 53
6.5 Alternating Direction Method of Multipliers . . . . . . . . . . . 57
6.6 A Generalized Proximal Point Algorithm . . . . . . . . . . . . . 60

1
1 Convex Sets
This first chapter introduces convex sets and discusses some of their properties.
Having a solid understanding of convex sets is very useful for convex analysis
of functions because we can and will regard a convex function as a special
representation of a convex set, namely its epigraph.

1.1 Basic Definitions


Definition 1. A set C ∈ Rn is said to be convex if x ∈ C, x0 ∈ C implies that
αx + (1 − α)x ∈ C for any α ∈ [0, 1]. 

Consider the sets below. Each pair of (x, x0 ) we select in the set on the left
defines a line segment which lies inside the set. However, this is not the case
for the set on the right. Even though we are able to find line segments with
endpoints inside the set (as in (x, x0 )), this is not true in general, as exemplified
by (y, y 0 ).

x0 y y0

x
x x0

For the examples below, decide if the set is convex or not and prove whatever
you think is true.
Example 1 (Hyperplane). For given s ∈ Rn , r ∈ R, consider the set Hs,r =
{x ∈ Rn : hs, xi = r}. Notice that this is a subspace for r = 0.
Example 2 (Affine Subspace). This is a set V ∈ Rn such that if x ∈ V and
x0 ∈ V , then αx + (1 − α)x0 ∈ V for all α ∈ R.

Example 3 (Half Space). For given s ∈ Rn , r ∈ R, consider the set Hs,r =
n
{x ∈ R : hs, xi ≤ r}.
Example 4 (Cone). A cone K ∈ Rn is a set such that if x ∈ K, then
αx ∈ K for all α > 0. Note that a cone may be convex or non-convex.
See below for an example of a convex (left) and a non-convex (right) cone.

2
Exercise 1. Let K be a cone. Show that K is convex if and only if x, y ∈ K
implies x + y ∈ K. 

1.2 Operations Preserving Convexity of Sets


Proposition 1 (Intersection of Convex Sets). Let C1 , C2 , . . . Ck be convex
sets. Show that C = ∩i Ci is convex.

Proof. Exercise!
Question 1. What happens if the intersection is empty? Is the empty set
convex? 

This simple result is useful for characterizing linear systems of equations or


inequalities.
Example 5. For a matrix A, the solution set of Ax = b is an intersection of
hyperplanes. Therefore it is convex.

For the example above, we can in fact say more, thanks to the following vari-
ation of Prop. 1.
Exercise 2. Show that the intersection of a finite collection of affine subspaces
is an affine subspace. 

Let us continue with systems of linear inequalities.


Example 6. For a matrix A, the solution set of Ax ≤ b is an intersection of
half spaces. Therefore it is convex.
Proposition 2 (Cartesian Products of Convex Sets). Suppose C1 ,. . . , Ck are
convex sets in Rn . Then the Cartesian product C1 × · · · × Ck is a convex set
in Rn×···×n .

Proof. Exercise!

Given an operator F and a set C, we can apply F to elements of C to obtain


the image of C under F . We will denote that set as F C. If F is linear then
it preserves convexity.
Proposition 3 (Linear Transformations of Convex Sets). Let M be a matrix.
If C is convex, then M C is also convex.

Consider now an operator that just adds a vector dto its operand. This is
a translation operator. Geometrically, it should be obvious that translation
preserves convexity. It is a good exercise to translate this mental picture to an
algebraic expression and show the following.

3
Proposition 4 (Translation). If C is convex, then the set C + d = {x : x =
v + d for some v ∈ C}is also convex.

Given C1 , C2 , consider the set of points of the form v = v1 + v2 , where vi ∈ Ci .


This set is denoted by C1 + C2 and is called the Minkowski sum of C1 and C2 .
We have the following result concerning Minkowski sums.

Proposition 5 (Minkowski Sums of Convex Sets). If C1 and C2 are convex


then C1 + C2 is also convex.

Proof. Observe that


 
C1 + C2 = I I (C1 × C2 ).

Thus it follows by Prop. 2 and Prop. 3 that C1 + C2 is convex.

Example 7. The Minkowski sum of a rectangle and a disk in R2 in shown


below.
3
2

+ =
1 1 2

Definition 2 (Convex Combination). Consider a finite collection of points x1 ,


. . . , xk . x is said to be a convex combination of xi ’s if x satisfies

x = α 1 x1 + · · · + α k xk

for some αi such that

αi ≥ 0 for all i,
k
X
αk = 1.
i=1

Definition 3 (Convex Hull). The set of all convex combinations of a set C is


called the convex hull of C and is denoted as Co(C). 

Below are two examples, showing the convex hull of the sets C = {x1 , x2 }, and
C 0 = {y1 , y2 , y3 }.

4
y3

y2
x2
x1 y1

Notice that in the definition of the convex hull, the set C does not have to
be convex (in fact, C is not convex in the examples above). Also, regardless
of the dimension of C, when constructing Co(C), we can consider convex
combinations of any number of finite elements chosen from C. In fact, if we
denote the set of all convex combinations of kelements from C as Ck , then we
can show that Ck ⊂ Ck+m for m ≥ 0. An interesting result, which we will not
use is in this course, is the following.

Exercise 3 (Caratheodory’s Theorem). Show that, if C ∈ Rn , then Co(C) =


Cn+1 . 

The following proposition justifies the term ‘convex’ in the definition of the
convex hull.

Proposition 6. The convex hull of a set C is convex.

Proof. Exercise!

The convex hull of C is the smallest convex set that contains C. More precisely,
we have the following.

Proposition 7. If D = Co(C), and if E is a convex set with C ⊂ E, then


D ⊂ E.

Proof. The idea is to show that for any integer k, Econtains all convex com-
binations involving k elements from C. For this, we will present an argument
based on induction.
We start with k = 2. Suppose x1 , x2 ∈ C. This implies x1 , x2 ∈ E. Since E
is convex, we have αx1 + (1 − α)x2 ∈ E, for all α ∈ [0, 1]. Since x1 , x2 were
arbitrary elements of C, it follows that Econtains all convex combinations of
any two elements from C.
Suppose now that Econtains all convex combinations Pk−1 of any k − 1 elements
from C. That is, if x1 , . . . xk−1 are in C, then for i=1 αi = 1, and αi ≥ 0, we
have k−1 th
P
i=1 αi xi ∈ E. Suppose we pick a k element, say xk , from C. Also,
let α1 , . . . αk be on the unit simplex, with αk 6= 0 (if αk = 0, we have nothing

5
to prove). Observe that
k
X k−1
X
αi xi = αk xk + α i xi
i=1 i=1
k−1
X αi
= αk xk + (1 − αk ) xi .
i=1
1 − αk

Notice that
k−1
X αi
= 1,
i=1
1 − αk
αi
≥ 0, for all i.
1 − αk
Therefore,
k−1
X αi
y= xi
i=1
1 − αk

is an element of E since it is a convex combination of k − 1 elements of C. But


then,
k
X
αi xi = αk xk + (1 − αk ) y
i=1

is an element of E (why?). We are done.

Another operation of interest is the affine hull. For that, let us introduce affine
combinations.

Definition 4 (Affine Combination). Consider a finite collection of points x1 ,


. . . , xk . x is said to be an affine combination of xi ’s if x satisfies

x = α 1 x1 + · · · + α k xk

for some αi such that


k
X
αi = 1.
i=1

Definition 5 (Affine Hull). The set of all affine combinations of a set C is


called the affine hull of C. 

6
The convex hull of two points x1 , x2 is a line segment passing through the two
points.
...

x2
x1
...

Exercise 4. Consider a set C ⊂ R2 , composed of two points C = {x1 , x2 }.


What is the difference between the affine and convex hull of C? 

Let us end this discussion with some definitions.


Definition 6 (Interior). x is said to be in the interior of Cif there exists an
open set Bsuch that x ∈ B and B ⊂ E. The interior of C is denoted as int(C).

c
Definition 7 (Boundary). Boundary of a set C is defined to be C ∩ int(C) .


1.3 Projections onto Closed Convex Sets


Definition 8 (Projection). Let C be a set. For a given point y (inside or
outside C), x ∈ C is said to be a projection of y onto C if it is a minimizer of
the following problem

min ky − zk2 .
z∈C

In general, projections may not exist, or may not be uniquely defined.


Example 8. Suppose D denotes the open disk in R2 and ybe such that kyk2 >
1. Then, the projection of y onto D does not exist. Notice that D is convex
but not closed. 
Example 9. Let C be the unit circle. Can you find a point y ∈ / C that
has infinitely many projections? Can you find a point which does not have a
projection onto C? Notice in this case that C is not convex, but closed. 

The two examples above imply that projections are not always guaranteed to
be well-defined. In fact, convexity alone is not sufficient. We will see later that
convexity and closedness together ensure the existence of a unique minimizer.
The following provides a useful characterization of the projection.
Proposition 8. Let C be a convex set. x is a projection of y onto x if and
only if

hz − x, y − xi ≤ 0, for all z ∈ C.

7
It is useful to understand what this proposition means geometrically. Consider
the figure below. If x is the projection of y onto the convex set, then Prop.8
implies that the angle zdxy is obtuse.
y

z
x

Proof of Prop. 8. (⇒) Suppose x = PC (y) but there exists z such that
hz − x, y − xi > 0.

The idea is as follows. Consider the figure below.


z

t y
β
α
x

Pick t such that β > α. But then


ky − tk2 < ky − xk2 .
Thus, x cannot be the projection.
Let us now provide an algebraic proof. Consider
ky − αz + (1 − α) x k22 = ky − x + α(x − z)k22


= ky − xk22 + α2 kx − zk22 +2α hx − z, y − xi


| {z } | {z }
d −c
2
Notice now that, since c > 0 by assumption, the polynomial α d − 2αc is
negative in the interval
 2 (0, 2c/d). Thus we can find α ∈ (0, 1) such that
2
ky − αz + (1 − α) x k2 is strictly less than ky − xk2 . But this contradicts the
assumption that z = PC (y).
(⇐) Suppose x ∈ C and x 6= PC (y). Let z = PC (y). Also, suppose that
hz − x, y − xi ≤ 0.
Consider
ky − zk22 = ky − x + x − zk22
= ky − xk22 + kx − zk22 + 2 hx − z, y − xi .
| {z }
c

8
Now if c > 0, then ky − xk2 < ky − zk2 . But this is a contradiction.
Corollary 1. If C is closed, convex then PC (y) is a unique point.

Proof. Suppose x1 , x2 ∈ C and

kx1 − yk2 = kx2 − yk2 ≤ kz − yk2 for all z ∈ C.

Then, we have, by Prop. 8 that

hy − x1 , x2 − x1 i ≤ 0,
hy − x2 , x1 − x2 i ≤ 0.

Adding these inequalities, we obtain

hx1 − x2 , x1 − x2 i ≤ 0,

which implies x1 = x2 .

Projection operators enjoy useful properties. One of them is the following,


which we will refer to as ‘firm nonexpansivity’ (to be properly defined later).
Proposition 9. kPC (x1 ) − PC (x2 )k22 ≤ hPC (x1 ) − PC (x2 ), x1 − x2 i.

Proof. Let pi = PC (xi ). Then, we have

hx1 − p1 , p2 − p1 i ≤ 0,
hx2 − p2 , p2 − p1 i ≤ 0.

Summing these, we have

h(p2 − p1 ) + (x1 − x2 ), p2 − p1 i ≤ 0.

Rearranging, we obtain the desired inequality.

Consider the following figure. The proposition states that the inner product
of the two vectors is greater than the length of the shorter one squared.
x1

C
PC (x1 )

x2
PC (x2 )

As a first corollary of this proposition, we have :

9
Corollary 2. For a closed, convex C, we have hPC (x1 ) − PC (x2 ), x1 − x2 i ≥ 0.

Applying the Cauchy-Schwarz inequality, we obtain from Prop. 9 that projec-


tion operators are ‘non-expansive’ (also, to be discussed later).

Corollary 3. For a closed, convex C, we have kPC (x1 )−PC (x2 )k2 ≤ kx1 −x2 k2 .

1.4 Separation and Normal Cones


Proposition 10. Let C be a closed convex set and x ∈
/ C. Then, there exists
s such that

hs, xi > suphs, yi.


y

Proof. To outline the idea of the proof, consider the left figure below.
H x

C
z
p p
x

The hyperplane H, normal to x − p touching C at p should separate x and C.


If this is not the case, the situation resembles the right figure above, and we
should have kx − zk < kx − pk.
Algebraically, hx − PC (x), z − PC (x)i ≤ 0, for all z ∈ C. Set s = x − PC (x).
Then, we have

hs, z − x + si ≤ 0, ∀z ∈ C
⇐⇒hs, xi ≥ hs, si + hs, zi, ∀z ∈ C.

As a generalization, we have the following result.

Proposition 11. Let C, D be disjoint closed convex sets. Suppose also that
C − D is closed. Then, there exists s such that hs, xi > hs, yi, for all x ∈ C,
y ∈ D.

Proof. The idea is to consider the segment between the closest points of C and
D, and construct a separating hyperplane that is orthogonal to this segment,
as visualized below.

10
D

We want to find s such that


hs, y − xi < 0, ∀x ∈ C, y ∈ D.
Note that y −x ∈ D −C. Also, since C ∩D = ∅, we have 0 ∈ / C −D. Therefore,
there exists s such that hs, 0i > hs, zi, for all z ∈ D − C. This s satisdies
hs, y − xi < 0, ∀ y ∈ D, x ∈ C.

Definition 9. An affine hyperplane Hs,r is said to support the set C if hs, xi ≤



r for all x ∈ C. Notice that this is equivalent to C ⊂ Hs,r . 

For a given set C, let ΣC denote the set of (s, r) such that C ⊂ Hs,r .
Proposition 12. If C is closed and convex, then

C = ∩(s,r)∈ΣC Hs,r .

Proof. The proof follows by showing that the expressions on the rhs and lhs
are subsets of each other.
Obviously,

C ⊂ ∪(s,r)∈ΣC Hs,r .

Let us now show the other inclusion. Take x ∈


/ C. Then, there exists p such
that
hx, pi < hz, pi, ∀ z ∈ C.
− −
Take q = PC (x). Then Hp,q ⊃ C, and x ∈
/ Hp,q . Since (p, q) ∈ ΣC , we find

that x ∈
/ ∩(s,r)∈ΣC Hs,r . Thus,

C ⊃ ∪(s,r)∈ΣC Hs,r .

We also state, without proof, the following result, that will be useful in the
discussion of duality.
Proposition 13. Suppose C is a convex set. There exists a supporting hy-
perplane for any x ∈ bd(C).

11
1.5 Tangent and Normal Cones
Definition 10. Let C be a closed convex set. The direction s ∈ Rn is said to
be normal to C at x when

hs, y − xi ≤ 0, ∀ y ∈ C.

According to the definition, for any y ∈ C, the angle between the two vectors
shown below is obtuse.

x
s
y

Notice that if s is normal to C at x, then αs is also normal, for α ≥ 0. The


set of normal directions is therefore a cone.

Definition 11. The cone mentioned above is called the normal cone of C at
x, and is denoted as NC (x). 

Proposition 14. NC (x) is a convex cone.

Proof. If s1 , s2 are in NC (x), then

hαs1 + (1 − α)s2 , y − xi = αhs1 , y − xi + (1 − α)hs2 , y − xi ≤ 0,

implying that αs1 + (1 − α)s2 ∈ NC (x).

Below are two examples.


C x1 + NC (x1 )

x1 D
z
x2 z + ND (z)

x2 + NC (x2 )

From the definition, we directly obtain the following results on normal cones.

Proposition 15. Let C be closed, convex. If s ∈ NC (x), then

PC (x + αs) = x, ∀ α ≥ 0.

12
Normal cones will be of interest when we discuss constrained minimization,
and subdifferentials of functions.
Let us now define a related cone through ‘polarity’.

Definition 12. Given a closed cone C, the polar cone of C is the set of p such
that

hp, si ≤ 0, ∀ s ∈ C.

In the figure below, the dot indicates the origin, and D is the polar cone of C.
Note also that C is the polar cone of D.

C D

Definition 13. The polar cone of NC (x) is called the tangent cone of C at x,
and is denoted by TC (x). 

The figures below show the tangent cones of the sets at the origin (the origin
is indicated by a dot).

NC (0) TC (0)
TC (0) C NC (0)

Proposition 16. For a convex C, we have x + TC (x) ⊃ C.

Proof. If p ∈ C and s ∈ NC (x), then

hp − x, si ≤ 0
⇒ p − x ∈ TC (x).

13
Proposition 17. Suppose C is closed and convex. Then

∩x∈C x + TC (x) = C.

Proof. Let us denote the set on the lhs as D. Note that, by the previous
proposition, D ⊃ C.
To see the converse inclusion, take z ∈
/ C. Let x = PC (z). Then z − x ∈ NC (x)
and z − x ∈/ TC (x). Thus z ∈/ x + TC (x), and thus z ∈
/ D.

Proposition 18. TC (x) is the closure of the cone generated by C − x.


1
Exercise 5. Show that C is convex if and only if (x+y) ∈ C, for all x, y ∈ C.
2


Proposition 19. Let x ∈ bd(C), where C is convex. There exists s such that

hs, xi ≤ hs, yi, ∀y ∈ C.

Proof. Consider a sequence {xk }k with xk ∈ / C and limk xk = x. Also, let


y ∈ C. We can find a sequence {sk }k with ksk k2 = 1 such that hsk , xk i ≤ hsk , yi
for all k. Now, extract a convergent subsequence ski with limit s (such a
subsequence exists by the Bolzano-Weierstrass theorem, since sk are bounded,
and are in Rn .). Then, we must have

hski , xki i ≤ hski , yi, ∀i.

Taking limits, we obtain hs, xi ≤ hs, yi.

Alternative Proof (Sketch). Assume NC (x) 6= ∅. Take z ∈ NC (x). Then,


hz, y − xi ≤ 0. This is equivalent to h−z, xi ≤ h−z, yi.

14
2 Convex Functions
The standard definition is as follows.
Definition 14. f : Rn → R is said to be convex if for all x, y ∈ Rn , and
α ∈ [0, 1], we have

f αx + (1 − α)y ≤ αf (x) + (1 − α) f (y).

If the above inequality can be made strict, the function is said to be strictly
convex.
The inequality is demonstrated in the figure below.
f (·)

x y 

In order to link convex functions and sets, let us also introduce the following.
Definition 15. Given f : Rn → R, the epigraph of f is the subset of Rn+1
defined as
n o
n
epi(f ) = (x, r) ∈ R × R : r ≥ f (x) .

epi(f )


Proposition 20. f is convex if and only if its epigraph is a convex set in Rn+1 .

Proof. (⇒) Suppose f is convex. Pick (x1 , r1 ), (x2 , r2 ) from epi(f ). Then,

r1 ≥ f (x1 )
r2 ≥ f (x2 ).

Using the convexity of f , we have,


1 1  1  1
f x 1 + x2 ≤ f (x1 ) + f (x2 ) ≤ (r1 + r2 ).
2 2 2 2
Therefore
1 
(x1 , r1 ) + (x2 , r2 ) ∈ epi(f ),
2
15
and so epi(f ) is convex.
(⇐) Suppose epi(f ) is convex. Notice that since (x1 , f (x1 )), (x2 , f (x2 )) ∈
epi(f ), we have

αx1 + (1 − α)x2 , αf (x1 ) + (1 − α)f (x2 ) ∈ epi(f ).

But this means that



f αx1 + (1 − α)x2 ≤ αf (x1 ) + (1 − α) f (x2 ).

Thus, f is convex.

Definition 16. The domain of f : Rn → R is the set

dom(f ) = {x ∈ Rn : f (x) < ∞}.

Example 10. Let C be a set in Rn . Consider the function


(
0, if x ∈ C,
iC (x) =
∞, if x ∈
/ C.

iC (x) is called the indicator function of the set C. Its domain is the set C.

Exercise 6. Show that iC (x) is convex if and only if C is convex. 

Exercise 7. Consider the function


(
0, if x ∈ C,
uC (x) =
1, if x ∈/ C.

Determine if uC (x) is convex. If so, under what conditions? 

Proposition 21 (Jensen’s inequality). Let f : Rn → R be convex P and


x1 , . . . , xk , be points in its domain. Also, let α1 , . . . , αk ∈ [0, 1] with i αi = 1.
Then,
X  X
f α i xi ≤ αi f (xi ). (2.1)
i i


Proof. Notice that xi , f (xi ) ∈ epi(f ), for i = 1, . . . , k. Therefore,
!
X X
α i xi , αi f (xi ) ∈ epi(f ).
i i

But this is equivalent to (2.1).

16
Definition 17. f is said to be concave if −f is convex. 

Below are some examples of convex functions.


Example 11. Affine functions : f (x) = hs, xi+b, for some s and b. In relation
with this, determine if f (x, y) = xy is convex.
Example 12 (Norms). Suppose k·k is a norm. Then by the triangle inequality,
and the homogeneity of the norm, we have

kαx + (1 − α)yk ≤ αkxk + (1 − α)kyk.

In particular, for 1 ≤ p ≤ ∞, the `p norm is defined as


!1/p
X
kxkp = |xi |p .
i

Show that kxkp is actually a norm. What happens if p < 1?


Example 13 (Quadratic Forms). f (x) = xT Qx is convex if Q + QT is positive
semidefinite. Show this!
Exercise 8. Show that f (x) above is not convex if Q + QT is not positive
semi-definite. 

2.1 Operations That Preserve the Convexity of Func-


tions
1. Translation by an arbitrary amount, multiplication with a non-negative
constant.
2. Dilation : If f (x) is convex, so is f (αx), for α ∈ R. (Follows by consid-
ering the epigraph.)
3. Pre-Composition with a matrix : If f (x) is convex, so is f (Ax). Show
this!
4. Post-Composition with an increasing convex function : Suppose g : R →
n
R is an increasing convex function and f : R → R is convex. Then,
g f (·) is convex.

Proof. Since g is increasing, and convex we have


   
g f αx1 + (1 − α)x2 ≤ g αf (x1 ) + (1 − α)f (x2 )
   
≤ α g f (x1 ) + (1 − α) g f (x2 ) .

17
5. Pointwise supremum of convex functions : Suppose f1 (·), . . . fk (·) are
convex functions, and define
g(x) = max fi (x).
i

Then g is convex.

Proof. Notice that


\
epi(g) = epi(fi ).
i

Since intersections of convex sets are convex, epi(g) is convex, and so g


is convex.
Below is a visual demonstration of the proof.
epi(g)
f1 f2

2.2 First Order Differentiation


Proposition 22. Suppose f : Rn → R is a differentiable function. Then, f is
convex if and only if
f (y) ≥ f (x) + h∇f (x), y − xi ∀x, y ∈ Rn .

Proof. (⇒) Suppose f is convex. Then,



f x + α(y − x) ≤ (1 − α)f (x) + αf (y), for 0 ≤ α ≤ 1.

Rearranging, we obtain

f x + α(y − x) − f (x)
≤ f (y) − f (x) for 0 ≤ α ≤ 1.
α
Letting α → 0, the left hand side converges to h∇f (x), y − xi.
(⇒) Consider the function gy (x) = f (y) + h∇f (y), x − yi. Then, gy (x) ≤ f (x)
and gy (y) = f (y) (see below). Also, gy (x) is convex (in fact, affine).
gy (·) f (·)

18
Now set

h(x) = sup gy (x).


y

But, by the two properties of gy (·) above, it follows that h(x) = f (x). But
since h(x) is the supremum of a collection of convex functions, it is convex.
Thus, it follows that f (x) is convex.

We remark that a similar construction as in the second part of the proof will
lead to the conjugate of f (x), which will be discussed later.
Note that if f : R → R is differentiable and convex, then f 0 (·) is a monotonously
increasing function. To see this, suppose that f 0 (x0 ) = 0. Then, for y ≥ x0 ,
we can show that f 0 (y) ≥ f 0 (x0 ) = 0. Indeed, f (y) ≥ f (x0 ) because of the
proposition. If f 0 (y) < 0, then

f (x0 ) ≥ f (y) + hf 0 (y), x − yi > f (x0 ),


|{z} | {z }
≥f (x0 ) >0

which is a contradiction. Therefore, we must have f 0 (y) ≥ 0.


To generalize this argument, apply it to

hx0 (x) = f (x) − f 0 (x0 ) x




Observe that h0 (x0 ) = 0, and h0x0 (x) = f 0 (x) − f 0 (x0 ).


Question 2. How does the foregoing discussion generalize to convex functions
defined on Rn ? 
Definition 18. An operator M : Rn → Rn is said to be monotone if

hM (x) − M (y), x − yi ≥ 0, for all x, y ∈ Rn .

Below is an instance of this relation. The two vectors x − y and M (x) − M (y)
approximately point in the same direction.
y
M (y)

x
M (x)

Proposition 23. Suppose f : Rn → R is differentiable. Then, f is convex if


and only if ∇f is monotone.

19
Proof. (⇒) Suppose f is convex. This implies the following two inequalities.

f (y) ≥ f (x) + h∇f (x), y − xi,


f (x) ≥ f (y) + h∇f (y), x − yi,

Rearranging these inequalities, we obtain

0 ≥ h∇f (x) − ∇f (y), y − xi.

(Lef tarrow) Suppose ∇f is monotone. For y, x ∈ Rn , let z = y − x. Then


Z 1
f (y) − f (x) = h∇f (x + αz), zi dα.
0

Rearranging,
Z 1 Z 1
f (y) − f (x) − h∇f (x), zidα = h∇f (x + αz) − ∇f (x), zi dα.
0 0

But the right hand side is non-negative by the monotonicity of ∇f . It follows


that

f (y) ≥ f (x) + h∇f (y), y − xi.

It then follows by Prop. 22 that f is convex.

2.3 Second Order Differentiation


We have seen that f : R → R is convex if and only if its first derivative
if monotone increasing. But the latter property is equivalent to the second
derivative being non-negative. Therefore, we also have an additional equivalent
condition that involves the second derivative. This generalizes as follows.

Proposition 24. Let f : Rn → R be a twice-differentiable function. Then, f


is convex if and only if ∇2 f is positive semi-definite.

Proof. (⇒) If f is convex, then F = ∇f : Rn → Rn is a monotone mapping,


by Prop. 23. Let d ∈ Rn . Then,

hF (x + αd) − F (x), αdi ≥ 0 for all α > 0.

This implies

F (x + αd) − F (x)
, d ≥ 0 for all α > 0.
α

20
Taking limits (which exist because f is twice differentiable), we obtain hG(x) d, di ≥
0, where
 
∂12 f ∂2 ∂1 f . . . ∂n ∂1 f
G(x) = ∇2 f =  ... ..
 
. 
2
∂1 ∂n f ∂2 ∂n f . . . ∂n f

(⇐) Conversely, assume that G(x) = ∇F is positive semi-definite. We will


show that F = ∇f is monotone. Notice that
Z 1
F (x + d) − F (x) = G(x + αd) d dα.
0

This implies
Z 1
hF (x + d) − F (x), di = hG(x + αd) d, di ≥ 0.
0 | {z }
≥0

Therefore F is monotone.

21
3 Conjugate Functions
This chapter introduces the notion of a conjugate function, along with some
basic properties.
Suppose epi(f ) is a closed set. We will now consider a dual representation of
this set.
Consider a point in epi(f ) : (x0 , f (xo )) ∈ Rn+1 , where f : Rn → R.

f (x0 )

x0

Notice that this point is on the boundary of epi(f ). Therefore we can find a
point (z, c) such that

hz, x0 i + cf (x0 ) ≥ hz, yi + cr for all (y, r) ∈ epi(f ). (3.2)

Notice that here c ≤ 0 since if y ∈ dom(f ), r can be arbitrarily large.


Now, if c 6= 0, we can find s, and r such that fs,r (x) = hs, xi + r minorizes
f (x) at x0 . That is,

fs,r (x0 ) = f (x0 )


fs,r (x) ≤ f (x)

To be concrete, we obtain s and r as follows.

hz/c, x0 i + f (x0 ) ≤ hz/c, xi + f (x)


⇐⇒ h−z/c, xi + hz/c, x0 i + f (x0 ) ≤ f (x)
| {z } | {z }
s r

The remaining question is : can we always find a (z, c) pair with c 6= 0 such
that (3.2) holds?
Fortunately, the answer is yes, and it is easier to see if we assume dom(f ) = Rn .
Note that, in this case, if (z, c) 6= (0, 0) and (3.2) holds, then c = 0 implies
that

hz, x0 i ≥ hz, yi for all y ∈ Rn .

But this inequality is not valid for y = 2zkx0 k2 /kzk2 . Thus, we must have
c 6= 0. The general case is considered below.
Lemma 1. Suppose f : Rn → R is closed, convex. Then, there exist (z, c)
with z 6= 0, c 6= 0 such that (3.2) holds.

22
Proof. To see that we can find a pair (z, c) with c 6= 0, let xk ∈ relint(dom(f ))
with xk → x. Then, we can find (sk , ck ) such that
hsk , xk i + ck f (xk ) ≤ hsk , yi + ck f (y).
Here, if ck = 0, then
hsk , xk − yi ≤ 0, ∀y ∈ dom(f ).
But since xk ∈ relint(dom(f )), xk + α(xk − y) ∈ dom(f ) for sufficiently small
α > 0. This implies hsk , y − xk i ≤ 0, which is a contradiction (here, we
assume that sk is included in the subspace parallel to aff(dom(f ))). Therefore,
ck 6= 0.

The foregoing discussion implies that, for a closed convex f : Rn → R, given


any x0 ∈ Rn , we can find z ∈ Rn and r ∈ R such that the following two
conditions hold:
hz, x0 i + r = f (x0 )
hz, xi + r ≤ f (x).

This is depicted below.


hz, ·i + r f (·)

x0

Notice that there is a maximum value of r, associated with a z, so that the


above two conditions are valid. How can we find this value?
Consider the following figure.
f (·)
hz, ·i

x0

In order to find the maximum r, we can look for the minimum vertical distance
between f (x) and hz, xi. That is, we set
r = inf f (x) − hz, xi
x

23
Observe that this definition implies the two conditions in (3.3).
In order to work with sup rather than inf, and to emphasize the dependence
on z, and we define
f ∗ (z) = −r = suphz, xi − f (x).
x

The function f (z) is called the Fenchel conjugate of f , and thanks to the
supremum operation, it is convex with respect to z – note that in the defini-
tion of f ∗ , x acts like an index. In addition to convexity, f ∗ ∗ is also closed,
because its epigraph is the intersection of closed sets (half spaces).The follow-
ing inequality follows immediately from the definition.
Proposition 25 (Fenchel’s inequality). For a closed convex f : Rn → R, we
have
f (x) + f ∗ (z) ≤ hz, xi, for all x, z.

Consider now the conjugate of the conjugate:


f ∗∗ (x) = suphz, xi − f ∗ (z).
z

Since the function g(z) = hz, xi−f ∗ (z) minorizes f (x), we have f ∗∗ (x) ≤ f (x).
However, by the previous discussion, we also know that for any x∗ , there is
a pair (z ∗ , r∗ ) such that hz ∗ , xi + r∗ = f (x∗ ). Therefore, we deduce that
f ∗∗ (x) = f (x). Thus, we have shown the following.
Proposition 26. If f : Rn → R is a closed convex function, then
f (x) = sup hz, xi − f ∗ (z).
z

Example 14. Let f (x) = 21 xT Q x, where Q is positive definite. Then,


1
f ∗ (z) = suphz, xi − xT Q x.
x 2
Maximum is achieved at x∗ such that z = Q x∗ . Plugging in x∗ = Q−1 z, we
obtain
1
f ∗ (z) = z T Q−1 z − z T Q−T Q Q−1 z
2
1 T −1
= z Q z.
2

Example 15. Let C be a convex set. The indicator function is defined as
(
0, if x ∈ C,
iC (x) =
∞, if x ∈
/ C.
The conjugate
σC (z) = sup hz, xi − iC (x) = sup hz, xi.
x x∈C


24
Some Properties of the Conjugate Function

(i) If g(x) = f (x − s), then


g ∗ (z) = sup hz, xi − f (x − s)
x
= suphz, y + si − f (y)
y

= f (z) + hz, si.

(ii) If g(x) = t f (x), with t > 0, then


g ∗ (z) = sup .hz, xi − tf (x)
x
= t suphz/t, xi − f (x)
x
z 
= tf ∗ .
t
(iii) If g(x) = f (tx), then
g ∗ (z) = sup hz, y/ti − f (y)
y=tx
∗ z
 
=f .
t
(iv) If A is invertible, and g(x) = f (Ax), then
g ∗ (z) = suphz, xi − f (Ax)
x
= suphA−T z, yi − f (y)
y

= f ∗ (A−T z).

(v) If g(x) = f (x) + hs0 , xi, then


g ∗ (z) = sup hz − s0 , xi − f (x)
x

= f (z − s0 ).

(vi) If g(x1 , x2 ) = f1 (x1 ) + f2 (x2 ), then


g ∗ (z1 , z2 ) = sup hz1 , x1 i + hz2 , x2 i − f (x)
x1 ,x2

= f (z1 ) + f ∗ (z2 ).

(vii) If g(x) = f1 (x) + f2 (x), then


 
∗ ∗
g (z) = suphz, xi − f1 (x) − suphy, xi − f2 (y)
x y

= sup inf hz, xi − f1 (x) − hy, xi + f2∗ (y).


x y

25
We will later dicsuss that whenever there is a ‘sadlle point’, we can ex-
change the order of inf and sup. Doing so, we obtain

g ∗ (z) = inf sup hz − y, xi − f1 (x) + f2∗ (y)


y x
= inf f1∗ (z − y) + f2∗ (y).
y

The operation that appears in the last line is called infimal convolution.

(viii) More generally, if g(x) = f1 (x) + f2 (x) + . . . + fn (x), then

g ∗ (z) = inf
z1 ,...,zn
f1∗ (z1 ) + . . . + fn∗ (zn ).
s.t. z=z1 +...+zn

Example 16. The last property will be useful when we consider multiple
constraints. In particular, let C = A ∩ B, where A, B are convex sets. Then
we have that

σC (x) = suphz, xi = inf σA (x − y) + σB (y).


z∈C y

To see this, note that σ ∗ (z) = iC (z). But iA R B (z) = iA (z) + iB (z). Computing
the conjugate, we obtain

i∗C (x) = σC (x) = inf σA (x − y) + σB (y).


y

26
4 Duality
This chapter introduces the notion of duality. We start with a general discus-
sion of duality. We then pass to Lagrangian duality, and finally consider the
Karush-Kuhn-Tucker conditions of optimality.

4.1 A General Discussion of Duality


Consider a minimization problem like

min f (x), where C is a closed convex set.


x∈C

Suppose there exists a function K(x, λ), which is

(i) convex for fixed λ, as a function of x,

(ii) concave for fixed x, as a function of λ,


(
f (x), if x ∈ C,
max K(x, λ) =
λ∈D ∞, if x ∈ / C,
for some closed convex set D.
Example 17. Consider the problem
1
min ky − xk22 + kxk2 , (4.4)
x 2
for x, y ∈ C = Rn . Here, by Cauchy-Schwarz inequality, we can write

kxk2 = maxhx, λi,


λ∈B2

where B2 is the unit `2 ball of Rn . In this case, the problem (4.4) is equivalent
to
1
min max ky − xk22 + hy, xi .
x λ∈B2 2
| {z }
K(x,λ)

In short, “minx∈C f (x)” is equivalent to

min max K(x, λ).


x∈C λ∈D

We call (x∗ , λ∗ ) a solution of this problem if

K(x∗ , λ) ≤ K(x∗ , λ∗ ) ≤ K(x, λ∗ ) for all x ∈ C, λ ∈ D.

27
If such a (x∗ , λ∗ ) exists, it is called a saddle point of (x, λ).
See below for a depiction.

λ
This approach is useful especiallly if for fixed λ, K(x, λ) is easy to minimize
with respect to x. In that case, if λ = λ∗ , then minimizing K(x, λ∗ ) over x ∈ C
is equivalent to solving the problem.
Example 18. Suppose λ is fixed. Then to maximize
1
ky − xk22 + hλ, xi,
2
set the gradient to zero. This gives

x−y+λ=0 ⇔ x = y − λ.

The question now is, how do we obtain λ∗ ? For that, define the function

g(λ) = min K(x, λ).


x∈C

Notice that

g(x, λ) ≤ K(x, λ) for all x ∈ C.

Consider now the problem

max g(λ). (4.5)


λ∈D

Exercise 9. Show that g(λ) is concave for λ ∈ D. 

Now let λ̄ = max g(λ). Also, let x̄ = arg min K(x, λ̄). Then,
λ∈D x∈C

K(x̄, λ̄) ≤ K(x, λ̄) for x ∈ C,

28
and

g(λ̄) = K(x̄, λ̄) ≥ g(λ) ≥ K(x̄, λ) for λ ∈ D.

Combining these, we obtain

K(x̄, λ) ≤ K(x̄, λ̄) ≤ K(x, λ̄) for x ∈ C, λ ∈ D.

Therefore, (x̄, λ̄) is a saddle point, and x̄ solves the problem

min f (x).
x∈C

We have shown that if a saddle point exists, we can obtain it either by solving
a min − max or a max − min problem.
Here, (4.4) is called the primal problem and (4.5) is called the dual problem.
Note that there might be different dual problems depending on how we choose
K(x, λ). Notice also that if a saddle point exists, we have

d∗ = max g(λ) = min f (x) = p∗ .


λ∈D x∈C

Example 19. Consider the problem


1
min ky − xk22 + kxk2 .
x 2
An equivalent problem is
1
min max ky − xk22 + hx, λi,
x λ∈B2 2
where B2 is the unit ball of the `2 norm. Let us define
1
g(λ) = min ky − xk22 + hx, λi.
x 2
Carrying out the minimization, we find
1
g(λ) = kλk22 + hy − λ, λi
2
1
= − ky − λk22 + c,
2
for some constant c. Therefore, the dual problem is

max −ky − λk22 .


λ∈B2

Or, equivalently

min ky − λk22 .
λ∈B2

29
The minimizer is the projection of y onto B2 , and is given by
(
y, if kyk2 ≤ 1,
λ∗ =
y/kyk2 , if 1 < kyk2 .

The solution of the primal problem is



0, if kyk2 ≤ 1,
∗ ∗
x = y − λ = y − PB2 (y) = kyk2 − 1
y , if 1 < kyk2 .
kyk2


Notice that this discussion depends heavily on the existence of a saddle point.
However, even if such a point does not exist, we can define a dual problem,
but in this case, the maximum of the dual d∗ and the minimum of the primal
problem p∗ are not necessarily equal. Instead, we will have d∗ ≤ p∗ . To see
this, note that

g(λ) ≤ K(x, λ) ≤ f (x) for all x, λ.

Therefore,

d∗ = g(λ∗ ) ≤ K(x, λ∗ ) ≤ f (x∗ ) = p∗ .

Therefore, p∗ − d∗ is always nonnegative. This difference is called the duality


gap.
The following proposition is a summary of the foregoing discussion. We provide
a proof for the sake of completeness.

Proposition 27. Let K(x, λ) be convexoconcave, and define

f (x) = sup K(x, λ)


λ
g(λ) = inf K(x, λ).
x

Then,

K(x∗ , λ) ≤ K(x∗ , λ∗ ) ≤ K(x, λ∗ ) (4.6)

if and only if

x∗ ∈ arg min f (x), (4.7)


x
λ∗ ∈ arg max g(λ),
λ
inf sup K(x, λ) = sup inf K(x, λ).
x λ λ x

30
Proof. (⇒) Suppose (4.6) holds. Then, since K(x∗ , λ∗ ) = f (x∗ ) = g(λ∗ ), and
since f (x) ≥ g(λ) for arbitrary x, λ, it follows that (4.7) holds too.
(⇐) Suppose (4.7) holds. Then, by the last equality in (4.7), we have f (x∗ ) =
g(λ∗ ). Now by the definition of f (·), we have

f (x∗ ) ≥ K(x∗ , λ), for any λ.

Similarly, by the definition of g(·), we obtain

g(λ∗ ) ≤ K(x, λ∗ ) for any x.

Using the definition of f , and g once again, we can write,

f (x∗ ) ≥ K(x∗ , λ∗ ) ≥ g(λ∗ ).

Since f (x∗ ) = g(λ∗ ), we therefore obtain the desired inequality (4.6) by com-
bining these inequalities.

4.2 Lagrangian Duality


We now discuss a specific dual, associated with a constrained minimization
problem.


 g1 (x) ≤ 0,

g2 (x) ≤ 0,

min f (x) subject to .. (4.8)
x  .



g (x) ≤ 0,
m

where all of the functions are closed, convex, and defined on Rn .


In this setting, we define the Lagrangian function as
(
f (x) + λ1 g1 (x) + λ2 g2 (x) + . . . λm gm (x), if λi ≥ 0
L(x, λ) =
−∞, if at least one λi ≤ 0.

Notice that
(
f (x), if gi (x) ≤ 0, for all i,
max L(x, λ) =
λ≥0 ∞ otherwise.

Therefore, (4.8) is equivalent to

min max L(x, λ).


x λ≥0

31
Also, notice that for fixed x, L(x, λ) is concave (in fact affine), with respect to
λ. It follows from the previous discussion that if (x∗ , λ∗ ) is a saddle point of
L(x, λ) if x∗ solves (4.8), or
λ∗ = arg min L(x, λ∗ )
x
= arg min f (x) + λ∗1 g1 (x) + λ∗2 g2 (x) + . . . + λ∗m gm (x).
x

Notice that the problem is transformed into an unconstrained problem, with


the help of λ∗ . To obtain λ∗ , we consider the Lagrangian dual, defined as
g(λ) = min L(x, λ).
x

The dual problem is,


max g(λ).
λ≥0

If a saddle point exists, λ∗ is the solution of this dual problem. In that case,
we obtain a minimizer (which need not be unique) as
x ∈ arg min L(x, λ∗ ).
x

Example 20. Consider the problem


min x s.t. x2 + 2x ≤ 0.
The dual function is
g(λ) = min x + λ(x2 + 2x) .
x | {z }
L(x,λ) for λ≥0

At the minimum we must have


1
1 + λ (2x + 2) = 0 ⇐⇒ − − 1 = x.

Plugging in, we obtain,
 
1 1 1 1
g(λ) = − − 1 + λ +1+ − −2
2λ 4λ2 λ λ
1
= −λ − − 1.

λ∗ satisfies
1
−1 + .
(2λ∗ )2
1
Solving for λ∗ , and taking the positive root (since λ∗ ≥ 0), we find λ∗ = .
2
Therefore,
1
x∗ = arg min x + (x2 + 2x),
x 2
which can be easily solved, to give x∗ = −2. Note that we can see this easily
if we sketch the constraint function. 

32
So far, the discussion relied on the assumption that the Lagrangian has a saddle
point, so that the duality gap is zero. A natural question is to ask when this
can be guaranteed. The conditions that ensure the existence of a saddle point
are called constraint qualification. A simple one to state is Slater’s condition.
Proposition 28 (Slater’s condition). Suppose f (·) and gi (·) for i = 1, 2, . . . , n
are convex functions. Consider the problem
min f (x), s.t. gi (x) ≤ 0, ∀i. (4.9)
x

Suppose there exists x̄ such that gi (x̄) ≤ 0 for all i, and gj (x) < 0 for some j.
Then, x∗ solves (4.9) if and only if there exists λ∗ ≥ 0 such that (x∗ , λ∗ ) is a
saddle point of the function
 n
f (x) + P λ g (x), if λ ≥ 0 for all i,
i i i
L(x, λ) = i=1
−∞, otherwise.

Proof. For simplicity, we assume that n = 1, so there is only one constraint


function g(·).
Assume that x∗ solves (4.9). Consider the sets
      
z1 z1 f (x)
A= : ≥ for some x ,
z2 z2 g(x)
f (x∗ )
     
z1 z1
B= : < .
z2 z2 0

f (x∗ )
g
B

It can be shown that both A and B are convex (show this!). Further, we have
A ∩ B = ∅ (show this!). Therefore, there exists µ1 , µ2 such that
µ1 z1 + µ2 z2 ≤ µ1 t1 + µ2 t2 for all z ∈ B, t ∈ A.
Note here that µ2 ≥ 0 since z2 can be taken as small as desired. Similarly,
µ1 ≥ 0, since z1 can be taken as small as desired. Also notice that µ1 6= 0,
since otherwise we would have 0 ≤ g(x) for all x, but we already know that
g(x̄) < 0. Therefore, for λ∗ = µ2 /µ1 , we can write
z1 + λ∗ z2 ≤ 0 · t1 + λ∗ t2 , for all z ∈ B, t ∈ A.

33
Consequently,

f (x∗ ) ≤ f (x) + λ∗ g(x) for all x and λ∗ ≥ 0.

In particular, we have that f (x∗ ) ≤ f (x∗ ) + λ∗ g(x∗ ). Since g(x∗ ) ≤ 0, and


λ∗ ≥ 0, it follows that λ∗ g(x∗ ) = 0. So, we have

f (x) + λ∗ g(x) ≥ f (x∗ ) = f (x∗ ) + λ∗ g(x∗ ) ≥ f (x∗ ) + λ g(x∗ ) for all λ ≥ 0.

Thus, (x∗ , λ∗ ) is a saddle point.

4.3 Karush-Kuhn-Tucker (KKT) Conditions


Suppose now that Slater’s conditions are satisfied so that x∗ solves the problem
so that (x∗ , λ∗ ) is a saddle point of the Lagrangian L(x, λ). Notice that in this
case, x∗ is a minimizer for the problem,

min f (x) + λ∗1 g1 (x) + λ∗2 g2 (x) + . . . + λ∗m gm (x)


x

If all of the functions above are differentiable, we can write

∇f (x∗ ) + λ∗1 ∇g1 (x∗ ) + λ∗2 ∇g2 (x∗ ) + . . . + λ∗m ∇gm (x∗ ) = 0.

But since λ∗i ≥ 0, we have that if gi (x∗ ) < 0, then we must have λ∗i . = 0.
Therefore, λ∗i gi (x∗ ) = 0 for all i. Collected together, these conditions are
known as KKT conditions.

λ∗i ≥ 0,
gi (x∗ ) ≤ 0,
λ∗i gi (x∗ ) = 0, (‘Complementary slackness’)
X
∇f (x∗ ) + λ∗i ∇gi (x∗ ) = 0. (4.10a)
i

By the above discussion these conditions are necessary for optimality. It turns
out that they are also sufficient conditions. To see this, first observe that
(4.10a) implies

g(λ∗ ) = L(x∗ , λ∗ ).

But since λ∗i gi (x∗ ) = 0, we have


X
L(x∗ , λ∗ ) = f (x∗ ) + λ∗i gi (x∗ ) = f (x∗ ).
i

Therefore, g(λ∗ ) = L(x∗ , λ∗ ) = f (x∗ ), i.e., (x∗ , λ∗ ) is a saddle point of L(x, λ).
Thus x∗ solves the primal problem.

34
5 Subdifferentials
A convex function does not have to be differentiable. However, even when it
is not differentiable, there is considerable structure in how it varies locally.
Subdifferentials generalize derivatives (or gradients) and capture this structure
for convex functions. In addition, they enjoy a certain calculus, which proves
very useful in deriving minimization algorithms. We introduce, and discuss
some basic properties of subdifferentials in this chapter. We start with some
motivations underlying definitions and basic properties. We then explore con-
nections with KKT conditions. Finally, we dicsuss the notion of monotonicity,
a fundamental property of subdifferentials.

5.1 Motivation, Definition, Properties of Subdifferen-


tials
Recall that if f : Rn → R is differentiable and convex, then

f (x) ≥ f (y) + hx − y, ∇f (y)i

for all x, y. In fact, s = ∇f (y) is the unique vector that satisfies the inequality
below

f (x) ≥ f (y) + hx − y, si for all x, y. (5.11)

f (y) + hx − y, ∇f (y)i f (·)

This useful observation has the shortcoming that it requires f to be differ-


entiable. In general, a convex function may not be differentiable – consider
for instance f (x) = |x|. We define the subdifferential by making use of the
inequality (5.11).

Definition 19. Let f : Rn → R be convex. The subdifferential of f at y is


the set of s that satisfy

f (x) ≥ f (y) + hx − y, si for all x.

This set is denoted by ∂f (y). 

Since for every y ∈ dom(f ), we can find r, s, such that

(i) r + hs, yi = f (y),

35
(ii) r + hs, yi ≤ f (x),

it follows that ∂f (y) is non-empty. ∂f (·) has other interesting properties as


well.
Proposition 29. For every y ∈ dom(f ), ∂f (y) is a convex set.

Proof. Suppose (s1 , s2 ) ∈ ∂f (y). Then, we have

f (y) + hx − y, s1 i ≤ f (x)
f (y) + hx − y, s2 i ≤ f (x).

Taking a convex combination of these inequalities, we find

f (y) + hx − y, αs1 + (1 − α)s2 i ≤ f (x).

Example 21. Let f (x) = |x|. Then



{−1}, if x < 0,

∂f (x) = [−1, 1], if x = 0,

{1}, if x > 0.

Notice the example above, wherever the function is differentiable, the subdif-
ferential coincides with the gradient. This holds in general.
Proposition 30. If f : Rn → R is differentiable at x, then ∂f (x) = {∇f (x)}.

Proof. Suppose s ∈ ∂f (x). In this case, for any d ∈ Rn , t ∈ R, we have


f (x + td) − f (x)
f (x + t d) ≥ f (x) + hs, t di ⇐⇒ hs, di ≤ for all t, d.
t
Now let t → 0. The inequality above implies

hs, di ≤ h∇f (x), di, for all d. (5.12)

Now notice that


f (x − td) − f (x)
hs, −di ≤ for all t, d.
t
Therefore,

hs, −di ≤ h∇f (x), −di, for all d. (5.13)

Taken together, (5.12), (5.13) imply hs, di = h∇f, di for all d. Thus, s =
∇f (x).

36
The subdifferential can be used to characterize the minimizers of convex func-
tions.

Proposition 31. Let f : Rn → R be convex. Then,

x∗ ∈ arg min f (x)


x

if and only if

0 ∈ ∂f (x∗ ).

Proof. This follows from the equivalences

x∗ ∈ arg min f (x)


x

⇐⇒ f (x ) ≤ f (x) for all x,
⇐⇒ f (x) ≤ f (x∗ ) + h0, x − x∗ i for all x,
⇐⇒ 0 ∈ ∂f (x∗ ).

Proposition 32. Let f , g : Rn → R be convex functions. Also let h = f + g.


If x ∈ dom(f ) ∩ dom(g), then

∂h(x) = ∂f (x) + ∂g(x).

Proof. Let s1 ∈ A, s2 ∈ B. Then,


 
f (x) + hy − x, s1 i + g(x) + hy − x, s2 i ≤ f (y) + g(y).

Therefore s1 + s2 ∈ C. Since s1 and s2 are arbitrary members of A, B, it


follows that A + B ⊂ C.
For the converse (i.e., C ⊂ A. + B), we need to show that any z ∈ ∂h(x) we
can find u ∈ ∂f (x) such that z − u ∈ ∂g(x). This is equivalent to saying that,
we can find u ∈ ∂f (x), v ∈ ∂g(x) such that z = u + v. We take a detour to
show this result.

Let us now study the link between conjugate functions and subdifferentials.
This will complete the remaining part of Prop. 5.1.

Proposition 33. Let f : Rn → R be a closed, convex function and f ∗ denote


its conjugate. Then s ∈ ∂f (x) if and only if

f (x) + f ∗ (x) = hs, xi. (5.14)

37
Proof. Since f (x) = supz hz, xi − f ∗ (z), we have

f (x) + f ∗ (x) ≥ hz, xi, for all z, x. (5.15)

Consider now the following chain of equivalences

s ∈ ∂f (x)
⇐⇒ f (x) − hs, xi ≤ f (y) − hs, yi for all y
⇐⇒ f (x) − hs, xi ≤ inf f (y) − hs, yi = − suphs, yi − f (y) = −f ∗(5.16)
(s).
y y

Combining the two inequalities (5.15), (5.16), we obtain (5.14).

Corollary 4. x ∈ ∂f ∗ (s) if and only if f (x) + f ∗ (s) = hx, si.

Corollary 5. 0 ∈ ∂f (x) if and only if x ∈ f ∗ (0).

Proof. Consider the following chain of equivalences.

0 ∈ ∂f (x) ⇐⇒ f (x) + f ∗ (x) = h0, xi ⇐⇒ x ∈ ∂f ∗ (x).

Example 22. Recall the definition of a support function : for a closed convex
set C, let σC (x) = supz∈C hz, xi. Recall also that σC∗ (s) = iC (s). Therefore,

s ∈ ∂σC (x) ⇐⇒ σC (x) + iC (s) = hs, xi


⇐⇒ σC (x) = hs, xi, ∀s ∈ C.

In words, ∂σC (x) is the set of s ∈ C for which σC (x) = hs, xi.
[Insert Fig. on p37] 

Example 23. Consider now a closed convex set C, and let us obtain a de-
scription of the subdifferential of the characteristic function of C at x, i.e.,
∂iC (x).
Observe that

s ∈ ∂iC (x) ⇐⇒ iC (y) ≥ iC (x) + hs, y − xi, for all y.

If x ∈
/ C, the inequality is not satisfied for any s. In this case, ∂iC (x)0∅.
If x ∈ C, then there are two cases to consider

(i) If y ∈
/ C, then iC (y) = ∞, and the inequality is always satisfied.

(ii) If y ∈ C, then

0 ≥ hs, y − xi, ∀y ∈ C ⇐⇒ s ∈ NC (x).

38
In summary,
(
NC (x), if x ∈ C,
∂iC (x) =
∅, if x ∈
/ C.

See below for when C is a rectangular set.

x + ∂iC (x)

x
C

We now go back to Prop. 5.1, and complete the proof. Recall that what
remains is to show that ∂h(x) ⊂ ∂f (x) + ∂g(x), where h(x) = f (x) + g(x) for
convex functions f (x), g(x).

Proof of Prop. cont’d: Let u ∈ ∂h(x). This implies,

h(x) + h∗ (x) = hx, ui.

Plugging in the definition of h, we can write

f (x) + g(x) + inf f ∗ (z) + g ∗ (u − z) = hx, ui.


z | {z }
m

Suppose the infimum of m is achieved for when z = s.


h i h i
f (x) + f ∗ (x) − hx, si + g(x) + g ∗ (u − s) − hu − s − xi = 0

Both terms in square bracksts are non-negative by Fenchel’s inequality. There-


fore, for equality to hold, we need both terms to be zero. Thus, we can write

f (x) + f ∗ (s) = hx, si


g(x) + g ∗ (u − s) = hu − s, xi.

Thus,

s + u − s ∈ ∂f (x) + ∂g(x).
u = |{z}
| {z }
∈∂f (x) ∈∂g(x)

Since u was arbitrary, this implies ∂h(x) ⊂ ∂f (x) + ∂g(x).

39
From this dicsussion, we obtain the following corollary concerning constrained
minimization.
Corollary 6. Consider the problem

min f (x),
x∈C

where f is a convex function, C is a convex set. x∗ is a solution of this problem


if and only if

0 ∈ ∂f (x∗ ) + NC (x∗ ).

5.2 Connection with the KKT Conditions


Let us now study what the condition stated in Corollary 6 means. Let us
consider the problem
(
g1 (x) ≤ 0,
min f (x) subject to
x g2 (x) ≤ 0,

where gi are convex and differentiable. Let Ci = {x : gi (x) ≤ 0}. Note that
both Ci ’s are convex sets. Suppose gi (x) < 0. Then NCi (x) = {0}. However,
if gi (x) = 0, then

NCi (x) = {s : hs, y − xi ≤ 0 = gi (x) for all y with gi (y) ≤ gi (x) = 0}.

That is, NCi (x) = α∇gi (x), where α ≥ 0. Also, if gi (x) > 0, then NCi (x) = ∅.
Therefore, the condition

0 ∈ ∇f (x) + NC1 (x) + NC2 (x)

is equivalent to

αi ≥ 0,

0 = ∇f (x) + α1 ∇g1 (x) + α2 g(x) where αi gi (x) = 0,

gi (x) ≤ 0.

These are precisely the KKT conditions.

5.3 Monotonicity of the Subdifferential


Recall that for a convex differentiable f , we had

h∇f (x) − ∇f (y), x − yi ≥ 0, for all x, y.

A similar property holds for the subdifferential.

40
Proposition 34. Suppose f : Rn → R is convex. If s ∈ ∂f (x) and z ∈ ∂f (y),
then
hs − z, x − yi ≥ 0. (5.17)

Proof. Observe that


s ∈ ∂f (x) =⇒ f (y) ≥ f (x) + hs, y − xi
z ∈ ∂f (y) =⇒ f (x) ≥ f (y) + hz, x − yi.
Summing the inequalities, we obtain (5.17).
n
Definition 20. An operator T : Rn → 2R is said to be monotone if hs−z, x−
yi ≥ 0, for all x, y, and s ∈ T (x), z ∈ T (y). 

It is useful to think of set-valued operators in terms of their graphs.


n
Definition 21. The graph of T : Rn → 2R is the set of u, v such that v ∈ T (u).


Notice that for a convex function, the graph of ∂f is the set of (x, u) such that
x ∈ dom(f ) and u ∈ ∂f (x).
A curious property satisfied by ∂f (x) is ‘maximal’ monotonicity.
n
Definition 22. T : Rn → 2R is said to be maximal monotone if there is no
monotone operator F , such that the graph of T is a strict subset of the graph
of F . 
Example 24. For f (x) = |x|, the graph of ∂f (x) is shown below.
[Insert Fig on p.41]. 

We state the following fact without proof. See Rockafellar’s book for a proof.
Proposition 35. If T = ∂f for a closed convex function f , then T is maximal
monotone.

Let us now revisit a minimization problem like minx f (x), for a convex f .
In terms of the subdifferential of f , this is equivalent to looking for x such
that 0 ∈ T (x), where T = ∂f . An equivalent problem is to find x such that
x ∈ x + λ T (x), for λ > 0.
n
Definition 23. Given S : Rn → 2R , S −1 is the operator whose graph is the
set of (u, v) where u ∈ S(v). 

Note that, by the foregoing discussion, minx f (x) is equivalent to finding x


such that x = (I + λT )−1 x. In the following, we study the properties of
Jλ T = (I + λT )−1 .
Let us first write down our previous observation as a proposition.

41
Proposition 36. If f (z) ≤ f (x) for all x, then JλT (z) = z.

Definition 24. An operator U : Rn → Rn is said to be firmly-nonexpansive if

kU (x) − U (y)k22 + k(I − U )(x) − (I − U )(y)k22 ≤ kx − yk22 .

Proposition 37. If T is monotone, then (I + T )−1 is firmly non-expansive.

Proof. Note that,

k(I − J)x − (I − J)yk22 = kx − yk22 + kJx − Jyk22 − 2hJx − Jy, x − yi.

Therefore, if we can show that

hJx − Jy, x − yi ≥ kJx − Jyk22 , (5.18)

we are done.
Now suppose

x = u + v, with v ∈ ∂T (u),
y = z + t, with t ∈ ∂T (z).

Then, J(x) = v, and J(y) = t. This implies

hv − t, u + v − z − ti = hJx − Jy, x − yi = kv − tk22 + hv − t, u − zi.

The last term is non-negative by the monotonicity of T . Also, since kJx −


Jyk22 = kv − tk22 , the (5.18) follows.

Alternative Proof :

T is monotone
⇐⇒ hx0 − x, y 0 − yi ≥ 0 ∀(x, y), (x0 , y 0 ) ∈ T,
⇐⇒ hx0 − x + y 0 − y, x0 − xi ≥ kx0 − xk22 ∀(x, y), (x0 , y 0 ) ∈ T.

Proposition 38. Suppose f : Rn → R is differentiable, convex and

k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 , (5.19)

for any x, y pair. Then,


L
f (x) ≤ f (y) + h∇f (y), x − yi + kx − yk22 .
2

42

Proof. Consider the function g(s) = f y+s (x−y) . Observe that g(0) = f (y),
g(1) = f (x), and
Z 1
g(1) − g(0) = g 0 (s) ds
0
Z 1

= h∇f y + s (x − y), x − y i ds
Z0 1

≤ h∇f y + s (x − y) − ∇f (y), x − y i + h∇f (y), x − yids
Z0 1
≤ k∇f y + s (x − y) − ∇f (y)k2 kx − yk2 + h∇f (y), x − yids
0
Z 1
≤ L s kx − yk22 + h∇f (y), x − yi ds
0
L
= kx − yk22 + h∇f (y), x − yi.
2
where we applied the Cauchy-Schwarz inequality and (5.19) to obtain the last
two inequalities.

Now suppose we want to minimize f and it satisfies the hypothesis of Prop. 38,
namely (5.19). We will derive an algorithm which will start from some initial
point x0 and produce a sequence that converges to the minimizer.
Suppose that at the k th step, we have xk , and we set xk+1 as
n α o
xk+1 = arg min Fk (x) = f (xk ) + h∇f (xk ), x − xk i + kx − xk k22 .
x 2
Observe that

(i) Fk (x) = f (xk ),


(ii) Fk (x) ≥ f (x) for all x.

f (·)
Fk (·)

xk

Therefore,
f (xk+1 ) ≤ F (xk+1 ) ≤ Fk (xk ) ≤ f (xk ).
In words, at each iteration, we reduce the value of f . Let us derive what xk+1
is. Observe that
0 = ∇f (xk ) + α(xk+1 − xk ).

43
This is equivalent to
1
xk+1 = xk − ∇f (xk ).
α
This is an instance of the steepest-descent algorithm with a fixed step-size.

44
6 Applications to Algorithms
We now consider the application of the foregoing discussion convex analysis
to algorithms. We start with the proximal point algorithm. Although the prox-
imal point algorithm in its first form is almost never used in minimization
algorithms, its convergence proof contains key ideas that are relevant for more
complicated algorithms. In fact, the algorithms discussed in the sequel can be
written as proximal point algorithms. We then discuss firmly non-expansive
operators, which may be regarded as building blocks for developing convergent
algorithms. Following this, we discuss the augmented Lagrangian, the Douglas-
Rachford algorithm, and the alternating direction method of multipliers algo-
rithm (ADMM). The last section considers a generalization of the proximal
point algorithm, along with an application to a saddle point problem. I hope to
add more algorithms to this collection...

6.1 The Proximal Point Algorithm


Consider the minimization of a convex function f (x). The proximal point al-
gorithm constructs a sequence xk that converges to a minimizer. The sequence
is defined as
1
xk+1 = arg min kx − xk k22 + f (x) (6.20)
x 2α
| {z }
Fk (x)

The function Fk (x) is a perturbation of f (x) around xk . The quadratic term


ensures that xk+1 is close to xk .
In fact, this algorithm may actually be regarded as an MM algorithm since

• Fk (xk ) = f (xk ),
• Fk (x) ≥ f (x) for all x.

It follows immediately that


f (xk+1 ) ≤ Fk (xk+1 ) ≤ Fk (xk ) = f (xk )

In terms of subdifferentials, we have


0 ∈ (xk+1 − xk ) + α ∂f (xk+1 ).
Equivalently, we can write
xk ∈ (I + α∂f ) xk+1 .
Thus, at the k th step of PPA, we are essentially computing the inverse of
(I + α ∂f ) at xk . Notice that, since Fk (x) in (6.20) is strictly convex, xk+1 is
uniquely defined. Therefore, (I + α ∂f )−1 is a single-valued operator.

45
Definition 25. For a convex f , the operator Jf = (I + ∂f )−1 is called the
proximity operator of f . 

The proximity operator of a convex function acts like a projection operator in


the following sense.

Proposition 39. For a proximity operator Jαf , we have

hJαf (x) − Jαf (y), x − yi ≥ kJαf (x) − Jαf (y)k22 . (6.21)

Proof. Let x0 = Jαf (x) and y 0 = Jαf (y). Then, we have

x ∈ (I + α ∂f ) x0 ⇐⇒ (x − x0 ) ∈ α∂f (x0 ),
y ∈ (I + α ∂f ) y 0 ⇐⇒ (y − y 0 ) ∈ α∂f (y 0 ).

Thanks to the monotonicity of α∂f , we can therefore write

hx0 − y 0 , (x − x0 ) − (y − y 0 )i ≥ 0
⇐⇒ hx0 − y 0 , x − yi ≥ kx0 − y 0 k22 ,

which is what we wanted to show.

The property in (6.21) is a key observation so we give it a name.

Definition 26. An operator F is said to be firmly-nonexpansive if

hF x − F y, x − yi ≥ kF x − F yk22 .

Thus, proximity operators are firmly-nonexpansive.


Let us now make another observation regarding the proximity operator. Sup-
pose x∗ minimizes f . For α > 0, this is equivalent to

0 ∈ ∂f (x∗ )
⇐⇒ 0 ∈ α ∂f (x∗ )
⇐⇒ x∗ ∈ x∗ + α ∂f (x∗ )
⇐⇒ x∗ = Jαf (x∗ ).

The last equality is a very special one.

Definition 27. A point x is said to be a fixed point of an operator F if


x = F (x). 

We have thus shown the following.

46
Proposition 40. A point x∗ minimizes the convex function f if and only if
x∗ = Jαf (x∗ ) for all α > 0.

The following theorem is known as the Krasnoselskii-Mann theorem.

Theorem 1 (Krasnoselskii-Mann). Suppose F is a firmly-nonexpansive map-


ping on RN and its set of fixed points is non-empty. Let x0 ∈ RN . If the
sequence xk is defined as xk+1 = F xk , then xk converges to a fixed point of F .

Before we prove the theorem, let us state an auxiliary result of interest, that
provides an alternative definition of firm-nonexpansivity.

Lemma 2. For an operator F , the following conditions are equivalent.

• hF x − F y, x − yi ≥ kF x − F yk22
• kF x − F yk22 + (I − F )x − (I − F )yk22 ≤ kx − yk22

Proof. Exercise!
Hint : Expand the expression

kF x − F yk22 + (x − y) − (F x − F y)k22 .

We are now ready for the proof of the Krasnoselskii-Mann Theorem.

Proof of the Krasnoselskii-Mann Theorem. Pick some z such that z = F z. We


have,

kF xk − zk22 = kF xk − F zk22
≤ kxk − zk22 − k(I − F ) xk − (I − F ) z k22
| {z }
=0

Therefore,

kxk+1 − zk22 ≤ kxk − zk22 − kxk − xk+1 k22 . (6.22)

Summing this inequality from k = 0 to n, and rearranging, we obtain


n
X
n+1
kx − zk22 0
≤ kx − zk22 − kxk − xk+1 k22 .
k=0

From this inequality, we obtain


n
X
kxk − xk+1 k22 ≤ kx0 − zk22 .
k=0

47
Therefore, kxk − xk+1 k2 → 0, which implies (I − F )xk → 0.
Also, since kxn − zk2 ≤ kx0 − zk2 for any n, the sequence xk is bounded.
Therefore, we can pick a convergent subsequence {xkn }n with limit, say x∗ .
By the continuity of I − F , we have,
0 = lim (I − F )xkn = (I − F )x∗ .
n→∞

Therefore x∗ = F x∗ . If we now plug in z = x∗ in (6.22), we have


kxkn +m − x∗ k2 ≤ kxkn − x∗ k2 for all m > 0.
Since the kxkn −x∗ k2 can be made arbitrarily small by choosing n large enough,
it follows that xk → x∗ .

An immediate corollary of this general result (which we will refer to later) is


the convergence of PPA.
Corollary 7. The sequence xk , constructed by the proximal point algorithm
in (6.20) converges to a minimizer of f for any α > 0.

Thanks to the Krasnoselskii-Mann theorem, firmly-nonexpansive operators


play a central role in the convergence study of a number of algorithms. We
now provide a brief study of these operators. We state the results in their most
general form, for later reference.

6.2 Firmly-Nonexpansive Operators


We already saw that proximity operators are firmly-nonexpansive. However,
not all firmly-nonexpansive operators are proximity operators. Firmly-nonexpansive
operators can be generated from monotone, or maximal monotone operators.
Specifically, suppose T is a monotone operator with domain RN and consider
SαT = (I + αT )−1 . Let us name this operator, for it has interesting properties
that will be useful later.
Definition 28. For a monotone operator T , the operator ST = (I + T )−1 is
called the resolvent of T . 

We pose two questions regarding resolvent operators.

• Is SαT defined everywhere?


• Is SαT single-valued?

We noted earlier that for a convex function f , if T = ∂f , then the answer to


both questions is affirmative. However, for a general monotone operator, JαT
may not be defined everywhere. Let us consider an example to demonstrate
this.

48
Example 25. Suppose we defined T : R → R as the unit step function, i.e.,
(
0, if x ≤ 0,
T (x) =
1, if x > 0.

Consider the operator I + T . Observe that


(
x, if x ≤ 0,
(I + T )(x) =
x + 1, if x > 0.

Notice that the interval (0, 1] is not included in the range of I + T . Therefore,
ST = (I + T )−1 is not defined for any point in (0, 1]. 

Nevertheless, in the range of I + α T , the inverse (I + α T )−1 is single-valued.

Proposition 41. Suppose T is a monotone mapping and α > 0. Then, for


any x in the range of the operator I + αT , we can find a unique z such that
(I + α T ) z = x. That is, (I + αT )−1 is a single-valued operator on its domain.

Proof. First, observe that for α > 0, the operator T is monotone if and only
if αT is monotone. Therefore, without loss of generality, we take α = 1.
We first show that for w ∈ T u and z ∈ T y, if u + w = y + z, then u = y. To
see this, observe that

u+w−y−z =0

implies

k(u − y) + (w − z)k22 = 0,

which is equivalent to

ku − yk22 + kw − zk22 + 2hu − y, w − zi = 0. (6.23)

But the inner product term in (6.23) is non-negative thanks to the monotonic-
ity of T . Therefore, in order for equality to hold in (6.23), we must have u = y
and w = z.
Now, if x is in the range of I + αT , then there is a point v and u ∈ αT v such
that u + v = x. But the observation above implies that if this is the case, then
u is unique and in fact, u = (I + αT )−1 x.

We noted that for a monotone operator T , the range of I + αT may be a


strict subset of Rn . This restricts the domain of SαT . For maximal monotone
operators, the range of I + αT is in fact the whole RN . This non-trivial result
(which we will not prove) is known as Minty’s theorem.

49
Theorem 2 (Minty’s Theorem). Suppose T is a maximal monotone operator
defined on RN and α > 0. Then, the range of I + αT is RN .

To demonstrate Minty’s theorem, let us consider the maximal monotone op-


erator whose graph is a superset of the graph of the operator in Example 25.

Example 26. Suppose the set valued operator T̃ is defined as the set valued
operator

{0}, if x < 0,

T̃ x = [0, 1], if x = 0,

{1}, if x > 0.

Consider now the operator I + T . Observe that



{x},
 if x < 0,
(I + T̃ )(x) = [0, 1], if x = 0,

{x + 1}, if x > 0.

Notice that for any y ∈ Rn , we can find x such that y ∈ (I + T̃ )x. Therefore,
the range of I + T̃ is the whole Rn .

Minty’s theorem ensures that the domain of SαT is the whole space if T is
maximal monotone. Therefore, we can state the following corollary.

Corollary 8. If T is a maximal monotone operator on RN , then SαT is single


valued and defined for all x ∈ RN .

We have the following generalization of Prop. 39 in this case.

Proposition 42. Suppose T is maximal monotone and α > 0. Then, SαT is


firmly-nonexpansive.

Proof. Exercise!
(Hint : Consider the argument used in the proof of Prop. 39.)

We remark that a subdifferential is maximal monotone. Therefore, a proximity


operator is in fact the resolvent of a maximal monotone operator and Prop. 39
is a special case of Prop. 42.
We introduce a final object called the ‘reflected resolvent’ that will be of in-
terest in the convergence analysis of algorithms that will follow. Let us first
state a result to motivate the definition.

Proposition 43. An operator S is firmly non-expansive if and only if 2S − I


is non-expansive.

50
Proof. Suppose S is firmly nonexpansive and let N = 2S − I. Observe that
n o
kN x − N yk22 = kx − yk22 + 4kSx − Syk22 − 4hSx − Sy, x − yi .

But the term inside the curly brackets above is non-negative thanks to the
firm-nonexpansivity of S. Thus we obtain

kN x − N yk2 ≤ kx − yk2 .

1
Conversely, suppose N = 2S − I is non-expansive. Then, S = 2
I + 12 N .
Observe also that I − S = 21 I − 12 N . We compute
1 1
kSx − Syk22 + k(I − S)x − (I − S)yk22 = kx − yk22 + kN x − N yk22
2 2
≤ kx − yk22 .

Thus, S is firmly nonexpansive by Lemma 2.


Definition 29. Suppose T is maximal monotone and α > 0. The operator
NT = 2 ST − I is called the reflected resolvent of T . 

Thus, the reflected resolvent of a maximal monotone operator is non-expansive.


Let us state Prop. 43 from another viewpoint for later reference.
 
1 1
Corollary 9. N is nonexpansive if and only if I + N is firmly-nonexpansive.
2 2

Another useful observation is the following.


Corollary 10. If f is a convex function and α > 0, then (2Jαf − I) is non-
expansive.

The following is a useful property to know concerning reflected resolvents


(which, in fact, justifies the term ‘reflected’).
Proposition 44. Suppose T is maximal monotone. Then, NT (x + y) = x − y
if and only if y ∈ T x.

Proof. The claim follows from the following chain of equivalences.

NT (x + y) = x − y
⇐⇒ 2 ST (x + y) − (x + y) = x − y
⇐⇒ ST (x + y) = x
⇐⇒ x + y ∈ (I + T ) x
⇐⇒ y ∈ T x.

51
6.3 The Dual PPA and the Augmented Lagrangian
Consider now a problem of the form
min f (x) subject to Ex = d, (6.24)
x

where E is a matrix.
In order to solve this problem, we will apply PPA on the dual problem. For
this let us derive a dual problem through the use of Lagrangians. Let
L(x, λ) = f (x) + hλ, Ex − di.
Then, (6.24) can be expressed as,
min max f (x) + hλ, Ex − di.
x λ

In order to obtain the dual problem, we change the order of min and max.
This gives us the dual problem,
max g(λ),
λ

where
g(λ) = min f (x) + hλ, Ex − di.
x

Recall that if
λ∗ ∈ arg max g(λ),
λ

then
x∗ ∈ arg min L(x, λ∗ )
x

is a solution of the original problem (6.24). To find λ∗ , we apply PPA on the


dual problem and define a sequence as
α
λk+1 = arg max g(λ) − kλ − λk k22 .
λ 2
To find λk+1 , we need to solve
α
max min f (x) + hλ, Ex − di − kλ − λk k22 . (6.25)
λ x 2
Let (xk+1 , λk+1 ) denote the solution of this saddle point problem. To find this
point, suppose we first tackle the maximization. Observe that, for fixed x, the
optimality condition for the maximization part implies
0 = (Ex − d) − α(λ∗ − λk )
1
⇐⇒ λ∗ = λk + (Ex − d)
α

52
To find xk+1 , plug this expression into the saddle point problem (6.25).
1 α 1
xk+1 = arg min f (x) + hλk + (Ex − d), Ex − di − kλk + (Ex − d) − λk k22
x α 2 α

α
= arg min f (x) + hλk , Ex − di + kEx − dk22 .
x 2
To summarize, the dual PPA algorithm is

xk+1 = arg min LA (x, λk )


x
1
λk+1 = λk + (Exk − d),
α
where
α
LA (x, λ) = f (x) + hλ, Ex − di + kEx − dk22 .
2
The function LA is called the augmented Lagrangian. Notice that LA is similar
to the Lagrangian L but contains the additional term hλ, Ex − di, justifying
the expression ‘augmented’.
Let us now consider the convergence of this algorithm. Observe that since

0 ∈ E xk+1 − d + α(λk+1 − λk ),

and since limk λk+1 − λk → 0, we have that

lim E xk − d → 0.
k

Therefore, if x∗ is a limit point of xk , and x̄ is an arbitrary vector such that


E x̄ = d, we have

f (x∗ ) = LA (x∗ , λ∗ ) ≤ f (x̄).

Therefore x∗ is a solution.
Note that this argument is not necessarily a convergence proof, because it
depends on the assumption that x∗ is a limit point of the sequence of xk ’s.

6.4 The Douglas-Rachford Algorithm


Consider now a problem of the form

min f (x) + g(x), (6.26)


x

where both f and g are convex. Let F = ∂f , and G = ∂G. If x is a solution


of this problem, we should have

0 ∈ F x + Gx. (6.27)

53
Let us now derive equivalent expressions of this inclusion, to obtain a fixed
point iteration. Suppose we fix α > 0. Then, (6.27) and the following state-
ments are equivalent.
There exist u ∈ F x, z ∈ Gx such that 0 = u + z
⇐⇒ There exist u ∈ F x, z ∈ Gx such that x + αz = x − αu
⇐⇒ There exist u ∈ F x such that x = (I + αG)−1 (x − αu) (6.28)
⇐⇒ x ∈ (I + αG)−1 (I − αF ) x (6.29)
At this point, observe that if f is differentiable then F is single-valued. In that
case, the inclusion in (6.29) is actually an equality (may I confuse matters :
n
this statement is correct if we consider the range of F as Rn and not 2R ). This
suggests the following fixed point iterations, which is known as the forward-
backward splitting algorithm.
xk+1 = (I + αG)−1 (I − αF ) x
We will discuss this algorithm later.
We would like to derive an algorithm that employs Jαf = (I +αF )−1 . For this,
let us now backtrack to (6.28) and write down another equivalent statement.
⇐⇒ There exist u ∈ F x such that x+αu = (I +αG)−1 (x−αu)+αu (6.30)
If we now define a new variable t = x + αu, we have the following useful
equalities :
x = (I + α F )−1 t = Jαf (t)
αu = t − x = (I − Jαf ) t
Plugging these in (6.30), we obtain the following proposition.
Proposition 45. Suppose f and g are convex and the minimization problem
(6.26) has at least one solution. A point x is a solution of (6.26) if and only if
  
x = Jαf (t), for some t that satisfies t = Jαg 2Jαf − I + (I − Jαf ) (t).
(6.31)

Thus, if we can obtain t that satisfies the fixed point equation in (6.31), we can
obtain the solution to our minimization problem as Jαf (t). A natural choice
is to consider the fixed point iterations
 
tk+1 = Jαg 2Jαf − I + (Jαf − I) (tk ).


These constitute the Douglas-Rachford iterations. By a little algebra, we can


put them in a form that is easier to interpret. For this observe that,
  1 1
Jαg 2Jαf − I + (I − Jαf ) = Jαg 2Jαf − I − (2Jαf − I) + I
2 2
1 1
= I + (2Jαg − I) (2Jαf − I). (6.32)
2 2

54
The convergence of the algorithm is easier to see in this form. Recall that,
Corollary 10 implies the non-expansivity of (2Jαf − I) and (2Jαg − I). But
composition of non-expansive operators is also non-expansive. Therefore, the
composite operator (2Jαg − I) (2Jαf − I) is non-expansive. Finally, by Corol-
lary 9, we can conclude that the operator in (6.32) is firmly-nonexpansive.
Combining this observation with Prop. 47, we obtain the following convergence
result as a consequence of the Krasnoselskii-Mann theorem (i.e., Thm. 1).
Proposition 46. Suppose f and g are convex and the minimization problem
(6.26) has at least one solution. The sequence constructed as
 
k+1 1 1
t = I + (2Jαg − I) (2Jαf − I) (tk ), (6.33)
2 2
is convergent. If t∗ denotes the limit of this sequence, then x∗ = Jαf (t∗ ) is a
solution of (6.26).

Generalized Douglas-Rachford Algorithm


In this section, we obtain a small variation on the Douglas-Rachford iterations
in (6.31). Let us start with the following general observation. Suppose T is a
single valued operator and z = (1 − β )I + β T z for some β ∗ 6= 0. Then,
∗ ∗


actually z = (1 − β)I + βT z, for any β. Therefore, we have the following


generalization of Prop. 47.
Proposition 47. Suppose f and g are convex and the minimization problem
(6.26) has at least one solution. A point x is a solution of (6.26) if and only if
x = Jαf (t) for some t that satisfies
 
t = (1 − β)I + β(2Jαg − I) (2Jαf − I) t, for all β.

Although this proposition is true for any β ∈ R, we will specifically be inter-


ested in β ∈ (0, 1), because that is the interval for which we can construct a
convergent sequence that can be used to obtain a minimizer of (6.26). The
generalized DR iterations are as follows.
Proposition 48. Suppose f and g are convex and the minimization problem
(6.26) has at least one solution. Also, let β ∈ (0, 1). Then, the sequence
constructed as
 
k+1
t = (1 − β)I + β(2Jαg − I) (2Jαf − I) (tk ), (6.34)

is convergent. If t∗ denotes the limit of this sequence, then x∗ = Jαf (t∗ ) is a


solution of (6.26).

In order to prove this result, we need a generalization of the Krasnoselskii-


Mann theorem. Let us start with a definition.

55
Definition 30. An operator T is said to be β-averaged with β ∈ (0, 1) if T
can be written as

T = (1 − β)I + βN,

for a non-expansive operator N . 

Specifically, a firmly-nonexpansive operator is 12 -averaged. Notice also that if


T is β-averaged and β < β 0 , then T is also β 0 -averaged.
Let us now consider how we can generalize the Krasnoselskii-Mann theorem.
Recall that, in the proof of that theorem, a key inequality used in the beginning
of the proof was of the form

kT x − T yk22 + k(I − T )x − (I − T )yk22 ≤ kx − yk22 , (6.35)

where T is firmly nonexpansive. However, this depends heavily on the firm


non-expansivity of T . If T is only β averaged, with β > 1/2, then (6.35) does
not hold any more. Let us now see if we can come up with an alternative. So,
let T = (1 − β)I + βN , for a nonexpansive N . Then, we have,

kT x − T yk22 = (1 − β)2 kx − yk22 + β 2 kN x − N yk22 + 2β(1 − β)hN x − N y, x − yi

k(I − T )x − (I − T )yk22 = β 2 kx − yk22 + β 2 kN x − N yk22 − 2β 2 hN x − N y, x − yi.

In order to cancel the inner product terms, let us consider a weighted sum.
1−β
kT x − T yk22 + k(I − T )x − (I − T )yk22
β
= (1 − β)2 + β(1 − β) kx − yk22 + β 2 + β(1 − β) kN x − N yk22
   

≤ kx − yk22 .

Thus, we have shown the following.


Proposition 49. Suppose T is β-averaged. Then,
1−β
kT x − T yk22 + k(I − T )x − (I − T )yk22 ≤ kx − yk22 . (6.36)
β

We can now state a generalization of Theorem 1, which we also refer to as the


Krasnoselskii-Mann theorem.
Theorem 3 (Krasnoselskii-Mann (General Statement)). Suppose F is a β-
averaged mapping on RN and its set of fixed points is non-empty. Given an
initial x0 , suppose we define a sequence as xk+1 = F x0 . Then, xk converges
to a fixed point of F .

56
Proof. Exactly the same arguments as in the proof of Thm 1 work, except that
we now start with the inequality (6.36).

Proposition 48 now follows as a corollary of Theorem 3 because the operator


in (6.34) is β-averaged with β ∈ (0, 1).

6.5 Alternating Direction Method of Multipliers


Consider the problem

min f (x) + g(M x). (6.37)


x

In order to solve this problem, we will apply the Douglas-Rachford algorithm


on the dual problem. The resulting algorithm is known as the alternating
direction method of multipliers algorithm (ADMM). We will assume that M
is full column rank.
We first split variables and write (6.37) as a saddle point problem

min max f (x) + g(z) + hλ, M x − zi.


x,z λ

The dual problem is then


h i h i
max min f (x) + hλ, M xi + min g(z) + hλ, zi
λ x z
| {z } | {z }
−d1 (λ) −d2 (λ)

Equivalently, the dual problem can be expressed as,

min d1 (λ) + d2 (λ), (6.38)


λ

for

d1 (λ) = max h−M T λ, xi − f (x) = f ∗ (−M T λ) (6.39a)


x
d2 (λ) = max hλ, zi − g(z) = g ∗ (λ). (6.39b)
z

For this problem, the Douglas-Rachford algorithm, starting from some y 0 , can
be written as follows.

ȳ k = Nαd2 (y k ),
ŷ k = Nαd1 (ȳ k )
1 1
y k+1 = y k + ŷ k
2 2
k+1 k+1
λ = Jαd2 (y )

57
Recall that the sequence of λk ’s converge to a solution of (6.38). Let us now
find expressions for the terms above in terms of f and g.
Suppose y k = λk + α z k , with z k ∈ ∂d2 (λk ). This implies (also recall Prop. 44)
that λk = Jαd2 (y k ), so that
ȳ k = 2 Jαd2 (y k ) − y k
= 2λ − (λk + α z k )
= λk − αz k .
Now observe that
1
Jα d1 (ȳ k ) = arg min ky − ȳ k k22 + d1 (y)
y 2α
 
1 k 2 T
= arg min max ky − ȳ k2 − f (x) − hM y, xi
y x 2α
Changing the order of min and max, we find that Jαd1 (ȳ k ) must satisfy
Jα d1 (ȳ k ) = ȳ k + α M xk+1 ,
where
1
xk+1 := arg max kαM xk22 − f (x) − hȳ k + αM x, M xi
x 2α
1
= arg max kαM xk22 − f (x) − hλk − αz k + αM x, M xi
x 2α
α
= arg min f (x) + hλk , M xi + kM x − z k k22 .
x 2
Therefore,
ŷ k = 2Jαd1 (y k ) − ȳ k
= ȳ k + 2αM xk
= λk − αz k + 2αM xk+1 .
We also have
1 1
y k+1 = y k + ŷ k
2 2
= λk + α M xk+1

Let us finally show that y k+1 can be expressed as y k+1 = λk+1 +α z k+1 for some
z k+1 ∈ ∂d2 (λk+1 ), so that the assumption stated in the beginning of the k th
iteration is also valid at the (k + 1)st iteration, when we define z k+1 properly.
We have,
1
λk+1 = arg min ky − y k+1 k22 + d2 (y)
y 2α
 
1 k+1 2
= arg min max ky − y k2 + hy, zi − g(z)
y z 2α

58
Changing the order of min-max, we find

λk+1 = y k+1 − αz k+1


= λk + α(M xk+1 − z k+1 ),

where
1
z k+1 = arg max kαzk22 + hy k+1 − αz, zi − g(z)
z 2α
1
= arg max kαzk22 + hλk + αM xk+1 − αz, zi − g(z)
z 2α
α
= arg min g(z) − hλk , zi + kM xk+1 − zk22 .
z 2
The optimality conditions for the last equation are

0 ∈ ∂g(z k+1 ) − λk + α(z k+1 − M xk+1 )


⇐⇒ λk − α(z k+1 − M xk+1 ) ∈ ∂g(z k+1 )
⇐⇒ λk+1 ∈ ∂g(z k+1 )
⇐⇒ z k+1 ∈ ∂g ∗ (λk+1 ) = ∂d2 (λk+1 ).

ADMM in Terms of the Primal Variables


We can rewrite the iterations solely in terms of xk , z k , λk . This produces the
following algorithm, referred to as the ADMM.
α
xk+1 = arg min f (x) + hλk , M xi + kM x − z k k22
x 2
α
z k+1 = arg min g(z) − hλk , zi + kM xk+1 − zk22
z 2
λk+1 = λk + α(M xk+1 − z k+1 ).

Convergence of ADMM
Recall that we derived ADMM as an instance of Douglas-Rachford algorithm
applied on the dual problem in (6.38). The iterations we started with, and
their relation to xk , z k , are given below.
h i
ȳ k = Nαd2 (y k ) ȳ k = λk − αz k (6.42a)
h i
ŷ k = Nαd1 (ȳ k ) ŷ k = λk − αz k + 2αM xk+1 (6.42b)
1 1 h i
y k+1 = y k + ŷ k y k+1 = λk + αM xk+1 (6.42c)
2 2 h i
λk+1 = Jαd2 (y k+1 ) λk+1 = λk + α(M xk+1 − z k+1 ) (6.42d)

59
The convergence results on the Douglas-Rachford iterations therefore ensure
that in (6.40), y k is convergent. Since the operators Nαd2 , Nαd1 , Jαd2 are con-
tinuous, this in turn implies that ȳ k , ŷ k and λk are also convergent sequences.
Thanks to the relations in (6.42), and the full column rank property of M
we then obtain that xk and z k are also convergent. Let λ∗ , x∗ , z ∗ denote the
corresponding limits.
First, notice that (6.42d) implies

λ∗ = λ∗ + α(M x∗ − z ∗ ).

Thus, we have M x∗ = z ∗ .
Now, using M x∗ = z ∗ , from (6.42c), and (6.42a) we obtain,

Nαd2 (λ∗ + αz ∗ ) = λ∗ − αz ∗ .

But by Prop. 44, this implies that z ∗ ∈ ∂d2 (λ∗ ). In view of (6.39b), equivalently
λ∗ ∈ ∂g(z ∗ ).
By a similar argument, we obtain from (6.42a) that ȳ ∗ = λ∗ − αM x∗ , from
(6.42b) that ŷ = λ∗ + αM x∗ , and

Nαd1 (λ∗ − αM x∗ ) = λ∗ + αM x∗ .

This time, Prop. 44, implies −M x∗ ∈ ∂d1 (λ∗ ). In view of (6.39a), equivalently
−M T λ∗ ∈ ∂f (x∗ ). Using this and the previous observation λ∗ ∈ ∂g(z ∗ ), we
thus can write

0 ∈ ∂f (x∗ ) + M T λ∗
⇐⇒ 0 ∈ ∂f (x∗ ) + M T ∂g(z ∗ )
⇐⇒ 0 ∈ ∂f (x∗ ) + M T ∂g (M x∗ ).


In words, x∗ is a solution of the primal problem (6.37).

6.6 A Generalized Proximal Point Algorithm


Recall that, given a maximal monotone T , the proximal point algorithm con-
sists of

xk+1 = (I + αT )−1 xk .

The sequence xk converges to some x∗ such that 0 ∈ T (x∗ ). Recall that the
convergence of PPA depended on the firm-nonexpansivity of S = (I + αT )−1 ,
which is equivalent to

hS x1 − S x2 , x1 − x2 i ≥ kS x1 − S x2 k22 .

60
To derive a generalized PPA, suppose M is a positive definite matrix, and
consider the following train of equivalences

0 ∈ T (x),
⇐⇒ M x ∈ M x + αT (x),
⇐⇒ x = (M + αT )−1 M x.

Note that, the last line assumes that (M + αT ) has an inverse. This is indeed
the case, since M can be written as M = cI + U for some positive definite
matrix U and c > 0. Thanks to positive definiteness, U is maximal monotone.
Consequently, U + αT is also maximal monotone and we can then resort to
Minty’s theorem and Prop. 42 to conclude that (M + αT ) = c I + (U + αT )
has a well-defined inverse. We also remark at this point that M does not have
to be symmetric.
We will study the operator (M + αT )−1 M in a modified norm.

Lemma 3. Suppose M is a positive definite matrix. Then, the mapping


hx, yiM = hx, M yi defines an inner product.

Proof. To be added...
p
In the following, we denote the induced norm h·, ·iM as k · kM .

Proposition 50. Suppose M is positive definite and T is maximal monotone


with respect to the inner product h·, ·iI . Then, the operator S = (M +αT )−1 M
is firmly-nonexpansive with respect to the inner product h·, ·iM . That is,

hSx − Sy, x − yiM ≥ kSx − Syk2M .

Proof. Without loss of generality, take α = 1. Since the range of M + T is the


whole space, and M is invertible, for any yi we can find xi and ui ∈ T (x) such
that M yi = M xi + ui . Notice that xi = Syi . But then,

hSy1 − Sy2 , y1 − y2 iM = hx1 − x2 , x1 − x2 iM + hx1 − x2 , M −1 (u1 − u2 )iM


= kx1 − x2 k2M + hx1 − x2 , u1 − u2 iI
≥ kx1 − x2 k2M .

To be added : generalization of the Krasnoselskii-Mann theorem...


In summary, the generalized PPA consists of the following iterations

xk+1 = (M + αT )−1 M xk .

61
Application to a Saddle Point Problem
Consider a problem of the form

min max f (x) + hAx, zi,


x z∈B

where B is a closed, convex set, and f is a convex function. Rewriting, we


obtain an equivalent problem as,

min max L(x, z) = f (x) + hAx, zi − iB (z) .
x z

Observe that L(x, z) is a convexoconcave function. For such functions, the


following operator replaces the subdifferential
 
∂x L(x, z) 
T (x, z) =
∂z −L(x, z)
  (6.43)
∂f (x) + AT z
=
∂iB (z) − Ax

Proposition 51. The operator T defined in (6.43) is maximal monotone.

Proof. Let us first show monotonicity. Observe that


    
x x
T (x1 , z2 ) − T (x2 , z2 ), 1 − 2
z1 z2
       T   
∂f (x1 ) ∂f (x2 ) x1 − x2 x1 − x2 0 A T x1 − x2
= − , +
∂iB (z1 ) ∂iB (z2 ) z1 − z2 z1 − z2 −A 0 z1 − z2
| {z }
=0
≥ 0.

Proof of maximality to be added...

If we apply PPA for obtaining a zero of T (x, z) defined in (6.43), the resulting
iterations are complicated. Consider now a generalized PPA (GPPA) with the
choice
 
I αAT
M= .
αA I

Observe that M is positive definite if α2 σ(AT A) < 1. In order to apply GPPA,


we need an expression for (M + αT )−1 . Notice that
      
x I αAT ∂f AT x
(M + αT ) = +α
z αA I −A ∂iB z
 T
 
I + α∂f 2αA x
=
0 I + α∂iB z

62
Thus,
   
x̂ x
(M + αT ) ∈
ẑ z
means

(I + α∂f )x̂ + 2αAT ẑ = x, (6.44a)


(I + α ∂iB )ẑ = z. (6.44b)

Solving (6.44b) and plugging this back in (6.44a), we obtain the expressions
for x̂and ẑas,

ẑ = PB (z),
x̂ = Jαf (x − 2αAT ẑ),

where PB = JαiB denotes the projection operator onto B and Jαf is proximity
operator for f .
Observe also that,
   
x x + α AT z
M = .
z αAx + z

Therefore, the GPPA for this problem is,

z k+1 = PB (z k + αAxk )
xk+1 = Jαf xk + αAT z k − 2αAT z k+1


= Jαf xk − αAT (2z k+1 − z k )




By the analysis on GPPA, we can state that this algorithm converges if α2 >
σ(AT A).

Application to an Analysis Prior Problem


Consider now the problem
1
min ky − H xk22 + λkA xk1
x 2
By making use of the dual expression of the `1 norm, we can express this
problem as a saddle point problem. For this, recall that,

kxk1 = max hx, zi,


z∈B∞

= max hx, zi − iB∞ (z),


z

63
where B∞ denotes the unit ball of the `∞ norm. The equivalent saddle problem
is,
1
min max ky − H xk22 +λ hA x, zi − λ iB∞ (z),
|2
x z
{z }
f (x)

This is exactly the same form considered above. The choice


 
I −αAT
M=
−αA I

leads to the iterations

xk+1 = Jαf (xk − αAT z k )


z k+1 = PB∞ z k + αA(2xk+1 − xk ) .


These are the iterations proposed by Chambolle and Pock.

64

You might also like