100% found this document useful (1 vote)
129 views

Hw2sol PDF

The document is a homework assignment on optimization techniques for non-convex and non-differentiable functions. It covers subgradient optimality conditions for inequality constrained optimization, optimality conditions and coordinate descent for l1-regularized minimization, and working through an example of coordinate descent for l1-regularized least squares.

Uploaded by

Shy PeachD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
129 views

Hw2sol PDF

The document is a homework assignment on optimization techniques for non-convex and non-differentiable functions. It covers subgradient optimality conditions for inequality constrained optimization, optimality conditions and coordinate descent for l1-regularized minimization, and working through an example of coordinate descent for l1-regularized least squares.

Uploaded by

Shy PeachD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

EE364b Prof. S.

Boyd

EE364b Homework 2
1. Subgradient optimality conditions for nondifferentiable inequality constrained optimiza-
tion. Consider the problem
minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m,
with variable x ∈ Rn . We do not assume that f0 , . . . , fm are convex. Suppose that x̃
and λ̃  0 satisfy primal feasibility,
fi (x̃) ≤ 0, i = 1, . . . , m,
dual feasibility,
m
X
0 ∈ ∂f0 (x̃) + λ̃i ∂fi (x̃),
i=1
and the complementarity condition
λ̃i fi (x̃) = 0, i = 1, . . . , m.
Show that x̃ is optimal, using only a simple argument, and definition of subgradient.
Recall that we do not assume the functions f0 , . . . , fm are convex.
Solution. Let g be defined by g(x) = f0 (x) + m i=1 λ̃i fi (x). Then, 0 ∈ ∂g(x̃). By
P

definition of subgradient, this means that for any y,


g(y) ≥ g(x̃) + 0T (y − x̃).

Thus, for any y,


m
X
f0 (y) ≥ f0 (x̃) − λ̃i (fi (y) − fi (x̃)).
i=1

For each i, complementarity implies that either λi = 0 or fi (x̃) = 0. Hence, for any
feasible y (for which fi (y) ≤ 0), each λ̃i (fi (y) − fi (x̃)) term is either zero or negative.
Therefore, any feasible y also satisfies f0 (y) ≥ f0 (x̃), and x̃ is optimal.
2. Optimality conditions and coordinate-wise descent for ℓ1 -regularized minimization. We
consider the problem of minimizing
φ(x) = f (x) + λkxk1 ,
where f : Rn → R is convex and differentiable, and λ ≥ 0. The number λ is the
regularization parameter, and is used to control the trade-off between small f and
small kxk1 . When ℓ1 -regularization is used as a heuristic for finding a sparse x for
which f (x) is small, λ controls (roughly) the trade-off between f (x) and the cardinality
(number of nonzero elements) of x.

1
(a) Show that x = 0 is optimal for this problem (i.e., minimizes φ) if and only if
k∇f (0)k∞ ≤ λ. In particular, for λ ≥ λmax = k∇f (0)k∞ , ℓ1 regularization yields
the sparsest possible x, the zero vector.
Remark. The value λmax gives a good reference point for choosing a value of the
penalty parameter λ in ℓ1 -regularized minimization. A common choice is to start
with λ = λmax /2, and then adjust λ to achieve the desired sparsity/fit trade-off.
Solution. A necessary and sufficient condition for optimality of x = 0 is that
0 ∈ ∂φ(0). Now ∂φ(0) = ∇f (0) + λ∂k0k1 = ∇f (0) + λ[−1, 1]n . In other words,
x = 0 is optimal if −∇f (x) ∈ [−λ, λ]n . This is equivalent to k∇f (0)k∞ ≤ λ.
(b) Coordinate-wise descent. In the coordinate-wise descent method for minimizing
a convex function g, we first minimize over x1 , keeping all other variables fixed;
then we minimize over x2 , keeping all other variables fixed, and so on. After
minimizing over xn , we go back to x1 and repeat the whole process, repeatedly
cycling over all n variables.
Show that coordinate-wise descent fails for the function

g(x) = |x1 − x2 | + 0.1(x1 + x2 ).

(In particular, verify that the algorithm terminates after one step at the point
(0) (0)
(x2 , x2 ), while inf x g(x) = −∞.) Thus, coordinate-wise descent need not work,
for general convex functions.
(0)
Solution. We first minimize over x1 , with x2 fixed as x2 . The optimal choice is
(0)
x1 = x2 , since the derivative on the left is −0.9, and on the right, it is 1.1. We
(0) (0)
then arrive at the point (x2 , x2 ). We now optimize over x2 . But it is optimal,
with the same left and right derivatives, so x is unchanged. We’re now at a fixed
point of the coordinate-descent algorithm.
On the other hand, taking x = (−t, t) and letting t → ∞, we see that g(x) =
−0.1t → −∞.
It’s good to visualize coordinate-wise descent for this function, to see why x gets
stuck at the crease along x1 = x2 . The graph looks like a folded piece of paper,
with the crease along the line x1 = x2 . The bottom of the crease has a small
tilt in the direction (−1, −1), so the function is unbounded below. Moving along
either axis increases g, so coordinate-wise descent is stuck. But moving in the
direction (−1, −1), for example, decreases the function.
(c) Now consider coordinate-wise descent for minimizing the specific function φ de-
fined above. Assuming f is strongly convex (say) it can be shown that the iterates
converge to a fixed point x̃. Show that x̃ is optimal, i.e., minimizes φ.
Thus, coordinate-wise descent works for ℓ1 -regularized minimization of a differ-
entiable function.
Solution. For each i, x̃i minimizes the function ψ, with all other variables kept

2
fixed. It follows that
∂f
0 ∈ ∂xi ψ(x̃) = (x̃) + λIi , i = 1, . . . , n,
∂xi
where Ii is the subdifferential of | · | at x̃i : Ii = {−1} if x̃i < 0, Ii = {+1} if
x̃i > 0, and Ii = [−1, 1] if x̃i = 0.
But this is the same as saying 0 ∈ ∇f (x̃) + ∂kx̃k1 , which means that x̃ minimizes
ψ.
The subtlety here lies in the general formula that relates the subdifferential of
a function to its partial subdifferentials with respect to its components. For a
separable function h : R2 → R, we have
∂h(x) = ∂x1 h(x) × ∂x2 h(x),
but this is false in general.
(d) Work out an explicit form for coordinate-wise descent for ℓ1 -regularized least-
squares, i.e., for minimizing the function
kAx − bk22 + λkxk1 .
You might find the deadzone function

 u−1 u>1

ψ(u) =  0 |u| ≤ 1

u + 1 u < −1
useful. Generate some data and try out the coordinate-wise descent method.
Check the result against the solution found using CVX, and produce a graph show-
ing convergence of your coordinate-wise method.
Solution. At each step we choose an index i, and minimize kAx − bk22 + λkxk1
over xi , while holding all other xj , with j 6= i, constant.
Selecting the optimal xi for this problem is equivalent to selecting the optimal xi
in the problem
minimize ax2i + cxi + |xi |,
where a = (AT A)ii /λ and c = (2/λ)( j6=i (AT A)ij xj + (bT A)i ). Using the theory
P

discussed above, any minimizer xi will satisfy 0 ∈ 2axi + c + ∂|xi |. Now we note
that a is positive, so the minimizer of the above problem will have opposite sign
to c. From there we deduce that the (unique) minimizer x⋆i will be
(
0 c ∈ [−1, 1]
x⋆i =
(1/2a)(−c + sign(c)) otherwise,
where 
 −1 u < 0

sign(u) =  0 u=0

1 u > 0.

3
Finally, we make use of the deadzone function ψ defined above and write
T
−ψ((2/λ) j6=i (A A)ij xj + (bT A)i )
P
x⋆i = .
(2/λ)(AT A)ii

Coordinate descent was implemented in Matlab for a random problem instance


with A ∈ R400×200 . When solving to within 0.1% accuracy, the iterative method
required only a third the time of cvx. Sample code appears below, followed by
a graph showing the coordinate-wise descent method’s function value converging
to the CVX function value.

% Generate a random problem instance.


randn(’state’, 10239); m = 400; n = 200;
A = randn(m, n); ATA = A’*A;
b = randn(m, 1);
l = 0.1;
TOL = 0.001;
xcoord = zeros(n, 1);

% Solve in cvx as a benchmark.


cvx_begin
variable xcvx(n);
minimize(sum_square(A*xcvx + b) + l*norm(xcvx, 1));
cvx_end

% Solve using coordinate-wise descent.


while abs(cvx_optval - (sum_square(A*xcoord + b) + ...
l*norm(xcoord, 1)))/cvx_optval > TOL
for i = 1:n
xcoord(i) = 0; ei = zeros(n,1); ei(i) = 1;
c = 2/l*ei’*(ATA*xcoord + A’*b);
xcoord(i) = -sign(c)*pos(abs(c) - 1)/(2*ATA(i,i)/l);
end
end

4
1
10

0
10

−1
10

−2
10

−3
10

−4
10

−5
10

−6
10

−7
10
0 5 10 15 20 25 30

You might also like