0105 Stoch Subgrad Notes
0105 Stoch Subgrad Notes
1
where x(k) is the kth iterate, αk > 0 is the kth step size, and g̃ (k) is a noisy subgradient of f
at x(k) ,
E(g̃ (k) |x(k) ) = g (k) ∈ ∂f (x(k) ).
Even more so than with the ordinary subgradient method, we can have f (x(k) ) increase
during the algorithm, so we keep track of the best point found so far, and the associated
function value
(k)
fbest = min{f (x(1) ), . . . , f (x(k) )}.
(k)
The sequences x(k) , g̃ (k) , and fbest are, of course, stochastic processes.
3 Convergence
We’ll prove a very basic convergence result for the stochastic subgradient method, using step
sizes that are square-summable but not summable,
∞
X ∞
X
αk ≥ 0, αk2 = kαk22 < ∞, αk = ∞.
k=1 k=1
We assume there is an x⋆ that minimizes f , and a G for which E kg (k) k22 ≤ G2 for all k. We
also assume that R satisfies E kx(1) − x⋆ k22 ≤ R2 .
We will show that
(k)
E fbest → f ⋆
as k → ∞, i.e., we have convergence in expectation. We also have convergence in probability:
for any ǫ > 0,
(k)
lim Prob(fbest ≥ f ⋆ + ǫ) = 0.
k→∞
= kx(k) − x⋆ k22 − 2αk E g̃ (k)T (x(k) − x⋆ ) x(k) + αk2 E kg̃ (k) k22 x(k)
= kx(k) − x⋆ k22 − 2αk E(g̃ (k) |x(k) )T (x(k) − x⋆ ) + αk2 E kg̃ (k) k22 (k)
x
≤ kx(k) − x⋆ k22 − 2αk (f (x(k) ) − f ⋆ ) + αk2 E kg̃ (k) k22 x(k) ,
where the inequality holds almost surely, and follows because E(g̃ (k) |x(k) ) ∈ ∂f (x(k) ).
Now we take expectation to get
2
Pk
Using E kx(1) − x⋆ k22 ≤ R2 , E kx(k+1) − x⋆ k22 ≥ 0, and i=1 αi2 ≤ kαk22 , we have
k
X
2 αi (E f (x(i) ) − f ⋆ ) ≤ R2 + G2 kαk22 .
i=1
Therefore we have
R2 + G2 kαk22
min (E f (x(i) ) − f ⋆ ) ≤ ,
2 ki=1 αi
P
i=1,...,k
4 Example
We consider the problem of minimizing a piecewise-linear function,
minimize f (x) = maxi=1,...,m (aTi x + bi ),
with variable x ∈ Rn . At each step, we evaluate a noisy subgradient of the form g̃ (k) =
g (k) + v (k) , where g (k) ∈ ∂f (x(k) ) and v (k) are independent zero mean random variables.
We illustrate the stochastic subgradient method with the same problem instance used
in the notes on subgradient methods, with n = 20 variables, m = 100 terms, and problem
problem data ai and bi generated from a unit normal distribution. We start with x(1) = 0,
and use the square summable but not summable step rule αk = 1/k. To report f (x(k) ) − f ⋆ ,
we compute the optimal value f ⋆ using linear programming.
The noises v (k) are IID√N (0, 0.5I). Since the norm of the vectors ai is on the order of
4 or 5 (the RMS value is 20), the subgradient noise is around 25% compared to the true
subgradient.
Figure 1 shows the convergence of the stochastic subgradient method for two realizations
of the noisy subgradient process, together with the noise-free case for comparison. This
figure shows that convergence is only a bit slower with subgradient noise.
We carried out 100 realizations, and show the (sample) mean and standard deviation
(k)
of fbest − f ⋆ for k in multiples of 250, in figure 2. (The error bars show the mean plus
and minus one standard deviation.) Figure 3 shows the empirical distribution (over the 100
(k)
realizations) of fbest − f ⋆ at iterations k = 250, k = 1000, and k = 5000.
3
0 noise-free case
10
realization 1
realization 2
−1
fbest − f ⋆
(k) 10
−2
10
−3
10
(k)
Figure 1: The value of fbest − f ⋆ versus iteration number k, for the subgradient
method with step size αk = 1/k. The plot shows a noise-free realization, and two
realizations with subgradient noise.
0
10
−1
10
E fbest − f ⋆
(k)
−2
10
−3
10
(k)
Figure 2: Average and one standard deviation error bars for fbest − f ⋆ versus
iteration number k, computed using 100 realizations, every 250 iterations.
4
k = 250
30
20
10
0 −3 −2 −1 0
10 10 10 10
k = 1000
30
20
10
0 −3 −2 −1 0
10 10 10 10
k = 5000
30
20
10
0 −3 −2 −1 0
10 10 10 10
(k)
Figure 3: Empirical distributions of fbest − f ⋆ at k = 250, k = 1000, and k = 5000
iterations, based on 100 realizations.
5
5 Stochastic programming
A stochastic programming problem has the form
minimize E f0 (x, ω)
(1)
subject to E fi (x, ω) ≤ 0, i = 1, . . . , m,
where x ∈ Rn is the optimization variable, and ω is a random variable. If fi (x, ω) is convex
in x for each ω, the problem is a convex stochastic programming problem. In this case the
objective and constraint functions are convex.
Stochastic programming can be used to model a variety of robust design or decision
problems with uncertain data. Although the basic form above involves only expectation
or average values, some tricks can be used to capture other measures of the probability
distributions of fi (x, ω). We can replace an objective or constraint term E f (x, ω) with
E Φ(f (x, ω)), where Φ is a convex increasing function. For example, with Φ(u) = max(u, 0),
we can form a constraint of the form
E fi (x, ω)+ ≤ ǫ,
where ǫ is a positive parameter, and (·)+ denotes positive part. Here E fi (x, ω)+ has a simple
interpretation as the expected violation of the ith constraint. It’s also possible to combine
the constraints, using a single constraint of the form
E max(f1 (x, ω)+ , . . . , fm (x, ω)+ ) ≤ ǫ.
The lefthand side here can be interpreted as the expected worst violation (over all constraints).
Constraints of the form Prob(fi (x, ω) ≤ 0) ≥ η, which require a constraint to hold with
a probability (or reliability) exceeding η, are called chance constraints. These constraints
cannot be directly handled using the simple trick above, but an expected violation constraint
can often give a good approximation for a chance constraint. (Some chance constraints can
be handled exactly, e.g., when f (x, ω) = (a + ω)T x − b, ω is Gaussian, and η ≥ 0.5.)
Recall that Jensen’s inequality tells us
E fi (x, ω) ≥ fi (x, E ω).
Now consider the problem
minimize f0 (x, E ω)
(2)
subject to fi (x, E ω) ≤ 0, i = 1, . . . , m,
obtained by replacing the random variable in each function by its expected value. This
problem is sometimes called the certainty equivalent of the original stochastic programming
problem (1), even though they are equivalent only in very special cases. By Jensen’s in-
equality, the constraint set for the uncertainty equivalent problem is larger than the original
stochastic problem (1), and its objective is smaller. It follows that the optimal value of the
uncertainty equivalent problem gives a lower bound on the optimal value of the stochastic
problem (1). (It can be a poor bound, of course.)
6
5.1 Noisy subgradient of expected function value
Suppose F : Rn × Rp → R, and F (x, w) is convex in x for each w. We define
Z
f (x) = E F (x, w) = F (x, w)p(w) dw,
where p is the density of w. (The integral is in Rp .) The function f is convex. We’ll show
how to compute a noisy unbiased subgradient of f at x.
The function f comes up in many applications. We can think of x as some kind of
design variable to be chosen, and w as some kind of parameter that is random, i.e., subject
to statistical fluctuation. The function F tells us the cost of choosing x when w takes a
particular value; the function f , which is deterministic, gives the average cost of choosing
x, taking the statistical variation of w into account. Note that the dimension of w, how it
enters F , and the distribution are not restricted; we only require that for each value of w,
F is convex in x.
Except in some very special cases, we cannot easily compute f (x) exactly. However,
we can approximately compute f using Monte Carlo methods, if we can cheaply generate
samples of w from its distribution. (This depends on the distribution.) We generate M
independent samples w1 , . . . , wM , and then take
M
ˆ 1 X
f (x) = F (x, wi )
M i=1
as our estimate of f (x). We hope that if M is large enough, we get a good estimate. In fact,
fˆ(x) is a random variable with E fˆ(x) = f (x), and a variance equal to c/M , where c is the
variance of F (x, ω). If we know or bound c, then we can at least bound the probability of a
given level of error, i.e., Prob(|fˆ(x) − f (x)| ≥ ǫ). In many cases it’s possible to carry out a
much more sophisticated analysis of the error in Monte Carlo methods, but we won’t pursue
that here. A summary for our purposes is: we cannot evaluate f (x) exactly, but we can get
a good approximation, with (possibly) much effort.
Let G : Rn × Rp → Rn be a function that satisfies
G(x, w) ∈ ∂x F (x, w)
for each x and w. In other words, G(x, w) selects a subgradient for each value of x and w.
If F (x, w) is differentiable in x then we must have G(x, w) = ∇x F (x, w).
We claim that Z
g = E G(x, w) = G(x, w)p(w) dw ∈ ∂f (x).
7
since G(x, w) ∈ ∂x F (x, w). Multiplying this by p(w), which is nonnegative, and integrating
gives
Z Z
F (x, w) + G(x, w)T (z − x) p(w) dw
F (z, w)p(w) dw ≥
= f (x) + g T (z − x).
5.2 Example
We consider the problem of minimizing the expected value of a piecewise-linear convex
function with random coefficients,
with variable x ∈ Rn . The data vectors ai ∈ Rn and bi ∈ R are random with some given
distribution. We can compute an (unbiased) approximation of f (x), and a noisy unbiased
subgradient g ∈ ∂f (x), using Monte Carlo methods.
We consider a problem instance with n = 20 variables and m = 100 terms. We assume
that ai ∼ N (āi , 5I) and b ∼ N (b̄, 5I). The mean values āi and b̄ are generated from unit
normal distributions (and are the same as the constant values used called ai and bi in previous
examples). We take x(1) = 0 as the starting point and use the step size rule αk = 1/k.
We first compare the solution of the stochastic problem xstoch (obtained from the stochas-
tic subgradient method) with xce , the solution of the certainty equivalent problem
8
xce
20 f (xce )
10
0
1 1.5 2 2.5 3
xheur
20 f (xheur )
10
0
1 1.5 2 2.5 3
xstoch
20 f (xstoch )
10
0
1 1.5 2 2.5 3
Figure 4: Empirical distributions of maxi (aTi x + bi ) for the certainty equivalent so-
lution xce , heuristic-based solution xheur , and the stochastic optimal solution xstoch .
The dark lines show the value of f (x), and the dotted lines show the value of fce (x).
where λ is a positive parameter. The extra terms are meant to account for the variation in
aTi x + b caused by variation in ai ; the problem above can be cast as an SOCP and easily
solved. We chose λ = 1, after some experimentation.
The certainty equivalent values for the three points are fce (xce ) = 1.12, fce (xheur ) = 1.23,
and fce (xstoch ) = 1.44. We (approximately) evaluate f (x) for these three points, based on
1000 Monte Carlo samples, obtaining f (xce ) ≈ 2.12, f (xheur ) ≈ 1.88, and f (xstoch ) ≈ 1.83.
The empirical distributions of maxi (aTi x + bi ), for x = xstoch , x = xheur , and for x = xce , are
shown in figure 4.
In summary, the heuristic finds a point that is good, but not quite as good as the
stochastic optimal. Both of these points are much better than the certainty equivalent
point.
Now we show the convergence of the stochastic subgradient method, evaluating noisy
subgradients with M = 1, M = 10, M = 100, and M = 1000 samples at each step. For
9
0 M =1
10 M = 10
M = 100
M = 1000
−1
fbest − fˆ⋆
10
(k)
−2
10
−3
10
500 1000 1500 2000
k
(k)
Figure 5: The value of fbest − f ⋆ versus iteration number k, for the stochastic
subgradient method with step size rule αk = 1/k. The plot shows one realization
for noisy subgradients evaluated using M = 1, M = 10, M = 100, and M = 1000.
M = 1000, we are computing a fairly accurate subgradient; for M = 1, the variance in our
computed subgradient is large.
We (approximately) evaluate the function f (x) using M = 1000 samples at each iteration,
(k)
and keep track of the best value fbest . We estimate f ⋆ by running the stochastic subgradient
algorithm for a large number of iterations. Figure 5 shows the convergence for one realization
for each of the four values of M . As expected, convergence is faster with M larger, which
yields a more accurate subgradient. Assuming the cost of an iteration is proportional to M ,
M = 1 seems to be the best choice. In any case, there seems little advantage (at least in
this example) to using a value of M larger than 10.
6 More examples
6.1 Minimizing expected maximum violation
The vector x ∈ Rn is to be chosen subject to some (deterministic) linear inequalities F x g.
These can represent manufacturing limits, cost limits, etc. Another set of random inequalities
is given as Ax b, where A and b come from some distribution. We would like these
inequalities to hold, but because A and b vary, there may be no choice of x that results in
Ax b almost surely. We still would like to choose x so that Ax b holds often, and when
it is violated, the violation is small.
10
Perhaps the most natural problem formulation is to maximize the yield, defined as
Prob(Ax b). In some very special cases (e.g., A is deterministic and b has log-concave
density), this can be converted to a convex problem; but in general it is not convex. Also,
yield is not sensitive to how much the inequalities Ax b are violated; a small violation is
the same as a large variation, as far as the yield is concerned.
Instead, we will work with the expected maximum violation of the inequalities. We define
the maximum violation as
max((Ax − b)+ ) = max(aTi x − b)+ ,
i
where ai are the rows of A. The maximum violation is zero if and only if Ax b; it is
positive otherwise. The expected violation is
E max((Ax − b)+ ) = E max(aTi x − bi )+ .
i
It is a complicated but convex function of x, and gives a measure of how often, and by how
much, the constraints Ax b are violated. It is convex no matter what the distribution of
A and b is.
As an aside, we note that many other interesting measures of violation could also be
used, such as the expected total (or average) violation, E 1T (aTi x − bi )+ , or the expected sum
of the squares of the individual violations, E k(Ax − b)+ k22 . We can even measure violation
by the expected distance from the desired polyhedron, E dist(x, {z | Az b}), which we
call the expected violation distance.
Back to our main story, and using expected (maximum) violation, we have the problem
minimize E max((Ax − b)+ ) = E maxi (aTi x − bi )+
subject to F x g.
The data are F , g, and the distribution of A and b.
We’ll use a stochastic projected subgradient method to solve this problem. Given a point
x that satisfies F x g, we need to generate a noisy unbiased subgradient g̃ of the objective.
To do this we first generate a sample of A and b. If Ax b, we can take g̃ = 0. (Note: if
a subgradient is zero, it means we’re done; but that’s not the case here.) In the stochastic
subgradient algorithm, g̃ = 0 means we don’t update the current value of x. But we don’t
stop the algorithm, as we would in a deterministic projected subgradient method.
If Ax b is violated, choose j for which aTj x − bj = maxi (aTi x − bi )+ . Then we can take
g̃ = aj . We then set xtemp = x − αk aj , where αk is the step size. Finally, we project xtemp
back to the feasible set to get the updated value of x. (This can be done by solving a QP.)
We can generate a number of samples of A and b, and use the method above to find an
unbiased noisy subgradient for each case. We can use the average of these as g̃. If we take
enough samples, we can simultaneously estimate the expected maximum violation for the
current value of x.
A reasonable starting point is a point well inside the certainty equivalent inequalities
E Ax E b, that also satisfies F x g. Simple choices include the analytic, Chebyshev, or
maximum volume ellipsoid centers, or the point that maximizes the margin max(b − Ax),
subject to F x g.
11
6.2 On-line learning and adaptive signal processing
We suppose that (x, y) ∈ Rn × R have some joint distribution. Our goal is to find a weight
vector w ∈ Rn for which wT x is a good estimator of y. (This is linear regression; we can add
an extra component to x that is always one to get an affine estimator of the form wT x + v.)
We’d like to choose a weight vector that minimizes
12
We take αk = 1/k. (In adaptive signal processing, this is called a sign algorithm.)
We compute w⋆ using the stochastic subgradient algorithm, which we run for 5000 iter-
ations. Figure 6 shows the behaviour of the prediction error w(k)T x(k+1) − y (k+1) for the first
300 iterations of the run. Figure 7 shows the empirical distibution (over the 1000 realizations)
of the prediction errors for w⋆ .
where for each pair (a, b), the objective F (x; (a, b)) is convex in x ∈ Rn . In this situation, if
we choose any m0 ≤ m as a batch size, we may construct a stochastic subgradient by taking
a subsample of indices i1 , . . . , im0 uniformly at random, either with or without replacement,
from {1, . . . , m}, then setting
m0
1 X
g̃ ∈ ∂x F (x; (aij , bij )).
m0 j=1
It is clear that E g̃ ∈ ∂f (x) with this choice. The interesting, though obvious, aspect of
this is that if m0 ≪ m, then it is m/m0 -times more efficient to compute the noisy unbiased
subgradient g̃ as opposed to the full subgradient of the objective. The power of stochastic
subgradient methods comes from the fact that, as often we only care about getting a mod-
erately accurate solution, we can often take m0 ≪ m to compute stochastic subgradients,
use these in the iteration of the methods, and achieve substantial computational benefits.
Here we work in reasonable detail through one such instance of this. We generate data
pairs as follows. We fix a vector w⋆ ∈ Rn , chosen uniformly from the unit sphere (so that
iid
kw⋆ k = 1), and then generate vectors ai ∼ N (0, I), the normal distribution on Rn . We then
set bi = aTi w⋆ + ξi for ξi drawn according to a Cauchy distribution. For our objective, we
take the absolute error
F (x; (a, b)) = |aT x − b|.
With this setup, our objective becomes f (x) = m1 kAx − bk1 , the mean ℓ1 -error in our pre-
dictions, which is a natural choice for heavy-tailed data.
13
40
30
20
prediction error
10
−10
−20
−30
−40
50 100 150 200 250 300
k
150
100
50
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
14
0
10
m0 =1
m0 =10
m0 =200
-1
10
f(x(k) )−f(x ⋆ )
-2
10
-3
10
-4
10
50 100 150 200 250 300 350 400
With this in mind, we can now compare a projected subgradient method for this problem
with m = 200 and m0 chosen from 1, 10, and 200. Each iteration updates
where Π denotes projection onto the ℓ2 -ball of radius 1. We give two figures. In the first
(Fig. 8), we show the optimality gap f (x(k) ) − f (x⋆ ) versus iteration, averaged over 50
different experiments. For each method, we used stepsize αk = √αk , where the scalar α was
chosen to yield the best performance in terms of optimality gap. On a per-iteration basis,
it is clear that the subgradient method using the full sample size m0 = 200 has the best
convergence as a function of the iterations.
On the other hand, when we normalize by the amount of actual computation performed,
the story is quite different. In Figure 9 we plot the optimality gap f (x(k) ) − f ⋆ against
the total amount of computation each method requires, where computation is given by the
total number of computations of individual subgradients ∂x F (x; (ai , bi )) = ai sign(aTi x − bi ),
each of which requires O(n) time. From this figure, we see that in the time the subgradient
method requires to make even a small amount of progress, both subsampled methods have
essentially made all of the progress they will make.
15
0
10
m0 =1
m0 =10
m0 =200
-1
10
f(x(k) )−f(x ⋆ )
-2
10
-3
10
-4
10
20 40 60 80 100
16
Acknowledgments
May Zhou helped with a preliminary version of this document. We thank Abbas El Gamal
for helping us simplify the convergence proof. Trevor Hastie suggested the on-line learning
example.
References
[BL97] J. R. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer,
New York, 1997.
17