0% found this document useful (0 votes)
7 views17 pages

0105 Stoch Subgrad Notes

Uploaded by

lichenghua2019
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views17 pages

0105 Stoch Subgrad Notes

Uploaded by

lichenghua2019
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Stochastic Subgradient Methods

Stephen Boyd and Almir Mutapcic, with additions by John Duchi


Notes for EE364b, Stanford University, Spring 2017–2018
April 12, 2018

1 Noisy unbiased subgradient


Suppose f : Rn → R is a convex function. We say that a random vector g̃ ∈ Rn is a noisy
(unbiased) subgradient of f at x ∈ dom f if g = E g̃ ∈ ∂f (x), i.e., we have

f (z) ≥ f (x) + (E g̃)T (z − x)

for all z. Thus, g̃ is a noisy unbiased subgradient of f at x if it can be written as g̃ = g + v,


where g ∈ ∂f (x) and v has zero mean.
If x is also a random variable, then we say that g̃ is a noisy subgradient of f at x (which
is random) if
∀z f (z) ≥ f (x) + E(g̃|x)T (z − x)
holds almost surely. We can write this compactly as E(g̃|x) ∈ ∂f (x). (‘Almost surely’ is to
be understood here.)
The noise can represent (presumably small) error in computing a true subgradient, er-
ror that arises in Monte Carlo evaluation of a function defined as an expected value, or
measurement error.
Some references for stochastic subgradient methods are [Sho98, §2.4], [Pol87, Chap. 5].
Some books on stochastic programming in general are [BL97, Pre95, Mar05].

2 Stochastic subgradient method


The stochastic subgradient method is essentially the subgradient method, but using noisy
subgradients and a more limited set of step size rules. In this context, the slow convergence
of subgradient methods helps us, since the many steps help ‘average out’ the statistical errors
in the subgradient evaluations.
We’ll consider the simplest case, unconstrained minimization of a convex function f :
Rn → R. The stochastic subgradient method uses the standard update

x(k+1) = x(k) − αk g̃ (k) ,

1
where x(k) is the kth iterate, αk > 0 is the kth step size, and g̃ (k) is a noisy subgradient of f
at x(k) ,
E(g̃ (k) |x(k) ) = g (k) ∈ ∂f (x(k) ).
Even more so than with the ordinary subgradient method, we can have f (x(k) ) increase
during the algorithm, so we keep track of the best point found so far, and the associated
function value
(k)
fbest = min{f (x(1) ), . . . , f (x(k) )}.
(k)
The sequences x(k) , g̃ (k) , and fbest are, of course, stochastic processes.

3 Convergence
We’ll prove a very basic convergence result for the stochastic subgradient method, using step
sizes that are square-summable but not summable,

X ∞
X
αk ≥ 0, αk2 = kαk22 < ∞, αk = ∞.
k=1 k=1

We assume there is an x⋆ that minimizes f , and a G for which E kg (k) k22 ≤ G2 for all k. We
also assume that R satisfies E kx(1) − x⋆ k22 ≤ R2 .
We will show that
(k)
E fbest → f ⋆
as k → ∞, i.e., we have convergence in expectation. We also have convergence in probability:
for any ǫ > 0,
(k)
lim Prob(fbest ≥ f ⋆ + ǫ) = 0.
k→∞

(More sophisticated methods can be used to show almost sure convergence.)


We have

E kx(k+1) − x⋆ k22 x(k) = E kx(k) − αk g̃ (k) − x⋆ k22 x(k)


 

= kx(k) − x⋆ k22 − 2αk E g̃ (k)T (x(k) − x⋆ ) x(k) + αk2 E kg̃ (k) k22 x(k)
 

= kx(k) − x⋆ k22 − 2αk E(g̃ (k) |x(k) )T (x(k) − x⋆ ) + αk2 E kg̃ (k) k22 (k)

x
≤ kx(k) − x⋆ k22 − 2αk (f (x(k) ) − f ⋆ ) + αk2 E kg̃ (k) k22 x(k) ,


where the inequality holds almost surely, and follows because E(g̃ (k) |x(k) ) ∈ ∂f (x(k) ).
Now we take expectation to get

E kx(k+1) − x⋆ k22 ≤ E kx(k) − x⋆ k22 − 2αk (E f (x(k) ) − f ⋆ ) + αk2 G2 ,

using E kg̃ (k) k22 ≤ G2 . Recursively applying this inequality yields


k
X k
X
(k+1)
E kx − x⋆ k22 ≤ E kx (1)
− x⋆ k22 −2 (i) ⋆
αi (E f (x ) − f ) + G 2
αi2 .
i=1 i=1

2
Pk
Using E kx(1) − x⋆ k22 ≤ R2 , E kx(k+1) − x⋆ k22 ≥ 0, and i=1 αi2 ≤ kαk22 , we have
k
X
2 αi (E f (x(i) ) − f ⋆ ) ≤ R2 + G2 kαk22 .
i=1

Therefore we have
R2 + G2 kαk22
min (E f (x(i) ) − f ⋆ ) ≤ ,
2 ki=1 αi
P
i=1,...,k

which shows that mini=1,...,k E f (x(i) ) converges to f ⋆ .


Finally, we note that by Jensen’s inequality and concavity of the minimum function, we
have
(k)
E fbest = E min f (x(i) ) ≤ min E f (x(i) ),
i=1,...,k i=1,...,k
(k)
so also converges to f ⋆ .
E fbest
To show convergence in probability, we use Markov’s inequality to obtain, for ǫ > 0,
(k)
(k) E(fbest − f ⋆ )
Prob(fbest − f ⋆ ≥ ǫ) ≤ .
ǫ
The righthand side goes to zero as k → ∞, so the lefthand side does as well.

4 Example
We consider the problem of minimizing a piecewise-linear function,
minimize f (x) = maxi=1,...,m (aTi x + bi ),
with variable x ∈ Rn . At each step, we evaluate a noisy subgradient of the form g̃ (k) =
g (k) + v (k) , where g (k) ∈ ∂f (x(k) ) and v (k) are independent zero mean random variables.
We illustrate the stochastic subgradient method with the same problem instance used
in the notes on subgradient methods, with n = 20 variables, m = 100 terms, and problem
problem data ai and bi generated from a unit normal distribution. We start with x(1) = 0,
and use the square summable but not summable step rule αk = 1/k. To report f (x(k) ) − f ⋆ ,
we compute the optimal value f ⋆ using linear programming.
The noises v (k) are IID√N (0, 0.5I). Since the norm of the vectors ai is on the order of
4 or 5 (the RMS value is 20), the subgradient noise is around 25% compared to the true
subgradient.
Figure 1 shows the convergence of the stochastic subgradient method for two realizations
of the noisy subgradient process, together with the noise-free case for comparison. This
figure shows that convergence is only a bit slower with subgradient noise.
We carried out 100 realizations, and show the (sample) mean and standard deviation
(k)
of fbest − f ⋆ for k in multiples of 250, in figure 2. (The error bars show the mean plus
and minus one standard deviation.) Figure 3 shows the empirical distribution (over the 100
(k)
realizations) of fbest − f ⋆ at iterations k = 250, k = 1000, and k = 5000.

3
0 noise-free case
10
realization 1
realization 2

−1
fbest − f ⋆
(k) 10

−2
10

−3
10

1000 2000 3000 4000 5000


k

(k)
Figure 1: The value of fbest − f ⋆ versus iteration number k, for the subgradient
method with step size αk = 1/k. The plot shows a noise-free realization, and two
realizations with subgradient noise.

0
10

−1
10
E fbest − f ⋆
(k)

−2
10

−3
10

1000 2000 3000 4000 5000


k

(k)
Figure 2: Average and one standard deviation error bars for fbest − f ⋆ versus
iteration number k, computed using 100 realizations, every 250 iterations.

4
k = 250
30
20
10
0 −3 −2 −1 0
10 10 10 10
k = 1000
30
20
10
0 −3 −2 −1 0
10 10 10 10
k = 5000
30
20
10
0 −3 −2 −1 0
10 10 10 10

(k)
Figure 3: Empirical distributions of fbest − f ⋆ at k = 250, k = 1000, and k = 5000
iterations, based on 100 realizations.

5
5 Stochastic programming
A stochastic programming problem has the form
minimize E f0 (x, ω)
(1)
subject to E fi (x, ω) ≤ 0, i = 1, . . . , m,
where x ∈ Rn is the optimization variable, and ω is a random variable. If fi (x, ω) is convex
in x for each ω, the problem is a convex stochastic programming problem. In this case the
objective and constraint functions are convex.
Stochastic programming can be used to model a variety of robust design or decision
problems with uncertain data. Although the basic form above involves only expectation
or average values, some tricks can be used to capture other measures of the probability
distributions of fi (x, ω). We can replace an objective or constraint term E f (x, ω) with
E Φ(f (x, ω)), where Φ is a convex increasing function. For example, with Φ(u) = max(u, 0),
we can form a constraint of the form
E fi (x, ω)+ ≤ ǫ,
where ǫ is a positive parameter, and (·)+ denotes positive part. Here E fi (x, ω)+ has a simple
interpretation as the expected violation of the ith constraint. It’s also possible to combine
the constraints, using a single constraint of the form
E max(f1 (x, ω)+ , . . . , fm (x, ω)+ ) ≤ ǫ.
The lefthand side here can be interpreted as the expected worst violation (over all constraints).
Constraints of the form Prob(fi (x, ω) ≤ 0) ≥ η, which require a constraint to hold with
a probability (or reliability) exceeding η, are called chance constraints. These constraints
cannot be directly handled using the simple trick above, but an expected violation constraint
can often give a good approximation for a chance constraint. (Some chance constraints can
be handled exactly, e.g., when f (x, ω) = (a + ω)T x − b, ω is Gaussian, and η ≥ 0.5.)
Recall that Jensen’s inequality tells us
E fi (x, ω) ≥ fi (x, E ω).
Now consider the problem
minimize f0 (x, E ω)
(2)
subject to fi (x, E ω) ≤ 0, i = 1, . . . , m,
obtained by replacing the random variable in each function by its expected value. This
problem is sometimes called the certainty equivalent of the original stochastic programming
problem (1), even though they are equivalent only in very special cases. By Jensen’s in-
equality, the constraint set for the uncertainty equivalent problem is larger than the original
stochastic problem (1), and its objective is smaller. It follows that the optimal value of the
uncertainty equivalent problem gives a lower bound on the optimal value of the stochastic
problem (1). (It can be a poor bound, of course.)

6
5.1 Noisy subgradient of expected function value
Suppose F : Rn × Rp → R, and F (x, w) is convex in x for each w. We define
Z
f (x) = E F (x, w) = F (x, w)p(w) dw,

where p is the density of w. (The integral is in Rp .) The function f is convex. We’ll show
how to compute a noisy unbiased subgradient of f at x.
The function f comes up in many applications. We can think of x as some kind of
design variable to be chosen, and w as some kind of parameter that is random, i.e., subject
to statistical fluctuation. The function F tells us the cost of choosing x when w takes a
particular value; the function f , which is deterministic, gives the average cost of choosing
x, taking the statistical variation of w into account. Note that the dimension of w, how it
enters F , and the distribution are not restricted; we only require that for each value of w,
F is convex in x.
Except in some very special cases, we cannot easily compute f (x) exactly. However,
we can approximately compute f using Monte Carlo methods, if we can cheaply generate
samples of w from its distribution. (This depends on the distribution.) We generate M
independent samples w1 , . . . , wM , and then take
M
ˆ 1 X
f (x) = F (x, wi )
M i=1

as our estimate of f (x). We hope that if M is large enough, we get a good estimate. In fact,
fˆ(x) is a random variable with E fˆ(x) = f (x), and a variance equal to c/M , where c is the
variance of F (x, ω). If we know or bound c, then we can at least bound the probability of a
given level of error, i.e., Prob(|fˆ(x) − f (x)| ≥ ǫ). In many cases it’s possible to carry out a
much more sophisticated analysis of the error in Monte Carlo methods, but we won’t pursue
that here. A summary for our purposes is: we cannot evaluate f (x) exactly, but we can get
a good approximation, with (possibly) much effort.
Let G : Rn × Rp → Rn be a function that satisfies

G(x, w) ∈ ∂x F (x, w)

for each x and w. In other words, G(x, w) selects a subgradient for each value of x and w.
If F (x, w) is differentiable in x then we must have G(x, w) = ∇x F (x, w).
We claim that Z
g = E G(x, w) = G(x, w)p(w) dw ∈ ∂f (x).

To see this, note that for each w and any z we have

F (z, w) ≥ F (x, w) + G(x, w)T (z − x),

7
since G(x, w) ∈ ∂x F (x, w). Multiplying this by p(w), which is nonnegative, and integrating
gives
Z Z
F (x, w) + G(x, w)T (z − x) p(w) dw

F (z, w)p(w) dw ≥

= f (x) + g T (z − x).

Since the lefthand side is f (z), we’ve shown g ∈ ∂f (x).


Now we can explain how to compute a noisy unbiased subgradient of f at x. Generate
independent samples w1 , . . . , wM . We then take
M
1 X
g̃ = G(x, wi ).
M i=1

In other words, we evaluate a subgradient of f , at x, for M random samples of w, and take


g̃ to be the average. At the same time we can also compute fˆ(x), the Monte Carlo estimate
of f (x). We have E g̃ = E G(x, w) = g, which we showed above is a subgradient of f at x.
Thus, g̃ is a noisy unbiased sugradient of f at x.
This result is independent of M . We can even take M = 1. In this case, g̃ = G(x, w1 ).
In other words, we simply generate one sample w1 and use the subgradient of F for that
value of w. In this case g̃ could hardly be called a good approximation of a subgradient of
f , but its mean is a subgradient, so it is a valid noisy unbiased subgradient.
On the other hand, we can take M to be large. In this case, g̃ is a random vector with
mean g and very small variance, i.e., it is a good estimate of g, a subgradient of f at x.

5.2 Example
We consider the problem of minimizing the expected value of a piecewise-linear convex
function with random coefficients,

minimize f (x) = E maxi=1,...,m (aTi x + bi ),

with variable x ∈ Rn . The data vectors ai ∈ Rn and bi ∈ R are random with some given
distribution. We can compute an (unbiased) approximation of f (x), and a noisy unbiased
subgradient g ∈ ∂f (x), using Monte Carlo methods.
We consider a problem instance with n = 20 variables and m = 100 terms. We assume
that ai ∼ N (āi , 5I) and b ∼ N (b̄, 5I). The mean values āi and b̄ are generated from unit
normal distributions (and are the same as the constant values used called ai and bi in previous
examples). We take x(1) = 0 as the starting point and use the step size rule αk = 1/k.
We first compare the solution of the stochastic problem xstoch (obtained from the stochas-
tic subgradient method) with xce , the solution of the certainty equivalent problem

minimize fce (x) = maxi=1,...,m (E aTi x + E bi )

8
xce
20 f (xce )
10

0
1 1.5 2 2.5 3
xheur
20 f (xheur )
10

0
1 1.5 2 2.5 3
xstoch
20 f (xstoch )
10

0
1 1.5 2 2.5 3

Figure 4: Empirical distributions of maxi (aTi x + bi ) for the certainty equivalent so-
lution xce , heuristic-based solution xheur , and the stochastic optimal solution xstoch .
The dark lines show the value of f (x), and the dotted lines show the value of fce (x).

(which is the same piecewise-linear minimization problem considered in earlier examples).


We also compare it with xheur , the solution of the problem

minimize fheur (x) = maxi=1,...,m (E aTi x + E bi + λkxk2 ),

where λ is a positive parameter. The extra terms are meant to account for the variation in
aTi x + b caused by variation in ai ; the problem above can be cast as an SOCP and easily
solved. We chose λ = 1, after some experimentation.
The certainty equivalent values for the three points are fce (xce ) = 1.12, fce (xheur ) = 1.23,
and fce (xstoch ) = 1.44. We (approximately) evaluate f (x) for these three points, based on
1000 Monte Carlo samples, obtaining f (xce ) ≈ 2.12, f (xheur ) ≈ 1.88, and f (xstoch ) ≈ 1.83.
The empirical distributions of maxi (aTi x + bi ), for x = xstoch , x = xheur , and for x = xce , are
shown in figure 4.
In summary, the heuristic finds a point that is good, but not quite as good as the
stochastic optimal. Both of these points are much better than the certainty equivalent
point.
Now we show the convergence of the stochastic subgradient method, evaluating noisy
subgradients with M = 1, M = 10, M = 100, and M = 1000 samples at each step. For

9
0 M =1
10 M = 10
M = 100
M = 1000

−1

fbest − fˆ⋆
10
(k)

−2
10

−3
10
500 1000 1500 2000
k

(k)
Figure 5: The value of fbest − f ⋆ versus iteration number k, for the stochastic
subgradient method with step size rule αk = 1/k. The plot shows one realization
for noisy subgradients evaluated using M = 1, M = 10, M = 100, and M = 1000.

M = 1000, we are computing a fairly accurate subgradient; for M = 1, the variance in our
computed subgradient is large.
We (approximately) evaluate the function f (x) using M = 1000 samples at each iteration,
(k)
and keep track of the best value fbest . We estimate f ⋆ by running the stochastic subgradient
algorithm for a large number of iterations. Figure 5 shows the convergence for one realization
for each of the four values of M . As expected, convergence is faster with M larger, which
yields a more accurate subgradient. Assuming the cost of an iteration is proportional to M ,
M = 1 seems to be the best choice. In any case, there seems little advantage (at least in
this example) to using a value of M larger than 10.

6 More examples
6.1 Minimizing expected maximum violation
The vector x ∈ Rn is to be chosen subject to some (deterministic) linear inequalities F x  g.
These can represent manufacturing limits, cost limits, etc. Another set of random inequalities
is given as Ax  b, where A and b come from some distribution. We would like these
inequalities to hold, but because A and b vary, there may be no choice of x that results in
Ax  b almost surely. We still would like to choose x so that Ax  b holds often, and when
it is violated, the violation is small.

10
Perhaps the most natural problem formulation is to maximize the yield, defined as
Prob(Ax  b). In some very special cases (e.g., A is deterministic and b has log-concave
density), this can be converted to a convex problem; but in general it is not convex. Also,
yield is not sensitive to how much the inequalities Ax  b are violated; a small violation is
the same as a large variation, as far as the yield is concerned.
Instead, we will work with the expected maximum violation of the inequalities. We define
the maximum violation as
max((Ax − b)+ ) = max(aTi x − b)+ ,
i

where ai are the rows of A. The maximum violation is zero if and only if Ax  b; it is
positive otherwise. The expected violation is
 
E max((Ax − b)+ ) = E max(aTi x − bi )+ .
i

It is a complicated but convex function of x, and gives a measure of how often, and by how
much, the constraints Ax  b are violated. It is convex no matter what the distribution of
A and b is.
As an aside, we note that many other interesting measures of violation could also be
used, such as the expected total (or average) violation, E 1T (aTi x − bi )+ , or the expected sum
of the squares of the individual violations, E k(Ax − b)+ k22 . We can even measure violation
by the expected distance from the desired polyhedron, E dist(x, {z | Az  b}), which we
call the expected violation distance.
Back to our main story, and using expected (maximum) violation, we have the problem

minimize E max((Ax − b)+ ) = E maxi (aTi x − bi )+
subject to F x  g.
The data are F , g, and the distribution of A and b.
We’ll use a stochastic projected subgradient method to solve this problem. Given a point
x that satisfies F x  g, we need to generate a noisy unbiased subgradient g̃ of the objective.
To do this we first generate a sample of A and b. If Ax  b, we can take g̃ = 0. (Note: if
a subgradient is zero, it means we’re done; but that’s not the case here.) In the stochastic
subgradient algorithm, g̃ = 0 means we don’t update the current value of x. But we don’t
stop the algorithm, as we would in a deterministic projected subgradient method.
If Ax  b is violated, choose j for which aTj x − bj = maxi (aTi x − bi )+ . Then we can take
g̃ = aj . We then set xtemp = x − αk aj , where αk is the step size. Finally, we project xtemp
back to the feasible set to get the updated value of x. (This can be done by solving a QP.)
We can generate a number of samples of A and b, and use the method above to find an
unbiased noisy subgradient for each case. We can use the average of these as g̃. If we take
enough samples, we can simultaneously estimate the expected maximum violation for the
current value of x.
A reasonable starting point is a point well inside the certainty equivalent inequalities
E Ax  E b, that also satisfies F x  g. Simple choices include the analytic, Chebyshev, or
maximum volume ellipsoid centers, or the point that maximizes the margin max(b − Ax),
subject to F x  g.

11
6.2 On-line learning and adaptive signal processing
We suppose that (x, y) ∈ Rn × R have some joint distribution. Our goal is to find a weight
vector w ∈ Rn for which wT x is a good estimator of y. (This is linear regression; we can add
an extra component to x that is always one to get an affine estimator of the form wT x + v.)
We’d like to choose a weight vector that minimizes

J(w) = E l(wT x − y),

where l : R → R is a convex loss function. For example, if l(u) = u2 , J is the mean-square


error; if l(u) = |u|, J is the mean-absolute error. Other interesting loss functions include
the Huber function, or a function with dead zone, as in l(u) = max{|u| − 1, 0}. For the
special case l(u) = u2 , there is an analytic formula for the optimal w, in terms of the mean
and covariance matrix of (x, y). But we consider here the more general case. Of course J is
convex, so minimizing J over w is a convex optimization problem.
We consider an on-line setting, where we do not know the distribution of (x, y). At each
step (which might correspond to time, for example), we are given a sample (x(i) , y (i) ) from
the distribution. After k steps, we could use the k samples to estimate the distribution; for
example, we could choose w to minimize the average loss under the empirical distribution.
This approach requires that we store all past samples, and we need to solve a large problem
each time a new sample comes in.
Instead we’ll use the stochastic subgradient method, which is really simple. In particular,
it requires essentially no storage (beyond the current value of w), and very little computation.
The disadvantage is that it is slow.
We will carry out a stochastic subgradient step each time a new sample arrives. Suppose
we are given a new sample (x(k+1) , y (k+1) ), and the current weight value is w(k) . Form

l′ (w(k)T x(k+1) − y (k+1) )x(k+1) ,

where l′ is the derivative of l. (If l is nondifferentiable, substitute any subgradient of l in


place of l′ .) This is a noisy unbiased subgradient of J.
Our on-line algorithm is very simple: when (x(k+1) , y (k+1) ) becomes available, we update
(k)
w as follows:
w(k+1) = w(k) − αk l′ (w(k)T x(k+1) − y (k+1) )x(k+1) . (3)
Note that w(k)T x(k+1) − y (k+1) is the prediction error for the k + 1 sample, using the weight
from step k. We can choose (for example) αk = 1/k.
In signal processing, updates of the form (3) are widely used in adaptive signal processing,
typically with l(u) = u2 . In this case the update (3) is called the LMS (least mean-square)
algorithm.
We illustrate the method with a simple example with n = 10, with (x, y) ∼ N (0, Σ),
where Σ is chosen randomly, and l(u) = |u|. (For the problem instance, we have E(y 2 ) ≈ 12.)
The update has the simple form

w(k+1) = w(k) − αk sign(w(k)T x(k+1) − y (k+1) )x(k+1) .

12
We take αk = 1/k. (In adaptive signal processing, this is called a sign algorithm.)
We compute w⋆ using the stochastic subgradient algorithm, which we run for 5000 iter-
ations. Figure 6 shows the behaviour of the prediction error w(k)T x(k+1) − y (k+1) for the first
300 iterations of the run. Figure 7 shows the empirical distibution (over the 1000 realizations)
of the prediction errors for w⋆ .

6.3 Large datasets and Monte Carlo methods


As an extension of the preceding section, we may revisit the stochastic programming prob-
lem (1) and consider performing the stochastic (sub)gradient method in this situation. It
is in these problems that stochastic subgradient methods, and more generally subgradient
methods, really become effective approaches to the solution of large-scale optimization prob-
lems. It is perhaps intuitive that this might be expected: while the subgradient method by
itself is quite slow, our convergence arguments suggest that its performance barely degrades
even when there is substantial noise in the computation of approximate subgradients g̃.
As a particular special case, consider the situation in which we have a large sample, which
we represent as pairs (ai , bi ), and have objective
m
1 X
f (x) = F (x; (ai , bi )),
m i=1

where for each pair (a, b), the objective F (x; (a, b)) is convex in x ∈ Rn . In this situation, if
we choose any m0 ≤ m as a batch size, we may construct a stochastic subgradient by taking
a subsample of indices i1 , . . . , im0 uniformly at random, either with or without replacement,
from {1, . . . , m}, then setting
m0
1 X
g̃ ∈ ∂x F (x; (aij , bij )).
m0 j=1

It is clear that E g̃ ∈ ∂f (x) with this choice. The interesting, though obvious, aspect of
this is that if m0 ≪ m, then it is m/m0 -times more efficient to compute the noisy unbiased
subgradient g̃ as opposed to the full subgradient of the objective. The power of stochastic
subgradient methods comes from the fact that, as often we only care about getting a mod-
erately accurate solution, we can often take m0 ≪ m to compute stochastic subgradients,
use these in the iteration of the methods, and achieve substantial computational benefits.
Here we work in reasonable detail through one such instance of this. We generate data
pairs as follows. We fix a vector w⋆ ∈ Rn , chosen uniformly from the unit sphere (so that
iid
kw⋆ k = 1), and then generate vectors ai ∼ N (0, I), the normal distribution on Rn . We then
set bi = aTi w⋆ + ξi for ξi drawn according to a Cauchy distribution. For our objective, we
take the absolute error
F (x; (a, b)) = |aT x − b|.
With this setup, our objective becomes f (x) = m1 kAx − bk1 , the mean ℓ1 -error in our pre-
dictions, which is a natural choice for heavy-tailed data.

13
40

30

20
prediction error
10

−10

−20

−30

−40
50 100 150 200 250 300
k

Figure 6: Prediction error w(k)T x(k+1) − y (k+1) versus iteration number k.

150

100

50

0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 7: Empirical distribution of prediction errors for w⋆ , based on 1000 samples.

14
0
10

m0 =1
m0 =10
m0 =200
-1
10

f(x(k) )−f(x ⋆ )

-2
10

-3
10

-4
10
50 100 150 200 250 300 350 400

Figure 8: Empirical performance of stochastic subgradient methods using either


a single sample (m0 = 1), a sample of ten pairs (ai , bi ) (m0 = 10), and the full
subgradient (m0 = m = 200) for the absolute regression problem. The horizontal
axis indexes the number of iterations of each method.

With this in mind, we can now compare a projected subgradient method for this problem
with m = 200 and m0 chosen from 1, 10, and 200. Each iteration updates

x(k+1) = Π(x(k) − αk g̃ (k) )

where Π denotes projection onto the ℓ2 -ball of radius 1. We give two figures. In the first
(Fig. 8), we show the optimality gap f (x(k) ) − f (x⋆ ) versus iteration, averaged over 50
different experiments. For each method, we used stepsize αk = √αk , where the scalar α was
chosen to yield the best performance in terms of optimality gap. On a per-iteration basis,
it is clear that the subgradient method using the full sample size m0 = 200 has the best
convergence as a function of the iterations.
On the other hand, when we normalize by the amount of actual computation performed,
the story is quite different. In Figure 9 we plot the optimality gap f (x(k) ) − f ⋆ against
the total amount of computation each method requires, where computation is given by the
total number of computations of individual subgradients ∂x F (x; (ai , bi )) = ai sign(aTi x − bi ),
each of which requires O(n) time. From this figure, we see that in the time the subgradient
method requires to make even a small amount of progress, both subsampled methods have
essentially made all of the progress they will make.

15
0
10

m0 =1
m0 =10
m0 =200
-1
10
f(x(k) )−f(x ⋆ )

-2
10

-3
10

-4
10
20 40 60 80 100

Figure 9: Empirical performance of stochastic subgradient methods using either


a single sample (m0 = 1), a sample of ten pairs (ai , bi ) (m0 = 10), and the full
subgradient (m0 = m = 200) for the absolute regression problem. The horizontal
axis index the number of computations of each method. Thus, for the single sample
m0 = 1 case, each tick mark indicates a total of m subgradient steps; for m0 = 10,
each tick indicates m/m0 = 20 subgradient steps.

16
Acknowledgments
May Zhou helped with a preliminary version of this document. We thank Abbas El Gamal
for helping us simplify the convergence proof. Trevor Hastie suggested the on-line learning
example.

References
[BL97] J. R. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer,
New York, 1997.

[Mar05] K. Marti. Stochastic Optimization Methods. Springer, 2005.

[Pol87] B. Polyak. Introduction to Optimization. Optimization Software, Inc., 1987.

[Pre95] A. Prekopa. Stochastic Programming. Kluwer Academic Publishers, 1995.

[Sho98] N. Shor. Nondifferentiable Optimization and Polynomial Problems. Nonconvex


Optimization and its Applications. Kluwer, 1998.

17

You might also like