0% found this document useful (0 votes)

7 views17 pages

0105 Stoch Subgrad Notes

Uploaded by

lichenghua2019

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views17 pages

0105 Stoch Subgrad Notes

Uploaded by

lichenghua2019

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Stochastic Subgradient Methods

Stephen Boyd and Almir Mutapcic, with additions by John Duchi

Notes for EE364b, Stanford University, Spring 2017–2018
April 12, 2018

1 Noisy unbiased subgradient

Suppose f : Rn → R is a convex function. We say that a random vector g̃ ∈ Rn is a noisy
(unbiased) subgradient of f at x ∈ dom f if g = E g̃ ∈ ∂f (x), i.e., we have

f (z) ≥ f (x) + (E g̃)T (z − x)

for all z. Thus, g̃ is a noisy unbiased subgradient of f at x if it can be written as g̃ = g + v,

where g ∈ ∂f (x) and v has zero mean.
If x is also a random variable, then we say that g̃ is a noisy subgradient of f at x (which
is random) if
∀z f (z) ≥ f (x) + E(g̃|x)T (z − x)
holds almost surely. We can write this compactly as E(g̃|x) ∈ ∂f (x). (‘Almost surely’ is to
be understood here.)
The noise can represent (presumably small) error in computing a true subgradient, er-
ror that arises in Monte Carlo evaluation of a function defined as an expected value, or
measurement error.
Some references for stochastic subgradient methods are [Sho98, §2.4], [Pol87, Chap. 5].
Some books on stochastic programming in general are [BL97, Pre95, Mar05].

2 Stochastic subgradient method

The stochastic subgradient method is essentially the subgradient method, but using noisy
subgradients and a more limited set of step size rules. In this context, the slow convergence
of subgradient methods helps us, since the many steps help ‘average out’ the statistical errors
in the subgradient evaluations.
We’ll consider the simplest case, unconstrained minimization of a convex function f :
Rn → R. The stochastic subgradient method uses the standard update

x(k+1) = x(k) − αk g̃ (k) ,

1
where x(k) is the kth iterate, αk > 0 is the kth step size, and g̃ (k) is a noisy subgradient of f
at x(k) ,
E(g̃ (k) |x(k) ) = g (k) ∈ ∂f (x(k) ).
Even more so than with the ordinary subgradient method, we can have f (x(k) ) increase
during the algorithm, so we keep track of the best point found so far, and the associated
function value
(k)
fbest = min{f (x(1) ), . . . , f (x(k) )}.
(k)
The sequences x(k) , g̃ (k) , and fbest are, of course, stochastic processes.

3 Convergence
We’ll prove a very basic convergence result for the stochastic subgradient method, using step
sizes that are square-summable but not summable,
∞
X ∞
X
αk ≥ 0, αk2 = kαk22 < ∞, αk = ∞.
k=1 k=1

We assume there is an x⋆ that minimizes f , and a G for which E kg (k) k22 ≤ G2 for all k. We
also assume that R satisfies E kx(1) − x⋆ k22 ≤ R2 .
We will show that
(k)
E fbest → f ⋆
as k → ∞, i.e., we have convergence in expectation. We also have convergence in probability:
for any ǫ > 0,
(k)
lim Prob(fbest ≥ f ⋆ + ǫ) = 0.
k→∞

(More sophisticated methods can be used to show almost sure convergence.)

We have

E kx(k+1) − x⋆ k22 x(k) = E kx(k) − αk g̃ (k) − x⋆ k22 x(k)

= kx(k) − x⋆ k22 − 2αk E g̃ (k)T (x(k) − x⋆ ) x(k) + αk2 E kg̃ (k) k22 x(k)

= kx(k) − x⋆ k22 − 2αk E(g̃ (k) |x(k) )T (x(k) − x⋆ ) + αk2 E kg̃ (k) k22 (k)

x
≤ kx(k) − x⋆ k22 − 2αk (f (x(k) ) − f ⋆ ) + αk2 E kg̃ (k) k22 x(k) ,

where the inequality holds almost surely, and follows because E(g̃ (k) |x(k) ) ∈ ∂f (x(k) ).
Now we take expectation to get

E kx(k+1) − x⋆ k22 ≤ E kx(k) − x⋆ k22 − 2αk (E f (x(k) ) − f ⋆ ) + αk2 G2 ,

using E kg̃ (k) k22 ≤ G2 . Recursively applying this inequality yields

k
X k
X
(k+1)
E kx − x⋆ k22 ≤ E kx (1)
− x⋆ k22 −2 (i) ⋆
αi (E f (x ) − f ) + G 2
αi2 .
i=1 i=1

2
Pk
Using E kx(1) − x⋆ k22 ≤ R2 , E kx(k+1) − x⋆ k22 ≥ 0, and i=1 αi2 ≤ kαk22 , we have
k
X
2 αi (E f (x(i) ) − f ⋆ ) ≤ R2 + G2 kαk22 .
i=1

Therefore we have
R2 + G2 kαk22
min (E f (x(i) ) − f ⋆ ) ≤ ,
2 ki=1 αi
P
i=1,...,k

which shows that mini=1,...,k E f (x(i) ) converges to f ⋆ .

Finally, we note that by Jensen’s inequality and concavity of the minimum function, we
have
(k)
E fbest = E min f (x(i) ) ≤ min E f (x(i) ),
i=1,...,k i=1,...,k
(k)
so also converges to f ⋆ .
E fbest
To show convergence in probability, we use Markov’s inequality to obtain, for ǫ > 0,
(k)
(k) E(fbest − f ⋆ )
Prob(fbest − f ⋆ ≥ ǫ) ≤ .
ǫ
The righthand side goes to zero as k → ∞, so the lefthand side does as well.

4 Example
We consider the problem of minimizing a piecewise-linear function,
minimize f (x) = maxi=1,...,m (aTi x + bi ),
with variable x ∈ Rn . At each step, we evaluate a noisy subgradient of the form g̃ (k) =
g (k) + v (k) , where g (k) ∈ ∂f (x(k) ) and v (k) are independent zero mean random variables.
We illustrate the stochastic subgradient method with the same problem instance used
in the notes on subgradient methods, with n = 20 variables, m = 100 terms, and problem
problem data ai and bi generated from a unit normal distribution. We start with x(1) = 0,
and use the square summable but not summable step rule αk = 1/k. To report f (x(k) ) − f ⋆ ,
we compute the optimal value f ⋆ using linear programming.
The noises v (k) are IID√N (0, 0.5I). Since the norm of the vectors ai is on the order of
4 or 5 (the RMS value is 20), the subgradient noise is around 25% compared to the true
subgradient.
Figure 1 shows the convergence of the stochastic subgradient method for two realizations
of the noisy subgradient process, together with the noise-free case for comparison. This
figure shows that convergence is only a bit slower with subgradient noise.
We carried out 100 realizations, and show the (sample) mean and standard deviation
(k)
of fbest − f ⋆ for k in multiples of 250, in figure 2. (The error bars show the mean plus
and minus one standard deviation.) Figure 3 shows the empirical distribution (over the 100
(k)
realizations) of fbest − f ⋆ at iterations k = 250, k = 1000, and k = 5000.

3
0 noise-free case
10
realization 1
realization 2

−1
fbest − f ⋆
(k) 10

−2
10

−3
10

1000 2000 3000 4000 5000

(k)
Figure 1: The value of fbest − f ⋆ versus iteration number k, for the subgradient
method with step size αk = 1/k. The plot shows a noise-free realization, and two
realizations with subgradient noise.

0
10

−1
10
E fbest − f ⋆
(k)

−2
10

−3
10

1000 2000 3000 4000 5000

(k)
Figure 2: Average and one standard deviation error bars for fbest − f ⋆ versus
iteration number k, computed using 100 realizations, every 250 iterations.

4
k = 250
30
20
10
0 −3 −2 −1 0
10 10 10 10
k = 1000
30
20
10
0 −3 −2 −1 0
10 10 10 10
k = 5000
30
20
10
0 −3 −2 −1 0
10 10 10 10

(k)
Figure 3: Empirical distributions of fbest − f ⋆ at k = 250, k = 1000, and k = 5000
iterations, based on 100 realizations.

5
5 Stochastic programming
A stochastic programming problem has the form
minimize E f0 (x, ω)
(1)
subject to E fi (x, ω) ≤ 0, i = 1, . . . , m,
where x ∈ Rn is the optimization variable, and ω is a random variable. If fi (x, ω) is convex
in x for each ω, the problem is a convex stochastic programming problem. In this case the
objective and constraint functions are convex.
Stochastic programming can be used to model a variety of robust design or decision
problems with uncertain data. Although the basic form above involves only expectation
or average values, some tricks can be used to capture other measures of the probability
distributions of fi (x, ω). We can replace an objective or constraint term E f (x, ω) with
E Φ(f (x, ω)), where Φ is a convex increasing function. For example, with Φ(u) = max(u, 0),
we can form a constraint of the form
E fi (x, ω)+ ≤ ǫ,
where ǫ is a positive parameter, and (·)+ denotes positive part. Here E fi (x, ω)+ has a simple
interpretation as the expected violation of the ith constraint. It’s also possible to combine
the constraints, using a single constraint of the form
E max(f1 (x, ω)+ , . . . , fm (x, ω)+ ) ≤ ǫ.
The lefthand side here can be interpreted as the expected worst violation (over all constraints).
Constraints of the form Prob(fi (x, ω) ≤ 0) ≥ η, which require a constraint to hold with
a probability (or reliability) exceeding η, are called chance constraints. These constraints
cannot be directly handled using the simple trick above, but an expected violation constraint
can often give a good approximation for a chance constraint. (Some chance constraints can
be handled exactly, e.g., when f (x, ω) = (a + ω)T x − b, ω is Gaussian, and η ≥ 0.5.)
Recall that Jensen’s inequality tells us
E fi (x, ω) ≥ fi (x, E ω).
Now consider the problem
minimize f0 (x, E ω)
(2)
subject to fi (x, E ω) ≤ 0, i = 1, . . . , m,
obtained by replacing the random variable in each function by its expected value. This
problem is sometimes called the certainty equivalent of the original stochastic programming
problem (1), even though they are equivalent only in very special cases. By Jensen’s in-
equality, the constraint set for the uncertainty equivalent problem is larger than the original
stochastic problem (1), and its objective is smaller. It follows that the optimal value of the
uncertainty equivalent problem gives a lower bound on the optimal value of the stochastic
problem (1). (It can be a poor bound, of course.)

6
5.1 Noisy subgradient of expected function value
Suppose F : Rn × Rp → R, and F (x, w) is convex in x for each w. We define
Z
f (x) = E F (x, w) = F (x, w)p(w) dw,

where p is the density of w. (The integral is in Rp .) The function f is convex. We’ll show
how to compute a noisy unbiased subgradient of f at x.
The function f comes up in many applications. We can think of x as some kind of
design variable to be chosen, and w as some kind of parameter that is random, i.e., subject
to statistical fluctuation. The function F tells us the cost of choosing x when w takes a
particular value; the function f , which is deterministic, gives the average cost of choosing
x, taking the statistical variation of w into account. Note that the dimension of w, how it
enters F , and the distribution are not restricted; we only require that for each value of w,
F is convex in x.
Except in some very special cases, we cannot easily compute f (x) exactly. However,
we can approximately compute f using Monte Carlo methods, if we can cheaply generate
samples of w from its distribution. (This depends on the distribution.) We generate M
independent samples w1 , . . . , wM , and then take
M
ˆ 1 X
f (x) = F (x, wi )
M i=1

as our estimate of f (x). We hope that if M is large enough, we get a good estimate. In fact,
fˆ(x) is a random variable with E fˆ(x) = f (x), and a variance equal to c/M , where c is the
variance of F (x, ω). If we know or bound c, then we can at least bound the probability of a
given level of error, i.e., Prob(|fˆ(x) − f (x)| ≥ ǫ). In many cases it’s possible to carry out a
much more sophisticated analysis of the error in Monte Carlo methods, but we won’t pursue
that here. A summary for our purposes is: we cannot evaluate f (x) exactly, but we can get
a good approximation, with (possibly) much effort.
Let G : Rn × Rp → Rn be a function that satisfies

G(x, w) ∈ ∂x F (x, w)

for each x and w. In other words, G(x, w) selects a subgradient for each value of x and w.
If F (x, w) is differentiable in x then we must have G(x, w) = ∇x F (x, w).
We claim that Z
g = E G(x, w) = G(x, w)p(w) dw ∈ ∂f (x).

To see this, note that for each w and any z we have

F (z, w) ≥ F (x, w) + G(x, w)T (z − x),

7
since G(x, w) ∈ ∂x F (x, w). Multiplying this by p(w), which is nonnegative, and integrating
gives
Z Z
F (x, w) + G(x, w)T (z − x) p(w) dw

F (z, w)p(w) dw ≥

= f (x) + g T (z − x).

Since the lefthand side is f (z), we’ve shown g ∈ ∂f (x).

Now we can explain how to compute a noisy unbiased subgradient of f at x. Generate
independent samples w1 , . . . , wM . We then take
M
1 X
g̃ = G(x, wi ).
M i=1

In other words, we evaluate a subgradient of f , at x, for M random samples of w, and take

g̃ to be the average. At the same time we can also compute fˆ(x), the Monte Carlo estimate
of f (x). We have E g̃ = E G(x, w) = g, which we showed above is a subgradient of f at x.
Thus, g̃ is a noisy unbiased sugradient of f at x.
This result is independent of M . We can even take M = 1. In this case, g̃ = G(x, w1 ).
In other words, we simply generate one sample w1 and use the subgradient of F for that
value of w. In this case g̃ could hardly be called a good approximation of a subgradient of
f , but its mean is a subgradient, so it is a valid noisy unbiased subgradient.
On the other hand, we can take M to be large. In this case, g̃ is a random vector with
mean g and very small variance, i.e., it is a good estimate of g, a subgradient of f at x.

5.2 Example
We consider the problem of minimizing the expected value of a piecewise-linear convex
function with random coefficients,

minimize f (x) = E maxi=1,...,m (aTi x + bi ),

with variable x ∈ Rn . The data vectors ai ∈ Rn and bi ∈ R are random with some given
distribution. We can compute an (unbiased) approximation of f (x), and a noisy unbiased
subgradient g ∈ ∂f (x), using Monte Carlo methods.
We consider a problem instance with n = 20 variables and m = 100 terms. We assume
that ai ∼ N (āi , 5I) and b ∼ N (b̄, 5I). The mean values āi and b̄ are generated from unit
normal distributions (and are the same as the constant values used called ai and bi in previous
examples). We take x(1) = 0 as the starting point and use the step size rule αk = 1/k.
We first compare the solution of the stochastic problem xstoch (obtained from the stochas-
tic subgradient method) with xce , the solution of the certainty equivalent problem

minimize fce (x) = maxi=1,...,m (E aTi x + E bi )

8
xce
20 f (xce )
10

0
1 1.5 2 2.5 3
xheur
20 f (xheur )
10

0
1 1.5 2 2.5 3
xstoch
20 f (xstoch )
10

0
1 1.5 2 2.5 3

Figure 4: Empirical distributions of maxi (aTi x + bi ) for the certainty equivalent so-
lution xce , heuristic-based solution xheur , and the stochastic optimal solution xstoch .
The dark lines show the value of f (x), and the dotted lines show the value of fce (x).

(which is the same piecewise-linear minimization problem considered in earlier examples).

We also compare it with xheur , the solution of the problem

minimize fheur (x) = maxi=1,...,m (E aTi x + E bi + λkxk2 ),

where λ is a positive parameter. The extra terms are meant to account for the variation in
aTi x + b caused by variation in ai ; the problem above can be cast as an SOCP and easily
solved. We chose λ = 1, after some experimentation.
The certainty equivalent values for the three points are fce (xce ) = 1.12, fce (xheur ) = 1.23,
and fce (xstoch ) = 1.44. We (approximately) evaluate f (x) for these three points, based on
1000 Monte Carlo samples, obtaining f (xce ) ≈ 2.12, f (xheur ) ≈ 1.88, and f (xstoch ) ≈ 1.83.
The empirical distributions of maxi (aTi x + bi ), for x = xstoch , x = xheur , and for x = xce , are
shown in figure 4.
In summary, the heuristic finds a point that is good, but not quite as good as the
stochastic optimal. Both of these points are much better than the certainty equivalent
point.
Now we show the convergence of the stochastic subgradient method, evaluating noisy
subgradients with M = 1, M = 10, M = 100, and M = 1000 samples at each step. For

9
0 M =1
10 M = 10
M = 100
M = 1000

−1

fbest − fˆ⋆
10
(k)

−2
10

−3
10
500 1000 1500 2000
k

(k)
Figure 5: The value of fbest − f ⋆ versus iteration number k, for the stochastic
subgradient method with step size rule αk = 1/k. The plot shows one realization
for noisy subgradients evaluated using M = 1, M = 10, M = 100, and M = 1000.

M = 1000, we are computing a fairly accurate subgradient; for M = 1, the variance in our
computed subgradient is large.
We (approximately) evaluate the function f (x) using M = 1000 samples at each iteration,
(k)
and keep track of the best value fbest . We estimate f ⋆ by running the stochastic subgradient
algorithm for a large number of iterations. Figure 5 shows the convergence for one realization
for each of the four values of M . As expected, convergence is faster with M larger, which
yields a more accurate subgradient. Assuming the cost of an iteration is proportional to M ,
M = 1 seems to be the best choice. In any case, there seems little advantage (at least in
this example) to using a value of M larger than 10.

6 More examples
6.1 Minimizing expected maximum violation
The vector x ∈ Rn is to be chosen subject to some (deterministic) linear inequalities F x g.
These can represent manufacturing limits, cost limits, etc. Another set of random inequalities
is given as Ax b, where A and b come from some distribution. We would like these
inequalities to hold, but because A and b vary, there may be no choice of x that results in
Ax b almost surely. We still would like to choose x so that Ax b holds often, and when
it is violated, the violation is small.

10
Perhaps the most natural problem formulation is to maximize the yield, defined as
Prob(Ax b). In some very special cases (e.g., A is deterministic and b has log-concave
density), this can be converted to a convex problem; but in general it is not convex. Also,
yield is not sensitive to how much the inequalities Ax b are violated; a small violation is
the same as a large variation, as far as the yield is concerned.
Instead, we will work with the expected maximum violation of the inequalities. We define
the maximum violation as
max((Ax − b)+ ) = max(aTi x − b)+ ,
i

where ai are the rows of A. The maximum violation is zero if and only if Ax b; it is
positive otherwise. The expected violation is

E max((Ax − b)+ ) = E max(aTi x − bi )+ .
i

It is a complicated but convex function of x, and gives a measure of how often, and by how
much, the constraints Ax b are violated. It is convex no matter what the distribution of
A and b is.
As an aside, we note that many other interesting measures of violation could also be
used, such as the expected total (or average) violation, E 1T (aTi x − bi )+ , or the expected sum
of the squares of the individual violations, E k(Ax − b)+ k22 . We can even measure violation
by the expected distance from the desired polyhedron, E dist(x, {z | Az b}), which we
call the expected violation distance.
Back to our main story, and using expected (maximum) violation, we have the problem

minimize E max((Ax − b)+ ) = E maxi (aTi x − bi )+
subject to F x g.
The data are F , g, and the distribution of A and b.
We’ll use a stochastic projected subgradient method to solve this problem. Given a point
x that satisfies F x g, we need to generate a noisy unbiased subgradient g̃ of the objective.
To do this we first generate a sample of A and b. If Ax b, we can take g̃ = 0. (Note: if
a subgradient is zero, it means we’re done; but that’s not the case here.) In the stochastic
subgradient algorithm, g̃ = 0 means we don’t update the current value of x. But we don’t
stop the algorithm, as we would in a deterministic projected subgradient method.
If Ax b is violated, choose j for which aTj x − bj = maxi (aTi x − bi )+ . Then we can take
g̃ = aj . We then set xtemp = x − αk aj , where αk is the step size. Finally, we project xtemp
back to the feasible set to get the updated value of x. (This can be done by solving a QP.)
We can generate a number of samples of A and b, and use the method above to find an
unbiased noisy subgradient for each case. We can use the average of these as g̃. If we take
enough samples, we can simultaneously estimate the expected maximum violation for the
current value of x.
A reasonable starting point is a point well inside the certainty equivalent inequalities
E Ax E b, that also satisfies F x g. Simple choices include the analytic, Chebyshev, or
maximum volume ellipsoid centers, or the point that maximizes the margin max(b − Ax),
subject to F x g.

11
6.2 On-line learning and adaptive signal processing
We suppose that (x, y) ∈ Rn × R have some joint distribution. Our goal is to find a weight
vector w ∈ Rn for which wT x is a good estimator of y. (This is linear regression; we can add
an extra component to x that is always one to get an affine estimator of the form wT x + v.)
We’d like to choose a weight vector that minimizes

J(w) = E l(wT x − y),

where l : R → R is a convex loss function. For example, if l(u) = u2 , J is the mean-square

error; if l(u) = |u|, J is the mean-absolute error. Other interesting loss functions include
the Huber function, or a function with dead zone, as in l(u) = max{|u| − 1, 0}. For the
special case l(u) = u2 , there is an analytic formula for the optimal w, in terms of the mean
and covariance matrix of (x, y). But we consider here the more general case. Of course J is
convex, so minimizing J over w is a convex optimization problem.
We consider an on-line setting, where we do not know the distribution of (x, y). At each
step (which might correspond to time, for example), we are given a sample (x(i) , y (i) ) from
the distribution. After k steps, we could use the k samples to estimate the distribution; for
example, we could choose w to minimize the average loss under the empirical distribution.
This approach requires that we store all past samples, and we need to solve a large problem
each time a new sample comes in.
Instead we’ll use the stochastic subgradient method, which is really simple. In particular,
it requires essentially no storage (beyond the current value of w), and very little computation.
The disadvantage is that it is slow.
We will carry out a stochastic subgradient step each time a new sample arrives. Suppose
we are given a new sample (x(k+1) , y (k+1) ), and the current weight value is w(k) . Form

l′ (w(k)T x(k+1) − y (k+1) )x(k+1) ,

where l′ is the derivative of l. (If l is nondifferentiable, substitute any subgradient of l in

place of l′ .) This is a noisy unbiased subgradient of J.
Our on-line algorithm is very simple: when (x(k+1) , y (k+1) ) becomes available, we update
(k)
w as follows:
w(k+1) = w(k) − αk l′ (w(k)T x(k+1) − y (k+1) )x(k+1) . (3)
Note that w(k)T x(k+1) − y (k+1) is the prediction error for the k + 1 sample, using the weight
from step k. We can choose (for example) αk = 1/k.
In signal processing, updates of the form (3) are widely used in adaptive signal processing,
typically with l(u) = u2 . In this case the update (3) is called the LMS (least mean-square)
algorithm.
We illustrate the method with a simple example with n = 10, with (x, y) ∼ N (0, Σ),
where Σ is chosen randomly, and l(u) = |u|. (For the problem instance, we have E(y 2 ) ≈ 12.)
The update has the simple form

w(k+1) = w(k) − αk sign(w(k)T x(k+1) − y (k+1) )x(k+1) .

12
We take αk = 1/k. (In adaptive signal processing, this is called a sign algorithm.)
We compute w⋆ using the stochastic subgradient algorithm, which we run for 5000 iter-
ations. Figure 6 shows the behaviour of the prediction error w(k)T x(k+1) − y (k+1) for the first
300 iterations of the run. Figure 7 shows the empirical distibution (over the 1000 realizations)
of the prediction errors for w⋆ .

6.3 Large datasets and Monte Carlo methods

As an extension of the preceding section, we may revisit the stochastic programming prob-
lem (1) and consider performing the stochastic (sub)gradient method in this situation. It
is in these problems that stochastic subgradient methods, and more generally subgradient
methods, really become effective approaches to the solution of large-scale optimization prob-
lems. It is perhaps intuitive that this might be expected: while the subgradient method by
itself is quite slow, our convergence arguments suggest that its performance barely degrades
even when there is substantial noise in the computation of approximate subgradients g̃.
As a particular special case, consider the situation in which we have a large sample, which
we represent as pairs (ai , bi ), and have objective
m
1 X
f (x) = F (x; (ai , bi )),
m i=1

where for each pair (a, b), the objective F (x; (a, b)) is convex in x ∈ Rn . In this situation, if
we choose any m0 ≤ m as a batch size, we may construct a stochastic subgradient by taking
a subsample of indices i1 , . . . , im0 uniformly at random, either with or without replacement,
from {1, . . . , m}, then setting
m0
1 X
g̃ ∈ ∂x F (x; (aij , bij )).
m0 j=1

It is clear that E g̃ ∈ ∂f (x) with this choice. The interesting, though obvious, aspect of
this is that if m0 ≪ m, then it is m/m0 -times more efficient to compute the noisy unbiased
subgradient g̃ as opposed to the full subgradient of the objective. The power of stochastic
subgradient methods comes from the fact that, as often we only care about getting a mod-
erately accurate solution, we can often take m0 ≪ m to compute stochastic subgradients,
use these in the iteration of the methods, and achieve substantial computational benefits.
Here we work in reasonable detail through one such instance of this. We generate data
pairs as follows. We fix a vector w⋆ ∈ Rn , chosen uniformly from the unit sphere (so that
iid
kw⋆ k = 1), and then generate vectors ai ∼ N (0, I), the normal distribution on Rn . We then
set bi = aTi w⋆ + ξi for ξi drawn according to a Cauchy distribution. For our objective, we
take the absolute error
F (x; (a, b)) = |aT x − b|.
With this setup, our objective becomes f (x) = m1 kAx − bk1 , the mean ℓ1 -error in our pre-
dictions, which is a natural choice for heavy-tailed data.

13
40

20
prediction error
10

−10

−20

−30

−40
50 100 150 200 250 300
k

Figure 6: Prediction error w(k)T x(k+1) − y (k+1) versus iteration number k.

150

100

0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 7: Empirical distribution of prediction errors for w⋆ , based on 1000 samples.

14
0
10

m0 =1
m0 =10
m0 =200
-1
10

f(x(k) )−f(x ⋆ )

-2
10

-3
10

-4
10
50 100 150 200 250 300 350 400

Figure 8: Empirical performance of stochastic subgradient methods using either

a single sample (m0 = 1), a sample of ten pairs (ai , bi ) (m0 = 10), and the full
subgradient (m0 = m = 200) for the absolute regression problem. The horizontal
axis indexes the number of iterations of each method.

With this in mind, we can now compare a projected subgradient method for this problem
with m = 200 and m0 chosen from 1, 10, and 200. Each iteration updates

x(k+1) = Π(x(k) − αk g̃ (k) )

where Π denotes projection onto the ℓ2 -ball of radius 1. We give two figures. In the first
(Fig. 8), we show the optimality gap f (x(k) ) − f (x⋆ ) versus iteration, averaged over 50
different experiments. For each method, we used stepsize αk = √αk , where the scalar α was
chosen to yield the best performance in terms of optimality gap. On a per-iteration basis,
it is clear that the subgradient method using the full sample size m0 = 200 has the best
convergence as a function of the iterations.
On the other hand, when we normalize by the amount of actual computation performed,
the story is quite different. In Figure 9 we plot the optimality gap f (x(k) ) − f ⋆ against
the total amount of computation each method requires, where computation is given by the
total number of computations of individual subgradients ∂x F (x; (ai , bi )) = ai sign(aTi x − bi ),
each of which requires O(n) time. From this figure, we see that in the time the subgradient
method requires to make even a small amount of progress, both subsampled methods have
essentially made all of the progress they will make.

15
0
10

m0 =1
m0 =10
m0 =200
-1
10
f(x(k) )−f(x ⋆ )

-2
10

-3
10

-4
10
20 40 60 80 100

Figure 9: Empirical performance of stochastic subgradient methods using either

a single sample (m0 = 1), a sample of ten pairs (ai , bi ) (m0 = 10), and the full
subgradient (m0 = m = 200) for the absolute regression problem. The horizontal
axis index the number of computations of each method. Thus, for the single sample
m0 = 1 case, each tick mark indicates a total of m subgradient steps; for m0 = 10,
each tick indicates m/m0 = 20 subgradient steps.

16
Acknowledgments
May Zhou helped with a preliminary version of this document. We thank Abbas El Gamal
for helping us simplify the convergence proof. Trevor Hastie suggested the on-line learning
example.

References
[BL97] J. R. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer,
New York, 1997.

[Mar05] K. Marti. Stochastic Optimization Methods. Springer, 2005.

[Pol87] B. Polyak. Introduction to Optimization. Optimization Software, Inc., 1987.

[Pre95] A. Prekopa. Stochastic Programming. Kluwer Academic Publishers, 1995.

[Sho98] N. Shor. Nondifferentiable Optimization and Polynomial Problems. Nonconvex

Optimization and its Applications. Kluwer, 1998.

K. Sam Shanmugan, Arthur M. Breipohl-Random Signals - Detection, Estimation and Data Analysis-Wiley (1988) PDF
100% (4)
K. Sam Shanmugan, Arthur M. Breipohl-Random Signals - Detection, Estimation and Data Analysis-Wiley (1988) PDF
676 pages
Exercises With Solutions PDF
No ratings yet
Exercises With Solutions PDF
37 pages
Raghu Meka notes
No ratings yet
Raghu Meka notes
7 pages
Gradient Decent - PDF 2
No ratings yet
Gradient Decent - PDF 2
7 pages
Duchi 16
No ratings yet
Duchi 16
88 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
L10_Subgrad_PGD (partially annotated)
No ratings yet
L10_Subgrad_PGD (partially annotated)
39 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Notes ch6
No ratings yet
Notes ch6
11 pages
SGD
No ratings yet
SGD
19 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Linear Convergence of Adaptive Stochastic Gradient Descent
No ratings yet
Linear Convergence of Adaptive Stochastic Gradient Descent
19 pages
Lim 05429427
No ratings yet
Lim 05429427
10 pages
On Marginal Allocation KTH
No ratings yet
On Marginal Allocation KTH
7 pages
EE364a Homework 3 Solutions: 0 N 0 1 N N 1 1 N N 0 0
No ratings yet
EE364a Homework 3 Solutions: 0 N 0 1 N N 1 1 N N 0 0
19 pages
Convex Problems
No ratings yet
Convex Problems
48 pages
HandbookORMS SP-Chapter01
No ratings yet
HandbookORMS SP-Chapter01
64 pages
ConvexSpring25_Week9
No ratings yet
ConvexSpring25_Week9
26 pages
Notes On Big-N Problems: 1 Motivation
No ratings yet
Notes On Big-N Problems: 1 Motivation
27 pages
5 NLP Models PDF
No ratings yet
5 NLP Models PDF
50 pages
Lecture_7_8_other_descent_methods
No ratings yet
Lecture_7_8_other_descent_methods
7 pages
Lecture 7 (with notes)
No ratings yet
Lecture 7 (with notes)
39 pages
2. Algorithmic Stability
No ratings yet
2. Algorithmic Stability
87 pages
lect5_removed
No ratings yet
lect5_removed
35 pages
Lecture Note SGD
No ratings yet
Lecture Note SGD
4 pages
Minimax Analysis of Stochastic Problems
No ratings yet
Minimax Analysis of Stochastic Problems
21 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Convexity II: Optimization Basics: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Convexity II: Optimization Basics: Ryan Tibshirani Convex Optimization 10-725
28 pages
Efficient Monte Carlo Simulations For Stochastic Programming
No ratings yet
Efficient Monte Carlo Simulations For Stochastic Programming
24 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
Subgrad Method Slides
No ratings yet
Subgrad Method Slides
33 pages
Robust Slides
No ratings yet
Robust Slides
32 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
opte - Optimization
No ratings yet
opte - Optimization
125 pages
Unconstrained
No ratings yet
Unconstrained
30 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
02-Subgrad Method Notes
No ratings yet
02-Subgrad Method Notes
27 pages
chapter8-Unconstrained Optimization
No ratings yet
chapter8-Unconstrained Optimization
14 pages
Stochastic Search Optimization
No ratings yet
Stochastic Search Optimization
317 pages
sol3_2015
No ratings yet
sol3_2015
8 pages
NIPS-2011-non-asymptotic-analysis-of-stochastic-approximation-algorithms-for-machine-learning-Paper
No ratings yet
NIPS-2011-non-asymptotic-analysis-of-stochastic-approximation-algorithms-for-machine-learning-Paper
9 pages
Nocedal_Wright Ch_02-02
No ratings yet
Nocedal_Wright Ch_02-02
12 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
note3
No ratings yet
note3
9 pages
opte
No ratings yet
opte
32 pages
Optimization Methods in Finance
No ratings yet
Optimization Methods in Finance
3 pages
01 Intro Notes Cvxopt f22
No ratings yet
01 Intro Notes Cvxopt f22
25 pages
Algorithms For Constrained Optimization
No ratings yet
Algorithms For Constrained Optimization
22 pages
Const Opt
No ratings yet
Const Opt
22 pages
Lecture 11
No ratings yet
Lecture 11
4 pages
A Tutorial On Stochastic Programming PDF
No ratings yet
A Tutorial On Stochastic Programming PDF
35 pages
lec13
No ratings yet
lec13
6 pages
Smooth Convex Minimization Problems
No ratings yet
Smooth Convex Minimization Problems
28 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Transformation of Axes (Geometry) Mathematics Question Bank
From Everand
Transformation of Axes (Geometry) Mathematics Question Bank
Mohmmad Khaja Shareef
3/5 (1)
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Caro 2013
No ratings yet
Caro 2013
9 pages
Moments 7
No ratings yet
Moments 7
4 pages
(eBook PDF) An Introduction to Categorical Data Analysis by Alan Agrestiinstant download
100% (5)
(eBook PDF) An Introduction to Categorical Data Analysis by Alan Agrestiinstant download
52 pages
Econometric Theory Lecture Notes
No ratings yet
Econometric Theory Lecture Notes
90 pages
yin2020
No ratings yet
yin2020
11 pages
Design of Experiments
No ratings yet
Design of Experiments
24 pages
Statistical Signal Processing
No ratings yet
Statistical Signal Processing
135 pages
Bbs14e PPT ch08
No ratings yet
Bbs14e PPT ch08
58 pages
Introduction to Econometrics 3rd Edition James H. Stock - eBook PDFpdf download
100% (2)
Introduction to Econometrics 3rd Edition James H. Stock - eBook PDFpdf download
62 pages
The Impact of Government Expenditure On Growth: Empirical Evidence From A Heterogeneous Panel
No ratings yet
The Impact of Government Expenditure On Growth: Empirical Evidence From A Heterogeneous Panel
10 pages
Schermelleh_Moosbrugger_Mueller_ModelFit_MPR_2003
No ratings yet
Schermelleh_Moosbrugger_Mueller_ModelFit_MPR_2003
53 pages
Identification of Civil Engineering Structures Using Vector ARMA Models
No ratings yet
Identification of Civil Engineering Structures Using Vector ARMA Models
29 pages
Statistics Syllabus
No ratings yet
Statistics Syllabus
50 pages
CE504 - HW2 - Dec 27, 20
No ratings yet
CE504 - HW2 - Dec 27, 20
4 pages
Bias in The E Ffective Bid-Ask Spread
No ratings yet
Bias in The E Ffective Bid-Ask Spread
62 pages
Machine Learning Basics Dl2 Rk (1)
No ratings yet
Machine Learning Basics Dl2 Rk (1)
16 pages
stk4610 s3 Ex Sheet
No ratings yet
stk4610 s3 Ex Sheet
5 pages
Course Outline MMSC Monte Carlo Methods
No ratings yet
Course Outline MMSC Monte Carlo Methods
4 pages
Essentials of Modern Business Statistics with Microsoft Excel 5th Edition Anderson Test Bank download pdf
100% (29)
Essentials of Modern Business Statistics with Microsoft Excel 5th Edition Anderson Test Bank download pdf
54 pages
Introduce To Probabilistic Machine Learning
No ratings yet
Introduce To Probabilistic Machine Learning
53 pages
Average Absolute Deviation: Deviation (Both Abbreviated As MAD)
No ratings yet
Average Absolute Deviation: Deviation (Both Abbreviated As MAD)
5 pages
Run-off-Triangles-2
No ratings yet
Run-off-Triangles-2
52 pages
Lesson 5
No ratings yet
Lesson 5
47 pages
Quantifying Language To Measure Firms' Fundamentals
No ratings yet
Quantifying Language To Measure Firms' Fundamentals
47 pages
Algorithms 17 00524
No ratings yet
Algorithms 17 00524
17 pages
Acturial Review
No ratings yet
Acturial Review
40 pages
Chapter8 Student
No ratings yet
Chapter8 Student
60 pages
The Analytic Hierarchy Process - What It Is and How It Is Used
No ratings yet
The Analytic Hierarchy Process - What It Is and How It Is Used
17 pages
Cochrane 1996 Cross+Sectional+Test+(Jpe)
No ratings yet
Cochrane 1996 Cross+Sectional+Test+(Jpe)
51 pages

0105 Stoch Subgrad Notes

Uploaded by

0105 Stoch Subgrad Notes

Uploaded by

Stochastic Subgradient Methods

Stephen Boyd and Almir Mutapcic, with additions by John Duchi

1 Noisy unbiased subgradient

f (z) ≥ f (x) + (E g̃)T (z − x)

for all z. Thus, g̃ is a noisy unbiased subgradient of f at x if it can be written as g̃ = g + v,

2 Stochastic subgradient method

x(k+1) = x(k) − αk g̃ (k) ,

(More sophisticated methods can be used to show almost sure convergence.)

E kx(k+1) − x⋆ k22 x(k) = E kx(k) − αk g̃ (k) − x⋆ k22 x(k)

E kx(k+1) − x⋆ k22 ≤ E kx(k) − x⋆ k22 − 2αk (E f (x(k) ) − f ⋆ ) + αk2 G2 ,

using E kg̃ (k) k22 ≤ G2 . Recursively applying this inequality yields

which shows that mini=1,...,k E f (x(i) ) converges to f ⋆ .

1000 2000 3000 4000 5000

1000 2000 3000 4000 5000

To see this, note that for each w and any z we have

F (z, w) ≥ F (x, w) + G(x, w)T (z − x),

Since the lefthand side is f (z), we’ve shown g ∈ ∂f (x).

In other words, we evaluate a subgradient of f , at x, for M random samples of w, and take

minimize f (x) = E maxi=1,...,m (aTi x + bi ),

minimize fce (x) = maxi=1,...,m (E aTi x + E bi )

(which is the same piecewise-linear minimization problem considered in earlier examples).

minimize fheur (x) = maxi=1,...,m (E aTi x + E bi + λkxk2 ),

J(w) = E l(wT x − y),

where l : R → R is a convex loss function. For example, if l(u) = u2 , J is the mean-square

l′ (w(k)T x(k+1) − y (k+1) )x(k+1) ,

where l′ is the derivative of l. (If l is nondifferentiable, substitute any subgradient of l in

w(k+1) = w(k) − αk sign(w(k)T x(k+1) − y (k+1) )x(k+1) .

6.3 Large datasets and Monte Carlo methods

Figure 6: Prediction error w(k)T x(k+1) − y (k+1) versus iteration number k.

Figure 7: Empirical distribution of prediction errors for w⋆ , based on 1000 samples.

Figure 8: Empirical performance of stochastic subgradient methods using either

x(k+1) = Π(x(k) − αk g̃ (k) )

Figure 9: Empirical performance of stochastic subgradient methods using either

[Mar05] K. Marti. Stochastic Optimization Methods. Springer, 2005.

[Pol87] B. Polyak. Introduction to Optimization. Optimization Software, Inc., 1987.

[Pre95] A. Prekopa. Stochastic Programming. Kluwer Academic Publishers, 1995.

[Sho98] N. Shor. Nondifferentiable Optimization and Polynomial Problems. Nonconvex

You might also like