0% found this document useful (0 votes)
6 views

Siopt Rsa 2009

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Siopt Rsa 2009

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

SIAM J. OPTIM.

c 2009 Society for Industrial and Applied Mathematics



Vol. 19, No. 4, pp. 1574–1609

ROBUST STOCHASTIC APPROXIMATION APPROACH TO


STOCHASTIC PROGRAMMING∗
A. NEMIROVSKI† , A. JUDITSKY‡ , G. LAN† , AND A. SHAPIRO†

Abstract. In this paper we consider optimization problems where the objective function is given
in a form of the expectation. A basic difficulty of solving such stochastic optimization problems is
that the involved multidimensional integrals (expectations) cannot be computed with high accuracy.
The aim of this paper is to compare two computational approaches based on Monte Carlo sampling
techniques, namely, the stochastic approximation (SA) and the sample average approximation (SAA)
methods. Both approaches, the SA and SAA methods, have a long history. Current opinion is that
the SAA method can efficiently use a specific (say, linear) structure of the considered problem, while
the SA approach is a crude subgradient method, which often performs poorly in practice. We intend
to demonstrate that a properly modified SA approach can be competitive and even significantly
outperform the SAA method for a certain class of convex stochastic problems. We extend the
analysis to the case of convex-concave stochastic saddle point problems and present (in our opinion
highly encouraging) results of numerical experiments.

Key words. stochastic approximation, sample average approximation method, stochastic pro-
gramming, Monte Carlo sampling, complexity, saddle point, minimax problems, mirror descent al-
gorithm

AMS subject classifications. 90C15, 90C25

DOI. 10.1137/070704277

1. Introduction. In this paper we first consider the following stochastic opti-


mization problem:
 
(1.1) min f (x) = E[F (x, ξ)] ,
x∈X

and then we deal with an extension of the analysis to stochastic saddle point problems.
Here X ⊂ Rn is a nonempty bounded closed convex set, ξ is a random vector whose
probability distribution P is supported on set Ξ ⊂ Rd and F : X × Ξ → R. We
assume that the expectation

(1.2) E[F (x, ξ)] = Ξ F (x, ξ)dP (ξ)

is well defined and finite valued for every x ∈ X. Moreover, we assume that the
expected value function f (·) is continuous and convex on X. Of course, if for every
ξ ∈ Ξ the function F (·, ξ) is convex on X, then it follows that f (·) is convex. With
these assumptions, (1.1) becomes a convex programming problem.
A basic difficulty of solving stochastic optimization problem (1.1) is that the mul-
tidimensional integral (expectation) (1.2) cannot be computed with a high accuracy
for dimension d, say, greater than five. The aim of this paper is to compare two
∗ Received by the editors October 1, 2007; accepted for publication (in revised form) August 26,

2008; published electronically January 21, 2009.


http://www.siam.org/journals/siopt/19-4/70427.html
† Georgia Institute of Technology, Atlanta, Georgia 30332 ([email protected], glan@isye.

gatech.edu, [email protected]). Research of the first author was partly supported by NSF
award DMI-0619977. Research of the third author was partially supported by NSF award CCF-
0430644 and ONR award N00014-05-1-0183. Research of the fourth author was partly supported by
NSF awards DMS-0510324 and DMI-0619977.
‡ Université J. Fourier, B.P. 53, 38041 Grenoble Cedex 9, France ([email protected]).

1574

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1575

computational approaches based on Monte Carlo sampling techniques, namely, the


stochastic approximation (SA) and the sample average approximation (SAA) meth-
ods. To this end we make the following assumptions.
(A1) It is possible to generate an independent identically distributed (iid) sample
ξ1 , ξ2 , . . . , of realizations of random vector ξ.
(A2) There is a mechanism (an oracle), which, for a given input point (x, ξ) ∈
X × Ξ returns stochastic subgradient—a vector G(x, ξ) such that g(x) := E[G(x, ξ)] is
well defined and is a subgradient of f (·) at x, i.e., g(x) ∈ ∂f (x).
Recall that if F (·, ξ), ξ ∈ Ξ, is convex and f (·) is finite valued in a neighborhood
of a point x, then (cf. Strassen [28])

(1.3) ∂f (x) = E [∂x F (x, ξ)] .

In that case we can employ a measurable selection G(x, ξ) ∈ ∂x F (x, ξ) as a stochastic


subgradient. At this stage, however, this is not important, we shall see later other
relevant ways for constructing stochastic subgradients.
Both approaches, the SA and SAA methods, have a long history. The SA method
is going back to the pioneering paper by Robbins and Monro [21]. Since then SA
algorithms became widely used in stochastic optimization (see, e.g., [3, 6, 7, 20, 22] and
references therein) and, due to especially low demand for computer memory, in signal
processing . In the classical analysis of the SA algorithm (it apparently goes back to
the works [5] and [23]) it is assumed that f (·) is twice continuously differentiable and
strongly convex and in the case when the minimizer of f belongs to the interior of X,
exhibits asymptotically optimal rate1 of convergence E[f (xt )−f∗ ] = O(t−1 ) (here xt is
tth iterate and f∗ is the minimal value of f (x) over x ∈ X). This algorithm, however,
is very sensitive to a choice of the respective stepsizes. Since “asymptotically optimal”
stepsize policy can be very bad in the beginning, the algorithm often performs poorly
in practice (e.g., [27, section 4.5.3.]).
An important improvement of the SA method was developed by Polyak [18] and
Polyak and Juditsky [19], where longer stepsizes were suggested with consequent
averaging of the obtained iterates. Under the outlined “classical” assumptions, the
resulting algorithm exhibits the same optimal O(t−1 ) asymptotical convergence rate,
while using an easy to implement and “robust” stepsize policy. It should be mentioned
that the main ingredients of Polyak’s scheme—long steps and averaging—were, in a
different form, proposed already in Nemirovski and Yudin [15] for the case of problems
(1.1) with general-type Lipschitz continuous convex objectives and for convex-concave
saddle point problems. The algorithms from [15] exhibit, in a nonasymptotical fashion,
the O(t−1/2 ) rate of convergence. It is possible to show that in the general convex
case (without assuming smoothness and strong convexity of the objective function),
this rate of O(t−1/2 ) is unimprovable. For a summary of early results in this direction,
see Nemirovski and Yudin [16].
The SAA approach was used by many authors in various contexts under different
names. Its basic idea is rather simple: generate a (random) sample ξ1 , . . . , ξN , of size
N , and approximate the “true” problem (1.1) by the sample average problem
⎧ ⎫
⎨ N ⎬
(1.4) min fˆN (x) = N −1 F (x, ξj ) .
x∈X ⎩ ⎭
j=1

1 Throughout the paper, we speak about convergence in terms of the objective value.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1576 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

Note that the SAA method is not an algorithm; the obtained SAA problem (1.4) still
has to be solved by an appropriate numerical procedure. Recent theoretical studies
(cf. [11, 25, 26]) and numerical experiments (see, e.g., [12, 13, 29]) show that the SAA
method coupled with a good (deterministic) algorithm could be reasonably efficient
for solving certain classes of two-stage stochastic programming problems. On the
other hand, classical SA-type numerical procedures typically performed poorly for
such problems.
We intend to demonstrate in this paper that a properly modified SA approach can
be competitive and even significantly outperform the SAA method for a certain class
of stochastic problems. The mirror descent SA method we propose here is a direct
descendent of the stochastic mirror descent method of Nemirovski and Yudin [16].
However, the method developed in this paper is more flexible than its “ancestor”:
the iteration of the method is exactly the prox-step for a chosen prox-function, and
the choice of prox-type function is not limited to the norm-type distance-generating
functions. Close techniques, based on subgradient averaging, have been proposed
in Nesterov [17] and used in [10] to solve the stochastic optimization problem (1.1).
Moreover, the results on large deviations of solutions and applications of the mirror
descent SA to saddle point problems, to the best of our knowledge, are new.
The rest of this paper is organized as follows. In section 2 we focus on theory
of the SA method applied to (1.1). We start with outlining the relevant–to-our-goals
part of the classical “O(t−1 )” SA theory (section 2.1), along with its “O(t−1/2 )”
modifications (section 2.2). Well-known and simple results presented in these sections
pave the road to our main developments carried out in section 2.3. In section 3
we extend the constructions and results of section 2.3 to the case of the convex-
concave stochastic saddle point problem. In concluding section 4 we present results
(in our opinion, highly encouraging) of numerical experiments with the SA algorithm
(sections 2.3 and 3) applied to large-scale stochastic convex minimization and saddle
point problems. Section 5 gives a short conclusion for the presented results. Finally,
some technical proofs are given in the appendix.
Throughout the paper, we use the following √ notation. By xp , we denote the p
norm of vector x ∈ Rn , in particular, x2 = xT x denotes the Euclidean norm, and
x∞ = max{|x1 |, . . . , |xn |}. By ΠX , we denote the metric projection operator onto
the set X, that is, ΠX (x) = arg minx ∈X x − x 2 . Note that ΠX is a nonexpanding
operator, i.e.,

(1.5) ΠX (x ) − ΠX (x)2 ≤ x − x2 ∀x , x ∈ Rn .

By O(1), we denote positive absolute constants. The notation a stands for the
largest integer less than or equal to a ∈ R and a for the smallest integer greater
than or equal to a ∈ R. By ξ[t] = (ξ1 , . . . , ξt ), we denote the history of the process
ξ1 , . . . , up to time t. Unless stated otherwise, all relations between random variables
are supposed to hold almost surely.
2. Stochastic approximation, basic theory. In this section we discuss theory
and implementations of the SA approach to the minimization problem (1.1).
2.1. Classical SA algorithm. The classical SA algorithm solves (1.1) by mim-
icking the simplest subgradient descent method. That is, for chosen x1 ∈ X and a
sequence γj > 0, j = 1, . . . , of stepsizes, it generates the iterates by the formula

(2.1) xj+1 = ΠX (xj − γj G(xj , ξj )).

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1577

Of course, the crucial question of that approach is how to choose the stepsizes γj . Let
x∗ be an optimal solution of (1.1). Note that since the set X is compact and f (x) is
continuous, (1.1) has an optimal solution. Note also that the iterate xj = xj (ξ[j−1] )
is a function of the history ξ[j−1] = (ξ1 , . . . , ξj−1 ) of the generated random process
and hence is random.
Denote

(2.2) Aj = 12 xj − x∗ 22 and aj = E[Aj ] = 12 E xj − x∗ 22 .

By using (1.5) and since x∗ ∈ X and hence ΠX (x∗ ) = x∗ , we can write


 2
Aj+1 = 12 ΠX (xj − γj G(xj , ξj )) − x∗ 2
 2
= 12 ΠX (xj − γj G(xj , ξj )) − ΠX (x∗ )2
(2.3)  2
≤ 12 xj − γj G(xj , ξj ) − x∗ 2
= Aj + 12 γj2 G(xj , ξj )22 − γj (xj − x∗ )T G(xj , ξj ).

Since xj = xj (ξ[j−1] ) is independent of ξj , we have


 
E (xj − x∗ )T G(xj , ξj ) = E E (xj − x∗ )T G(xj , ξj ) |ξ[j−1] 
(2.4) = E (xj − x∗ )T E G(xj , ξj ) |ξ[j−1]
= E (xj − x∗ )T g(xj ) .

Assume now that there is a positive number M such that

(2.5) E G(x, ξ)22 ≤ M 2 ∀x ∈ X.

Then, by taking expectation of both sides of (2.3) and using (2.4), we obtain

(2.6) aj+1 ≤ aj − γj E (xj − x∗ )T g(xj ) + 12 γj2 M 2 .

Suppose further that the expectation function f (x) is differentiable and strongly
convex on X, i.e., there is constant c > 0 such that

f (x ) ≥ f (x) + (x − x)T ∇f (x) + 12 cx − x22 , ∀x , x ∈ X,

or equivalently that

(2.7) (x − x)T (∇f (x ) − ∇f (x)) ≥ cx − x22 ∀x , x ∈ X.

Note that strong convexity of f (x) implies that the minimizer x∗ is unique. By
optimality of x∗ , we have that

(x − x∗ )T ∇f (x∗ ) ≥ 0 ∀x ∈ X,

which together with (2.7) implies that (x − x∗ )T ∇f (x) ≥ cx − x∗ 22 . In turn, it
follows that (x − x∗ )T g ≥ cx − x∗ 22 for all x ∈ X and g ∈ ∂f (x), and hence

E (xj − x∗ )T g(xj ) ≥ cE xj − x∗ 22 = 2caj .

Therefore, it follows from (2.6) that

(2.8) aj+1 ≤ (1 − 2cγj )aj + 12 γj2 M 2 .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1578 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

Let us take stepsizes γj = θ/j for some constant θ > 1/(2c). Then, by (2.8), we
have

aj+1 ≤ (1 − 2cθ/j)aj + 12 θ2 M 2 /j 2 .

It follows by induction that

(2.9) E xj − x∗ 22 = 2aj ≤ Q(θ)/j,

where
 
(2.10) Q(θ) = max θ2 M 2 (2cθ − 1)−1 , x1 − x∗ 22 .

Suppose further that x∗ is an interior point of X and ∇f (x) is Lipschitz contin-


uous, i.e., there is constant L > 0 such that

(2.11) ∇f (x ) − ∇f (x)2 ≤ Lx − x2 ∀x , x ∈ X.

Then

(2.12) f (x) ≤ f (x∗ ) + 12 Lx − x∗ 22 , ∀x ∈ X,

and hence

(2.13) E f (xj ) − f (x∗ ) ≤ 12 LE xj − x∗ 22 ≤ 12 LQ(θ)/j,

where Q(θ) is defined in (2.10).


Under the specified assumptions, it follows from (2.9) and (2.13), respectively, that
after t iterations, the expected error of the current solution in terms of the distance
to x∗ is of order O(t−1/2 ), and the expected error in terms of the objective value is of
order O(t−1 ), provided that θ > 1/(2c). The simple example of X = {x : x2 ≤ 1},
f (x) = 12 c xT x, and G(x, ξ) = ∇f (x) + ξ, with ξ having standard normal distribution
N (0, In ), demonstrates that the outlined upper bounds on the expected errors are
tight within factors independent of t.
We have arrived at the O(t−1 ) rate of convergence in terms of the expected value
of the objective mentioned in the Introduction. Note, however, that the result is
highly sensitive to a priori information on c. What would happen if the parameter c
of strong convexity is overestimated? As a simple example, consider f (x) = x2 /10,
X = [−1, 1] ⊂ R, and assume that there is no noise, i.e., G(x, ξ) ≡ ∇f (x). Suppose,
further that we take θ = 1 (i.e., γj = 1/j), which will be the optimal choice for c = 1,
while actually here c = 0.2. Then the iteration process becomes
 
 1
xj+1 = xj − f (xj )/j = 1 − xj ,
5j
and hence starting with x1 = 1,
 j−1    j−1 

j−1
1
  1  1
xj = 1− = exp − ln 1 + > exp −
s=1
5s s=1
5s − 1 s=1
5s − 1
   j−1   
1 1
> exp − 0.25 + dt > exp −0.25 + 0.2 ln 1.25 − ln j
1 5t − 1 5
> 0.8j −1/5 .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1579

That is, the convergence is extremely slow. For example, for j = 109 , the error of the
iterated solution is greater than 0.015. On the other hand, for the optimal stepsize
factor of θ = 1/c = 5, the optimal solution x∗ = 0 is found in one iteration.
It could be added that the stepsizes γj = θ/j may become completely unaccept-
able when f loses strong convexity. For example, when f (x) = x4 , X = [−1, 1],
and there is no noise, these stepsizes result in a disastrously slow convergence: |xj | ≥
1
O([ln(j+1)]−1/2 ). The precise statement here is that with γj = θ/j and 0 < x1 ≤ 6√ θ
,
we have that xj ≥ √ 2
x 1
for j = 1, 2, . . . .
1+32θx1 [1+ln(j+1)]
We see that in order to make the SA “robust”—applicable to general convex ob-
jectives rather than to strongly convex ones—one should replace the classical stepsizes
γj = O(j −1 ), which can be too small to ensure a reasonable rate of convergence even
in the “no noise” case, with “much larger” stepsizes. At the same time, a detailed
analysis shows that “large” stepsizes poorly suppress noise. As early as in [15] it
was realized that in order to resolve the arising difficulty, it makes sense to separate
collecting information on the objective from generating approximate solutions. Specif-
ically, we can use large stepsizes, say, γj = O(j −1/2 ) in (2.1), thus avoiding too slow
motion at the cost of making the trajectory “more noisy.” In order to suppress, to
some extent, this noisiness, we take, as approximate solutions, appropriate averages
of the search points xj rather than these points themselves.
2.2. Robust SA approach. Results of this section go back to Nemirovski and
Yudin [15, 16]. Let us look again at the basic relations (2.2), (2.5), and (2.6). By
convexity of f (x), we have that f (x) ≥ f (xt ) + (x − xt )T g(xt ) for any x ∈ X, and
hence

E (xt − x∗ )T g(xt ) ≥ E f (xt ) − f (x∗ ) .

Together with (2.6), this implies (recall that at = E 1


2
xt − x∗ 22 )

γt E f (xt ) − f (x∗ ) ≤ at − at+1 + 12 γt2 M 2 .

It follows that whenever 1 ≤ i ≤ j, we have


j 
j 
j 
j
(2.14) γt E f (xt ) − f (x∗ ) ≤ [at − at+1 ] + 12 M 2 γt2 ≤ ai + 12 M 2 γt2 ,
t=i t=i t=i t=i

γt
and hence, setting νt = j ,
τ =i γτ

 j  j
 ai + 12 M 2 γt2
(2.15) E νt f (xt ) − f (x∗ ) ≤ j t=i
.
t=i t=i γt
j
Note that νt ≥ 0 and t=i νt = 1. Consider the points


j
(2.16) x̃ji = νt xt ,
t=i

and let

(2.17) DX = max x − x1 2 .
x∈X

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1580 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

By convexity of X, we have x̃ji ∈ X, and, by convexity of f , we have f (x̃ji ) ≤


j 2 2
t=i νt f (xt ). Thus, by (2.15) and in view of a1 ≤ DX and ai ≤ 4DX , i > 1, we get

  D2 + M 2 j γ 2
j
(a) E f (x̃1 ) − f (x∗ ) ≤ X j t=1 t
for 1 ≤ j,
2 t=1 γt
(2.18)
  4D2 + M 2 j γ 2
j
(b) E f (x̃i ) − f (x∗ ) ≤ X
j t=i t
for 1 < i ≤ j.
2 t=i γt

Based on the resulting bounds on the expected inaccuracy of approximate solutions x̃ji ,
we can now develop “reasonable” stepsize policies along with the associated efficiency
estimates.
Constant stepsizes and basic efficiency estimate. Assume that the number N of
iterations of the method is fixed in advance and that γt = γ, t = 1, . . . , N . Then it
follows by (2.18(a)) that

  2
DX + M 2N γ2
(2.19) 1 − f (x∗ ) ≤
E f x̃N .
2N γ

Minimizing the right-hand side of (2.19) over γ > 0, we arrive at the constant stepsize
policy
DX
(2.20) γt = √ , t = 1, . . . , N,
M N
along with the associated efficiency estimate
  DX M
(2.21) 1 − f (x∗ ) ≤ √
E f x̃N .
N
With the constant stepsize policy (2.20), we also have, for 1 ≤ K ≤ N ,
 
  DX M 2N 1
(2.22) E f x̃N − f (x∗ ) ≤ √ + .
K
N N −K +1 2

When K/N ≤ 1/2, the right-hand side of (2.22) coincides, within an absolute constant
factor, with the right-hand side of (2.21). Finally, for a constant θ > 0, passing from
the stepsizes (2.20) to the stepsizes

θDX
(2.23) γt = √ , t = 1, . . . , N,
M N
the efficiency estimate becomes
 
   −1  DX M 2N 1
(2.24) E f x̃N
K − f (x∗ ) ≤ max θ, θ √ + , 1 ≤ K ≤ N.
N N −K +1 2

Discussion. We conclude that the expected error in terms of the objective of Ro-
bust SA algorithm (2.1), (2.16), with constant stepsize policy (2.20), after N iterations
is of order O(N −1/2 ) in our setting. Of course, this is worse than the rate O(N −1 ) for
the classical SA algorithm as applied to a smooth strongly convex function attaining
minimum at a point from the interior of the set X. However, the error bounds (2.21)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1581

and (2.22) are guaranteed independently of any smoothness and/or strong convexity
assumptions on f . All that matters is the convexity of f on the convex compact set X
and the validity of (2.5). Moreover, scaling the stepsizes by positive constant θ affects
the error bound (2.24) linearly in max{θ, θ−1 }. This can be compared with a possibly
disastrous effect of such scaling in the classical SA algorithm discussed in section 2.1.
These observations, in particular the fact that there is no necessity in “fine tuning”
the stepsizes to the objective function f , explain the adjective “robust” in the name
of the method. Finally, it can be shown that without additional, as compared to
convexity and (2.5), assumptions on f , the accuracy bound (2.21) within an absolute
constant factor is the best one allowed by statistics (cf. [16]).
Varying stepsizes. When the number of steps is not fixed in advance, it makes
sense to replace constant stepsizes with the stepsizes
θDX
(2.25) γt = √ , t = 1, 2, . . . .
M t
From (2.18(b)) it follows that with this stepsize policy, one has, for 1 ≤ K ≤ N ,
   
 N DX M 2 N θ N
(2.26) E f x̃K − f (x∗ ) ≤ √ + .
N θ N −K +1 2 K

Choosing K as a fixed fraction of N , i.e., setting K = rN , with a fixed r ∈ (0, 1),


we get the efficiency estimate
   −1  DX M
(2.27) E f x̃N
K − f (x∗ ) ≤ C(r) max θ, θ √ , N = 1, 2, . . . ,
N
with an easily computable factor C(r) depending solely on r. This bound, up to a
factor depending solely on r and θ, coincides with the bound (2.21), with the advantage
that our new stepsize policy should not be adjusted to a fixed-in-advance number of
steps N .
2.3. Mirror descent SA method. On a close inspection, the robust SA algo-
rithm from section 2.2 is intrinsically linked to the Euclidean structure of Rn . This
structure plays the central role in the very construction of the method (see (2.1)), the
same as in the associated efficiency estimates, like (2.21) (since the quantities DX , M
participating in the estimates are defined in terms of the Euclidean norm, see (2.17)
and (2.5)). By these reasons, from now on, we refer to the algorithm from section 2.2
as the (robust) Euclidean SA (E-SA). In this section we develop a substantial gener-
alization of the E-SA approach allowing us to adjust, to some extent, the method to
the geometry, not necessary Euclidean, of the problem in question. We shall see in
the meantime that we can gain a lot, both theoretically and numerically, from such
an adjustment. A rudimentary form of the generalization to follow can be found in
Nemirovski and Yudin [16], from where the name “mirror descent” originates.
Let  ·  be a (general) norm on Rn and x∗ = supy≤1 y T x be its dual norm.
We say that a function ω : X → R is a distance-generating function modulus α > 0
with respect to  · , if ω is convex and continuous on X, the set

(2.28) X o = {x ∈ X : ∂ω(x) = ∅}

is convex (note that X o always contains the relative interior of X) and restricted
to X o , ω is continuously differentiable and strongly convex with parameter α with

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1582 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

respect to  · , i.e.,

(2.29) (x − x)T (∇ω(x ) − ∇ω(x)) ≥ αx − x2 ∀x , x ∈ X o .

A simple example of a distance-generating function is ω(x) = 12 x22 (modulus 1 with


respect to  · 2 , X o = X).
Let us define function V : X o × X → R+ as follows:

(2.30) V (x, z) = ω(z) − ω(x) + ∇ω(x)T (z − x) .

In what follows we shall refer to V (·, ·) as prox-function associated with distance-


generating function ω(x) (it is also called Bregman distance [4]). Note that V (x, ·) is
nonnegative and is a strongly convex modulus α with respect to the norm  · . Let
us define prox-mapping Px : Rn → X o , associated with ω and a point x ∈ X o , viewed
as a parameter, as follows:
 
(2.31) Px (y) = arg min y T (z − x) + V (x, z) .
z∈X

Observe that the minimum in the right-hand side of (2.31) is attained since ω is
continuous on X and X is compact, and all the minimizers belong to X o , whence the
minimizer is unique, since V (x, ·) is strongly convex on X o . Thus, the prox-mapping
is well defined.
For ω(x) = 12 x22 , we have Px (y) = ΠX (x − y) so that (2.1) is the recurrence

(2.32) xj+1 = Pxj (γj G(xj , ξj )), x1 ∈ X o .

Our goal is to demonstrate that the main properties of the recurrence (2.1) (which
from now on we call the E -SA recurrence) are inherited by (2.32), whatever be the
underlying distance-generating function ω(x).
The statement of the following lemma is a simple consequence of the optimal-
ity conditions of the right-hand side of (2.31) (proof of this lemma is given in the
appendix).
Lemma 2.1. For every u ∈ X, x ∈ X o , and y ∈ Rn , one has
y2∗
(2.33) V (Px (y), u) ≤ V (x, u) + y T (u − x) + .

Using (2.33) with x = xj , y = γj G(xj , ξj ), and u = x∗ , we get

γj2
(2.34) γj (xj − x∗ )T G(xj , ξj ) ≤ V (xj , x∗ ) − V (xj+1 , x∗ ) + G(xj , ξj )2∗ .

Note that with ω(x) = 12 x22 , one has V (x, z) = 12 x − z22 , α = 1,  · ∗ =  · 2 .
That is, (2.34) becomes nothing but the relation (2.6), which played a crucial role
in all the developments related to the E-SA method. We are about to process, in
a completely similar fashion, the relation (2.34) in the case of a general distance-
generating function, thus arriving at the mirror descent SA. Specifically, setting

(2.35) Δj = G(xj , ξj ) − g(xj ),

we can rewrite (2.34), with j replaced by t, as

γt2
(2.36) γt (xt − x∗ )T g(xt ) ≤ V (xt , x∗ ) − V (xt+1 , x∗ ) − γt ΔTt (xt − x∗ ) + G(xt , ξt )2∗ .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1583

Summing up over t = 1, . . . , j, and taking into account that V (xj+1 , u) ≥ 0, u ∈ X,


we get

j j
γt2 j
(2.37) γt (xt − x∗ )T g(xt ) ≤ V (x1 , x∗ ) + G(xt , ξt )2∗ − γt ΔTt (xt − x∗ ).
t=1 t=1
2α t=1
γ
Setting νt = j t , t = 1, . . . , j, and
i=1 γi


j
(2.38) x̃j1 = νt xt
t=1

and invoking convexity of f (·), we have



j 
j
γt (xt − x∗ )T g(xt ) ≥ γt [f (xt ) − f (x∗ )]
t=1 t=1
! j " 
 
j
= γt νt f (xt ) − f (x∗ )
t=1 t=1
! j "

≥ γt [f (x̃j ) − f (x∗ )] ,
t=1

which combines with (2.37) to imply that


j γt2 2
j
j V (x1 , x∗ ) + t=1 2α G(xt , ξt )∗ − γt ΔTt (xt − x∗ )
(2.39) f (x̃1 ) − f (x∗ ) ≤ j
t=1
.
t=1 γt

Let us suppose, as in the previous section (cf. (2.5)), that we are given a positive
number M∗ such that
(2.40) E G(x, ξ)2∗ ≤ M∗2 ∀x ∈ X.
Taking expectations of both sides of (2.39) and noting that (i) xt is a deterministic
function of ξ[t−1] = (ξ1 , . . . , ξt−1 ), (ii) conditional on ξ[t−1] , the expectation of Δt is
0, and (iii) the expectation of G(xt , ξt )2∗ does not exceed M∗2 , we obtain
  max −1

j u∈X V (x1 , u) + (2α) M∗2 jt=1 γt2
(2.41) E f (x̃1 ) − f (x∗ ) ≤ j .
t=1 γt
Assume from now on that the method starts with the minimizer of ω:
x1 = argmin X ω(x).
Then, from (2.30), it follows that
2
(2.42) max V (x1 , z) ≤ Dω,X ,
z∈X

where
 1/2
(2.43) Dω,X := max ω(z) − min ω(z) .
z∈X z∈X

Consequently, (2.41) implies that


  D 2 + 1 M 2 j γ 2
ω,X 2α ∗ t=1 t
(2.44) E f (x̃j1 ) − f (x∗ ) ≤ j .
γ
t=1 t

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1584 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

Constant stepsize policy. Assuming that the total number of steps N is given in
advance and γt = γ, t = 1, . . . , N , optimizing the right-hand side of (2.44) over γ > 0
we arrive at the constant stepsize policy

2αDω,X
(2.45) γt = √ , t = 1, . . . , N
M∗ N
and the associated efficiency estimate

  2
(2.46) E f x̃N
1 − f (x∗ ) ≤ Dω,X M∗
αN
(cf. (2.20), (2.21)). For a constant θ > 0, passing from the stepsizes (2.45) to the
stepsizes

θ 2αDω,X
(2.47) γt = √ , t = 1, . . . , N,
M∗ N
the efficiency estimate becomes

   −1  2
(2.48) E f x̃N
1 − f (x∗ ) ≤ max θ, θ Dω,X M∗ .
αN
We refer to the method (2.32), (2.38), and (2.47) as the (robust) mirror descent SA
algorithm with constant stepsize policy.
Probabilities of large deviations. So far, all our efficiency estimates were upper
bounds on the expected nonoptimality, in terms of the objective, of approximate
solutions generated by the algorithms. Here we complement these results with bounds
on probabilities of large deviations. Observe that by Markov inequality, (2.48) implies
that
√  
  N  2 max θ, θ−1 Dω,X M∗
(2.49) Prob f x̃1 − f (x∗ ) > ε ≤ √ ∀ε > 0.
ε αN
It is possible, however, to obtain much finer bounds on deviation probabilities when
imposing more restrictive assumptions on the distribution of G(x, ξ). Specifically,
assume that
 # $
2
(2.50) E exp G(x, ξ)∗ /M∗2 ≤ exp{1} ∀x ∈ X.

Note that condition (2.50) is stronger than (2.40). Indeed, if a random variable Y sat-
isfies E[exp{Y /a}] ≤ exp{1} for some a > 0, then by Jensen inequality, exp{E[Y /a]} ≤
E[exp{Y /a}] ≤ exp{1}, and therefore, E[Y ] ≤ a. Of course, condition (2.50) holds if
G(x, ξ)∗ ≤ M∗ for all (x, ξ) ∈ X × Ξ.
Proposition 2.2. In the case of (2.50) and for the constant stepsizes (2.47), the
following holds for any Ω ≥ 1:
(2.51)  
√  
  2 max θ, θ−1 M∗ Dω,X (12 + 2Ω)
1 − f (x∗ ) >
Prob f x̃N √ ≤ 2 exp{−Ω}.
αN

Proof of this proposition is given in the appendix.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1585

Varying stepsizes. Same as in the case of E-SA, we can modify the mirror descent
SA algorithm to allow for time-varying stepsizes and “sliding averages” of the search
points xt in the role of approximate solutions, thus getting rid of the necessity to fix
in advance the number of steps. Specifically, consider
√ 1/2
Dω,X := 2 sup ω(z) − ω(x) − (z − x)T ∇ω(x)
x∈X o ,z∈X
(2.52) %
= sup 2V (x, z)
x∈X o ,z∈X

and assume that Dω,X is finite. This is definitely so when ω is continuously differen-
tiable on the entire X. Note that for the E-SA, that is, with ω(x) = 12 x22 , Dω,X is
the Euclidean diameter of X.
In the case of (2.52), setting
j
γt xt
(2.53) x̃ji = t=i
j ,
t=i γt

summing up inequalities (2.34) over K ≤ t ≤ N , and acting exactly as when deriving


(2.39), we get for 1 ≤ K ≤ N ,
N γt2 2
N
  V (xK , x∗ ) + t=K 2α G(xt , ξt )∗ − γt ΔTt (xt − x∗ )
K − f (x∗ ) ≤
t=K
f x̃N N .
t=K γt

2
Noting that V (xK , x∗ ) ≤ 12 Dω,X and taking expectations, we arrive at
2 1 N
  1
Dω,X + 2α M∗2 t=K γt2
(2.54) K − f (x∗ ) ≤
E f x̃N 2
N
t=K γt

(cf. (2.44)). It follows that with a decreasing stepsize policy



θD ω,X α
(2.55) γt = √ , t = 1, 2, . . . ,
M∗ t
one has for 1 ≤ K ≤ N ,
 
  Dω,X M∗ 2 N θ N
(2.56) K − f (x∗ ) ≤ √ √
E f x̃N +
α N θN −K +1 2 K

(cf. (2.26)). In particular, with K = rN for a fixed r ∈ (0, 1), we get an efficiency
estimate
   −1  Dω,X M∗
(2.57) E f x̃N
K − f (x∗ ) ≤ C(r) max θ, θ √ √ ,
α N

completely similar to the estimate (2.27) for the E-SA.


Discussion. Comparing (2.21) to (2.46) and (2.27) to (2.57), we see that for
both the Euclidean and the mirror descent robust SA, the expected inaccuracy, in
terms of the objective, of the approximate solution built in course of N steps is
O(N −1/2 ). A benefit of the mirror descent over the Euclidean algorithm is in the

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1586 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

potential possibility to reduce the constant factor hidden in O(·) by adjusting the
norm  ·  and the distance-generatingfunction ω(·) to the geometry of the problem.
n
Example. Let X = {x ∈ Rn : i=1 xi = 1, x ≥ 0} be a standard simplex.
Consider two setups for the mirror descent SA:
— Euclidean setup, where  ·  =  · 2 and ω(x) = 12 x22 , and
— 1 -setup, where  ·  =  · 1 , with  · ∗ =  · ∞ and ω is the entropy function

n
(2.58) ω(x) = xi ln xi .
i=1

The Euclidean setup leads to the Euclidean robust SA, which is easily imple-
mentable (computing the prox-mapping requires O(n ln n) operations) and guarantees
that
   −1 
(2.59) E f x̃N
1 − f (x∗ ) ≤ O(1) max θ, θ M N −1/2 ,

with M 2 = supx∈X E G(x, ξ)22 , provided that the constant M is known and the
stepsizes
√ (2.23) are used (see (2.24), (2.17), and note that the Euclidean diameter of
X is 2). √
The 1 -setup corresponds to X o = {x ∈ X : x > 0}, Dω,X = ln n, α = 1, and
x1 = argmin X ω = n−1 (1, . . . , 1)T (see appendix). The associated mirror descent SA
is easily implementable: the prox-function here is

n
zi
V (x, z) = zi ln ,
i=1
xi

and the prox-mapping Px (y) = argmin z∈X y T (z − x) + V (x, z) can be computed in


O(n) operations according to the explicit formula

xi e−yi
[Px (y)]i = n −yk
, i = 1, . . . , n.
k=1 xk e

The efficiency estimate guaranteed with the 1 -setup is


   −1  √
(2.60) E f x̃N1 − f (x∗ ) ≤ O(1) max θ, θ ln nM∗ N −1/2 ,

with

M∗2 = sup E G(x, ξ)2∞ ,


x∈X

provided that the constant M∗ is known and the constant stepsizes (2.47) are used
(see (2.48) and (2.40)). To compare (2.60) and (2.59), observe that M∗ ≤ M , and the
ratio M∗ /M can be as small as n−1/2 . Thus, the efficiency estimate for the 1 -setup
never is much worse than the estimate for the Euclidean setup, and for large n, can
be far better than the latter estimate:

1 M n
≤√ ≤ , N = 1, 2, . . . ,
ln n ln nM∗ ln n
both the upper and the lower bounds being achievable. Thus, when X is a standard
simplex of large dimension, we have strong reasons to prefer the 1 -setup to the usual
Euclidean one.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1587

Note that  · 1 -norm can be coupled with “good” distance-generating functions


different from the entropy one, e.g., with the function

n
1
(2.61) ω(x) = (ln n) |xi |1+ ln n , n ≥ 3.
i=1

Whenever 0 ∈ X and Diam·1 (X) ≡ maxx,y∈X x − y1 equal to 1 (these conditions


can always be ensured
√ by scaling and shifting X), for the just-outlined setup, one
has Dω,X = O(1) ln n, α = O(1), so that the associated mirror descent robust SA
guarantees that with M∗2 = supx∈X E G(x, ξ)2∞ and N ≥ 1,
 & '  √
M∗ ln n
(2.62) E f x̃NrN − f (x∗ ) ≤ C(r) √
N
(see (2.57)), while the efficiency estimate for the Euclidean robust SA is
  M Diam·2 (X)
(2.63) E f (x̃NrN ) − f (x∗ ) ≤ C(r) √ ,
N
with

M 2 = sup E G(x, ξ)22 and Diam·2 (X) = max x − y2 .


x∈X x,y∈X

Ignoring logarithmic in n factors, the second estimate (2.63) can be much better than
the first estimate (2.62) only when Diam·2 (X)  1 = Diam·1 (X), as it is the case,
e.g., when X is an Euclidean ball. On the other hand, when X is an  · 1 -ball or
its nonnegative part (which is the simplex), so that the  · 1 - and  · 2 -diameters of
X are of the same order, the first estimate (2.62) is much more attractive than the
estimate (2.63) due to potentially much smaller constant M∗ .
Comparison with the SAA approach. We compare now theoretical complexity
estimates for the robust mirror descent SA and the SAA methods. Consider the case
when (i) X ⊂ Rn is contained in the  · p -ball of radius R, p = 1, 2, and the SA in
question is either the E-SA (p = 2), or the SA associated with  · 1 and the distance-
generating function2 (2.61), (ii) in SA, the constant stepsize rule (2.45) is used, and
(iii) the “light tail” assumption (2.50) takes place.
Given ε > 0, δ ∈ (0, 1/2), let us compare the number of steps N = NSA of
SA, which, with probability ≥ 1 − δ, results in an approximate solution x̃N 1 such
1 ) − f (x∗ ) ≤ ε, with the sample size N = NSAA for the SAA result-
that f (x̃N
ing in the same accuracy guarantees. According to Proposition 2.2 we have that
Prob f (x̃N1 ) − f (x∗ ) > ε ≤ δ for

(2.64) NSA = O(1)ε−2 Dω,X


2
M∗2 ln2 (1/δ),

where M∗ is the constant from (2.50) and Dω,X is defined in (2.43). Note that the
2
constant M∗ depends on the chosen norm, Dω,X = O(1)R2 for p = 2, and Dω,X 2
=
2
O(1) ln(n)R for p = 1.
This can be compared with the estimate of the sample size (cf. [25, 26])

(2.65) NSAA = O(1)ε−2 R2 M∗2 ln(1/δ) + n ln (RM∗ /ε) .

2 In the second case, we apply the SA after the variables are scaled to make X the unit  · 1 -ball.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1588 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

We see that both SA and SAA methods have logarithmic in δ and quadratic (or nearly
so) in 1/ε complexity in terms of the corresponding sample sizes. It should be noted,
however, that the SAA method requires solution of the corresponding (deterministic)
problem, while the SA approach is based on simple calculations as long as stochastic
subgradients could be easily computed.
3. Stochastic saddle point problem. We show in this section how the mirror
descent SA algorithm can be modified to solve a convex-concave stochastic saddle
point problem. Consider the following minimax (saddle point) problem:
 
(3.1) min max φ(x, y) = E[Φ(x, y, ξ)] .
x∈X y∈Y

Here X ⊂ R and Y ⊂ R are nonempty bounded closed convex sets, ξ is a random


n m

vector whose probability distribution P is supported on set Ξ ⊂ Rd , and Φ : X × Y ×


Ξ → R. We assume that for every (x, y) ∈ X × Y , the expectation

E[Φ(x, y, ξ)] = Φ(x, y, ξ)dP (ξ)
Ξ
is well defined and finite valued and that the expected value function φ(x, y) is convex
in x ∈ X and concave in y ∈ Y . It follows that (3.1) is a convex-concave saddle
point problem. In addition, we assume that φ(·, ·) is Lipschitz continuous on X × Y .
It is well known that, in the above setting, (3.1) is solvable, i.e., the corresponding
“primal” and “dual” optimization problems
   
min max φ(x, y) and max min φ(x, y) ,
x∈X y∈Y y∈Y x∈X

respectively, are solvable with equal optimal values, denoted φ∗ , and pairs (x∗ , y ∗ ) of
optimal solutions to the respective problems form the set of saddle points of φ(x, y)
on X × Y .
As in the case of the minimization problem (1.1), we assume that neither the func-
tion φ(x, y) nor its sub/supergradients in x and y are available explicitly. However,
we make the following assumption.
(A 2) We have at our disposal an oracle which, given an input of point (x, y, ξ) ∈
X × Y × Ξ, returns a stochastic subgradient, that is, (n + m)-dimensional vector
Gx (x, y, ξ)
G(x, y, ξ) = −G y (x, y, ξ)
such that vector
   
gx (x, y) E[Gx (x, y, ξ)]
g(x, y) = =
−gy (x, y) −E[Gy (x, y, ξ)]
is well defined, gx (x, y) ∈ ∂x φ(x, y), and −gy (x, y) ∈ ∂y (−φ(x, y)).
For example, if for every ξ ∈ Ξ the function Φ(·, ·, ξ) is convex-concave and the
respective subdifferential and integral operators are interchangeable, we ensure (A 2)
by setting
   
Gx (x, y, ξ) ∂x Φ(x, y, ξ)
G(x, y, ξ) = ∈ .
−Gy (x, y, ξ) ∂y (−Φ(x, y, ξ))
Let  · x be a norm on Rn and  · y be a norm on Rm , and let  · ∗,x and  · ∗,y
stand for the corresponding dual norms. As in section 2.1, the basic assumption we
make about the stochastic oracle (aside from its unbiasedness, which we have already
2 2
postulated) is that we know positive constants M∗,x and M∗,y such that
   
2 2 2 2
(3.2) E Gx (u, v, ξ)∗,x ≤ M∗,x and E Gy (u, v, ξ)∗,y ≤ M∗,y ∀(u, v) ∈ X × Y.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1589

3.1. Mirror SA algorithm for saddle point problems. We equip X and Y


with distance-generating functions ωx : X → R modulus αx with respect to  · x ,
and ωy : Y → R modulus αy with respect to  · y . Let Dωx ,X and Dωy ,Y be the
respective constants (see definition (2.42)). We equip Rn × Rm with the norm
( α
(3.3) (x, y) = 2Dα2 x x2x + 2D2y y2y ,
ωx ,X ωy ,Y

so that the dual norm is


2 2
2Dω
2Dω y ,Y
(3.4) (ζ, η)∗ = αx
x ,X
ζ2∗,x + αy η2∗,y

and set
2Dω2 x ,X 2 2Dω2 y ,Y 2
(3.5) M∗2 = M∗,x + M∗,y .
αx αy

It follows by (3.2) that

(3.6) E G(x, y, ξ)2∗ ≤ M∗2 .

We use the notation z = (x, y) and equip the set Z = X × Y with the distance-
generating function
ωx (x) ωy (y)
ω(z) = 2 + .
2Dωx ,X 2Dω2 y ,Y

It is immediately seen that ω indeed is a distance-generating function for Z modulus


α = 1 with respect to the norm  ·  and that Z o = X o × Y o and Dω,Z = 1. In what
follows, V (z, u) : Z o × Z → R and Pz (ζ) : Rn+m → Z o are the prox-function and
prox-mapping associated with ω and Z (see (2.30), (2.31)).
We are ready now to present the mirror SA algorithm for saddle point problems.
This is the iterative procedure (compare with (2.32))

(3.7) zj+1 = Pzj (γj G(zj , ξj )),

where the initial point z1 ∈ Z is chosen to be the minimizer of ω(z) on Z. As before


(compare with (2.38)), we define approximate solution z̃1j = (x̃j1 , ỹ1j ) of (3.1) after j
iterations as
j
j γt zt
(3.8) z̃1 = t=1
j .
t=1 γt

We refer to the procedure (3.7), (3.8) as the saddle point mirror SA algorithm.
Let us analyze convergence properties of the algorithm. We measure quality of
an approximate solution z̃ = (x̃, ỹ) by the error
   
φ (z̃) := max φ(x̃, y) − φ∗ + φ∗ − min φ(x, ỹ) = max φ(x̃, y) − min φ(x, ỹ).
y∈Y x∈X y∈Y x∈X

By convexity of φ(·, y), we have

φ(xt , yt ) − φ(x, yt ) ≤ (xt − x)T gx (xt , yt ) ∀x ∈ X

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1590 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

and by concavity of φ(x, ·),

φ(xt , y) − φ(xt , yt ) ≤ (y − yt )T gy (xt , yt ) ∀y ∈ Y

so that for all z = (x, y) ∈ Z,

φ(xt , y) − φ(x, yt ) ≤ (xt − x)T gx (xt , yt ) + (y − yt )T gy (xt , yt ) = (zt − z)T g(zt ).

Using once again the convexity-concavity of φ, we write


& ' & '
φ (z̃1j ) = max φ x̃j1 , y − min φ x, ỹ1j
y∈Y x∈X
 j −1  
 
j 
j

(3.9) ≤ γt max γt φ(xt , y) − min γt φ(x, yt )


y∈Y x∈X
t=1 t=1 t=1
 j −1
 
j
≤ γt max γt (zt − z)T g(zt ).
z∈Z
t=1 t=1

To bound the right-hand side of (3.9), we use the result of the following lemma (its
proof is given in the appendix).
Lemma 3.1. In the above setting, for any j ≥ 1, the following inequality holds:
 

j
5 2 2
j
(3.10) E max γt (zt − z) g(zt ) ≤ 2 + M∗T
γt .
z∈Z
t=1
2 t=1

Now to get an error bound for the solution z̃1j , it suffices to substitute inequality
(3.10) into (3.9) to obtain
 j −1  
& '  5 2 2
j
j
(3.11) E φ z̃1 ≤ γt 2 + M∗ γt .
t=1
2 t=1

Constant stepsizes and basic efficiency estimates. For a fixed number of steps N ,
with the constant stepsize policy


(3.12) γt = √ , t = 1, . . . , N,
M∗ 5N

condition (3.6) and estimate (3.11) imply that


    (
φ z̃1N ≤ 2 max θ, θ−1 M∗ N5
(3.13)  
 −1
 2
10 αy Dω x ,X
2 +α D 2
M∗,x 2
x ωy ,Y M∗,y
= 2 max θ, θ αx αy N .

Variable stepsizes. Same as in the minimization case, assuming that


√ 1/2
Dω,Z := 2 supz∈Z o ,w∈Z ω(w) − ω(z) − (w − z)T ∇ω(z)
(3.14) √ 1/2
= 2 supz∈Z o ,w∈Z V (z, w)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1591

is finite, we can pass from constant stepsizes on a fixed “time horizon” to decreasing
stepsize policy

θDω,Z
γt = √ , t = 1, 2, . . .
M∗ t
(compare with (2.55) and take into account that we are in the situation of α = 1),
and from the averaging of all iterates to the “sliding averaging”
j
j γt zt
z̃i = t=i
j
,
t=i γt

arriving at the efficiency estimates (compare with (2.56) and (2.57))


 
 N  Dω,Z M∗ 2 N 5θ N
 z̃K ≤ √ + , 1 ≤ K ≤ N,
N θN −K +1 2 K
(3.15)
& '   Dω,Z M∗
 z̃ NrN ≤ C(r) max θ, θ−1 √ , r ∈ (0, 1).
N
Probabilities of large deviations. Assume that instead of (3.2), the following stronger
assumption holds:
 2 2

E exp Gx (u, v, ξ)∗,x /M∗,x ≤ exp{1},
(3.16)  
E exp Gy (x, y, ξ)2∗,y /M∗,y
2
≤ exp{1}.

Proposition 3.2. In the case of (3.16), with the stepsizes given by (3.12) and
(3.6), one has, for any Ω > 1,
 
 N  (8+2Ω) max{θ,θ−1 }√5M∗
(3.17) Prob φ z̃1 > √
N
≤ 2 exp{−Ω}.

Proof of this proposition is given in the appendix.


3.2. Application to minimax stochastic problems. Consider the following
minimax stochastic problem:
 
(3.18) min max fi (x) = E[Fi (x, ξ)] ,
x∈X 1≤i≤m

where X ⊂ Rn is a nonempty bounded closed convex set, ξ is a random vector


whose probability distribution P is supported on set Ξ ⊂ Rd , and Fi : X × Ξ → R,
i = 1, . . . , m. We assume that the expected value functions fi (·), i = 1, . . . , m, are
well defined, finite valued, convex, and Lipschitz continuous on X. Then the minimax
problem (3.18) can be formulated as the following saddle point problem:
 
m
(3.19) min max φ(x, y) = yi fi (x) ,
x∈X y∈Y
i=1
m
where Y = {y ∈ Rm : i=1 yi = 1, y ≥ 0}.
Assume that we are able to generate independent realizations ξ1 , . . . , of random
vector ξ, and, for given x ∈ X and ξ ∈ Ξ, we can compute Fi (x, ξ) and its stochastic
subgradient Gi (x, ξ) such that gi (x) = E[Gi (x, ξ)] is well defined and gi (x) ∈ ∂fi (x),

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1592 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

x ∈ X, i = 1, . . . , m. In other words, we have a stochastic oracle for the problem


(3.19) such that assumption (A 2) holds, with
 m 
(3.20) G(x, y, ξ) =  i=1 yi Gi (x, ξ) 
− F1 (x, ξ), . . . , Fm (x, ξ)

and
 m   
(3.21) g(x, y) = E[G(x, y, ξ)] =  i=1 yi gi (x)  ∈ ∂x φ(x, y)
.
− f1 (x), . . . , fm (x) −∂y φ(x, y)

Suppose that the set X is equipped with norm  · x , whose dual norm is  · ∗,x ,
and a distance-generating function ω modulus αx with respect to  · x , and let
D2
Rx2 = ωαxx,X . We equip the set Y with the norm  · y =  · 1, so that  · ∗,y =  · ∞ ,
and with the distance-generating function


m
ωy (y) = yi ln yi
i=1

2

and set Ry2 = αy
y ,Y
= ln m. Next, following (3.3), we set
)
x2x y21
(x, y) = 2
+ ,
2Rx 2Ry2

and hence
(
(ζ, η)∗ = 2Rx2 ζ2∗,x + 2Ry2 η2∞ .

Let us assume uniform bounds:


 
max E Gi (x, ξ)2∗,x ≤ M∗,x
2
, E max |Fi (x, ξ)|2 ≤ M∗,y
2
, i = 1, . . . , m.
1≤i≤m 1≤i≤m

Note that
 
m
2
E G(x, y, ξ)2∗ = 2Rx2 E  yi Gi (x, ξ) + 2Ry2 E F (x, ξ)2∞ ,
∗,x
i=1

and since y ∈ Y ,
m 2 ! "2
  
m 
m
 
 yi Gi (x, ξ) ≤ yi Gi (x, ξ)∗,x ≤ yi Gi (x, ξ)2∗,x .
 
i=1 ∗,x i=1 i=1

It follows that

(3.22) E G(x, y, ξ)2∗ ≤ M∗2 ,

where

M∗2 = 2Rx2 M∗,x


2
+ 2Ry2 M∗,y
2
= 2Rx2 M∗,x
2 2
+ 2M∗,y ln m.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1593

Let us now use the saddle point mirror SA algorithm (3.7)–(3.8) with the constant
stepsize policy

2
γt = √ , t = 1, 2, . . . , N.
M∗ 5N

When substituting the value of M∗ , we obtain the following from (3.13):


 
     
E φ z̃1N = E max φ x̂N 1 , y − min φ x, ŷ N
1
y∈Y x∈X
(3.23) (
(
10[R2x M∗,x
2 +M 2
∗,x ln m]
≤ 2M∗ N5 ≤ 2 N .

Discussion. Looking at the bound (3.23), one can make the following important
observation. The error of the saddle point mirror SA algorithm √ in this case is “almost
independent” of the number m of constraints (it grows as O( ln m) as m increases).
The interested reader can easily verify that if an E-SA algorithm were used in the
same setting (i.e., the algorithm tuned to the norm  · y =  · 2 ), the corresponding √
bound would grow with m much faster (in fact, our error bound would be O( m) in
that case).
Note that properties of the saddle point mirror SA can be used to reduce signifi-
cantly the arithmetic cost of the algorithm implementation. To this end let us look at
the definition (3.20) of the stochastic oracle: In order to obtain a realization G(x, y, ξ),
one has to compute mm random subgradients Gi (x, ξ), i = 1, . . . , m, and then their con-
vex combination i=1 yi Gi (x, ξ). Now let η be an independent of ξ and uniformly
distributed on [0, 1] random variable, and let ı(η, y) : [0, 1] × Y → {1, . . . , m} equal
i−1 i
to i when s=1 ys < η ≤ s=1 ys . That is, random variable ı̂ = ı(η, y) takes values
1, . . . , m with probabilities y1 , . . . , ym . Consider random vector
 
 Gı(η,y) (x, ξ) 
(3.24) G(x, y, (ξ, η)) = .
− F1 (x, ξ), . . . , Fm (x, ξ)

We refer to G(x, y, (ξ, η)) as a randomized oracle for problem (3.19), the corresponding
random parameter being (ξ, η). By construction, we still have E G(x, y, (ξ, η)) =
g(x, y), where g is defined in (3.21), and, moreover, the same bound (3.22) holds
for E G(x, y, (ξ, η))2∗ . We conclude that the accuracy bound (3.23) holds for the
error of the saddle point mirror SA algorithm with randomized oracle. On the other
hand, in the latter procedure only one randomized subgradient Gı̂ (x, ξ) per iteration is
computed. This simple idea is further developed in another interesting application of
the saddle point mirror SA algorithm to bilinear matrix games, which we discuss next.
3.3. Application to bilinear matrix games. Consider the standard matrix
game problem, that is, problem (3.1) with

φ(x, y) = y T Ax + bT x + cT y,

where A ∈ Rm×n , and X and Y are the standard simplexes:


⎧ ⎫  
⎨ n ⎬ 
m
X = x ∈ R : x ≥ 0,
n
xj = 1 , Y = y ∈ R : y ≥ 0,
m
yi = 1 .
⎩ ⎭
j=1 i=1

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1594 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

In the case in question it is natural to equip X (respectively, Y ) with the ·1 -norm on
Rn (respectively, Rm ). We choose entropies as the corresponding distance-generating
functions:
n m  2 2

Dω Dω y ,Y
ωx (x) = xi ln xi , ωy (x) = yi ln yi ⇒ αx = ln n, αy = ln m .
x ,X

i=1 i=1

According to (3.3), we set

x21 y21 %
(3.25) (x, y) = + ⇒ (ζ, η)∗ = 2ζ2∞ ln n + 2η2∞ ln m.
2 ln n 2 ln m
In order to compute the estimates G(x, y, ξ) of g(x, y) = (b + AT y, −c − Ax), to
be used in the saddle point mirror SA iterations (3.7), we use the randomized oracle
 
c + Aı(ξ1 ,y)
(3.26) G(x, y, ξ) = ,
−b − Aı(ξ2 ,x)

where ξ1 and ξ2 are independent uniformly distributed on [0, 1] random variables and
ĵ = ı(ξ1 , y), î = ı(ξ2 , x) are defined as in (3.24) (i.e., ĵ can take values 1, . . . , m, with
probabilities y1 , . . . , ym and î can take values 1, . . . , n, with probabilities x1 , . . . , xn ),
and Aj , [Ai ]T are jth column and ith row in A, respectively.
Note that
 & & ''  ∂ φ(x, y) 
x
(3.27) g(x, y) ≡ E G x, y, ĵ, î ∈ .
∂y (−φ(x, y))

Besides this,

|G(x, y, ξ)i | ≤ max1≤j≤m Aj + b∞ , 1 ≤ i ≤ n,


|G(x, y, ξ)i | ≤ max1≤j≤n Aj + c∞ , n + 1 ≤ i ≤ n + m,

whence, invoking (3.25), for any x ∈ X, y ∈ Y , and ξ,

(3.28) G(x, y, ξ)2∗ ≤ M∗2 = 2 ln n max Aj + b2∞ + 2 ln m max Aj + c2∞ .
1≤j≤m 1≤j≤n

The bottom line is that our stochastic gradients along with the just-defined M∗ satisfy
both (A 2) and (3.16), and therefore with the constant stepsize policy (3.12), we have
 
 N  N    5
(3.29) E φ z̃1 = E max φ x̃1 , y − min φ x, ỹ1
N
≤ 2M∗
y∈Y x∈X N

(cf. (3.13)). In our present situation, Proposition 3.2 in a slightly refined form (for
proof, see the appendix) reads as follows.
Proposition 3.3. With the constant stepsize policy (3.12), for the just-defined
algorithm, one has for any Ω ≥ 1, that
#   ( $  
(3.30) Prob φ z̃1N > 2M∗ N5 + √ 4M
N
Ω ≤ exp −Ω2 /2 ,

where

(3.31) M = max Aj + b∞ + max Aj + c∞ .


1≤j≤m 1≤j≤n

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1595

√ matrix game with n ≥ m, ln(m) = O(1) ln(n),


Discussion. Consider a bilinear
and b = c = 0 (so that M∗ = O(1) ln n M and M = maxi,j |Aij |; see (3.28), (3.31)).
Suppose that we are interested to solve it within a fixed relative accuracy ρ, that is,
to ensure that the (perhaps random) approximate solution z̃1N , which we get after N
iterations, satisfies the error bound

φ (z̃N ) ≤ ρ max |Aij |


1≤i, j≤n

with probability at least 1 − δ. According to (3.30), to this end, one can use the
randomized saddle point mirror SA algorithm (3.7), (3.8), (3.26) with stepsizes (3.12),
(3.28) and with

ln n + ln(1/δ)
(3.32) N = O(1) .
ρ2

The computational cost of building z̃1N with this approach is

[ln n + ln(1/δ)] [R + n]
C(ρ) = O(1)
ρ2

arithmetic operations, where R is the arithmetic cost of extracting a column/row


from A given the index of this column/row. The total number of rows and columns
visited by the algorithm does not exceed the number of steps N as given in (3.32) so
that the total number of entries in A used in the course of the entire computation
does not exceed
n(ln n + ln(1/δ))
M = O(1) .
ρ2

When ρ is fixed, m = O(1)n and n is large, M is incomparably less than the total
number mn of entries in A. Thus, our algorithm exhibits sublinear-time behavior :
it produces reliable solutions of prescribed quality to large-scale matrix games by
inspecting a negligible, as n → ∞, part of randomly selected data. Note that ran-
domization here is critical.3 It can be seen that a deterministic algorithm, which is
capable to find a solution with (deterministic) relative accuracy ρ ≤ 0.1, has to “see”
in the worst case at least O(1)n rows/columns of A.
4. Numerical results. In this section, we report the results of our computa-
tional experiments where we compare the performance of the robust mirror descent
SA method and the SAA method applied to three stochastic programming problems,
namely: a stochastic utility problem, a stochastic max-flow problem, and a network
planning problem with random demand. We also present a small simulation study of
the performance of randomized mirror SA algorithm for bilinear matrix games.
The algorithms we were testing are the two variants of the robust mirror descent
SA. The first variant, the E-SA, is as described in section 2.2; in terms of section 2.3,
this is nothing but mirror descent robust SA with Euclidean setup; see the example
in section 2.3. The second variant, referred to as the non-Euclidean SA (N-SA), is
the mirror descent robust SA with 1 -setup; see, the example in section 2.3.
3 The possibility to solve matrix games in a sublinear-time fashion by a randomized algorithm

was discovered by Grigoriadis and Khachiyan [9]. Their “ad hoc” algorithm is similar, although not
completely identical to ours, and possesses the same complexity bounds.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1596 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

Table 4.1
Selecting stepsize policy.

[method: N-SA, N:2,000, K:10,000, instance: L1]


θ
Policy 0.1 1 5 10
Variable −7.4733 −7.8865 −7.8789 −7.8547
Constant −6.9371 −7.8637 −7.9037 −7.8971

These two variants of the SA method are compared with the SAA approach in
the following way: fixing an iid. sample (of size N ) for the random variable ξ, we
apply the three aforementioned methods to obtain approximate solutions for the test
problem under consideration, and then the quality of the solutions yielded by these
algorithms is evaluated using another iid. sample of size K >> N . It should be noted
that SAA itself is not an algorithm, and in our experiments, it was coupled with
the non-Euclidean restricted memory level (NERML) [2]—a powerful deterministic
algorithm for solving the sample average problem (1.4).
4.1. Preliminaries.
Algorithmic schemes. Both E-SA and N-SA were implemented according to the
description in section 2.3, the number of steps N being the parameter of a particular
experiment. In such an experiment, we generated ≈ log2 N candidate solutions x̃N i ,
with N −i+1 = min[2k , N ], k = 0, 1, . . . , log2 N . We then used an additional sample
to estimate the objective at these candidate solutions in order to choose the best of
these candidates, specifically, as follows: we used a relatively short sample to choose
the two “most promising” of the candidate solutions, and then a large sample (of size
K  N ) to identify the best of these two candidates, thus getting the “final” solution.
The computational effort required by this simple postprocessing is not reflected in the
tables to follow.
The stepsizes. At the “pilot stage” of our experimentation, we made a decision on
which stepsize policy—(2.47) or (2.55)—to choose and how to identify the underlying
parameters M∗ and θ. In all our experiments, M∗ was estimated by taking the
maxima of G(·, ·)∗ over a small (just 100) calls to the stochastic oracle at randomly
generated feasible solutions. As about the value of θ and type of the stepsize policy
((2.47) or (2.55)), our choice was based on the results of experimentation with a
single test problem (instance L1 of the utility problem, see below); some results of
this experimentation are presented in Table 4.1. We have found that the constant
stepsize policy (2.47) with θ = 0.1 for the E-SA and θ = 5 for the N-SA slightly
outperforms other variants we have considered. This particular policy, combined with
the aforementioned scheme for estimating M∗ , was used in all subsequent experiments.
Format of test problems. All our test problems are of the form minx∈X f (x),
 = E[F (x, ξ)], where the domain X either is a standard simplex {x ∈ R : x ≥
n
f (x)
0, i xi = 1} or can be converted into such a simplex by scaling of the original
variables.
Notation in the tables. Below,
• n is the design dimension of an instance,
• N is the sample size (i.e., the number of steps in SA, and the size of the sample
used to build the stochastic average in SAA),
• Obj is the empirical mean of the random variable F (x, ξ), x being the approxi-
mate solution generated by the algorithm in question. The empirical means are taken
over a large (K = 104 elements) dedicated sample,
• CPU is the CP U time in seconds.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1597
Table 4.2
SA versus SAA on the stochastic utility problem.

- L1: n = 500 L2: n = 1,000 L3: n = 2,000 L4: n = 5,000


ALG. N Obj CPU Obj CPU Obj CPU Obj CPU
N-SA 100 −7.7599 0 −5.8340 0 −7.1419 1 −5.4688 3
1,000 −7.8781 2 −5.9152 2 −7.2312 6 −5.5716 13
2,000 −7.8987 2 −5.9243 5 −7.2513 10 −5.5847 25
4,000 −7.9075 5 −5.9365 12 −7.2595 20 −5.5935 49
E-SA 100 −7.6895 0 −5.7988 1 −7.0165 1 −4.9364 4
1,000 −7.8559 2 −5.8919 4 −7.2029 7 −5.3895 20
2,000 −7.8737 3 −5.9067 7 −7.2306 15 −5.4870 39
4,000 −7.8948 7 −5.9193 13 −7.2441 29 −5.5354 77
SAA 100 −7.6571 7 −5.6346 8 −6.9748 19 −5.3360 44
1,000 −7.8821 31 −5.9221 68 −7.2393 134 −5.5656 337
2,000 −7.9100 72 −5.9313 128 −7.2583 261 −5.5878 656
4,000 −7.9087 113 −5.9384 253 −7.2664 515 −5.5967 1,283

Table 4.3
The variability for the stochastic utility problem.

- N-SA E-SA SAA


Obj CPU Obj CPU Obj CPU
Inst N Mean Dev (Avg.) Mean Dev (Avg.) Mean Dev (Avg.)
L2 1,000 −5.9159 0.0025 2.63 −5.8925 0.0024 4.99 −5.9219 0.0047 67.31
L2 2,000 −5.9258 0.0022 5.03 −5.9063 0.0019 7.09 −5.9328 0.0028 131.25

4.2. A stochastic utility problem. Our first experiment was carried out with
the utility model
  
n

(4.1) min f (x) = E φ (i/n + ξi )xi ,
x∈X
i=1
n
where X = {x ∈ Rn : x ≥ 0, i=1 xi = 1}, ξi ∼ N (0, 1) are independent and φ(·) is a
piecewise linear convex function given by φ(t) = max{v1 + s1 t, . . . , vm + sm t}, where
vk and sk are certain constants. In our experiment, we used m = 10 breakpoints, all
located on [0, 1]. The four instances L1, L2, L3, L4 we dealt with were of dimension
varying from 500 to 2,000, each instance—with its own randomly generated function
φ. All the algorithms were coded in ANSI C, and the experiments were conducted on
an Intel PIV 1.6GHz machine with Microsoft windows XP professional.
We run each of the three aforementioned methods with various sample sizes on
every one of the instances. The results are reported in Table 4.2.
In order to evaluate stability of the algorithms, we run each of them 100 times;
the resulting statistics are shown in Table 4.3. In this relatively time-consuming
experiment, we restrict ourselves with a single instance (L2) and just two sample
sizes (N = 1,000 and 2,000). In Table 4.3, “Mean” and “Dev” are, respectively, the
mean and the deviation, over 100 runs, of the objective value Obj at the resulting
approximate solution.
The experiments demonstrate that as far as the quality of approximate solutions
is concerned, N-SA outperforms E-SA and is almost as good as SAA. At the same
time, the solution time for N-SA is significantly smaller than the one for SAA.
4.3. Stochastic max-flow problem. In the second experiment, we consider
simple two-stage stochastic linear programming, namely, a stochastic max-flow prob-
lem. The problem is to optimize the capacity expansion of a stochastic network. Let

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1598 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

Table 4.4
SA versus SAA on the stochastic max-flow problem.

- F1 F2 F3 F4
(m, n) (50,500) (100, 1,000) (100, 2,000) (250, 5,000)
ALG. N Obj CPU Obj CPU Obj CPU Obj CPU
N-SA 100 0.1140 0 0.0637 0 0.1296 1 0.1278 3
1,000 0.1254 1 0.0686 3 0.1305 6 0.1329 15
2,000 0.1249 3 0.0697 6 0.1318 11 0.1338 29
4,000 0.1246 5 0.0698 11 0.1331 21 0.1334 56
E-SA 100 0.0840 0 0.0618 1 0.1277 2 0.1153 7
1,000 0.1253 3 0.0670 6 0.1281 16 0.1312 39
2,000 0.1246 5 0.0695 13 0.1287 28 0.1312 72
4,000 0.1247 9 0.0696 24 0.1303 53 0.1310 127
SAA 100 0.1212 5 0.0653 12 0.1310 20 0.1253 60
1,000 0.1223 35 0.0694 84 0.1294 157 0.1291 466
2,000 0.1223 70 0.0693 170 0.1304 311 0.1284 986
4,000 0.1221 140 0.0693 323 0.1301 636 0.1293 1,885

G = (N, A) be a diagraph with a source node s and a sink node t. Each arc (i, j) ∈ A
has an existing capacity pij ≥ 0 and a random implementing/operating level ξij .
Moreover, there is a common random degrading factor η for all arcs in A. The goal is
to determine how much capacity to add to the arcs, subject to a budget constraint,
in order to maximize the expected maximum flow from s to t. Denoting by xij the
capacity to be added to arc (i, j), the problem reads
⎧ ⎫
⎨  ⎬
(4.2) max f (x) = E[F (x; ξ, η)] : cij xij ≤ b, xij ≥ 0, ∀(i, j) ∈ A ,
x ⎩ ⎭
(i,j)∈A

where cij is the per unit cost for the capacity to be added, b is the total available
budget, and F (x; ξ, η) denotes the maximum s − t flow in the network when the
capacity of an arc (i, j) is ηξij (pij + xij ). Note that the above is a maximization
rather than a minimization problem.
We assume that the random variables ξij , θ are independent and uniformly dis-
tributed on [0, 1] and [0.5, 1], respectively, and consider the case of pij = 0, cij = 1 for
all (i, j) ∈ E, and b = 1. We randomly generated 4 network instances (referred to as
F1, F2, F3, and F4) using the network generator GRIDGEN available on DIMACS
challenge. The push-relabel algorithm [8] was used to solve the second stage max-flow
problem.
In the first test, each algorithm (N-SA, E-SA, SAA) was run once at each test
instance; the results are reported in Table 4.4, where m, n stand for the number of
nodes, respectively, arcs in G. Similar to the stochastic utility problem, we investi-
gate the stability of the methods by running each of them 100 times. The resulting
statistics is presented in Table 4.5, whose columns have exactly the same meaning as
in Table 4.3.
This experiment fully supports the conclusions on the methods suggested by the
experiments with the utility problem.
4.4. A network planning problem with random demand. In the last ex-
periment, we consider the so-called SSN problem of Sen, Doverspike, and Cosares
[24]. This problem arises in telecommunications network design where the owner of
the network sells private-line services between pairs of nodes in the network, and the
demands are treated as random variables based on the historical demand patterns.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1599
Table 4.5
The variability for the stochastic max-flow problem.

- N-SA E-SA SAA


Obj Avg. Obj Avg. Obj Avg.
Inst N Mean Dev CPU Mean Dev CPU Mean Dev CPU
F2 1,000 0.0691 0.0004 3.11 0.0688 0.0006 4.62 0.0694 0.0003 90.15
F2 2,000 0.0694 0.0003 6.07 0.0692 0.0002 6.91 0.0695 0.0003 170.45

The optimization problem is to decide where to add capacity to the network to min-
imize the expected rate of unsatisfied demands. Since this problem has been studied
by several authors (see, e.g., [12, 24]), it could be interesting to compare the results.
Another purpose of this experiment is to investigate the behavior of the SA method
when the Latin hyperplane sampling (LHS) variance reduction technique (introduced
in [14]) is applied.
The problem has been formulated as a two-stage stochastic linear programming
as follows:
 

(4.3) min f (x) = E[F (x, ξ)] : x ≥ 0, xi = b ,
x
i

where x is the vector of capacities to be added to the arcs of the network, b (the
budget) is the total amount of capacity to be added, ξ denotes the random demand,
and F (x, ξ) represents the number of unserved requests, specifically,
⎧   ⎫
⎨ A f ≤x+c ⎬
i r∈R(i) r ir i
(4.4) F (x, ξ) = min si : r∈R(i) fir + si = ξ , ∀i .
s,f ⎩ ⎭
i fir ≥ 0, si ≥ 0, ∀i, r ∈ R(i)

Here,
• R(i) is the set of routes used for traffic i (traffic between the source-sink pair
of nodes # i),
• ξ i is the (random) demand for traffic i,
• Ar are the route-arc incidence vectors (so that jth component of Ar is 1 or 0
depending on whether arc j belongs to the route r),
• c is the vector of current capacities, fir is the fraction of traffic i transferred via
route r, and s is the vector of unsatisfied demands.
In the SSN instance, there are dim x = 89 arcs and dim ξ = 86 source-sink
pairs, and components of ξ are independent random variables with known discrete
distributions (from 3 to 7 possible values per component), which result in ≈ 1070
possible demand scenarios.
In the first test with the SSN instance, each of our 3 algorithms was run once
without and once with the LHS technique; the results are reported in Table 4.6. We
then tested the stability of algorithms by running each of them 100 times; see statistics
in Table 4.7. Note that experiments with the SSN problem were conducted on a more
powerful computer: Intel Xeon 1.86GHz with Red Hat Enterprize Linux.
As far as comparison of our three algorithms is concerned, the conclusions are
in full agreement with those for the utility and the max-flow problem. We also see
that for our particular example, the LHS does not yield much of an improvement,
especially when a larger sample size is applied. This result seems to be consistent
with the observation in [12].

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1600 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

Table 4.6
SA versus SAA on the SSN problem.

- Without LHS With LHS


Alg. N Obj CPU Obj CPU
N-SA 100 11.0984 1 10.1024 1
1,000 10.0821 6 10.0313 7
2,000 9.9812 12 9.9936 12
4,000 9.9151 23 9.9428 22
E-SA 100 10.9027 1 10.3860 1
1,000 10.1268 6 10.0984 6
2,000 10.0304 12 10.0552 12
4,000 9.9662 23 9.9862 23
SAA 100 11.8915 24 11.0561 23
1,000 10.0939 215 10.0488 216
2,000 9.9769 431 9.9872 426
4,000 9.8773 849 9.9051 853

Table 4.7
The variability for the SSN problem.

- N-SA E-SA SAA


Obj Avg. Obj Avg. Obj Avg.
N LHS Mean Dev CPU Mean Dev CPU Mean Dev CPU
1,000 no 10.0624 0.1867 6.03 10.1730 0.1826 6.12 10.1460 0.2825 215.06
1,000 yes 10.0573 0.1830 6.16 10.1237 0.1867 6.14 10.0135 0.2579 216.10
2,000 no 9.9965 0.2058 11.61 10.0853 0.1887 11.68 9.9943 0.2038 432.93
2,000 yes 9.9978 0.2579 11.71 10.0486 0.2066 11.74 9.9830 0.1872 436.94

4.5. N-SA versus E-SA. The data in Tables 4.3, 4.4, and 4.6 demonstrate that
with the same sample size N , the N-SA somehow outperforms the E-SA in terms of
both the quality of approximate solutions and the running time.4 The difference in
solutions’ quality, at the first glance, seems slim, and one could think that adjusting
the SA algorithm to the “geometry” of the problem in question (in our case, to
minimization over a standard simplex) is of minor importance. We, however, do
believe that such a conclusion would be wrong. In order to get a better insight, let
us come back to the stochastic utility problem. This test problem has an important
advantage—we can easily compute the value of the objective f (x) at a given candidate
solution x analytically.5 Moreover, it is easy to minimize f (x) over the simplex—on a
closest inspection, this problem reduces to minimizing an easy-to-compute univariate
convex function so that we can approximate the true optimal value f∗ to high accuracy
by bisection. Thus, in the case in question, we can compare solutions x generated
by various algorithms in terms of their “true inaccuracy” f (x) − f∗ , and this is the
rationale behind our “Gaussian setup.” We can now exploit this advantage of the
stochastic utility problem for comparing properly N-SA and E-SA. In Table 4.8, we
present the true values of the objective f (x̄) at the approximate solutions x̄ generated
by N-SA and E-SA as applied to the instances L1 and L4 of the utility problem
(cf. Table 4.3) along with the inaccuracies f (x̄) − f∗ and the Monte Carlo estimates
f*(x̄) of f (x̄) obtained via 50,000-element samples. We see that the difference in
4 The difference in running times can be easily explained: with X being a simplex, the prox-

mapping for E-SA takes O(n ln n) operations versus O(n) operations for N-SA.
5 Indeed, (ξ , . . . , ξ ) ∼ N (0, I ), so that the random variable ξ =
1 n n x i (ai + ξi )xi is normal with
easily computable mean and variance, and since φ is piecewise linear, the expectation f (x) = E[φ(ξx )]
can be immediately expressed via the error function.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1601
Table 4.8
N-SA versus E-SA.

Method Problem f*(x̄), f (x̄) f (x̄) − f∗ Time


N-SA, N = 2,000 L2: n = 1,000 −5.9232/ − 5.9326 0.0113 5.00
E-SA, N = 2,000 L2 −5.8796/ − 5.8864 0.0575 6.60
E-SA, N = 10,000 L2 −5.9059/ − 5.9058 0.0381 39.80
E-SA, N = 20,000 L2 −5.9151/ − 5.9158 0.0281 74.50
N-SA, N = 2,000 L4: n = 5,000 −5.5855/ − 5.5867 0.0199 25.00
E-SA, N = 2,000 L4 −5.5467/ − 5.5469 0.0597 44.60
E-SA, N = 10,000 L4 −5.5810/ − 5.5812 0.0254 165.10
E-SA, N = 20,000 L4 −5.5901/ − 5.5902 0.0164 382.00

the inaccuracy f (x̄) − f∗ of the solutions produced by the algorithms is much more
significant than is suggested by the data in Table 4.3 (where the actual inaccuracy
is “obscured” by the estimation error and summation with f∗ ). Specifically, at the
common for both algorithm sample sizes N = 2,000, the inaccuracy yielded by N-SA
is 3–5 times less than the one for E-SA and in order to compensate for this difference,
one should increase the sample size for E-SA (and hence the running time) by factor
5–10. It should be added that in light of theoretical complexity analysis carried out
in Example 2.3, the outlined significant difference in performances of N-SA and E-SA
is not surprising; the surprising fact is that E-SA works at all.
4.6. Bilinear matrix game. We consider here a bilinear matrix game

min max y T Ax,


x∈X y∈Y

n both feasible sets are the standard simplexes in R : Y = X = {x ∈ R :


n n
where
i=1 xi = 1, x ≥ 0}. We consider two versions of the randomized mirror SA algo-
rithm (3.7), (3.8) for the saddle point problem. The first algorithm, the E-SA, uses
1
2
x22 as ωx , ωy and  · 2 as  · x ,  · y . The second algorithm, the N-SA, uses the
entropy function (2.58) as ωx , ωy and the norm  · 1 as  · x ,  · y . To compare
the two procedures, we compute the corresponding approximate solutions z̃1N and
compute the exact values of the error:
  T
1 − min ỹ1
 z̃1N = max y T Ax̃N N
Ax, i = 1, 2.
y∈Y x∈X

In our experiments we consider symmetric matrices A of two kinds. The matrices of


the first family, parameterized by α > 0, are given by
 α
i+j−1
Aij = , 1 ≤ i, j ≤ n.
2n − 1
The second family of matrices, which are also parameterized by α > 0, is given by
 α
|i − j| + 1
Aij = , 1 ≤ i, j ≤ n.
2n − 1

We use the notations E1 (α) and E2 (α) to refer to the experiments with the matrices
of the first and second kind with parameter α. We present in Table 4.9 the results of
experiments conducted for the matrices A of size 104 × 104 . We made 100 simulation
runs in each experiment and present the average error (column Mean), standard

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1602 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

Table 4.9
SA for bilinear matrix games.

E2 (2), (z̃1 ) = 0.500 E2 (1), (z̃1 ) = 0.500 E2 (0.5), (z̃1 ) = 0.390


     
N-SA  z̃1N CPU  z̃1N CPU  z̃1N CPU
N Mean Dev CPU Mean Dev CPU Mean Dev CPU
100 0.0121 3.9 e−4 0.58 0.0127 1.9 e−4 0.69 0.0122 4.3 e−4 0.81
1,000 0.00228 3.7 e−5 5.8 0.00257 2.2 e−5 7.3 0.00271 4.5 e−5 8.5
2,000 0.00145 2.1 e−5 11.6 0.00166 1.0 e−5 13.8 0.00179 2.7 e−5 16.4
     
E-SA  z̃1N CPU  z̃1N CPU  z̃1N CPU
N Mean dev (Avg.) Mean Dev (Avg.) Mean Dev (Avg.)
100 0.00952 1.0 e−4 1.27 0.0102 5.1 e−5 1.77 0.00891 1.1 e−4 1.94
1,000 0.00274 1.3 e−5 11.3 0.00328 7.8 e−6 17.6 0.00309 1.6 e−5 20.9
2,000 0.00210 7.4 e−6 39.7 0.00256 4.6 e−6 36.7 0.00245 7.8 e−6 39.2
E1 (2), (z̃1 ) = 0.0625 E1 (1), (z̃1 ) = 0.125 E1 (0.5), (z̃1 ) = 0.138
     
N-SA  z̃1N CPU  z̃1N CPU  z̃1N CPU
N Mean Dev (Avg.) Mean Dev (Avg.) Mean Dev (Avg.)
100 0.00817 0.0016 0.58 0.0368 0.0068 0.66 0.0529 0.0091 0.78
1,000 0.00130 2.7 e−4 6.2 0.0115 0.0024 6.5 0.0191 0.0033 7.6
2,000 0.00076 1.6 e−4 11.4 0.00840 0.0014 11.7 0.0136 0.0018 13.8
 N  N  N
E-SA  z̃1 CPU  z̃1 CPU  z̃1 CPU
N Mean Dev (Avg.) Mean Dev (Avg.) Mean Dev (Avg.)
100 0.00768 0.0012 1.75 0.0377 0.0062 2.05 0.0546 0.0064 2.74
1,000 0.00127 2.2 e−4 17.2 0.0125 0.0022 19.9 0.0207 0.0020 18.4
2,000 0.00079 1.6 e−4 35.0 0.00885 0.0015 36.3 0.0149 0.0020 36.7

deviation (column Dev) and the average running time (with excluded time to compute
the error of the resulting solution). For comparison, we also present the error of the
initial solution z̃1 = (x1 , y1 ).
Our basic observation is as follows: Both N-SA and E-SA succeed to reduce the
solution error reasonably fast. The N-SA implementation is preferable as it is more
efficient in terms of running time. For comparison, it takes Matlab from 10 (for the
simplest problem)
 T 
to 35 seconds (for the hardest one) to compute just one answer
g(x, y) = −Ax
A y
of the deterministic oracle.

5. Conclusions. It is shown that for a certain class of convex stochastic opti-


mization and saddle point problems, robust versions of the SA approach have similar
theoretical estimates of computational complexity, in terms of the required sample
size, to the SAA method. Numerical experiments, reported in section 4, confirm this
conclusion. These results demonstrate that for considered problems, a properly imple-
mented mirror descent SA algorithm produces solutions of comparable accuracy to the
SAA method for the same sample size of generated random points. On the other hand,
the implementation (computational) time of the SA method is significantly smaller
with a factor of up to 30–40 for considered problems. Thus, both theoretical and
numerical results suggest that the robust mirror descent SA is a viable alternative to
the SAA approach, an alternative which at least deserves testing in particular applica-
tions. It is also shown that the robust mirror SA approach can be applied as a random-
ization algorithm to large-scale deterministic saddle point problems (in particular, to
minimax optimization problems and bilinear matrix games) with encouraging results.

6. Appendix. Proof of Lemma 2.1. Let x ∈ X o and v = Px (y). Note that v ∈


argmin z∈X [ω(z) + pT z], where p = ∇ω(x) − y. Thus, ω is differentiable at v and v ∈

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1603

X o . As ∇v V (x, v) = ∇ω(v) − ∇ω(x), the optimality conditions for (2.31) imply that
(6.1) (∇ω(v) − ∇ω(x) + y)T (v − u) ≤ 0 ∀u ∈ X.
For u ∈ X, we therefore have
V (v, u) − V (x, u) = ω(u) − ∇ω(v)T (u − v) − ω(v)
− ω(u) − ∇ω(x)T (u − x) − ω(x)
= ∇ω(v) − ∇ω(x) + y)T (v − u) + y T (u − v)
≤ y T (u − v) − V (x, v),
where the last inequality is due to (6.1). By Young’s inequality,6 we have
y2∗ α
y T (x − v) ≤ + x − v2 ,
2α 2
while V (x, v) ≥ 2 x
α
− v2 , due to the strong convexity of V (x, ·). We get

V (v, u) − V (x, u) ≤ y T (u − v) − V (x, v) = y T (u − x) + y T (x − v) − V (x, v)


y2∗
≤ y T (u − x) + ,

as required in (2.33).
Entropy as a distance-generating function on the standard simplex. nThe only
property which is not immediately evident is that the entropy w(x) = i=1 xi ln xi
is strongly
 convex, modulus
n 1 with  respect to  · 1 -norm, on the standard simplex
X = x ∈ Rn : x ≥ 0, i=1 xi . We are in the situation where X = {x ∈ X :
o

x > 0} and in order to establish the property in question, it suffices to verify that
hT ∇2 ω(x)h ≥ h21 for every x ∈ X o . Here is the computation:
 2  2   
  −1/2 1/2
  
2 −1
|hi | = (xi |hi |)xi ≤ hi xi xi = h2i x−1
i = hT ∇2 ω(x)h,
i i i i i

where the inequality follows by Cauchy inequality.


Proof of Lemma 3.1. By (2.33), we have, for any u ∈ Z, that
γt2
(6.2) γt (zt − u)T G(zt , ξt ) ≤ V (zt , u) − V (zt+1 , u) + G(zt , ξt )2∗
2
(recall that we are in the situation of α = 1). This relation implies that for every
u ∈ Z, one has
γt2
(6.3) γt (zt − u)T g(zt ) ≤ V (zt , u) − V (zt+1 , u) + G(zt , ξt )2∗ − γt (zt − u)T Δt ,
2
where Δt = G(zt , ξt ) − g(zt ). Summing up these inequalities over t = 1, . . . , j, we get


j 
j
γ2 
j
γt (zt − u)T g(zt ) ≤ V (z1 , u) − V (zt+1 , u) + t
G(zt , ξt )2∗ − γt (zt − u)T Δt .
t=1 t=1
2 t=1

Now we need the following simple lemma.


6 For any u, v ∈ Rn , we have, by the definition of the dual norm, that u v ≥ uT v, and hence

(u2∗ /α + αv2 )/2 ≥ u∗ v ≥ uT v.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1604 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

Lemma 6.1. Let ζ1 , . . . , ζj be a sequence of elements of Rn+m . Define the se-


quence vt , t = 1, 2, . . . in Z o as follows: v1 ∈ Z o and

vt+1 = Pvt (ζt ), 1 ≤ t ≤ j.

Then, for any u ∈ Z, the following holds:


j 
j
(6.4) ζtT (vt − u) ≤ V (v1 , u) + 1
2
ζt 2∗ .
t=1 t=1

Proof. Using the bound (2.33) of Lemma 2.1 with y = ζt and x = vt (so that
vt+1 = Pvt (ζt )) and recalling that we are in the situation of α = 1, we obtain the
following for any u ∈ Z:

V (vt+1 , u) ≤ V (vt , u) + ζtT (u − vt ) + 12 ζt 2∗ .

Summing up from t = 1 to t = j, we conclude that


j 
j
V (vj+1 , u) ≤ V (v1 , u) + ζtT (u − vt ) + 1
2
ζt 2∗ ,
t=1 t=1

which implies (6.4) due to V (v, u) ≥ 0 for any v ∈ Z o , u ∈ Z.


Applying Lemma 6.1 with v1 = z1 , ζt = −γt Δt , we get


j
1 2
j
(6.5) γt ΔTt (u − vt ) ≤ V (z1 , u) + γ Δt 2∗ ∀u ∈ Z.
t=1
2 t=1 t

Observe that
! "
2Dω2 x ,X 2 2Dω2 y ,Y 2
EΔt 2∗ ≤ 4EG(zt , ξt )2∗ ≤4 M∗,x + M∗,y = 4M∗2
αx αy

so that when taking the expectation of both sides of (6.5), we get


  

j 
j
(6.6) E sup γt ΔTt (u − vt ) ≤ 1 + 2M∗2 γt2
u∈Z t=1 t=1

(recall that V (z1 , ·) is bounded by 1 on Z). Now we proceed exactly as in section 2.2:
we sum up (6.3) from t = 1 to j to obtain


j 
j
γ2 
j
γt (zt − u)T g(zt ) ≤ V (z1 , u) + t
G(zt , ξt )2∗ − γt (zt − u)T Δt
t=1 t=1
2 t=1
(6.7)

j
γ2  j j
= V (z1 , u) + t
G(zt , ξt )2∗ − γt (zt − vt )T Δt + γt (u − vt )T Δt .
t=1
2 t=1 t=1

When taking into account that zt and vt are deterministic functions of ξ[t−1] =
(ξ1 , . . . , ξt−1 ) and that the conditional expectation of Δt , ξ[t−1] being given, van-
ishes, we conclude that E[(zt − vt )T Δt ] = 0. We take now suprema in u ∈ Z and then

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1605

expectations on both sides of (6.7):


 
j  j
γt2
E sup γt (zt − u) g(zt ) ≤ sup V (z1 , u) +
T
EG(zt , ξt )2∗
u∈Z t=1 u∈Z t=1
2
j
+ sup γt (u − vt )T Δt
u∈Z t=1
 
M∗2  2 
j j
2 2
(by (6.6)) ≤ 1 + γ + 1 + 2M∗ γt
2 t=1 t t=1
5  j
= 2 + M∗2 γt2 ,
2 t=1

and we arrive at (3.10).


Proof of Propositions 2.2 and 3.2. We provide here the proof of Proposition 3.2
only. The proof of Proposition 2.2 follows the same lines and can be easily recon-
structed using the bound (2.39) instead of the relations (6.5) and (6.7) in the proof
below.
First of all, with M∗ given by (3.6), one has
(6.8) ∀(z ∈ Z) : E exp{G(z, ξ)2∗ /M∗2 } ≤ exp{1}.
2 2 2 2
2Dω M∗,x 2Dω y ,Y
M∗,y
x ,X
Indeed, setting px = αx M∗2 , py = αy M∗2 , we have px + py = 1, whence,
invoking (3.4),
E exp{G(z, ξ)2∗ /M∗2 } = E exp{px Gx (z, ξ)2∗,x /M∗,x
2
+ py Gy (z, ξ)2∗,y /M∗,y
2
} ,
and (6.8) follows from (3.16) by the Hölder inequality.
N
Setting ΓN = t=1 γt and using the notation from the proof of Lemma 3.1,
relations (3.9), (6.5), and (6.7) combined with the fact that V (z1 , u) ≤ 1 for u ∈ Z,
imply that

N 
N
ΓN φ (z̃N ) ≤ 2 + 1
2
γt2 G(zt , ξt )2∗ + Δt 2∗ + γt (vt − zt )T Δt .
(6.9) t=1 t=1
+ ,- . + ,- .
αN βN

Now, from (6.8), it follows straightforwardly that


   
(6.10) E exp Δt 2∗ /(2M∗ )2 ≤ exp{1}, E exp G(zt , ξt )2∗ /M∗2 ≤ exp{1},
which, in turn, implies that

5 2 2
N
(6.11) E[exp{αN /σα }] ≤ exp{1}, σα = M∗ γt ,
2 t=1

and therefore, by Markov inequality, for any Ω > 0,


(6.12) Prob{αN ≥ (1 + Ω)σα } ≤ exp{−Ω}.
Indeed, we have by (6.8)
(
g(zt )∗ = E[G(zt , ξt )|ξ[t−1] ]∗ ≤ E(G(zt , ξt )2∗ |ξ[t−1] ) ≤ M∗

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1606 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

and
Δt 2∗ = G(zt , ξt ) − g(zt )2∗ ≤ (G(zt , ξt )∗ + g(zt )∗ )2 ≤ 2G(zt , ξt )2∗ + 2M∗2 ,
which implies that

N
γ2
αN ≤ t
3G(zt , ξt )2∗ + 2M∗2 .
t=1
2

Further, by the Hölder inequality, we have the following from (6.8):


  
γt2 32 G(zt , ξt )2∗ + M∗2
E exp 5 2 2
≤ exp(1).
2 γt M∗

Observe that if r1 , . . . , ri are nonnegative random variables such that E[exp{rt /σt }] ≤
exp{1} for some deterministic σt > 0, then, by convexity of the exponent, w(s) =
exp{s} and
   ⎡ ⎤
t≤i rt
 σ
≤ E⎣ exp{rt /σt }⎦ ≤ exp{1}.
t
(6.13) E exp  
t≤i σt t≤i τ ≤i στ

Now applying (6.13) with rt = γt2 32 G(zt , ξt )2∗ + M∗2 and σt = 52 γt2 M∗2 , we obtain
(6.11).
Now let ζt = γt (vt − zt )T Δt . Observing that vt , zt are deterministic functions
of ξ[t−1] , while E[Δt |ξ[t−1] ] = 0, we see that the sequence {ζt }N
t=1 of random real
variables forms a martingale difference. Besides this, by strong convexity of ω with
modulus 1 w.r.t.  ·  and due to Dω,Z ≤ 1, we have
1
u ∈ Z ⇒ 1 ≥ V (z1 , u) ≥ u − z1 2 ,
2
√ √
whence the  · -diameter of Z does not exceed 2 2 so that |ζt | ≤ 2 2γt Δt ∗ , and
therefore
   3
E exp |ζt |2 / 32γt2 M∗2 3ξ[t−1] ≤ exp{1}
by (6.10). Applying Cramer’s deviation bound, we obtain, for any Ω > 0,
⎧ 4 ⎫
⎨ 5N
5 ⎬
(6.14) Prob βN > 4ΩM∗ 6 γt2 ≤ exp{−Ω2 /4}.
⎩ ⎭
t=1

Indeed, for 0 ≤ γ, setting σt = 4 2γt M∗ and taking into account that ζt is a deter-
ministic function of ξ[t] , with E[ζt |ξ[t−1] ] = 0 and E[exp{ζt2 /σt2 }|ξ[t−1] ] ≤ exp{1}, we
have
& 2
'
0 < γσt ≤ 1 ⇒ as ex ≤ x + ex
 
E[exp{γζt }|ξ[t−1] ] ≤ E[exp γ 2 ζt2 |ξ[t−1] ]
  γ 2 σt2   
≤ E exp ζt2 /σt2 |ξ[t−1] ≤ exp γ 2 σt2 ;
γσt > 1 ⇒
 
E[exp{γζt }|ξ[t−1] ] ≤ E exp [ 12 γ 2 σt2 + 12 ζt2 /σt2 |ξ[t−1]
   
≤ exp 12 γ 2 σt2 + 12 ≤ exp γ 2 σt2 ,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1607

that is, in both cases, E[exp{γζt }|ξ[t−1] ] ≤ exp{γ 2 σt2 }. Therefore,


 
E[exp{γβi }] = E exp{γβi−1 }E[exp{γζi }|ξ[i−1] ] ≤ exp γ 2 σi2 E[exp{γβi−1 }],
N
whence E[exp{γβN }] ≤ exp{γ 2 t=1 σt2 }, and thus, by Markov inequality for every
Ω > 0, it holds
⎧ 4 ⎫   ⎧ 4 ⎫
⎨ 5N 5N
5 ⎬ N ⎨ 5 ⎬
Prob βN > Ω6 σt2 ≤ exp γ 2 σt2 exp −γΩ6 σt2 .
⎩ ⎭ ⎩ ⎭
t=1 t=1 t=1

& '−1/2
When choosing γ = 12 Ω N
t=1 σt2 , we arrive at (6.14).
Combining (6.9), (6.10), and (6.14), we get the following for any positive Ω and
Θ:
⎧ 4 ⎫
⎨ 5N ⎬
5 
N
√ 5 
Prob ΓN φ (z̃t ) > 2 + (1 + Ω)M∗2 γt2 + 4 2ΘM∗ 6 γt2
⎩ 2 ⎭

t=1
 t=1
1
≤ exp{−Ω} + exp − Θ2 .
4

When setting Θ = 2 Ω and substituting (3.12), we obtain (3.17).
N Proof of Proposition 3.3. As in the proof of Proposition 3.2, when setting ΓN =
t=1 γt and using the relations (3.9), (6.5), and (6.7), combined with the fact that
G(z, ξy )∗ ≤ M∗ , we obtain

N
γ2 
N
ΓN φ (z̃N ) ≤ 2 + t
G(zt , ξt )2∗ + Δt 2∗ + γt (vt − zt )T Δt
t=1
2 t=1

5 
N 
N
(6.15) ≤ 2 + M∗2 γt2 + γt (vt − zt )T Δt .
2 t=1 t=1
+ ,- .
αN

Recall that by definition of Δt , Δt ∗ = G(zt , ξt )− g(zt )∗ ≤ G(zt , ξt ) + g(zt )∗ ≤
2M∗ .
Note that ζt = γt (vt −zt )T Δt is a bounded martingale difference, i.e., E(ζt |ξ[t−1] ) =
0 and |ζt | ≤ 4γt M (here M is defined in (3.31)). Then, by Azuma–Hoeffding’s in-
equality [1] for any Ω ≥ 0,
⎛ 4 ⎞
5N
5
Prob ⎝αN > 4ΩM 6
2
(6.16) γt2 ⎠ ≤ e−Ω /2 .
t=1

(x) (y) (x) (y)


Indeed, let us denote vt = (vt , vt ) and Δt = (Δt , Δt ). When taking into
(x) (y)
account that vt 1 ≤ 1, vt 1 ≤ 1 and xt 1 ≤ 1, yt 1 ≤ 1, we conclude that
3& 'T 3 3& 'T 3
3 (x) (x) 3 3 (y) (y) 3
|(vt − zt )T Δt | ≤ 33 vt − xt Δt 33 + 33 vt − yt Δt 33
   
 (x)   (y) 
≤ 2 Δt  + 2 Δt  ≤ 4 max Aj + b∞ + 4 max Aj + c∞
∞ ∞ 1≤j≤m 1≤j≤n

= 4M .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


1608 A. NEMIROVSKI, A. JUDITSKY, G. LAN, AND A. SHAPIRO

We conclude from (6.15) and (6.16) that


⎛ 4 ⎞
5N
5 N 5
γt2 + 4ΩM 6
2
Prob ⎝ΓN φ (z̃N ) > 2 + M∗2 γt2 ⎠ ≤ e−Ω /2 ,
2 t=1 t=1

and the bound (3.30) of the proposition can be easily obtained by substituting the
constant stepsizes γt as defined in (3.12).

REFERENCES

[1] K. Azuma, Weighted sums of certain dependent random variables, Tökuku Math. J., 19 (1967),
pp. 357–367.
[2] A. Ben-Tal and A. Nemirovski, Non-Euclidean restricted memory level method for large-scale
convex optimization, Math. Program., 102 (2005), pp. 407–456.
[3] A. Benveniste, M. Métivier, and P. Priouret, Algorithmes Adaptatifs et Approximations
Stochastiques, Masson, Paris, 1987 (in French). Adaptive Algorithms and Stochastic Ap-
proximations, Springer, New York, 1993 (in English).
[4] L.M. Bregman, The relaxation method of finding the common points of convex sets and its
application to the solution of problems in convex programming, Comput. Math. Math.
Phys., 7 (1967), pp. 200–217.
[5] K.L. Chung, On a stochastic approximation method, Ann. Math. Statist., 25 (1954), pp. 463–
483.
[6] Y. Ermoliev, Stochastic quasigradient methods and their application to system optimization,
Stochastics, 9 (1983), pp. 1–36.
[7] A.A. Gaivoronski, Nonstationary stochastic programming problems, Kybernetika, 4 (1978),
pp. 89–92.
[8] A.V. Goldberg and R.E. Tarjan, A new approach to the maximum flow problem, J. ACM,
35 (1988), pp. 921–940.
[9] M.D. Grigoriadis and L.G. Khachiyan, A sublinear-time randomized approximation algo-
rithm for matrix games, Oper. Res. Lett., 18 (1995), pp. 53–58.
[10] A. Juditsky, A. Nazin, A. Tsybakov, and N. Vayatis, Recursive aggregation of estimators by
the mirror descent algorithm with averaging, Probl. Inf. Transm., 41 (2005), pp. 368–384.
[11] A.J. Kleywegt, A. Shapiro, and T. Homem-de-Mello, The sample average approximation
method for stochastic discrete optimization, SIAM J. Optim., 12 (2002), pp. 479–502.
[12] J. Linderoth, A. Shapiro, and S. Wright, The empirical behavior of sampling methods for
stochastic programming, Ann. Oper. Res., 142 (2006), pp. 215–241.
[13] W.K. Mak, D.P. Morton, and R.K. Wood, Monte Carlo bounding techniques for determining
solution quality in stochastic programs, Oper. Res. Lett., 24 (1999), pp. 47–56.
[14] M.D. McKay, R.J. Beckman, and W.J. Conover, A comparison of three methods for select-
ing values of input variables in the analysis of output from a computer code, Technometrics,
21 (1979), pp. 239–245.
[15] A. Nemirovski and D. Yudin, On Cezari’s convergence of the steepest descent method for
approximating saddle point of convex-concave functions, Dokl. Akad. Nauk SSSR, 239
(1978) (in Russian). Soviet Math. Dokl., 19 (1978) (in English).
[16] A. Nemirovski and D. Yudin, Problem Complexity and Method Efficiency in Optimization,
Wiley-Intersci. Ser. Discrete Math. 15, John Wiley, New York, 1983.
[17] Y. Nesterov, Primal-dual subgradient methods for convex problems, Math. Program., Ser. B,
http://www.springerlink.com content-b441795t5254m533.
[18] B.T. Polyak, New stochastic approximation type procedures, Automat. i Telemekh., 7 (1990),
pp. 98–107.
[19] B.T. Polyak and A.B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM
J. Control Optim., 30 (1992), pp. 838–855.
[20] G.C. Pflug, Optimization of Stochastic Models, The Interface Between Simulation and Opti-
mization, Kluwer, Boston, 1996.
[21] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Stat., 22 (1951),
pp. 400–407.
[22] A. Ruszczyński and W. Syski, A method of aggregate stochastic subgradients with on-line
stepsize rules for convex stochastic programming problems, Math. Prog. Stud., 28 (1986),
pp. 113–131.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


ROBUST STOCHASTIC APPROXIMATION 1609

[23] J. Sacks, Asymptotic distribution of stochastic approximation, Ann. Math. Stat., 29 (1958),
pp. 373–409.
[24] S. Sen, R.D. Doverspike, and S. Cosares, Network planning with random demand,
Telecomm. Syst., 3 (1994), pp. 11–30.
[25] A. Shapiro, Monte Carlo sampling methods, in Stochastic Programming, Handbook in OR &
MS, Vol. 10, A. Ruszczyński and A. Shapiro, eds., North-Holland, Amsterdam, 2003.
[26] A. Shapiro and A. Nemirovski, On complexity of stochastic programming problems, in Con-
tinuous Optimization: Current Trends and Applications, V. Jeyakumar and A.M. Rubinov,
eds., Springer, New York, 2005, pp. 111–144.
[27] J.C. Spall, Introduction to Stochastic Search and Optimization: Estimation, Simulation, and
Control, John Wiley, Hoboken, NJ, 2003.
[28] V. Strassen, The existence of probability measures with given marginals, Ann. Math. Statist.,
38 (1965), pp. 423–439.
[29] B. Verweij, S. Ahmed, A.J. Kleywegt, G. Nemhauser, and A. Shapiro, The sample aver-
age approximation method applied to stochastic routing problems: A computational study,
Comput. Optim. Appl., 24 (2003), pp. 289–333.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

You might also like