Siopt Rsa 2009
Siopt Rsa 2009
Abstract. In this paper we consider optimization problems where the objective function is given
in a form of the expectation. A basic difficulty of solving such stochastic optimization problems is
that the involved multidimensional integrals (expectations) cannot be computed with high accuracy.
The aim of this paper is to compare two computational approaches based on Monte Carlo sampling
techniques, namely, the stochastic approximation (SA) and the sample average approximation (SAA)
methods. Both approaches, the SA and SAA methods, have a long history. Current opinion is that
the SAA method can efficiently use a specific (say, linear) structure of the considered problem, while
the SA approach is a crude subgradient method, which often performs poorly in practice. We intend
to demonstrate that a properly modified SA approach can be competitive and even significantly
outperform the SAA method for a certain class of convex stochastic problems. We extend the
analysis to the case of convex-concave stochastic saddle point problems and present (in our opinion
highly encouraging) results of numerical experiments.
Key words. stochastic approximation, sample average approximation method, stochastic pro-
gramming, Monte Carlo sampling, complexity, saddle point, minimax problems, mirror descent al-
gorithm
DOI. 10.1137/070704277
and then we deal with an extension of the analysis to stochastic saddle point problems.
Here X ⊂ Rn is a nonempty bounded closed convex set, ξ is a random vector whose
probability distribution P is supported on set Ξ ⊂ Rd and F : X × Ξ → R. We
assume that the expectation
(1.2) E[F (x, ξ)] = Ξ F (x, ξ)dP (ξ)
is well defined and finite valued for every x ∈ X. Moreover, we assume that the
expected value function f (·) is continuous and convex on X. Of course, if for every
ξ ∈ Ξ the function F (·, ξ) is convex on X, then it follows that f (·) is convex. With
these assumptions, (1.1) becomes a convex programming problem.
A basic difficulty of solving stochastic optimization problem (1.1) is that the mul-
tidimensional integral (expectation) (1.2) cannot be computed with a high accuracy
for dimension d, say, greater than five. The aim of this paper is to compare two
∗ Received by the editors October 1, 2007; accepted for publication (in revised form) August 26,
gatech.edu, [email protected]). Research of the first author was partly supported by NSF
award DMI-0619977. Research of the third author was partially supported by NSF award CCF-
0430644 and ONR award N00014-05-1-0183. Research of the fourth author was partly supported by
NSF awards DMS-0510324 and DMI-0619977.
‡ Université J. Fourier, B.P. 53, 38041 Grenoble Cedex 9, France ([email protected]).
1574
1 Throughout the paper, we speak about convergence in terms of the objective value.
Note that the SAA method is not an algorithm; the obtained SAA problem (1.4) still
has to be solved by an appropriate numerical procedure. Recent theoretical studies
(cf. [11, 25, 26]) and numerical experiments (see, e.g., [12, 13, 29]) show that the SAA
method coupled with a good (deterministic) algorithm could be reasonably efficient
for solving certain classes of two-stage stochastic programming problems. On the
other hand, classical SA-type numerical procedures typically performed poorly for
such problems.
We intend to demonstrate in this paper that a properly modified SA approach can
be competitive and even significantly outperform the SAA method for a certain class
of stochastic problems. The mirror descent SA method we propose here is a direct
descendent of the stochastic mirror descent method of Nemirovski and Yudin [16].
However, the method developed in this paper is more flexible than its “ancestor”:
the iteration of the method is exactly the prox-step for a chosen prox-function, and
the choice of prox-type function is not limited to the norm-type distance-generating
functions. Close techniques, based on subgradient averaging, have been proposed
in Nesterov [17] and used in [10] to solve the stochastic optimization problem (1.1).
Moreover, the results on large deviations of solutions and applications of the mirror
descent SA to saddle point problems, to the best of our knowledge, are new.
The rest of this paper is organized as follows. In section 2 we focus on theory
of the SA method applied to (1.1). We start with outlining the relevant–to-our-goals
part of the classical “O(t−1 )” SA theory (section 2.1), along with its “O(t−1/2 )”
modifications (section 2.2). Well-known and simple results presented in these sections
pave the road to our main developments carried out in section 2.3. In section 3
we extend the constructions and results of section 2.3 to the case of the convex-
concave stochastic saddle point problem. In concluding section 4 we present results
(in our opinion, highly encouraging) of numerical experiments with the SA algorithm
(sections 2.3 and 3) applied to large-scale stochastic convex minimization and saddle
point problems. Section 5 gives a short conclusion for the presented results. Finally,
some technical proofs are given in the appendix.
Throughout the paper, we use the following √ notation. By xp , we denote the p
norm of vector x ∈ Rn , in particular, x2 = xT x denotes the Euclidean norm, and
x∞ = max{|x1 |, . . . , |xn |}. By ΠX , we denote the metric projection operator onto
the set X, that is, ΠX (x) = arg minx ∈X x − x 2 . Note that ΠX is a nonexpanding
operator, i.e.,
By O(1), we denote positive absolute constants. The notation a stands for the
largest integer less than or equal to a ∈ R and a for the smallest integer greater
than or equal to a ∈ R. By ξ[t] = (ξ1 , . . . , ξt ), we denote the history of the process
ξ1 , . . . , up to time t. Unless stated otherwise, all relations between random variables
are supposed to hold almost surely.
2. Stochastic approximation, basic theory. In this section we discuss theory
and implementations of the SA approach to the minimization problem (1.1).
2.1. Classical SA algorithm. The classical SA algorithm solves (1.1) by mim-
icking the simplest subgradient descent method. That is, for chosen x1 ∈ X and a
sequence γj > 0, j = 1, . . . , of stepsizes, it generates the iterates by the formula
Of course, the crucial question of that approach is how to choose the stepsizes γj . Let
x∗ be an optimal solution of (1.1). Note that since the set X is compact and f (x) is
continuous, (1.1) has an optimal solution. Note also that the iterate xj = xj (ξ[j−1] )
is a function of the history ξ[j−1] = (ξ1 , . . . , ξj−1 ) of the generated random process
and hence is random.
Denote
Then, by taking expectation of both sides of (2.3) and using (2.4), we obtain
Suppose further that the expectation function f (x) is differentiable and strongly
convex on X, i.e., there is constant c > 0 such that
or equivalently that
Note that strong convexity of f (x) implies that the minimizer x∗ is unique. By
optimality of x∗ , we have that
(x − x∗ )T ∇f (x∗ ) ≥ 0 ∀x ∈ X,
which together with (2.7) implies that (x − x∗ )T ∇f (x) ≥ cx − x∗ 22 . In turn, it
follows that (x − x∗ )T g ≥ cx − x∗ 22 for all x ∈ X and g ∈ ∂f (x), and hence
Let us take stepsizes γj = θ/j for some constant θ > 1/(2c). Then, by (2.8), we
have
aj+1 ≤ (1 − 2cθ/j)aj + 12 θ2 M 2 /j 2 .
where
(2.10) Q(θ) = max θ2 M 2 (2cθ − 1)−1 , x1 − x∗ 22 .
Then
and hence
That is, the convergence is extremely slow. For example, for j = 109 , the error of the
iterated solution is greater than 0.015. On the other hand, for the optimal stepsize
factor of θ = 1/c = 5, the optimal solution x∗ = 0 is found in one iteration.
It could be added that the stepsizes γj = θ/j may become completely unaccept-
able when f loses strong convexity. For example, when f (x) = x4 , X = [−1, 1],
and there is no noise, these stepsizes result in a disastrously slow convergence: |xj | ≥
1
O([ln(j+1)]−1/2 ). The precise statement here is that with γj = θ/j and 0 < x1 ≤ 6√ θ
,
we have that xj ≥ √ 2
x 1
for j = 1, 2, . . . .
1+32θx1 [1+ln(j+1)]
We see that in order to make the SA “robust”—applicable to general convex ob-
jectives rather than to strongly convex ones—one should replace the classical stepsizes
γj = O(j −1 ), which can be too small to ensure a reasonable rate of convergence even
in the “no noise” case, with “much larger” stepsizes. At the same time, a detailed
analysis shows that “large” stepsizes poorly suppress noise. As early as in [15] it
was realized that in order to resolve the arising difficulty, it makes sense to separate
collecting information on the objective from generating approximate solutions. Specif-
ically, we can use large stepsizes, say, γj = O(j −1/2 ) in (2.1), thus avoiding too slow
motion at the cost of making the trajectory “more noisy.” In order to suppress, to
some extent, this noisiness, we take, as approximate solutions, appropriate averages
of the search points xj rather than these points themselves.
2.2. Robust SA approach. Results of this section go back to Nemirovski and
Yudin [15, 16]. Let us look again at the basic relations (2.2), (2.5), and (2.6). By
convexity of f (x), we have that f (x) ≥ f (xt ) + (x − xt )T g(xt ) for any x ∈ X, and
hence
j
j
j
j
(2.14) γt E f (xt ) − f (x∗ ) ≤ [at − at+1 ] + 12 M 2 γt2 ≤ ai + 12 M 2 γt2 ,
t=i t=i t=i t=i
γt
and hence, setting νt = j ,
τ =i γτ
j j
ai + 12 M 2 γt2
(2.15) E νt f (xt ) − f (x∗ ) ≤ j t=i
.
t=i t=i γt
j
Note that νt ≥ 0 and t=i νt = 1. Consider the points
j
(2.16) x̃ji = νt xt ,
t=i
and let
(2.17) DX = max x − x1 2 .
x∈X
D2 + M 2 j γ 2
j
(a) E f (x̃1 ) − f (x∗ ) ≤ X j t=1 t
for 1 ≤ j,
2 t=1 γt
(2.18)
4D2 + M 2 j γ 2
j
(b) E f (x̃i ) − f (x∗ ) ≤ X
j t=i t
for 1 < i ≤ j.
2 t=i γt
Based on the resulting bounds on the expected inaccuracy of approximate solutions x̃ji ,
we can now develop “reasonable” stepsize policies along with the associated efficiency
estimates.
Constant stepsizes and basic efficiency estimate. Assume that the number N of
iterations of the method is fixed in advance and that γt = γ, t = 1, . . . , N . Then it
follows by (2.18(a)) that
2
DX + M 2N γ2
(2.19) 1 − f (x∗ ) ≤
E f x̃N .
2N γ
Minimizing the right-hand side of (2.19) over γ > 0, we arrive at the constant stepsize
policy
DX
(2.20) γt = √ , t = 1, . . . , N,
M N
along with the associated efficiency estimate
DX M
(2.21) 1 − f (x∗ ) ≤ √
E f x̃N .
N
With the constant stepsize policy (2.20), we also have, for 1 ≤ K ≤ N ,
DX M 2N 1
(2.22) E f x̃N − f (x∗ ) ≤ √ + .
K
N N −K +1 2
When K/N ≤ 1/2, the right-hand side of (2.22) coincides, within an absolute constant
factor, with the right-hand side of (2.21). Finally, for a constant θ > 0, passing from
the stepsizes (2.20) to the stepsizes
θDX
(2.23) γt = √ , t = 1, . . . , N,
M N
the efficiency estimate becomes
−1 DX M 2N 1
(2.24) E f x̃N
K − f (x∗ ) ≤ max θ, θ √ + , 1 ≤ K ≤ N.
N N −K +1 2
Discussion. We conclude that the expected error in terms of the objective of Ro-
bust SA algorithm (2.1), (2.16), with constant stepsize policy (2.20), after N iterations
is of order O(N −1/2 ) in our setting. Of course, this is worse than the rate O(N −1 ) for
the classical SA algorithm as applied to a smooth strongly convex function attaining
minimum at a point from the interior of the set X. However, the error bounds (2.21)
and (2.22) are guaranteed independently of any smoothness and/or strong convexity
assumptions on f . All that matters is the convexity of f on the convex compact set X
and the validity of (2.5). Moreover, scaling the stepsizes by positive constant θ affects
the error bound (2.24) linearly in max{θ, θ−1 }. This can be compared with a possibly
disastrous effect of such scaling in the classical SA algorithm discussed in section 2.1.
These observations, in particular the fact that there is no necessity in “fine tuning”
the stepsizes to the objective function f , explain the adjective “robust” in the name
of the method. Finally, it can be shown that without additional, as compared to
convexity and (2.5), assumptions on f , the accuracy bound (2.21) within an absolute
constant factor is the best one allowed by statistics (cf. [16]).
Varying stepsizes. When the number of steps is not fixed in advance, it makes
sense to replace constant stepsizes with the stepsizes
θDX
(2.25) γt = √ , t = 1, 2, . . . .
M t
From (2.18(b)) it follows that with this stepsize policy, one has, for 1 ≤ K ≤ N ,
N DX M 2 N θ N
(2.26) E f x̃K − f (x∗ ) ≤ √ + .
N θ N −K +1 2 K
(2.28) X o = {x ∈ X : ∂ω(x) = ∅}
is convex (note that X o always contains the relative interior of X) and restricted
to X o , ω is continuously differentiable and strongly convex with parameter α with
respect to · , i.e.,
Observe that the minimum in the right-hand side of (2.31) is attained since ω is
continuous on X and X is compact, and all the minimizers belong to X o , whence the
minimizer is unique, since V (x, ·) is strongly convex on X o . Thus, the prox-mapping
is well defined.
For ω(x) = 12 x22 , we have Px (y) = ΠX (x − y) so that (2.1) is the recurrence
Our goal is to demonstrate that the main properties of the recurrence (2.1) (which
from now on we call the E -SA recurrence) are inherited by (2.32), whatever be the
underlying distance-generating function ω(x).
The statement of the following lemma is a simple consequence of the optimal-
ity conditions of the right-hand side of (2.31) (proof of this lemma is given in the
appendix).
Lemma 2.1. For every u ∈ X, x ∈ X o , and y ∈ Rn , one has
y2∗
(2.33) V (Px (y), u) ≤ V (x, u) + y T (u − x) + .
2α
Using (2.33) with x = xj , y = γj G(xj , ξj ), and u = x∗ , we get
γj2
(2.34) γj (xj − x∗ )T G(xj , ξj ) ≤ V (xj , x∗ ) − V (xj+1 , x∗ ) + G(xj , ξj )2∗ .
2α
Note that with ω(x) = 12 x22 , one has V (x, z) = 12 x − z22 , α = 1, · ∗ = · 2 .
That is, (2.34) becomes nothing but the relation (2.6), which played a crucial role
in all the developments related to the E-SA method. We are about to process, in
a completely similar fashion, the relation (2.34) in the case of a general distance-
generating function, thus arriving at the mirror descent SA. Specifically, setting
γt2
(2.36) γt (xt − x∗ )T g(xt ) ≤ V (xt , x∗ ) − V (xt+1 , x∗ ) − γt ΔTt (xt − x∗ ) + G(xt , ξt )2∗ .
2α
j
(2.38) x̃j1 = νt xt
t=1
Let us suppose, as in the previous section (cf. (2.5)), that we are given a positive
number M∗ such that
(2.40) E G(x, ξ)2∗ ≤ M∗2 ∀x ∈ X.
Taking expectations of both sides of (2.39) and noting that (i) xt is a deterministic
function of ξ[t−1] = (ξ1 , . . . , ξt−1 ), (ii) conditional on ξ[t−1] , the expectation of Δt is
0, and (iii) the expectation of G(xt , ξt )2∗ does not exceed M∗2 , we obtain
max −1
j u∈X V (x1 , u) + (2α) M∗2 jt=1 γt2
(2.41) E f (x̃1 ) − f (x∗ ) ≤ j .
t=1 γt
Assume from now on that the method starts with the minimizer of ω:
x1 = argmin X ω(x).
Then, from (2.30), it follows that
2
(2.42) max V (x1 , z) ≤ Dω,X ,
z∈X
where
1/2
(2.43) Dω,X := max ω(z) − min ω(z) .
z∈X z∈X
Constant stepsize policy. Assuming that the total number of steps N is given in
advance and γt = γ, t = 1, . . . , N , optimizing the right-hand side of (2.44) over γ > 0
we arrive at the constant stepsize policy
√
2αDω,X
(2.45) γt = √ , t = 1, . . . , N
M∗ N
and the associated efficiency estimate
2
(2.46) E f x̃N
1 − f (x∗ ) ≤ Dω,X M∗
αN
(cf. (2.20), (2.21)). For a constant θ > 0, passing from the stepsizes (2.45) to the
stepsizes
√
θ 2αDω,X
(2.47) γt = √ , t = 1, . . . , N,
M∗ N
the efficiency estimate becomes
−1 2
(2.48) E f x̃N
1 − f (x∗ ) ≤ max θ, θ Dω,X M∗ .
αN
We refer to the method (2.32), (2.38), and (2.47) as the (robust) mirror descent SA
algorithm with constant stepsize policy.
Probabilities of large deviations. So far, all our efficiency estimates were upper
bounds on the expected nonoptimality, in terms of the objective, of approximate
solutions generated by the algorithms. Here we complement these results with bounds
on probabilities of large deviations. Observe that by Markov inequality, (2.48) implies
that
√
N 2 max θ, θ−1 Dω,X M∗
(2.49) Prob f x̃1 − f (x∗ ) > ε ≤ √ ∀ε > 0.
ε αN
It is possible, however, to obtain much finer bounds on deviation probabilities when
imposing more restrictive assumptions on the distribution of G(x, ξ). Specifically,
assume that
# $
2
(2.50) E exp G(x, ξ)∗ /M∗2 ≤ exp{1} ∀x ∈ X.
Note that condition (2.50) is stronger than (2.40). Indeed, if a random variable Y sat-
isfies E[exp{Y /a}] ≤ exp{1} for some a > 0, then by Jensen inequality, exp{E[Y /a]} ≤
E[exp{Y /a}] ≤ exp{1}, and therefore, E[Y ] ≤ a. Of course, condition (2.50) holds if
G(x, ξ)∗ ≤ M∗ for all (x, ξ) ∈ X × Ξ.
Proposition 2.2. In the case of (2.50) and for the constant stepsizes (2.47), the
following holds for any Ω ≥ 1:
(2.51)
√
2 max θ, θ−1 M∗ Dω,X (12 + 2Ω)
1 − f (x∗ ) >
Prob f x̃N √ ≤ 2 exp{−Ω}.
αN
Varying stepsizes. Same as in the case of E-SA, we can modify the mirror descent
SA algorithm to allow for time-varying stepsizes and “sliding averages” of the search
points xt in the role of approximate solutions, thus getting rid of the necessity to fix
in advance the number of steps. Specifically, consider
√ 1/2
Dω,X := 2 sup ω(z) − ω(x) − (z − x)T ∇ω(x)
x∈X o ,z∈X
(2.52) %
= sup 2V (x, z)
x∈X o ,z∈X
and assume that Dω,X is finite. This is definitely so when ω is continuously differen-
tiable on the entire X. Note that for the E-SA, that is, with ω(x) = 12 x22 , Dω,X is
the Euclidean diameter of X.
In the case of (2.52), setting
j
γt xt
(2.53) x̃ji = t=i
j ,
t=i γt
2
Noting that V (xK , x∗ ) ≤ 12 Dω,X and taking expectations, we arrive at
2 1 N
1
Dω,X + 2α M∗2 t=K γt2
(2.54) K − f (x∗ ) ≤
E f x̃N 2
N
t=K γt
(cf. (2.26)). In particular, with K = rN for a fixed r ∈ (0, 1), we get an efficiency
estimate
−1 Dω,X M∗
(2.57) E f x̃N
K − f (x∗ ) ≤ C(r) max θ, θ √ √ ,
α N
potential possibility to reduce the constant factor hidden in O(·) by adjusting the
norm · and the distance-generatingfunction ω(·) to the geometry of the problem.
n
Example. Let X = {x ∈ Rn : i=1 xi = 1, x ≥ 0} be a standard simplex.
Consider two setups for the mirror descent SA:
— Euclidean setup, where · = · 2 and ω(x) = 12 x22 , and
— 1 -setup, where · = · 1 , with · ∗ = · ∞ and ω is the entropy function
n
(2.58) ω(x) = xi ln xi .
i=1
The Euclidean setup leads to the Euclidean robust SA, which is easily imple-
mentable (computing the prox-mapping requires O(n ln n) operations) and guarantees
that
−1
(2.59) E f x̃N
1 − f (x∗ ) ≤ O(1) max θ, θ M N −1/2 ,
with M 2 = supx∈X E G(x, ξ)22 , provided that the constant M is known and the
stepsizes
√ (2.23) are used (see (2.24), (2.17), and note that the Euclidean diameter of
X is 2). √
The 1 -setup corresponds to X o = {x ∈ X : x > 0}, Dω,X = ln n, α = 1, and
x1 = argmin X ω = n−1 (1, . . . , 1)T (see appendix). The associated mirror descent SA
is easily implementable: the prox-function here is
n
zi
V (x, z) = zi ln ,
i=1
xi
xi e−yi
[Px (y)]i = n −yk
, i = 1, . . . , n.
k=1 xk e
with
provided that the constant M∗ is known and the constant stepsizes (2.47) are used
(see (2.48) and (2.40)). To compare (2.60) and (2.59), observe that M∗ ≤ M , and the
ratio M∗ /M can be as small as n−1/2 . Thus, the efficiency estimate for the 1 -setup
never is much worse than the estimate for the Euclidean setup, and for large n, can
be far better than the latter estimate:
1 M n
≤√ ≤ , N = 1, 2, . . . ,
ln n ln nM∗ ln n
both the upper and the lower bounds being achievable. Thus, when X is a standard
simplex of large dimension, we have strong reasons to prefer the 1 -setup to the usual
Euclidean one.
Ignoring logarithmic in n factors, the second estimate (2.63) can be much better than
the first estimate (2.62) only when Diam·2 (X) 1 = Diam·1 (X), as it is the case,
e.g., when X is an Euclidean ball. On the other hand, when X is an · 1 -ball or
its nonnegative part (which is the simplex), so that the · 1 - and · 2 -diameters of
X are of the same order, the first estimate (2.62) is much more attractive than the
estimate (2.63) due to potentially much smaller constant M∗ .
Comparison with the SAA approach. We compare now theoretical complexity
estimates for the robust mirror descent SA and the SAA methods. Consider the case
when (i) X ⊂ Rn is contained in the · p -ball of radius R, p = 1, 2, and the SA in
question is either the E-SA (p = 2), or the SA associated with · 1 and the distance-
generating function2 (2.61), (ii) in SA, the constant stepsize rule (2.45) is used, and
(iii) the “light tail” assumption (2.50) takes place.
Given ε > 0, δ ∈ (0, 1/2), let us compare the number of steps N = NSA of
SA, which, with probability ≥ 1 − δ, results in an approximate solution x̃N 1 such
1 ) − f (x∗ ) ≤ ε, with the sample size N = NSAA for the SAA result-
that f (x̃N
ing in the same accuracy guarantees. According to Proposition 2.2 we have that
Prob f (x̃N1 ) − f (x∗ ) > ε ≤ δ for
where M∗ is the constant from (2.50) and Dω,X is defined in (2.43). Note that the
2
constant M∗ depends on the chosen norm, Dω,X = O(1)R2 for p = 2, and Dω,X 2
=
2
O(1) ln(n)R for p = 1.
This can be compared with the estimate of the sample size (cf. [25, 26])
2 In the second case, we apply the SA after the variables are scaled to make X the unit · 1 -ball.
We see that both SA and SAA methods have logarithmic in δ and quadratic (or nearly
so) in 1/ε complexity in terms of the corresponding sample sizes. It should be noted,
however, that the SAA method requires solution of the corresponding (deterministic)
problem, while the SA approach is based on simple calculations as long as stochastic
subgradients could be easily computed.
3. Stochastic saddle point problem. We show in this section how the mirror
descent SA algorithm can be modified to solve a convex-concave stochastic saddle
point problem. Consider the following minimax (saddle point) problem:
(3.1) min max φ(x, y) = E[Φ(x, y, ξ)] .
x∈X y∈Y
respectively, are solvable with equal optimal values, denoted φ∗ , and pairs (x∗ , y ∗ ) of
optimal solutions to the respective problems form the set of saddle points of φ(x, y)
on X × Y .
As in the case of the minimization problem (1.1), we assume that neither the func-
tion φ(x, y) nor its sub/supergradients in x and y are available explicitly. However,
we make the following assumption.
(A 2) We have at our disposal an oracle which, given an input of point (x, y, ξ) ∈
X × Y × Ξ, returns a stochastic subgradient, that is, (n + m)-dimensional vector
Gx (x, y, ξ)
G(x, y, ξ) = −G y (x, y, ξ)
such that vector
gx (x, y) E[Gx (x, y, ξ)]
g(x, y) = =
−gy (x, y) −E[Gy (x, y, ξ)]
is well defined, gx (x, y) ∈ ∂x φ(x, y), and −gy (x, y) ∈ ∂y (−φ(x, y)).
For example, if for every ξ ∈ Ξ the function Φ(·, ·, ξ) is convex-concave and the
respective subdifferential and integral operators are interchangeable, we ensure (A 2)
by setting
Gx (x, y, ξ) ∂x Φ(x, y, ξ)
G(x, y, ξ) = ∈ .
−Gy (x, y, ξ) ∂y (−Φ(x, y, ξ))
Let · x be a norm on Rn and · y be a norm on Rm , and let · ∗,x and · ∗,y
stand for the corresponding dual norms. As in section 2.1, the basic assumption we
make about the stochastic oracle (aside from its unbiasedness, which we have already
2 2
postulated) is that we know positive constants M∗,x and M∗,y such that
2 2 2 2
(3.2) E Gx (u, v, ξ)∗,x ≤ M∗,x and E Gy (u, v, ξ)∗,y ≤ M∗,y ∀(u, v) ∈ X × Y.
and set
2Dω2 x ,X 2 2Dω2 y ,Y 2
(3.5) M∗2 = M∗,x + M∗,y .
αx αy
We use the notation z = (x, y) and equip the set Z = X × Y with the distance-
generating function
ωx (x) ωy (y)
ω(z) = 2 + .
2Dωx ,X 2Dω2 y ,Y
We refer to the procedure (3.7), (3.8) as the saddle point mirror SA algorithm.
Let us analyze convergence properties of the algorithm. We measure quality of
an approximate solution z̃ = (x̃, ỹ) by the error
φ (z̃) := max φ(x̃, y) − φ∗ + φ∗ − min φ(x, ỹ) = max φ(x̃, y) − min φ(x, ỹ).
y∈Y x∈X y∈Y x∈X
To bound the right-hand side of (3.9), we use the result of the following lemma (its
proof is given in the appendix).
Lemma 3.1. In the above setting, for any j ≥ 1, the following inequality holds:
j
5 2 2
j
(3.10) E max γt (zt − z) g(zt ) ≤ 2 + M∗T
γt .
z∈Z
t=1
2 t=1
Now to get an error bound for the solution z̃1j , it suffices to substitute inequality
(3.10) into (3.9) to obtain
j −1
& ' 5 2 2
j
j
(3.11) E φ z̃1 ≤ γt 2 + M∗ γt .
t=1
2 t=1
Constant stepsizes and basic efficiency estimates. For a fixed number of steps N ,
with the constant stepsize policy
2θ
(3.12) γt = √ , t = 1, . . . , N,
M∗ 5N
is finite, we can pass from constant stepsizes on a fixed “time horizon” to decreasing
stepsize policy
θDω,Z
γt = √ , t = 1, 2, . . .
M∗ t
(compare with (2.55) and take into account that we are in the situation of α = 1),
and from the averaging of all iterates to the “sliding averaging”
j
j γt zt
z̃i = t=i
j
,
t=i γt
Proposition 3.2. In the case of (3.16), with the stepsizes given by (3.12) and
(3.6), one has, for any Ω > 1,
N (8+2Ω) max{θ,θ−1 }√5M∗
(3.17) Prob φ z̃1 > √
N
≤ 2 exp{−Ω}.
and
m
(3.21) g(x, y) = E[G(x, y, ξ)] = i=1 yi gi (x) ∈ ∂x φ(x, y)
.
− f1 (x), . . . , fm (x) −∂y φ(x, y)
Suppose that the set X is equipped with norm · x , whose dual norm is · ∗,x ,
and a distance-generating function ω modulus αx with respect to · x , and let
D2
Rx2 = ωαxx,X . We equip the set Y with the norm · y = · 1, so that · ∗,y = · ∞ ,
and with the distance-generating function
m
ωy (y) = yi ln yi
i=1
2
Dω
and set Ry2 = αy
y ,Y
= ln m. Next, following (3.3), we set
)
x2x y21
(x, y) = 2
+ ,
2Rx 2Ry2
and hence
(
(ζ, η)∗ = 2Rx2 ζ2∗,x + 2Ry2 η2∞ .
Note that
m
2
E G(x, y, ξ)2∗ = 2Rx2 E yi Gi (x, ξ) + 2Ry2 E F (x, ξ)2∞ ,
∗,x
i=1
and since y ∈ Y ,
m 2 ! "2
m
m
yi Gi (x, ξ) ≤ yi Gi (x, ξ)∗,x ≤ yi Gi (x, ξ)2∗,x .
i=1 ∗,x i=1 i=1
It follows that
where
Let us now use the saddle point mirror SA algorithm (3.7)–(3.8) with the constant
stepsize policy
2
γt = √ , t = 1, 2, . . . , N.
M∗ 5N
Discussion. Looking at the bound (3.23), one can make the following important
observation. The error of the saddle point mirror SA algorithm √ in this case is “almost
independent” of the number m of constraints (it grows as O( ln m) as m increases).
The interested reader can easily verify that if an E-SA algorithm were used in the
same setting (i.e., the algorithm tuned to the norm · y = · 2 ), the corresponding √
bound would grow with m much faster (in fact, our error bound would be O( m) in
that case).
Note that properties of the saddle point mirror SA can be used to reduce signifi-
cantly the arithmetic cost of the algorithm implementation. To this end let us look at
the definition (3.20) of the stochastic oracle: In order to obtain a realization G(x, y, ξ),
one has to compute mm random subgradients Gi (x, ξ), i = 1, . . . , m, and then their con-
vex combination i=1 yi Gi (x, ξ). Now let η be an independent of ξ and uniformly
distributed on [0, 1] random variable, and let ı(η, y) : [0, 1] × Y → {1, . . . , m} equal
i−1 i
to i when s=1 ys < η ≤ s=1 ys . That is, random variable ı̂ = ı(η, y) takes values
1, . . . , m with probabilities y1 , . . . , ym . Consider random vector
Gı(η,y) (x, ξ)
(3.24) G(x, y, (ξ, η)) = .
− F1 (x, ξ), . . . , Fm (x, ξ)
We refer to G(x, y, (ξ, η)) as a randomized oracle for problem (3.19), the corresponding
random parameter being (ξ, η). By construction, we still have E G(x, y, (ξ, η)) =
g(x, y), where g is defined in (3.21), and, moreover, the same bound (3.22) holds
for E G(x, y, (ξ, η))2∗ . We conclude that the accuracy bound (3.23) holds for the
error of the saddle point mirror SA algorithm with randomized oracle. On the other
hand, in the latter procedure only one randomized subgradient Gı̂ (x, ξ) per iteration is
computed. This simple idea is further developed in another interesting application of
the saddle point mirror SA algorithm to bilinear matrix games, which we discuss next.
3.3. Application to bilinear matrix games. Consider the standard matrix
game problem, that is, problem (3.1) with
φ(x, y) = y T Ax + bT x + cT y,
In the case in question it is natural to equip X (respectively, Y ) with the ·1 -norm on
Rn (respectively, Rm ). We choose entropies as the corresponding distance-generating
functions:
n m 2 2
Dω Dω y ,Y
ωx (x) = xi ln xi , ωy (x) = yi ln yi ⇒ αx = ln n, αy = ln m .
x ,X
i=1 i=1
x21 y21 %
(3.25) (x, y) = + ⇒ (ζ, η)∗ = 2ζ2∞ ln n + 2η2∞ ln m.
2 ln n 2 ln m
In order to compute the estimates G(x, y, ξ) of g(x, y) = (b + AT y, −c − Ax), to
be used in the saddle point mirror SA iterations (3.7), we use the randomized oracle
c + Aı(ξ1 ,y)
(3.26) G(x, y, ξ) = ,
−b − Aı(ξ2 ,x)
where ξ1 and ξ2 are independent uniformly distributed on [0, 1] random variables and
ĵ = ı(ξ1 , y), î = ı(ξ2 , x) are defined as in (3.24) (i.e., ĵ can take values 1, . . . , m, with
probabilities y1 , . . . , ym and î can take values 1, . . . , n, with probabilities x1 , . . . , xn ),
and Aj , [Ai ]T are jth column and ith row in A, respectively.
Note that
& & '' ∂ φ(x, y)
x
(3.27) g(x, y) ≡ E G x, y, ĵ, î ∈ .
∂y (−φ(x, y))
Besides this,
(3.28) G(x, y, ξ)2∗ ≤ M∗2 = 2 ln n max Aj + b2∞ + 2 ln m max Aj + c2∞ .
1≤j≤m 1≤j≤n
The bottom line is that our stochastic gradients along with the just-defined M∗ satisfy
both (A 2) and (3.16), and therefore with the constant stepsize policy (3.12), we have
N N 5
(3.29) E φ z̃1 = E max φ x̃1 , y − min φ x, ỹ1
N
≤ 2M∗
y∈Y x∈X N
(cf. (3.13)). In our present situation, Proposition 3.2 in a slightly refined form (for
proof, see the appendix) reads as follows.
Proposition 3.3. With the constant stepsize policy (3.12), for the just-defined
algorithm, one has for any Ω ≥ 1, that
# ( $
(3.30) Prob φ z̃1N > 2M∗ N5 + √ 4M
N
Ω ≤ exp −Ω2 /2 ,
where
with probability at least 1 − δ. According to (3.30), to this end, one can use the
randomized saddle point mirror SA algorithm (3.7), (3.8), (3.26) with stepsizes (3.12),
(3.28) and with
ln n + ln(1/δ)
(3.32) N = O(1) .
ρ2
[ln n + ln(1/δ)] [R + n]
C(ρ) = O(1)
ρ2
When ρ is fixed, m = O(1)n and n is large, M is incomparably less than the total
number mn of entries in A. Thus, our algorithm exhibits sublinear-time behavior :
it produces reliable solutions of prescribed quality to large-scale matrix games by
inspecting a negligible, as n → ∞, part of randomly selected data. Note that ran-
domization here is critical.3 It can be seen that a deterministic algorithm, which is
capable to find a solution with (deterministic) relative accuracy ρ ≤ 0.1, has to “see”
in the worst case at least O(1)n rows/columns of A.
4. Numerical results. In this section, we report the results of our computa-
tional experiments where we compare the performance of the robust mirror descent
SA method and the SAA method applied to three stochastic programming problems,
namely: a stochastic utility problem, a stochastic max-flow problem, and a network
planning problem with random demand. We also present a small simulation study of
the performance of randomized mirror SA algorithm for bilinear matrix games.
The algorithms we were testing are the two variants of the robust mirror descent
SA. The first variant, the E-SA, is as described in section 2.2; in terms of section 2.3,
this is nothing but mirror descent robust SA with Euclidean setup; see the example
in section 2.3. The second variant, referred to as the non-Euclidean SA (N-SA), is
the mirror descent robust SA with 1 -setup; see, the example in section 2.3.
3 The possibility to solve matrix games in a sublinear-time fashion by a randomized algorithm
was discovered by Grigoriadis and Khachiyan [9]. Their “ad hoc” algorithm is similar, although not
completely identical to ours, and possesses the same complexity bounds.
Table 4.1
Selecting stepsize policy.
These two variants of the SA method are compared with the SAA approach in
the following way: fixing an iid. sample (of size N ) for the random variable ξ, we
apply the three aforementioned methods to obtain approximate solutions for the test
problem under consideration, and then the quality of the solutions yielded by these
algorithms is evaluated using another iid. sample of size K >> N . It should be noted
that SAA itself is not an algorithm, and in our experiments, it was coupled with
the non-Euclidean restricted memory level (NERML) [2]—a powerful deterministic
algorithm for solving the sample average problem (1.4).
4.1. Preliminaries.
Algorithmic schemes. Both E-SA and N-SA were implemented according to the
description in section 2.3, the number of steps N being the parameter of a particular
experiment. In such an experiment, we generated ≈ log2 N candidate solutions x̃N i ,
with N −i+1 = min[2k , N ], k = 0, 1, . . . , log2 N . We then used an additional sample
to estimate the objective at these candidate solutions in order to choose the best of
these candidates, specifically, as follows: we used a relatively short sample to choose
the two “most promising” of the candidate solutions, and then a large sample (of size
K N ) to identify the best of these two candidates, thus getting the “final” solution.
The computational effort required by this simple postprocessing is not reflected in the
tables to follow.
The stepsizes. At the “pilot stage” of our experimentation, we made a decision on
which stepsize policy—(2.47) or (2.55)—to choose and how to identify the underlying
parameters M∗ and θ. In all our experiments, M∗ was estimated by taking the
maxima of G(·, ·)∗ over a small (just 100) calls to the stochastic oracle at randomly
generated feasible solutions. As about the value of θ and type of the stepsize policy
((2.47) or (2.55)), our choice was based on the results of experimentation with a
single test problem (instance L1 of the utility problem, see below); some results of
this experimentation are presented in Table 4.1. We have found that the constant
stepsize policy (2.47) with θ = 0.1 for the E-SA and θ = 5 for the N-SA slightly
outperforms other variants we have considered. This particular policy, combined with
the aforementioned scheme for estimating M∗ , was used in all subsequent experiments.
Format of test problems. All our test problems are of the form minx∈X f (x),
= E[F (x, ξ)], where the domain X either is a standard simplex {x ∈ R : x ≥
n
f (x)
0, i xi = 1} or can be converted into such a simplex by scaling of the original
variables.
Notation in the tables. Below,
• n is the design dimension of an instance,
• N is the sample size (i.e., the number of steps in SA, and the size of the sample
used to build the stochastic average in SAA),
• Obj is the empirical mean of the random variable F (x, ξ), x being the approxi-
mate solution generated by the algorithm in question. The empirical means are taken
over a large (K = 104 elements) dedicated sample,
• CPU is the CP U time in seconds.
Table 4.3
The variability for the stochastic utility problem.
4.2. A stochastic utility problem. Our first experiment was carried out with
the utility model
n
(4.1) min f (x) = E φ (i/n + ξi )xi ,
x∈X
i=1
n
where X = {x ∈ Rn : x ≥ 0, i=1 xi = 1}, ξi ∼ N (0, 1) are independent and φ(·) is a
piecewise linear convex function given by φ(t) = max{v1 + s1 t, . . . , vm + sm t}, where
vk and sk are certain constants. In our experiment, we used m = 10 breakpoints, all
located on [0, 1]. The four instances L1, L2, L3, L4 we dealt with were of dimension
varying from 500 to 2,000, each instance—with its own randomly generated function
φ. All the algorithms were coded in ANSI C, and the experiments were conducted on
an Intel PIV 1.6GHz machine with Microsoft windows XP professional.
We run each of the three aforementioned methods with various sample sizes on
every one of the instances. The results are reported in Table 4.2.
In order to evaluate stability of the algorithms, we run each of them 100 times;
the resulting statistics are shown in Table 4.3. In this relatively time-consuming
experiment, we restrict ourselves with a single instance (L2) and just two sample
sizes (N = 1,000 and 2,000). In Table 4.3, “Mean” and “Dev” are, respectively, the
mean and the deviation, over 100 runs, of the objective value Obj at the resulting
approximate solution.
The experiments demonstrate that as far as the quality of approximate solutions
is concerned, N-SA outperforms E-SA and is almost as good as SAA. At the same
time, the solution time for N-SA is significantly smaller than the one for SAA.
4.3. Stochastic max-flow problem. In the second experiment, we consider
simple two-stage stochastic linear programming, namely, a stochastic max-flow prob-
lem. The problem is to optimize the capacity expansion of a stochastic network. Let
Table 4.4
SA versus SAA on the stochastic max-flow problem.
- F1 F2 F3 F4
(m, n) (50,500) (100, 1,000) (100, 2,000) (250, 5,000)
ALG. N Obj CPU Obj CPU Obj CPU Obj CPU
N-SA 100 0.1140 0 0.0637 0 0.1296 1 0.1278 3
1,000 0.1254 1 0.0686 3 0.1305 6 0.1329 15
2,000 0.1249 3 0.0697 6 0.1318 11 0.1338 29
4,000 0.1246 5 0.0698 11 0.1331 21 0.1334 56
E-SA 100 0.0840 0 0.0618 1 0.1277 2 0.1153 7
1,000 0.1253 3 0.0670 6 0.1281 16 0.1312 39
2,000 0.1246 5 0.0695 13 0.1287 28 0.1312 72
4,000 0.1247 9 0.0696 24 0.1303 53 0.1310 127
SAA 100 0.1212 5 0.0653 12 0.1310 20 0.1253 60
1,000 0.1223 35 0.0694 84 0.1294 157 0.1291 466
2,000 0.1223 70 0.0693 170 0.1304 311 0.1284 986
4,000 0.1221 140 0.0693 323 0.1301 636 0.1293 1,885
G = (N, A) be a diagraph with a source node s and a sink node t. Each arc (i, j) ∈ A
has an existing capacity pij ≥ 0 and a random implementing/operating level ξij .
Moreover, there is a common random degrading factor η for all arcs in A. The goal is
to determine how much capacity to add to the arcs, subject to a budget constraint,
in order to maximize the expected maximum flow from s to t. Denoting by xij the
capacity to be added to arc (i, j), the problem reads
⎧ ⎫
⎨ ⎬
(4.2) max f (x) = E[F (x; ξ, η)] : cij xij ≤ b, xij ≥ 0, ∀(i, j) ∈ A ,
x ⎩ ⎭
(i,j)∈A
where cij is the per unit cost for the capacity to be added, b is the total available
budget, and F (x; ξ, η) denotes the maximum s − t flow in the network when the
capacity of an arc (i, j) is ηξij (pij + xij ). Note that the above is a maximization
rather than a minimization problem.
We assume that the random variables ξij , θ are independent and uniformly dis-
tributed on [0, 1] and [0.5, 1], respectively, and consider the case of pij = 0, cij = 1 for
all (i, j) ∈ E, and b = 1. We randomly generated 4 network instances (referred to as
F1, F2, F3, and F4) using the network generator GRIDGEN available on DIMACS
challenge. The push-relabel algorithm [8] was used to solve the second stage max-flow
problem.
In the first test, each algorithm (N-SA, E-SA, SAA) was run once at each test
instance; the results are reported in Table 4.4, where m, n stand for the number of
nodes, respectively, arcs in G. Similar to the stochastic utility problem, we investi-
gate the stability of the methods by running each of them 100 times. The resulting
statistics is presented in Table 4.5, whose columns have exactly the same meaning as
in Table 4.3.
This experiment fully supports the conclusions on the methods suggested by the
experiments with the utility problem.
4.4. A network planning problem with random demand. In the last ex-
periment, we consider the so-called SSN problem of Sen, Doverspike, and Cosares
[24]. This problem arises in telecommunications network design where the owner of
the network sells private-line services between pairs of nodes in the network, and the
demands are treated as random variables based on the historical demand patterns.
The optimization problem is to decide where to add capacity to the network to min-
imize the expected rate of unsatisfied demands. Since this problem has been studied
by several authors (see, e.g., [12, 24]), it could be interesting to compare the results.
Another purpose of this experiment is to investigate the behavior of the SA method
when the Latin hyperplane sampling (LHS) variance reduction technique (introduced
in [14]) is applied.
The problem has been formulated as a two-stage stochastic linear programming
as follows:
(4.3) min f (x) = E[F (x, ξ)] : x ≥ 0, xi = b ,
x
i
where x is the vector of capacities to be added to the arcs of the network, b (the
budget) is the total amount of capacity to be added, ξ denotes the random demand,
and F (x, ξ) represents the number of unserved requests, specifically,
⎧ ⎫
⎨ A f ≤x+c ⎬
i r∈R(i) r ir i
(4.4) F (x, ξ) = min si : r∈R(i) fir + si = ξ , ∀i .
s,f ⎩ ⎭
i fir ≥ 0, si ≥ 0, ∀i, r ∈ R(i)
Here,
• R(i) is the set of routes used for traffic i (traffic between the source-sink pair
of nodes # i),
• ξ i is the (random) demand for traffic i,
• Ar are the route-arc incidence vectors (so that jth component of Ar is 1 or 0
depending on whether arc j belongs to the route r),
• c is the vector of current capacities, fir is the fraction of traffic i transferred via
route r, and s is the vector of unsatisfied demands.
In the SSN instance, there are dim x = 89 arcs and dim ξ = 86 source-sink
pairs, and components of ξ are independent random variables with known discrete
distributions (from 3 to 7 possible values per component), which result in ≈ 1070
possible demand scenarios.
In the first test with the SSN instance, each of our 3 algorithms was run once
without and once with the LHS technique; the results are reported in Table 4.6. We
then tested the stability of algorithms by running each of them 100 times; see statistics
in Table 4.7. Note that experiments with the SSN problem were conducted on a more
powerful computer: Intel Xeon 1.86GHz with Red Hat Enterprize Linux.
As far as comparison of our three algorithms is concerned, the conclusions are
in full agreement with those for the utility and the max-flow problem. We also see
that for our particular example, the LHS does not yield much of an improvement,
especially when a larger sample size is applied. This result seems to be consistent
with the observation in [12].
Table 4.6
SA versus SAA on the SSN problem.
Table 4.7
The variability for the SSN problem.
4.5. N-SA versus E-SA. The data in Tables 4.3, 4.4, and 4.6 demonstrate that
with the same sample size N , the N-SA somehow outperforms the E-SA in terms of
both the quality of approximate solutions and the running time.4 The difference in
solutions’ quality, at the first glance, seems slim, and one could think that adjusting
the SA algorithm to the “geometry” of the problem in question (in our case, to
minimization over a standard simplex) is of minor importance. We, however, do
believe that such a conclusion would be wrong. In order to get a better insight, let
us come back to the stochastic utility problem. This test problem has an important
advantage—we can easily compute the value of the objective f (x) at a given candidate
solution x analytically.5 Moreover, it is easy to minimize f (x) over the simplex—on a
closest inspection, this problem reduces to minimizing an easy-to-compute univariate
convex function so that we can approximate the true optimal value f∗ to high accuracy
by bisection. Thus, in the case in question, we can compare solutions x generated
by various algorithms in terms of their “true inaccuracy” f (x) − f∗ , and this is the
rationale behind our “Gaussian setup.” We can now exploit this advantage of the
stochastic utility problem for comparing properly N-SA and E-SA. In Table 4.8, we
present the true values of the objective f (x̄) at the approximate solutions x̄ generated
by N-SA and E-SA as applied to the instances L1 and L4 of the utility problem
(cf. Table 4.3) along with the inaccuracies f (x̄) − f∗ and the Monte Carlo estimates
f*(x̄) of f (x̄) obtained via 50,000-element samples. We see that the difference in
4 The difference in running times can be easily explained: with X being a simplex, the prox-
mapping for E-SA takes O(n ln n) operations versus O(n) operations for N-SA.
5 Indeed, (ξ , . . . , ξ ) ∼ N (0, I ), so that the random variable ξ =
1 n n x i (ai + ξi )xi is normal with
easily computable mean and variance, and since φ is piecewise linear, the expectation f (x) = E[φ(ξx )]
can be immediately expressed via the error function.
the inaccuracy f (x̄) − f∗ of the solutions produced by the algorithms is much more
significant than is suggested by the data in Table 4.3 (where the actual inaccuracy
is “obscured” by the estimation error and summation with f∗ ). Specifically, at the
common for both algorithm sample sizes N = 2,000, the inaccuracy yielded by N-SA
is 3–5 times less than the one for E-SA and in order to compensate for this difference,
one should increase the sample size for E-SA (and hence the running time) by factor
5–10. It should be added that in light of theoretical complexity analysis carried out
in Example 2.3, the outlined significant difference in performances of N-SA and E-SA
is not surprising; the surprising fact is that E-SA works at all.
4.6. Bilinear matrix game. We consider here a bilinear matrix game
We use the notations E1 (α) and E2 (α) to refer to the experiments with the matrices
of the first and second kind with parameter α. We present in Table 4.9 the results of
experiments conducted for the matrices A of size 104 × 104 . We made 100 simulation
runs in each experiment and present the average error (column Mean), standard
Table 4.9
SA for bilinear matrix games.
deviation (column Dev) and the average running time (with excluded time to compute
the error of the resulting solution). For comparison, we also present the error of the
initial solution z̃1 = (x1 , y1 ).
Our basic observation is as follows: Both N-SA and E-SA succeed to reduce the
solution error reasonably fast. The N-SA implementation is preferable as it is more
efficient in terms of running time. For comparison, it takes Matlab from 10 (for the
simplest problem)
T
to 35 seconds (for the hardest one) to compute just one answer
g(x, y) = −Ax
A y
of the deterministic oracle.
X o . As ∇v V (x, v) = ∇ω(v) − ∇ω(x), the optimality conditions for (2.31) imply that
(6.1) (∇ω(v) − ∇ω(x) + y)T (v − u) ≤ 0 ∀u ∈ X.
For u ∈ X, we therefore have
V (v, u) − V (x, u) = ω(u) − ∇ω(v)T (u − v) − ω(v)
− ω(u) − ∇ω(x)T (u − x) − ω(x)
= ∇ω(v) − ∇ω(x) + y)T (v − u) + y T (u − v)
≤ y T (u − v) − V (x, v),
where the last inequality is due to (6.1). By Young’s inequality,6 we have
y2∗ α
y T (x − v) ≤ + x − v2 ,
2α 2
while V (x, v) ≥ 2 x
α
− v2 , due to the strong convexity of V (x, ·). We get
x > 0} and in order to establish the property in question, it suffices to verify that
hT ∇2 ω(x)h ≥ h21 for every x ∈ X o . Here is the computation:
2 2
−1/2 1/2
2 −1
|hi | = (xi |hi |)xi ≤ hi xi xi = h2i x−1
i = hT ∇2 ω(x)h,
i i i i i
j
j
γ2
j
γt (zt − u)T g(zt ) ≤ V (z1 , u) − V (zt+1 , u) + t
G(zt , ξt )2∗ − γt (zt − u)T Δt .
t=1 t=1
2 t=1
j
j
(6.4) ζtT (vt − u) ≤ V (v1 , u) + 1
2
ζt 2∗ .
t=1 t=1
Proof. Using the bound (2.33) of Lemma 2.1 with y = ζt and x = vt (so that
vt+1 = Pvt (ζt )) and recalling that we are in the situation of α = 1, we obtain the
following for any u ∈ Z:
j
j
V (vj+1 , u) ≤ V (v1 , u) + ζtT (u − vt ) + 1
2
ζt 2∗ ,
t=1 t=1
j
1 2
j
(6.5) γt ΔTt (u − vt ) ≤ V (z1 , u) + γ Δt 2∗ ∀u ∈ Z.
t=1
2 t=1 t
Observe that
! "
2Dω2 x ,X 2 2Dω2 y ,Y 2
EΔt 2∗ ≤ 4EG(zt , ξt )2∗ ≤4 M∗,x + M∗,y = 4M∗2
αx αy
(recall that V (z1 , ·) is bounded by 1 on Z). Now we proceed exactly as in section 2.2:
we sum up (6.3) from t = 1 to j to obtain
j
j
γ2
j
γt (zt − u)T g(zt ) ≤ V (z1 , u) + t
G(zt , ξt )2∗ − γt (zt − u)T Δt
t=1 t=1
2 t=1
(6.7)
j
γ2 j j
= V (z1 , u) + t
G(zt , ξt )2∗ − γt (zt − vt )T Δt + γt (u − vt )T Δt .
t=1
2 t=1 t=1
When taking into account that zt and vt are deterministic functions of ξ[t−1] =
(ξ1 , . . . , ξt−1 ) and that the conditional expectation of Δt , ξ[t−1] being given, van-
ishes, we conclude that E[(zt − vt )T Δt ] = 0. We take now suprema in u ∈ Z and then
5 2 2
N
(6.11) E[exp{αN /σα }] ≤ exp{1}, σα = M∗ γt ,
2 t=1
and
Δt 2∗ = G(zt , ξt ) − g(zt )2∗ ≤ (G(zt , ξt )∗ + g(zt )∗ )2 ≤ 2G(zt , ξt )2∗ + 2M∗2 ,
which implies that
N
γ2
αN ≤ t
3G(zt , ξt )2∗ + 2M∗2 .
t=1
2
Observe that if r1 , . . . , ri are nonnegative random variables such that E[exp{rt /σt }] ≤
exp{1} for some deterministic σt > 0, then, by convexity of the exponent, w(s) =
exp{s} and
⎡ ⎤
t≤i rt
σ
≤ E⎣ exp{rt /σt }⎦ ≤ exp{1}.
t
(6.13) E exp
t≤i σt t≤i τ ≤i στ
Now applying (6.13) with rt = γt2 32 G(zt , ξt )2∗ + M∗2 and σt = 52 γt2 M∗2 , we obtain
(6.11).
Now let ζt = γt (vt − zt )T Δt . Observing that vt , zt are deterministic functions
of ξ[t−1] , while E[Δt |ξ[t−1] ] = 0, we see that the sequence {ζt }N
t=1 of random real
variables forms a martingale difference. Besides this, by strong convexity of ω with
modulus 1 w.r.t. · and due to Dω,Z ≤ 1, we have
1
u ∈ Z ⇒ 1 ≥ V (z1 , u) ≥ u − z1 2 ,
2
√ √
whence the · -diameter of Z does not exceed 2 2 so that |ζt | ≤ 2 2γt Δt ∗ , and
therefore
3
E exp |ζt |2 / 32γt2 M∗2 3ξ[t−1] ≤ exp{1}
by (6.10). Applying Cramer’s deviation bound, we obtain, for any Ω > 0,
⎧ 4 ⎫
⎨ 5N
5 ⎬
(6.14) Prob βN > 4ΩM∗ 6 γt2 ≤ exp{−Ω2 /4}.
⎩ ⎭
t=1
√
Indeed, for 0 ≤ γ, setting σt = 4 2γt M∗ and taking into account that ζt is a deter-
ministic function of ξ[t] , with E[ζt |ξ[t−1] ] = 0 and E[exp{ζt2 /σt2 }|ξ[t−1] ] ≤ exp{1}, we
have
& 2
'
0 < γσt ≤ 1 ⇒ as ex ≤ x + ex
E[exp{γζt }|ξ[t−1] ] ≤ E[exp γ 2 ζt2 |ξ[t−1] ]
γ 2 σt2
≤ E exp ζt2 /σt2 |ξ[t−1] ≤ exp γ 2 σt2 ;
γσt > 1 ⇒
E[exp{γζt }|ξ[t−1] ] ≤ E exp [ 12 γ 2 σt2 + 12 ζt2 /σt2 |ξ[t−1]
≤ exp 12 γ 2 σt2 + 12 ≤ exp γ 2 σt2 ,
& '−1/2
When choosing γ = 12 Ω N
t=1 σt2 , we arrive at (6.14).
Combining (6.9), (6.10), and (6.14), we get the following for any positive Ω and
Θ:
⎧ 4 ⎫
⎨ 5N ⎬
5
N
√ 5
Prob ΓN φ (z̃t ) > 2 + (1 + Ω)M∗2 γt2 + 4 2ΘM∗ 6 γt2
⎩ 2 ⎭
t=1
t=1
1
≤ exp{−Ω} + exp − Θ2 .
4
√
When setting Θ = 2 Ω and substituting (3.12), we obtain (3.17).
N Proof of Proposition 3.3. As in the proof of Proposition 3.2, when setting ΓN =
t=1 γt and using the relations (3.9), (6.5), and (6.7), combined with the fact that
G(z, ξy )∗ ≤ M∗ , we obtain
N
γ2
N
ΓN φ (z̃N ) ≤ 2 + t
G(zt , ξt )2∗ + Δt 2∗ + γt (vt − zt )T Δt
t=1
2 t=1
5
N
N
(6.15) ≤ 2 + M∗2 γt2 + γt (vt − zt )T Δt .
2 t=1 t=1
+ ,- .
αN
Recall that by definition of Δt , Δt ∗ = G(zt , ξt )− g(zt )∗ ≤ G(zt , ξt ) + g(zt )∗ ≤
2M∗ .
Note that ζt = γt (vt −zt )T Δt is a bounded martingale difference, i.e., E(ζt |ξ[t−1] ) =
0 and |ζt | ≤ 4γt M (here M is defined in (3.31)). Then, by Azuma–Hoeffding’s in-
equality [1] for any Ω ≥ 0,
⎛ 4 ⎞
5N
5
Prob ⎝αN > 4ΩM 6
2
(6.16) γt2 ⎠ ≤ e−Ω /2 .
t=1
= 4M .
and the bound (3.30) of the proposition can be easily obtained by substituting the
constant stepsizes γt as defined in (3.12).
REFERENCES
[1] K. Azuma, Weighted sums of certain dependent random variables, Tökuku Math. J., 19 (1967),
pp. 357–367.
[2] A. Ben-Tal and A. Nemirovski, Non-Euclidean restricted memory level method for large-scale
convex optimization, Math. Program., 102 (2005), pp. 407–456.
[3] A. Benveniste, M. Métivier, and P. Priouret, Algorithmes Adaptatifs et Approximations
Stochastiques, Masson, Paris, 1987 (in French). Adaptive Algorithms and Stochastic Ap-
proximations, Springer, New York, 1993 (in English).
[4] L.M. Bregman, The relaxation method of finding the common points of convex sets and its
application to the solution of problems in convex programming, Comput. Math. Math.
Phys., 7 (1967), pp. 200–217.
[5] K.L. Chung, On a stochastic approximation method, Ann. Math. Statist., 25 (1954), pp. 463–
483.
[6] Y. Ermoliev, Stochastic quasigradient methods and their application to system optimization,
Stochastics, 9 (1983), pp. 1–36.
[7] A.A. Gaivoronski, Nonstationary stochastic programming problems, Kybernetika, 4 (1978),
pp. 89–92.
[8] A.V. Goldberg and R.E. Tarjan, A new approach to the maximum flow problem, J. ACM,
35 (1988), pp. 921–940.
[9] M.D. Grigoriadis and L.G. Khachiyan, A sublinear-time randomized approximation algo-
rithm for matrix games, Oper. Res. Lett., 18 (1995), pp. 53–58.
[10] A. Juditsky, A. Nazin, A. Tsybakov, and N. Vayatis, Recursive aggregation of estimators by
the mirror descent algorithm with averaging, Probl. Inf. Transm., 41 (2005), pp. 368–384.
[11] A.J. Kleywegt, A. Shapiro, and T. Homem-de-Mello, The sample average approximation
method for stochastic discrete optimization, SIAM J. Optim., 12 (2002), pp. 479–502.
[12] J. Linderoth, A. Shapiro, and S. Wright, The empirical behavior of sampling methods for
stochastic programming, Ann. Oper. Res., 142 (2006), pp. 215–241.
[13] W.K. Mak, D.P. Morton, and R.K. Wood, Monte Carlo bounding techniques for determining
solution quality in stochastic programs, Oper. Res. Lett., 24 (1999), pp. 47–56.
[14] M.D. McKay, R.J. Beckman, and W.J. Conover, A comparison of three methods for select-
ing values of input variables in the analysis of output from a computer code, Technometrics,
21 (1979), pp. 239–245.
[15] A. Nemirovski and D. Yudin, On Cezari’s convergence of the steepest descent method for
approximating saddle point of convex-concave functions, Dokl. Akad. Nauk SSSR, 239
(1978) (in Russian). Soviet Math. Dokl., 19 (1978) (in English).
[16] A. Nemirovski and D. Yudin, Problem Complexity and Method Efficiency in Optimization,
Wiley-Intersci. Ser. Discrete Math. 15, John Wiley, New York, 1983.
[17] Y. Nesterov, Primal-dual subgradient methods for convex problems, Math. Program., Ser. B,
http://www.springerlink.com content-b441795t5254m533.
[18] B.T. Polyak, New stochastic approximation type procedures, Automat. i Telemekh., 7 (1990),
pp. 98–107.
[19] B.T. Polyak and A.B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM
J. Control Optim., 30 (1992), pp. 838–855.
[20] G.C. Pflug, Optimization of Stochastic Models, The Interface Between Simulation and Opti-
mization, Kluwer, Boston, 1996.
[21] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Stat., 22 (1951),
pp. 400–407.
[22] A. Ruszczyński and W. Syski, A method of aggregate stochastic subgradients with on-line
stepsize rules for convex stochastic programming problems, Math. Prog. Stud., 28 (1986),
pp. 113–131.
[23] J. Sacks, Asymptotic distribution of stochastic approximation, Ann. Math. Stat., 29 (1958),
pp. 373–409.
[24] S. Sen, R.D. Doverspike, and S. Cosares, Network planning with random demand,
Telecomm. Syst., 3 (1994), pp. 11–30.
[25] A. Shapiro, Monte Carlo sampling methods, in Stochastic Programming, Handbook in OR &
MS, Vol. 10, A. Ruszczyński and A. Shapiro, eds., North-Holland, Amsterdam, 2003.
[26] A. Shapiro and A. Nemirovski, On complexity of stochastic programming problems, in Con-
tinuous Optimization: Current Trends and Applications, V. Jeyakumar and A.M. Rubinov,
eds., Springer, New York, 2005, pp. 111–144.
[27] J.C. Spall, Introduction to Stochastic Search and Optimization: Estimation, Simulation, and
Control, John Wiley, Hoboken, NJ, 2003.
[28] V. Strassen, The existence of probability measures with given marginals, Ann. Math. Statist.,
38 (1965), pp. 423–439.
[29] B. Verweij, S. Ahmed, A.J. Kleywegt, G. Nemhauser, and A. Shapiro, The sample aver-
age approximation method applied to stochastic routing problems: A computational study,
Comput. Optim. Appl., 24 (2003), pp. 289–333.