0% found this document useful (0 votes)
10 views

A Data Driven Approach To Multi Stage Stochastic Linear Optimization

This document summarizes a research paper that proposes a new data-driven approach for multi-stage stochastic linear optimization problems with unknown distributions. The approach involves constructing and solving a robust optimization problem using sample paths from the stochastic process. The paper proves that under certain assumptions, the optimal cost of the robust optimization problem converges to that of the underlying stochastic problem as the number of samples increases. It also develops approximation algorithms to solve the robust optimization problem by extending techniques from robust optimization.

Uploaded by

Z0 Zoo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

A Data Driven Approach To Multi Stage Stochastic Linear Optimization

This document summarizes a research paper that proposes a new data-driven approach for multi-stage stochastic linear optimization problems with unknown distributions. The approach involves constructing and solving a robust optimization problem using sample paths from the stochastic process. The paper proves that under certain assumptions, the optimal cost of the robust optimization problem converges to that of the underlying stochastic problem as the number of samples increases. It also develops approximation algorithms to solve the robust optimization problem by extending techniques from robust optimization.

Uploaded by

Z0 Zoo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/340316050

A data-driven approach to multi-stage stochastic linear optimization

Preprint · April 2021

CITATIONS READS
0 2,248

3 authors:

Dimitris Bertsimas Shimrit Shtern


Massachusetts Institute of Technology Technion - Israel Institute of Technology
474 PUBLICATIONS 36,282 CITATIONS 30 PUBLICATIONS 470 CITATIONS

SEE PROFILE SEE PROFILE

Bradley Sturt
University of Illinois at Chicago
17 PUBLICATIONS 1,372 CITATIONS

SEE PROFILE

All content following this page was uploaded by Bradley Sturt on 25 August 2021.

The user has requested enhancement of the downloaded file.


A data-driven approach to multi-stage stochastic
linear optimization
Dimitris Bertsimas
Operations Research Center, Massachusetts Institute of Technology, [email protected]

Shimrit Shtern
The William Davidson Faculty of Industrial Engineering & Management, Technion - Israel Institute of Technology,
[email protected]

Bradley Sturt
Department of Information and Decision Sciences, University of Illinois at Chicago, [email protected]

We propose a new data-driven approach for addressing multi-stage stochastic linear optimization problems
with unknown distributions. The approach consists of solving a robust optimization problem that is con-
structed from sample paths of the underlying stochastic process. We provide asymptotic bounds on the
gap between the optimal costs of the robust optimization problem and the underlying stochastic problem
as more sample paths are obtained, and we characterize cases in which this gap is equal to zero. To the
best of our knowledge, this is the first sample-path approach for multi-stage stochastic linear optimization
that offers asymptotic optimality guarantees when uncertainty is arbitrarily correlated across time. Finally,
we develop approximation algorithms for the proposed approach by extending techniques from the robust
optimization literature, and demonstrate their practical value through numerical experiments on stylized
data-driven inventory management problems.

Key words : Stochastic programming. Robust optimization. Sample-path approximations.


History : First submitted on Nov. 3, 2018. Revisions submitted on April 5, 2020 and April 16, 2021.
Accepted for publication on August 22, 2021.

1. Introduction
In the traditional formulation of linear optimization, one makes a decision which minimizes a known
objective function and satisfies a known set of constraints. Linear optimization has, by all measures,
succeeded as a framework for modeling and solving numerous real world problems. However, in many
practical applications, the objective function and constraints are unknown at the time of decision
making. To incorporate uncertainty into the linear optimization framework, Dantzig (1955) proposed
partitioning the decision variables across multiple stages, which are made sequentially as more uncertain
parameters are revealed. This formulation is known today as multi-stage stochastic linear optimization,
which has become an integral modeling paradigm in many applications (e.g., supply chain management,
energy planning, finance) and remains a focus of the stochastic optimization community (Birge and
Louveaux 2011, Shapiro et al. 2009).

1
2 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

In practice, decision makers increasingly have access to historical data which can provide valuable in-
sight into future uncertainty. For example, consider a manufacturer which sells short lifecycle products.
The manufacturer does not know a joint probability distribution of the demand over a new product’s
lifecycle, but has access to historical demand trajectories over the lifecycle of similar products. An-
other example is energy planning, where operators must coordinate and commit to production levels
throughout a day, the output of wind turbines is subject to uncertain weather conditions, and data on
historical daily wind patterns is increasingly available. Other examples include portfolio management,
where historical asset returns over time are available to investors, and transportation planning, where
data comes in the form of historical ride usage of transit and ride sharing systems over the course
of a day. Such historical data provides significant potential for operators to better understand how
uncertainty unfolds through time, which can in turn be used for better planning.

When the underlying probability distribution is unknown, data-driven approaches to multi-stage


stochastic linear optimization traditionally follow a two-step procedure. The historical data is first fit
to a parametric model (e.g., an autoregressive moving average process), and decisions are then ob-
tained by solving a multi-stage stochastic linear optimization problem using the estimated distribution.
The estimation step is considered essential, as techniques for solving multi-stage stochastic linear opti-
mization (e.g., scenario tree discretization) generally require knowledge of the correlation structure of
uncertainty across time; see Shapiro et al. (2009, Section 5.8). A fundamental difficulty in this approach
is choosing a parametric model which will accurately estimate the underlying correlation structure and
lead to good decisions.

Nonparametric data-driven approaches to multi-stage stochastic linear optimization where uncer-


tainty is correlated across time are surprisingly scarce. Pflug and Pichler (2016) propose a nonparametric
estimate-then-optimize approach based on applying a kernel density estimator to the historical data,
which enjoys asymptotic optimality guarantees under a variety of strong technical conditions. Hanasu-
santo and Kuhn (2013) present another nonparametric approach wherein the conditional distributions in
stochastic dynamic programming are estimated using kernel regression. Krokhmal and Uryasev (2007)
discuss nonparametric path-grouping heuristics for constructing scenario trees from historical data.
In the case of multi-stage stochastic linear optimization, to the best of our knowledge, there are no
previous nonparametric data-driven approaches which are asymptotically optimal in the presence of
time-dependent correlations. Moreover, in the absence of additional assumptions on the estimated dis-
tribution or on the problem setting, multi-stage stochastic linear optimization problems are notorious
for being computationally demanding.

The main contribution of this paper is a new data-driven approach for multi-stage stochastic linear
optimization that can be asymptotically optimal even when uncertainty is arbitrarily correlated across
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 3

time. In other words, we propose a data-driven approach for addressing multi-stage stochastic linear
optimization with unknown distributions that (i ) does not require any parametric modeling assump-
tions on the correlation structure of the underlying probability distribution, and (ii ) converges to the
underlying multi-stage stochastic linear optimization problem under certain conditions as the size of the
dataset tends to infinity. Such asymptotic optimality guarantees are of practical importance, as they
ensure that the approach offers a near-optimal approximation of the underlying stochastic problem in
the presence of big data.

Our approach for multi-stage stochastic linear optimization is based on robust optimization. Specifi-
cally, given sample paths of the underlying stochastic process, the proposed approach consists of con-
structing and solving a multi-stage robust linear optimization problem with multiple uncertainty sets.
The main result of this paper (Theorem 1) establishes, under certain assumptions, that the optimal
cost of this robust optimization problem converges nearly to that of the stochastic problem as the
number of sample paths tends to infinity. While this robust optimization problem is computationally
demanding to solve exactly, we provide evidence that it can be tractably approximated to reasonable
accuracy by leveraging approximation techniques from the robust optimization literature. To the best
of our knowledge, there was no similar work in the literature which addresses multi-stage stochastic
linear optimization by solving a sequence of robust optimization problems.

The paper is organized as follows. Section 2 introduces multi-stage stochastic linear optimization
in a data-driven setting. Section 3 presents the new data-driven approach to multi-stage stochastic
linear optimization. Section 4 states the main asymptotic optimality guarantees. Section 5 presents two
examples of approximation algorithms by leveraging techniques from robust optimization. Section 6
discusses implications of our asymptotic optimality guarantees in the context of Wasserstein-based
distributionally robust optimization. Sections 7 and 8 demonstrate the practical value of the proposed
methodologies in computational experiments. Section 9 offers concluding thoughts. All technical proofs
are relegated to the attached appendices.

1.1. Related literature

Originating with Soyster (1973) and Ben-Tal and Nemirovski (1999), robust optimization has been
widely studied as a general framework for decision-making under uncertainty, in which “optimal” de-
cisions are those which perform best under the worst-case parameter realization from an “uncertainty
set”. Beginning with the seminal work of Ben-Tal et al. (2004), robust optimization has been viewed
with particular success as a computationally tractable framework for addressing multi-stage problems.
Indeed, by restricting the space of decision rules, a stream of literature showed that multi-stage robust
linear optimization problems can be solved in polynomial time by using duality-based reformulations
4 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

or cutting-plane methods. For a modern overview of decision-rule approximations, we refer the reader
to Delage and Iancu (2015), Georghiou et al. (2018), Ben-Tal et al. (2009), Bertsimas et al. (2011a). A
variety of non-decision rule approaches to solving multi-stage robust optimization have been proposed
as well, such as Zeng and Zhao (2013), Zhen et al. (2018), Xu and Burer (2018), Georghiou et al. (2019).

Despite its computational tractability for multi-stage problems, a central critique of traditional robust
optimization is that it does not aspire to find solutions which perform well on average. Several works
have aimed to quantify the quality of solutions from multi-stage robust linear optimization from the
perspective of multi-stage stochastic linear optimization (Chen et al. 2007, Bertsimas and Goyal 2010,
Bertsimas et al. 2011b). By and large, it is fair to say that multi-stage robust linear optimization is
viewed today as a distinct framework from multi-stage stochastic linear optimization, aiming to find
solutions with good worst-case as opposed to good average performance.

Providing a potential tradeoff between the stochastic and robust frameworks, distributionally robust
optimization has recently received significant attention. First proposed by Scarf (1958), distributionally
robust optimization models the uncertain parameters with a probability distribution, but the distri-
bution is presumed to be unknown and contained in an ambiguity set of distributions. Even though
single-stage stochastic optimization is generally intractable, the introduction of ambiguity can surpris-
ingly emit tractable reformulations (Delage and Ye 2010, Wiesemann et al. 2014). Consequently, the
extension of distributionally robust optimization to multi-stage decision making is an active area of
research, including Bertsimas et al. (2019) for multi-stage distributionally robust linear optimization
with moment-based ambiguity sets.

There has been a proliferation of data-driven constructions of ambiguity sets which offer various
probabilistic performance guarantees, including those based on the p-Wasserstein distance for p ∈ [1, ∞)
(Pflug and Wozabal 2007, Mohajerin Esfahani and Kuhn 2018), phi-divergences (Ben-Tal et al. 2013,
Bayraksan and Love 2015, Van Parys et al. 2020), and statistical hypothesis tests (Bertsimas et al.
2018). Many of these data-driven approaches have since been applied to the particular case of two-
stage distributionally robust linear optimization, including Jiang and Guan (2018) for phi-divergence
and Hanasusanto and Kuhn (2018) for p-Wasserstein ambiguity sets when p ∈ [1, ∞). To the best of
our knowledge, no previous work has demonstrated whether such distributionally robust approaches, if
extended to solve multi-stage stochastic linear optimization (with three or more stages) directly from
data, retain their asymptotic optimality guarantees.

In contrast to the above literature, our motivation for robust optimization in this paper is not to
find solutions which perform well on the worst-case realization in an uncertainty set, are risk-averse,
or have finite-sample probabilistic guarantees. Rather, our proposed approach to multi-stage stochastic
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 5

linear optimization adds robustness to the historical data as a tool to avoid overfitting as the number
of data points tends to infinity. In this spirit, our work is perhaps closest related to several papers
in the context of machine learning (Xu et al. 2012, Shafieezadeh-Abadeh et al. 2019), which showed
that adding robustness to historical data can be used to develop machine learning methods which have
nonparametric performance guarantees when the solution space (of classification or regression models)
is not finite-dimensional. To the best of our knowledge, this paper is the first to apply this use of
robust optimization in the context of multi-stage stochastic linear optimization to achieve asymptotic
optimality without restricting the space of decision rules.

As far as we are aware, our data-driven approach of averaging over multiple uncertainty sets is novel
in the context of multi-stage stochastic linear optimization, and its asymptotic optimality guarantees
do not follow from existing literature. Xu et al. (2012) considered averaging over multiple uncertainty
sets to establish convergence guarantees for predictive machine learning methods, drawing connections
with distributionally robust optimization and kernel density estimation. Their convergence results re-
quire that the objective function is continuous, the underlying distribution is continuous, and there are
no constraints on the support. Absent strong assumptions on the problem setting and on the space
of decision rules (which in general can be discontinuous), these properties do not hold in multi-stage
problems. Erdoğan and Iyengar (2006) provide feasibility guarantees on robust constraints over unions
of uncertainty sets with the goal of approximating ambiguous chance constraints using the Prohorov
metric. Their probabilistic guarantees require that the constraint functions have a finite VC-dimension
(Erdoğan and Iyengar 2006, Theorem 5), an assumption which does not hold in general for two- or
multi-stage problems (Erdoğan and Iyengar 2007). In this paper, we instead establish general asymp-
totic optimality guarantees for the proposed data-driven approach for multi-stage stochastic linear
optimization by developing new bounds for distributionally robust optimization with the 1-Wasserstein
ambiguity set and connections with nonparametric support estimation (Devroye and Wise 1980).

Under a particular construction of the uncertainty sets, we show that the proposed data-driven ap-
proach to multi-stage stochastic linear optimization can also be interpreted as distributionally robust
optimization using the ∞-Wasserstein ambiguity set (see Section 6). However, the asymptotic opti-
mality guarantees in our paper do not make use of this interpretation, as there were surprisingly few
previous convergence results for this ambiguity set, even in single-stage settings. Indeed, when an un-
derlying distribution is unbounded, the ∞-Wasserstein distance between an empirical distribution and
true distribution is always infinite (Givens and Shortt 1984) and thus does not converge to zero as
more data is obtained. Therefore, it is not possible to develop measure concentration guarantees for
the ∞-Wasserstein distance (akin to those of Fournier and Guillin (2015)) which hold in general for
light-tailed but unbounded probability distributions. Consequently, the proof techniques used by Moha-
jerin Esfahani and Kuhn (2018, Theorem 3.6) to establish convergence guarantees for the 1-Wasserstein
6 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

ambiguity set do not appear to extend to the ∞-Wasserstein ambiguity set. As a byproduct of the results
in this paper, we obtain asymptotic optimality guarantees for distributionally robust optimization with
the ∞-Wasserstein ambiguity set under the same mild probabilistic assumptions as Mohajerin Esfahani
and Kuhn (2018) for the first time.

1.2. Notation

We denote the real numbers by R, the nonnegative real numbers by R+ , and the integers by Z. Lowercase
and uppercase bold letters refer to vectors and matrices. We assume throughout that k·k refers to an
Pd
`p -norm in Rd , such as kvk1 = i=1 |vi | or kvk∞ = maxi∈[d] |vi |. We let ∅ denote the empty set, int(·)
be the interior of a set, and [K] be shorthand for the set of consecutive integers {1, . . . , K }. Throughout
the paper, we let ξ := (ξ 1 , . . . , ξ T ) ∈ Rd denote a stochastic process with a joint probability distribution
P, and assume that ξ̂ 1 , . . . , ξ̂ N are independent and identically distributed (i.i.d.) samples from that
distribution. Let PN := P × · · · × P denote the N -fold probability distribution over the historical data.
We let S ⊆ Rd denote the support of P, that is, the smallest closed set where P(ξ ∈ S) = 1. The extended
real numbers are defined as R̄ := R ∪ {−∞, ∞}, and we adopt the convention that ∞ − ∞ = ∞.
The expectation of a measurable function f : Rd → R applied to the stochastic process is denoted by
E[f (ξ)] = EP [f (ξ)] = EP [max{f (ξ), 0}] − EP [max{−f (ξ), 0}]. Finally, for any set Z ⊆ Rd , we let P (Z )
denote the set of all probability distributions on Rd which satisfy Q(ξ ∈ Z ) ≡ EQ [I{ξ ∈ Z}] = 1.

2. Problem Setting
We consider multi-stage stochastic linear optimization problems with T ≥ 1 stages. The uncertain
parameters observed over the time horizon are represented by a stochastic process ξ := (ξ 1 , . . . , ξ T ) ∈ Rd
with an underlying joint probability distribution, where ξ t ∈ Rdt is a random variable that is observed
immediately after the decision in stage t is selected. We assume throughout that the random variables
ξ 1 , . . . , ξ T may be correlated. A decision rule x := (x1 , . . . , xT ) is a collection of policies which specify
what decision to make in each stage based on the information observed up to that point. More precisely,
a policy in each stage is a measurable function of the form xt : Rd1 × · · · × Rdt−1 → Rnt −pt × Zpt . We
use the shorthand notation X to denote the space of all decision rules.

In multi-stage stochastic linear optimization, our goal is to find a decision rule which minimizes a
linear cost function in expectation while satisfying a system of linear inequalities almost surely. These
problems are represented by
" T
#
X
minimize E ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )
x∈X
t=1
T
(1)
X
subject to At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) a.s.
t=1
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 7

Following standard convention, we assume that the problem parameters c1 (ξ) ∈ Rn1 , . . . , cT (ξ) ∈ RnT ,
A1 (ξ) ∈ Rm×n1 , . . . , AT (ξ) ∈ Rm×nT , and b(ξ) ∈ Rm are affine functions of the stochastic process.

In this paper, we assume that the underlying joint probability distribution of the stochastic process
is unknown. Instead, our information comes from historical data of the form

ξ̂ j ≡ (ξ̂ j1 , . . . , ξ̂ jT ), j = 1, . . . , N.

We refer to each of these trajectories as a sample path of the stochastic process. This setting corresponds
to many real-life applications. For example, consider managing the inventory of a new short lifecycle
product, in which production decisions must be made over the product’s lifecycle. In this case, each
sample path represents the historical sales data observed over the lifecycle of a comparable product.
Further examples are readily found in energy planning and finance, among many others. We assume
that ξ̂ 1 , . . . , ξ̂ N are independent and identically distributed (i.i.d.) realizations of the stochastic process
ξ ≡ (ξ 1 , . . . , ξ T ). Our goal in this paper is a general-purpose, nonparametric sample-path approach for
solving Problem (1) in practical computation times.

We will also assume that the support of the stochastic process is unknown. For example, in inventory
management, an upper bound on the demand, if one exists, is generally unknown. On the other hand,
we often have partial knowledge on the underlying support. For example, when the stochastic process
captures the demand for a new product or the energy produced by a wind turbine, it is often the
case that the uncertainty will be nonnegative. To allow any partial knowledge on the support to be
incorporated, we assume knowledge of a convex superset Ξ ⊆ Rd of the support of the underlying joint
distribution, that is, P(ξ ∈ Ξ) = 1.

3. A Robust Approach to Multi-Stage Stochastic Linear Optimization


We now present the proposed data-driven approach, based on robust optimization, for solving multi-
stage stochastic linear optimization. First, we construct an uncertainty set UNj ⊆ Ξ around each sample
path, consisting of realizations ζ ≡ (ζ 1 , . . . , ζ T ) which are slight perturbations of ξ̂ j ≡ (ξ̂ j1 , . . . , ξ̂ jT ).
Then, we optimize for decision rules by averaging over the worst-case costs from each uncertainty set,
and require that the decision rule is feasible for all realizations in all of the uncertainty sets. Formally,
the proposed approach is the following:
N T
1 X X
minimize sup ct (ζ) · xt (ζ 1 , . . . , ζ t−1 )
x∈X N j =1 ζ∈U j t=1
N
(2)
T
X j
subject to At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ ∪N
j =1 UN .
t=1
8 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

In contrast to traditional robust optimization, Problem (2) involves averaging over multiple uncertainty
sets. Thus, the explicit goal here is to obtain decision rules which perform well on average while simul-
taneously not overfitting the historical data. We note that Problem (2) only requires that the decision
rules are feasible for the realizations in the uncertainty sets. These feasibility requirements are justified
when the overlapping uncertainty sets encompass the variability of future realizations of the uncertainty;
see Section 4.

Out of the various possible constructions of the uncertainty sets, our investigation shall henceforth
be focused on uncertainty sets constructed as balls of the form
n o
U jN := ζ ≡ (ζ 1 , . . . , ζ T ) ∈ Ξ : kζ − ξ̂j k ≤ N ,

where N ≥ 0 is a parameter which controls the size of the uncertainty sets. The parameter is indexed by
N to allow for the size of the uncertainty sets to change as more data is obtained. The rationale for this
particular uncertainty set is three-fold. First, it is conceptually simple, requiring only a single parameter
to both estimate the expectation in the objective and the support of the distribution in the constraints.
Second, under appropriate choice of the robustness parameter, we will show that Problem (2) with
these uncertainty sets provides a near-optimal approximation of Problem (1) in the presence of big data
(see Section 4). Finally, the uncertainty sets are of similar structure, which can be exploited to obtain
tractable reformulations (see Section 5).

Our approach, in a nutshell, uses robust optimization as a tool for solving multi-stage stochastic linear
optimization directly from data. More specifically, we obtain decision rules and estimate the optimal
cost of Problem (1) by solving Problem (2). We refer the proposed data-driven approach for solving
multi-stage stochastic linear optimization problems as sample or sample-path robust optimization. As
mentioned previously, the purpose of robustness is to ensure that resulting decision rules do not overfit
the historical sample paths. To illustrate this role performed by robustness, we consider the following
example.

Example 1. Consider a supplier which aims to satisfy uncertain demand over two phases at minimal
cost. The supplier selects an initial production quantity at $1 per unit after observing preorders, and
produces additional units at $2 per unit after the regular orders are received. To determine the optimal
production levels, we wish to solve
minimize E [x2 (ξ1 ) + 2x3 (ξ1 , ξ2 )]

subject to x2 (ξ1 ) + x3 (ξ1 , ξ2 ) ≥ ξ1 + ξ2 a.s. (3)

x2 (ξ1 ), x3 (ξ1 , ξ2 ) ≥ 0 a.s.


The output of the optimization problem are decision rules, x2 : R → R and x3 : R2 → R, which specify
what production levels to choose as a function of the demands observed up to that point. The joint
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 9

probability distribution of the demand process (ξ1 , ξ2 ) ∈ R2 is unknown, and the supplier’s knowledge
comes from historical demand realizations of past products, denoted by (ξˆ11 , ξˆ12 ), . . . , (ξˆN ˆN
1 , ξ 2 ). For the

sake of illustration, suppose we attempted to approximate Problem (3) by choosing the decision rules
which perform best when averaging over the historical data without any robustness. Such a sample
average approach amounts to solving
N
1 X 
minimize x2 (ξˆj1 ) + 2x3 (ξˆj1 , ξˆj2 )
N j =1
subject to x2 (ξˆj1 ) + x3 (ξˆj1 , ξˆj2 ) ≥ ξˆj1 + ξˆj2 ∀j ∈ [N ]

x2 (ξˆj1 ), x3 (ξˆj1 , ξˆj2 ) ≥ 0 ∀j ∈ [N ].


Suppose that the random variable ξ1 for preorders has a continuous distribution. In that case, it im-
mediately follows that ξˆ11 = 6 ξˆN
6 ··· = 1 almost surely, and thus an optimal decision rule for the above

optimization problem is
(
ξˆj + ξˆj2 , if ξ1 = ξˆj1 for j ∈ [N ],
x2 (ξ1 ) = 1 x3 (ξ1 , ξ2 ) = 0.
0, otherwise;

Unfortunately, these decision rules are nonsensical with respect to Problem (3). Indeed, the decision
rules will not result in feasible decisions for the true stochastic problem with probability one. Moreover,
the optimal cost of the above optimization problem will converge almost surely to E[ξ1 + ξ2 ] as the
number of sample paths N tends to infinity, which can in general be far from that of the stochastic
problem. Clearly, such a sample average approach results in overfitting, even in big data settings, and
thus provides an unsuitable approximation of Problem (3). 

In the following section, we show that the asymptotic overfitting phenomenon illustrated in the above
example is eliminated by adding robustness to the historical data via Problem (2).

4. Asymptotic Optimality
In this section, we provide asymptotic bounds on the gap between the optimal cost of our robust opti-
mization approach, Problem (2), and the optimal cost of the multi-stage stochastic linear optimization
problem, Problem (1), as the number of sample paths grows to infinity. From a practical standpoint,
the bounds suggest that our robust optimization problem can provide a reasonable approximation of
the multi-stage stochastic linear optimization problem in the presence of big data. Moreover, we show
that our bounds collapse in particular examples of multi-stage stochastic linear optimization, in which
case our robust optimization approach is guaranteed to be asymptotically optimal. These results can be
viewed as significant due to the generality of multi-stage stochastic optimization problems considered
in this paper.
10 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

In Section 4.1, we describe the assumptions used in the subsequent convergence results. In Section 4.2,
we present the main result of this paper (Theorem 1), which establishes asymptotic lower and bound
bounds on our proposed data-driven approach. In Section 4.3, we interpret Theorem 1 through several
examples. In Section 4.4, we present asymptotic feasibility guarantees.

4.1. Assumptions

We begin by introducing our assumptions which will be used for establishing asymptotic optimality
guarantees. First, we will assume that the joint probability distribution of the stochastic process satisfies
the following light-tail assumption:

Assumption 1. There exists a constant a > 1 such that b := E [exp(kξ ka )] < ∞.

For example, this assumption is satisfied if the stochastic process has a multivariate Gaussian dis-
tribution, and is not satisfied if the stochastic process has a multivariate exponential distribution.
Importantly, Assumption 1 does not require any parametric assumptions on the correlation structure
of the random variables across stages, and we do not assume that the coefficient a > 1 is known.

Second, we will assume that the robustness parameter N is chosen to be strictly positive and decreases
to zero as more data is obtained at the following rate:
1
Assumption 2. There exists a constant κ > 0 such that N := κN − max{3,d+1} .

In a nutshell, Assumption 2 provides a theoretical requirement on how to choose the robustness param-
eter to ensure that Problem (2) will not overfit the historical data (see Example 1 from Section 3). The
rate also provides practical guidance on how the robustness parameter can be updated as more data
is obtained. We note that, for many of the following results, the robustness parameter can decrease to
zero at a faster rate; nonetheless, we shall impose Assumption 2 for all our results for simplicity.

Finally, our convergence guarantees for Problem (2) do not require any restrictions on the space of
decision rules. Our analysis will only require the following assumption on the problem structure.

Assumption 3. There exists a L ≥ 0 such that, for all N ∈ N, the optimal cost of Problem (2) would
not change if we added the following constraints:

sup xt (ζ 1 , . . . , ζ t−1 ) ≤ sup L(1 + kζ k) ∀t ∈ [T ].


j j
ζ∈∪N
j =1 UN ζ∈∪N
j =1 UN

This assumption says that there always exists a near-optimal decision rule to Problem (2) where the
decisions which result from realizations in uncertainty sets are bounded by the largest realization in
the uncertainty sets. Moreover, this is a mild assumption that we find can be easily verified in many
practical examples. In Appendix A, we show that every example presented in this paper satisfies this
assumption.
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 11

4.2. Main result

We now present the main result of this paper (Theorem 1), which establishes asymptotic lower and
upper bounds on the optimal cost of Problem (2). For notational convenience, let J ∗ be the optimal
cost of Problem (1), JbN be the optimal cost of Problem (2), and S ⊆ Ξ be the support of the underlying
joint probability distribution of the stochastic process.

First, let J be defined as the maximal optimal cost of any chance-constrained variant of the multi-
¯
stage stochastic linear optimization problem:
" T #
X 
J := lim minimize E ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈ S̃
¯ ρ↓0 x∈X , S̃⊆Ξ
t=1
T
X
subject to At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ S̃
t=1

P ξ ∈ S̃ ≥ 1 − ρ.
We observe that the above limit must exist, as the optimal cost of the chance-constrained optimization
problem is monotone in ρ. We also observe that J is always a lower bound on J ∗ , since for every
¯
ρ > 0, adding the constraint P(ξ ∈ S̃) = 1 on the decision variable S̃ to the above chance-constrained
optimization problem would increase its optimal cost to J ∗ .1

Second, let J¯ be the optimal cost of the multi-stage stochastic linear optimization problem with an
additional restriction that the decision rules are feasible on an expanded support:
" T #
X
¯
J := lim minimize Ē ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )
ρ↓0 x∈X
t=1
T
X
subject to At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ Ξ : dist(ζ, S) ≤ ρ.
t=1

We remark that the limit as ρ tends down to zero must exist as well, since the optimal cost of the above
optimization problem with expanded support is monotone in ρ. Note also that the expectation in the
objective function has been replaced with Ē[·], which we define here as the local upper semicontinuous
envelope of an expectation, i.e., Ē[f (ξ)] := lim→0 E[supζ∈Ξ:kζ−ξk≤ f (ζ)]. We similarly observe that J¯
is an upper bound on J ∗ , since the above optimization problem involves additional constraints and an
upper envelope of the objective function.

Our main result is the following:

Theorem 1. Suppose Assumptions 1, 2, and 3 hold. Then, P∞ -almost surely we have

¯
J ≤ lim inf JbN ≤ lim sup JbN ≤ J.
¯ N →∞ N →∞

1
The definition does not preclude the possibility that J is equal to −∞ or ∞. However, we do not expect either of
¯ ¯
those values to occur outside of pathological cases; see Section 4.3. The same remark applies to the upper bound J.
12 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Proof. See Appendix C. 

Note that Theorem 1 holds in very general cases; for example, it does not require boundedness on the
decisions or random variables, requires no parametric assumptions on the correlations across stages, and
holds when the decisions contain both continuous and integer components. Moreover, these asymptotic
bounds for Problem (2) do not necessitate imposing any restrictions on the space of decision rules.
To the best of our knowledge, such nonparametric convergence guarantees for a sample-path approach
to multi-stage stochastic linear optimization are the first of their kind when uncertainty is correlated
across time.

Our proof of Theorem 1 is based on a new uniform convergence result (Theorem 2) which establishes
a general relationship for arbitrary functions between the in-sample worst-case cost and the expected
out-of-sample cost over the uncertainty sets. We state this theorem below due to its independent interest.

Theorem 2. If Assumptions 1 and 2 hold, then there exists a N̄ ∈ N, P∞ -almost surely, such that
N
h n
j
oi 1 X
E f (ξ)I ξ ∈ ∪N U
j =1 N ≤ sup f (ζ) + MN sup |f (ζ)|
N j =1 ζ∈U j ζ∈∪N U j
N j =1 N

1
for all N ≥ N̄ and all measurable functions f : Rd → R, where MN := N − (d+1)(d+2) log N .

Proof. See Appendix D. 

We note that our proofs of Theorems 1 and 2 also utilize a feasibility guarantee (Theorem 3) which can
be found Section 4.4.

4.3. Examples where J¯ − J is zero or strictly positive


¯
Theorem 1 establishes asymptotic lower and upper bounds on the optimal cost of Problem (2). We
next show that the lower and upper bounds can be equal, J = J, ¯ in which case the optimal cost of
¯
Problem (2) provably converges to the optimal cost of Problem (1). We first show that these bounds
can be equal by revisiting the stochastic inventory management problem from Example 1.

Proposition 1. For Problem (3), J = J ∗ . If there is an optimal x∗2 : R → R for Problem (3) which
¯
is continuous, then J¯ = J ∗ .

Proof. See Appendix B.

The proof of this proposition holds for any underlying probability distribution which satisfies Assump-
tion 1. In combination with Theorem 1, the above proposition shows that adding robustness to the
historical data provably overcomes the asymptotic overfitting phenomenon discussed in Section 3.
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 13

More generally, the equality of the lower bound J and upper bound J¯ can be established for much
¯
broader classes of problems than Example 1. In Appendix B, we provide sufficient conditions for the
lower and upper bounds to be equal and demonstrate that these sufficient conditions for asymptotic
optimality can be satisfied in the real-world applications studied in Sections 7 and 8.

Unless restrictions are placed on the space of multi-stage stochastic linear optimization problems,
we show next that the lower and upper bounds can have a nonzero gap. In the following, we present
three examples that provide intuition on the situations in which this gap may be strictly positive. Our
first example presents a problem in which the lower bound J is equal to J ∗ but is strictly less than the
¯
upper bound J.¯

Example 2. Consider the single-stage stochastic problem

minimize x1
x1 ∈ Z

subject to x1 ≥ ξ1 a.s.,

where the random variable ξ1 is governed by the probability distribution P(ξ1 > α) = (1 − α)k for fixed
k > 0, and Ξ = [0, 2]. We observe that the support of the random variable is S = [0, 1], and thus the
optimal cost of the stochastic problem is J ∗ = 1. We similarly observe that the lower bound is J = 1
1
¯
and the upper bound, due to the integrality of the first stage decision, is J¯ = 2. If N = N − 3 , then we
prove in Appendix E that the bounds in Theorem 1 are tight under different choices of k:

Range of k Result
 
k ∈ (0, 3) P ∞ ¯
J < lim inf JN = lim sup JN = J = 1
b b
¯ N →∞ N →∞
 
k=3 P∞ J = lim inf JbN < lim sup JbN = J¯ = 1
¯ N →∞ N →∞
 
k ∈ (3, ∞) P∞ J = lim inf JbN = lim sup JbN < J¯ = 1
¯ N →∞ N →∞

This example shows that gaps can arise between the lower and upper bounds when mild changes in
the support of the underlying probability distribution lead to significant changes in the optimal cost
of Problem (1). Moreover, this example illustrates that each of the inequalities in Theorem 1 can hold
with equality or strict inequality when the feasibility of decisions depends on random variables that
have not yet been realized. 

Our second example presents a problem in which the upper bound J¯ is equal to J ∗ but is strictly
greater than the lower bound J . This example deals with the special case in which any chance con-
¯
strained version of a stochastic problem leads to a decision which is infeasible for the true stochastic
problem.
14 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Example 3. Consider the single-stage stochastic problem

minimize
2
x12
x1 ∈R

subject to ξ1 (1 − x12 ) ≤ x11 a.s.

0 ≤ x12 ≤ 1,

where ξ1 ∼ Gaussian(0, 1) and Ξ = R. The constraints are satisfied only if x12 = 1, and so the optimal
cost of the stochastic problem is J ∗ = 1. Since there is no expectation in the objective and Ξ equals
the true support, we also observe that J¯ = 1. However, we readily observe that there is always a
feasible solution to the sample robust optimization problem (Problem (2)) where x12 = 0, and therefore
J = JbN = 0 for all N ∈ N. 
¯

Our third and final example demonstrates the necessity of the upper semicontinuous envelope Ē[·] in
the definition of the upper bound.

Example 4. Consider the two-stage stochastic problem

minimize E [x2 (ξ1 )]


x2 :R→Z

subject to x2 (ξ1 ) ≥ ξ1 a.s.,

where θ ∼ Bernoulli(0.5) and ψ ∼ Uniform(0, 1) are independent random variables, ξ1 = θψ, and Ξ =
[0, 1]. An optimal decision rule x∗2 : R → Z to the stochastic problem is given by x∗2 (ξ1 ) = 0 for all ξ1 ≤ 0
and x∗2 (ξ1 ) = 1 for all ξ1 > 0, which implies that J ∗ = 12 . It follows from similar reasoning that J = 12 .
¯
Since Ξ equals the support of the random variable, the only difference between the stochastic problem
and the upper bound is that the latter optimizes over the local upper semicontinuous envelope, and we
observe that limN →∞ JbN = J¯ = Ē[x∗2 (ξ1 )] = 1. 

In each of the above examples, we observe that the bounds in Theorem 1 are tight, in the sense that
the optimal cost of Problem (2) converges either to the lower bound or the upper bound. This provides
some indication that the bounds in Theorem 1 offer an accurate depiction of how Problem (2) can
behave in the asymptotic regime. On the other hand, the above examples which illustrate a nonzero
gap seem to require intricate construction, and future work may identify (sub-classes) of Problem (1)
where the equality of the bounds can be ensured.

4.4. Feasibility guarantees

We conclude Section 4 by discussing out-of-sample feasibility guarantees for decision rules obtained
from Problem (2). Recall that Problem (2) finds decision rules which are feasible for each realization
in the uncertainty sets. However, one cannot guarantee that these decision rules will be feasible for
realizations outside of the uncertainty sets. Thus, a pertinent question is whether a decision rule obtained
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 15

from approximately solving Problem (2) is feasible with high probability. To address the question of
feasibility, we leverage classic results from detection theory.
j
Let SN := ∪N
j =1 U N be shorthand for the union of the uncertainty sets. We say that a decision rule is

SN -feasible if
T
X
At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ SN .
t=1

In other words, the set of feasible decision rules to Problem (2) are exactly those which are SN -feasible.
Our subsequent analysis utilizes the following (seemingly tautological) observation: for any decision rule
that is SN -feasible,
T
!
X
P At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) ≥ P (ξ ∈ SN ) ,
t=1

where P(ξ ∈ SN ) is shorthand for P(ξ ∈ SN | ξ̂ 1 , . . . , ξ̂ N ). Indeed, this inequality follows from the fact
that a decision rule which is SN -feasible is definitionally feasible for all realizations ζ ∈ SN , and thus
the probability of feasibility is at least the probability that ξ ∈ SN .

We have thus transformed the analysis of feasible decision rules for Problem (2) to the problem of
analyzing the performance of SN as an estimate for the support S of a stochastic process. Interestingly,
this nonparametric estimator for the support of a joint probability distribution has been widely studied
in the statistics literature, with perhaps the earliest results coming from Devroye and Wise (1980) in
detection theory. Since then, the performance of SN as a nonparametric estimate of S has been studied
with applications in cluster analysis and image recognition (Korostelev and Tsybakov 1993, Schölkopf
et al. 2001). Leveraging this connection between stochastic optimization and support estimation, we
obtain the following guarantee on feasibility.

Theorem 3. Suppose Assumptions 1 and 2 hold. Then, P∞ -almost surely we have


1
!
N d+1
lim P(ξ 6∈ SN ) = 0.
N →∞ (log N )d+1

Proof. See Appendix F. 

Intuitively speaking, Theorem 3 provides a guarantee that any feasible decision rule to Problem (2) will
be feasible with high probability on future data when the number of sample paths is large. To illustrate
why robustness is indeed necessary to achieve such feasibility guarantees, we recall from Example 1 that
decision rules may prohibitively overfit the data and be infeasible with probability one if the robustness
parameter N is set to zero.
16 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

5. Approximation Techniques
In the previous section, we developed theoretical guarantees which demonstrated that Problem (2) pro-
vides a good approximation of multi-stage stochastic linear optimization when the number of sample
paths is large. In this section, we demonstrate that Problem (2) can be addressed using approximation
techniques from the field of robust optimization. Specifically, we show that two decision-rule approxima-
tion schemes from robust optimization, linear decision rules and finite adaptability, can be extended to
obtain approximations of Problem (2). In particular, we present a novel duality argument (Theorem 4)
which allows the computational cost of these techniques to scale efficiently in the number of sample
paths. The computational tractability and out-of-sample performance of these approximation schemes
is illustrated via numerical experiments in Sections 7 and 8.

5.1. Linear decision rules

Generally speaking, multi-stage optimization problems are computationally demanding due to optimiz-
ing over an unrestricted space of decision rules. To overcome this challenge, a common approximation
technique in robust optimization is to restrict the space of decision rules to a space which can more
easily be optimized. As described in Section 1.1, the success of robust optimization as a modeling frame-
work for addressing real-world multi-stage problems is often attributed the computational tractability
of such decision rule approximations. This section extends one such decision rule scheme, known as
linear decision rules, to approximately solve Problem (2) and illustrates its computational tractability
in big data settings.

Specifically, we consider approximating Problem (2) by restricting its decision rules to those of the
form
t−1
X
xt (ζ 1 , . . . , ζ t−1 ) = xt,0 + Xt,s ζ s .
s=1

Thus, rather than optimizing over the space of all possible decision rules (functions), we instead op-
timize over a finite collection of decision variables which parameterize a linear decision rule. For the
setting where ct (ξ) and At (ξ) do not depend on the uncertain parameters and all decision variables are
continuous, the resulting linear decision rule approximation of Problem (2) is given by
N T t−1
!
1 X X X
minimize sup ct · xt,0 + Xt,s ζ s
N j =1 ζ∈U j t=1 s=1
N
! (4)
XT t−1
X j
subject to At xt,0 + Xt,s ζ s ≤ b(ζ) ∀ζ ∈ ∪N j =1 UN ,
t=1 s=1

where the decision variables are xt,0 ∈ Rnt and Xt,s ∈ Rnt ×ds for all 1 ≤ s < t ≤ T and the affine function
PT
b(ζ) ∈ Rm is shorthand for b0 + t=1 Bt ζ t .
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 17

Much like linear decision rules in robust optimization, we observe that Problem (4), when feasible,
always produces a feasible decision rule for Problem (2) and an upper bound on its optimal cost.
Nonetheless, Problem (4) has semi-infinite constraints, which must be eliminated in order for the opti-
mization problem to be solvable by off-the-shelf solvers. A standard technique from robust optimization
for eliminating semi-infinite constraints is to introduce (dual) auxiliary decision variables and con-
straints for each uncertainty set. Importantly, for Problem (4) to be practically tractable in the presence
of big data, the size of an equivalent finite-dimensional optimization problem must scale efficiently in
the number of sample paths.

We now show that Problem (4) can be reformulated as a linear optimization with size that scales
linearly in the number of sample paths (Theorem 4). The central idea enabling the following refor-
mulation is that the worst-case realizations over the various uncertainty sets are found by optimizing
over identical linear functions. Thus, when constructing the robust counterparts for each uncertainty
set, we can combine the dual auxiliary decision variables from different uncertainty sets, resulting in a
reformulation where the number of auxiliary decision variables is independent of the number of sample
paths. To illustrate this reformulation technique, we focus on uncertainty sets which satisfy the following
construction:

Assumption 4. The uncertainty sets have the form UNj := {ζ ∈ Rd : `j ≤ ζ ≤ uj }.

For example, Assumption 4 holds if we choose the set Ξ to be Rd+ and use the k · k∞ norm in the
uncertainty sets from Section 3. The following illustrates the novel duality technique described above:

Theorem 4. If Assumption 4 holds, then Problem (4) can be reformulated as a linear optimization
problem with O(md) auxiliary decision variables and O(md + mN ) linear constraints.

Proof. By introducing epigraph variables v1 , . . . , vN ∈ R, the constraints in Problem (4) can be


rewritten as
T T
! T
X X X
X|s,t cs · ζ t ≤ vj − ct · xt,0 ∀ζ ∈ UNj , j ∈ {1, . . . , N },
t=1 s=t+1 t=1
T T
! T
X X X
−Bt + As Xt,s ζ t ≤ b0 − At xt,0 ∀ζ ∈ UNj , j ∈ {1, . . . , N }.
t=1 s=t+1 t=1

We will now reformulate each of these semi-infinite constraints by introducing auxiliary variables. First,
we observe that each of the above semi-infinite constraints can be rewritten as
T
X
maxj dt · ζ t ≤ γ
ζ∈U N
t=1
18 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

for some vector d := (d1 , . . . , dT ) ∈ Rd and scalar γ ∈ R. Moreover, it follows from strong duality for
linear optimization that
T 
 
T
X
ujt · µt − `jt · λt
X 
minimize
max dt · ζ t = µ ,λ ∈R
t t +
dt
t=1
ζ∈U jN 
t=1
subject to µt − λt = dt ∀t ∈ [T ],

where uj := (uj1 , . . . , ujT ) ∈ Rd and `j := (`j1 , . . . , `jT ) ∈ Rd are the upper and lower bounds which define
the uncertainty set. We readily observe that the solutions µt = [dt ]+ and λt = [−dt ]+ are optimal for the
above optimization problem. Importantly, these optimal solutions to the dual problem are independent
of the index j. Thus, the semi-infinite constraints in the epigraph formulation of Problem (4) are satisfied
if and only if there exists α := (α1 , . . . , αT ) ∈ Rd+ and β := (β 1 , . . . , β T ) ∈ Rd+ which satisfy
T 
X 
αt · ujt − β t · `jt + ct · xt,0 ≤ vj ∀j ∈ [N ]
t=1
T
X
αt − β t = X|s,t cs ∀t ∈ [T ]
s=t+1

and there exists M := (M1 , . . . , MT ) ∈ Rm×d


+
m×d
and Λ := (Λ1 , . . . , ΛT ) ∈ R+ which satisfy
T 
X 
Mt ujt − Λt `jt + At xt,0 ≤ b0 ∀j ∈ [N ]
t=1
T
X
Mt − Λt = −Bt + As Xt,s ∀t ∈ [T ]
s=t+1

Removing the epigraph decision variables, the resulting reformulation of Problem (4) is
N T
1 XX 
minimize αt · ujt − β t · `jt + ct · xt,0
N j =1 t=1
T
X
subject to αt − β t = X|s,t cs t ∈ [T ]
s=t+1
T 
X 
Mt ujt − Λt `jt + At xt,0 ≤ b0 j ∈ [N ]
t=1
T
X
Mt − Λt = −Bt + As Xt,s t ∈ [T ]
s=t+1

where the auxiliary decision variables are α ≡ (α1 , . . . , αT ), β ≡ (β 1 , . . . , β T ) ∈ Rd+ and M ≡


(M1 , . . . , MT ), Λ ≡ (Λ1 , . . . , ΛT ) ∈ Rm×d
+ . Thus, the reformulation technique allowed us to decrease the
number of auxiliary decision variables from O(N md) to O(md). 

While linear decision rules can sometimes provide a near-optimal approximation of Problem (2) (see
Section 8), we do not expect this to be the case in general. Indeed, we recall from Section 4 that
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 19

Problem (2) can provide a near-optimal approximation of Problem (1), and it has been known from
the early literature that linear decision rules generally provide a poor approximation for multi-stage
stochastic linear optimization; see, e.g., Garstka and Wets (1974, Section 6). Nonetheless, we can obtain
tighter approximations of Problem (2) by selecting a richer space of decision rules, an abundance of
which can be found in the robust optimization literature (see Section 5.2 for an example). Moreover,
Problem (2) is also amenable to new approximation schemes which exploit its particular structure; we
refer to our companion paper Bertsimas et al. (2021) for such an approximation algorithm for two-stage
problems. In all cases, and as a result of the convergence guarantees from Section 4, Problem (2) offers
an opportunity to extend algorithmic advances from robust optimization to obtain approximations of
multi-stage stochastic linear optimization.

5.2. Finite adaptability

In this section, we show how to extend the decision rule approximation scheme of finite adaptability from
robust optimization (Bertsimas and Caramanis 2010) to obtain tighter approximations of Problem (2).
Specifically, finite adaptability partitions the set Ξ into smaller regions, and then optimizes a separate
static or linear decision rule in each region. The approach of finite adaptability extends to problems
with integer decision variables, and the practitioner can trade off the tightness of their approximations
with an increase in computational cost. We show that the duality techniques from the previous section
(Theorem 4) readily extend to this richer class of decision rules, and a practical demonstration of finite
adaptability is presented in Section 7.

We begin by describing the approximation scheme of finite adaptability from robust optimization.
In finite adaptability, one partitions the uncertainty set into different regions, and optimizes a separate
linear decision rule for each region. Let P 1 , . . . , P K ⊆ Rd be regions which form a partition of Ξ ⊆ Rd .
For each stage t, let Ptk ⊆ Rd1 +···+dt be the projection of the region P k onto the first t stages. Then, we
consider approximating Problem (2) by restricting its decision rules to those of the form
 Pt−1 1
1 1
xt,0 + s=1 Xt,s ζ s , if (ζ 1 , . . . , ζ t−1 ) ∈ Pt−1 ,


.
xt (ζ 1 , . . . , ζ t−1 ) = ..

xK + Pt−1 XK ζ , if (ζ , . . . , ζ ) ∈ P K .

t,0 s=1 t,s s 1 t−1 t−1

In contrast to a single linear decision rule, finite adaptability allows for greater degrees of freedom at
a greater computational cost. Indeed, for each region P k , we choose a separate linear decision rule
which is locally optimal for that region. To accommodate integer decision variables, we restrict the
corresponding component of each xkt,0 to be integer and restrict the associated rows of each matrix Xkt,s
to be zero.

A complication of finite adaptability is that we may not have enough information at any intermediary
stage to determine which region P k will contain the entire trajectory. In other words, at the start of
20 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

stage t, a decision must be chosen after only observing the values of (ζ 1 , . . . , ζ t−1 ), and there may be two
0
k k
or more regions of the partition for which their projections Pt− 1 and Pt−1 are overlapping. Fortunately,

the following proposition shows that the aforementioned complication caused by overlapping projections
0 0
can be resolved by adding constraints of the form xkt = xkt and Xkt,s = Xkt,s for every 1 ≤ s < t when the
0
regions P k and P k are indistinguishable at stage t.

Proposition 2 (Proposition 4, Bertsimas and Dunning (2016)). If there exists ζ ≡


0 0
(ζ 1 , . . . , ζ T ) ∈ P k and ζ 0 ≡ (ζ 01 , . . . , ζ 0T ) ∈ P k such that (ζ 1 , . . . , ζ t−1 ) = (ζ 01 , . . . , ζ 0t−1 ), and ζ ∈ int(P k )
0 0
or ζ 0 ∈ int(P k ) hold, then we must enforce the constraints that xkt,0 = xkt,0 and Xkt,s = Xkt,s for all
1 ≤ s < t at stage t as the two regions cannot be distinguished with the uncertain parameters realized by
that stage. Otherwise, we do not need to enforce any constraints at stage t for this pair.

0
For brevity, we let T (P 1 , . . . , P K ) denote the collection of tuples (k, k 0 , t) for which P k and P k cannot
be distinguished at stage t, which we assume can be tractably computed.

We now extend the approach of finite adaptability to Problem (2). Let P 1 , . . . , P K be a given partition
of Ξ, and let the intersections between regions of the partition and uncertainty sets be denoted by
Kj := {k ∈ [K] : U jN ∩ P k 6= ∅}. For the setting where ct (ξ) and At (ξ) do not depend on the uncertain
parameters, the resulting linear decision rule approximation of Problem (2) is given by
N T t−1
!
1 X X
k
X
k
minimize max max ct · xt,0 + Xt,s ζ s
N j =1 k∈Kj ζ∈UNj ∩P k t=1 s=1
T t−1
!
X X j
(5)
subject to At xkt,0 + Xkt,s ζ s ≤ b(ζ) ∀ζ ∈ ∪N k
j =1 UN ∩ P , k ∈ [K]
t=1 s=1
0 0
xkt = xkt , Xkt,s = Xkt,s ∀(k, k 0 , t) ∈ T (P 1 , . . . , P K ), 1 ≤ s < t.

where the decision variables are xkt,0 ∈ Rnt and Xkt,s ∈ Rnt ×ds for all 1 ≤ s < t and k ∈ [K].

Speaking intuitively, the approximation gap between Problem (2) and Problem (5) depends on the
selection and granularity of the partition. By choosing partitions with a greater number of regions,
Problem (5) can produce a tighter approximation of Problem (2), although this comes with an increase
in problem size. For heuristic algorithms for selecting the partitions, we refer the reader to Postek and
Den Hertog (2016) and Bertsimas and Dunning (2016). Once the partitions are determined, we obtain
a reformulation of Problem (5) by employing the same duality techniques as in Section 5.1. To this end,
we will assume that the intersections between the regions of the partition and uncertainty sets take a
rectangular form:

Assumption 5. The intersection between each uncertainty set and region of the partition either has
the form UNj ∩ P k := {ζ ∈ Rd : `jk ≤ ζ ≤ ujk } or is empty.
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 21

We remark that this assumption can be guaranteed under the same conditions as Assumption 4 when
the partition’s regions are constructed as hyperrectangles. We now show that Problem (5) can be
reformulated as a finite-dimensional linear optimization problem which scales lightly in the number of
sample paths N as well as the number of regions K.

Corollary 1. If Assumption 5 holds, then (5) can be reformulated by adding at most O(N + Kmd)
PN
auxiliary continuous decision variables and O(m j =1 |Kj | + Kmd) linear constraints. The reformula-
tion is
N
1 X
minimize vj
N j =1
T 
X 
subject to ujk k jk k k
t · αt − `t · β t + ct · xt,0 ≤ vj j ∈ [N ], k ∈ Kj
t=1
T
X |
αkt − β kt = Xks,t cs t ∈ [T ], k ∈ [K]
s=t+1
T 
X 
Mkt ujk
t − Λk jk
`
t t + A x k
t t,0 ≤ b
0
j ∈ [N ], k ∈ Kj
t=1
T
X
Mkt − Λkt = −Bt + As Xkt,s t ∈ [T ], k ∈ [K]
s=t+1
0 0
xkt = xtk , Xkt,s = Xkt,s (k, k 0 , t) ∈ T (P 1 , . . . , P K ), 1 ≤ s < t,

where the auxiliary decision variables are v ∈ RN as well as αk := (αk1 , . . . , αkT ), β k := (β k1 , . . . , β kT ) ∈


Rd+ and Mk := (Mk1 , . . . , MkT ), Λk := (Λk1 , . . . , ΛkT ) ∈ Rm×d
+ for each k ∈ [K]. Note that b(ζ) := b0 +
PT m
t=1 Bt ζ t ∈ R .

Proof. The proof follows from similar duality techniques as Theorem 4 and is thus omitted.

This result suggests that Problem (2) with finite adaptability is scalable, in the sense that the size of the
resulting reformulation for a given partition P 1 , . . . , P K scales lightly in the number of sample paths N .
Assuming that the partition’s regions and uncertainty sets are hyperrectangles, we remark that `jk , ujk ,
and T (P 1 , . . . , P K ) can be obtained efficiently by computing the intersection of each uncertainty set
and region of the partition.

6. Relationships with Distributionally Robust Optimization


In the previous sections, we discussed the theoretical underpinnings and computational tractability
of Problem (2) as a data-driven approach to multi-stage stochastic linear optimization. An attractive
aspect of the proposed approach is its simplicity, interpretable as a straightforward robustification of
historical sample paths. In this section, we explore connections between our data-driven approach to
22 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

multi-stage stochastic linear optimization and distributionally robust optimization, and discuss impli-
cations of our results to the latter.

Our exposition in this section focuses on the following formulation of multi-stage distributionally
robust linear optimization:

" T
#
X
minimize sup EQ ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )
x∈X Q∈AN t=1
T
(6)
X
subject to At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) Q-a.s., ∀Q ∈ AN .
t=1

Intuitively speaking, this framework chooses the decision rules which minimize the expected cost with
respect to an adversarially chosen probability distribution from an ambiguity set. The requirement that
the constraints hold almost surely for every distribution in the ambiguity set ensures that the objective
function will evaluate the cost function on realizations of the stochastic process where the decision rules
are feasible. Examples of this formulation in multi-stage and data-driven two-stage problems include
Bertsimas et al. (2019) and Hanasusanto and Kuhn (2018).

Our following discussion focuses on ambiguity sets which are constructed using historical data and
Wasserstein-based distances between probability distributions. Given two bounded probability distri-
butions, their ∞-Wasserstein distance is defined as

Π is a joint distribution of ξ and ξ 0


( )
d∞ (Q, Q0 ) := inf Π-ess sup kξ − ξ 0 k : ,
Ξ×Ξ with marginals Q and Q0 , respectively

where the essential supremum of the joint distribution is given by

Π-ess sup kξ − ξ 0 k := inf M : Π kξ − ξ 0 k > M = 0 .


 
Ξ×Ξ

From an intuitive standpoint, we note, in the case of d = 1, that the ∞-Wasserstein distance between
two bounded probability distributions can be interpreted as the maximum distance between the quantile
functions of the two distributions; see Ramdas et al. (2017). For any p ∈ [1, ∞), the p-Wasserstein
distance between two probability distributions is defined as
 p1 Π is a joint distribution of ξ and ξ 0
(Z )
p
dp (Q, Q0 ) = inf kξ − ξ0 k dΠ(ξ, ξ0 ) : .
Ξ×Ξ with marginals Q and Q0 , respectively

For technical details on these distances, we refer the reader to Givens and Shortt (1984). For any
p ∈ [1, ∞], let the p-Wasserstein ambiguity set be defined as
n   o
AN = Q ∈ P (Ξ) : dp Q, P
bN ≤ N ,
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 23

where N ≥ 0 is a robustness parameter which controls the size of the ambiguity set and P
bN is the

empirical probability distribution which assigns equal weight to each of the historical sample paths
ξ̂ 1 , . . . , ξ̂ N . We henceforth refer to Problem (6) with the p-Wasserstein ambiguity set as p-WDRO.

As discussed at the end of Section 1.1, there are relatively few previous convergence guarantees
for distributionally robust optimization with the ∞-Wasserstein ambiguity set, even for single-stage
problems. Indeed, when the underlying distribution is unbounded, the ∞-Wasserstein ambiguity set will
never contain the true distribution, even as N tends to infinity, since the distance d∞ (P, P
bN ) from the

true to the empirical distribution will always be infinite. Thus, except under stronger assumptions than
Assumption 1, the techniques used by Mohajerin Esfahani and Kuhn (2018, Theorems 3.5 and 3.6) to
establish finite-sample and convergence guarantees for the 1-Wasserstein ambiguity set do not extend
to the ∞-Wasserstein ambiguity set. Nonetheless, distributionally robust optimization with the ∞-
Wasserstein ambiguity set has recently received interest in the context of regularization and adversarial
training in machine learning (Gao et al. 2017, Staib and Jegelka 2017).

The following proposition shows that Problem (2), under the particular construction of uncertainty
sets from Section 3, can also be interpreted as Problem (6) with the ∞-Wasserstein ambiguity set.

Proposition 3. Problem (2) with uncertainty sets of the form


n o
U jN := ζ ≡ (ζ 1 , . . . , ζ T ) ∈ Ξ : kζ − ξ̂j k ≤ N

is equivalent to ∞-WDRO.

Proof. See Appendix G. 

Therefore, as a byproduct of Theorem 1 from Section 4, we have obtained general convergence guarantees
for distributionally robust optimization using the ∞-Wasserstein ambiguity set under mild probabilistic
assumptions.

For comparison, we now show that similar asymptotic optimality guarantees for multi-stage stochastic
linear optimization are not obtained by p-WDRO for any p ∈ [1, ∞). Indeed, the following proposition
shows that the constraints induced by such an approach are overly conservative in general.

Proposition 4. If p ∈ [1, ∞) and N > 0, then a decision rule is feasible for p-WDRO only if

T
X
At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ Ξ.
t=1

Proof. See Appendix H. 


24 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

As discussed in Section 2, the set Ξ is not necessarily a tight approximation of the true (unknown)
support of the stochastic process, and may be strictly and significantly larger. Thus, the constraints
induced from p-WDRO with p ∈ [1, ∞) may eliminate optimal or high-quality decision rules for Prob-
lem (1). Consequently, p-WDRO with p ∈ [1, ∞) is not asymptotically optimal for multi-stage stochastic
linear optimization in general. We conclude this section with two further remarks.

Remark 1. If we relaxed the constraints of p-WDRO with p ∈ [1, ∞) in an attempt to decrease


its conservatism, then the resulting decision rules are not guaranteed to be feasible for the stochastic
problem. Thus, the finite-sample guarantees provided by Mohajerin Esfahani and Kuhn (2018, Equation
2), which served as one of the principle justifications for using p-WDRO, would no longer provide
meaningful insight into the true out-of-sample performance of this decision rule. 

Remark 2. The conservatism of p-WDRO can lead to suboptimal decisions, even for problems where
uncertainty does not impact feasibility, if the true support is not known exactly. For example, consider
the problem
minimize E [x2 (ξ1 ) + 2x3 (ξ1 , ξ2 )]
x2 :R→R, x3 :R2 →R

subject to x2 (ξ1 ) + x3 (ξ1 , ξ2 ) ≥ ξ1 + ξ2 a.s.

x2 (ξ1 ) + x3 (ξ1 , ξ2 ) ≥ ξ1 − ξ2 a.s.


We observe that x2 (ξ1 ) = ξ1 and x3 (ξ1 , ξ2 ) = |ξ2 | are feasible decision rules, regardless of the underlying
probability distribution. Suppose that the probability distribution and support of (ξ1 , ξ2 ) is unknown,
and our only information comes from historical data. If we approximate this stochastic problem using
p-WDRO for any p ∈ [1, ∞) and linear decision rules, we are tasked with solving

minimize sup EQ [(x2,0 + x2,1 ξ1 ) + 2(x3,0 + x3,1 ξ1 + x3,2 ξ2 )]


x2,0 , x2,1 , x3,0 , x3,1 , x3,2 ∈R Q∈AN
subject to (x2,0 + x2,1 ζ1 ) + (x3,0 + x3,1 ζ1 + x3,2 ζ2 ) ≥ ζ1 + ζ2 ∀ζ ∈ R2

(x2,0 + x2,1 ζ1 ) + (x3,0 + x3,1 ζ1 + x3,2 ζ2 ) ≥ ζ1 − ζ2 ∀ζ ∈ R2 .

It follows from identical reasoning as Bertsimas et al. (2019, Section 3) that there are no linear decision
rules which are feasible for the above optimization problem. In particular, the above optimization
problem will remain infeasible even if the true support of the random variable happens to be bounded
but the bound is unknown. In contrast, the sample robust optimization approach (Problem (2)) to this
example will always have a feasible linear decision rule. A similar example is found in Section 8. 

7. Application to a Stochastic Inventory Replenishment Problem


In our first set of numerical experiments, we consider a stochastic inventory replenishment problem
in a network with a single warehouse and multiple retailers. Our setting is motivated by the real-
world case study of Avrahami et al. (2014), who study a supply chain controlled by a major Israeli
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 25

publisher that sells a weekly magazine. The goal of the publisher is to find production and replenishment
policies for delivering magazines to retailers which minimize their expected weekly operating costs. We
perform numerical experiments to assess the practical value of our robust optimization approach in this
application in comparison to alternative data-driven approaches from the literature.

7.1. Problem description

The stochastic inventory replenishment problem faced by the publisher can be modeled as a three-stage
stochastic linear optimization problem with mixed-integer decision variables and a multi-dimensional
stochastic process. Let the number of retailers in the supply chain network be denoted by R ∈ N. The
time horizon of the decision problem faced by the publisher corresponds to one calendar week, beginning
on Sunday and ending on Saturday, in which procurement and replenishment decisions are made at the
start and middle of the week. The dynamics of the decision problem are summarized below:

• (Sunday) At the beginning of the week, the publisher decides the production quantity and the
destinations of the weekly magazine. These decisions are captured by the decision variables
Q11 , . . . , Q1R ≥ 0, which represent the number of produced magazines delivered directly to each
of the retailers at the start of the week, and the decision variable Q10 ≥ 0, which represents the
number of produced magazines sent to a common warehouse. Magazines are produced at a cost of
PR
c per unit, and r=0 Q1r is the total number of magazines that are produced for the entire week.

• (Sunday - Tuesday) After the publisher has made the initial production decisions, each retailer
r ∈ {1, . . . , R} uses their inventory of Q1r magazines to satisfy the random customer demand ξ1r
from the first half of the week. There is no backlogging for demand that exceeds the available
inventory, and the inventory at each retailer at the end of the first half of the week is thus equal
to max{0, Q1r − ξ1r }. The publisher incurs a cost of b per unit of unmet customer demand at each
of the retailers, and the realized demands from the first half of the week ξ 1 := (ξ11 , . . . , ξ1R ) are
observed by the publisher on Tuesday night.

• (Wednesday) At the midweek point, the publisher can replenish the inventories of the retailers
using the magazines from the warehouse. These replenishment quantities to each of the retailers are
captured by second-stage decision rules, Q21 (ξ 1 ), . . . , Q2R (ξ 1 ) ≥ 0. The replenishment quantities
PR
are constrained by the inventory at the warehouse, r=1 Q2r (ξ 1 ) ≤ Q10 , and the publisher pays a
fixed shipping cost f ≥ 0 for each retailer to which it sends a nonzero replenishment quantity.

• (Wednesday - Saturday) After the publisher has made the replenishment decisions, each retailer
r ∈ {1, . . . , R} uses their inventory of max{0, Q1r − ξ1r } + Q2r (ξ 1 ) magazines to satisfy the random
customer demand ξ2r from the second half of the week. The random demands in the second half
of the week at each of the retailers are represented by ξ 2 := (ξ21 , . . . , ξ2R ).
26 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

• (Saturday) At the end of the week, the publisher incurs a holding cost of h per unit and a back-
logging cost of b per unit at each of the retailers, and a holding cost of h per unit is also applied
to any remaining units at the warehouse.

It follows from the above dynamics that the stochastic inventory replenishment problem faced by the
publisher corresponds to a three-stage stochastic nonlinear optimization problem of the form
" R
! R
! R
!
X X X
minimizeR E c Q10 + Q1r + h Q10 − Q2r (ξ 1 ) + b max{0, ξ1r − Q1r }
Q≥0,z∈{0,1} ,I
r =1 r =1
R
! R
!r=1 R
!#
X X X
+f zr (ξ 1 ) + b max {0, −I3r (ξ 1 , ξ 2 )} + h max {0, I3r (ξ 1 , ξ 2 )}
r =1 r =1 r =1
R
X
subject to Q2r (ξ 1 ) ≤ Q10 a.s.
r =1
I3r (ξ 1 , ξ 2 ) = max {0, Q1r − ξ1r } + Q2r (ξ 1 ) − ξ2r ∀r ∈ [R], a.s.

zr (ξ 1 )M ≥ Q2r (ξ 1 ) ∀r ∈ [R], a.s.,


where the decision rules I31 (ξ 1 , ξ 2 ), . . . , I3R (ξ 1 , ξ 2 ) represent the net inventory at each retailer at the
end of the week, the decision rules z1 (ξ 1 ), . . . , zR (ξ 1 ) ∈ {0, 1} represent whether a fixed cost should be
applied at each retailer, and M is a sufficiently large big-M constant. Following similar reasoning as
Avrahami et al. (2014, Appendix A), we show in Appendix J.1 that the above optimization problem
can be equivalently reformulated as the following three-stage stochastic linear optimization problem:
" R
! R R
#
X X X
minimizeR E c Q10 + Q1r + hQ10 + vr (ξ 1 , ξ 2 ) + f zr (ξ 1 )
Q≥0,z∈{0,1} ,v
r =1 r =1 r =1
R
X
subject to Q2r (ξ 1 ) ≤ Q10 a.s.
r =1
vr (ξ 1 , ξ 2 ) ≥ b(ξ2r + ξ1r − Q2r (ξ 1 ) − Q1r ) − hQ2r (ξ 1 ) ∀r ∈ [R], a.s. (7)

vr (ξ 1 , ξ 2 ) ≥ h(Q1r − ξ1r − ξ2r ) ∀r ∈ [R], a.s.

vr (ξ 1 , ξ 2 ) ≥ b (ξ1r − Q1r ) − hξ2r ∀r ∈ [R], a.s.

zr (ξ 1 )M ≥ Q2r (ξ 1 ) ∀r ∈ [R], a.s.

7.2. Experiments

We compare several approaches for finding decision rules for Problem (7) when the only information
on the joint probability distribution of the stochastic process comes from historical data. Specifically,
following the discussion of Avrahami et al. (2014), we assume that the publisher has collected historical
data consisting of the demands for magazines in past weeks, ξ̂ 1 := (ξ̂ 11 , ξ̂ 12 ), . . . , ξ̂ N := (ξ̂ N N
1 , ξ̂ 2 ), which are

independent and identically distributed sample paths of the underlying stochastic process ξ := (ξ 1 , ξ 2 ).
We also assume that the random demands are known to be nonnegative almost surely, Ξ := R2+R . We
compare the following data-driven approaches for finding decision rules for Problem (7):
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 27

• SAA - Independence: Given that the only information on the underlying joint probability distribu-
tion comes from sample paths, Avrahami et al. (2014) study a method for obtaining approximate
solutions to Problem (7) which assumes that the demands in the first half of the week ξ 1 and the
second half of the week ξ 2 are independent random vectors. The assumption of stage-wise indepen-
dence is a common simplifying assumption in the stochastic programming literature which allows
for a scenario tree to be constructed directly from historical sample paths. Specifically, under this
assumption of stage-wise independence, this approach finds first-stage decisions for Problem (7)
by solving the following mixed-integer linear optimization problem constructed from the scenario
tree: ! !
R N R N
X 1 XX j 1 X jk
minimize c Q10 + Q1r + hQ10 + f zr + v
Q≥0,z∈{0,1}N ×R ,v
r =1
N j =1 r=1 N k=1 r
R
X
subject to Qj2r ≤ Q10 j ∈ [N ]
r =1
 
vrjk ≥ b ξˆk2r + ξˆj1r − Qj2r − Q1r − hQj2r j, k ∈ [N ], r ∈ [R] (8)
 
vrjk ≥ h Q1r − ξˆj1r − ξˆk2r j, k ∈ [N ], r ∈ [R]
 
vrjk ≥ b ξˆj1r − Q1r − hξˆk2r j, k ∈ [N ], r ∈ [R]

zrj M ≥ Qj2r j ∈ [N ], r ∈ [R].


We denote the optimal first-stage decisions to the above linear optimization problem by
Q̂SAA SAA SAA
10 , Q̂11 . . . , Q̂1R ≥ 0. Given these first-stage decisions and a realization of the demand in the

first half of the week, ξ̄ 1 = (ξ¯11 . . . , ξ¯1R ), the approach obtains second-stage decisions by solving
the following mixed-integer linear optimization problem:
R N
!
X 1 X k
minimize f zr + v
Q2 ≥0,z∈{0,1}R ,v
r =1
N k=1 r
R
X
subject to Q2r ≤ Q̂SAA
10
r =1
 
vrk ≥ b ξˆk2r + ξ¯1r − Q2r − Q̂SAA 1r − hQ2r k ∈ [N ], r ∈ [R]
 
vrk ≥ h Q̂SAA
1r − ξ¯1r − ξˆk2r k ∈ [N ], r ∈ [R]
 
vrk ≥ b ξ¯1r − Q̂SAA
1r − hξˆk2r k ∈ [N ], r ∈ [R]

zr M ≥ Q2r r ∈ [R].

We denote the optimal decisions to the above optimization problem by Q̂SAA SAA
21 (ξ̄ 1 ), . . . , Q̂2R (ξ̄ 1 ) ≥ 0

and ẑ1SAA (ξ̄ 1 ), . . . , ẑR


SAA
(ξ̄ 1 ) ∈ {0, 1}.

• AR Linear : This approach obtains an approximation of Problem (7) using a parametric ‘estimate-
then-optimize’ technique from the literature. Specifically, the approach consists of two steps: in
the ‘estimate’ step, the historical data ξ̂ 1 , . . . , ξ̂ N is fit to a parametric family of joint probability
28 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

distributions; in the ‘optimize’ step, a scenario tree is constructed by sampling from the esti-
mated joint probability distribution, and decision rules are obtained by solving the corresponding
approximation of the multi-stage stochastic linear optimization problem.
In the ‘estimate’ step, the scenario tree is constructed by first generating Ñ realizations for
the demand in the first half of the week and then, for each such realization ξ̃ j1 , generating a
conditional set of demand realizations for the second half of the week {ξ̃ jk
2 }k∈[Ñ ] . This generation

procedure creates a scenario tree with Ñ 2 leafs. Given this scenario tree, decisions are obtained
by solving Problem (8) where ξˆk2r is replaced by ξ˜jk
2r in each of the constraints. Following the

stochastic programming literature (see, e.g., Löhndorf and Shapiro (2019)),


  we fit the historical
1
data to a general linear model of the form ξ 1 ∼ N (µ1 , Σ1 ) and ξ 2 ∼ N B , Σ2 , and we use
ξ1
the standard maximum likelihood estimators for the parameters µ1 , B, Σ1 , and Σ2 (Finn 1974,
Section 4.4).

• SRO-FA: Our final method obtains decision rules for Problem (7) using the proposed robust opti-
mization approach. Specifically, given historical sample paths of the form ξ̂ 1 := (ξ̂ 11 , ξ̂ 12 ), . . . , ξ̂ N :=
(ξ̂ N N
1 , ξ̂ 2 ), we first construct an instance of Problem (2) in which the uncertainty sets from Section 3

are defined with the `∞ -norm. We then approximately solve the robust optimization problem to ob-
tain decision rules for Problem (7). To obtain a tractable approximation of the robust optimization
problem, we implement the finite adaptability technique described in Section 5.2.
In our implementation of the finite adaptability, we construct a partition of Ξ = R2+R in which the
historical demands in the first half of the week, ξ̂ 11 , . . . , ξ̂ N
1 , each lie in their own hyperrectangular

region. In other words, we construct a partition of Ξ consisting of regions of the form P 1 :=


[`1 , u1 ] × RR
+, . . . , P
N := N
[` , uN ] × RR j j
+ which satisfy the property that ξ̂ ∈ P for each sample path

j ∈ [N ]. This partition is motivated by the desire to obtain a tight approximation of the robust
optimization problem; indeed, we observe that the number and granularity of regions will increase
with the number of sample paths. Details on the partitioning heuristic used in this section can be
found in Appendix J.2.
Given a partition of the form described above, we approximate the robust optimization problem,
Problem (2), by restricting the space of second-stage decision rules to those of the form
 
1 1 1 1 1 1

 Q 2 r , if ` ≤ ζ 1 ≤ u , zr , if ` ≤ ζ 1 ≤ u ,


. .

Q2r (ζ 1 ) = .. zr (ζ 1 ) = ..
 
QN , if `N ≤ ζ ≤ uN ;
 z N , if `N ≤ ζ ≤ uN .

2r 1 r 1

For notational convenience, let Kj := {k ∈ [N ] : UNj ∩ P k 6= ∅} denote the indices of regions


P 1 , . . . , P N that intersect the uncertainty set UNj , and let us define the quantities ζ jk tr
:=
jk
¯
minζ∈U j ∩P k ζtr and ζ̄tr := maxζ∈U j ∩P k ζtr for each period t ∈ {1, 2}, retailer r ∈ {1, . . . , R}, sample
N N
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 29

path j ∈ {1, . . . , N }, and region k ∈ Kj . With this notation, we show in Appendix J.3 that the re-
sulting approximation of Problem (2) using finite adaptability is obtained by solving the following
mixed-integer linear optimization problem:
R
! N R
X 1 XX j
minimize c Q 10 + Q1r + hQ10 + v
vj , Q1 ≥0,
r =1
N j =1 r=1 r
Qk k
2 ≥0,z ∈{0,1}
R

uj,k ,vj ∈RR


R
X
subject to Qk2r ≤ Q10 ∀k ∈ [K]
r =1
R
X
vrj ≥ uj,k k (9)

r + f zr ∀r ∈ [R], j ∈ [N ], k ∈ Kj
r =1
jk jk
uj,k k k
r ≥ b(ζ̄ 2r + ζ̄ 1r − Q2r − Q1r ) − hQ2r ∀r ∈ [R], j ∈ [N ], k ∈ Kj
jk jk
uj,k
r ≥ h(Q1r − ζ 1r − ζ 2r ) ∀r ∈ [R], j ∈ [N ], k ∈ Kj
 ¯  ¯
jk jk
uj,k
r ≥ b ζ̄ 1r − Q1r − hζ 2r ∀r ∈ [R], j ∈ [N ], k ∈ Kj
¯
zrk M ≥ Qk2r ∀r ∈ [R], k ∈ [K].

Solving the above optimization problem yields the first-stage decisions Q̂SRO SRO SRO
10 , Q̂11 . . . , Q̂1R ≥ 0

and decision rules Q̂SRO SRO SRO


21 (ξ 1 ), . . . , Q̂2R (ξ 1 ) ≥ 0 and ẑ1
SRO
(ξ 1 ), . . . , ẑR (ξ 1 ) ∈ {0, 1}.

To compare the above data-driven approaches, we perform a variety of numerical experiments on


different numbers of retailers, R ∈ {2, 3, 4}, with cost parameters c = 0.25, h = 0.05 and b = 0.5, and
in cases with no fixed cost f = 0 as well as when the fixed cost is f = 0.1. In the joint probability
distribution of the stochastic process, the demand in the first half of the week for each retailer, ξ1r , is
independent of the demands in the first half of the week for the other retailers and is generated from
a truncated normal distribution with mean 6 and standard deviation 2.5 that is bounded below by
zero. The demands in the second half of the week are independent across retailers and follow truncated
normal distributions with means 2(6 − ξ1r )2 and standard deviations 2.5 that are bounded below by
zero. This joint probability distribution is chosen because of its relative simplicity and, particularly in
the setting with multiple retailers, the difficulty of correctly identifying its structure using only limited
historical data. The number of retailers in our experiments is motivated by the organizational structure
of the Israeli publisher, in which sales representatives are assigned to managing the inventory of up to
five retailers (Avrahami et al. 2014, p. 452). In Appendix J.4, we provide numerical evidence that the
fixed cost of f = 0.1 leads to replenishment decision rules that are meaningfully different than those
obtained in experiments with no fixed cost.

We compare the aforementioned data-driven methods on training sets of sizes N ∈


{50, 100, 200, 400, 800} when there is no fixed cost (f = 0) and N ∈ {50, 71, 100, 141, 200, 282} when
there are fixed costs (f = 0.1). To obtain statistically meaningful results, for each size N , we generate
30 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Figure 1 Three-stage inventory replenishment problem:


Impact of robustness parameter on SRO-FA for R = 3, no fixed cost (f = 0)

Note. The solid black lines are the average out-of-sample costs of decision rules produced by SRO-FA, and the shaded
regions are the 20th and 80th percentiles over the 50 training datasets. The dotted red lines are the average in-sample
costs for SRO-FA. The dashed green line is the benchmark for Problem (7). Results are shown for N ∈ {50, 200, 800}.

50 training datasets of size N and apply each of the above data-driven approaches to each training
dataset. We evaluate the decision rules from each approach on a common testing dataset of 10,000
sample paths. For “SRO-FA”, we select the robustness parameter using five-fold cross-validation. For
“AR Linear”, we generate scenario trees of size Ñ = 50 for all sizes of training sets N . We found that
larger values of Ñ resulted in longer computation times but did not have any discernible impact on the
performance of the “AR Linear” approach.

For comparison purposes, we also implement a benchmark method that has perfect knowledge of
the true joint probability distribution. In the benchmark, we obtain an estimate of the optimal cost of
Problem (7) by constructing a scenario tree from the true joint probability distribution. We construct
the scenario tree using the same procedure as described previously, with a size of Ñ = 800 for experi-
ments without fixed cost and Ñ = 200 for experiments with fixed cost. We repeat this process over 50
replications and report the average of the resulting optimal costs.

7.3. Results

In Figure 1, we show the impact of the robustness parameter on the in-sample and out-of-sample cost
of the decision rules obtained by our robust optimization approach. The results demonstrate that a
strictly positive choice of the robustness parameter is essential in order to obtain the best out-of-sample
cost for each N . This is due to the fact that Problem (2) has been approximated with a rich space of
decision rules, and in particular, a space of decision rules that becomes increasingly flexible as more
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 31

historical sample paths are obtained. In contrast, when the robustness parameter is set to zero, Figure 1
shows that the in-sample cost of Problem (9) does not appear to converge to the optimal cost of the
stochastic problem; these findings are consistent with the discussion of Example 1 in Section 3 and
show that approximating Problem (2) with a rich restricted space of decision rules will asymptotically
overfit the historical data when the robustness parameter is set to zero. Additional numerical results
on the relationship between the robustness parameter and the performance of our robust optimization
approach can be found in Appendix J.4.

More broadly, we believe that Figure 1 highlights a practical strength of our robust optimization
approach. Indeed, the approximation of Problem (2) using piecewise static decision rules did not require
nor utilize any information about the structure of optimal decision rules for the underlying stochastic
problem. At the same time, despite searching over a rich space of decision rules, Problem (2) with an
appropriate choice of the robustness parameter yields decision rules which do not overfit the historical
data. This shows that Problem (2) provides an opportunity to find high-quality decision rules for multi-
stage stochastic linear optimization problems, even when (i ) the only information on the underlying
distribution comes from limited data, and (ii ) the structure of optimal decision rules for the stochastic
problem is complex or unknown.

In Figures 2 and 3, we compare the average out-of-sample costs of the decision rules produced
by the various data-driven methods in experiments with no fixed cost (f = 0) and with fixed cost
(f = 0.1). We first observe that the gap between “SRO-FA CV” (where the robustness parameter is
chosen through five-fold cross validation) and the unrealistic “SRO-FA Best” (where the robustness
parameter is chosen post hoc to obtain decision rules with the best average out-of-sample performance) is
relatively small across sizes of training sets. This suggests that cross-validation can indeed be practically
effective in choosing the robustness parameter in the robust optimization approach. Moreover, the
results of the experiments show that the proposed robust optimization approach (“SRO-FA CV”) can
find decision rules which significantly outperform those obtained by widely-used alternative data-driven
approaches (“SAA Independence” and “AR Linear”). In particular, we note that the problem sizes in
our experiments are realistic, both in terms of the number of retailers R and sizes of training sets N .
The results of the experiments thus provide numerical evidence that the robust optimization approach
proposed in this paper can be valuable in practice, particularly in challenging applications with mixed-
integer decisions and multi-dimensional stochastic processes.

8. Application to a Multi-Stage Stochastic Inventory Management


Problem
In our second set of experiments, we consider a classic and widely-studied stochastic inventory man-
agement problem for a single product with an unknown autoregressive demand. Our motivation in this
32 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Figure 2 Three-stage inventory replenishment problem:


Performance of data-driven approaches, no fixed cost (f = 0)

Note. The solid lines are the average out-of-sample costs of decision rules produced by the various data-driven
approaches to Problem (7), and the shaded regions are the 20th and 80th percentiles over the 50 training datasets.
The robustness parameter in “SRO-FA CV” is chosen using five-fold cross validation, and the robustness parameter
in “SRO-FA Best” is chosen optimally with respect to the testing dataset.

Figure 3 Three-stage inventory replenishment problem:


Performance of data-driven approaches, fixed cost (f = 0.1)

Note. The solid lines are the average out-of-sample costs of decision rules produced by the various data-driven
approaches to Problem (7), and the shaded regions are the 20th and 80th percentiles over the 50 training datasets.
The robustness parameter in “SRO-FA CV” is chosen using five-fold cross validation, and the robustness parameter
in “SRO-FA Best” is chosen optimally with respect to the testing dataset.
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 33

set of experiments is to explore the practical value of Problem (2) in applications where the number
of time periods is comparatively large (e.g., T = 10) and the number of historical sample paths is
relatively small (e.g., 10 ≤ N ≤ 100). In this context, we compare our proposed robust optimization
approach with linear decision rules to a variety of alternative data-driven approaches. In contrast to
the previous section, where we compared to data-driven approaches that are practically tractable in
problems with mixed-integer decisions and short time horizons, this section compares to approaches
that are practically tractable in problems with continuous decisions and long time horizons.

8.1. Problem description

We consider an inventory management problem of a single product over a finite planning horizon. At
the beginning of each time period t ∈ [T ], we start with It ∈ R units of product in inventory. We then
select a production quantity of xt ∈ [0, x̄t ] with zero lead time at a cost of ct per unit. The product
demand ξt ≥ 0 is then revealed, the inventory is updated to It+1 = It + xt − ξt , and we incur a holding
cost of ht max{It+1 , 0} and a backorder cost of bt max{−It+1 , 0}. We begin with zero units of inventory
in the first period. Our goal is to dynamically select the production quantities to minimize the expected
total cost over the planning horizon, captured by
" T #
X
minimize E (ct xt (ξ1 , . . . , ξt−1 ) + yt+1 (ξ1 , . . . , ξt ))
x, I, y
t=1
subject to It+1 (ξ1 , . . . , ξt ) = It (ξ1 , . . . , ξt−1 ) + xt (ξ1 , . . . , ξt−1 ) − ξt a.s., ∀t∈ [T ]
(10)
yt+1 (ξ1 , . . . , ξt ) ≥ ht It+1 (ξ1 , . . . , ξt ) a.s., ∀t∈ [T ]

yt+1 (ξ1 , . . . , ξt ) ≥ −bt It+1 (ξ1 , . . . , ξt ) a.s., ∀t∈ [T ]

0 ≤ xt (ξ1 , . . . , ξt−1 ) ≤ x̄t a.s., ∀t∈ [T ].


We consider the setting where the joint probability distribution of the stochastic process (ξ1 , . . . , ξT ) ∈
RT is unknown. Our only information on the distribution comes from historical data consisting of
demand realizations for past products (ξˆ11 , . . . , ξˆ1T ), . . . , (ξˆN ˆN
1 , . . . , ξ T ), which are independent and identi-

cally distributed sample paths of the underlying stochastic process, and knowledge that the stochastic
process will be contained in Ξ = RT+ almost surely.

8.2. Experiments

We perform computational experiments on the following data-driven approaches for obtaining decision
rules for Problem (10):

• SRO-LDR: This is the proposed data-driven approach for multi-stage stochastic linear optimization
(Problem (2)), where the uncertainty sets are constructed as described in Section 3 with the
`∞ -norm. The approach is approximated using linear decision rules (see Section 5.1) and solved
34 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

using the reformulation developed in Theorem 4. We choose the robustness parameter for each
training dataset using five-fold cross validation, where the range of possible values considered in
the cross-validation procedure is N ∈ {b · 10a : a ∈ {−2, −1, 0, 1}, b ∈ {1, . . . , 9}}.

• SAA-LDR: This is the same approach as SRO-LDR, except the robustness parameter is set to
zero.

• Approx PCM : This is a data-driven extension of the approach developed in Bertsimas et al. (2019).
In this approach, decision rules are obtained by solving a multi-stage distributionally robust op-
timization problem (Problem (6)) in which AN is the set of joint probability distributions with
the same mean and covariance as those estimated from the historical data. This distributionally
robust optimization problem is solved approximately by restricting to lifted linear decision rules,
as described in Bertsimas et al. (2019, Section 3).

• DDP and RDDP : This is the robust data-driven dynamic programming approach proposed by
Hanasusanto and Kuhn (2013). The approach estimates cost-to-go functions by applying kernel
regression to the historical sample paths. Decisions are obtained from optimizing over the cost-
to-go functions, which are evaluated approximately using the algorithm described in Hanasusanto
and Kuhn (2013, Section 4). Since the algorithm requires both input sample paths and initial state
paths, we use half of the training dataset as the input sample paths, and the other half to generate
the state paths via the lifted linear decision rules obtained by Approx PCM. The approach also
requires a robustness parameter γ, which we choose to be either γ = 0 (DDP) or γ = 10 (RDDP).

• WDRO-LDR: Described in Section 6, this approach obtains decision rules by solving a multi-stage
distributionally robust optimization problem (Problem (6)) in which AN is chosen to be the 1-
Wasserstein ambiguity set with the `1 -norm. Similarly as SRO-LDR, the distributionally robust
optimization problem is approximated using linear decision rules, which is solved using a duality-
based reformulation provided in Appendix I. The robustness parameter is chosen using the same
procedure as SRO-LDR.

We perform computational simulations using the same parameters and data generation as See and
Sim (2010). Specifically, the demand is a nonstationary autoregressive stochastic process of the form
ξt = ςt + αςt−1 + · · · + ας1 + µ, where ς1 , . . . , ςT are independent random variables distributed uniformly
over [−ς¯, ς¯]. The parameters of the stochastic process are µ = 200 and ς¯ = 40 when T = 5, and µ = 200
and ς¯ = 20 when T = 10. The capacities and costs are x̄t = 260, ct = 0.1, ht = 0.02 for all t ∈ [T ], bt = 0.2
for all t ∈ [T − 1], and bT = 2.

To compare the above data-driven approaches, we take the following steps. For various choices of
N , we generate 100 training datasets of size N and obtain decision rules by applying the above data-
driven approaches to each training dataset. The out-of-sample costs of the obtained decision rules are
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 35

approximated using a common testing dataset of 10000 sample paths.2 Specifically, for each sample
path (ξ i1 , . . . , ξ iT ) ∈ RT in the testing dataset, we calculate production quantities (xA,i,`
1 , . . . , xA,i,`
T ) ∈ RT
by applying the decision rule obtained from approach A on the `-th training dataset. The out-of-sample
cost of the decision rule is then approximated as
10000 T 
1 X X n o
ct xA,i,`
t + max h I A,i,`
t t+1 , −b I A,i,`
t t+1 ,
10000 i=1 t=1

where the inventory levels (I1A,i,` , . . . , ITA,i,` ) are computed from the production quantities
(xA,i,`
1 , . . . , xTA,i,` ) ∈ RT and the test sample path (ξ i1 , . . . , ξ iT ) ∈ RT . All sample paths in the training and
testing datasets are drawn independently from the true joint probability distribution.

As discussed earlier, Problem (2) is not guaranteed to find decision rules which are feasible for all re-
alizations in Ξ. Therefore, the linear decision rules obtained by SRO-LDR and SAA-LDR, when applied
to sample paths in the testing dataset, may result in production quantities which exceed x̄1 , . . . , x̄T or
are negative. Thus, before computing the out-of-sample costs, we first project each production quantity
xA,i,`
t onto the interval [0, x̄t ] to ensure it is feasible. We discuss the impact of this projection procedure
at the end of the results section.

8.3. Results

In Table 1 and Figure 4, we report the out-of-sample costs and computation times of the various
approaches. SRO-LDR produces an out-of-sample cost which outperforms the other approaches, most
notably when the size of the training dataset is small, and requires less than one second of computation
time. We note that the out-of-sample cost of SRO-LDR roughly converges to the dynamic programming
(DP) estimate of the optimal cost of Problem (10), which suggests that linear decision rules provide a
good approximation of the optimal production decision rules for this particular stochastic problem. The
relationship between the robustness parameter and the in-sample and out-of-sample cost of SRO-LDR
is shown in Figure 5.

We briefly reflect on some notable differences between SRO-LDR and the other approaches. First,
the results demonstrate that a strictly positive choice of the robustness parameter is not necessary
to avoid asymptotic overfitting when Problem (2) is approximated with a fixed, finite-dimensional
space of decision rules; indeed, Table 1 and Figure 5 show that SAA-LDR can provide an out-of-
sample cost which is similar to that of SRO-LDR for moderate to large training datasets. However,
SRO-LDR produces an out-of-sample cost which significantly outperforms SAA-LDR when N is small
(N ∈ {10, 25}). More generally, this shows that there exist regimes in which a positive choice of the

2
For DDP and RDDP, we only evaluated on the first 1000 sample paths in the testing dataset, due to the computa-
tional cost of optimizing over the cost-to-go functions for each testing sample path.
36 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Table 1 Multi-stage stochastic inventory management: Average out-of-sample cost.


Size of training dataset (N)
T α Approach 10 25 50 100 DP
5 0 SRO-LDR 111.8(7.2) 109.5(1.3) 108.5(0.7) 108.0(0.3)
SAA-LDR 127.3(12.6) 111.7(3.1) 108.7(1.0) 107.9(0.3)
Approx PCM 118.5(2.2) 117.4(1.0) 117.1(0.7) 117.0(0.5)
108
DDP 2262.7(363.3) 1189.6(854.2) 525.4(510.8) 205.1(201.5)
RDDP 2255.5(393.2) 1175.8(856.2) 515.6(506.0) 202.4(195.3)
WDRO-LDR 2400.3(0.0) 2400.3(0.0) 2400.3(0.0) 2400.3(0.0)
0.25 SRO-LDR 113.0(4.2) 110.0(1.8) 108.7(0.8) 108.0(0.3)
SAA-LDR 127.6(13.0) 111.7(3.1) 108.6(1.0) 108.0(0.2)
Approx PCM 126.8(3.8) 125.3(1.5) 124.8(0.9) 124.6(0.8)
107
DDP 2251.5(488.8) 1393.7(897.8) 679.3(656.0) 236.9(240.5)
RDDP 2222.0(556.5) 1386.2(900.7) 670.1(654.9) 236.2(237.2)
WDRO-LDR 2400.7(0.0) 2400.7(0.0) 2400.7(0.0) 2400.7(0.0)
0.5 SRO-LDR 115.5(5.3) 112.0(4.0) 110.8(2.7) 111.7(2.6)
SAA-LDR 129.5(13.3) 113.1(6.8) 110.7(2.9) 111.6(2.6)
Approx PCM 136.0(4.8) 134.0(1.9) 133.4(1.2) 133.2(1.0)
108
DDP 2263.8(480.1) 1563.7(917.6) 777.5(787.7) 364.2(488.1)
RDDP 2253.8(515.8) 1532.6(940.9) 716.8(758.9) 334.6(477.1)
WDRO-LDR 2401.2(0.0) 2401.2(0.0) 2401.2(0.0) 2401.2(0.0)
10 0 SRO-LDR 208.9(1.0) 207.5(0.6) 206.8(0.5) 206.2(0.2)
SAA-LDR 293.9(70.1) 212.6(2.1) 207.8(1.1) 206.3(0.4)
Approx PCM 215.3(2.1) 214.5(0.6) 214.1(0.6) 214.1(0.4)
206
DDP 5211.4(1131.1) 2827.9(1757.5) 1335.6(1206.4) 497.5(550.2)
RDDP 5210.1(1133.4) 2820.3(1758.7) 1327.6(1206.0) 500.1(552.7)
WDRO-LDR 5800.3(0.0) 5800.3(0.0) 5800.3(0.0) 5800.3(0.0)
0.25 SRO-LDR 210.3(2.9) 207.8(1.1) 206.9(0.5) 206.3(0.2)
SAA-LDR 295.1(70.4) 212.7(2.1) 207.8(1.1) 206.3(0.4)
Approx PCM 228.7(4.5) 226.2(1.8) 225.7(1.1) 225.5(0.9)
206
DDP 5215.6(1350.1) 3214.7(1984.9) 1598.0(1566.5) 440.2(417.1)
RDDP 5202.0(1368.7) 3185.3(1977.1) 1593.6(1566.6) 437.0(418.0)
WDRO-LDR 5800.2(0.0) 5800.5(0.0) 5800.5(0.0) 5800.5(0.0)
0.5 SRO-LDR 211.1(3.9) 207.9(1.0) 206.9(0.6) 206.3(0.2)
SAA-LDR 297.8(70.7) 213.0(2.3) 207.9(1.1) 206.4(0.4)
Approx PCM 245.3(7.0) 242.1(2.7) 241.6(2.0) 240.9(1.6)
206
DDP 5374.8(1052.7) 3676.4(2159.1) 1960.9(1878.3) 644.1(914.6)
RDDP 5313.0(1173.0) 3630.9(2161.4) 1949.2(1863.9) 644.0(913.3)
WDRO-LDR 5800.7(0.0) 5800.7(0.0) 5800.3(0.0) 5800.7(0.0)
Mean (standard deviation) for the out-of-sample cost of decision rules obtained by various data-driven ap-
proaches for Problem (10). The robustness parameters in SRO-LDR and WDRO-LDR are chosen using cross
validation. The column DP presents the dynamic programming approximations of the optimal cost of Prob-
lem (10) from See and Sim (2010, Tables EC.1 and EC.2), which have an accuracy of ±1%.

robustness parameter can still provide significant value even when Problem (2) is approximated using
linear decision rules. Second, we note that WDRO-LDR consistently produces decision rules with large
average out-of-sample cost; this is due to the fact that this approach requires the linear decision rules
Pt−1
to satisfy 0 ≤ xt,0 + s=1 xt,s ζs ≤ x̄t for all (ζ1 , . . . , ζT ) ∈ RT+ , which reduces to a static decision rule for
the production quantity in each stage. Finally, we remark that the average out-of-sample cost of DDP
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 37

Figure 4 Multi-stage stochastic inventory management: Computation times for T = 10, α = 0.25.

SRO-LDR
103 SAA-LDR
Approx PCM
DDP
RDDP
Computation time (seconds) 102
WDRO-LDR

101

100

10-1

10-2

25 50 75 100
N

Note. Computation times for data-driven approaches to the multi-stage stochastic inventory management problem.
Results are shown for T = 10 and α = 0.25, and similar computation times were observed for other choices of α. The
graph shows the mean value of the computation times over 100 training datasets for each value of N .

Figure 5 Multi-stage stochastic inventory management:


Impact of robustness parameter on SRO-LDR for T = 10, α = 0.25.

25 50 100

214

212
Cost

210

208

206

204

10-3 10-2 10-1 100 101 10-3 10-2 10-1 100 101 10-3 10-2 10-1 100 101
εN

Note. The solid black line is the average out-of-sample cost of decision rules produced by SRO-LDR, and the shaded
black region is the 20th and 80th percentiles over the 100 training datasets. The dotted red line is the average in-
sample cost of SRO-LDR, and the solid green line is a dynamic programming approximation of the optimal cost of
Problem (10) from See and Sim (2010, Table EC.2).
38 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

and RDDP improved significantly with the size of the training dataset, but produced high variability
across training datasets and required long computation time.

We recall that Table 1 reports the out-of-sample costs of SRO-LDR and SAA-LDR after their pro-
duction quantities are projected onto the feasible region (see Section 8.2). In Appendix K, we discuss
the impact of this projection procedure on the out-of-sample cost. Specifically, we show across the
above experiments that (i ) SRO-LDR produces feasible production quantities for more than 93% of the
sample paths in the testing dataset, and (ii ) the average `1 -distance between the production quanti-
ties (xA,i,`
1 , . . . , xA,i,`
T ) and the feasible region [0, x̄1 ] × · · · × [0, x̄T ] is less than 2 units. This shows that
SRO-LDR consistently produces feasible or nearly-feasible decisions, and thus the out-of-sample costs
of SRO-LDR are unlikely to be an artifact of this projection procedure.

9. Conclusion
In this work, we presented a new data-driven approach, based on robust optimization, for solving multi-
stage stochastic linear optimization problems where uncertainty is correlated across time. We showed
that the proposed approach is asymptotically optimal, providing assurance that the approach offers a
near-optimal approximation of the underlying stochastic problem in the presence of big data. At the
same time, the optimization problem resulting from the proposed approach can be addressed by approx-
imation algorithms and reformulation techniques which have underpinned the success of multi-stage
robust optimization. The practical value of the proposed approach was illustrated by computational
examples inspired by real-world applications, demonstrating that the proposed data-driven approach
can produce high-quality decisions in reasonable computation times. Through these contributions, this
work provides a step towards helping organizations across domains leverage historical data to make
better operational decisions in dynamic environments.

References
Assaf Avrahami, Yale T Herer, and Retsef Levi. Matching supply and demand: Delayed two-phase distri-
bution at Yedioth Group - Models, algorithms, and information technology. Interfaces, 44(5):445–460,
2014.

Amparo Baı́llo, Antonio Cuevas, and Ana Justel. Set estimation and nonparametric detection. Canadian
Journal of Statistics, 28(4):765–782, 2000.

Güzin Bayraksan and David K Love. Data-driven stochastic programming using phi-divergences. In The
Operations Research Revolution, chapter 1, pages 1–19. INFORMS, 2015.

Aharon Ben-Tal and Arkadi Nemirovski. Robust solutions of uncertain linear programs. Operations Research
Letters, 25(1):1–13, 1999.
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 39

Aharon Ben-Tal, Alexander Goryashko, Elana Guslitzer, and Arkadi Nemirovski. Adjustable robust solutions
of uncertain linear programs. Mathematical Programming, 99(2):351–376, 2004.

Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust Optimization. Princeton University
Press, 2009.

Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust
solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):
341–357, 2013.

Dimitris Bertsimas and Constantine Caramanis. Finite adaptability in multistage linear optimization. IEEE
Transactions on Automatic Control, 55(12):2751–2766, 2010.

Dimitris Bertsimas and Iain Dunning. Multistage robust mixed-integer optimization with adaptive partitions.
Operations Research, 64(4):980–998, 2016.

Dimitris Bertsimas and Vineet Goyal. On the power of robust solutions in two-stage stochastic and adaptive
optimization problems. Mathematics of Operations Research, 35(2):284–305, 2010.

Dimitris Bertsimas, David B Brown, and Constantine Caramanis. Theory and applications of robust opti-
mization. SIAM Review, 53(3):464–501, 2011a.

Dimitris Bertsimas, Vineet Goyal, and Xu Andy Sun. A geometric characterization of the power of finite
adaptability in multistage stochastic and adaptive optimization. Mathematics of Operations Research,
36(1):24–54, 2011b.

Dimitris Bertsimas, Vishal Gupta, and Nathan Kallus. Robust sample average approximation. Mathematical
Programming, 171(1):217–282, 2018.

Dimitris Bertsimas, Melvyn Sim, and Meilin Zhang. Adaptive distributionally robust optimization. Man-
agement Science, 65(2):604–618, 2019.

Dimitris Bertsimas, Shimrit Shtern, and Bradley Sturt. Two-stage sample robust optimization. Operations
Research (forthcoming), 2021.

John R Birge and Francois Louveaux. Introduction to Stochastic Programming. Springer Science & Business
Media, 2011.

Leo Breiman. Probability. SIAM, 1992.

Xin Chen, Melvyn Sim, and Peng Sun. A robust optimization perspective on stochastic programming.
Operations Research, 55(6):1058–1071, 2007.

George B Dantzig. Linear programming under uncertainty. Management Science, 1(3-4):197–206, 1955.

Erick Delage and Dan A Iancu. Robust multistage decision making. In The Operations Research Revolution,
pages 20–46. INFORMS, 2015.

Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application
to data-driven problems. Operations Research, 58(3):595–612, 2010.
40 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Luc Devroye and Gary L Wise. Detection of abnormal behavior via nonparametric estimation of the support.
SIAM Journal on Applied Mathematics, 38(3):480–488, 1980.

E. Erdoğan and Garud Iyengar. Ambiguous chance constrained problems and robust optimization. Mathe-
matical Programming, 107(1):37–61, Jun 2006.

E. Erdoğan and Garud Iyengar. On two-stage convex chance constrained problems. Mathematical Methods
of Operations Research, 65(1):115–140, 2007.

K Bruce Erickson. The strong law of large numbers when the mean is undefined. Transactions of the
American Mathematical Society, 185:371–381, 1973.

Jeremy D Finn. A general model for multivariate analysis. Holt, Rinehart & Winston, 1974.

Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical
measure. Probability Theory and Related Fields, 162(3-4):707–738, 2015.

Rui Gao, Xi Chen, and Anton J Kleywegt. Wasserstein distributional robustness and regularization in
statistical learning. arXiv preprint arXiv:1712.06050v2, 2017.

Stanley J Garstka and Roger J-B Wets. On decision rules in stochastic programming. Mathematical Pro-
gramming, 7(1):117–143, 1974.

Angelos Georghiou, Daniel Kuhn, and Wolfram Wiesemann. The decision rule approach to optimization
under uncertainty: methodology and applications. Computational Management Science, pages 1–32,
2018.

Angelos Georghiou, Angelos Tsoukalas, and Wolfram Wiesemann. Robust dual dynamic programming.
Operations Research, 67(3):813–830, 2019.

Clark R Givens and Rae Michael Shortt. A class of Wasserstein metrics for probability distributions. The
Michigan Mathematical Journal, 31(2):231–240, 1984.

Grani A Hanasusanto and Daniel Kuhn. Robust data-driven dynamic programming. In Advances in Neural
Information Processing Systems, pages 827–835, 2013.

Grani A Hanasusanto and Daniel Kuhn. Conic programming reformulations of two-stage distributionally
robust linear programs over Wasserstein balls. Operations Research, 66(3):849–869, 2018.

Ruiwei Jiang and Yongpei Guan. Risk-averse two-stage stochastic program with distributional ambiguity.
Operations Research, 66(5):1390–1405, 2018.

Aleksandr Petrovich Korostelev and Alexandre B Tsybakov. Minimax Theory of Image Reconstruction,
volume 82. Springer-Verlag, New York, 1993.

Pavlo Krokhmal and Stanislav Uryasev. A sample-path approach to optimal position liquidation. Annals of
Operations Research, 152(1):193–225, 2007.

Nils Löhndorf and Alexander Shapiro. Modeling time-dependent randomness in stochastic dual dynamic
programming. European Journal of Operational Research, 273(2):650–661, 2019.
Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization 41

Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the
Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming,
171(1):115–166, 2018.

Georg Ch Pflug and Alois Pichler. From empirical observations to tree models for stochastic optimization:
Convergence properties. SIAM Journal on Optimization, 26(3):1715–1740, 2016.

Georg Ch Pflug and David Wozabal. Ambiguity in portfolio selection. Quantitative Finance, 7(4):435–442,
2007.

Krzysztof Postek and Dick Den Hertog. Multistage adjustable robust mixed-integer optimization via iterative
splitting of the uncertainty set. INFORMS Journal on Computing, 28(3):553–574, 2016.

Aaditya Ramdas, Nicolás Garcı́a Trillos, and Marco Cuturi. On Wasserstein two-sample testing and related
families of nonparametric tests. Entropy, 19(2):47, 2017.

Herbert E Scarf. A min-max solution of an inventory problem. In Studies in the Mathematical Theory of
Inventory and Production, pages 201–209. Stanford University Press, 1958.

Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating
the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001.

Chuen-Teck See and Melvyn Sim. Robust approximation to multiperiod inventory management. Operations
Research, 58(3):583–594, 2010.

Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Regularization via mass
transportation. Journal of Machine Learning Research, 20(103):1–68, 2019.

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on Stochastic Programming:
Modeling and Theory. SIAM, 2009.

Allen L Soyster. Convex programming with set-inclusive constraints and applications to inexact linear
programming. Operations Research, 21(5):1154–1157, 1973.

Matthew Staib and Stefanie Jegelka. Distributionally robust deep learning as a generalization of adversarial
training. In NIPS Machine Learning and Computer Security Workshop, 2017.

Bart PG Van Parys, Peyman Mohajerin Esfahani, and Daniel Kuhn. From data to decisions: Distributionally
robust optimization is optimal. Management Science, 2020.

Jean-Louis Verger-Gaugry. Covering a ball with smaller equal balls in Rn . Discrete & Computational
Geometry, 33(1):143–155, 2005.

Hong Wang and RJ Tomkins. A zero-one law for large order statistics. Canadian Journal of Statistics, 20
(3):323–334, 1992.

Wolfram Wiesemann, Daniel Kuhn, and Melvyn Sim. Distributionally robust convex optimization. Opera-
tions Research, 62(6):1358–1376, 2014.
42 Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Guanglin Xu and Samuel Burer. A copositive approach for two-stage adjustable robust optimization with
uncertain right-hand sides. Computational Optimization and Applications, 70(1):33–59, 2018.

Huan Xu, Constantine Caramanis, and Shie Mannor. A distributional interpretation of robust optimization.
Mathematics of Operations Research, 37(1):95–110, 2012.

Bo Zeng and Long Zhao. Solving two-stage robust optimization problems using a column-and-constraint
generation method. Operations Research Letters, 41(5):457–461, 2013.

Jianzhe Zhen, Dick Den Hertog, and Melvyn Sim. Adjustable robust optimization via Fourier–Motzkin
elimination. Operations Research, 66(4):1086–1100, 2018.
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec1

Appendix A: Verifying Assumption 3 in Examples

In this appendix, we show that each multi-stage stochastic linear optimization problem considered in this
paper satisfies Assumption 3.

A.1. Example 1 from Section 3

Consider the sample robust optimization problem


N
1 X
minimize2 sup {x2 (ζ1 ) + 2x3 (ζ1 , ζ2 )}
x2 :R→R, x3 :R →R N j =1 ζ∈U j
N
j
subject to x2 (ζ1 ) + x3 (ζ1 , ζ2 ) ≥ ζ1 + ζ2 ∀ζ ∈ ∪N
j =1 UN

j
x2 (ζ1 ), x3 (ζ1 , ζ2 ) ≥ 0 ∀ζ ∈ ∪N
j =1 UN .

We observe that the decisions must be nonnegative for every realization in the uncertainty sets. Moreover,
the following constraints can be added to the above problem without affecting its optimal cost:

j
x2 (ζ1 ) ≤ sup {ζ10 + ζ20 } ∀ζ ∈ ∪N
j =1 UN ,
j
ζ 0 ∈∪N U
j=1 N
j
x3 (ζ1 , ζ2 ) ≤ sup {ζ10 + ζ20 } ∀ζ ∈ ∪N
j =1 UN .
j
ζ 0 ∈∪N U
j=1 N

Indeed, the above constraints ensure that we are never purchasing inventory which exceeds the maximal
ζ1 + ζ2 which can be realized in the uncertainty sets. Thus, we have shown that Assumption 3 holds.

A.2. Example 2 from Section 4.3

Consider the sample robust optimization problem


minimize x1
x1 ∈ Z
j
subject to x1 ≥ ζ1 ∀ζ1 ∈ ∪N
j =1 UN .

We observe that an optimal solution to this problem is x1 = maxζ1 ∈∪N j


UN dζ1 e, and thus the constraint
j=1

x1 ≤ max j
ζ1 + 1
ζ1 ∈∪N U
j=1 N

can be added to the above problem without affecting its optimal cost. We conclude that Assumption 3 holds.

A.3. Example 3 from Section 4.3

Consider the sample robust optimization problem


minimize
2
x12
x1 ∈R
j
subject to ζ1 (1 − x12 ) ≤ x11 ∀ζ1 ∈ ∪N
j =1 UN

0 ≤ x12 ≤ 1.
ec2 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

We observe that an optimal solution to this problem is given by x11 = maxζ1 ∈∪N j
UN ζ1 and x12 = 0. Thus,
j=1

the constraint
x11 ≤ max j
ζ1
ζ1 ∈∪N U
j=1 N

can be added to the above problem without affecting its optimal cost. We conclude that Assumption 3 holds.

A.4. Example 4 from Section 4.3

Consider the sample robust optimization problem


N
1 X
minimize sup x2 (ζ1 )
x2 :R→Z N j =1 ζ1 ∈U j
N
j
subject to x2 (ζ1 ) ≥ ζ1 ∀ζ1 ∈ ∪N
j =1 UN .

Since Ξ = [0, 1], we observe that the constraint

j
x2 (ζ1 ) ≤ 1 ∀ζ1 ∈ ∪N
j =1 UN

can be added to the above problem without affecting its optimal cost. We conclude that Assumption 3 holds.

A.5. Example from Section 7

Consider the sample robust optimization problem associated with Problem (7):
N
( R
! R R
)
1 X X X X
minimize max c Q10 + Q1r + hQ10 + vr (ζ 1 , ζ 2 ) + f zr (ζ 1 )
v, Q1 ≥0,Q2 , z N j =1 ζ∈UNj r =1 r =1 r =1
R
X j
subject to Q2r (ζ 1 ) ≤ Q10 ∀ζ ∈ ∪N
j =1 UN
r =1
j
vr (ζ 1 , ζ 2 ) ≥ b(ζ2r + ζ1r − Q2r (ζ 1 ) − Q1r ) − hQ2r (ζ 1 ) ∀r ∈ [R], ζ ∈ ∪N
j =1 UN

j
vr (ζ 1 , ζ 2 ) ≥ h(Q1r − ζ1r − ζ2r ) ∀r ∈ [R], ζ ∈ ∪N
j =1 UN

j
vr (ζ 1 , ζ 2 ) ≥ b (ζ1r − Q1r ) − hζ2r ∀r ∈ [R], ζ ∈ ∪N
j =1 UN

j
zr (ζ 1 )M ≥ Q2r (ζ 1 ) ∀r ∈ [R], ζ ∈ ∪N
j =1 UN

j
zr (ζ 1 ) ∈ {0, 1}, Q2r (ζ 1 ) ≥ 0, ∀r ∈ [R], ζ ∈ ∪N
j =1 UN .

Since Ξ = R2+r , we observe that the constraints

0 ≤ Q11 , . . . , Q1R ≤ max j (ζ1r + ζ2r )


ζ∈∪N U
j=1 N

R
X
0 ≤ Q10 ≤ max j
(ζ1r + ζ2r ),
ζ∈∪N U
j=1 N r =1

can be added to the above problem without affecting its optimal cost. It thus follows from the constraint
PR
r =1
Q2r (ζ 1 ) ≤ Q10 that the constraints
R
X j
0 ≤ Q21 (ζ 1 ), . . . , Q2R (ζ 1 ) ≤ max j
(ζ10 r + ζ20 r ), ∀ζ ∈ ∪N
j =1 UN
ζ 0 ∈∪N U
j=1 N r =1
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec3

can also be added to the above problem without affecting its optimal cost. Finally, we can without loss of
generality impose the constraint for each retailer r ∈ [R] that

0 ≤ vr (ζ 1 , ζ 2 ) = max{b(ζ2r + ζ1r − Q2r (ζ 1 ) − Q1r ) − hQ2r (ζ 1 ),

h(Q1r − ζ1r − ζ2r ),


j
b (ζ1r − Q1r ) − hζ2r } ∀ζ ∈ ∪N
j =1 UN .

Applying the aforementioned bounds on the decision rules, we conclude that Assumption 3 holds.

A.6. Example from Section 8

Consider the sample robust optimization problem


T T
1 X X
minimize sup (ct xt (ζ1 , . . . , ζt−1 ) + yt+1 (ζ1 , . . . , ζt ))
x, I, y N j =1 ζ∈U j t=1
N
j
subject to It+1 (ζ1 , . . . , ζt ) = It (ζ1 , . . . , ζt−1 ) + xt (ζ1 , . . . , ζt−1 ) − ζt ∀ζ ∈ ∪N
j =1 UN , ∀t ∈ [T ]

j
yt+1 (ζ1 , . . . , ζt ) ≥ ht It+1 (ζ1 , . . . , ζt ) ∀ζ ∈ ∪N
j =1 UN , ∀t ∈ [T ]

j
yt+1 (ζ1 , . . . , ζt ) ≥ −bt It+1 (ζ1 , . . . , ζt ) ∀ζ ∈ ∪N
j =1 UN , ∀t ∈ [T ]

j
0 ≤ xt (ζ1 , . . . , ζt−1 ) ≤ x̄t ∀ζ ∈ ∪N
j =1 UN , ∀t ∈ [T ],

where I1 = 0 and Ξ = RT+ . For any feasible decision rule to the above problem and for each stage t, we observe
that the following constraint is satisfied:
T T
X X j
− sup ζs0 ≤ It+1 (ζ1 , . . . , ζt ) ≤ x̄s ∀ζ ∈ ∪N
j =1 UN .
j
ζ 0 ∈∪N U
j=1 N s=1 s=1

Moreover, we can without loss of generality impose the constraint that

j
0 ≤ yt+1 (ζ1 , . . . , ζt ) = max {ht It+1 (ζ1 , . . . , ζt ), −bt It+1 (ζ1 , . . . , ζt )} ∀ζ ∈ ∪N
j =1 UN .

Applying the aforementioned bounds on It+1 (ζ1 , . . . , ζt ) over the uncertainty sets, we conclude that Assump-
tion 3 holds.

Appendix B: On the Tightness of the Bounds from Theorem 1

In Section 4.2, we introduced a lower bound J and upper bound J¯ on the optimal cost J ∗ of Problem (1). In
¯
Theorem 1, we showed under mild assumptions that these quantities also provide an asymptotic lower and
upper bound on the optimal cost JbN of Problem (2). In this appendix, we demonstrate the practical value of
these bounds by establishing sufficient conditions for the lower and upper bounds to be equal and applying
those sufficient conditions to applications of multi-stage stochastic linear optimization from Sections 3, 7,
and 8.
ec4 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

B.1. Sufficient conditions for J = J ∗ and J¯ = J ∗


¯

We begin by developing our two primary results, Theorems EC.1 and EC.2, in which we establish sufficient
conditions for the lower bound J and the upper bound J¯ to be equal to the optimal cost J ∗ of Problem (1).
¯
In particular, we will show in the following Appendix B.2 that the sufficient conditions in these theorems
can be verified in examples in which the underlying joint probability distribution and the support of the
stochastic process are unknown. Consequently, the following two theorems can serve as practical tools for
demonstrating that our proposed robust optimization approach, Problem (2), is asymptotically optimal for
specific multi-stage stochastic linear optimization problems which arise in real-world applications.

We begin by developing our first primary result, Theorem EC.1, which establishes a sufficient condition
for the lower bound J to be equal to the optimal cost J ∗ of Problem (1). Recall that S denotes the support
¯
of the joint probability distribution. Speaking intuitively, the following theorem shows that J is guaranteed
¯
to equal J ∗ if there exists an optimal decision rule to a stochastic problem over any restricted support S̃ ⊆ S
that can be extended to a decision rule that is feasible for Problem (1) and has a well-behaved objective
function. To see the utility of the following theorem, we refer the reader to the examples in Appendix B.2

Theorem EC.1 (Sufficient condition for lower bound). Let Assumption 1 hold, and suppose there
exists an L ≥ 0 such that, for all S̃ ⊆ S, the optimal cost of the optimization problem
" T #
X n o
minimize E ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈ S̃
x∈X
t=1
T
X
subject to At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ S̃
t=1

would not change if we added the constraints


T
( )!
X
0≤ ct (ξ) · xt (ξ 1 , . . . , ξ t−1 ) ≤ L 1 + max kξk, sup kζk a.s.
t=1 ζ∈S̃
T
X
At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) a.s.
t=1

Then, J = J ∗ .
¯
Proof. We recall from the definition of the lower bound that J := limρ→0 J ρ , where
 " T ¯ ¯ #
X n o 
ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈ S̃
 


minimize E 


 x∈X 
t=1
J ρ := min .
¯ S̃⊆Ξ: P(ξ∈S̃ )≥1−ρ 

T
X 

subject to

 At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ S̃  

t=1

Therefore, it follows from the conditions of Theorem EC.1 that


 " T # 
 X n o 
minimize ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈ S̃
 

 E 


 x∈X 


 t=1 

 

 T 

X
Jρ = min subject to ct (ξ) · xt (ξ 1 , . . . , ξ t−1 ) ≥ 0 a.s. .
¯ S̃⊆Ξ: P(ξ∈S̃ )≥1−ρ 
 t=1



 


 T 

 X 
At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ S̃

 


 

t=1
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec5

PT
For any feasible solution to the above optimization problem, we observe that the function t=1
ct (ξ) ·
xt (ξ 1 , . . . , ξ t−1 ) is nonnegative almost surely. Therefore, for any arbitrary r ≥ 0, a lower bound on J ρ is given
¯
by
 " T # 
 X n o 
minimize E ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈ S̃, kξk ≤ r
 

 

 x∈X 


 t=1 

 

 T 

X
J ρ,r := min subject to ct (ξ) · xt (ξ 1 , . . . , ξ t−1 ) ≥ 0 a.s. (EC.1)
¯ S̃⊆Ξ: P(ξ∈S̃ )≥1−ρ 
 t=1



 


 T 

 X 
A (ζ)x (ζ , . . . , ζ ) ≤ b(ζ) ∀ζ ∈ S̃

 


 t t 1 t− 1 

t=1
 " T # 
X n o
· ∈ kξk ≤
 


 minimize E c t (ξ) x t (ξ 1 , . . . , ξ t−1 )I ξ S̃, r 


 x∈X 
t=1
≥ min T
, (EC.2)
S̃⊆Ξ: P(ξ∈S̃ )≥1−ρ 
 X 

subject to

 At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ S̃ : kζk ≤ r 

t=1

where the inequality follows from removing constraints from the inner minimization problem in line (EC.1).
Furthermore, it follows from the conditions of Theorem EC.1 that line (EC.2) is equal to
 " T # 
X n o
ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈ S̃, kξk ≤ r
 
minimize E
 

x∈X

 

t=1

 

 
T

 

 X 
At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ S̃ : kζk ≤ r
 


subject to 


 
t=1
min T
( )!
S̃⊆Ξ: P(ξ∈S̃ )≥1−ρ 
 X 




 0≤ ct (ξ) · xt (ξ 1 , . . . , ξ t−1 ) ≤ L 1 + max kξk, sup kζk a.s. 




 t =1 ζ∈S̃ :kζk≤r 


 


 XT 

 


 At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) a.s. 


t=1
(EC.3)
 " T # 
 X n o 
minimize E ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈ S̃, kξk ≤ r

 

 
 x∈X

 t=1




 


 XT 

≥ min subject to 0 ≤ ct (ξ) · xt (ξ 1 , . . . , ξ t−1 ) ≤ L (1 + max {kξk, r}) a.s. , (EC.4)
S̃⊆Ξ: P(ξ∈S̃ )≥1−ρ  

 t=1 

 

 XT 

 
At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) a.s. 

 

 

t=1

where the inequality follows from removing constraints from the inner minimization problem in line (EC.3)
and using the fact that supζ0 ∈S̃ :kζ0 k≤r kζ 0 k ≤ r.
We now use Assumption 1 to obtain a lower bound on (EC.4). Indeed, Assumption 1 says that there
exists an a > 1 such that b := E[exp(kξka )] < ∞. Therefore, for any feasible solutions S̃ ⊆ Ξ and x ∈ X to the
optimization problems in (EC.4),
" T #
X n o
E ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈
/ S̃ or kξk > r
t=1
h n oi
≤ E L (1 + max {kξk, r}) I ξ ∈
/ S̃ or kξk > r
ec6 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

h n oi
≤ E L (1 + max {kξk, r}) I ξ ∈
/ S̃ + E [L (1 + max {kξk, r}) I {kξk > r}]
r h i r h i
2 2
≤ E L2 (1 + max {kξk, r}) ρ+ E L2 (1 + max {kξk, r}) P(kξk > r)
r h s
2
i h
2
i b
≤ E L2 (1 + max {kξk, r}) ρ + E L2 (1 + max {kξk, r}) .
exp(ra )
| {z }
h(ρ,r )

PT
Indeed, the first inequality follows because 0 ≤ t=1 ct (ξ) · xt (ξ 1 , . . . , ξ t−1 ) ≤ L(1 + max{kξk, r}) almost
surely, the second inequality follows from the union bound, the third inequality follows from P(ξ ∈ S̃) ≥ 1 − ρ
and the Cauchy-Schwartz inequality, and the fourth and final inequality follows from Markov’s inequality.
Therefore,
 " T # 
 X 
minimize E ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )

 

 

 x∈X 


 t=1 

 

 T 

X
J ρ,r ≥ −h(ρ, r) + min subject to 0 ≤ ct (ξ) · xt (ξ 1 , . . . , ξ t−1 ) ≤ L (1 + max {kξk, r}) a.s.
¯ S̃⊆Ξ: P(ξ∈S̃ )≥1−ρ 
 t=1



 


 X T 

 
At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) a.s. 

 

 

t=1
 " T #
X
ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )

minimize E


 x∈X
t=1
≥ −h(ρ, r) + T

 X


 subject to At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) a.s.
t=1

= −h(ρ, r) + J ∗ . (EC.5)

The first inequality follows from the lower bound on J ρ,r in line (EC.4), the definition of h(ρ, r), and the
¯
law of iterated expectation. The second inequality follows from removing constraints, and the final equality
follows from the definition of J ∗ .
We now combine the above results to prove the main result. Indeed,

J = lim J ρ ≥ lim lim J ρ,r ≥ lim lim −h(ρ, r) + J ∗ = J ∗ .


¯ ρ↓0 ¯ r→∞ ρ↓0 ¯ r→∞ ρ↓0

The first inequality follows because J ρ ≥ J ρ,r for any arbitrary r ≥ 0 and the quantity limρ↓0 J ρ,r is mono-
¯ ¯ ¯
tonically increasing in r. The second inequality follows from (EC.5). The final equality follows from the
definition of h(ρ, r) and Assumption 1. Since the inequality J ≤ J ∗ always holds, our proof is complete. 
¯

We conclude the present Appendix B.1 by developing our second primary result, Theorem EC.2, which
establishes a sufficient condition for the upper bound J¯ to be equal to the optimal cost J ∗ of Problem (1).
Speaking intuitively, the following theorem says that J¯ is equal to J ∗ if there are near-optimal decision rules
to Problem (1) that are feasible and result in an objective function which is both upper-semicontinuous and
well-behaved on a slight extension of the support S.
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec7

Theorem EC.2 (Sufficient condition for upper bound). Let Assumption 1 hold, and suppose for all
η > 0 that there exists an η-optimal decision rule xη ∈ X for Problem (1) and ρη > 0, Lη ≥ 0 such that
T
X
At (ζ)xηt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ Ξ : dist(ζ, S) ≤ ρη ; (EC.6)
t=1
T
X
0≤ ct (ζ) · xηt (ζ 1 , . . . , ζ t−1 ) ≤ Lη (1 + kζk) ∀ζ ∈ Ξ : dist(ζ, S) ≤ ρη ; (EC.7)
t=1
( T
) T
X X
η
lim sup ct (ζ) · x (ζ 1 , . . . , ζ t−1 ) =
t ct (ξ) · xηt (ξ 1 , . . . , ξ t−1 ) a.s. (EC.8)
→0 ζ∈Ξ: kζ−ξk≤
t=1 t=1

Then, J¯ = J ∗ .

Proof. For any arbitrary η > 0, consider an η-optimal decision rule xη ∈ X for Problem (1) and constants
ρη > 0, Lη ≥ 0 which satisfy properties (EC.6), (EC.7), and (EC.8). Then,
" T #
X
¯
J ≤ Ē η
ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )
t=1
" ( T
)#
X
η
= lim E sup ct (ζ) · x (ζ 1 , . . . , ζ t−1 )
t
→0 ζ∈Ξ: kζ−ξk≤
t=1
" ( T
)#
X
= E lim sup ct (ζ) · xηt (ζ 1 , . . . , ζ t−1 )
→0 ζ∈Ξ: kζ−ξk≤
t=1
" T #
X
η
=E ct (ξ) · x (ξ 1 , . . . , ξ t−1 )
t
t=1

≤ J ∗ + η.

Indeed, the first inequality follows from the definition of J¯ and the feasiblility of xη for the problem defining
J¯ for all sufficiently small ρ, as indicated by (EC.6). The first equality is the definition of the local upper
semicontinuous envelope. The second equality follows from the dominated convergence theorem, which can be
applied because of Assumption 1 and (EC.7). The third equality follows from (EC.8), and the final equality
holds because xη is an η-optimal decision rule for Problem (1). Since η > 0 was chosen arbitrarily and since
the inequality J ∗ ≤ J¯ always holds, our proof is complete. 

B.2. Applications of Theorems EC.1 and EC.2

In the previous section, we developed sufficient conditions, Theorems EC.1 and EC.2, for the proposed robust
optimization approach, Problem (2), to be asymptotically optimal for multi-stage stochastic linear optimiza-
tion problems. In this section, we use those sufficient conditions to show that Problem (2) is asymptotically
optimal in three data-driven examples of multi-stage stochastic linear optimization based on Sections 3,
7, and 8. All together, the three examples provide evidence that the lower bound and upper bounds in
Theorem 1 can be equal in applications of multi-stage stochastic linear optimization that arise in practice.
ec8 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

B.2.1. Example 1 from Section 3. Consider the multi-stage stochastic linear optimization problem
J∗ = minimize E [x2 (ξ1 ) + 2x3 (ξ1 , ξ2 )]
x2 :R→R, x3 :R2 →R

subject to x2 (ξ1 ) + x3 (ξ1 , ξ2 ) ≥ ξ1 + ξ2 a.s. (3)

x2 (ξ1 ), x3 (ξ1 , ξ2 ) ≥ 0 a.s.,


where the random variables ξ = (ξ1 , ξ2 ) ∈ R2 denote the preorder and regular demand of a new product. We
assume that this stochastic process satisfies Assumption 1 and is contained in Ξ := R2+ .

Proposition 1. For Problem (3), J = J ∗ . If there is an optimal x∗2 : R → R for Problem (3) which is
¯
continuous, then J¯ = J ∗ .

Proof. Our proof is split into two parts:


• For any arbitrary S̃ ⊆ S, we observe that the optimal cost of the optimization problem
h n oi
minimize E (x2 (ξ1 ) + 2x3 (ξ1 , ξ2 )) I ξ ∈ S̃

subject to x2 (ζ1 ) + x3 (ζ1 , ζ2 ) ≥ ξ1 + ξ2 ∀ζ ∈ S̃ (EC.9)

x2 (ζ1 ), x3 (ζ1 , ζ2 ) ≥ 0 ∀ζ ∈ S̃
would not change if we added the constraints

0 ≤ x2 (ξ1 ) ≤ sup {ζ1 + ζ2 } a.s.


ζ∈S̃

0 ≤ x3 (ξ1 , ξ2 ) ≤ ξ1 + ξ2 a.s.

Moreover, any feasible solution to Problem (EC.9) which satisfies the above constraints will also satisfy
( )
0 ≤ x2 (ξ1 ) + 2x3 (ξ1 , ξ2 ) ≤ 3 max sup{ζ1 + ζ2 }, ξ1 + ξ2 a.s.
ζ∈S̃

as well as satisfy the constraints in Problem (3). Since Assumption 1 holds, we conclude from Theo-
rem EC.1 that J = J ∗ for Problem (3).
¯
• Let x∗2 : R → R denote an optimal second-stage decision rule to Problem (3) which is continuous. For
any M ≥ 0, define the new decision rules



xM
2 (ζ1 ) := max{0, min{x2 (ζ1 ), M }}, xM M
3 (ζ1 , ζ2 ) := max 0, ζ1 + ζ2 − x2 (ζ1 ) . (EC.10)

We observe that the decision rules from (EC.10) satisfy the constraints of Problem (3). Moreover,
 
E xM M
2 (ξ1 ) + 2x3 (ξ1 , ξ2 )

∗ ∗
   M   M 
= E xM 2 (ξ1 ) + E 2x3 (ξ1 , ξ2 )I {x2 (ξ1 ) ≤ M } + E 2x3 (ξ1 , ξ2 )I {x2 (ξ1 ) > M } (EC.11)

≤ E [x∗2 (ξ1 )] + E 2xM ∗ ∗


   M 
3 (ξ1 , ξ2 )I {x2 (ξ1 ) ≤ M } + E 2x3 (ξ1 , ξ2 )I {x2 (ξ1 ) > M } (EC.12)

= E [x∗2 (ξ1 )] + E [2x∗3 (ξ1 , ξ2 )I {x∗2 (ξ1 ) ≤ M }] + E 2xM ∗


 
3 (ξ1 , ξ2 )I {x2 (ξ1 ) > M } (EC.13)

≤ J ∗ + E 2xM ∗
 
3 (ξ1 , ξ2 )I {x2 (ξ1 ) > M } (EC.14)

≤ J ∗ + 2E [(ξ1 + ξ2 )I {x∗2 (ξ1 ) > M }] (EC.15)


p p
≤ J ∗ + 2 E [(ξ1 + ξ2 )2 ] P (x∗2 (ξ1 ) > M ). (EC.16)
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec9

Indeed, (EC.11) follows from the law of total probability; (EC.12) and (EC.13) follow from the definition
of the decision rules from (EC.10); (EC.14) holds because the inequality xM
3 (ξ1 , ξ2 ) ≥ 0 holds almost

surely; (EC.15) follows from the definition of the decision rule xM


3 (ξ1 , ξ2 ); (EC.16) follows from the

Cauchy-Schwartz inequality.

Since limM →∞ P (x∗2 (ξ1 ) > M ) = 0 and E [(ξ1 + ξ2 )2 ] < ∞ (Assumption 1), we have shown, for every
arbitrary η > 0, that there exists a constant M ≡ M η ≥ 0 such that the decision rules from (EC.10)
are an η-optimal solution to Problem (3). Moreover, we readily observe that the decision rules from
(EC.10) satisfy the properties of Theorem EC.2. Indeed, for every M ≥ 0:

— Property (EC.6) is clearly satisfied by the decision rules (xM M


2 , x3 ).

— Property (EC.7) is satisfied by the decision rules (xM M M


2 , x3 ) because the inequalities 0 ≤ x2 (ζ1 ) +

2xM
3 (ζ1 , ζ2 ) ≤ M + 2(ζ1 + ζ2 ) hold for all ζ ∈ Ξ.

— Property (EC.8) is satisfied by the decision rules (xM M


2 , x3 ) because the optimal decision rules

(x∗2 , x∗3 ) are continuous functions, which implies that (xM M


2 , x3 ) are continuous functions as well over

the domain ζ ∈ Ξ.

We thus conclude from Theorem EC.2 that J ∗ = J¯ for Problem (3).




B.2.2. Example from Section 7. Consider the multi-stage stochastic linear optimization problem
" R
! R R
#
X X X

J = minimize E c Q10 + Q1r + hQ10 + vr (ξ 1 , ξ 2 ) + f zr (ξ 1 )
Q≥0,z∈{0,1}R ,v
r =1 r =1 r =1
R
X
subject to Q2r (ξ 1 ) ≤ Q10 a.s.
r =1

vr (ξ 1 , ξ 2 ) ≥ b(ξ2r + ξ1r − Q2r (ξ 1 ) − Q1r ) − hQ2r (ξ 1 ) ∀r ∈ [R], a.s. (7)

vr (ξ 1 , ξ 2 ) ≥ h(Q1r − ξ1r − ξ2r ) ∀r ∈ [R], a.s.

vr (ξ 1 , ξ 2 ) ≥ b (ξ1r − Q1r ) − hξ2r ∀r ∈ [R], a.s.

zr (ξ 1 )M ≥ Q2r (ξ 1 ) ∀r ∈ [R], a.s.,

where the random variables ξ = (ξ 1 , ξ 2 ) ∈ R2R denote the demands of a weekly magazine at different retailers.
We assume that this stochastic process satisfies Assumption 1 and is contained in Ξ := R2+R . For simplicity,
we focus on the case where f = 0.

Proposition EC.1. For Problem (7), J = J ∗ . Moreover, if there are optimal decision rules Q∗21 , . . . , Q∗2R :
¯
¯
R → R which are continuous, then J = J ∗ .
R

Proof. Our proof is split into two parts:


ec10 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

• For any arbitrary S̃ ⊆ S, we observe that the optimal cost of the optimization problem
" R
! R
! #
X X n o
minimize E c Q10 + Q1r + hQ10 + vr (ξ 1 , ξ 2 ) I ξ ∈ S̃
Q≥0,z∈{0,1}R ,v
r =1 r =1
R
X
subject to Q2r (ζ 1 ) ≤ Q10 ∀ζ ∈ S̃
r =1 (EC.17)
vr (ζ 1 , ζ 2 ) ≥ b(ζ2r + ζ1r − Q2r (ζ 1 ) − Q1r ) − hQ2r (ζ 1 ) ∀r ∈ [R], ∀ζ ∈ S̃

vr (ζ 1 , ζ 2 ) ≥ h(Q1r − ζ1r − ζ2r ) ∀r ∈ [R], ∀ζ ∈ S̃

vr (ζ 1 , ζ 2 ) ≥ b (ζ1r − Q1r ) − hζ2r ∀r ∈ [R], ∀ζ ∈ S̃


would not change if we added the following constraints:
( R )
X
Q10 ≤ sup (ζ1r + ζ2r )
ζ∈S̃ r =1

Q1r ≤ sup {ζ1r + ζ2r } ∀r ∈ [R]


ζ∈S̃
R
X
Q2r (ξ 1 ) ≤ Q10 a.s.
r =1

vr (ξ 1 , ξ 2 ) ≤ max {b(ξ2r + ξ1r ), hQ1r , bξ1r − hξ2r } ∀r ∈ [R], a.s.

Moreover, any feasible solution to Problem (EC.17) which satisfies the above constraints will satisfy
R
! R
X X
0 ≤ c Q10 + Q1r + hQ10 + vr (ξ 1 , ξ 2 ) a.s.,
r =1 r =1

will satisfy
R
! R
X X
c Q10 + Q1r + hQ10 + vr (ξ 1 , ξ 2 )
r =1 r =1
( R ) R
! ( R
)
X X X
≤ c sup (ζ1r + ζ2r ) + sup {ζ1r + ζ2r } + h sup (ζ1r + ζ2r )
ζ∈S̃ r =1 r =1 ζ∈S̃ ζ∈S̃ r =1
R
( )
X
+ max b(ξ2r + ξ1r ), h sup {ζ1r + ζ2r } , bξ1r − hξ2r a.s.
r =1 ζ∈S̃
( R
( R
))
X X
≤ (2c + 2h + b) max (ξ1r + ξ2r ), sup (ζ1r + ζ2r ) a.s.,
r =1 ζ∈S̃ r =1

and will satisfy the constraints of Problem (7). Therefore, it readily follows that the conditions of
Theorem EC.1 are satisfied, which implies that J = J ∗ for Problem (7).
¯
• Let Q∗21 , . . . , Q∗2R : RR → R denote optimal decision rules for Problem (7) which are continuous, and let
Q∗10 , . . . , Q∗1R denote optimal first-stage decisions for Problem (7). We define the following new decision
rules for all retailers r ∈ [R] and all realizations ζ = (ζ 1 , ζ 2 ) ∈ Ξ:
( ( r−1
))
X
Q02r (ζ 1 ) := max 0, min Q∗21 (ζ 1 ), Q∗10 − Q02s (ζ 1 )
s=1

vr0 (ζ 1 , ζ 2 ) := max {b(ζ2r + ζ1r − Q02r (ζ 1 ) − Q∗1r ) − hQ02r (ζ 1 ), h(Q∗1r − ζ1r − ζ2r ), b (ζ1r − Q∗1r ) − hζ2r } .

We observe that (Q∗10 , . . . , Q∗1R , Q021 , . . . , Q02R , v10 , . . . , vR


0
) is an optimal solution to Problem (7) and sat-
isfies the conditions of Theorem EC.2, which concludes our proof that J ∗ = J¯ for Problem (7).

e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec11

B.2.3. Example from Section 8 Consider the multi-stage stochastic optimization problem
" T #
X

J = minimize E (ct xt (ξ1 , . . . , ξt−1 ) + yt+1 (ξ1 , . . . , ξt ))
x, I, y
t=1

subject to It+1 (ξ1 , . . . , ξt ) = It (ξ1 , . . . , ξt−1 ) + xt (ξ1 , . . . , ξt−1 ) − ξt a.s., ∀t∈ [T ]


(10)
yt+1 (ξ1 , . . . , ξt ) ≥ ht It+1 (ξ1 , . . . , ξt ) a.s., ∀t∈ [T ]

yt+1 (ξ1 , . . . , ξt ) ≥ −bt It+1 (ξ1 , . . . , ξt ) a.s., ∀t∈ [T ]

0 ≤ xt (ξ1 , . . . , ξt−1 ) ≤ x̄t a.s., ∀t∈ [T ].


where I1 = 0 and the parameters ct , ht , bt , x̄t are nonnegative for all periods t ∈ [T ]. We assume that the
stochastic process ξ ≡ (ξ1 , . . . , ξT ) ∈ RT satisfies Assumption 1 and is contained in Ξ = RT+ .

Proposition EC.2. For Problem (10), J = J ∗ . Moreover, if there exist optimal decision rules x∗t : Rt−1 →
¯
R which are continuous, then J¯ = J ∗ .

Proof. Our proof is split into two parts:


• For any arbitrary S̃ ⊆ S, we observe that the optimal cost of the optimization problem
" T #
X n o
minimize E (ct xt (ξ1 , . . . , ξt−1 ) + yt+1 (ξ1 , . . . , ξt )) I ξ ∈ S̃
x, I, y
t=1

subject to It+1 (ζ1 , . . . , ζt ) = It (ζ1 , . . . , ζt−1 ) + xt (ζ1 , . . . , ζt−1 ) − ζt ∀ζ ∈ S̃, ∀t∈ [T ]


(EC.18)
yt+1 (ζ1 , . . . , ζt ) ≥ ht It+1 (ζ1 , . . . , ζt ) ∀ζ ∈ S̃, ∀t∈ [T ]

yt+1 (ζ1 , . . . , ζt ) ≥ −bt It+1 (ζ1 , . . . , ζt ) ∀ζ ∈ S̃, ∀t∈ [T ]

0 ≤ xt (ζ1 , . . . , ζt−1 ) ≤ x̄t ∀ζ ∈ S̃, ∀t∈ [T ].


would not change if we added the following constraints for each period t ∈ [T ]:
It+1 (ξ1 , . . . , ξt ) = It (ξ1 , . . . , ξt−1 ) + xt (ξ1 , . . . , ξt−1 ) − ξt a.s.

yt+1 (ξ1 , . . . , ξt ) = max {ht It+1 (ξ1 , . . . , ξt ), −bt It+1 (ξ1 , . . . , ξt )} a.s.

0 ≤ xt (ξ1 , . . . , ξt−1 ) ≤ x̄t a.s.


Moreover, any feasible solution to Problem (EC.18) which satisfies the above constraints is feasible for
Problem (10) and also satisfies
T T T
! T
! T
! T
!
X X X X X X
0≤ (ct xt (ξ1 , . . . , ξt−1 ) + yt+1 (ξ1 , . . . , ξt )) ≤ ct + ht x̄t + bt ξt a.s.
t=1 t=1 t=1 t=1 t=1 t=1

Therefore, it readily follows that the conditions of Theorem EC.1 are satisfied, which implies that J = J ∗
¯
for Problem (10).

• Let x∗t : Rt−1 → R for each t ∈ [T ] denote optimal decision rules for Problem (10) which are continuous.
We define the following new decision rules for each period t ∈ [T ] and all ζ ∈ Ξ:

x0t (ζ1 , . . . , ζt−1 ) := max{0, min{x∗t (ζ1 , . . . , ζt−1 ), x̄t }}

It0+1 (ζ1 , . . . , ζt ) := It0 (ζ1 , . . . , ζt−1 ) + x0t (ζ1 , . . . , ζt−1 ) − ζt

yt0 +1 (ζ1 , . . . , ζt ) := max ht It0+1 (ζ1 , . . . , ζt ), −bt It0+1 (ζ1 , . . . , ζt ) ,



ec12 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

where I10 = 0. We observe that (x01 , . . . , x0T , y20 , . . . , yT0 +1 , I10 , . . . , IT0 +1 ) is an optimal solution to Prob-
lem (10) and satisfies the conditions of Theorem EC.2, which concludes our proof that J ∗ = J¯ for
Problem (10).


Appendix C: Proof of Theorem 1 from Section 4.2

In this appendix, we present the proof of Theorem 1. The theorem consists of asymptotic lower and upper
bounds on the optimal cost of Problem (2), and we will address the proofs of the two bounds separately.

We first present the proof of the lower bound, which utilizes Theorem 2 from Section 4.2 and Theorem 3
from Section 4.4.

Theorem 1a. Suppose Assumptions 1, 2, and 3 hold. Then, P∞ -almost surely we have

J ≤ lim inf JbN .


¯ N →∞

Proof. Recall from Assumption 1 that b := E[exp(kξka )] < ∞ for some a > 1, and let L ≥ 0 be the
constant from Assumption 3. Then,
 
∞ ∞  n o 
X X
PN  sup L(1 + kζk) > log N  = PN max L(1 + kξ̂ j k + N ) > log N (EC.19)
j j∈[N ]
N =1 ζ∈∪N U
j=1 N N =1

X
≤ N P (L(1 + kξk + N ) > log N ) (EC.20)
N =1
∞  
X log N
= N P kξk > − 1 − N
N =1
L
∞   a 
X
a log N
= N P exp(kξk ) > exp − 1 − N
N =1
L

X Nb
≤ log N
a  (EC.21)
N =1
exp L
− 1 − N
< ∞, (EC.22)

where (EC.19) follows from the definition of the uncertainty sets, (EC.20) follows from the union bound,
(EC.21) follows from Markov’s inequality, and (EC.22) follows from a > 1 and N → 0. Therefore, the Borel-
Cantelli lemma and Assumption 3 imply that the following equality holds for all sufficiently large N ∈ N,
P∞ -almost surely:
N T
1 X X
JbN = minimize sup ct (ζ) · xt (ζ 1 , . . . , ζ t−1 )
x∈X N j =1 ζ∈U j t=1
N
T
X (EC.23)
j
subject to At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ ∪N
j =1 UN
t=1
j
xt (ζ 1 , . . . , ζ t−1 ) ≤ log N ∀ζ ∈ ∪N
j =1 UN , t.
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec13

Moreover, since c1 (ζ) ∈ Rn1 , . . . , cT (ζ) ∈ RnT are affine functions of the stochastic process, it follows from
identical reasoning as (EC.19)-(EC.22) and the equivalence of `p -norms in finite-dimensional spaces that
supζ∈∪N j
UN kct (ζ)k∗ ≤ log N for all sufficiently large N ∈ N, P∞ -almost surely.
j=1

We now apply Theorem 2 to obtain an asymptotic lower bound on the optimization problem in (EC.23).
1
− (d+1)(d+2)
Indeed, let MN be shorthand for N log N . Then, for all sufficiently large N ∈ N, P∞ -almost surely,
and for any decision rule x ∈ X which is feasible for the optimization problem in (EC.23),
" T #
X  j
E ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈ ∪N
j =1 UN
t=1
N T T
1 X X X
≤ sup ct (ζ) · xt (ζ 1 , . . . , ζ t−1 ) + MN sup ct (ζ) · xt (ζ 1 , . . . , ζ t−1 )
N j =1 ζ∈U j t=1 ζ∈∪N U
j
t=1
N j=1 N

N T T
1 X X X
≤ sup ct (ζ) · xt (ζ 1 , . . . , ζ t−1 ) + MN sup kct (ζ)k∗ kxt (ζ 1 , . . . , ζ t−1 )k
N j =1 ζ∈U j t=1 t=1 ζ∈∪
N Uj
N j=1 N

N T
1 X X
≤ sup ct (ζ) · xt (ζ 1 , . . . , ζ t−1 ) + T MN (log N )2 ,
N j =1 ζ∈U j t=1
N

where the first inequality follows from Theorem 2, the second inequality follows from the triangle inequal-
ity and the Cauchy-Schwartz inequality, and the third inequality follows because kct (ζ)k∗ ≤ log N and
kxt (ζ 1 , . . . , ζ T )k ≤ log N for all sufficiently large N ∈ N and all realizations in the uncertainty sets. We remark
that the above bound holds uniformly for all decision rules which are feasible for the optimization problem
in (EC.23). Therefore, we have shown that the following inequality holds for all sufficiently large N ∈ N,
P∞ -almost surely:
" T #
X j
JbN + T MN (log N )2 ≥ minimize

E ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈ ∪N
j =1 UN
x∈X
t=1

X T
j
subject to At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ ∪N
j =1 UN
t=1
j
xt (ζ 1 , . . . , ζ t−1 ) ≤ log N ∀ζ ∈ ∪N
j =1 UN , t.

Next, we obtain a lower bound on the right side of the above inequality by removing the last row of
j
constraints and relaxing ∪N
j =1 UN to any set which contains the stochastic process with sufficiently high

probability: " #
T n o
X
minimize E ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈ S̃
x∈X ,S̃⊆Ξ
t=1
T
X
subject to At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ S̃
t=1
 
j

P ξ ∈ S̃ ≥ P ξ ∈ ∪N
j =1 UN .

j
Finally, for any fixed ρ ∈ (0, 1), it follows from Theorem 3 that P(ξ ∈ ∪N
j =1 UN ∩ S) ≥ 1 − ρ for all sufficiently

large N ∈ N, P∞ -almost surely.3 Furthermore, we observe that T MN (log N )2 converges to zero as the number

3
We remark that Devroye and Wise (1980, Theorem 2) could be used here in lieu of Theorem 3. Our primary use of
Theorem 3 is in the proof of Theorem 2.
ec14 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

of sample paths N tends to infinity. Therefore, we have shown that the following inequality holds, P∞ -almost
surely:
" T
#
X n o
lim inf JbN ≥ minimize E ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )I ξ ∈ S̃
N →∞ x∈X ,S̃⊆Ξ
t=1
T
X
subject to At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ S̃
t=1
 
P ξ ∈ S̃ ≥ 1 − ρ.

Since the inequality holds true for every ρ ∈ (0, 1), and since the optimal cost of the above optimization
problem is monotone in ρ, we obtain the desired result. 

We now conclude the proof of Theorem 1 by establishing its upper bound.

Theorem 1b. Suppose Assumption 2 holds. Then, P∞ -almost surely we have

¯
lim sup JbN ≤ J.
N →∞

Proof. Consider any ρ > 0 such that there is a decision rule x ∈ X which satisfies
" T #
X
Ē ct (ξ) · xt (ξ 1 , . . . , ξ t−1 ) < ∞, (EC.24)
t=1
T
X
At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ Ξ : dist(ζ, S) ≤ ρ. (EC.25)
t=1

Indeed, if no such ρ > 0 and x ∈ X existed, then J¯ = ∞ and the desired result follows immediately. We recall
from Assumption 2 that N → 0 as N → ∞. Therefore,
N T
1 X X
lim sup JbN ≤ lim sup sup ct (ζ) · xt (ζ 1 , . . . , ζ t−1 )
N →∞ N →∞ N j =1 ζ∈Ξ: kζ−ξ̂j k≤N t=1
N T
1 X X
≤ lim lim sup sup ct (ζ) · xt (ζ 1 , . . . , ζ t−1 )
k→∞ N →∞ N j =1 ζ∈Ξ: kζ−ξ̂j k≤k t=1
" T
#
X
= lim E sup ct (ζ) · xt (ζ 1 , . . . , ζ t−1 ) P∞ -almost surely
k→∞ ζ∈Ξ: kζ−ξk≤k
t=1
" T #
X
= Ē ct (ξ) · xt (ξ 1 , . . . , ξ t−1 ) . (EC.26)
t=1

The first inequality holds because the decision rule is feasible but possibly suboptimal for Problem (2) for all
N ≥ min{N̄ : N̄ ≤ ρ}. The second inequality holds because k ≥ N for every fixed k and all N ≥ k. The first
equality follows from the strong law of large numbers (Erickson 1973), which holds since (EC.24) ensures
that
" ( T
)#
X
E max sup ct (ζ) · xt (ζ 1 , . . . , ζ t−1 ), 0 <∞
ζ∈Ξ: kζ−ξk≤k
t=1
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec15

for all sufficiently large k. The final equality follows the definition of the local upper semicontinuous envelope.
Since the set of decision rules which satisfy (EC.25) does not get smaller as ρ ↓ 0, we conclude that the
following holds P∞ -almost surely:
" T
#
X
lim sup JbN ≤ lim minimize Ē ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )
N →∞ ρ↓0 x∈X
t=1
T
X
subject to At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ Ξ : dist(ζ, S) ≤ ρ.
t=1

This concludes the proof. 

Appendix D: Proof of Theorem 2 from Section 4.2

In this appendix, we present the proof of Theorem 2. The proof is organized as follows. In Appendix D.1,
we first develop a helpful intermediary bound (Lemma EC.2). In Appendix D.2, we use that bound to prove
Theorem 2. In Appendix D.3, we provide for completeness the proofs of some miscellaneous and rather
technical results that were used in Appendix D.2.

D.1. An intermediary result

Our proof of Theorem 2 relies on an intermediary result (Lemma EC.2), which establishes a relationship be-
tween sample robust optimization and distributionally robust optimization with the 1-Wasserstein ambiguity
set. We begin by establishing the relationship for the specific case where there is a single data point.

Lemma EC.1. Let f : Rd → R be measurable, Z ⊆ Rd , and ξ̂ ∈ Z. If θ2 ≥ 2θ1 ≥ 0, then


2θ1
sup EQ [f (ξ)] ≤ sup f (ζ) + sup |f (ζ)|. (EC.27)
Q∈P (Z ): EQ [kξ−ξ̂k]≤θ1 ζ∈Z : kζ−ξ̂k≤θ2 θ2 ζ∈Z

Proof. We first apply the Richter-Rogonsinski Theorem (see Theorem 7.32 and Proposition 6.40 of
Shapiro et al. (2009)), which says that a distributionally robust optimization problem with m moment
constraints is equivalent to optimizing a weighted average of m + 1 points. Thus,

 1 2 sup
 λf (ζ 1 ) + (1 − λ)f (ζ 2 )
ζ ,ζ ∈Z,λ∈[0,1]
sup EQ [f (ξ)] =
Q∈P (Z ): EQ [kξ−ξ̂k]≤θ1 subject to
 λ ζ 1 − ξ̂ + (1 − λ) ζ 2 − ξ̂ ≤ θ1

 1 2 sup
 λf (ζ 1 ) + (1 − λ)f (ζ 2 )
ζ ,ζ ∈Z,λ∈[0,1]
≤ (EC.28)
subject to
 λ ζ 1 − ξ̂ ≤ θ1 , (1 − λ) ζ 2 − ξ̂ ≤ θ1 ,

where the inequality follows from relaxing the constraints on ζ 1 and ζ 2 . Let us assume from this point
onward that supζ∈Z |f (ζ)| < ∞; indeed, if supζ∈Z |f (ζ)| = ∞, then the inequality in (EC.27) would trivially
hold since the right-hand side would equal infinity. Then, it follows from (EC.28) that
 !  
 
sup EQ [f (ξ)] ≤ sup λ sup f (ζ) + (1 − λ)  sup f (ζ) . (EC.29)
Q∈P (Z ): EQ [kξ−ξ̂k]≤θ1 0≤λ≤1  θ
ζ∈Z : kζ−ξ̂k≤ 1
θ
ζ∈Z : kζ−ξ̂k≤ 1

λ 1−λ
ec16 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

We observe that the supremum over 0 ≤ λ ≤ 1 is symmetric with respect to λ, in the sense that λ can be
restricted to [0, 21 ] or [ 12 , 1] without loss of generality. Moreover, under the assumption that θ2 ≥ 2θ1 , the
interval [0, 1 − θθ21 ] is a superset of the interval [0, 12 ]. Combining these arguments, we conclude that the right
side of (EC.29) is equal to
 !  
 
sup λ sup f (ζ) + (1 − λ)  sup f (ζ) . (EC.30)
θ θ θ
0≤λ≤1− θ1 ζ∈Z : kζ−ξ̂k≤ λ1 1
ζ∈Z : kζ−ξ̂k≤ 1−λ
 
2

θ1
Next, we observe that 1−λ
≤ θ2 for every feasible λ for the above optimization problem. Using this inequality,
we obtain the following upper bound:

sup EQ [f (ξ)]
Q∈P (Z ): EQ [kξ−ξ̂k]≤θ1
( ! !)
≤ sup λ sup f (ζ) + (1 − λ) sup f (ζ)
θ θ ζ∈Z : kζ−ξ̂k≤θ2
0≤λ≤1− θ1 ζ∈Z : kζ−ξ̂k≤ λ1
2
( !)
= sup f (ζ) + sup λ sup f (ζ) − sup f (ζ) , (EC.31)
ζ∈Z : kζ−ξ̂k≤θ2 θ θ ζ∈Z : kζ−ξ̂k≤θ2
0≤λ≤1− θ1 ζ∈Z : kζ−ξ̂k≤ λ1
2

θ1 θ1
where the above equality comes from rearranging terms. For every θ2
≤λ≤1− θ2
, it immediately follows
θ1
from λ
≤ θ2 that

sup f (ζ) − sup f (ζ) ≤ 0,


θ ζ∈Z : kζ−ξ̂k≤θ2
ζ∈Z : kζ−ξ̂k≤ λ1

θ1
and the above holds at equality when λ = θ2
. Therefore,
( !)
sup λ sup f (ζ) − sup f (ζ)
θ θ ζ∈Z : kζ−ξ̂k≤θ2
0≤λ≤1− θ1 ζ∈Z : kζ−ξ̂k≤ λ1
2
( !)
= sup λ sup f (ζ) − sup f (ζ) (EC.32)
θ θ ζ∈Z : kζ−ξ̂k≤θ2
0≤λ≤ θ1 ζ∈Z : kζ−ξ̂k≤ λ1
2
  
≤ sup λ sup f (ζ) − inf f (ζ) (EC.33)
θ ζ∈Z ζ∈Z
0≤λ≤ θ1
2

2θ1
≤ sup |f (ζ)|. (EC.34)
θ2 ζ∈Z
Line (EC.32) follows because we can without loss of generality restrict λ to the interval [0, θθ12 ]. Line (EC.33)
is obtained by applying the global lower and upper bounds on f (ζ). Finally, we obtain (EC.34) since

0 ≤ sup f (ζ) − inf f (ζ) ≤ 2 sup |f (ζ)|.


ζ∈Z ζ∈Z ζ∈Z

Combining (EC.31) and (EC.34), we obtain the desired result. 

We now extend the previous lemma to the general case with more than one data point. In the following, we
bN denote the empirical distribution of historical data ξ̂ 1 , . . . , ξ̂ N ∈ Rd , Z ⊆ Rd be any set that contains
let P
the historical data, and d1 (·, ·) denote the 1-Wasserstein distance between two probability distributions (see
Section 6).
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec17

Lemma EC.2. Let f : Rd → R be measurable, Z ⊆ Rd , and ξ̂ 1 , . . . , ξ̂ N ∈ Z. If θ2 ≥ 2θ1 ≥ 0, then


N
1 X 4θ1
sup EQ [f (ξ)] ≤ sup f (ζ) + sup |f (ζ)|.
Q∈P (Z ): d1 (Q, b
PN )≤θ1 N j =1 ζ∈Z : kζ−ξ̂j k≤θ2 θ2 ζ∈Z
Proof. We recall from the proof of Mohajerin Esfahani and Kuhn (2018, Theorem 4.2) that
N
 
N 1 X h
j
i
o  EQj kξ − ξ̂ k ≤ θ1 
n 1 X 
Q ∈ P(Z) : d1 (Q, PN ) ≤ θ1 =
b Qj : N j =1
.
 N j =1
 

Q1 , . . . , QN ∈ P(Z)
Therefore,
( N N
)
1 X 1 X
sup EQ [f (ξ)] = sup sup EQ [f (ξ)] : γj ≤ θ 1 .
Q∈P (Z ): d1 (Q,b
PN )≤θ1 γ∈RN
+
N j =1 Q∈P (Z ): EQ [kζ−ξ̂j k]≤γj N j =1
For any choice of γ ∈ RN
+ , we can partition the components γj into those that satisfy 2γj ≤ θ2 and 2γj > θ2 .

Thus,

sup EQ [f (ξ)]
Q∈P (Z ): d1 (Q,b
PN )≤θ1
 
1 N
X 1 X 1 X 
≤ sup sup EQ [f (ξ)] + sup |f (ζ)| : γj ≤ θ1 , (EC.35)
γ∈RN N j N ζ∈Z N j =1
j∈[N ]: 2γj ≤θ2 Q∈P (Z ): EQ [kζ−ξ̂ k]≤γj

+ j∈[N ]: 2γj >θ2

where the inequality follows from upper bounding each of the inner distributionally robust optimization
2N θ1
problems for which 2γj > θ2 by supζ∈Z |f (ζ)|. Due to the constraints on γ, there can be at most θ2

components which satisfy 2γj > θ2 . It thus follows from (EC.35) that

sup EQ [f (ξ)]
Q∈P (Z ): d1 (Q,b
PN )≤θ1
 
1 N
X 1 X  2θ
1
≤ sup sup EQ [f (ξ)] : γj ≤ θ 1 + sup |f (ζ)|. (EC.36)
γ∈RN N j
Q∈P (Z ): EQ [kζ−ξ̂ k]≤γj N j =1
 θ 2 ζ∈Z
+ j∈[N ]: 2γj ≤θ2

To conclude the proof, we apply Lemma EC.1 to each distributionally robust optimization problem in (EC.36)
to obtain the following upper bounds:

sup EQ [f (ξ)]
Q∈P (Z ): d1 (Q,b
PN )≤θ1
 ! 
1 N
X 2γj 1 X  2θ
1
≤ sup sup f (ζ) + sup |f (ζ)| : γj ≤ θ1 + sup |f (ζ)| (EC.37)
γ∈RN N ζ∈Z : kζ− ξ̂ j k≤θ θ 2 ζ∈Z N  θ 2 ζ∈Z
+ j∈[N ]: 2γj ≤θ2 2 j =1
( N
! N
)
1 X 2γj 1 X 2θ1
≤ sup sup f (ζ) + sup |f (ζ)| : γj ≤ θ 1 + sup |f (ζ)| (EC.38)
γ∈RN
+
N j =1 ζ∈Z : kζ−ξ̂j k≤θ2 θ2 ζ∈Z N j =1 θ2 ζ∈Z
N
1 X 4θ1
= sup f (ζ) + sup |f (ζ)|. (EC.39)
N j =1 ζ∈Z : kζ−ξ̂j k≤θ2 θ2 ζ∈Z
Line (EC.37) follows from applying Lemma EC.1 to (EC.36). Line (EC.38) follows because
2γj
sup f (ζ) + sup |f (ζ)| ≥ 0
ζ∈Z : kζ−ξ̂j k≤θ2 θ2 ζ∈Z
for each component that satisfies 2γj > θ2 , and thus adding these quantities to (EC.37) results in an upper
PN
bound. Finally, (EC.39) follows from the constraint N1 j =1 γj ≤ θ1 . This concludes the proof. 
ec18 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

D.2. Proof of Theorem 2

We have established above a deterministic bound (Lemma EC.2) between sample robust optimization and
distributionally robust optimization with the 1-Wasserstein ambiguity set. We will now combine that bound
with a concentration inequality of Fournier and Guillin (2015) to prove Theorem 2. We remark that the
following proof will employ Theorem 3 from Section 4.4 as well as notation from Section 6. For clarity of
exposition, some intermediary and rather technical details of the following proof have been relegated to
Appendix D.3.

Theorem 2. If Assumptions 1 and 2 hold, then there exists a N̄ ∈ N, P∞ -almost surely, such that
N
  N j
 1 X
E f (ξ)I ξ ∈ ∪j =1 UN ≤ sup f (ζ) + MN sup |f (ζ)|
N j =1 ζ∈U j ζ∈∪N U
j
N j=1 N

1
− (d+1)(d+2)
for all N ≥ N̄ and all measurable functions f : Rd → R, where MN := N log N .

Proof. Let κ > 0 be the coefficient from Assumption 2, and define κ̄ = κ/8. For each N ∈ N, define
( 1
κ̄N − 2 log N, if d = 1,
δN := 1
κ̄N − d (log N )2 , if d ≥ 2.

bN ) ≤ δN for all sufficiently large


It follows from Fournier and Guillin (2015) and Assumption 1 that d1 (P, P
N ∈ N, P∞ -almost surely (see Lemma EC.3 in Appendix D.3). Therefore, for every measurable function
f : Rd → R,

j
  
E f (ξ)I ξ ∈ ∪N
j =1 UN
    
j  j

= E f (ξ) + sup |f (ζ)| I ξ ∈ ∪N
j =1 UN − sup |f (ζ)| P(ξ ∈ ∪N
j =1 UN )
j j
ζ∈∪N U
j=1 N
ζ∈∪N U
j=1 N
    
j  j

≤ sup EQ f (ξ) + sup |f (ζ)| I ξ ∈ ∪N
j =1 UN − sup |f (ζ)| P(ξ ∈ ∪N
j =1 UN ),
j j
Q∈P (Ξ): d1 (Q,b
PN )≤δN ζ∈∪N U
j=1 N
ζ∈∪N U
j=1 N
| {z }
g (ξ)
(EC.40)

where the inequality holds for all sufficiently large N ∈ N, P -almost surely. Next, we observe that g(ξ)
j
equals zero when ξ is not an element of ∪N
j =1 UN and is nonnegative otherwise. Therefore, without loss of

generality, we can restrict the supremum over the expectation of g(ξ) to distributions with support contained
j
in ∪N
j =1 UN (see Lemma EC.4 in Appendix D.3). Therefore, (EC.40) is equal to
    
j  j

sup EQ f (ξ) + sup |f (ζ)| I ξ ∈ ∪N j =1 UN −  sup |f (ζ)| P(ξ ∈ ∪N
j =1 UN )
j j j
Q∈P (∪N U ): d1 (Q,b
j=1 N
PN )≤δN ζ∈∪N U
j=1 N
ζ∈∪N U
j=1 N
   
j
= sup EQ f (ξ) + sup |f (ζ)| −  sup |f (ζ)| P(ξ ∈ ∪N
j =1 UN )
j j j
Q∈P (∪N U ): d1 (Q,b
j=1 N
PN )≤δN ζ∈∪N U
j=1 N
ζ∈∪N U
j=1 N
 
j
= sup EQ [f (ξ)] +  sup |f (ζ)| P(ξ 6∈ ∪N
j =1 UN ), (EC.41)
j j
Q∈P (∪N U ): d1 (Q,b
j=1 N
PN )≤δN ζ∈∪N U
j=1 N
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec19

where the first equality follows because the support of probability distributions in the outer-most supre-
j
mum is restricted to those which assign measure only on ∪N
j =1 UN , and the second equality follows because

supζ∈∪N j
UN |f (ζ)| is independent of Q. By Assumption 2 and the construction of δN , we have that N ≥ 2δN
j=1

for all sufficiently large N ∈ N. Thus, it follows from Lemma EC.2 that (EC.41) is upper bounded by
 
N
1 X 4δN j
sup f (ζ) + sup |f (ζ)| +  sup |f (ζ)| P(ξ 6∈ ∪N
j =1 UN ). (EC.42)
N j =1 ζ∈U j N ζ∈∪N U j ζ∈∪ N U
j
N j=1 N j=1 N

1 1
By the definition of δN , and since N = κN − 3 when d = 1 and N = κN − d+1 when d ≥ 2, we have that
4δN MN j MN
N
≤ 2
for all sufficiently large N . Finally, Theorem 3 implies that P(ξ 6∈ ∪N
j =1 UN ) ≤ 2
for all sufficiently

large N , P -almost surely. Combining (EC.40), (EC.41), and (EC.42), we obtain the desired result. 

D.3. Miscellaneous results

We conclude Appendix D with some intermediary and technical lemmas which were used in the proof of
Theorem 2. The following lemma is a corollary of Fournier and Guillin (2015, Theorem 2) and is included
for completeness.

Lemma EC.3. Suppose Assumption 1 holds, and let


( 1
κ̄N − 2 log N, if d = 1,
δN := 1
κ̄N − d (log N )2 , if d ≥ 2,

bN ) ≤ δN for all sufficiently large N ∈ N, P∞ -almost surely.


for any fixed κ̄ > 0. Then, d1 (P, P

Proof. Let N̄ ∈ N be any index such that δN ≤ 1 for all N ≥ N̄ . It follows from Assumption 1 that there
exists an a > 1 such that b := E[exp(kξka )] < ∞. Thus, it follows from Fournier and Guillin (2015, Theorem
2) that there exist constants c1 , c2 > 0 (which depend only a, b, and d) such that for all N ≥ N̄ ,
 2
   c1 exp (−c2 N δN ),
2
 if d = 1,
N c2 N δN
P d1 (P, PN ) > δN ≤ c1 exp − (log(2+1/δN ))2 , if d = 2,
b (EC.43)

d
if d ≥ 3.

c1 exp (−c2 N δN ),
1
First, suppose d = 1 and N ≥ N̄ . Then, it follows from the definition of δN = κ̄N − 2 log N and (EC.43) that
 
bN ) > δN ≤ c1 exp −c2 N δ 2 = c1 exp −c2 κ̄2 (log N )2 .
 
PN d1 (P, P N

1
Second, suppose d = 2 and N ≥ N̄ . Then, it follows from the definition of δN = κ̄N − 2 (log N )2 and (EC.43)
that there exists some constant c̄ > 0 (which depends only on κ̄ and c2 ) such that
2
 
  c2 N δ N
PN d1 (P, P bN ) > δN ≤ c1 exp −
log(2 + 1/δN )2
c2 κ̄2 (log N )4
 
= c1 exp − 1
log(2 + κ̄−2 N 2 (log N )−2 )2
c2 κ̄2 (log N )4
 
≤ c1 exp − 1
log(2 + κ̄−2 N 2 )2
≤ c1 exp −c̄(log N )2 .

ec20 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

1
Third, suppose d ≥ 3 and N ≥ N̄ . Then, it follows from the definition of δN = κ̄N − d (log N )2 and (EC.43)
that
 
bN ) > δN ≤ c1 exp −c2 N δ d = c1 exp −c2 (log N )2d .
 
PN d1 (P, P N

Therefore, for any d ≥ 1, we have shown that


∞  
X
PN d1 (P, P
bN ) > δN < ∞,
N =1

and thus the desired result follows from the Borel-Cantelli lemma. 

The second lemma (Lemma EC.4) demonstrates that restrictions may be placed on the support of an
ambiguity set in distributionally robust optimization without loss of generality when the objective function
is nonnegative.

Lemma EC.4. Suppose Ξ ⊆ Rd and ξ̂ 1 , . . . , ξ̂ N ∈ Z ⊆ Ξ. Let g : Ξ → R be any measurable function where


g(ζ) ≥ 0 for all ζ ∈ Z. Then, for all θ ≥ 0,

sup EQ [g(ξ)I {ξ ∈ Z}] = sup EQ [g(ξ)] .


Q∈P (Ξ): d1 (Q,b
PN )≤θ Q∈P (Z ): d1 (Q,b
PN )≤θ

Proof. For notational convenience, let ḡ(ζ) := g(ζ)I {ζ ∈ Z} for all ζ ∈ Ξ. It readily follows from Z ⊆ Ξ
that

sup EQ [ḡ(ξ)] ≥ sup EQ [ḡ(ξ)] = sup EQ [g(ξ)] ,


Q∈P (Ξ): d1 (Q,b
PN )≤θ Q∈P (Z ): d1 (Q,b
PN )≤θ Q∈P (Z ): d1 (Q,b
PN )≤θ

where the equality holds since ḡ(ζ) = g(ζ) for all ζ ∈ Z.


It remains to show the other direction. By the Richter-Rogonsinski Theorem (see Theorem 7.32 and
Proposition 6.40 of Shapiro et al. (2009)),

N
 1 X j
λ ḡ(ζ j 1 ) + (1 − λj )ḡ(ζ j 2 )

sup




ζj1 ,ζj2 ∈Ξ, λj ∈[0,1] N j =1
sup EQ [ḡ(ξ)] = N
Q∈P (Ξ): d1 (Q,b
PN )≤θ
 1 X  j j1 
λ kζ − ξ̂ j k + (1 − λj )kζ j 2 − ξ̂ j k ≤ θ.

subject to


N j =1

For any arbitrary η > 0, let (ζ̄ j 1 , ζ̄ j 2 , λ̄j )j∈[N ] be an η-optimal solution to the above optimization problem.
We now perform a transformation on this solution. For each j ∈ [N ], define λ̆j = λ̄j , and for each ∗ ∈ {1, 2},
define ζ̆ j∗ = ζ̄ j∗ if ζ̄ j∗ ∈ Z and ζ̆ j∗ = ξ̂ j otherwise. Since ḡ(ζ) ≥ 0 for all ζ ∈ Ξ and ḡ(ζ) = 0 for all ζ ∈
/ Z, it
follows that ḡ(ζ̆ j∗ ) ≥ ḡ(ζ̄ j∗ ). By construction, (ζ̆ j 1 , ζ̆ j 2 , λ̆j )j∈[N ] is a feasible solution to the above optimization
problem, and is also feasible for
N
1 X j
λ ḡ(ζ j 1 ) + (1 − λj )ḡ(ζ j 2 )

sup
ζ j1 ,ζ j2 ∈Z,λj ∈[0,1] N j =1
N
1 X  j j1 
subject to λ kζ − ξ̂ j k + (1 − λj )kζ j 2 − ξ̂ j k ≤ θ,
N j =1
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec21

where we replaced the domain of ζ j 1 and ζ j 2 by Z. We have thus shown that


N
1 X j
λ̄ ḡ(ζ̄ j 1 ) + (1 − λ̄j )ḡ(ζ̄ j 2 ) + η

sup EQ [ḡ(ξ)] ≤
Q∈P (Ξ): d1 (P,b
PN )≤θ N j =1
N
1 X j 
≤ λ̆ ḡ(ζ̆ j 1 ) + (1 − λ̆j )ḡ(ζ̆ j 2 ) + η ≤ sup EQ [ḡ(ξ)] + η.
N j =1 Q∈P (Z ): d1 (Q,b
PN )≤θ

Since η > 0 was chosen arbitrarily, and by the equivalence of ḡ and g on Z, we have shown the other direction.
This concludes the proof. 

Appendix E: Details for Example 2 from Section 4.3

In this appendix, we provide the omitted technical details of Example 2 from Section 4.3. For convenience,
we repeat the example below.

Consider the single-stage stochastic problem


minimize x1
x1 ∈ Z

subject to x1 ≥ ξ1 a.s.,
where the random variable ξ1 is governed by the probability distribution P(ξ1 > α) = (1 − α)k for fixed k > 0,
and Ξ = [0, 2]. We observe that the support of the random variable is S = [0, 1], and thus the optimal cost of
the stochastic problem is J ∗ = 1. We similarly observe that the lower bound is J = 1 and the upper bound,
¯
due to the integrality of the first stage decision, is J¯ = 2. If N = N − 3 , then we prove in Appendix E that the
1

bounds in Theorem 1 are tight under different choices of k:

Range of k Result
 
k ∈ (0, 3) P∞ ¯
J < lim inf JN = lim sup JN = J = 1
b b
¯ N →∞ N →∞
 
k=3 P∞ J = lim inf JbN < lim sup JbN = J¯ = 1
¯ N →∞ N →∞
 
k ∈ (3, ∞) P∞ J = lim inf JbN = lim sup JbN < J¯ = 1
¯ N →∞ N →∞

We now prove the above bounds. To begin, we recall that P(ξ1 > α) = (1 − α)k . Thus, for any k > 0,

J = lim min {x1 : P(x1 ≥ ξ1 ) ≥ 1 − ρ} = 1, and


¯ ρ↓0 x1 ∈Z
J¯ = lim min {x1 : x1 ≥ 1 + ρ} = 2.
ρ↓0 x1 ∈Z
1
Furthermore, given historical data, the choice of the robustness parameter N = N − 3 , and Ξ = [0, 2],
(
1, if maxj∈[N ] ξˆj1 ≤ 1 − N − 3 ,
1
j
 N
JN = min x1 : x1 ≥ ζ1 , ∀ζ1 ∈ ∪j =1 UN =
b
2, if maxj∈[N ] ξˆj1 > 1 − N − 3 .
1
x1 ∈ Z

We first show that


  (
∞ 0, if 0 < k ≤ 3,
P lim sup JbN = 1 = (Claim 1)
N →∞ 1, if k > 3.
ec22 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Indeed,
 
P∞ lim sup JbN = 1
N →∞
 
∞ ˆj −1
=P max ξ 1 ≤ 1 − N 3 for all sufficiently large N
j∈[N ]
 
∞ ˆj −1
= lim P max ξ 1 ≤ 1 − n 3 for all n ≥ N
N →∞ j∈[n]
 
1 1
= lim P∞ max ξˆj1 ≤ 1 − N − 3 and max ξˆj1 ≤ 1 − n− 3 for all n ≥ N + 1
N →∞ j∈[N ] j∈[n]
  Y ∞  
N ˆj −1 ˆj 1
−3 ˆ j 1
−3
= lim P max ξ 1 ≤ 1 − N 3 P max ξ 1 ≤ 1 − n max ξ 1 ≤ 1 − (n − 1) (EC.44)
N →∞ j∈[N ] j∈[n] j∈[n−1]
n=N +1
  Y ∞  
1 1 1
= lim P ∞
max ξˆj1 ≤ 1 − N − 3 P ξˆn1 ≤ 1 − n− 3 max ξˆj1 ≤ 1 − (n − 1)− 3 (EC.45)
N →∞ j∈[N ] j∈[n−1]
n=N +1
 ∞
N Y  
1 1
= lim P ξ 1 ≤ 1 − N − 3 P ξ1 ≤ 1 − n− 3 (EC.46)
N →∞
n=N +1
 ∞
N Y  
k k
= lim 1 − N− 3 1 − n− 3 . (EC.47)
N →∞
n=N +1

Line (EC.44) follows from the law of total probability. Line (EC.45) follows because, conditional on
maxj∈[n−1] ξˆj1 ≤ 1 − (n − 1)− 3 , we have that ξˆj1 ≤ 1 − n− 3 for all j ∈ [n − 1]. Line (EC.46) follows from the
1 1

independence of ξˆj , j ∈ N. By evaluating the limit in (EC.47), we conclude the proof of Claim 1.

Next, we show that


 
P∞ lim inf JbN = 1 = 1 if k ≥ 3. (Claim 2)
N →∞

Indeed,
   
∞ ∞ ˆj −1
P lim inf JN = 1 = P
b max ξ 1 ≤ 1 − N 3 for infinitely many N
N →∞ j∈[N ]
 
∞ ˆj −1
= lim P max ξ 1 ≤ 1 − n 3 for some n ≥ N
N →∞ j∈[n]
 
1
≥ lim PN max ξˆj1 ≤ 1 − N − 3
N →∞ j∈[N ]
 1
N
= lim P ξ1 ≤ 1 − N − 3 (EC.48)
N →∞
 k
N
= lim 1 − N − 3 . (EC.49)
N →∞

Line (EC.48) follows from the independence of ξˆj , j ∈ N. We observe that the limit in (EC.49) is strictly
positive when k ≥ 3. It follows from the Hewitt-Savage zero-one law (see, e.g., Breiman (1992), Wang and
Tomkins (1992)) that the event {maxj∈[N ] ξˆj1 ≤ 1 − N − 3 for infinitely many N } happens with probability
1

zero or one. Thus, (EC.49) implies that the event {lim inf N →∞ JbN = 1} must occur with probability one for
k ≥ 3.

Finally, we show that


 
P∞ lim inf JbN = 1 = 0 if 0 < k < 3. (Claim 3)
N →∞
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec23

Indeed, suppose that 0 < k < 3. Then,


∞   X ∞   X∞  N
−1 k
X
P ∞
JN = 1 =
b PN ˆ j
max ξ 1 ≤ 1 − N 3 = 1 − N− 3 < ∞.
j∈[N ]
N =1 N =1 N =1

Therefore, it follows from the Borel-Cantelli lemma that


   
1
P∞ lim inf JbN = 1 = P∞ max ξˆj1 > 1 − N − 3 for all sufficiently large N = 0,
N →∞ j∈[N ]

when 0 < k < 3, which proves Claim 3.

¯ we have shown the desired results.


Combining Claims 1, 2, and 3 with the definitions of J and J,
¯

Appendix F: Proof of Theorem 3 from Section 4.4

In this appendix, we present the proof of Theorem 3. Our proof techniques follow similar reasoning to Devroye
j
and Wise (1980) and Baı́llo et al. (2000) for SN := ∪N
j =1 UN , which we adapt to Assumption 1. We remark

that the following theorem also provides an intermediary step in the proofs of Theorems 1 and 2, which are
found in Appendices C and D.

Theorem 3. Suppose Assumptions 1 and 2 hold. Then, P∞ -almost surely we have


1
!
N d+1
lim P(ξ 6∈ SN ) = 0.
N →∞ (log N )d+1
1
Proof. Choose any arbitrary η > 0, and let RN := N d+1 (log N )−(d+1) . Moreover, let a > 1 be a fixed
constant such that b := E[exp(kξka )] < ∞ (the existence of a and b follows from Assumption 1). Define
n a+1
o
AN := ζ ∈ Rd : kζk ≤ (log N ) 2a .

We begin by showing that RN P(ξ 6∈ AN ) ≤ η for all sufficiently large N ∈ N. Indeed,


 a+1
  a+1
 bRN
RN P (ξ 6∈ AN ) = RN P kξk > (log N ) 2a = RN P exp(kξka ) > exp((log N ) 2 ) ≤ a+1 ≤ η.
exp((log N ) 2 )
The first inequality follows from Markov’s inequality and the second inequality holds for all sufficiently large
N ∈ N since a > 1.
Next, define
η 
αN := d(a+1)
, BN := ζ ∈ Rd : P (kξ − ζk ≤ N ) > αN dN ,
(log N ) 2a φRN
where φ > 0 is a constant which depends only on d and will be defined shortly. We now show that RN P(ξ 6∈
BN ) ≤ 2η for all sufficiently large N . Indeed, for all sufficiently large N ∈ N,

RN P(ξ 6∈ BN ) = RN P(ξ ∈ AN , ξ 6∈ BN ) + RN P(ξ 6∈ AN , ξ ∈


/ BN )

≤ RN P(ξ ∈ AN , ξ 6∈ BN ) + RN P(ξ 6∈ AN )

≤ RN P (ξ ∈ AN , ξ 6∈ BN ) + η, (EC.50)

where the final inequality follows because RN P(ξ 6∈ AN ) ≤ η for all sufficiently large N ∈ N. Now, choose
points ζ 1 , . . . , ζ KN ∈ AN such that minj∈[KN ] kζ − ζ j k ≤ N
2
for all ζ ∈ AN . For example, one can place the
ec24 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

points on a grid overlaying AN . It follows from Verger-Gaugry (2005) that this can be accomplished with a
number of points KN which satisfies
a+1
!d
(log N ) 2a
KN ≤ φ , (EC.51)
N
where φ > 0 is a constant that depends only on d. Then, continuing from (EC.50),
KN
X  N 
RN P(ξ 6∈ BN ) ≤ RN P (ξ ∈ AN , ξ 6∈ BN ) + η ≤ RN P kξ − ζ j k ≤ , ξ 6∈ BN + η, (EC.52)
j =1
2
where the second inequality follows from the union bound. For each j ∈ [KN ], we have two cases to consider.
First, suppose there exists a realization ζ 6∈ BN such that kζ − ζ j k ≤ 2N . Then,
 N   N 
P kξ − ζ j k ≤ , ξ 6∈ BN ≤ P kξ − ζ j k ≤ ≤ P (kξ − ζk ≤ N ) ≤ αN dN ,
2 2
where the second inequality follows because kξ − ζk ≤ kξ − ζ j k + kζ j − ζk ≤ N whenever kξ − ζ j k ≤ N
2
, and
the third inequality follows from ζ 6∈ BN . Second, suppose there does not exist a realization ζ 6∈ BN such
that kζ − ζ j k ≤ N
2
. Then,
 N 
P kξ − ζ j k ≤ , ξ 6∈ BN = 0.
2
In each of the two cases, we have shown that
 N 
P kξ − ζ j k ≤ , ξ 6∈ BN ≤ αN dN (EC.53)
2
for each j ∈ [KN ]. Therefore, we combine (EC.52) and (EC.53) to obtain the following upper bound on
RN P(ξ 6∈ BN ) for all sufficiently large N ∈ N:
d(a+1)
RN P(ξ 6∈ BN ) ≤ RN KN αN dN + η ≤ (log N ) 2a φRN αN + η ≤ 2η. (EC.54)

The first inequality follows from (EC.52) and (EC.53), the second inequality follows from (EC.51), and the
third inequality follows from the definition of αN .
We now prove the main result. Indeed, for all sufficiently large N ∈ N,

RN P(ξ 6∈ SN ) = RN P(ξ 6∈ SN , ξ ∈ BN ) + RN P(ξ 6∈ SN , ξ 6∈ BN ) ≤ RN P(ξ 6∈ SN , ξ ∈ BN ) + 2η, (EC.55)

where the equality follows from the law of total probability and the inequality follows from (EC.54). Let
d(a−1)
ρ := 2a
> 0. Then, for all sufficiently large N ∈ N:

PN (RN P(ξ 6∈ SN ) > 3η) ≤ PN (RN P(ξ 6∈ SN , ξ ∈ BN ) > η) (EC.56)


≤ η −1 RN EPN [P (ξ 6∈ SN , ξ ∈ BN )] (EC.57)
= η −1 RN E I {ξ ∈ BN } PN (ξ 6∈ SN )
 
(EC.58)
h  i
= η −1 RN E I {ξ ∈ BN } PN kξ − ξ̂ 1 k > N , . . . , kξ − ξ̂ N k > N (EC.59)
  N 
−1 1 1
= η RN E I {ξ ∈ BN } P kξ − ξ̂ k > N (EC.60)

≤ η −1 RN (1 − αN dN )N (EC.61)
≤ η −1 RN exp −N αN dN

(EC.62)
 1

≤ η −1 RN exp −κd N d+1 αN (EC.63)
−1 −1 1+ρ
d

=η RN exp −κ ηφ (log N ) . (EC.64)
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec25

Line (EC.56) follows from (EC.55), (EC.57) follows from Markov’s inequality, (EC.58) follows from Fubini’s
thoerem, and (EC.59) follows from the definition of SN . Line (EC.60) follows because, given any fixed
realization of ξ, the random variables kξ − ξ̂ 1 k, . . . , kξ − ξ̂ N k are independent. Note that the random vector
ξ̂ 1 is drawn from the measure P1 . Line (EC.61) follows from the definition of BN and the fact that the
 N
function ξ 7→ I {ξ ∈ BN } P1 kξ − ξ̂ 1 k > N is equal to zero if ξ ∈
/ BN . Line (EC.62) follows from the mean
1
value theorem. Line (EC.63) holds since Assumption 2 implies that N ≥ κN − d+1 . Line (EC.64) follows from
the definitions of αN , RN , and ρ. Since ρ > 0, it follows from (EC.64) and the definition of RN that

X
PN (RN P(ξ 6∈ SN ) > 3η) < ∞, ∀η > 0,
N =1

and thus the Borel-Cantelli lemma implies that RN P(ξ 6∈ SN ) → 0 as N → ∞, P∞ -almost surely. 

Appendix G: Proof of Proposition 3 from Section 6

In this appendix, we present the proof of Proposition 3. We begin with the following intermediary lemma.

Lemma EC.5. The ∞-Wasserstein ambiguity set is equivalent to


   
1 X N
Qj kξ − ξ̂ j k ≤ N = 1 ∀j ∈ [N ],
Qj : .
 N j =1
Q1 , . . . , QN ∈ P(Ξ)

Proof. By the definition of the ∞-Wasserstein distance from Section 6,


 

 Π ∈ P(Ξ × Ξ), 


 

 0
Π (kξ − ξ k ≤ N ) = 1, and
 

n   o  
Q ∈ P(Ξ) : d∞ Q, PN ≤ N = Q ∈ P(Ξ) :
b . (EC.65)


 Π is a joint distribution of ξ and ξ 0 



 

 
with marginals Q and PN , respectively
 b 

Let ξ̄ 1 , . . . , ξ̄ L be the distinct vectors among ξ̂ 1 , . . . , ξ̂ N , and let I1 , . . . , IL be index sets defined as

I` := {j ∈ [N ] : ξ̂ j = ξ̄ ` }.

For any joint distribution Π that satisfies the constraints in the ambiguity set in (EC.65), let Q` be the
conditional distribution of ξ given ξ 0 = ξ̄ ` . Then, for every Borel set A ⊆ Ξ,
L L
0
X
0 bN (ξ 0 = ξ̄ ` ) =
`
 X |I` |
Q(ξ ∈ A) = Π ((ξ, ξ ) ∈ A × Ξ) = Π ξ ∈ A | ξ = ξ̄ P Q` (ξ ∈ A) .
`=1 `=1
N

The first equality follows because Π is a joint distribution of ξ and ξ 0 with marginals Q and P
bN , respectively.
The second equality follows from the law of total probability. The final equality follows from the definitions
of Q` and P
bN . Since the above equalities holds for every Borel set, we have shown that
L
X |I` |
Q= Q` .
`=1
N
Furthermore, by using similar reasoning as above, we observe that
L L
X X |I` |
Π (kξ − ξ 0 k ≤ N ) = Π kξ − ξ 0 k ≤ N | ξ 0 = ξ̄ ` PbN (ξ 0 = ξ̄ ` ) =

Q` (kξ − ξ̄ ` k ≤ N ) .
`=1 `=1
N
ec26 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Combining the above results, the ambiguity set from (EC.65) can be rewritten as
L
 
 L
X
` |I` |  ( L Q` (kξ − ξ̄ ` k ≤ N ) = 1 ∀` ∈ [L],
)
X |I` | Q` (kξ − ξ̄ k ≤ N ) = 1, X |I` |
Q` : `=1 N = Q` :
 `=1 N
  `=1
N Q1 , . . . , QL ∈ P(Ξ)
Q1 , . . . , QL ∈ P(Ξ)

 
1 X N Qj (kξ − ξ̂ j k ≤ N ) = 1 ∀j ∈ [N ],
= Qj : .
 N j =1 Q , . . . , Q ∈ P(Ξ) 
1 N

The first equality follows because Q` (kξ − ξ̄ ` k ≤ N ) ≤ 1 for each ` ∈ [L]. The second equality follows because
Q` (kξ − ξ̄ ` k ≤ N ) = 1 if and only if there exists Qj ∈ P(Ξ) for each j ∈ I` such that Qj (kξ − ξ̂ j k ≤ N ) = 1
and j∈I` |I1` | Qj = Q` . This concludes the proof. 
P

We now present the proof of Proposition 3.

Proposition 3. Problem (2) with uncertainty sets of the form


n o
U jN := ζ ≡ (ζ 1 , . . . , ζ T ) ∈ Ξ : kζ − ξ̂ j k ≤ N

is equivalent to ∞-WDRO.

Proof. It follows from Lemma EC.5 that the ∞-Wasserstein ambiguity set can be decomposed into
separate distributions, each having a support that is contained in {ζ ∈ Ξ : kζ − ξ̂ j k ≤ N } for j ∈ [N ]. Of
course, these sets are exactly equal to the uncertainty sets from Section 3, and thus Lemma EC.5 implies
that the ∞-Wasserstein ambiguity set is equivalent to
( N
)
1 X j
Qj : Qj ∈ P(UN ) for each j ∈ [N ] .
N j =1

Therefore, when AN is the ∞-Wasserstein ambiguity set and each UNj is a closed balls around ξ̂ j which is
intersected with Ξ,
" T
# N
" T #
X 1 X X
sup EQ ct (ξ) · xt (ξ 1 , . . . , ξ t−1 ) = sup EQ ct (ξ) · xt (ξ 1 , . . . , ξ t−1 )
Q∈AN
t=1
N j =1 Q∈P (U j ) t=1
N
N T
1 X X
= sup ct (ζ) · xt (ζ 1 , . . . , ζ t−1 ).
N j =1 ζ∈U j t=1
N

Moreover, it similarly follows from Lemma EC.5 that the following inequalities are equivalent:
T
!
X
Q At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) = 1 ∀Q ∈ AN
t=1
N T
!
1 X X
Qj At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) = 1 ∀Qj ∈ P(UNj ), j ∈ [N ]
N j =1 t=1
T
!
X
Qj At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) = 1 ∀Qj ∈ P(UNj ), j ∈ [N ]
t=1
T
X
At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ UNj , j ∈ [N ].
t=1

We have thus shown that Problem (2) and Problem (6) have equivalent objective functions and constraints
under the specified constructions of the uncertainty sets and ambiguity set. This concludes the proof. 
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec27

Appendix H: Proof of Proposition 4 from Section 6

In this appendix, we present the proof of Proposition 4.

Proposition 4. If p ∈ [1, ∞) and N > 0, then a decision rule is feasible for p-WDRO only if
T
X
At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ Ξ.
t=1

Proof. Consider any arbitrary ξ̄ ∈ Ξ such that ξ̄ 6= ξ̂ j for each j ∈ [N ]. Let δξ̄ denote the Dirac delta
bN := 1 PN δ j be the empirical distribution of the
distribution which satisfies δξ̄ (ξ = ξ̄) = 1, and let P N j =1 ξ̂

sample paths. For any λ ∈ (0, 1), let the convex combination of the two distributions be given by

Qλξ̄ := (1 − λ)P
bN + λδξ̄ .

bN and Qλ :
We recall the definition of the p-Wasserstein distance between P ξ̄

 p1 Π is a joint distribution of ξ and ξ 0


(Z )
 
bN , Qλ = inf 0 p 0
dp P ξ̄ kξ − ξ k dΠ(ξ, ξ ) : . (EC.66)
Ξ×Ξ bN and Qλ , respectively
with marginals P ξ̄

Consider a feasible joint distribution Π̄ for the above optimization problem in which ξ 0 ∼ Qλξ̄ , ξ 00 ∼ P
bN , and
(
ξ 0 , if ξ 0 = ξ̂ j for some j ∈ [N ],
ξ=
ξ 00 , otherwise.
Indeed, we readily verify that the marginal distributions of ξ and ξ 0 are P
bN and Qλ , respectively, and thus
ξ̄

this joint distribution is feasible for the optimization problem in (EC.66). Moreover,
  Z  p1
λ 0 p 0
dp PN , Qξ̄ ≤
b kξ − ξ k dΠ̄(ξ, ξ )
Ξ×Ξ
  p1
Z Z
p  p 
kξ − ξ 0 k I ξ 0 = ξ̄ dΠ̄(ξ, ξ 0 ) + kξ − ξ 0 k I ξ 0 6= ξ̄ dΠ̄(ξ, ξ 0 )
 
=


Ξ×Ξ Ξ×Ξ
| {z }
=0
N
! p1
1 X p
= λ ξ̂ j − ξ̄ .
N j =1
The inequality follows since Π̄ is a feasible but possibly suboptimal joint distribution for the optimization
problem in (EC.66). The first equality follows from splitting the integral into two cases, and observing that
the second case equals zero since ξ = ξ 0 whenever ξ 0 6= ξ̄. The final equality follows because ξ = ξ 00 whenever
ξ 0 = ξ̄, and ξ 00 is distributed uniformly over the historical sample paths. Thus, for any arbitrary choice of
ξ̄ ∈ Ξ, we have shown that Qλξ̄ is contained in the p-Wasserstein ambiguity set whenever λ ∈ (0, 1) satisfies
N
! p1
1 X j
p
λ ξ̂ − ξ̄ ≤ N
N j =1
N
1 X p
λ ξ̂ j − ξ̄ ≤ pN
N j =1
pN
λ≤ PN p .
1
N j =1
ξ̂ j − ξ̄
ec28 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Now, consider any feasible decision rule for Problem (6), i.e., a decision rule x ∈ X which satisfies
T
X
At (ξ)xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) Q-a.s., ∀Q ∈ AN . (EC.67)
t=1

Let AN be the p-Wasserstein ambiguity set for 1 ≤ p < ∞ and N > 0. Then, for any arbitrary ξ̄ ∈ Ξ, there
exists a λ ∈ (0, 1) such that Qλξ̄ is contained in AN , and so it follows from (EC.67) that the decision rule
must satisfy
T
X
At (ξ̄)xt (ξ̄ 1 , . . . , ξ̄ t−1 ) ≤ b(ξ̄).
t=1

Since ξ̄ ∈ Ξ was chosen arbitrarily, we conclude that the decision rule must satisfy
T
X
At (ζ)xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ Ξ,
t=1

which is what we wished to show. 

Appendix I: Linear Decision Rules for Problem (6) with 1-Wasserstein Ambiguity
Sets

In this appendix, we present a reformulation of linear decision rules for Problem (6) using the 1-Wasserstein
ambiguity set. The performance of this data-driven approach is illustrated in Section 8.

We first review the necessary notation. Following Section 5, we focus on a specific case of Problem (6) of
the form " T #
X
minimize sup EQ ct · xt (ξ 1 , . . . , ξ t−1 )
x∈X Q∈AN
t=1
T
(EC.68)
X
subject to At xt (ξ 1 , . . . , ξ t−1 ) ≤ b(ξ) Q-a.s., ∀Q ∈ AN ,
t=1

in which At (ξ) and ct (ξ) do not depend on the stochastic process. The ambiguity set is constructed as
n o
AN = Q ∈ P(Ξ) : d1 (Q, Pb N ) ≤ N ,

bN is the empirical distribution of the historical data, N ≥ 0 is the robustness parameter, and the
where P
1-Wasserstein distance between two distributions is given by
Π is a joint distribution of ξ and ξ 0
(Z )
0 0 0
d1 (Q, Q ) = inf kξ − ξ k dΠ(ξ, ξ ) : .
Ξ×Ξ with marginals Q and Q0 , respectively
We refer to Section 6 for more details on the 1-Wasserstein ambiguity set. We assume that the robustness
parameter satisfies N > 0, in which case it follows from Proposition 4 in Section 6 that Problem (EC.68) is
equivalent to " #
T
X
minimize sup EQ ct · xt (ξ 1 , . . . , ξ t−1 )
x∈X Q∈AN
t=1
T
(EC.69)
X
subject to At xt (ζ 1 , . . . , ζ t−1 ) ≤ b(ζ) ∀ζ ∈ Ξ.
t=1
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec29

We next present an extension of the linear decision rule approach to Problem (EC.69), in which we restrict
the space of decision rules to those of the form
t−1
X
xt (ζ 1 , . . . , ζ t−1 ) = xt,0 + Xt,s ζ s .
s=1

The resulting approximation of Problem (EC.69) is given by


" T t−1
!#
X X
minimize sup EQ ct · xt,0 + Xt,s ξ s
Q∈AN
t=1 s=1
T t−1
! (EC.70)
X X
subject to At xt,0 + Xt,s ζ s ≤ b(ζ) ∀ζ ∈ Ξ,
t=1 s=1

where the decision variables are xt,0 ∈ Rnt and Xt,s ∈ Rnt ×ds for all 1 ≤ s < t.

In the remainder of this appendix, we develop a tractable reformulation of Problem (EC.70). Our reformu-
lation, which will use similar duality techniques to those presented in Section 5, is presented as Theorem EC.3.
Our reformulation requires the following assumption:

Assumption EC.1. The set Ξ ⊆ Rd is a nonempty multi-dimensional box of the form [`, u], where any
component of ` might be −∞ and any component of u may be ∞. Moreover, the norm in the 1-Wasserstein
distance is equal to k · k1 .

We now present the reformulation of Problem (EC.70) given Assumption EC.1.

Theorem EC.3. If Assumption EC.1 holds, then Problem (EC.70) can be reformulated by adding at most
O(md) additional continuous decision variables and O(md) additional linear constraints. The reformulation
is given by
N T t−1
! !
1 X X X
minimize λN + ct · xt,0 + Xt,s ξ̂ js j
+ α · (u − ξ̂ ) + β(ξ̂ − `)j
N j =1 t=1 s=1
T
X |
subject to (Xs,t ) cs − αt + β t ≤λ t ∈ [T ]
s=t+1 ∞
T
X
Mt − Λt = −Bt + As Xs,t t ∈ [T ]
s=t+1
T
X
(Mt ut − Λt `t + At xt,0 ) ≤ b0 ,
t=1

where the auxilary decision variables are α := (α1 , . . . , αT ), β := (β 1 , . . . , β T ) ∈ Rd+ , as well as M :=


(M1 , . . . , MT ), Λ := (Λ1 , . . . , ΛT ) ∈ Rm×d
+ .

Proof. Following similar reasoning to Theorem 4, the constraints


T t−1
! T
X X X
At xt,0 + Xt,s ζ s ≤ b0 + Bt ζ t ∀ζ ∈ Ξ
t=1 s=1 t=1
ec30 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

are satisfied if and only if there exist M := (M1 , . . . , MT ), Λ := (Λ1 , . . . , ΛT ) ∈ Rm×d


+ which satisfy
T
X
(Mt ut − Λt `t + At xt,0 ) ≤ b0 ,
t=1
T
X
Mt − Λt = As Xs,t − Bt , t ∈ [T ].
s=t+1

The remainder of the proof focuses on the objective function. Note that for any fixed solution to Prob-
lem (EC.70) one can define a function f : Ξ → R as follows
T t−1
!
X X
f (ζ) = ct · xt,0 + Xt,s ζ s .
t=1 s=1

It follows from Assumption EC.1 that Ξ ⊆ Rd is nonempty, convex, and closed, and −f (·) is proper, convex,
and lower semicontinuous on Ξ, thus satisfying Mohajerin Esfahani and Kuhn (2018, Assumption 4.1).
Therefore, we conclude from Mohajerin Esfahani and Kuhn (2018, Equation 12b) that
" T t−1
!#
X X
sup EQ ct · xt,0 + Xt,s ξ s = sup EQ [f (ξ)]
Q∈AN Q∈AN
t=1 s=1
N
1 X n o
= inf λN + sup f (ζ) − λkζ − ξ̂ j k1
λ≥0 N j =1 ζ∈Ξ
N
( T t−1
! )
1 X X X
j
= inf λN + sup ct · xt,0 + Xt,s ζ s − λkζ − ξ̂ k1 .
λ≥0 N j =1 ζ∈Ξ t=1 s=1
| {z }
γj
(EC.71)
We now reformulate the expression γj for each j ∈ [N ]. Indeed, it follows from strong duality for linear
programming that
T  
X
γj = minimize ct · xt,0 + αt · (ut − ξ̂ jt ) + β t · (ξ̂ jt − `t )
α, β ∈Rd
+ t=1
T
(EC.72)
X
|
subject to (Xs,t ) cs − αt + β t ≤ λ, t ∈ [T ].
s=t+1 ∞

Remark: For any index l such that ul = ∞ (alternatively, `l = −∞), the corresponding decision variable
αl (alternatively, βl ) should be set to zero and the term αl (ul − ξˆjl ) (alternatively, βl (ξˆjl − `l )) should be
dropped from the objective.
Note that problem (EC.72) is component-wise separable to d problems of the form
minimize αl (ukl − ξˆjl ) + βl (ξˆjl − `l )
αl , β l ∈ R +
(EC.73)
subject to |gl − αl + βl | ≤ λ,
PT PT 
where g := s=2
(Xs,1 )| cs , s=3 (Xs,2 )| cs , . . . , (XT,T −1 )| cT , 0 ∈ Rd . Moreover, ξ̂ j ∈ Ξ implies that both
(ul − ξˆjl ) and (ξˆjl − `l ) are nonnegative, and so for any fixed λ and gl , an optimal solution of (EC.73) is
given by αl = max{gl − λ, 0} and βl = max{−gl − λ, 0} (their corresponding minimal values). This solution
is independent of the value of ξˆjl , and therefore, the same variables α and β can be used in (EC.72) for
all values of j ∈ [N ]. Combining (EC.71) and (EC.72) and plugging the result to the objective function of
(EC.70), we obtain the desired formulation. 
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec31

Appendix J: Supplement to Section 7

J.1. Reformulation of the Three-Stage Inventory Replenishment Problem

Let I2−r (ξ 1 ) = max{0, ξ1r − Q1r } denote the demand at retailer r from the first half of the week that exceeded
the initial inventory of the retailer. With these auxiliary decision rules, we observe that the three-stage
stochastic nonlinear optimization problem from Section 7.1 is equivalent to
" R
! R
! R
!
X X X

minimize E c Q10 + Q1r + h Q10 − Q2r (ξ 1 ) + b I2r (ξ 1 )
Q≥0,z∈{0,1}R ,I
r =1 r =1
R
! R
!r=1 R
!#
X X X
+f zr (ξ 1 ) + b max {0, −I3r (ξ 1 , ξ 2 )} + h max {0, I3r (ξ 1 , ξ 2 )}
r =1 r =1 r =1
R
X
subject to Q2r (ξ 1 ) ≤ Q10 a.s.
r =1

I (ξ 1 ) = max{0, ξ1r − Q1r }
2r ∀r ∈ [R], a.s.

I3r (ξ 1 , ξ 2 ) = Q1r − ξ1r + I2−r (ξ 1 , ξ 2 ) + Q2r (ξ 1 ) − ξ2r ∀r ∈ [R], a.s.

zr (ξ 1 )M ≥ Q2r (ξ 1 ) ∀r ∈ [R], a.s.


After substituting the equality of I3r (ξ 1 , ξ 2 ) into the objective function, and after adding epigraph decision
rules, we obtain the following simplified formulation:
" R
! R R
#
X X X
minimize E c Q10 + Q1r + hQ10 + vr (ξ 1 , ξ 2 ) + f zr (ξ 1 )
Q≥0,z∈{0,1}R ,I− ,v
r =1 r =1 r =1
R
X
subject to Q2r (ξ 1 ) ≤ Q10 a.s.
r =1

I (ξ 1 ) = max{0, ξ1r − Q1r }
2r ∀r ∈ [R], a.s.

vr (ξ 1 , ξ 2 ) ≥ b(ξ2r + ξ1r − Q2r (ξ 1 ) − Q1r ) − hQ2r (ξ 1 ) ∀r ∈ [R], a.s.

vr (ξ 1 , ξ 2 ) ≥ (h + b)I2−r (ξ 1 ) + h(Q1r − ξ1r − ξ2r ) ∀r ∈ [R], a.s.

zr (ξ 1 )M ≥ Q2r (ξ 1 ) ∀r ∈ [R], a.s.



Plugging in the equality of the auxiliary decision rules I (ξ 1 ) into the remaining constraints, and eliminating
2r

the maximization by splitting the relevant constraint, we conclude our derivation of Problem (7).

J.2. Heuristic partitioning algorithm

In this appendix, we describe the heuristic partitioning algorithm for finite adaptability which is used in
our numerical experiments in Section 7. The goal of the algorithm is to construct a partition comprised of
hyper-rectangular regions of the form P 1 := [`1 , u1 ] × RR
+, . . . , P
N
:= [`N , uN ] × RR
+ such that the inclusion

ξ̂ j ∈ P j is satisfied for each sample path j ∈ [N ]. The output of the partitioning algorithm from this section
are thus the vectors `1 , u1 , . . . , `N , uN ∈ [0, ∞]R which define the partition of the set Ξ := R2+R .

Our algorithm is comprised of the following steps, which are formalized in Algorithm 1 and visualized
1
in Figure EC.1. We first define M := dN R e as the smallest integer which satisfies the inequality M R ≥ N .
ec32 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Figure EC.1 Illustration of the partitioning algorithm


ζ12 ζ12 ζ12

F F E
E E F

C C C
D B D B D B

A A A
ζ11 ζ11 ζ11

Note. Left: This shows the N = 6 historical sample paths in an example with R = 2 retailers. Center: In the first
iteration (r = 1) of the heuristic partitioning algorithm, there are N = 6 historical sample paths. Since there are R = 2
retailers, we subdivide the historical sample paths into M = 3 regions along the demand of the first retailer. Right: In
the second iteration (r = 2), we subdivide each of the 3 regions along the demand of the second retailer. The regions
comprise the partition which is returned by the algorithm.

We then iterate over the indices r = 1, 2, . . . , R. In each iteration r, we start with a partition of Ξ and then

subdivide each region in that partition into at most M smaller regions by adding cuts along the first-stage

demand ζ1r . The borders of the new regions are determined by the historical data points such that each region

contains approximately the same number of data points and the region borders are the furthest possible from

the closest data point in the ζ1r dimension. The progression of the algorithm is illustrated in Figure EC.1.

When the algorithm concludes its Rth iteration, we have obtained a partition P = {P 1 , . . . , P N } of Ξ with

exactly N regions such that each ξ̂ j1 lies in its own region. Moreover, since Ξ is a hyperrectangle, each region

P j is also a hyperrectangle. This concludes the description of our heuristic partitioning algorithm. We note

that, for simplicity, the version of the algorithm presented here ignores cases where there exist j1 6= j2 ∈ [N ]

and r such that ξ̂ j11r = ξ̂ j12r , which may happen if the distribution is not continuous; however, this is not the

case in our numerical experiments, and, furthermore, our algorithm can be easily adjusted to address such

ties.

J.3. Reformulation of the robust optimization problem

In this appendix, we present the derivation of Problem (9) from Section 7.2. Indeed, we recall from Section 7.2

that the uncertainty sets UN1 , . . . , UNN are hyperrectangles, and it follows from Appendix J.2 that the regions

P 1 , . . . , P K ⊆ Ξ that are obtained from our heuristic partitioning algorithm are hyperrectangles as well.
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec33

Algorithm 1 Partitioning algorithm


Input: Ξ, N, {ξ̂ 1 , . . . , ξ̂ N }
Output: Partition P
1
1: Initialize M = dN R e, P = {Ξ}
2: for r := 1, . . . , R do

3: for all P ∈ P do
4: Find J = {j : ξ̂ j ∈ P }
5: Let {j(k) }k=1,...,|J| be an ordering of the indexes in J such that

j j
ξˆ1(rk) ≤ ξˆ1(rk+1) , k ∈ [|J | − 1].

6: Update P = P \ {P }.
7: Set K = min{M, |J |} and k0 = 0
8: for all l := 1, . . . , K do
9: Set kl = max{d|J |l/K e, kl−1 + 1}
10: if k < |J | then
j(k) j(k+1)
ξ̂1r +ξ̂1r
11: ζ̄ l = 2

12: Set P = {ζ ∈ P : ζ1r ≤ ζ̄ l } and update P = P \ P l .


l

13: else
14: Set K = l and P K = P
15: end if
16: end for
17: Update P = P ∪ {P 1 , . . . , P K }.
18: end for
19: end for
ec34 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Consequently, the approximation of the robust optimization problem using finite adaptability is given by

R
! N
( R R
)
X 1 X X X
k
minimize c Q10 + Q1r + hQ10 + max vr (ζ 1 , ζ 2 ) + f z r
v, Q1 ≥0,
r =1
N j =1 k∈Kj ,ζ∈UNj ∩P k r =1 r =1
Qk k
2 ≥0,z ∈{0,1}
R

R
X
subject to Qk2r ≤ Q10 ∀k ∈ [K]
r =1

vr (ζ 1 , ζ 2 ) ≥ b(ζ2r + ζ1r − Qk2r − Q1r ) − hQk2r ∀r ∈ [R], k ∈ [K], ζ ∈ ∪N j k (EC.74)


j =1 UN ∩ P

j
vr (ζ 1 , ζ 2 ) ≥ h(Q1r − ζ1r − ζ2r ) ∀r ∈ [R], k ∈ [K], ζ ∈ ∪N
j =1 UN ∩ P
k

j
vr (ζ 1 , ζ 2 ) ≥ b (ζ1r − Q1r ) − hζ2r ∀r ∈ [R], k ∈ [K], ζ ∈ ∪N
j =1 UN ∩ P
k

zrk M ≥ Qk2r ∀r ∈ [R], k ∈ [K],

where Kj := {k ∈ [N ] : UNj ∩ P k 6= ∅} contains the indices of regions P 1 , . . . , P K that intersect the uncertainty

set UNj . In the remainder of this appendix, we show that the above optimization problem is equivalent to

Problem (9).

We first observe that there is an optimal solution to Problem (EC.74) in which each decision rule vr (ζ 1 , ζ 2 )

is equivalent to a function that depends only on ζ1r and ζ2r . Therefore, since each UNj ∩ P k is a hyperrectangle,

we observe that
( R R
) ( R !)
X X X
k k
maxj vr (ζ 1 , ζ 2 ) + f z r = max max
j
vr (ζ 1 , ζ 2 ) + f z r .
k∈Kj ,ζ∈UN ∩P k k∈Kj ζ∈UN ∩P k
r =1 r =1 r =1

It follows from the above observation that Problem (EC.74) is equivalent to

R
! N
X 1 X j
minimize c Q10 + Q1r + hQ10 + v
v, Q1 ≥0,
r =1
N j =1 r
Qk k
2 ≥0,z ∈{0,1}
R

uj,k ,vj ∈RR


R
X
subject to Qk2r ≤ Q10 ∀k ∈ [K]
r =1
R
X 
vrj ≥ uj,k k
r + f zr ∀r ∈ [R], j ∈ [N ], k ∈ Kj
r =1
j
uj,k k k
r ≥ b(ζ2r + ζ1r − Q2r − Q1r ) − hQ2r ∀r ∈ [R], k ∈ [K], ζ ∈ ∪N
j =1 UN ∩ P
k

j
uj,k
r ≥ h(Q1r − ζ1r − ζ2r ) ∀r ∈ [R], k ∈ [K], ζ ∈ ∪N
j =1 UN ∩ P
k

j
uj,k
r ≥ b (ζ1r − Q1r ) − hζ2r ∀r ∈ [R], k ∈ [K], ζ ∈ ∪N
j =1 UN ∩ P
k

zrk M ≥ Qk2r ∀r ∈ [R], k ∈ [K],

where at optimality each auxiliary decision variable uj,k


r will satisfy the equality uj,k
r = maxζ∈U j ∩P k vr (ζ 1 , ζ 2 ).

Applying the standard ‘robust counterpart’ reformulation technique, we observe that the above optimization
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec35

problem can be rewritten as


R
! N R
X 1 XX j
minimize c Q10 + Q1r + hQ10 + v
vj , Q1 ≥0,
r =1
N j =1 r=1 r
Qk k
2 ≥0,z ∈{0,1}
R

uj,k ,vj ∈RR


R
X
subject to Qk2r ≤ Q10 ∀k ∈ [K]
r =1
R
X 
vrj ≥ uj,k k
r + f zr ∀r ∈ [R], j ∈ [N ], k ∈ Kj
r =1

uj,k
r ≥ max
j
b(ζ2r + ζ1r − Qk2r − Q1r ) − hQk2r ∀r ∈ [R], j ∈ [N ], k ∈ Kj
ζ∈UN ∩P k

uj,k
r ≥ max
j
h(Q1r − ζ1r − ζ2r ) ∀r ∈ [R], j ∈ [N ], k ∈ Kj
ζ∈UN ∩P k

uj,k
r ≥ max
j
b (ζ1r − Q1r ) − hζ2r ∀r ∈ [R], j ∈ [N ], k ∈ Kj
ζ∈UN ∩P k

zrk M ≥ Qk2r ∀r ∈ [R], k ∈ [K].


jk
Let us define a lower bound ζ jk := minζ∈U j ∩P k ζtr and an upper bound ζ̄tr := maxζ∈U j ∩P k ζtr for each
¯tr N N

period t ∈ {1, 2}, retailer r ∈ [R], sample path j ∈ [N ], and region k ∈ Kj . Since each set UNj ∩ P k is a
hyperrectangle, and since the holding costs and backlogging costs are nonnegative (h, b ≥ 0), we see that the
above optimization problem is equivalent to Problem (9).

J.4. Supplement to Section 7.3

We conclude the present Appendix J with additional numerical results which were omitted from Section 7.
In Figure EC.2, we show the impact of having a fixed cost of f = 0.1 versus not having a fixed cost on the
replenishment decision rules obtained from SRO-FA. In Figures EC.3 and EC.4, we show the impact of the
robustness parameter on SRO-FA for various numbers of retailers and sizes of training datasets with and
without fixed costs.

Appendix K: Supplement to Section 8.3

In this appendix, we provide supplemental numerical results for the multi-stage stochastic inventory man-
agement problem from Section 8. Specifically, the aim of this appendix is to evaluate the impact of the
projection procedure, described at the end of Section 8.2, on the out-of-sample costs of SRO-LDR and
SAA-LDR reported in Table 1.

Following the same notation in Section 8.2, let xA,i,` = (xA,i,`


1 , . . . , xA,i,`
T ) be the production quantities
obtained when the decision rule from approach A on training dataset ` is applied to the ith sample path in
the testing dataset. For each approach A and training dataset `, the probability that the resulting decision
rule is feasible is approximated by
10000
1
1{xA,i,` ∈[0,x̄]} ,
X
P A,` =
10000 i=1
ec36 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Figure EC.2 Three-stage inventory replenishment problem:


Histogram of replenishment decision rules for SRO-FA

Note. The histogram corresponds to the replenishment decision rules obtained by SRO-FA in experiments where
N = 200, R = 3, and N = 0.7. The light-black bars correspond to experiments in which there was no fixed cost
(f = 0) and the dark-black bars correspond to experiments in which there was fixed cost (f = 0.1). For each number
of retailers 0, 1, 2, 3, the histogram shows the number of sample paths in the test set for which the replenishment
policies sent a nonzero quantity of magazines to the corresponding number of retailers.
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec37

Figure EC.3 Three-stage inventory replenishment problem:


Impact of robustness parameter on SRO-FA, no fixed cost (f = 0)

Note. The solid black lines are the average out-of-sample cost of decision rules produced by SRO-FA, and the shaded
regions are the 20th and 80th percentiles over the 50 training datasets. The dotted red lines are the average in-sample
cost for SRO-FA. The green line is the benchmark cost of Problem (7). Results are shown for N ∈ {50, 200, 800}.
ec38 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Figure EC.4 Three-stage inventory replenishment problem:


Impact of robustness parameter on SRO-FA, fixed cost (f = 0.1)

Note. The solid black lines are the average out-of-sample cost of decision rules produced by SRO-FA, and the shaded
regions are the 20th and 80th percentiles over the 50 training datasets. The dotted red lines are the average in-sample
cost for SRO-FA. The green line is the benchmark cost of Problem (7). Results are shown for N ∈ {50, 100, 200}.
e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization ec39

and the infeasibility magnitude is approximated by


10000
1 X
C A,` = min xA,i,` − y 1
.
10000 i=1 y∈[0,x̄]

In other words, P A,` tells us how frequently the projection procedure needs to be applied, and C A,` captures
the average number of production units which are changed due to the projection procedure.

In Tables EC.1 and EC.2, for each experiment in Section 8.3 and for each approach A ∈
{SRO-LDR, SAA-LDR}, we report the average and standard deviations for P A,` and C A,` over the 100
training datasets. For almost all choices of T , α, and N , the decision rules produced by SRO-LDR are fea-
sible for over 93% of the sample paths in testing dataset, and their infeasibility magnitude is below 2 units.
These results imply that the projection procedure does not significantly impact the out-of-sample cost of
SRO-LDR reported in Table 1. In contrast, SAA-LDR produces decision rules which have low feasibility and
high infeasibility magnitude when N is small. This shows, for small training datasets, that the decision rules
obtained by SAA-LDR can be unreliable and require significant corrections to obtain feasible production
quantities.

Table EC.1 Multi-stage stochastic inventory management: out-of-sample


feasibility.
Size of training dataset (N)
T α Approach 10 25 50 100
5 0 SRO-LDR 96.3(6.6) 98.9(2.1) 99.7(0.8) 100.0(0.1)
SAA-LDR 83.0(13.4) 96.5(4.3) 99.5(1.3) 100.0(0.1)
0.25 SRO-LDR 93.8(7.3) 95.9(3.5) 97.3(2.3) 98.1(1.1)
SAA-LDR 79.2(12.9) 92.3(5.3) 96.5(2.8) 98.1(1.1)
0.5 SRO-LDR 89.7(8.6) 91.0(4.9) 91.1(3.7) 94.1(2.4)
SAA-LDR 73.4(11.3) 85.4(4.9) 89.9(3.5) 94.0(2.3)
10 0 SRO-LDR 99.6(1.0) 100.0(0.1) 100.0(0.0) 100.0(0.0)
SAA-LDR 61.5(24.6) 99.0(1.6) 100.0(0.1) 100.0(0.0)
0.25 SRO-LDR 99.4(1.8) 99.9(0.3) 100.0(0.1) 100.0(0.0)
SAA-LDR 60.2(23.9) 97.8(2.2) 99.8(0.4) 100.0(0.0)
0.5 SRO-LDR 96.7(2.9) 97.7(1.4) 98.6(0.7) 98.9(0.3)
SAA-LDR 57.6(22.4) 93.9(3.0) 97.7(1.2) 98.9(0.3)
Mean (standard deviation) of the percentage of the 10,000 sample paths in the
testing dataset for which the linear decision rule resulted in feasible production
quantities (P A,i ). In other words, 100% minus the above values indicates the per-
centage of sample paths in the testing dataset for which the production quantities
needed correction. The mean and standard deviation are computed over 100 training
datasets for each value of N , T , α.
ec40 e-companion to Bertsimas, Shtern, and Sturt: A data-driven approach to multi-stage stochastic linear optimization

Table EC.2 Multi-stage stochastic inventory management: infeasibility


magnitude.
Size of training dataset (N)
T α Approach 10 25 50 100
5 0 SRO-LDR 0.5(1.6) 0.1(0.3) 0.0(0.0) 0.0(0.0)
SAA-LDR 4.6(6.5) 0.4(0.9) 0.0(0.1) 0.0(0.0)
0.25 SRO-LDR 0.8(1.4) 0.4(0.6) 0.2(0.3) 0.1(0.1)
SAA-LDR 5.7(6.6) 0.9(1.1) 0.2(0.3) 0.1(0.1)
0.5 SRO-LDR 1.7(2.1) 1.1(0.8) 1.0(0.6) 0.6(0.4)
SAA-LDR 7.8(7.3) 2.0(1.2) 1.1(0.7) 0.6(0.4)
10 0 SRO-LDR 0.0(0.1) 0.0(0.0) 0.0(0.0) 0.0(0.0)
SAA-LDR 218.2(1417.0) 0.1(0.2) 0.0(0.0) 0.0(0.0)
0.25 SRO-LDR 0.0(0.2) 0.0(0.0) 0.0(0.0) 0.0(0.0)
SAA-LDR 218.8(1417.2) 0.2(0.3) 0.0(0.0) 0.0(0.0)
0.5 SRO-LDR 0.4(0.4) 0.2(0.2) 0.1(0.1) 0.1(0.0)
SAA-LDR 220.3(1417.4) 0.7(0.5) 0.2(0.2) 0.1(0.0)
Mean (standard deviation) of the average infeasibility magnitude on the test-
ing dataset resulting from applying the projecting procedure on the production
quantity produced by the linear decision rule (C A,i ). The mean and standard
deviation are computed over 100 training datasets for each value of N , T , α.

View publication stats

You might also like