PhDthesis B Defourny
PhDthesis B Defourny
Thesis by
Boris Defourny
2010
Abstract
This thesis investigates the following question: Can supervised learning techniques be
successfully used for finding better solutions to multistage stochastic programs? A similar
question had already been posed in the context of reinforcement learning, and had led to
algorithmic and conceptual advances in the field of approximate value function methods
over the years (Lagoudakis and Parr, 2003; Ernst, Geurts, and Wehenkel, 2005; Lang-
ford and Zadrozny, 2005; Antos, Munos, and Szepesvári, 2008). This thesis identifies
several ways to exploit the combination “multistage stochastic programming/supervised
learning” for sequential decision making under uncertainty.
Multistage stochastic programming is essentially the extension of stochastic program-
ming (Dantzig, 1955; Beale, 1955) to several recourse stages. After an introduction to
multistage stochastic programming and a summary of existing approximation approaches
based on scenario trees, this thesis mainly focusses on the use of supervised learning for
building decision policies from scenario-tree approximations.
Two ways of exploiting learned policies in the context of the practical issues posed
by the multistage stochastic programming framework are explored: the fast evaluation
of performance guarantees for a given approximation, and the selection of good scenario
trees. The computational efficiency of the approach allows novel investigations relative
to the construction of scenario trees, from which novel insights, solution approaches and
algorithms are derived. For instance, we generate and select scenario trees with random
branching structures for problems over large planning horizons. Our experiments on
the empirical performances of learned policies, compared to golden-standard policies,
suggest that the combination of stochastic programming and machine learning techniques
could also constitute a method per se for sequential decision making under uncertainty,
inasmuch as learned policies are simple to use, and come with performance guarantees
that can actually be quite good.
Finally, limitations of approaches that build an explicit model to represent an optimal
solution mapping are studied in a simple parametric programming setting, and various
insights regarding this issue are obtained.
iv
Acknowledgments
Warm thanks are addressed to my family for their love and constant support over the
years.
I express my deepest gratitude to my advisor, Louis Wehenkel. In addition to his
guidance, to his questioning, and to the many discussions we had together, from technical
subjects to scientific research in general, Louis has provided me with an outstanding
environment for doing research and communicating it, while demonstrating, with style,
his large culture, his high ethic, and his formidable ability to be present in times of need.
This thesis would not have been written without Louis’ interests for supervised learning,
ensemble methods, and optimal control.
I am very grateful to Damien Ernst. Together, we had many discussions, for instance
on approximate dynamic programming and reinforcement learning, the cross-entropy
method, the uses of scenario-tree methods, and the ways to communicate and publish
results. Damien has repeatedly provided support and encouragements regarding the
present work. Collaborating with Louis and Damien has been extremely energizing,
productive, and life-enriching.
I wish to express my gratitude to Rodolphe Sepulchre. Rodolphe too deserves credit
for setting up an excellent research environment, for instance by organizing stimulating
weekly seminars, group lunches, and offering me the opportunity to participate to activ-
ities of the Belgian network DYSCO or to activities hosted by CESAME/INMA(UCL).
Warm thanks to Jean-Louis Lilien for his good advice and support over the years.
With hindsight, I owe to Jean-Louis many important choices that I am happy to have
made.
I would like to thank Yves Crama for setting up discussions on multistage stochastic
programming and models in logistics, and inviting me to presentations given by his group.
I would like to thank Quentin Louveaux for several stimulating discussions.
I express my gratitude to Mania Pavella for her interest and her repeated encourage-
ments.
I am grateful to Jacek Gondzio (University of Edinburgh), Rüdiger Schultz (University
of Duisburg-Hessen), Werner Römisch (Humboldt University), and Alexander Shapiro
(Georgia Tech) for stimulating discussions on stochastic programming, and expressions
of interest in the general orientation of this research. The scientific responsibility of the
present work and the views expressed in the thesis rest with us.
I am grateful to Rémi Munos (INRIA Lille) and Olivier Teytaud (INRIA Saclay) for
stimulating discussions related to machine learning and sequential decision making.
I am grateful to Roman Belavkin (Middlesex University) for stimulating discussions
vi
on cognitive sciences, on finance, and for inviting me to give a seminar in his group.
I am grateful to Jovan Ilic and Marija Ilic (Carnegie Mellon University) for meetings
and discussions from which my interest in risk-aware decision making dates back.
My interest in multistage stochastic programming originates from meetings with mem-
bers of the department OSIRIS (Optimisation, Simulation, Risque et Statistiques) at
Electricité de France. I would like to thank especially Yannick Jacquemart, Kengy Barty,
Pascal Martinetto, Jean-Sébastien Roy, Cyrille Strugarek, and Gérald Vignal for inspiring
discussions on sequential decision making models and current challenges in the electricity
industry. The scientific responsibility of the present work and the views expressed in the
thesis rest with us.
Warm thanks to my friends, colleagues and past post-docs from the Systems and
Modeling research unit and beyond. I had innumerable scientific and personal discus-
sions with Alain Sarlette and Michel Journée, and during their post-doc time with Emre
Tuna and Silvère Bonnabel. I had wonderful time and discussions with Pierre Geurts,
Vincent Auvray, Guy-Bart Stan, Renaud Ronsse, Christophe Germay, Maxime Bon-
jean, Luca Scardovi, Denis Efimov, Christian Lageman, Alexandre Mauroy, Gilles Meyer,
Marie Dumont, Pierre Sacré, Christian Bastin, Guillaume Drion, Anne Collard, Laura
Trotta, Bertrand Cornélusse, Raphaël Fonteneau, Florence Belmudes, François Schnit-
zler, François Van Lishout, Gérard Dethier, Olivier Barnich, Philippe Ries, and Axel
Bodart. I extend my acknowledgements to Thierry Pironet and Yasemin Arda (HEC-
Ulg). Thanks also to Patricia Rousseaux, Thierry Van Cutsem and Mevludin Glavic.
Many thanks to my friends who encouraged me to pursue this work, and in particular to
Estelle Derclaye (University of Nottingham).
I am also indebted to people who helped me when I was abroad for conferences and
made my time there even more enjoyable: Marija Prica, Guillaume Obozinski, Janne
Kettunen, Aude Piette, Diana Chan, Daniel Bartz, Kazuya Haraguchi.
I had the opportunity to be invited to be a coauthor to papers by Sourour Am-
mar, Philippe Leray (INSA Rouen) and Louis Wehenkel, and to a paper by Bertrand
Cornélusse, Gérald Vignal and Louis Wehenkel, and I wish to thank these authors for
these collaborations.
I gratefully acknowledge the financial support of the Belgian network DYSCO (Dy-
namical Systems, Control, and Optimization), funded by the Interuniversity Attraction
Poles Programme, initiated by the Belgian State, Science Policy Office. The scientific
responsibility rests with us. This work was supported in part by the IST Programme
of the European Union, under the PASCAL2 Network of Excellence, IST-2007-216886.
This thesis only reflects our views.
Finally, I would like to express my extreme gratitude to the members of my thesis
defense committee: Rodolphe Sepulchre (Chair), Louis Wehenkel (Advisor), Yves Crama,
Quentin Louveaux, Shie Mannor (Technion), Werner Römisch (Humboldt University),
Alexander Shapiro (Georgia Tech), and Olivier Teytaud (INRIA Saclay/Université Paris-
Sud).
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3 Influences on Research in Machine Learning . . . . . . . . . . . . . . . . . 140
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Chapter 1
Introduction
Multistage stochastic programming has attracted a lot of interest over the last years,
as a promising framework for formulating sequential decision making problems under
uncertainty. Several potential applications of the framework are often cited:
• Capacity planning: finding the location and size of units or equipment, such as
power plants or telecommunication relays.
In these applications, uncertainty may refer to the evolution of the demand for goods or
services, temperature and rainfall patterns affecting consumption or production, inter-
est rates affecting the burden of debt, . . . Under growing environmental stress, resource
limitations, concentration of populations in cities, many believe that these applications
can only get a higher societal impact in the future, and that even better quantitative
methods for tackling them are needed, especially methods able to take into account a
large number of constraints.
In general, problems where a flexible plan of successive decisions has to be imple-
mented, under uncertainties described by a probabilistic model, can be formulated as a
multistage stochastic program (Chapter 2). However, scalable numerical solution algo-
rithms are not always available, so that restrictions to certain classes of programs and
then further approximations are needed.
Interestingly, the approximations affect primarily the representation of the uncer-
tainty, rather than the space of possible decisions or the space of possible states reachable
by the controlled system. Thus, the limitations with the problem dimensions suffered by
the multistage stochastic framework are of a different nature than those found in dynamic
programming (the so-called curse of dimensionality). The multistage stochastic program-
ming framework is very attractive for settings where decisions in high-dimensional spaces
must be found, but suffers quickly from the dimensions of the uncertainty, and from the
extension of the planning horizon.
This thesis deals with some aspects related to multistage stochastic programming.
Our research was initially motivated by finding ways to incorporate to the multistage
2 Chapter 1. Introduction
Whereas the material of Chapters 5 and 6 is still unpublished, most of the material of
Chapters 2, 3, 4 has been published in the following papers.
• B. Defourny, L. Wehenkel. 2007. Projecting generation decisions induced by a stochastic
program on a family of supply curve functions. Third Carnegie Mellon Conference on the
Electricity Industry. Pittsburgh PA. 6 pages.
Work that uses concepts or algorithms from stochastic programming, and addresses
specific topics in machine learning, has also been presented in the following papers.
• B. Defourny, D. Ernst, L. Wehenkel. 2008. Risk-aware decision making and dynamic
programming. Y. Engel, M. Ghavamzadeh, S. Mannor, P. Poupart, editors, NIPS-08 workshop
on model uncertainty and risk in reinforcement learning. 8 pages.
• B. Defourny, L. Wehenkel. 2009. Large margin classification with the progressive hedging
algorithm. S. Nowozin, S. Sra, S. Vishwanathan, S. Wright, editors, Second NIPS workshop on
optimization for machine learning. 6 pages.
In this section, we describe an attitude towards risk and uncertainty that can motivate
decision makers to employ multistage stochastic programming. Then, we detail the ele-
ments of the decision model and the approximations that can make the model tractable.
In their first attempt towards planning under uncertainty, decision makers often set up a
course of actions, or nominal plan (reference plan), deemed to be robust to uncertainties
in some sense, or to be a wise bet on future events. Then, they apply the decisions, often
departing from the nominal plan to better take account of actual events. To further
improve the plan, decision makers are then led to consider (i) in which parts of the
plan flexibility in the decisions may help to better fulfill the objectives, and (ii) whether
the process by which they make themselves (or the system) “ready to react” impacts the
initial decisions of the plan and the overall objectives. If the answer to (ii) is positive, then
it becomes valuable to cast the decision problem as a sequential decision making problem,
6 Chapter 2. The Multistage Stochastic Programming Framework
even if the net added value of doing so (benefits minus increased complexity) is unknown
at this stage. During the planning process, the adaptations (or recourse decisions) that
may be needed are clarified, their influence on prior decisions is quantified. The notion of
nominal plan is replaced by the notion of decision process, defined as a course of actions
driven by observable events. As distinct outcomes have usually antagonist effects on ideal
prior decisions, it becomes crucial to determine which outcomes should be considered, and
what importance weights should be put on these outcomes, in the perspective of selecting
decisions under uncertainty that are not regretted too much after the dissipation of the
uncertainty by the course of real-life events.
In the robust optimization approach to decision making under uncertainty, decision mak-
ers are concerned by worst-case outcomes. Describing the uncertainty is then essentially
reduced to drawing the frontier between events that should be considered and events
that should be excluded from consideration (for instance, because they would paralyze
any action). In that context, outcomes under consideration form the uncertainty set, and
decision making becomes a game against some hostile opponent that selects the worst
outcome from the uncertainty set. Ben-Tal et al. (2004) provide arguments in favor of
robust approaches.
In a stochastic programming approach, decision makers use a softer frontier between
possible outcomes, by assigning weights to outcomes and optimizing some aggregated
measure of performance that takes into account all these possible outcomes. In that
context, the weights are often interpreted as a probability measure over the events, and
a typical way of aggregating the events is to consider the expected performance under
that probability measure.
Furthermore, interpreting weights as probabilities allows reasoning under uncertainty.
Essentially, probability distributions are conditioned on observations, and Bayes’ rule
from probability theory quantifies how decision makers’ initial beliefs about the likelihood
of future events — be it from historical data or from bets — should be updated on the
basis of new observations.
Technically, it turns out that the optimization of a decision process contingent to
future events is more tractable (read: suitable to large-scale operations) when the “rea-
soning under uncertainty” part can be decoupled from the optimization process itself. In
particular, such a decoupling occurs when the probability distributions describing future
events are not influenced in any way by the decisions selected by the agent, that is, when
the uncertainty is exogenous to the decision process.
We can now describe the main elements of a multistage stochastic programming decision
model. These elements are:
maker will react. The probability measure P serves to quantify the prior beliefs
about the uncertainty. There is no restriction on the structure of the random vari-
ables; in particular, the random variables may be dependent. When the realization
of ξ1 , . . . , ξt−1 is known, there is a residual uncertainty represented by the random
variables ξt , . . . , ξT , the distribution of which in now conditioned on the realization
of ξ1 , . . . , ξt−1 .
ii. A sequence of decisions u1 , u2 , . . . , uT defining the decision process for the problem.
Many models also use a terminal decision uT +1 . We will assume that ut is valued in
a Euclidian space Rm (the space dimension m, corresponding to a number of scalar
decisions, could vary with the index t, but we will not stress that in the notation).
iii. A convention specifying when decisions should actually be taken and when the
realizations of the random variables are actually revealed. This means that if ξ t−1
is observed before taking a decision ut , we can actually adapt ut to the realization
of ξt−1 . To this end, we identify decision stages: see Table 2.1. A row of the
table is read as follows: at decision stage t > 1, the decisions u1 , . . . , ut−1 are
already implemented (no modification is possible), the realization of the random
variables ξ1 , . . . , ξt−1 is known, the realization of the random variables ξt , . . . , ξT is
still unknown but a density P(ξt , . . . , ξT | ξ1 , . . . , ξt−1 ) conditioned on the realized
value of ξ1 , . . . , ξt−1 is available, and the current decision to take concerns the
value of ut . Once such a convention holds, we need not stress in the notation the
difference between random variables ξt and their realized value, or decisions as
functions of uncertain events and the actual value for these decisions: the correct
interpretation is clear from the context of the current decision stage.
The adaptation of a decision ut to prior observations ξ1 , . . . , ξt−1 will always be
made in a deterministic fashion, in the sense that ut is uniquely determined by the
value of (ξ1 , . . . , ξt−1 ).
A sequential decision making problem has more than two decision stages inas-
much as the realizations of the random variables are not revealed simultaneously:
the choice of the decisions taken between successive observations has to take into
account some residual uncertainty on future observations. If the realization of
several random variables is revealed before actually taking a decision, then the
corresponding random variables should be merged into a single random vector;
if several decisions are taken without intermediary observations, then the corre-
sponding decisions should be merged into a single decision vector (Gassmann and
Prékopa, 2005). This is how a problem concerning several time periods could ac-
tually be a two-stage stochastic program, involving two large decision vectors u 1
(first-stage decision, constant), u2 (recourse decision, adapted to the observation of
ξ1 ). What is called a decision in a stochastic programming model may thus actually
correspond to several actions implemented over a certain number of discrete time
periods.
Tab. 2.1: Decision stages, setting the order of observations and decisions.
optional:
T +1 u1 , . . . , uT ξ1 , . . . , ξ T none (uT +1 )
mined by prior observations, but for convenience we keep track of prior decisions
to parametrize the feasibility sets.
An important role of the feasibility sets is to model how decisions are affected by
prior decisions and prior events. In particular, a situation with no possible recourse
decision (Ut empty at stage t, meaning that no feasible decision ut ∈ Ut exists) is
interpreted as a catastrophic situation to be avoided at any cost.
We will always assume that the planning agent knows the set-valued mapping from
the random variables ξ1 , . . . , ξt−1 and the decisions u1 , . . . , ut−1 to the set Ut of
feasible decisions ut .
We will also assume that the feasibility sets are such that a feasible sequence of
decisions u1 ∈ U1 , . . . , uT ∈ UT exists for all possible joint realizations of ξ1 , . . . , ξT .
In particular, the fixed set U1 must be nonempty. A feasibility set Ut parametrized
only by variables in a subset of {ξ1 , . . . , ξt−1 } must be nonempty for any possi-
ble joint realization of those variables. A feasibility set Ut also parametrized by
variables in a subset of {u1 , . . . , ut−1 } must be implicitly taken into account in the
definition of the prior feasibility sets, so as to prevent immediately a decision maker
from taking a decision at some earlier stage that could lead to a situation at stage t
with no possible recourse decision (Ut empty), be it for all possible joint realiza-
tions of the subset of {ξ1 , . . . , ξt−1 } on which Ut depends, or for some possible joint
realization only. These implicit requirements will affect in particular the definition
of U1 .
For example, assume that ut−1 , ut ∈ Rm , and take Ut = {ut ∈ Rm : ut
0, At−1 ut−1 + Bt ut = ht (ξt−1 )} with At−1 , Bt ∈ Rq×m fixed matrices, and ht an
affine function of ξt−1 with values in Rq . If Bt is such that {Bt ut : ut ≥ 0} = Rq ,
meaning that for any v ∈ Rq , there exists some ut 0 with Bt ut = v, then this
is true in particular for v = ht (ξt−1 ) − At−1 ut−1 , so that Ut is never empty. More
details on such conditions can be found in Appendix D.
7. 1.
Ω 4.
1. 2. 3. 5. 2. 3.
8.
6. 4. 5. 6. 7. 8.
Fig. 2.1: (From left to right) Nested partitioning of the event space Ω, starting from a trivial
partition representing the absence of observations. (Rightmost) Scenario tree corre-
sponding to the partitioning process.
mance measure. In this chapter, we write the performance measure as the expecta-
tion of a function f that assigns some scalar value to each realization of ξ 1 , . . . , ξT
and u1 , . . . , uT , assuming the integrability of f with respect to the joint distribution
of ξ1 , . . . , ξT .
PT
For example, one could take for f a sum of scalar products t=1 ct · ut , where
c1 is fixed and where ct depends affinely on ξ1 , . . . , ξt−1 . The function f would
represent a sum of instantaneous costs over the planning horizon. The decision
maker would be assumed to know the vector-valued mapping from the random
variables ξ1 , . . . , ξt−1 to the vector ct , for each t.
Besides the expectation, more sophisticated ways to aggregate the distribution
of f into a single measure of performance have been investigated (Ruszczyński and
Shapiro, 2006; Pflug and Römisch, 2007). An important element considered in the
choice of the performance measure is the tractability of the resulting optimization
problem.
probability of the value to which they are associated, conditioned on the value associated
to their ancestor node. Multiplying the probabilities of the nodes of the path from the
root to a leaf gives the probability of a scenario.
Clearly, an exact construction of the scenario tree would require an infinite num-
ber of nodes if the support of (ξ1 , . . . , ξT ) is discrete but not finite. A random process
involving continuous random variables cannot be represented as a scenario tree; never-
theless, the scenario tree construction turns out to be instrumental in the construction
of approximations to nested continuous conditional distributions.
Branchings are essential to represent residual uncertainty beyond the first decision
stage. At the planning time, the decision makers may contemplate as many hypothetical
scenarios as desired, but when decisions are actually implemented, the decisions can-
not depend on observations that are not yet available. We have seen that the decision
model specifies, with decision stages, how the scenario actually realized will be gradually
revealed. No branchings in the representation of the outcomes of the random process
would mean that after conditioning on the observation of ξ1 , the outcome of ξ2 , . . . , ξT
could be predicted (anticipated) exactly. Under such a representation, decisions spanning
stages 2 to T would be optimized on the anticipated outcome. This would be equivalent
to optimizing a nominal plan for u2 , . . . , uT that fully bets on some scenario anticipated
at stage 2.
To visualize how information on the realization of the random variables becomes
gradually available, it is convenient to imagine nested partitions of the event space (Fig-
ure 2.1): refinements of the partitions appear gradually at each decision stage in cor-
respondence with the possible realizations of the new observations. To each subregion
induced by the partitioning of the event space can be associated a constant recourse
decision, as if decisions were chosen according to a piecewise constant decision policy.
On Figure 2.1, the surface of each subregion could also represent probabilities (then by
convention the initial square has a unit surface and the thin space between subregions
is for visual separation only). The dynamical evolution of the partitioning can be rep-
resented by a scenario tree: the nodes of the tree corresponds to the subregions of the
event space, and the edges between subregions connect a parent subregion to its refined
subregions obtained by one step of the recursive partitioning process.
Ideally a scenario tree should cover the totality of possible outcomes of a random
process. But unless the support of the distribution of the random variables is finite, no
scenario tree with a finite number of nodes can represent exactly the random process and
the probability measure, as we already mentioned, while even if the support is finite, the
number of scenarios grows exponentially with the number of stages.
In the general decision model, the agent is assumed to have access to the joint probability
distributions, and is able to derive from it the conditional distributions listed in Table 2.1.
In practice, computational limitations will restrict the quality of the representation of
P. Let us however reason at first at an abstract and ideal level to establish the program
that an agent would solve for planning under uncertainty.
For brevity, let ξ denote (ξ1 , . . . , ξT ), and let π(ξ) denote a decision policy mapping
realizations of ξ to realizations of the decision process u1 , . . . , uT . Let πt (ξ) denote
2.1. Description of the Framework 11
Here we used an abstract notation which hides the nested expectations corresponding
to the successive random variables, and the possible sum decomposition of the function f
among the decision stages. Concrete formulations are presented in Appendix D. Note
that it is possible to be more general by replacing the expectation operator by a func-
tional Φ{·} that maps the distribution of f to a single number in [−∞, ∞]. We also
stressed the possible dependence of Ut on ξ1 , u1 , ξ2 , u2 , . . . , ξt−1 by writing Ut (ξ).
A program more amenable to numerical optimization techniques is obtained by repre-
senting π(·) by a set of optimization variables for each possible argument of the function
— for each possible outcome ξ k = (ξ1k , . . . , ξTk ) of ξ, one associates the optimization vari-
ables (uk1 , . . . , ukT ), written uk for brevity. The non-anticipativity of the policy can be
expressed by a set of equality constraints: for the first decision stage (t = 1) we require
uk1 = uj1 for all (k, j), and for subsequent stages (t ≥ 2) we require ukt = ujt for each (k, j)
such that (ξ1k , . . . , ξt−1
k
) ≡ (ξ1j , . . . , ξt−1
j
).
A finite-dimensional approximation to the program P is obtained by considering a
finite number n of outcomes, and assigning to each outcome a probability p k > 0. This
yields a formulation on a scenario tree covering the scenarios ξ k :
Pn
P 0 : minimize k k k
k=1 p f (ξ , u ) subject to ukt ∈ Ut (ξ k ) ∀ k ;
uk1 = uj1 ∀ k, j ,
ukt = ujt whenever k
(ξ1k , . . . , ξt−1 ) ≡ (ξ1j , . . . , ξt−1
j
) .
Once again we used a simple notation ξ k for designating outcomes of the process ξ,
which hides the fact that outcomes can share some elements according to the branching
structure of the scenario tree.
Non-anticipativity constraints can also be accounted for implicitly. A partial path
from the root (depth 0) to some node of depth t of the scenario tree identifies some
outcome (ξ1k , . . . , ξtk ) of (ξ1 , . . . , ξt ). To the node can be associated the decision ukt+1 , but
also all decisions ujt+1 such that (ξ1k , . . . , ξtk ) ≡ (ξ1j , . . . , ξtj ). Those decisions are redundant
and can be merged into a single decision on the tree, associated to the considered node
of depth t.
To fix ideas, we illustrate the scenario tree technique on a trajectory tracking problem
under uncertainty with control penalization. In the proposed example, the uncertainty
is such that the exact problem can be posed on a small finite scenario tree.
12 Chapter 2. The Multistage Stochastic Programming Framework
k 1 2 3 4 5 6 7
ξ1k -4 -4 -4 3 3 3 3
ξ2k -3 2 2 -3 0 0 2
ξ3k 0 -2 1 0 -1 2 1
pk 0.1 0.2 0.1 0.2 0.1 0.1 0.2 .
The random process is fully represented by the scenario tree of Figure 2.2 (Left): the
first possible outcome is ξ 1 = (−4, −3, 0) with probability p1 = 0.1, and so on. Note that
the random variables ξ1 , ξ2 , ξ3 are not mutually independent.
Assume that an agent can choose actions vt ∈ R at t = 1, 2, 3 (the notation vt instead
of ut is justified in the sequel). The goal of the agent is the minimization of an expected
P3
sum of costs E{ t=1 ct (vt , xt+1 ) | x1 = 0}. Here xt ∈ R is the state of a continuous-
state, discrete-time dynamical system, that starts from the initial state x 1 = 0 and
follows the state transition equation xt+1 = xt + vt + ξt . Costs ct (vt , xt+1 ), associated
to the decision vt and the transition to the state xt+1 , are defined by ct = (dt+1 + vt2 /4)
with dt+1 = |xt+1 − αt+1 | and α2 = 2.9, α3 = 0, α4 = 0 (αt+1 : nominal trajectory; dt+1 :
tracking error; vt2 /4: penalization of control effort).
An optimal policy mapping observations ξ1 , . . . , ξt−1 to decisions vt can be obtained
by solving the following convex quadratic program over variables vtk , xkt+1 , dkt+1 , where k
runs from 1 to 7 and t from 1 to 3, and over xk1 trivially set to 0:
P7 k
P3 k k 2
minimize k=1 p t=1 (dt+1 + (vt ) /4)
subject to − dkt+1 ≤ xkt+1 − αt+1 ≤ dkt+1 ∀ k, t
xk1 = 0 , xkt+1 = xkt + vtk + ξtk ∀ k, t
v11 = v12 = v13 = v14 = v15 = v16 = v17
v21 = v22 = v23 , v24 = v25 = v26 = v27
v32 = v33 , v35 = v36 .
Here, the vector of optimization variables (v1k , xk1 ) plays the role of uk1 , the vector
(vtk , xkt , dkt )
plays the role of ukt for t = 2, 3, and the vector (xk4 , dk4 ) plays the role of uk4 ,
showing that the decision process u1 , . . . , uT +1 of the general multistage stochastic pro-
gramming decision model can in fact include state variables and more generally any
element that serves to evaluate costs conveniently.
The optimal objective value is +7.3148, and the optimal solution is depicted on Fig-
ure 2.2. In this example, the final solution can be recast as a mapping π̃t from xt
to vt : π̃1 (0) = −0.1, π̃2 (−4.1) = 2.1, π̃2 (2.9) = −1.16, π̃3 (−5) = 2, π̃3 (−1.26) = 1.26,
π̃3 (0) = 0.667, π̃3 (1.74) = −0.74, π̃3 (3.74) = −2. Hence in this case the modeling as-
sumption of an agent observing ξt instead of the system state xt is not a fundamental
restriction.
2.2. Comparison to Related Approaches 13
-0.1 0
v1
x1
-4 3
2.1 -1.16 -4.1 2.9
v2
x2
-3 2 -3 0 2
-5 0 -1.26 1.74 3.74
v3
x3
x4
Fig. 2.2: (Left) Scenario tree representing the 7 possible scenarios for a random process ξ =
(ξ0 , ξ1 , ξ2 ). The outcomes ξtk are written in bold, and the scenario probabilities pk are
reported at the leaf nodes. (Middle) Optimal actions vt for the agent. (Right) Visited
states xt under the optimal actions, treated as artificial decisions (see text).
This section discusses several modeling and algorithmic complexity issues raised by the
multistage stochastic programming framework and scenario-tree based decision making.
able to simulate possible state transitions for every possible action, or at least to have at
one’s disposal a fairly exhaustive data set relating actions to state transitions.
In Markov Decision Processes (MDP) (Bellman, 1954; Howard, 1960), the decision maker
seeks to optimize a performance criterion decomposed into a sum of instantaneous re-
wards. The information state of the decision maker at time t coincides with the state x t
of a dynamical system For simplicity, we do not consider in this discussion partial ob-
servability (POMDP) or risk-sensitivity, for which the system state need not be the
information state of the agent. Optimal decision policies are often found by a reasoning
based on the dynamic programming principle, to which is essential the notion of state as
a sufficient statistic for representing the complete history of the system’s evolution and
agent’s beliefs.
Multistage stochastic programming problems could be viewed as a subclass of finite-
horizon Markov Decision Processes, by identifying the growing history of observations
(ξ1 , . . . , ξt−1 ) to the agent’s state. However, the mathematical assumptions under the
MDP and the stochastic programming formulations are in fact quite different. Complex-
ity results suggest that the algorithmic resolution of MDPs is efficient when the decision
space is finite and small (Littman et al., 1995; Rust, 1997; Mundhenk et al., 2000; Kearns
et al., 2002), while for the scenario-tree based stochastic programming framework, the
resolution is efficient when the optimization problem is convex — in particular the deci-
sion space is continuous — and the number of decision stages is small (Shapiro, 2006).
One of the main appeals of stochastic programming techniques is their ability to deal
efficiently with high-dimensional continuous decision spaces structured by numerous con-
straints, and with sophisticated, non-memoryless random processes. At the same time,
if stochastic programming models have traditionally been used for optimizing long-term
decisions that are implemented once and have lasting consequences, for example in net-
work capacity planning (Sen et al., 1994), they are now increasingly used in the context
of near-optimal control strategies that Bertsekas (2005a) calls limited-lookahead strate-
gies. In this usage, at each decision stage an updated model over the remaining planning
horizon is rebuilt and optimized on the fly, from which only the first-stage decisions are
actually implemented. Indeed, when a stochastic program is solved on a scenario tree, the
initial search for a decision policy degenerates into the search for sequences of decisions
relative to the scenarios covered by the tree. The first-stage decision does not depend
on observations and can thus always be implemented on any new scenario, whereas the
recourse decisions relative to any particular scenario in the tree could be infeasible on a
new scenario, especially if the feasibility sets depend on the random process.
ii. Value-function based methods assume that there is a finite set of actions (or policy
parameters), given a priori, that are the elementary building blocks of a near-
optimal policy, and that can be used to drive the exploratory phase. The value
function represents or approximates the expected value-to-go from the current state,
and can be used to rank candidate actions (or policy parameters).
iii. By the structure of the optimization problem, the decisions and the state space
subregions identified as promising early in the exploratory phase are those that are
actually relevant to a near-optimal policy. This ensures the success of optimistic
exploratory strategies, that refine decisions within promising subregions.
Stochastic programming algorithms do not rely on the covering of the state space
of dynamic programming. Instead, they rely on the covering of the random exogenous
process, which needs not correspond to the complete state space (see how the auxiliary
state xt is treated in the example of the previous section). The complement of the state
space and the decision space are “explored” during the optimization procedure itself. The
success of the approach will thus depend on the tractability of the joint optimization in
those spaces, and not on insights on the structure of near-optimal policies.
In multistage stochastic programming approaches, the curse of dimensionality is
present when the number of decision stages increases, and in face of high-dimensional
exogenous processes. Therefore, methods that one could call, by analogy to approxi-
mate dynamic programming, approximate stochastic programming methods, will attempt
to cover only the realizations of the exogenous random process that are truly needed to
obtain near-optimal decisions. These methods work with a number of scenarios that does
not grow exponentially with the dimension of the exogenous process and the number of
stages.
• V ∗ , the optimal value of the multistage stochastic program minπ E{f (ξ, π(ξ))}. For
notational simplicity, we adopt the convention that f (ξ, π(ξ)) = ∞ if the policy π
is anticipative or yields infeasible decisions.
• uζ , the optimal solution to the expected value problem minu f (ζ, u). Note that the
optimization is over a single fixed sequence of feasible decisions; the problem data
is determined by ζ.
• V ζ , the optimal value of the multistage stochastic program minπ E{f (ξ, π(ξ))}
subject to the additional constraint π1 (ξ) = uζ1 for all ξ. If by a slight abuse of
notation, we write π1 , viewed as an optimization variable, for the value of the
constant-valued function π1 , then the additional constraint is simply π1 = uζ1 .
By definition, V ζ is the value of a policy implementing the first decision from
the expected value problem, and then selecting optimal recourse decisions for the
subsequent decision stages. The recourse decisions differ in general from those that
would be selected by a policy optimal for the original multistage program.
A less radical simplification consists in discarding the distinction between recourse stages,
keeping in the model a first stage (associated to full uncertainty) and a second stage (as-
sociated to full knowledge). A multistage model degenerates into a two-stage model when
2.3. The Value of Multistage Stochastic Programming 17
the scenario tree has branchings only at one stage. The situation arises for instance when
scenarios are sampled over the full horizon independently: the tree has then branchings
only at the root. In Huang and Ahmed (2009), the value of multistage stochastic program-
ming (VMS) is defined as the difference of the optimal values of the multistage model
versus the two-stage model. The authors establish bounds on the VMS and describe an
application (in the semiconductor industry) where the VMS is high. Note however that a
generalization of the notion of VSS would rather quantify how multistage decisions out-
perform two-stage decisions when those two-stage decisions are implemented with model
rebuilding at each stage, in the manner of the Model Predictive Control scheme.
As an intermediate simplification between the expected value problem and the reduction
to a two-stage model, it is possible to optimize sequences of decisions separately on
each scenario. The decision maker can then use some averaging, consensus or selection
strategy to implement a first-stage decision inferred from the so-obtained ensemble of
first-stage decisions. Here again, the model should be rebuilt with updated scenarios at
each decision stage.
The problem of computing optimal decisions separately for each scenario is known as
the distribution problem. The problem appears in the definition of the expected value of
perfect information (EVPI), which quantifies the additional value that a decision maker
could reach in expectation if he or she were able to predict the future. To make the
notion precise, let V ∗ denote as before the optimal value of the multistage stochastic
program minπ E{f (ξ, π(ξ))} over non-anticipative policies π; let V (ξ) denote the optimal
value of the deterministic program minu f (ξ, u); and let V A be the expected value of
V (ξ), according to the distribution of ξ. Observe that V A is also the optimal value of the
program minπA E{f (ξ, π A (ξ))} over anticipative policies π A , the optimization of which
is now decomposable among scenario subproblems. The EVPI is then defined as the
difference V ∗ − V A ≥ 0. For maximization problems, it is defined by V A − V ∗ ≥ 0.
Intuitively, the EVPI is high when having to delay adaptations to final outcomes due to
a lack of information results in high costs.
The EVPI is usually interpreted as the price a decision maker would be ready to pay
to know the future (Raiffa and Schlaifer, 1961; Birge, 1992). The EVPI also indicates
how valuable the dependence of decision sequences is on the particular scenario they are
optimized over. Mercier and Van Hentenryck (2007) show on an example with low EVPI
how a strategy based on a particular aggregation of decisions optimized separately on
deterministic scenarios can be arbitrarily bad. Thus even if the EVPI is low, heuristics
based on the decisions of anticipative policies can perform poorly.
This does not mean that the approach cannot perform well in practice. Van Henten-
ryck and Bent (2006) have studied and refined various aggregation and regret-minimization
strategies on a series of stochastic combinatorial problems already hard to solve on a sin-
gle scenario, as well as schemes that build a bank of pre-computed reference solutions
and then adapt them online to accelerate the optimization on new scenarios. They show
that their strategies perform well on vehicle routing applications.
18 Chapter 2. The Multistage Stochastic Programming Framework
Remark 2.1. The progressive hedging algorithm (Rockafellar and Wets, 1991) is
a decomposition method that computes the solution to a multistage stochastic
program on a scenario tree by solving repeatedly separate subproblems on the
scenarios covered by the tree. First-stage decisions and other decisions coupled
by non-anticipativity constraints are obtained by aggregating the decisions of the
concerned scenarios, in the spirit of the heuristics based on the distribution problem
presented above. The algorithm modifies the scenario subproblems at each iteration
to make the decisions coupled by non-anticipativity constraints converge towards
a common and optimal decision.
As the iterations are carried out, first-stage decisions evolve from decisions hedged
by the aggregation strategy to decisions hedged by the multiple recourse deci-
sions computed on the scenario tree. Therefore, the progressive hedging algorithm
shows that there can be a smooth conceptual transition between the decision model
based on the distribution problem and the decision model based on the multistage
stochastic programming problem.
Example 2.1. We illustrate the computation of the VSS and the EVPI on an artifi-
cial multistage problem, with numerical parameters chosen in such a way that the
full multistage model is valuable. By valuable we mean that the presented simpli-
fied decision-making schemes will output first-stage decisions that are suboptimal.
If those decisions were implemented, and subsequently the best possible recourse
decisions were applied, the value of the objective over the full horizon would be
significantly suboptimal.
Let w1 , w2 , w3 be mutually independent random variables uniformly distributed on
{+1, −1}. Let ξ = (ξ1 , ξ2 , ξ3 ) be a random walk such that ξ1 = w1 , ξ2 = w1 + w2 ,
ξ3 = w1 + w2 + w3 . Let the 8 equiprobable outcomes of ξ form a scenario tree
and induce non-anticipativity constraints (the tree is a binary tree of depth 3).
Consider the decision process u = (u1 , u2 , u3 ) with u2 ∈ R and ut = (ut1 , ut2 ) ∈ R2
for t = 1, 3. Then consider the multistage stochastic program
maximize
1
P8 k k k
8 k=1 {[0.8u11 − 0.4(u2 /2 + u31 − ξ3k )2 ]
+ uk32 ξ3k + [1 − uk11 − uk12 ]}
subject to
uk11 + uk12 ≤ 1 ∀k
− uk11 ≤ uk2 ≤ uk11 ∀k
− uk1j ≤ uk3j ≤ uk1j ∀k and j = 1, 2
C1: uk1 = u11 ∀k
C2: uk2 = uk+1
2 = uk+2
2 = uk+3
2 for k = 1, 5
k k+1
C3: u3 = u3 for k = 1, 3, 5, 7 .
The non-anticipativity constraints C1, C2, C3, which are convenient to state the
problem, indicate in practice the redundant optimization variables that can be
eliminated.
2.3. The Value of Multistage Stochastic Programming 19
Its optimal value is 1 with first-stage decision uζ1 = (0, 0). When equality
constraints are made implicit the problem can be formulated using 5 scalar
optimization variables only.
• The two-stage relaxation is obtained by relaxing the constraints C2, C3. Its
def
optimal value is 0.6361 with uk1 = uII
1 = (0.6111, 0.3889).
• The distribution problem is obtained by relaxing the constraints C1, C2, C3.
Its optimal value is V A = 0.6444. The two extreme scenarios ξ 1 = (1, 2, 3)
and ξ 8 = (−1, −2, −3) have first-stage decisions u11 = u81 = (0.7778, 0.2222)
and value -0.0556. The 6 other scenarios have uk1 = (0.5556, 0.3578) and value
0.8778, k = 2, . . . , 7. Note that in general, (i) scenarios with the same optimal
first-stage decision and values may still have different recourse decisions, and
(ii) the first-stage decisions can be distinct for all scenarios.
• The EVPI is equal to V A − V ∗ = 0.2944.
• Solving the multistage stochastic program with the additional constraint
yields an upper bound on the optimal value of any scheme using the first-stage
decision of the expected value problem. This value is V ζ = −0.2000.
• The VSS is equal to V ∗ − V ζ = 0.55.
• Solving the multistage stochastic program with the additional constraint
yields an upper bound on the optimal value of any scheme using the first-stage
decision of the two-stage relaxation model. This value is V II = 0.2431. Thus,
the value of the multistage model over a two-stage model, in our sense (distinct
from the VMS of Huang and Ahmed (2009)), is at least V ∗ − V II =0.1069.
(0.7778, 0.2222), optimal value 0.2167 if uk1 = (0.5556, 0.3578). But one has
to concede that in contrast to other simplified models, for which we solve
multistage programs only to measure the quality of a suboptimal first-stage
decision, the selection strategy needs good estimates of the different optimal
values to actually output the best decision.
• Consensus strategy: The outcome of a majority vote out of the set of the 8
first-stage decisions would be the decision (0.5556, 0.3578) associated to the
scenarios 2 to 7. With value 0.2167, this turns out to be the worst decision
between (0.7778, 0.2222) and (0.5556, 0.3578).
• Averaging strategy: The mean first-stage decision of the set of 8 first-stage
decisions is ū1 = (0.6111, 0.3239). Solving the multistage program with uk1 =
ū1 for all k yields the optimal value 0.2431.
The best result is the value 0.3056 obtained by the selection strategy. Note that
we are here in a situation where the multistage program and its variants could be
solved exactly, that is, with a scenario tree representing the possible outcomes of
the random process exactly.
In two-stage stochastic programming, the large or infinite set of recourse decisions of the
original program is reduced to a finite set of recourse decisions for the approximation.
Hence the exact and approximate solutions lie in different spaces and cannot be com-
pared directly. Still, recourse decisions can be treated implicitly, as if they were already
incorporated to the objective function, and as if the only remaining element to optimize
were the first-stage decision.
In multistage stochastic programming, we face the same issue: one cannot directly
compare finite-dimensional solutions obtained from finite scenario-tree approximations to
exact optimal solutions lying in a space of functions. But now, using the same technique
of treating all recourse decisions implicitly leads to a dilution of the structural properties
of the objective function. As these structural properties are weaker, the class of objective
functions to consider becomes very general. Worst-case distances between functions in
such classes may cease to guide satisfactorily a discretization procedure. In addition, as
the random process runs over several stages, the discretization problems are posed over
typically larger spaces, making them more difficult to solve, even approximately.
For these reasons, rather than presenting the generation of scenario trees as a natu-
ral extension of discretization methods for two-stage stochastic programming, with the
incorporation of branchings for representing the nested conditional probability densities,
we state the problem in a more open way, which also highlights complexity aspects:
as small as possible.
Notice that we allow, for the sake of generality, that the surrogate program may refer
to a function g different from the original objective f , and that we impose that the
algorithm A, the solving strategy associated to the problem P 0 , as well as the evaluation
of the induced policy π̂ on any new scenario, are all tractable. At this stage, we do not
specify how π̂ is inferred or understood; π̂ needs to be introduced here only to be able
to write a valid expression for the regret on the original multistage program.
2.4. Practical Scenario-Tree Approaches 23
Depending on situations, the problem P (random process model and function f ) can
be described analytically, or be only accessible through sampling and/or simulation. The
problem P 0 will be described by a scenario tree and the choice of the function g, under
limitations intrinsically due to the tractability of the optimization of the approximate
program.
As we have seen, there are many derived decision-making schemes and usages of
the multistage stochastic programming framework. Also, various classes of optimization
programs can be distinguished — with the main distinctions being between two-stage and
multistage settings, and among linear, convex, and integer/mixed-integer formulations —
and thus several possible families of functions over which one might attempt to minimize
a worst-case regret.
In the stochastic programming literature, several scenario tree generation strategies
have been studied. The scenario tree generation problem is there often viewed in one or
another of two reduced ways with respect to the above definition, namely
(i) as the problem of finding a scenario tree with an associated optimal value
n
X
min pk {f (ξ k , uk )}
u
k=1
(ii) as the problem of finding a scenario tree with its associated optimal first-stage
decision û1 close to a first-stage decision π1 optimal for the exact program.
Indeed, version (i) is useful when the goal is merely to estimate the optimal value of the
original program P, while version (ii) is useful when the goal is to extract only the first
stage decision, assuming that later on, recourse decisions are recomputed using a similar
algorithm, given the new observations.
The generic approximation problem that we have described is more general, since it
covers also the case where the scenario tree approach may be exploited offline to extract
a complete policy π̂(ξ) that may then be used later on, in a stand-alone fashion for
decision making over arbitrary scenarios and decision steps, be it in the real world or in
the context of Monte Carlo simulations.
To give an idea of theoretical results established in the scenario tree generation lit-
erature, we now briefly discuss two representative trends of research: works that study
Monte Carlo methods for building the tree, and works that seek to minimize in a deter-
ministic fashion a certain measure of discrepancy between the original process and the
approximate process represented by the scenario tree.
Monte Carlo methods have several advantages: they are easy to implement and they scale
well with the dimension, in the sense that with enough samples, one can get close to the
statistical properties of high-dimensional target distributions with high probability. The
major drawback of (pure) Monte Carlo methods is the variance of the results (in our case,
the optimal value and optimal solutions of the approximate programs) in small-sample
conditions.
24 Chapter 2. The Multistage Stochastic Programming Framework
Let us describe the Sample Average Approximation method (SAA) (Shapiro, 2003b),
which uses Monte Carlo for generating the scenario tree. One starts by building the
branching structure of the tree. Note that the method does not specify how to carry out
that step. Practitioners often use the same branching factor for each node relative to
a given decision stage. They also often concentrate the branchings at early stages: the
branching factor is high at the root node and then decreases with the index of the decision
stage. The next step of the method consists in sampling the node values according to the
distributions conditioned on the values of the ancestor nodes. The procedure, referred to
as conditional sampling, is implemented by sampling the realizations of random variables
at stage t before sampling those of stage t+1. Distinct realizations are assigned to distinct
nodes, which are given a conditional probability equal to the inverse of the branching
factor. The last step consists in solving the program on the so-obtained scenario tree
and thus, although part of the description of the SAA method, does not concern the
generation of the tree itself.
Consider scenario trees obtained by conditional sampling. For simplicity assume a
uniform branching factor nt at each stage t, so that the number of scenarios is n =
QT
t=1 nt . Shapiro (2006) shows under some technical assumptions that if we want to
guarantee, with a probability at least 1 − α, that implementing the first-stage decision û 1
optimized on a scenario tree of size n while implementing subsequently optimal recourse
decisions conditionally to the first-stage decision will yield an objective value -close to
the exact optimal value, then the size n of the tree we use for that purpose has to grow
exponentially with the number of stages. The result goes against the intuition that
by asking for -optimality with probability 1 − α only, one could get moderate sample
complexity requirements. Now, as the exponential growth of the number of scenarios is
not sustainable, one can only hope solving multistage models in small-sample conditions,
and obtain solutions that at least with the SAA method may vary from tree to tree
and be of uncertain value for the real problem. Perhaps surprisingly, it is not possible to
obtain valid statistical bounds for that uncertain value by imposing as first-stage decision
the tested first-stage decision and reoptimizing recourse decisions on several new random
trees (Shapiro, 2003a).
There exist various deterministic techniques for selecting jointly the scenarios of the tree.
Note that there is a part of numerical experimentation in the development of scenario tree
methods, and a risk of overestimating the domain of validity of the proposed methods,
since research efforts are oriented by experiments on particular problems.
Moment-matching methods (Høyland and Wallace, 2001; Høyland et al., 2003) at-
tempt to produce discrete distributions with some statistical moments matching those
of a target distribution. Moment matching may be done at the expense of other statis-
tics, such as the number and the location of the modes, that might also be important.
Hochreiter and Pflug (2007) give an example illustrating that risk.
The theoretical analysis underlying the so-called probability metrics methods, that
we have briefly evoked in the context of two-stage stochastic programming, was initially
believed to be easily extensible to the multistage case (Heitsch and Römisch, 2003); but
then it turned out that more elaborated measures of probability distances, integrating the
2.4. Practical Scenario-Tree Approaches 25
intertemporal aspect of observations, were needed (Heitsch and Römisch, 2009). These
elaborated metrics are more difficult to compute and to minimize, so that well-justified
discretizations of multistage programs are more difficult to obtain.
We can also mention methods that come with approximation guarantees, such as
bounds on the suboptimality of the approximation (Frauendorfer, 1996; Kuhn, 2005).
However, they are applicable only under relatively strong assumptions concerning the
problem class and the type of randomness. Quasi Monte Carlo techniques are perhaps
among the more generally applicable methods (Pennanen, 2009).
Most deterministic methods end up with the formulation of difficult optimization
problems, such as nonconvex or NP-hard problems (Høyland et al., 2003; Hochreiter and
Pflug, 2007), with computationally demanding tasks (such as multidimensional integra-
tions), especially for high-dimensional random processes.
The field is still in a state where the scope of existing methods is not well defined,
and where the algorithmic description of the methods is incomplete, especially concerning
the branching structure of the trees. That the domains of applicability are not known
or overestimated makes it delicate to select a sophisticated deterministic technique for
building a scenario tree on a new problem.
Several authors have proposed to use a generic scheme similar to Model Predictive Control
to assess the performances associated to a particular algorithm A for building the scenario
tree (Kouwenberg, 2001; Chiralaksanakul, 2003; Hilli and Pennanen, 2008). The scheme
can be sketched as follows.
i. Generate a scenario tree using algorithm A. Solve the resulting program and extract
from its solution the value of the first-stage decision u1 , say ū1 .
ii. Generate a test sample, of n00 mutually independent scenarios by drawing i.i.d.
realizations ξ j of the random process ξ.
iii. For each scenario ξ j of the test sample, set uj1 = ū1 , and obtain sequentially
the recourse decisions uj2 , . . . , ujT , as follows: each decision ujt is obtained as a
first-stage decision computed by taking as an initial condition the past decisions
uj1 , . . . , ujt−1 and the history ξ1j , . . . , ξt−1
j
of the test scenario ξ j , by conditioning the
joint distribution of ξt , . . . , ξT on the history, by using the algorithm A to build
a new scenario tree that approximates the random process ξt , . . . , ξT , by solving
the program formulated on this tree over the optimization variables relative to the
decisions ut , . . . , uT , and by discarding all but the decision ut , the optimal value of
which is then assigned to ujt .
iv. Estimate the overall performance of the scheme by Monte Carlo simulation. This
consists in evaluating on the test sample the empirical average
Pn00
VTS (A) = (1/n00 ) j=1 f (ξ j , uj ) ,
The Monte Carlo estimate VTS (A) can provide an unbiased estimation of the value
of the scenario tree building algorithm A in the context of the other approximations
involved in the numerical computations of the sequences of decisions, such as for instance
simplifications of the objective function, or early stopping at low-accuracy solutions.
The estimator VTS (A) may have a high variance, but we can expect a high positive
correlation between this estimator and an estimator VTS (A0 ) using the same test sample
but relative to another tree generation algorithm A0 . This would allow a reliable com-
parison of the relative performance of the two algorithms A, A0 on the problem instance
at hand.
The validation is generic in the sense that it can be applied to any algorithm A, but
also in the sense that it addresses the general scenario tree building problem in the larger
context of the decision making scheme actually implemented in practice.
2.5. Conclusions 27
We point out that alternative numerical methods for solving infinite-dimensional two-
stage stochastic programs exist, based on an incorporation of the discretization procedure
to the optimization, for instance by updating the discretization or carrying out impor-
tance sampling within the iterations of a given optimization algorithm (Slyke and Wets,
1969; Higle and Sen, 1991; Norkin et al., 1998a), or by using stochastic subgradient meth-
ods (Nemirovski et al., 2009). Also, heuristics for finding good policies directly on the
infinite-dimensional multistage problem have been suggested: a possible idea, akin to di-
rect policy search procedures in Markov Decision Processes, is to optimize a combination
of feasible non-anticipative basis policies π j (ξ) specified beforehand (Koivu and Penna-
nen, 2010). These methods are nevertheless less general than the standard scenario tree
approach, because they seem to be reserved to applications with rather simple feasibility
sets.
2.5 Conclusions
the framework of stochastic programming based on scenario trees has in this way, in spite
of its theoretical appeal, lost its practical attractiveness during the last years in many
environments dealing with large-scale systems (Powell and Topaloglu, 2003; Van Henten-
ryck and Bent, 2006).
Chapter 3
Let X be a random vector following some fixed but unknown density PX , referred to as
the data-generating density.
Let D = {x1 , . . . , xn } denote a set of realizations of X drawn from PX in some way.
Call D the data set. For brevity we write xn for x1 , . . . , xn . A data set of n samples
is a random quantity. Its density is written PD . There is a wide spectrum of methods
from statistics and machine learning that can be used to explain the data, and predict
(forecast) future samples. We discuss those methods in the context of the inference of
a predictive density px|D for a new sample x, given the data. One could also condition
def
the density of x = (y, z) on some of its components y and interpret the resulting density
pz|y,D as the predictive distribution of output variables z given input variables y and
data set D.
The later use of densities is not discussed in this section. We simply recall that the
summary of a density through a single value is addressed by decision theory, and is usually
30 Chapter 3. Solution Averaging in Structured Spaces
done through the choice of a loss function (Robert, 2007, Chapter 2). The quality of the
inference can also be quantified through a measure of divergence between the predicted
and true densities (Ali and Silvey, 1966; Csizár, 1967). A particular divergence that has
been found useful (Clarke and Barron, 1990) is the Kullback-Leibler divergence between
two densities g, h, defined by
g(x)
Z
KL(g||h) = g(x) log dx . (3.1)
h(x)
The KL divergence is also referred to as the cross-entropy distance (Rubinstein and
Kroese, 2004).
In the simplest frequentist approach to explaining data, one assumes that the samples
are drawn independently, and that the data-generating density belongs to a family of
densities parametrized by θ ∈ Rd . The density at x with parameter θ is written p(x; θ).
As the joint density of independent random variables is the product of the marginal
densities, we can write the joint density of the samples as
n
Y
n
pD|θ (x ; θ) = p(xk ; θ) .
k=1
The parameter θ can be inferred (estimated) from the finite data set by maximizing the
log-likelihood of the data (Fisher, 1925):
n
X
θ̂ ∈ argmaxθ log p(xk ; θ) , (3.2)
k=1
where argmax f denotes the set of maximizers of f (often a singleton). The predictive
density, defined as the density of a new sample xn+1 , conditionally to the data set, is
then given by
provided that the resulting estimate θ̂ is in the interior of Θ. The left-hand side of
(3.3) is called the score function. The random vector
n
X ∇θ p(Xk ; θ)
δ(θ) = with Xk drawn according to p(·; θ)
p(Xk ; θ)
k=1
3.1. The Perturb-and-Combine Paradigm 31
is called the efficient score. The covariance matrix of the efficient score is called
the Fisher information matrix, written I(θ; n) ∈ Rd×d — the argument n stresses
that we define the Fisher information matrix for n observations. We write I(θ)
for I(θ; 1). Under suitable conditions on p(X; θ) allowing the interchange of the
expectation and differentiation operators,
∂ 2 log p(X; θ)
Iij (θ; n) = −n E{ } ,
∂θi ∂θj
which shows that the Fisher information is related to the curvature of p evaluated
at θ.
Let φ(D) ∈ Rd , with D made of n observations, denote an unbiased estimator of θ,
that is, an Rd -valued mapping such that E{φ(D)} = θ. Let us also assume that the
true density PX is p(·; θ). Then under some regularity conditions, the covariance
matrix of φ(D), written Σφ , satisfies the Cramér-Rao inequality:
Σφ I −1 (θ; n) ,
the inequality referring to the cone of positive semi-definite matrices. If the maxi-
mum likelihood estimate θ̂ in (3.2) is unique and “far enough” from the boundary
of Θ, then for n “large enough”, θ̂ is “approximately” normally distributed with
mean θ and covariance I −1 (θ; n) = n−1 I −1 (θ). This result is usually expressed by
saying that
√
n(θ̂ − θ) converges in distribution to N (0, I −1 (θ)) .
As θ̂ has asymptotically the best possible covariance for unbiased estimators, the
maximum likelihood estimator is said to be an efficient estimator. Note, however,
that the covariance matrix relative to a biased estimator could be smaller.
For asymptotic results in situations where p(x; θ) is not twice differentiable in
a neighborhood of θ, see Dupacova and Wets (1988); for asymptotic results in
situations where θ is on the boundary of Θ, see Shapiro (2000).
The function `(x; θ) = − log p(x; θ) is a loss function, called the negative log-likelihood
loss function. If the samples are truly drawn independently, the maximization of the log-
likelihood of the data in (3.2) is a surrogate program for the minimization of E{`(X; θ)},
where the expectation is taken with respect to the true data-generating density P X .
Hence, as the surrogate problem is ill-posed, it may be preferable to penalize the objective
when the number n of samples is small, for example (Tikhonov and Arsenin, 1977) by
adding a regularization term − 21 λ||θ||2 with λ > 0 to the log-likelihood:
n
X
log p(xk ; θ) − 21 n−1 λ||θ||2
θ̂ ∈ argmaxθ . (3.4)
k=1
Pn
When the objective can be written as a sum k=1 ρ(xk ; θ) for some function ρ, the cor-
responding estimates θ̂ are sometimes referred to as M-estimates (Maximum Likelihood
Type Estimates) (Huber, 1964).
When the true density of the data set PD cannot be identified to pD|θ for some θ,
be it because the data-generating density does not belong to the parametric family of
32 Chapter 3. Solution Averaging in Structured Spaces
densities, or because the samples are not drawn independently, the probability model is
said to be misspecified. This is the most common situation encountered in practice. Using
maximum likelihood type estimators with misspecified models does not necessarily lead
to inconsistent estimates (estimates with non vanishing bias as the number of samples
grows to infinity): what can really harm consistency is rather to omit some of the relevant
variables for explaining the data, or to assume wrong constraints between the components
of x (White, 1982).
In the simplest Bayesian approach to explaining data, one assumes that the samples are
drawn independently, that the data-generating density belongs to a family of densities
parametrized by θ ∈ Rd , and in addition that the parameter θ has been drawn from a
fixed density pθ , called the prior. The conditional density of θ given the data, written
pθ|D , is called the posterior. It is computed according to the Bayes formula for conditional
distributions
pD|θ (xn ; θ)pθ (θ)
pθ|D (θ; xn ) = R (3.5)
p (xn ; θ)pθ (θ)dθ
Θ D|θ
where Θ denotes the domain of θ, and dθ denotes the Lebesgue measure if θ is continu-
ous, or the counting measure if θ is countable. Note that the normalization of p θ|D by
the integral makes it possible to (formally) use improper priors (Jeffrey, 1939), that is,
R
“generalized” densities pθ such that Θ pθ (θ)dθ = +∞. In particular, using a uniform
prior pθ (θ) = 1 amounts to identify pθ|D to the likelihood, pD|θ . On the other hand, for
certain families of distributions p(x; θ), there exists a special choice for the prior, referred
to as the conjugate prior, such that the prior and the posterior belong to the same family
(Raiffa and Schlaifer, 1961). This is convenient for evaluating pθ|D in closed-form, but
reduces the prior to a mere device for making tractable predictions.
Note that the frequentist approach also makes prior assumptions, for instance through
the choice of λ in (3.4), which is formally set to 0 in the maximum likelihood estimate
(3.2). The family p(x; θ) and the type of regularization are also often chosen to facilitate
the evaluation of the M -estimate.
If (3.5) cannot be evaluated in closed-form, the simplest approximation is Maximum A
Posteriori (MAP) estimation, which consists in approximating pθ|D by a distribution with
all the probability mass concentrated on the mode of pθ|D (with ties broken arbitrarily).
Maximum A Posteriori estimation with a uniform prior coincides with Maximum Likeli-
hood estimation. More advanced approximation techniques include asymptotic methods
such as Laplace’s method (MacKay, 2003, Chapter 27) which consists in replacing a
distribution by a Gaussian approximation, importance sampling, multiple quadrature,
and Markov Chain Monte Carlo methods (MCMC) (Metropolis et al., 1953; Neal, 1993,
2010), which essentially consists in approximating the integration over θ by accumulat-
ing evaluations at points θk generated by a random walk in the parameter space Θ. The
relative merit of these methods are discussed in Evans and Swartz (1995) and in MacKay
(2003, Chapters 29 & 30). The methods that scale well with the dimension d are the
MCMC methods.
The Bayesian approach aims at taking into account (through the prior) the uncer-
tainty associated to the selection of a particular value θ̂ for making predictions after
3.1. The Perturb-and-Combine Paradigm 33
having observed the data. The density of a new sample xn+1 , conditionally to the data
set, is given by a mixture of all members of the parametric family, obtained by averaging
all the members with importance weights given by pθ|D (Bayesian averaging):
Z
n
px|D (xn+1 ; x ) = p(xn+1 ; θ)pθ|D (θ; xn )dθ . (3.6)
Θ
Remark 3.2. Under suitable conditions (Walker, 1969), one can show that if the
true density PX is p(·; θ) for some θ in the interior of its domain Θ, then for n
“large enough”, the posterior distribution px|D given the data is “approximately” a
normal distribution with mean θ̂ (maximum likelihood estimator on the data) and
covariance matrix I −1 (θ; n) (inverse of the Fisher information matrix for n obser-
vations).
In a more advanced approach to explaining data, one assumes that the data-generating
density PX belongs to a space of densities described by a model structure M ∈ M with
model parameters θM ∈ ΘM . The dimension of θM can vary with M . One speaks of
nested models when there exists a complete ordering M1 , M2 , . . . of the models such that
all the densities representable by Mν are also representable by Mν+1 .
Models of different complexity (flexibility) coexist in the hypothesis space. Loosely
speaking, low complexity was originally associated to a small number of parameters for
34 Chapter 3. Solution Averaging in Structured Spaces
describing a model (Rissanen, 1978), or to a greater smoothness of the model (Good and
Gaskins, 1971). It is convenient to view a model through the pair (M, s), where s is a
complexity parameter associated to M . For nested models Mν , we can assume that there
exists an increasing function that maps structure indices ν to complexity parameters s.
Model selection methods aim at identifying the model M that best explains the data,
often by adapting the complexity s of the selected model to the size n of the data set.
Note that the misspecification issue is completely irrelevant here inasmuch as one seeks
to explain learnable properties of the data (Vapnik, 1998): assumptions on a hypothetical
true distribution PX are a matter of pure convenience.
In finite mixture density estimation for instance, the cardinality of the finite support
of pθ|D determines the model structure and induces a model ordering, so that compet-
ing models can be ranked according to the log-likelihood of the data penalized by a
complexity parameter s (Li and Barron, 2000).
with pν|D interpreted as the importance weight of the model Mν , determined by updating
the prior pν using the observed data.
For models M identified by some continuous hyper-parameter α ∈ Rq , so that x fol-
lows f (x; α, θ), it is common to define a joint prior pα,θ = pα pθ|α on (α, θ) ∈ A × Θ. The
predictive distribution is then given by
Z Z !
px|D (xn+1 ; xn ) = p(xn+1 ; α, θ)pθ|α,D (θ; α, xn )dθ pα|D (α; xn )dα . (3.8)
A Θ(α)
assume a predictive distribution of the form (3.8) with a MCMC approximation already
applied to the integral, that is,
m
X
px|D (xn+1 ; xn ) = p(xn+1 ; αν , θν )wν , (3.9)
ν=1
where
Each term in the sum represent the contribution of a weighted sample as if it were drawn
from the joint density pα,θ|D = pθ|α,D pα|D in (3.8), duplicate samples being permitted.
Bagging.
Boosting.
In boosting methods (Schapire, 1990; Freund and Schapire, 1996; Schapire et al., 2002),
the sequence M1 , . . . , Mm is built by sampling models Mν as follows: the parameters of a
model Mν are particularized to a random resampling Dν of the elements in a data set D,
each element of D having a certain probability to be drawn, determined by assigning to
each element k of the data set D an importance weight that is relatively greater if the
36 Chapter 3. Solution Averaging in Structured Spaces
element k is not well explained (or predicted) by the previous models. The m models are
then aggregated by weighted averaging (Littlestone and Warmuth, 1989), the weights
reflecting the respective quality of each model at explaining the data. The weighted
aggregation scheme depends on generalization bounds proper to the loss function chosen
for scoring the models.
Boosting has been shown to induce predictive models with excellent generalization
capabilities starting from a family of models Mk having their prediction slightly bet-
ter than random predictions once their parameters θk are adapted to the data (weak
models). The reasons for the empirical success of boosting may not still be fully eluci-
dated (Mease and Wyner, 2008). The aggregation schemes used for the online prediction
of (bounded) sequences X1 , X2 , . . . without assuming a probabilistic model PX (Cesa-
Bianchi and Lugosi, 1999), as advocated by Dawid (1984), are similar to the aggregation
schemes used in boosting (Cesa-Bianchi et al., 1997), and have been analyzed in terms of
their generalization ability in the context of online prediction (Cesa-Bianchi et al., 2004).
While bagging and boosting exploit perturbations of the data set based on the temporary
presence or absence of particular samples, other ensemble methods use perturbations of
the data set based on the temporary presence or absence of certain components of x
(features) (Dietterich, 2000; Breiman, 2001; Geurts et al., 2006). Research is still very
active in machine learning for finding beneficial ways to perturb data sets by further ran-
domizing the features, be it in the context of ensemble methods stricto sensu (Breiman,
2000), or in the context of kernel methods (Rahimi and Recht, 2008, 2009; Shi et al.,
2009).
Most stochastic programs of practical interest use unbounded objective functions. This
is in strong contrast with the usual assumptions made in machine learning and online se-
quence prediction. A large body of theoretical work based on empirical processes theory
(Pollard, 1990), large-deviation theory and concentration inequalities such as Hoeffding’s
inequality (Hoeffding, 1963), Azuma’s inequality (Azuma, 1967), McDiarmid’s inequal-
ity (McDiarmid, 1989), ultimately relies on a bounded range assumption for establishing
the generalization bounds that back the predictions realized by mixtures of experts or
boosting-type approaches (Koltchinskii and Panchenko, 2002; Audibert et al., 2007; Shiv-
aswamy and Jebara, 2010). Results and reasonings from those works are thus difficult
to adapt to the usual models of stochastic programming. Note that when one accepts to
focus on bounded objective functions, theoretical investigations are possible (Nesterov
and Vial, 2008).
We follow another path here, and investigate empirically the use of bagging methods
for estimating an optimal first-stage decision to a multistage stochastic program. We
consider perturbed scenario-tree approximations to multistage stochastic programs with
decisions valued in a nonconvex feasible set. The standard averaging rule is not imple-
mentable, calling for a more sophisticated aggregation strategy. The first-stage decision
plays the role of the parameter θ considered in Section 3.1.
3.2. Adaptation to Stochastic Programming 37
In this section, we outline the principle of the proposed approach, and discuss the main
underlying assumptions. We start by describing the class of problems that we address
and then provide an overview of the main ingredients of the proposed solution approach,
namely, a procedure for generating an ensemble of scenario trees, an algorithm based
on the cross-entropy method for computing near-optimal first-stage decisions, and a
kernel-based method for aggregating the first-stage decisions derived from the ensemble
of scenario trees. Background material on kernel methods can be found in Appendix C.
xt+1 = ft (xt , ut , wt ) ,
T
X −1
J ∗ (x0 ) = max E{ rt (xt , ut , wt )|x0 }. (3.10)
µ
t=0
The candidate strategies µ for selecting the decisions ut at times 0 ≤ t < T is a sequence of
time-indexed deterministic mappings µt from the current history ht = (w0 , w1 , . . . , wt−1 )
of the disturbance process to a fixed decision ut = µt (ht ) ∈ U .
(To compare this setup to the Markov Decision Process framework, one may assume
temporarily that the disturbance process is observable. Then, the mappings from h t to
ut are as expressive as mappings from states xt to actions ut , since the states xt are
ultimately a function of ht : xt can be recovered from ht , given x0 , u0 , the decision rules
µ1 ,. . . µt−1 , and the state transition functions f0 , . . . , ft−1 ).
No assumption is made about the dimensionality or the structure of the state space X.
The space U of possible actions, and the space W of possible disturbances, are assumed
to be made of a finite number of elements.
The notations xt , ut , wt , ft , rt , the assumption of a memoryless disturbance process,
the initial condition for t = 0 rather than t = 1, are meant to facilitate the connection
with the usual discrete-time optimal control framework (Bertsekas, 2005a). The memo-
ryless assumption may be relaxed, by simply requiring that the probabilities of all future
disturbance sequences are known by the decision maker. The temporal decomposition of
the performance criterion in Equation (3.10) is fundamental in an optimization procedure
based on dynamic programming, but is not essential in the present approach.
38 Chapter 3. Solution Averaging in Structured Spaces
A complete scenario tree of depth T represents all the possible realizations of the process
w0 , w1 , . . . , wT −1 , together with their probabilities. In such a tree, the root node (depth 0)
corresponds to time t = 0 and to an empty process history. To each node n of depth
t ∈ {1, 2, . . . , T } in the tree corresponds a possible history hn = [w0 , . . . , wt−1 ]n of the
process, through the unique path from the root to the node n. The disturbance (w t−1 )n
is directly assigned to the node n together with its probability, while [w 0 , . . . , wt−2 ]n and
their joint probability can be collected from the disturbances and probabilities associated
to the nodes in the path.
Any strategy µ can be represented on the completetree by assigning to each node n of
depth 0 ≤ t < T a fixed value un = µ(hn ) ∈ U . Consequently, searching for an optimal
strategy is equivalent to jointly optimizing the values un assigned to the internal nodes
of the tree.
The performance criterion defined in (3.10) can be evaluated once decisions have
been assigned to the nodes. Indeed, given the value of x0 , u0 = µ0 and a particular w0 ,
one can evaluate r0 = r0 (x0 , u0 , w0 ) and x1 = f0 (x0 , u0 , w0 ) by the knowledge of rt
and ft at t = 0. The values r0 and x1 can thus be assigned to the node associated to
the disturbance process history [w0 ]. The probability P0,w (w0 ) is determined from the
disturbance process model. Given the nodal decision for u1 = µ1 (w0 ), and using x1
and a particular w1 , one gets the values of x2 and r1 for the corresponding particular
value of [w0 , w1 ]. The value r1 can be assigned to the node corresponding to [w0 , w1 ], to
which is also assigned a probability P0,w (w0 ) · P1,w (w1 ), since we assume that w0 , w1 are
independent. The propagation of nodal values is pursued until values are assigned to x T
and rT −1 . It can be carried out for each disturbance path in the tree. Therefore, for
a given decision strategy µ, all the rewards and probabilities entering the evaluation of
the expectation in (3.10) can be computed, given the system model ft , rt , Pt,w and the
initial state x0 .
Without any particular structure assumed for ft and rt , the optimization of the
policy µ may be done by a direct search of the decisions un assigned to the nodes of
T −1
the tree. However, the number of possible combinations is of the order of |U | |W | ,
meaning that as soon as the cardinalities |U |, |W |, or the time horizon T are large, an
exact optimization is intractable.
(w2 , p2 ) p2 /2
p4
p2 /2 (w4 , p4 )
3 3
(w , p )
(w1 , p1 )
For each v ∈ V , let C −1 (v) denote the subset of elements in W \ V that have v as a
nearest neighbor:
C −1 (v) = {w ∈ W \ V : w ∈ C(v)} .
Consider an incomplete scenario tree with N nodes numbered from 1 (root, depth 0)
to N (last leaf, depth T ). We assume that the leaf nodes (depth T ) are numbered
from N − L + 1 to N , where L is the number of leaf nodes or equivalently, the number
of scenarios. Let w n , xn , un , r n , denote respectively the disturbance wt−1 , state xt ,
decision ut , reward rt−1 assigned to node n, where t corresponds to the depth of the node,
and where xt , ut , rt−1 are conditioned on the disturbance process history [w0 , . . . , wt−1 ]
induced by the path from the root to the node n. The root node has no disturbance and
reward assigned to it. The leaf nodes (N − L + 1 ≤ n ≤ N ) have no decision un assigned
to them. Let pn be the probability mass assigned to node n (the probabilities pn of the
nodes of depth t sum up to 1). For n > 1, let n− denote the index of the parent node
of node n. Let fn− and rn− denote the functions ft and rt for t equal to the depth of the
node n− . The problem (3.10) formulated on the incomplete scenario tree becomes
N
X
maximize pn r n
n=2
subject to x 1 = x0
xn = fn− (xn− , un− , wn ) 2≤n≤N
rn = rn− (xn− , un− , wn ) 2≤n≤N
u ∼ g(·; θ) . (3.13)
• The procedure that computes the value of F for a candidate solution u, where
F (u) will be interpreted as the score assigned to u by F :
u 7→ F (u) . (3.14)
The method works as follows. Starting with the value of θ that corresponds to the uniform
distribution over the space of solutions, one draws NCE samples u1 , . . . , uNCE from the
density g(·; θ), scores them using the scoring function F , and tags as elite solutions the
samples with a score greater or equal to the dρNCE e-th best score, written γ̂. The
parameter ρ is set to a small positive value, typically 0.01 (a value for which the elite
solutions correspond to the best percentile of the empirical distribution of the score).
The parameter θ is then updated so as to decrease the CE distance of g(·; θ) with respect
to the empirical density induced by the elite solutions. The update rule proposed by
Rubinstein and Kroese (2004, Equation 4.8) is
X
θ ← θ̂ where θ̂ ∈ argmaxθ log g(uk ; θ) .
k: F (uk )≥γ̂
Thus in fact θ̂ is just the maximum likelihood estimate of the sampling density g(u; θ)
given the data set of elite samples (the parameter update maximizes the probability of
generating the elite samples).
After the parameter update step, a new set of NCE samples is redrawn from g(·; θ).
The parameter update/resampling procedure is repeated until the density g(·; θ) concen-
trates on a single solution, or until the elite scores have ceased to improve. The best
candidate solution with respect to F (at any iteration) is then returned. The authors of
42 Chapter 3. Solution Averaging in Structured Spaces
The squared distance between the centroid and some first-stage decision u 0 ∈ U is given
by
The squared distance from some decision uν0 ∈ S0 to the centroid uc0 may be expressed
in terms of the elements Kij = hϕ(ui0 ), ϕ(uj0 )i of the Gram matrix K ∈ Rm×m by
m
X m X
X m
−1 −2
||ϕ(uν0 ) − ϕ(uc0 )||2 = Kνν − 2m Kiν + m Kij . (3.16)
i=1 i=1 j=1
with ties broken arbitrarily, may thus also be computed without the need to refer to the
feature map ϕ once the Gram matrix is given. Therefore, the explicit computation of the
centroid in the feature space, which would require the explicit knowledge of the feature
map, is not actually needed for evaluating the aggregated solution.
Note that the empirical variance of the ensemble of decisions in the feature space
induced by the kernel kU , defined by
m
X m
X m X
X m
−1 −1 −2
var{S0 } = m ||ϕ(uν0 ) − ϕ(uc0 )||22 =m Kii − m Kij ,
ν=1 i=1 i=1 j=1
could also be evaluated even if the feature map is specified only implicitly by the definition
of the kernel, and could quantify the discrepancy between candidate decisions in S 0 .
Discussion.
First consider the situation where the decision space U only possesses a handful of el-
ements. Thanks to the small cardinality of U , we may expect that optimal first-stage
decisions are present among the elements of set S0 . Therefore, a simple majority vote
among the elements of S0 can be taken as the estimate ua of an optimal first-stage de-
cision. Note that the majority vote can be obtained from the general formulation (3.17)
based on kernels by setting Kij = δ{ui0 = uj0 }, where δ{·} denotes the 0-1 indicator
function of the event placed in argument. Indeed, as Kνν = 1, the squared distances
Pm
||ϕ(uν0 ) − ϕ(uc )||2 , 1 ≤ ν ≤ m, will only differ by the term −2m−1 i=1 δ{ui0 = uν0 },
proportional to the frequency of uν0 in S0 .
Now consider the situation where U is finite but has a cardinality |U | much larger
than |S0 | = m. It is then very likely that a clear majority will not be attained in S0 ,
especially if there are many quasi-equivalent decisions in terms of optimality. However, in
many situations, U is formed from the combination of several elementary decisions. One
could thus combine kernels on the elementary decision spaces, for instance by combining
separate majority votes on the elementary decisions.
The kernelization of the decision space enables one to incorporate prior knowledge on
the structure of the decision space. Therefore, kernels should be consistent with prior
beliefs about the decisions that have similar effects on the problem at hand.
44 Chapter 3. Solution Averaging in Structured Spaces
N
Fig. 3.2: Example of configuration for the sensor network problem, with eight sensors ( ) and
two targets (•) (figure taken from Dutech et al. (2005)).
In this section, we illustrate the proposed approach on a test problem problem that has
a large, structured, discrete action space. We explain in detail how the action space is
kernelized, how the incomplete scenario trees are generated, and how the corresponding
optimization problems are solved approximately. We assess the quality of the first-stage
decision estimators û0 = ua0 obtained with the proposed approach by a direct comparison
with the optimal strategy, which can be computed exactly in this test problem by dynamic
programming (by evaluating recursively the tabular representation of the expected costs-
to-go (Q-functions), where the tabular representation of a Q-function has an entry for
each combination of state-action pairs).
The test problem is part of a series of standard benchmark problems proposed for compar-
ing Reinforcement Learning solution approaches (Dutech et al., 2005). The test problem
is inspired by a distributed control application from Ali et al. (2005) and named Sensor
Network. Note that among all the problems selected by Dutech et al. (2005), Sensor
Network is the only problem with a relatively large discrete decision space.
The problem can be described as follows. Eight sensors bracket an array of three
cells, as shown on Figure 3.3.1. Each cell is surrounded by 4 sensors. Two targets float
over the cells. Each cell is occupied at most one target.
At time step t, each sensor can be focussed on the cell to its left, on the cell to its
right, or be idle. The decision ut sets the action of the 8 sensors. The decision space
is thus a joint action space U = {0, 1, 2}8 that encodes the 3 possible actions of the
8 sensors (0: idle, 1: focus left, 2: focus right), totalling 38 = 6561 possible actions. A
unit cost is incurred for each focussed sensor; idle sensors have no cost.
The game consists in eliminating the two floating targets from the cells as quickly as
possible. Each target start at energy level 3. After sensors have been set according to u t ,
the targets move. The leftmost target randomly moves to the left (L), to the right (R),
or stay idle (I). A priori the 3 possibilities are equiprobable, but a move is cancelled if
the cell where the target intends to go is already occupied or does not exist. After the
move performed by the leftmost target, the rightmost target randomly moves according
to the same rules.
The intended moves of the targets are viewed as the disturbances in the problem.
The disturbance space is W = {L, R, I} × {L, R, I}, with each of the 32 = 9 possible com-
binations having probability 1/9. The effective moves may differ inasmuch as intended
3.3. Numerical Experiment 45
PT
The total return is the discounted cumulated reward t=0 γ t rt with γ = 0.95, and
T = 10. The problem is the maximization of the expected total return, starting from
some given state, over stochastic programming decision rules µt : W t−1 → U with
values µt (w0 , . . . , wt−1 ) = ut . We concentrate on the estimation of an optimal first-stage
decision u0 , given x0 .
The general approach described in the preceding subsections is adapted to the problem
at hand as follows.
• The sampling distribution for candidate solutions u (Section 3.2.3) is first decom-
posed into N − L independent components, each component being relative to one
internal node of the scenario tree (assuming that the tree has N nodes, including
L leaf nodes of depth T ). The N − L components are themselves decomposed into
8 independent parts corresponding to the 8 sensors. Each part defines the distribu-
tion over {0, 1, 2} of the action of a sensor j at a node i, written aij . A distribution
for aij is described by the two scalar parameters pij = P{aij = 0}, qij = P{aij = 1},
with pij , qij ∈ [0, 1] and 0 ≤ 1 − pij − qij = P{aij = 2} ≤ 1. A uniform distri-
bution over all possible strategies on the incomplete tree is obtained by setting
pij = qij = 1/3 for all i, j, whereas any particular deterministic solution u can be
obtained by selecting for each pair (i, j) one of the three configurations {p ij = 1,
qij = 0}, {pij = 0, qij = 1}, or {pij = 0, qij = 0}. The distribution for generating
a random solution u associated to the incomplete scenario tree is thus specified by
2 · 8 · (N − L) parameters.
Once the elite samples uk (Section 3.2.3) have been scored by computing the ex-
pected discounted sum of rewards on the incomplete tree with the nodal decisions
set to uk , the parameters of the generating distribution for the solutions are up-
dated as follows: given ` elite samples, with akij denoting the action aij from the
elite sample uk , 1 ≤ k ≤ `, one first computes the empirical frequencies of the
elementary actions in the elite samples,
def P`
p̂ij = `−1 k=1 δ{akij = 0}
def P`
q̂ij = `−1 k=1 δ{akij = 1} ,
and then one updates the parameters pij , qij of the solution generating distribution
by
• The aggregation scheme exploits the decomposition of the decision space into sep-
arate sensor actions. Recalling that the root node has the node index i = 1, let
aν1j denote the first-stage action of sensor j from the ν-th solution in the set S 0 ,
1 ≤ ν ≤ m. The action of sensor j in the centroid decision uc0 (Section 3.2.4) is
determined by a majority vote over the action of sensor j in the first-stage decisions
3.3. Numerical Experiment 47
and then the aggregated decision ua0 is set to the element uν0 ∈ S0 sharing the most
actions with ua0 , that is,
8
X
ν = min{argmaxk δ{ak1j = ac1j }} .
j=1
A similar effect can be obtained by defining the kernel kU (Section 3.2.4) between
two elements uν0 , uσ0 of S0 as
8
X
kU (uν0 , uσ0 ) = δ{aν1j = aσ1j } = Kνσ = Kσν
j=1
and setting
m
0 X
ua0 = uν0 with ν 0 = min{argmaxν Kiν } ,
i=1
following (3.17) with Kνν constant and ties broken by the lexicographical order
on ν.
Typical outcomes with an ensemble of m = 5 incomplete trees are reported in Table 3.1.
Three problems corresponding to the 3 initial configurations x0 of the targets with 3 en-
ergy points that float over the 3 cells are considered. Decisions uν0 are represented
graphically. For instance, the symbol -/\- /\/\
indicates that 3 sensors are focussed on
the leftmost cell, no sensor is focussed on the middle cell, 3 sensors are focussed on the
rightmost cell, and the remaining 2 sensors are idle (-). If the targets move onto the
leftmost or the rightmost target, they will be hit, so the combined action of the sensors
is effective. It would be suboptimal to have 1, 2 or 4 sensors be focussed on a same cell.
The table shows that the structure of optimal decisions can be destroyed in the centroid
decision uc0 . In fact, there are several configurations of the sensors that can lead to the
same effective targeting of two cells, but these equivalent configurations are made inef-
fective when they are averaged. The aggregated decision ua0 reaches a consensus while
preserving the structure of effective configurations.
It turns out that for the first and third problem, the aggregated decisions u a0 are
optimal, in the sense that the targeted cells are optimal, according to an exact dynamic
programming solution where the decision space is reduced a priori to 6 sensible choices
of targeted cells instead of considering the full set of combined actions of the sensors.
For the second problem (x0 = [3 0 3]), the decision ua0 shown in the table is slightly
suboptimal: if subsequently optimal decisions are selected, then u a0 brings an expected
return of 27.78 instead of the optimal return 27.88.
We repeated 10 times the experiment of building an ensemble of 5 trees and computing
the aggregated decision. An optimal decision was found: 7 times for x 0 = [3 3 0], 9 times
48 Chapter 3. Solution Averaging in Structured Spaces
3.4 Conclusions
This chapter has investigated empirically the estimation of an optimal first-stage de-
cision to a finite-horizon optimal control problem by scenario tree techniques. While
we recognize that the solution techniques used in this work might be of limited inter-
est in practice, given that the studied problem class would be more naturally addressed
from a Markov Decision Process perspective, we believe that the statistical framework in
which the proposed tree-bagging solution technique was presented clarifies the connec-
tion between statistical estimation/prediction methods and sequential decision making
by stochastic programming.
It is interesting to realize, in particular, that stochastic programming models take
seldom into account the intrinsic limitation that only finite-sample approximations can
be solved. Usual stochastic programming models are thus close in spirit to maximum
likelihood estimation models used on finite data without regularization. Certainly, the
appropriate ways to apply regularization to sequential decision making are not clear
at this stage, and would call for further research. For instance, we observed — in-
dependently of the material presented in this chapter — that the early stopping of the
progressive hedging algorithm (Rockafellar and Wets, 1991) (see also Remark 2.1), where
decisions are optimized on separate scenarios (with a penalization of the difference with
the decisions at the previous iteration) and then averaged if they are relative to a same
information state, could provide a kind of regularization without even modifying the
formulation of the model. With early stopping however, the objective being optimized
is no longer totally explicit, the solution has a dependence on the initial conditions, and
therefore the solution algorithm would require a careful tuning of its parameters on the
problem at hand.
Unfortunately, we are still lacking at this stage efficient methods for testing the real
value of any solution procedure, be it regularized or not. As the right amount of reg-
ularization (weight of the regularization in the objective, early stopping criterion, . . . )
3.4. Conclusions 49
is usually selected at the light of the results obtained by simulating the model on an
independent validation sample or by cross-validation methods (Stone, 1974; Efron and
Tibshirani, 1993), it would be vacuous to discuss regularization further if we were ulti-
mately unable to estimate the true quality of the regularized solution. The development
of efficient validation methods is the subject of the next chapter.
50 Chapter 3. Solution Averaging in Structured Spaces
Chapter 4
4.1 Motivation
on scenario tree based methods. Intensive testing is needed, because for obtaining perfor-
mance estimators that are statistically significant, it is important to test a decision policy
on a sufficiently large number of independent scenarios. Testing decisions a posteriori by
the shrinking-horizon approach (Section 2.4.4) is not a viable option, given the internal
use of additional scenario trees by this approach, and given the overall computational
complexity of the procedure. With respect to this motivation, machine learning offers a
multitude of ways of extracting policies that are easy to test in an automatic way on a
large number of independent samples.
The second motivation has to do with the intrinsic nature of the finite scenario-tree
approximation for multistage stochastic programming. The variance in the quality of the
optimal decisions that may be inferred from finite approximations suggests that those
problems are essentially ill-posed in the same sense as the inverse problems addressed
in machine learning are also ill-posed: small perturbations in the values of a finite data
set — finite number of scenarios in stochastic programming; finite number of input-
output pairs in machine learning — lead to perturbations of empirical expectations, and
ultimately lead to large variations (instability) of the quantities of interest — first-stage
decisions in stochastic programming; parameters of classifiers or regressors in machine
learning — that are being optimized on the basis of empirical estimates.
This analogy suggests that regularization techniques and principles from statistical
learning theory (Vapnik, 1998), such as the structural risk minimization principle, could
help to extract solutions from scenario-tree approximations in a sound way from the
theoretical point of view, and in an efficient way from the practical point of view.
The main ideas developed in the following subsections can be summarized as follows:
we propose an approach that (i) allows to test small scenario trees quickly and reliably,
(ii) is likely to offer better ways of exploiting individual scenario-tree approximations,
and (iii) in the end, allows to revisit the initial question (Section 2.4.2) of generating,
solving, ranking and exploiting tractable scenario trees for solving complex multistage
decision making problems.
We start from the following observation: estimators of the quality of a scenario-tree ap-
proximation that are computationally cheap to evaluate can be constructed by resorting
to supervised learning techniques.
The basic principle consists in inferring a suboptimal decision policy by first learning
a sequence of decision predictors π̂1 , . . . π̂T from a data set of examples of information
state/decision pairs. The examples of information states are extracted from the nodes of
the scenario tree; they correspond to the partial scenario histories (ξ1k , . . . , ξt−1
k
) in the
tree. Later in the chapter (Section 4.4.3), we will see that the information states can also
be represented differently, for instance by features or by states in the sense of dynamic
programming. The examples of decisions are also extracted from the nodes of the tree:
they correspond to the decisions ukt optimized on the tree.
When a decision predictor π̂t is applied on a new scenario ξ (or more exactly, on
the observed part (ξ1 , . . . , ξt−1 ) of the scenario ξ), it outputs a predicted decision that
cannot be assumed to satisfy the exact feasibility constraints ut ∈ Ut (ξ) relative to the
4.2. Learning and Evaluation of Scenario Tree Based Policies 53
new scenario, if we want to define a framework that allows the use of existing standard
supervised learning algorithms for building the decision predictors. Therefore, to obtain
feasible decisions, we assume that the predicted decision can then be corrected in an
ad-hoc fashion using a computationally cheap feasibility-restoration procedure, that we
call repair procedure in the sequel and denote by Mt . The idea of using repair procedures
is also suggested in Küchler and Vigerske (2010) as a means of restoring the feasibility
of decisions extracted from a tree and applied to test scenarios.
We now formalize these ideas to describe how a learned decision policy can be used
to assess (validate), in a certain sense, a given scenario-tree approximation.
where ξ = (ξ1 , . . . , ξT ) is a random process, and where the optimization is over the
decision policy π = (π1 , . . . , πT ). We assume that ξt has its outcomes in some space Ξt ,
say Rd for simplicity, and that πt is valued in some space Ut (of which Ut (ξ) is a subset),
say Rm . We recall that π is non-anticipative if π1 is a constant-valued function, π2 is
a function of ξ1 , and more generally πt is a function of ξ1 , . . . , ξt−1 . Therefore, one can
define πt either as a mapping from Ξ1 × · · · × ΞT = RT d to Rm restricted to the class of
non-anticipative mappings, or as a mapping from Ξ1 × · · · × Ξt−1 = R(t−1)d to Rm .
Given an approximation of P on a scenario tree having n scenarios ξ k of probability pk ,
Pn
P 0 : minimize k k k
k=1 p f (ξ , u ) subject to ukt ∈ Ut (ξ k ) ∀ k ;
uk1 = uj1 ∀ k, j ,
ukt = ujt whenever k
(ξ1k , . . . , ξt−1 ) ≡ (ξ1j , . . . , ξt−1
j
) ,
let {ūk }1≤k≤n denote an optimal solution to P 0 , where each ūk = (ūk1 , . . . , ūkT ) corre-
sponds to the sequence of decisions associated to ξ k . We define a decision predictor π̂t as
def def
a mapping from inputs Xt = (ξ1 , . . . , ξt−1 ) ∈ R(t−1)d to outputs Yt = ut ∈ Rm , learned
def
from a data set Dt = {(Xtk , Ytk )}1≤k≤n of input-output pairs, obtained by collecting
from the scenario tree the observed parts of the scenarios and their associated optimized
decisions:
def k
Xtk = (ξ1k , . . . , ξt−1 ) ,
def
Ytk = ūkt .
Note that the duplicate samples (Xtk , Ytk ) ≡ (Xtj , Ytj ) induced by the non-anticipativity
conditions (the branching structure of the scenario tree) may be removed from the learn-
ing set Dt . In particular, D1 is reduced to a single learning sample Y1 = ū1 , leading to
a trivial learning problem and to the decision predictor π̂1 ≡ ū1 .
By construction of P 0 , and by the fact that U1 is constant-set-valued, the first-stage
decision is feasible: π̂1 (ξ) = ū1 ∈ U1 (ξ). For the subsequent decisions, the supervised
learning procedure cannot in general guarantee that π̂t (ξ) ∈ Ut (ξ) for all scenarios in the
54 Chapter 4. Validation of Solutions and Scenario Tree Generation
learning set and for all new scenarios ξ. Therefore, we repair the predictions to restore the
feasibility of the decisions. The nature of the repair procedure varies with the feasibility
constraints that have to be enforced. The realizations of the random quantities on which
Ut (ξ) depend are passed in arguments of the repair procedure, and the procedure is then
applied online on each new scenario and predicted decisions.
An example of repair procedure is the projection of a predicted decision on the feasibil-
ity set. Later in the thesis, we resort to simple problem-dependent heuristics for restoring
feasibility (Section 5.3.4). Formally, we define as an admissible repair procedure for U t
any mapping
such that the range of Mt is always contained in the feasible set Ut (ξ), assuming that
u1 , . . . , ut−1 are in the corresponding feasibility sets U1 (ξ), . . . , Ut−1 (ξ), and that Ut (ξ) is
nonempty.
A learned (feasible) policy is made of the association of the decision predictors and
the repair procedures.
We can exploit a learned policy for computing an estimate of the quality of a scenario
tree, or a bound on the exact value of the original multistage program P. The procedure
can be described as follows.
i. Generate a scenario tree using a tree building algorithm A. Solve the resulting
program P 0 , extract from its solution the first-stage decision ū1 , and the data
sets Dt of scenario/decisions pairs.
ii. Learn the decision predictors π̂t from the data set Dt for t = 2, . . . , T .
iv. For each scenario ξ j of the test sample, set uj1 = ū1 and compute sequentially
the recourse decisions uj2 , . . . , ujT . Each decision ujt is obtained by first evaluating
π̂t (ξ1j , . . . , ξt−1
j
) and then restoring feasibility by the repair procedure Mt .
v. Estimate the performance of the learned decision policy on the test sample by
P n0
forming the empirical average VTS (A) = (1/n0 ) j=1 f (ξ j , uj ), where the sum runs
over the indices relative to the scenarios in the test sample and their associated
decision sequences uj = (uj1 , . . . , ujT ).
The estimator VTS (A) computed in this way reflects the joint quality of the scenario
tree, the learned predictors and the repair procedures.
The estimator VTS (A) is obtained by simulating an explicit policy that generates
feasible decisions, and thus always provides a pessimistic bound (upper bound for min-
imization, lower bound for maximization) on the performance of the best policy that
could be inferred from the considered scenario tree, up to the standard error of the test
sample estimator. The pessimistic bound is also a reliable bound on the achievable per-
formance of a decision policy for the true problem, up to the standard error of the test
sample estimator.
4.2. Learning and Evaluation of Scenario Tree Based Policies 55
Note that in theory, a learned policy is not necessarily worse than a shrinking-horizon
policy using the same first-stage decision ū1 , since the supervised learning step could
actually improve the quality of the recourse decisions uj2 , . . . , ujT .
The pessimistic bound can be made tighter by testing various policies obtained from
the same scenario tree, but with different learning algorithms and/or repair procedures.
The best combination of algorithms and learning parameters could then be retained.
Note, however, that due to the optimistic bias induced by the selection of the best bound
on the test sample of size n0 , the value of the best policy should be evaluated again by
simulation on a new independent test sample of size n00 .
It is also possible to exploit estimators relative to policies learned from different
scenario trees but computed on the same test sample of size n0 . We may even expect
that scenario tree variants can be ranked reliably based on the value of these estimators,
despite the variance of the estimator due to the randomness in the generation of the test
sample, and despite a new source of bias due to the use of suboptimal recourse decisions
obtained from the learned policies. These ideas will be further developed in Section 4.3.
Note also that the input space of a learned policy is a simple matter of convenience.
As long as the policy remains non-anticipative, the input space can be described differ-
ently, typically by letting appear explicitly past decisions, state variables, and additional
features derived from the information state, that might facilitate the generalization of
the decisions in the data sets, or later on, the online evaluation of the learned decision
predictors. These ideas are illustrated in Section 4.4.3.
To simplify the exposition in the sequel, we will assume that all the considered al-
gorithms for learning policies use the same repair procedures Mt , and differ only by the
choice of the hypothesis space Ht for π̂t (space of functions considered by the supervised
learning algorithm). It is convenient to denote the possible hypothesis spaces by H tλ ,
where λ belongs to some index set Λ. For instance, λ could represent the weight of a
regularization term used in the supervised learning algorithm. For simplicity, we assume
that Λ has a finite cardinality |Λ|.
correspond to a non-anticipative feasible decision policy π̄ = (π̄1 , . . . , π̄T ) for the original
56 Chapter 4. Validation of Solutions and Scenario Tree Generation
1. For each λ ∈ Λ,
learn the decision predictors π̂tλ given the data sets Dt ,
using the hypothesis spaces Htλ , t = 1, . . . , T .
2. For each λ ∈ Λ,
evaluate the performance of the policy π̄ λ obtained by combining π̂tλ with Mt .
Let v λ be that performance evaluated on the common test sample of size n0 .
program P.
The computational complexity of exploiting π̄t on new scenarios depends on the com-
plexity of evaluating π̂t and Mt for all t.
The mappings π̂t should ideally be the best mappings from the best hypothesis spaces
one could consider, but in practice they correspond to the mappings identified by a given
supervised learning algorithm on the basis of the data sets Dt . We find it useful to
consider a series of policies in this section, because there is some leeway in the choice
of the supervised learning algorithm and/or its parameters, that can be exploited in the
search for ideal mappings.
In the usual supervised learning framework, one generally selects a model by evalu-
ating its performance on a fraction of the data set kept apart for testing purpose. In
the present setup, it is preferable to evaluate models by directly simulating the learned
policy π̄ on a common test sample of new scenarios (Algorithm 4.1).
If Algorithm 4.1 is merely run to select a best learned policy, a single test sample
of size n0 on which the policies are compared suffices. If in addition an unbiased upper
bound on the exact value of P is sought, an additional independent test sample of size n 00
is required on which the best policy should be simulated again.
In practice, the selection bias may be very small if n0 is large enough with respect to the
considered hypothesis spaces. Therefore, in some numerical experiments, we sometimes
allow ourselves to report directly the estimates obtained on the first test sample of size n 0 .
To discuss the complexity of Algorithm 4.1, let us introduce the following quantities.
• cE (t): expected running time of the combined computation of π̂t and Mt on a new
scenario.
We assume that cL (1) = cE (1) = 0 since the first decision ū1 is fixed and simply
extracted from a solution to P 0 . For t ≥ 2, note that cL (t) and cE (t) usually grow with
the dimension of the random variables ξt , the dimension of the decisions ut , and the
cardinality of the data sets Dt . The ratio between cL (t) and cE (t) depends largely on
the type of supervised learning algorithm and the type of repair procedure for M t . We
neglect the time for computing f (ξ, u) given ξ and u.
The following proposition is a straightforward consequence of the definition of Algo-
rithm 4.1:
starting from data sets obtained in expected time cA + cS . The optional step of Algo-
PT
rithm 4.1 adds to the expected time a term n00 t=2 cE (t).
If Algorithm 4.1 is run on N parallel processes, one can essentially replace |Λ| in
Proposition 4.1 by |Λ|/N , and n00 by n00 /N .
The complexity of Algorithm 4.1 can be compared to the complexity of the usual
shrinking-horizon validation approach (Section 2.4.4). To this end, we extend our nota-
tions as follows.
• P(t) denotes the program for the minimization of the objective over the remaining
stages t, t + 1, . . . , T . The program P(1) is the original program P. Given real-
izations for ξ1 , . . . , ξt−1 and the corresponding implemented decisions ū1 , . . . , ūt−1 ,
one can obtain P(t) by replacing in P the random variables ξ1 , . . . , ξt−1 by their
outcomes, conditioning the distribution of ξt , . . . , ξT accordingly, and introducing
the constraints π1 (ξ) = ū1 , . . . , πt−1 (ξ) = ūt−1 .
• cA (t) denotes the expected time for forming the approximation P 0 (t) to P(t) using
algorithm A(t) for building a scenario tree over the shrunk horizon.
If the tree building algorithm A(t) is based on a pure Monte Carlo sampling, c A (t) should
be relatively small, and approximately proportional to the size of the scenario tree. If
A(t) is based on a deterministic method and the dimension of the random process is
not say 1 or 2, cA (t) may actually be quite large, even for t near the horizon T . The
time cS (t) can also be quite large, except perhaps for t = T or t close to T .
In Section 2.4.4, we had denoted all the algorithms A(t) simply by A, and written
VTS (A) for the estimate produced by the shrinking-horizon validation approach. In
the following proposition, we assume that the shrinking-horizon approach is run on the
58 Chapter 4. Validation of Solutions and Scenario Tree Generation
independent test sample of size n00 used for reevaluating the best policy selected by
Algorithm 4.1.
In this section, we mention variants of the validation approaches discussed above, mo-
tivated by the complexity estimates of Propositions 4.1 and 4.2. These variants are
interesting to consider, but their implementation has been left as future work.
We begin by observing that it is possible to combine the two preceding validation ap-
proaches (supervised learning of policies and shrinking-horizon optimization) by combin-
ing learned policies at stages 2, . . . , t0 to a shrinking-horizon decision making procedure
for t = t0 + 1, . . . , T . This hybrid approach would run in expected time
P t0 PT
t=2 |Λ| · [cL (t) + n0 cE (t)] + t=t0 +1 n0 · [cA (t) + cS (t)] ,
starting from data obtained in expected time cA +cS . We would also add to the expected
P t0 PT
time the term n00 · [ t=2 cE (t) + t=t0 +1 [cA (t) + cS (t)]], relative to the reevaluation of
the selected hybrid policy on the test sample of size n00 .
The number of stage t0 could be chosen to minimize the expected running time,
namely (neglecting the reevaluation term)
but a complication with the optimal choice of t0 is the possible dependence of the standard
error of the estimates on nondeterministic algorithms A(t).
Another possible variant is to carry out the selection of the models for π̂ 1 , . . . , π̂T
sequentially, that is, stage by stage. To describe this variant, we extend our notations as
follows.
2. For each λt ∈ Λt ,
learn the predictor π̂tλt given the data set Dt , using the hypothesis space Htλt ;
build the policy π λt = (π̄1 , . . . , π̄t−1 , πtλt ), where πtλt combines π̂tλt and Mt .
If t = T , go to Step 4.
3. For each λt ∈ Λt ,
0
form and solve the problem P+ (t; π λt ); let v λt denote its optimal value.
Set νt ∈ argminλt ∈Λt v λt and set π̄t = πtνt .
0
Form the data set Dt+1 relative to ut+1 from the solution to P+ (t; π νt ).
Set t to t + 1 and go to Step 2.
4. For each λT ∈ ΛT ,
evaluate the performance of π λT on the common test sample of size n0 ;
let v λT be that performance.
5. Set νT ∈ argminλT ∈ΛT v λT and set π̄T = πTνT . Return π̄ = (π̄1 , . . . , π̄T ).
• Given t < T and a policy π̄ † = (π̄1 , . . . , π̄t ) specified only from stage 1 to stage t,
the notation P+ (t; π̄ † ) refers to the original program P over a policy π, subject to
the additional constraints π1 (ξ) = π̄1 (ξ), . . . , πt (ξ) = π̄t (ξ). Thus, the program
P+ (t; π̄ † ) is the original problem P, except that π̄1 , . . . , π̄t are already specified.
0
• P+ (t; π̄ † ) denotes the scenario-tree approximation to P+ (t; π̄ † ) built by some al-
gorithm A(t). The algorithm A(t) must always return the same tree, while the
0
trees relative to A(1), . . . , A(T − 1) must all be different. Thus, P+ (t; π̄ † ) is the
0
approximate program P posed over a new scenario tree proper to t, and subject
to the additional constraints uk1 = π̄1 (ξ k ), . . . , ukt = π̄t (ξ k ) for all k.
Algorithm 4.2 describes how decision predictors are learned from data sets that in-
corporate the effect of the decision rules already selected for the previous stages, and
left unspecified for the subsequent stages. At each stage t, there is also a selection step
among possible decision predictors indexed by λt ∈ Λt .
Indeed, the advantage of Algorithm 4.2 over Algorithm 4.1 is that the learning problem
for a decision predictor for stage t + 1 takes into account the learned decisions rules
π̄1 , . . . , π̄t . As the learned decision rules introduce a loss of optimality and modify the
information states that can be reached at stage t + 1, other recourse decisions at stages
t + 1, . . . , T are preferable, and in fact, the ideal recourse decisions are those that would
60 Chapter 4. Validation of Solutions and Scenario Tree Generation
be obtained by solving the problem P+ (t; π̄ † ) with π̄ † suitably defined (see Step 2 of
Algorithm 4.2, where π λt plays the role of π̄ † ). We cannot solve P+ (t; π̄ † ), but we
0
can exploit a scenario-tree approximation P+ (t; π̄ † ) from which a data set Dt+1 for
learning π̂t+1 can constructed (Step 3 of Algorithm 4.2).
From the statistical point of view, the main drawback of Algorithm 4.2 is that it
cannot evaluate on the test sample of size n0 a policy that is not specified on the full
horizon. Instead, Step 3 in Algorithm 4.2 performs a weak form of model selection by
0
scoring the incomplete policies of Step 2 with the optimal value of the programs P + (t; π̄ † ).
0 †
The programs P+ (t; π̄ ) use a common scenario tree independent of the trees relative to
π̄1 , . . . , π̄t−1 , so as to reduce the selection bias. The selection is weak in the sense that
the score of an incomplete specified policy π̄ † is not a reliable estimate of the optimal
value of the exact program P+ (t; π̄ † ). An unbiased upper bound on the exact value of P
can be obtained by the optional Step 6 of Algorithm 4.2.
From the complexity point of view, the main drawback of Algorithm 4.2 is that the
0
programs P+ (t; π̄ † ) must be solved for each λt ∈ Λt , t = 2, . . . , T . Another concern is the
new source of variance of the test sample estimates coming from use of several scenario
trees, that could force us to use larger test samples.
We now sketch a workable and generic scheme for obtaining approximate solutions to
a multistage stochastic program with performance guarantees, and for selecting good
scenario-tree approximations to the multistage stochastic program. The scheme builds
on the validation procedure described in Section 4.2.1 (Algorithm 4.1), which infers a
decision policy from examples of scenarios and decisions collected from a scenario-tree
approximation, and also computes an accurate estimate of the value of the learned policy
by Monte Carlo simulation.
A first idea simply consists in perturbing the data sets Dt of scenario/decisions pairs
used by the supervised learning procedure, by obtaining these data sets from different
scenario-tree approximations. This source of variation creates new opportunities for
finding better policies by supervised learning.
A second idea consists in identifying good scenario trees, on the basis of the perfor-
mance of the policies that can be learned from the data sets Dt collected from those
trees. This approach allows to study empirically algorithms that construct the scenario-
tree approximations, and to tune or modify these algorithms so as to improve the solution
procedure in terms of solution accuracy or in terms of computational complexity.
In this section, we describe the scheme that allows to identify good scenario trees. The
scheme consists in generating a possibly large set of randomized scenario-tree approxi-
mations P 0 for a given problem P, ranking them according to the estimated value of the
best policy learned from them, and identifying in this way a presumably best scenario
tree among the considered sample of trees. The best policy of the best scenario tree is
then viewed as the best solution for P found by the method, and its value can be assessed
4.3. Monte Carlo Selection of Scenario Trees 61
4. Set µ ∈ argmin1≤ν≤M v ν .
Return the scenario tree indexed by µ and the policy π̄ = π µ .
by Monte Carlo simulation on an independent test sample. Algorithm 4.3 describes each
step of the procedure.
Having enough diversity in the considered scenario-tree approximations multiplies our
chance of obtaining good data sets, from which good policies can be learned. Therefore,
it is interesting to assume that the generated scenario trees have a random branching
structure — a novelty with respect to the usual practice of multistage stochastic program-
ming. In our presentation of Algorithm 4.3, we formally decompose a tree generation
algorithm A into 2 components: AS for generating a random branching structure, and
AV for sampling realizations ξ k of the random process ξ according to the fixed branching
structure and for assigning probabilities to the nodes of the tree. Existing tree generation
methods from the stochastic programming literature, briefly discussed in Section 2.4.2,
correspond to a particular choice of AV .
Developing algorithms AS able to generate rich but tractable branching structures,
for low-dimensional processes or for high-dimensional processes, valid for short horizons
and long horizons, is an interesting open problem. We have investigated several variants
for AS in the context of a concrete family of problems (Section 4.4.2), without how-
ever providing general-purpose algorithms for generating random branching structures
adapted to high-dimensional random processes.
Algorithm 4.3 uses in theory 3 independent test samples of size n0 , n00 , n000 : one for
selecting a best hypothesis space for the best learned policy from the data sets relative
to a given tree; one for selecting the best tree; and one for estimating the performance of
the overall best policy. In practice, we do not always reevaluate our estimates on distinct
62 Chapter 4. Validation of Solutions and Scenario Tree Generation
4.3.2 Discussion
The generic procedure presented in this section is based on various open ingredients that
may be exploited for the design of a wide class of algorithms in a flexible way. Namely, the
main ingredients are (i) the scenario tree sampling scheme, (ii) the (possibly regularized)
optimization technique used to obtain data sets from a scenario tree, (iii) the supervised
learning algorithm used to obtain the decision strategies from the data sets, (iv) the
repair procedure used to restore the feasibility of the decisions on new scenarios.
The main ideas of the proposed scheme are evaluated in the case study section on
a family of problems proposed by other authors. We illustrate how one may adjust the
scenario tree generation algorithm and the policy learning algorithm to one’s needs, and
by doing so we also illustrate the flexibility of the proposed approach and the potential
of the combination of scenario-tree based decision making with supervised learning. In
particular, the efficiency of supervised learning strategies makes it possible to rank large
numbers of policies inferred from large numbers of randomly generated scenario trees.
Although we do not illustrate this in the present work, we would like also to stress
that the scenario tree sampling scheme may be coupled in various other ways with the
inference of policies by machine learning. For example, one could seek to use sequential
Monte Carlo techniques inspired from the importance sampling literature, in order to
progressively guide the scenario tree sampling and machine learning methods towards
regions of high interest, given the quality of the policies inferred from scenarios trees
at previous iterations. Also, instead of using the data set obtained from each scenario
tree to extract a policy, one could use data sets collecting data from several scenario-
tree approximations to extract a single policy, in the spirit of the wide range of model
4.4. Case Study 63
We will show the interest of the approximate solution techniques presented in the chapter
by applying them to a family of multistage stochastic programs. Implementation choices
difficult to discuss in general terms, such as choices concerning the supervised learning
of a policy for the recourse decisions, and the choices for the random generation of the
trees, will be illustrated on a concrete case.
The section starts by the formulation of a multistage stochastic program that various
researchers have presented as difficult for scenario tree methods (Hilli and Pennanen,
2008; Koivu and Pennanen, 2010; Küchler and Vigerske, 2010). Several instances of
the problem will be addressed, including instances on horizons considered as almost
unmanageable by scenario tree methods.
We consider a multistage problem adapted from Hilli and Pennanen (2008), interpreted in
that paper as the valuation of an electricity swing option. In this chapter, we interpret
the problem rather as the search for risk-aware strategies for distributing the sales of
a commodity over T stages in a flexible way adapted to market prices. A risk-aware
objective is very interesting for our purposes, but it is difficult to justify it in a context
of option valuation. The formulation of the problem is as follows:
PT
minimize ρ−1 log E{exp{−ρ t=1 ξt−1 · πt (ξ)}}
PT
subject to 0 ≤ πt (ξ) ≤ 1 and t=1 πt (ξ) ≤ Q ,
(4.1)
π non-anticipative.
The objective uses the exponential utility function, with risk aversion coefficient ρ.
Such objectives are discussed at the end of the chapter.
In our formulation of the problem, there is no constant first-stage decision to optimize.
We begin directly by the observation of ξ0 , followed by a recourse decision u1 = π1 (ξ0 ).
Observations and decisions are intertwined so that in general ut = πt (ξ0 , . . . , ξt−1 ). The
random variable ξt−1 is the unitary profit (ξt−1 > 0) or loss (ξt−1 < 0) that can re-
sult from the sale of the commodity at time t. Potential profits and losses fluctuate
in time, depending on market conditions (we later select a random process model for
market prices to complete the problem specification). The commodity is sold in quantity
ut = πt (ξ0 , . . . , ξt−1 ) at time t, meaning that the quantity ut can depend on past and
current prices. The decision is made under the knowledge of the potential profit or loss
at time t, given by ξt−1 · ut , but under uncertainty of future prices. This is by the way
why scenario tree techniques must be used with great care on this problem when the
planning horizon is long: as soon as the scenarios cease to have branchings, there is no
more residual uncertainty on future prices, and the optimization process wrongly iden-
tifies opportunities anticipatively. Those spurious future opportunities may significantly
degrade the quality of previous decisions.
We seek strategies where the sales per stage are bounded (constraint 0 ≤ π t (ξ) ≤ 1).
The constraint can model a bottleneck in the production process. Notice also that
64 Chapter 4. Validation of Solutions and Scenario Tree Generation
bounded sales are consistent with the model assumption of an exogenous random process:
very large sales are more likely to influence the market prices on long planning horizons.
The scalar Q bounds the total sales (we assume Q ≥ 1). It represents the initial stock of
commodity, the sale of which must be distributed optimally over the horizon T .
When the risk aversion coefficient ρ tends to 0, the problem reduces to the search of
a risk-neutral strategy. This case has been studied by Küchler and Vigerske (2010). It
admits a linear programming formulation:
PT
minimize −E{ t=1 ξt−1 · πt (ξ)}
PT
subject to 0 ≤ πt (ξ) ≤ 1 and t=1 πt (ξ) ≤ Q ,
(4.2)
π non-anticipative,
• In a first series of experiments, we will take the numerical parameters and the
process ξ selected in Hilli and Pennanen (2008) (to ease the comparisons): ρ = 1,
T = 4, Q = 2; ξt = (exp{bt } − K) where K = 1 is the fixed cost (or the strike
price, when the problem is interpreted as the valuation
√ of an option) and b t is a
random walk: b0 = σ 0 , bt = bt−1 +σ t , with σ = 0.2 and t following a standard
normal distribution N (0, 1).
Pt
Noting that a priori bt = σ t0 =0 t0 is normally distributed with mean 0 and
variance (t + 1)σ 2 , we record, for future reference, that the first process ξ is such
that
1
√ log(ξt−1 + K) is a priori distributed as N (0, 1) , (4.4)
σ t
√
where σ = 0.2 and K = 1 .
At the heart of tree selection procedure relies our ability to generate scenario trees reduced
to a very small number of scenarios, with interesting branching structures. As the trees
4.4. Case Study 65
are small, they can be solved quickly and then scored using the supervised learning policy
inference procedure. Fast testing procedures make it possible to rank large numbers of
random trees.
The generation of random branching structures has not been explored in the classical
stochastic programming literature; we thus have to propose a first family of algorithms
in this section. The algorithms are developed with our needs in view, with the feedback
provided by the final numerical results of the tests, until results on the whole set of con-
sidered numerical instances suggest that a particular algorithm suffices for the application
at hand. We believe that the main ideas behind the algorithms will be reused in subse-
quent work for addressing the representation of stochastic processes of higher dimensions.
Therefore, in the following explanations we put more emphasis on the methodology we
followed than on the final resulting algorithms.
Method of Investigation.
the programs associated to these trees, simply by computing the identified features.
We now describe the branching process used in the first series of experiments, made
with deterministic node values. Let r ∈ [0, 1] denote a fixed probability of creating a
branching. We start by creating the root node of the tree (depth 0), to which we assign
the conditional probability 1. With probability r, we create 2 successor nodes to which we
assign the values ±0.6745 and the conditional probabilities 0.5 (see Remark 4.1 below).
With probability (1 − r) we create instead a single successor node to which we assign
the value 0 and the conditional probability 1; this node is a degenerate approximation
of the distribution of t . Then we take each node of depth 1 as a new root and repeat
the process of creating 1 or 2 successor nodes to these new roots randomly. The process
is further repeated on the nodes of depth 2, . . . , T − 1, yielding a tree of depth T for
representing the original process 0 , . . . , T −1 . The scenario tree for ξ is derived from the
scenario tree for .
Remark 4.1 (Wasserstein distance). The discrete distribution that assigns proba-
bilities 0.5 on the values ±0.6745 is the discrete distribution with support of car-
dinality 2 that has the smallest Wasserstein distance l1 to the normal distribution
N (0, 1) followed by t . The Wasserstein distance l1 may be defined as follows. Let
X, Y be random variables following marginal distributions G and H respectively.
Assume that G and H are such that X and Y have finite first moments. Let P de-
note the collection of coupling measures between X and Y , that is, the collection of
probability measures P such that X follows G, and Y follows H. Then the Wasser-
stein distance l1 between G and H is defined as l1 (G, H) = inf P∈P {E{|X − Y |}}.
It admits a dual representation l1 (G||H) = supf ∈F1 {E{f (X)} − E{f (Y )}}, where
F1 = {f : R → R : |f (x) − f (y)| ≤ |x − y|} denotes the class of functions with
Lipschitz constant 1. It can be shown that the distribution with values y k and
probabilities pk , 1 ≤ k ≤ N , closest in the l1 sense to a density g (with respect to
the Lebesgue measure) can be computed as follows: Set y 0 = −∞, y N +1 = +∞
and minimize over y 1 < y 2 < · · · < y N the sum
N Z
X (y k +y k+1 )/2 Z (y k +y k+1 )/2
k k
|x − y |g(x)dx , and then set p = g(x)dx .
k=1 (y k−1 +y k )/2 (y k−1 +y k )/2
For the case N = 2 and g the density of N (0, 1), one can use a symmetry argument
R∞
and then evaluate argminy 0 |x − y|(2π)−1/2 exp{−x2 /2}dx ' 0.6745.
For problems on larger horizons, it is difficult to keep the size of the tree under
control with a single fixed branching parameter r — the number of scenarios would have
a large variance. Therefore, in the second series of experiments (made with random node
values), we used a slightly more complicated branching process, by letting the branching
probability r depend on the number of scenarios currently developed (Algorithm 4.4).
Specifically, let N be a target number of scenarios and T a target depth for the scenario
tree with the realizations of ξt relative to depth t + 1. Let nt be the number of parent
4.4. Case Study 67
Algorithm 4.4 Branching structure generation for the second series of experiments
Input: A targeted number of scenarios N ≥ 1, and a tree depth T ≥ 1.
Output: A random branching structure for a scenario tree having n ' N
scenarios.
nodes at depth t. Note that nt is a random variable, except at the root where n0 = 1.
During the construction of the tree, parent nodes at depth t < T are developed and split
in 2 children nodes with a probability rt = n−1 t (N − 1)/T . Parent nodes have a single
child node with a probability 1 − rt . If rt > 1, we set rt = 1 and all nodes are split in 2
children nodes. Thus in general rt = min{1, n−1 t (N − 1)/T }. Note that the truncation
of rt to 1 has no effect on Algorithm 4.4 and has thus been omitted.
Algorithm 4.4 produces branching structures having approximately N scenarios in
the following sense. Assume that the number nT −1 of existing nodes at depth T − 1 is
large. By the independence of the random decision of creating 1 or 2 successor nodes,
and by a concentration of measure argument, the number of nodes created at depth T is
approximately equal to
Solving a program on a scenario tree yields a data set of scenario/decision sequence pairs
(ξ, u). To infer a decision policy that generalizes the decisions of the tree to test scenarios,
we have to learn mappings from (ξ0 , . . . , ξt−1 ) to ut and ensure the compliance of the
decisions with the constraints. To some extent the procedure is thus problem-specific.
Here again we insist on the methodology.
68 Chapter 4. Validation of Solutions and Scenario Tree Generation
Dimensionality Reduction.
We first try to represent the information state (ξ0 , . . . , ξt−1 by a smaller number of
variables, because the representation (ξ0 , . . . , ξt−1 risks to become very cumbersome as
t grows. In particular, we can try to get back to a state-action space representation of
the policy (and postprocess data sets accordingly to recover the states). Note that in
general, the states we need are those that would be used by a hypothetical reformulation
of the optimization problem using dynamic programming. Here the objective is based
on the exponential utility function. By the property that
PT
E{exp{− t0 =1 ξt0 −1 · ut0 } | ξ0 , . . . , ξt−1 }
Pt−1 PT
= exp{− t0 =1 ξt0 −1 · ut0 } E{exp{− t0 =t ξt0 −1 · ut0 } | ξ0 , . . . , ξt−1 } ,
We try to map the output space in such a way that the predictions learned under the
new geometry and then transformed back using the inverse mapping comply with the
feasibility constraints. Here, we scale the output ut so as to have to learn the fraction
yt = yt (ξt−1 , ζt ) ∈ [0, 1] of the maximal allowed output min(1, ζt ). Indeed, note that
0 ≤ ut ≤ min(1, ζt ) summarizes the constraints of the problem at time t, namely the
constraint 0 ≤ πt (ξ) ≤ 1 in (4.1) and the constraint (4.6). Since ζ0 = Q is fixed
(with Q greater than 1 by assumption), we distinguish the cases u1 = y1 (ξ0 ) · 1 and
ut = yt (ξt−1 , ζt ) · min(1, ζt ). It will be easy to ensure that fractions yt predicted by the
learned models are valued in [0, 1] (thus we actually do not need to define an a posteriori
repair procedure).
Input Normalization.
It is convenient for the sequel to normalize the inputs. From the definition of ξ t−1 we can
def
recover the state of the random walk bt−1 , and use as first input xt1 = (σ 2 t)−1/2 bt−1 ,
which follows a standard normal distribution. Thus for the first version of the process ξ,
recalling (4.4), instead of ξt−1 we use xt1 = σ −1 t−1/2 log(ξt−1 + K). For the second
version of the process ξ, recalling (4.5), instead of ξt−1 we use xt1 = σ −1 t−1/2 log(ξt−1 +
def
K) + σt1/2 /2. Instead of the second input ζt (for t > 1) we use xt2 = ζt /Q, which is
valued in [0, 1]. We will also rewrite the fraction yt = yt (ξt−1 , ζt ) as yt = gt (xt1 , xt2 ) to
stress the change of input variables.
4.4. Case Study 69
βt1
tansig tansig
vt11 + wt1
+1
xt1 βt2 γt -1
tansig logsig
+ + yt
logsig
xt2 βt3 +1
tansig wt3
0
vt32 +
Fig. 4.1: Neural Network model with L = 3 hidden layers for the component gt of the policy (4.7)
to be learned from data. The figure is a graphical representation of (4.8). Training
the neural networks consists in finding, for each t, values for the parameters vtjk , βtj ,
wtj , γt that best explain examples of pairs (xt , yt ).
σ −1 t−1/2 log(ξt−1 + K)
for the process ξ of (4.4)
xt1 =
σ −1 t−1/2 log(ξt−1 + K) + σt1/2 /2 for the process ξ of (4.5)
Pt−1
xt2 = ζt /Q = 1 − Q−1 t0 =1 ut0 (4.7)
yt = gt (xt1 , xt2 )
Pt−1
ut = yt · min{1, ζt } = yt · min{1, Q − t0 =1 u t0 } ,
with π non-anticipative and feasible if and only if gt is always valued in [0, 1].
Hypothesis Space.
We have to choose the hypothesis space for the functions gt in (4.7). In the present
situation, we find it convenient to choose the class of feed-forward neural networks with
one hidden layer of L neurons (Figure 4.1):
PL P2
gt (xt1 , xt2 ) = logsig γt + j=1 wtj · tansig βtj + k=1 vtjk xtk , (4.8)
with weights vtjk and wtj , biases βtj and γt , and activation functions
a usual choice for imposing the output ranges [−1, +1] and [0, 1] respectively.
Since the training sets are extremely small, we take L = 2 for g1 (which has only one
input x11 ) and L = 3 for gt (t > 1).
We recall that artificial neural networks have been found to be well-adapted to nonlin-
ear regression. Standard implementations of neural networks (data structure construction
and training algorithms) are widely available (Demuth and Beale, 1993). We report here
the parameters chosen in our experiments for the sake of completeness; the method is
largely off-the-shelf.
70 Chapter 4. Validation of Solutions and Scenario Tree Generation
The weights and biases are determined by training the neural networks. We used the
Neural Network toolbox of Matlab with the default methods for training the networks
by backpropagation — the Nguyen-Widrow method for initializing the weights and bi-
ases of the networks randomly, the mean square error loss function, and the Levenberg-
Marquardt optimization algorithm. We used [−3, 3] for the estimated range of x t1 , cor-
responding to 3 standard deviations, and [0, 1] for the estimated range of x t2 .
Trained neural networks are dependent on the initial weights and biases before train-
ing, because the loss minimization problem is nonconvex. Therefore, we repeat the
training 5 times from different random initializations. We obtain several candidate poli-
cies (to be ranked on the test sample). In our experiments on the problem with T = 4,
we randomize the initial weights and biases of each network independently. In our exper-
iments on problems with T > 4, we randomize the initial weights and biases of g 1 (x11 )
and g2 (x21 , x22 ), but then we use the optimized weights and biases of gt−1 as the initial
weights and biases for the training of gt . Such a warm-start strategy accelerates the
learning tasks. Our intuition was that for optimal control problems, the decision rules
πt would change rather slowly with t, at least for stages far from the terminal horizon.
We do not claim that using neural networks is the only or the best way of building
models gt that generalize well and are fast in exploitation mode. The choice of the Matlab
implementation for the neural networks could also be criticized. It just turns out that
these choices are satisfactory in terms of implementation efforts, reliability of the codes,
solution quality, and overall running time.
An option of the proposed testing framework that we have not discussed, as it is linked
to technical aspects of numerical optimization, is that we can form the data sets of sce-
nario/decisions pairs using inexact solutions to the optimization programs associated to
the trees. Indeed, simulating a policy based on any data set will still give a pessimistic
bound on the optimal solution of the targeted problem. The tree selection procedure will
implicitly take this new source of approximation into account. In fact, every approxima-
tion one can think of for solving the programs could be tested on the problem at hand
and thus ultimately accepted or rejected, on the basis of the performance of the policy
on the test sample, and the time taken by the solver to generate the decisions of the
data set. In the present setting, we judged that solving multiple instances of large-scale
nonlinear programs would be too slow with cvx, and preferred to use a large-scale linear
programming approximation of the initial objective.
Here, we present an approximation used for the problems with ρ > 0 on horizons larger
than T = 4, that turned out to perform satisfactorily on that family of problems. We
approximated the function exp{z} in the objective by a convex piecewise linear ap-
def
proximation, expL {z} = maxj∈{0,1,...,J−1} {cj · z + dj }, with cJ−1 = dJ−1 = 0, and
with cj , dj ∈ R chosen such that expL {zi } = exp{zi } on a sequence of anchor points
4.4. Case Study 71
minimize E{v(ξ)}
PT
subject to v(ξ) ≥ cj · [−ρ t=1 ξt−1 · πt (ξ)] + dj
for j = 0, . . . , J − 2 ,
v(ξ) ≥ 0 (case j = J − 1),
PT
0 ≤ πt (ξ) ≤ 1 and t=1 πt (ξ) ≤ Q ,
π non-anticipative.
The anchor points zj may be chosen as follows. It is easy to see that at optimality
we should always have πt (ξ) = 0 if ξt−1 < 0. This means that the arguments z =
PT
−ρ t=1 ξt−1 · πt (ξ) of the exponential function will always be nonpositive at optimality.
Thus we may set z0 = 0: the exponential function will be approximated by the linear
function c0 (z)+d0 for z > 0 during the optimization process, without loss of precision. On
the other hand, in a finite-dimensional approximation, the support of the approximation
to the distribution of ξt has a maximal value, say ξM . The minimal value of the argument
of the exponential is thus greater or equal to z̄ = −ρ · ξM · Q. Thus if zJ−1 ≤ z̄
the exponential function will approximated by max{0, cJ−2 · z + dJ−2 } for z < zJ−1
during the optimization process, without loss of precision. We can then select J and
zJ−1 < zJ−2 < · · · < z0 = 0, with zJ−1 ≤ z̄, such that the approximation of exp{z} by
expL {z} is tight enough on the domain [z̄, 0]. For all z ∈ [z̄, 0], we have exp(z)L ≥ exp(z)
and max(expL {z} − exp{z}) < maxj {| exp{zj+1 } − exp{zj }|}.
For solving the linear programs we still use the interior-point solver associated to cvx.
One could also switch to simplex methods — arguments in favor of simplex methods may
be found in Bixby (2002).
We now describe the numerical experiments we have carried out and comment on the
results.
First, we consider the process ξ and parameters (ρ, Q, T ) taken from Hilli and Pennanen
(2008). We generate a sample of n0 = 104 scenarios drawn independently, on which each
learned policy will be tested. We generate 200 random tree structures as described previ-
ously (using r = 0.5 and rejecting structures with less than 2 or more than 10 scenarios).
PSfrag replacements
PSfrag replacements
0
0.1
0.66 0.2
cost of policy
0.64
0.5
0.6
0.62 0.7
0.8
0.9
0.6
1
0.58
1 2 3 4 5 6 7 8 9 10
number of scenarios
0 of the tree
0.1
PSfrag replacements
Fig. 4.2: First experiment: scores on the test sample
0.2 associated to the random scenario trees
(lower is better). The linear segments 0.3
join the best scores of policies inferred from
0.4
trees of equivalent complexity. 0
0.5
0.1
0.6
0.2
0.7
0.3
0.8
0.4
0.9
+0.828 1/4 0.5
+0.828
1 1/4
0.6
0.7
+0.352 +0.352
0.8 1/8
PSfrag replacements 1/8 0.9 1/8
+0.000 1/4 +0.000
1
-0.260 1/8 -0.260 1/8
1/8
-0.453 1/4 -0.453 1/4
0
0.1
ξ0k ξ1k ξ2k ξ3k p k
0.2 ξ0k ξ1k ξ2k ξ3k pk
0.3
+1.472 +1.472
0.4 1/8
1/8 0.5
0.6
+0.828 0.7
+0.828 1/16
0.8
0.9
1/8
+0.352 +0.352
1 1/8
1/8
1/16
+0.000 1/4 +0.000 1/16
1/8 1/4
-0.260 -0.260 1/8
-0.453 1/8 -0.453 1/16
-0.595 1/8 -0.595 1/8
Node values are set by the deterministic method, thus the variance in performance that
we will observe among trees of similar complexity will come mainly from the branching
structure. We form and solve the programs on the trees using cvx, and extract the data
sets. We generate 5 policies per tree, by repeatedly training the neural networks from
random initial weights and biases. Each policy is simulated on the test sample and the
best of the 5 policies is retained for each tree.
The result of the experiment is shown on Figure 4.2. Each point is relative to a
particular scenario tree. Points from left to right are relative to trees of increasing size.
4.4. Case Study 73
P n0 PT j
We report the value of (1/n0 ) j=1 exp{− t=1 ξt−1 · π̂t (ξ j )} for each learned policy π̂, in
accordance with the objective minimized in Hilli and Pennanen (2008). Lower is better.
Notice the large variance of the test sample scores among trees with the same number of
scenarios but different branching structures.
The tree selection method requires a single lucky outlier to output a good valid upper
bound on the targeted objective — quite an advantage with respect to approaches based
on worst-case reasonings for building a single scenario tree. With a particular tree of 6
scenarios (best result: 0.59) we already reach the guarantee that the optimal value of
our targeted problem is less or equal to log(0.59) ' −0.5276. On Figure 4.3, we have
represented graphically some of the lucky small scenario trees associated to the best
performances. Of course, tree structures that perform well here may not be optimal for
other problem instances.
The full experiment, that allows to draw Figures 4.2 and 4.3, takes 10 minutes to run
on a pc with a single 1.55 GHz processor and 512 Mb RAM. By comparing our bounds
to the results reported in Hilli and Pennanen (2008) (who have undertaken validation
experiments taking up to 30 hours on a pc with a single 3.8 GHz processor, 8 Gb RAM,
using a test sample of 10000 scenarios, and whose Figure 1 seems to indicate that the
best possible value for the bounds should be slightly greater than 0.58), we deduce that
we reached essentially the quality of the optimal solution.
Second, we consider the process ξ taken from Küchler and Vigerske (2010) (see Equa-
tion (4.5)) and a series of 15 sets of parameters for ρ, Q, T (see the first columns of
Table 4.1). We repeat the following experiment on each (ρ, Q, T ) with 3 different values
for the parameter N that controls the size of the random trees obtained with Algo-
rithm 4.4: Generate 25 random trees (we recall that this time the node values are also
randomized), solve the resulting 25 programs, learn 5 policies per tree (depending on the
random initialization of the neural networks), and report as the best score (best upper
bound) the lowest of the resulting 125 values computed on a common test sample of
n0 = 10000 scenarios. The test sample is proper to the problem instance (in fact, proper
to the time horizon T ).
Table 4.1 reports values corresponding to the average performance
0
n T
j
X X
−1 0
ρ log{(1/n ) exp{−ρ ξt−1 · π̂t (ξ j )}}
j=1 t=1
obtained for the considered series of problem instances, for the 3 considered nominal tree
sizes N (so as to illustrate the effect of the size of the trees on the performance of the
learned policies). One column is dedicated to the performance of the analytical reference
policy π ref on the test sample.
Note that the case that Küchler and Vigerske (2010) have considered is the case
corresponding to (ρ, Q, T ) = (0, 20, 52) in our table. The plots from their Figure 3 seem
to confirm that the optimal value for this case is around −3.6.
For the cases with ρ = 0, the reference value provided by the analytical optimal policy
suggests that the best policies found by our approach are close to optimality. For the
74 Chapter 4. Validation of Solutions and Scenario Tree Generation
Tab. 4.1: Second experiment: Best upper bounds for a family of problem instances.
Problem Upper bounds1 on the value of problems (4.1) with the process (4.5)
N =1·T N =5·T N = 25 · T
cases with ρ = 0.25, the reference policy is now suboptimal. It still slightly dominates
the learned policies when Q = 2, but not anymore when Q = 6 or Q = 20. For the cases
with ρ = 1, the reference policy is dominated by the learned policies, except perhaps for
the cases with Q = 2. We also observe that results obtained with smaller trees (cases
N = 1 · T ) are sometimes better than results obtained with larger trees (cases N = 25 · T ,
that is, N = 300 if T = 12 and N = 1300 if T = 52). There is indeed a random
component in our tree generation approach, and it may happen that one small tree
ultimately gives a better data set than the data sets of the large trees, especially given
the relatively small number of trials in this experiment (25 trees per size N ) compared
to the number of stages.
Overall, the approach seems promising in terms of the usage of computational re-
sources. Table 4.2 reports the times taken for computing the bounds reported in Ta-
ble 4.1, using a Matlab/cvx implementation on a pc with a single 1.55 GHz processor,
512 Mb RAM. We recall that obtaining one bound involves generating 25 trees, form-
ing and solving the 25 corresponding mathematical programs, learning 125 policies, and
testing the 125 policies on 10000 scenarios. For instance, obtaining one of the 15 bounds
of the column N = 1 · T of Table 4.1 takes between 2 minutes (for the case ρ = 0, Q = 2,
T = 12) and 9 minutes (for the case ρ = 1, Q = 20, T = 52). Obtaining one of the 15
4.5. Time Inconsistency and Bounded Rationality Limitations 75
Tab. 4.2: Cpu Times for computing the bounds in Table 4.1
Problem Total1 cpu time (in seconds)
ρ Q T N =1·T N =5·T N = 25 · T
bounds of the column N = 25 · T of Table 4.1 takes from less than 4 minutes (for the case
ρ = 0, Q = 2, T = 12, N = 300) to 110 minutes (for the case ρ = 1, Q = 20, T = 52,
N = 1300).
The experiment shows that even if the proposed scenario tree selection method re-
quires generating and solving several trees, rather than one single tree, it can work very
well. In fact, the experiment illustrates that with a random tree generation process that
can generate an “interesting” set of small trees, there is a good likelihood (on the studied
family of problems) that at least one of those trees will lead to excellent performances.
This section briefly discusses the notion of dynamically consistent decision process, which
is relevant to sequential decision making with risk-sensitivity — by opposition to the
optimization of the expectation of a total return over the planning horizon, which can
be described as risk-indifferent, or risk-neutral.
state fully describes the distribution of total returns conditionally to the current state.
Note, however, that a nice way of handling a mean-variance objective on the total return
is to relate it to the expected exponential utility: if R denotes a random total return,
Φρ {R} = E{R} − (ρ/2)var{R} ' −ρ−1 log E{exp(−ρR)}. The approximation holds for
small ρ > 0. It is exact for all ρ > 0 if R follows a Gaussian distribution.
4.6 Conclusions
This chapter has presented a generic procedure for estimating the value of approximate
solutions to multistage stochastic programs. A direct application of this procedure is
the evaluation of the quality of the discretization of the original program. The proposed
selection of a best scenario tree among an ensemble of trees generated randomly, with the
branching structure also randomized, contributes to bring partial answers to the general
problem of building good scenario trees efficiently.
Our simple description of the proposed tree selection scheme (Algorithm 4.3), based on
an ensemble of random scenario trees generated independently, is less naive than it might
appear at first view, with in mind more advanced Monte Carlo sampling techniques for
generating the trees sequentially. Indeed, there is a terrible dimensionality challenge in
the search for a proper approximate representation of a random process ξ = (ξ 1 , . . . , ξT )
by a scenario tree, already on short horizons, say T equal to 4 or 5, and especially if
the dimension of the random vectors ξt is larger than say 1 or 2. In that context, it is
not clear whether more advanced importance sampling schemes would be tractable for
problems of practical interest.
On the other hand, given a scenario tree and optimal decisions associated to its nodes,
there is still much liberty in the way a policy can be learned, and in the way the feasibility
of the output of a learned decision predictor can be efficiently restored. The next chapter
will explore some of these possibilities.
We leave as future work the investigations concerning policies learned from the data
obtained from several scenario trees. Based on the numerical results collected in this
chapter, our first intuition is that the trees would have first to be sorted out. Indeed,
many trees, as we currently generate them, give very poor decisions. Adding the decisions
of such trees to a common data set is likely to hurt policies learned from the common
data set. The issue, however, is that we can sort out trees only if we can score them.
Currently, we score the trees by testing a policy learned from them. Our conclusion is that
learning a policy from several scenario trees would imply a computationally intensive,
boosting-like approach: the best policies learned from say the largest trees one could
solve would serve to identify the scenario/decisions pairs to be collected in a large data
set, that would then be used by a next generation of policies. Such ideas are difficult
to test and refine on the problems we have considered in this chapter, because the best
policies learned from single trees already yield near-optimal results.
Chapter 5
In this chapter, we investigate alternative methods for learning feasible policies given
a data set of scenario/decisions pairs. We seek to infer conditional probability models
(predictive densities) for the decisions ut given the information state (ξ1 , . . . , ξt−1 ), and
then to obtain feasible decisions on new scenarios ξ by maximizing online the probability
of the decision ut subject to the current feasibility constraints ut ∈ Ut (ξ).
The chapter is organized in a backward fashion: Section 5.1 assumes that a predictive
density is available and seeks to exploit it so as to select a feasible decision; Section 5.2
concentrates on the inference of conditional predictive densities, given the current in-
formation state. In Section 5.3, a certain number of the proposed ideas are illustrated,
evaluated, and sometimes modified, in the context of a particular problem.
Notations.
We consider the following setup: Given a predictive density p̂t for the decision ut ∈ Rn ,
infer (select) a decision ūt such that ūt satisfies the feasibility constraints ūt ∈ Ut (ξ).
The given density p̂t is in fact an estimated density, obtained for instance as described in
Section 5.2. For the selection of a decision from the density, we maximize (the logarithm
of) the predictive density subject to constraints, which leads to the following estimate:
If p̂t is unimodal with its mode in Ut (ξ), then ūt is given by the mode of p̂t . The nontrivial
case is when the mode is not in Ut (ξ).
To make the approach computationally viable, we have to introduce restrictions on the
density p̂t and the feasible sets Ut . Moreover, one may want to ensure that the solution
set in (5.1) is a singleton. Indeed, we are interested in the situation where ū t is viewed
as the decision of a deterministic policy πt (ξ1 , . . . , ξt−1 ); by selecting ūt arbitrarily from
the solution set, an undesirable source of randomness would be added to the decision
process.
5.1.1 Assumptions
An interesting restriction on the models for p̂t is to assume that p̂t is taken from an
exponential family of distributions.
The following description of exponential families will suffice for our purposes. Given
an index set I of finite cardinality |I| = d, and a finite collection {φ` }`∈I of functions
φ` : Rn → R, let φ(ut ) ∈ Rd denote the d-dimensional column vector with elements
φ` (ut ), ` ∈ I, and define the (natural) exponential family associated to the collection
{φ` }`∈I as
where θ is allowed to take values from a set Θ ⊂ Rd described below, and where A(θ) is
the so-called cumulant generating function (log-partition function) defined by
Z
A(θ) = log exp{hθ, φ(ut )i}dut . (5.3)
Rn
Choosing a value for θ amounts to select a distribution among the members of the
exponential family.
The domain of the parameter θ is the set Θ = {θ ∈ Rd : A(θ) < ∞}. In the
terminology of Appendix A, the set Θ is the effective domain of the cumulant generating
function A(θ) viewed as an extended-real-valued function. The (natural) exponential
family is said to be regular if Θ is open. In the sequel, we assume that the family is
regular. It is well-known (Brown, 1986; Robert, 2007; Wainwright and Jordan, 2008) that
A(θ) is a convex function of θ (and thus in particular that Θ is convex). Moreover, A(θ)
is strictly convex for the so-called minimal exponential families. Minimal exponential
families are (natural) exponential families such that the functions φ` , ` ∈ I, and the
constant-valued function φ0 (x) = 1, form a set of linearly independent functions — that
is, for any θ 6= 0, hθ, φ(ut )i is not a constant-valued function of ut .
For minimal exponential families, there is a one-to-one correspondence between a
value θ ∈ Θ and a distribution from the family.
Using p(ut ; θ) from (5.2) for p̂t (ut ), the problem (5.1) becomes
which is independent of the constant term A(θ), and corresponds formally to a maximum
a posteriori (MAP) estimation problem subject to additional constraints.
To ensure that (5.4) has a solution, we assume that the set Ut (ξ) is nonempty, closed,
and convex. Moreover, we assume that the support of p(ut ; θ) meets the interior of Ut (ξ),
5.1. Constrained MAP Repair Procedure 81
in order to guarantee that (5.4) has a nonempty solution set and does not lead to a
pathological optimization problem. It is well-known that the support of exponential
families does not depend on the value of their parameter θ. Therefore, given a subset C
of Rn such that Ut (ξ) is always in C for all ξ (it is possible to choose C = Rn ), one can
choose the exponential family so that its support covers C.
5.1.2 Particularizations
where ut−1 is the decision relative to the previous stage (with ut−1 actually depending
only on ξ1 , . . . , ξt−2 ), Bt is very often a fixed matrix (recourse matrix), and At , ht are
a matrix (technology matrix) and a vector that may both depend on ξ1 , . . . , ξt−1 (often
only affinely). The form (5.5) is in part justified by results for two-stage stochastic
programming problems (Appendix D). When one uses (5.4) for computing ū t online on
a new scenario, the realizations of ut−1 , At and ht are known, and (5.4) becomes the
problem of solving over ut ∈ Rn the program
In the sequel, we seek to identify some exponential families that lead to a concave
objective in (5.6).
We consider for p̂t in (5.1) the multivariate normal distribution N (λ, Λ) with mean
λ ∈ Rn and covariance matrix Λ ∈ Rn×n (we do not stress in the notation λ, Λ a possible
dependence of these parameters on ξ1 , . . . , ξt−1 and on t). We assume that Λ is positive
definite, so that the normal distribution has a density, namely,
p̂t (ut ) = ((2π)n det Λ)−1/2 exp{− 12 (ut − λ)T Λ−1 (ut − λ)}
= exp{− 12 (tr{Λ−1 ut uTt } − 2hΛ−1 λ, ut i + λT Λ−1 λ − log{(2π)n det Λ−1 })} .
In that case, using the precision matrix S = Λ−1 , the program (5.6) becomes the strictly
convex quadratic program
The program (5.7) has a simple geometrical interpretation (Figure 5.1) in terms of the
Mahalanobis distance dM (ut , vt ) = ||S 1/2 (ut − vt )||2 between two vectors ut , vt ∈ Rn
(Mahalanobis, 1936). For conditions ensuring that the feasibility set is nonempty, see
Definitions D.3, D.4, D.5 in Appendix D.
A zero-valued element Sij of the precision matrix has the interpretation that the com-
ponents i, j of ut are conditionally independent given the other components (Dempster,
1972).
82 Chapter 5. Inferring Decisions from Predictive Densities
λ
ūt Ut (ξ)
Fig. 5.1: Geometrical interpretation for (5.7). The matrix S defines a Mahalanobis distance
in Rn . The program (5.7) consists in computing the projection of λ on the set Ut (ξ)
according to this metric, by minimizing the distance between λ and ut ∈ Ut (ξ). On this
figure, ut ∈ R2 , and λ 6∈ Ut (ξ). The level set corresponding to the optimal objective
value f ∗ has been drawn (dashed line).
For stochastic programming problems where the components of the decisions u t can be
put in correspondence with spatial locations, for instance problems defined on networks,
it could make sense to use a Gaussian Markov random field model (Speed and Kiiveri,
1986) for the density p̂t (ut ).
The form (5.8) is well suited to situations where probabilistic models for the scalar
components ut i have been learned separately, so as to obtain more tractable learning
problems. There is probably some structure among the components ut i once ut is op-
timized, and we may hope that by enforcing the condition ut ∈ Ut (ξ), we recover, to a
certain extent, a part of that structure — while the part of the structure that is induced
by the objective function of the original multistage decision making problem is unlikely
to be restored by this myopic feasibility restoration procedure.
As an example of log-concave density, we can cite the univariate normal distribution
N (µ, σ 2 ) with σ 2 > 0. Another potentially useful example is the gamma distribution
Γ(α, β) with α ≥ 1 (condition ensuring the log-concavity) and β > 0, supported on
(0, ∞). If we choose for p̂it the gamma distribution Γ(αi , βi ), then the density of p̂it is
given by
P2 P2
i=1 βi u t i − i=1 (αi − 1) log ut i = f ∗
mp
ūt
Ut (ξ)
Fig. 5.2: Geometrical interpretation for (5.10). The parameters αi of the marginal distributions
Γ(αi , βi ) for 1 ≤ i ≤ n define a weighted Itakura-Saito distance in Rn (see Remark 5.1).
The program (5.10) consists in computing the projection of the mode mp of the dis-
tribution of ut on the set Ut (ξ) according to this (pseudo) metric, by minimizing the
distance between mp and ut ∈ Ut (ξ), where mp = [mp 1 . . . mp n ]T , mp i = (αi − 1)/βi .
On the present figure, ut = [ut 1 ut 2 ]T ∈ R2 , and mp 6∈ Ut (ξ). The level set corre-
sponding to the optimal objective value f ∗ has been drawn (dashed line).
R∞
where Γ(αi ) = 0 tαi −1 exp{−t}dt is the gamma function evaluated at αi . One then
obtains the objective component
which is strictly concave if αi > 1. Note that its unconstrained maximization would then
def
yield the mode of the distribution Γ(αi , βi ), namely mp i = (αi − 1)/βi .
Now, if for instance each component ut i follows a distribution Γ(αi , βi ) with αi > 1
and βi > 0, the program (5.8) becomes the strictly convex program
Pn Pn
minimize i=1 βi ut i − i=1 (αi − 1) log ut i (5.10)
subject to Bt ut = ht − At ut−1 , ut 0
Remark 5.1 (Justification of the geometrical interpretation for (5.10)). The strictly
Pn
convex function F (ut ) = − i=1 hθi , φi (ut i )i, obtained by summing the compo-
nents (5.9) and changing the sign, induces a Bregman divergence (Bregman, 1967;
Banerjee et al., 2005) between ut , vt ∈ Rn given by
that is, the objective of (5.10) up to a constant term. The omission of the constant
term shifts the value of the objective but does not alter the geometry of the level
sets.
We consider the following joint Gaussian model as a base case for learning probabilistic
models.
Description.
It is well known that if a random vector z = (x, y) follows a multivariate normal distri-
bution N (z̄, Σ) with
x̄ Σx Σxy
z̄ = , Σ= ,
ȳ ΣTxy Σy
A simple model of the predictive density for ut given ξ1 , . . . , ξt−1 can be obtained by
setting x = (ξ1 , . . . , ξt−1 ), y = ut , and then using the conditioning formulae (5.12) on
5.2. Gaussian Predictive Densities 85
a multivariate normal model N (z̄, Σ) for z = (x, y), with z̄ and Σ learned (estimated)
from a data set of scenario/decisions pairs (see below).
The evaluation of (5.12) for u2 , . . . , uT requires T − 1 matrix inversions, but as Σ−1
x
is independent of the observations x, the inversions and matrix products need not be
recomputed online on new scenarios.
By (5.12), the conditional mean of ut is an affine function of the observed history
(ξ1 , . . . , ξt−1 ). In fact, λ(x) would be called a linear decision rule in the context of
stochastic programming (Garstka and Wets, 1974).
Estimation.
There is a large literature on the estimation of the mean and the covariance matrix (or
its inverse) of a Gaussian random vector (Stein, 1956; Haff, 1980; Banerjee et al., 2008).
In the present context, given a data set of samples {z k }1≤k≤N , where z k = (xk , y k ),
PN
xk = (ξ1k , . . . , ξt−1
k
), y k = ukt , we can estimate the mean z̄ by ẑ = (1/N ) k=1 z k , and
estimate the covariance matrix Σ by a simple shrinkage estimator of the form
N
X
Σ̂ = (1 − )ΣML + I , with ΣML = (1/N ) (z k − ẑ)(z k − ẑ)T . (5.13)
i=1
The identity matrix I is added with weight ∈ (0, 1) in order to ensure that the estimated
covariance is positive definite and well-conditioned.
If Σ̂ in (5.13) is scaled by some positive factor, the conditional covariance Λ in (5.12)
is scaled by the same factor, whereas the conditional mean λ(x) is left unchanged. As
the minimizer of (5.7) is invariant with respect to a rescaling of the objective, one can
thus rescale (5.13) by a factor (1 − )−1 , set 0 = /(1 − ) > 0 and simply use
Σ̂ = ΣML + 0 I . (5.14)
By the same token, there is no potential advantage in replacing the maximum likelihood
estimator ΣML by an unbiased empirical estimator
N
X
Σemp = (N − 1)−1 (z k − ẑ)(z k − ẑ)T .
i=1
The program (5.7) that restores the feasibility of ut uses larger corrections for compo-
nents ut i of ut with larger conditional variances Λii . Under the joint Gaussian model,
the components ut i of the decision vector ut that have a larger estimated variance (rela-
tively to the other components) are those that are not well explained by the linear model
(compared to the other components).
Due to the corrections made by (5.7), the actual decision ūt will not in general de-
pend affinely on (ξ1 , . . . , ξt−1 ). Therefore, it might be beneficial to consider the actual
decisions (u2 , . . . , ut−1 ) as new observations, and extend the conditional model for ut by
computing (5.12) with y = ut and x = (ξ1 , . . . , ξt−1 , u2 , . . . , ut−1 ).
86 Chapter 5. Inferring Decisions from Predictive Densities
Gaussian processes allow to define nonparametric models (by opposition to models with
an a priori fixed number of parameters for summarizing the data, whatever the size of
the data). Following O’Hagan (1978), it is often said that Gaussian processes allow
to define prior distributions over spaces of functions, that are then updated to posterior
distributions, given a data set of observations of the relation between inputs and outputs.
Note that since Gaussian process models often incorporate the effect of a noise process
on observations, the relation between inputs and noisy observed outputs is actually not
of a purely functional nature (Neal, 1997, page 4).
Description.
Let J denote an index set (typically infinite), and let {X α }α∈J denote a collection of
vectors X α ∈ Rn such that X α 6= X β if α 6= β. The vectors X α are interpreted as query
points uniquely identified by labels α ∈ J (the labels can indicate an ordering between
distinct query points). For each α ∈ J , let Y α be a real-valued random variable with
finite variance. For any finite subset S of indices from J , let |S| denote the cardinality
of S, and let Y (S) denote the |S|-dimensional random vector with elements Y α , α ∈ S.
We assume that for any such subset S, the random vector Y (S) follows a multivariate nor-
mal distribution N (µ(S), K(S)), with its mean vector µ(S) ∈ R|S| and covariance matrix
K(S) ∈ R|S|×|S| defined below. This defines a so-called Gaussian process {Y α }α∈J .
The mean vector µ(S) = E{Y (S)} collects (stacks into a column vector) the elements
µα = g(X α ) , α∈S ,
defined using some fixed real-valued function g : Rn → R called the mean function.
The covariance matrix K(S) = E{[Y (S) − µ(S)][Y (S) − µ(S)]T } collects (stacks into
a symmetric |S| × |S| matrix) the elements
K αβ = k(X α , X β ) , α, β ∈ S ,
defined using some fixed positive definite kernel k : Rn × Rn → R (see Definition C.11 in
Appendix C — the name “positive definite kernel” is standard whereas the corresponding
matrix K(S) is only positive semi-definite). The kernel k (also called covariance function)
is parametrized by a vector η of hyperparameters that has not been written explicitly to
lighten the notation. For example, a kernel k : Rn × Rn → R with values
Pn
k(X α , X β ) = v0 exp{− 21 i=1 (Xiα − Xiβ )2 /σi2 }
(radial basis kernel) is parametrized by η = (v0 , σ1−2 , . . . , σn−2 ), with v0 > 0 and where
each σi > 0 is a bandwidth parameter associated to the i-th coordinate of the inputs X α
and X β .
Now, let (S1 , S2 ) denote a partition of S, that is, S1 ∪ S2 = S and S1 ∩ S2 = ∅. Let
K(S1 , S2 ) = E{[Y (S1 ) − µ(S1 )][Y (S2 ) − µ(S2 )]T } be the matrix with elements K αβ for
α ∈ S1 , β ∈ S2 , and let
µ(S1 ) K(S1 ) K(S1 , S2 )
µ= , K= .
µ(S2 ) K(S1 , S2 )T K(S2 )
5.2. Gaussian Predictive Densities 87
Let Z(S1 ) denote a |S1 |-dimensional random vector representing a noisy observation of
Y (S1 ), collecting elements defined by
Z α = Y α + σW α , α ∈ S1 ,
where σ 2 > 0 represents the variance of the observation noise, assumed to be i.i.d.
Gaussian, and where each W α is assumed to be drawn independently from the stan-
dard normal distribution N (0, 1). Then, the random vector (Z(S1 ), Y (S2 )) follows a
multivariate normal N (µ0 , K 0 ) with
K(S1 ) + σ 2 I K(S1 , S2 )
0 0
µ =µ , K = ,
K(S1 , S2 )T K(S2 )
where I stands for the |S1 |×|S1 | identity matrix. In particular, the random vector Y (S2 )
conditionally to Z follows a multivariate normal N (λY (Z(S1 )), ΛY ) with the conditional
mean and conditional covariance matrix given respectively by
When one actually observes a realization z(S1 ) ∈ R|S1 | of the random vector Z(S1 ),
the conditional mean of Y (S2 ) given z(S1 ) is a real vector λ̂ = λY (z(S1 )) ∈ R|S2 | that
represents the best prediction for the realization of Y (S2 ) in the mean-square error sense,
while the covariance matrix of the prediction error λ̂ − Y (S2 ) is given by Λ̂ = ΛY .
The contribution σ 2 I from the noise vector W can be viewed as a jitter term that
stabilizes the matrix inversion without perturbing too much the model (Neal, 1997). It
also allows to consider in Equations (5.15), (5.16), several independent noisy observations
at a same query point X α , by reinterpreting S1 as a multiset (collection) of indices of J .
We now apply the described Gaussian Process model to the inference of the distri-
bution of a decision vector ut conditionally to a new scenario ξ (of which we can only
observe ξ1 , . . . , ξt−1 ), given a data set of scenario/decisions pairs (ξ k , uk ) extracted from
a scenario tree. We describe the calculations for the i-th component of ut , written ut i .
k
We define S1 as an index set relative to the distinct values of (ξ1k , . . . , ξt−1 ) found in
the data set, and we set
X α = (ξ1α , . . . , ξt−1
α
) , z α = uα
ti , α ∈ S1 . (5.17)
This allows to compute the term (K(S1 ) + σ 2 I)−1 (z(S1 ) − µ(S1 )) in (5.15) as soon as
we obtain the data set (the realization z(S1 ) of Z(S1 )). Then, we view S2 as a singleton
relative to a new scenario ξ ∗ , and we set
X β = (ξ1∗ , . . . , ξt−1
∗
) , Y β = ut i , β ∈ S2 . (5.18)
∗
This allows to compute K(S1 , S2 ) as soon as we actually observe (ξ1∗ , . . . , ξt−1 ), with
K(S1 , S2 ) interpreted as a vector of weights describing the similarity of the new sce-
nario ξ ∗ with respect to each example ξ k stored in the data set.
At this stage, we can infer that the real-valued random variable ut i follows a univariate
normal distribution N (λi , Λii ) with λi = λY (z(S1 )) given by (5.15), and Λii = ΛY given
by (5.16). As for the predictive density for the full decision ut , we assume that each
88 Chapter 5. Inferring Decisions from Predictive Densities
It can be seen from (5.15) that the mean and thus the mode of the predicted Gaussian
density for ut i combine the decisions uα
t i of the data set, in a way that depends on the
similarity (determined by the kernel k) between the observed part (ξ1∗ , . . . , ξt−1
∗
) of the
∗ α
new scenario ξ , and the scenarios ξ , α ∈ S1 , stored in the data set.
The factor (K(S1 )+σ 2 I)−1 (Z(S1 )−µ(S1 )) in (5.15) has to be evaluated once, whereas
µ(S2 ) and the vector K(S1 , S2 ) ∈ R|S1 | must be evaluated online for each new scenario.
Therefore, training requires a time cubic in the cardinality |S1 | of the training set due to
the matrix inversion, whereas the computation of the conditional mean λ can be done in
linear time. The online computation of the variance ΛY would require a time quadratic
in the cardinality of the training set, but following Remark 5.2, it is possible to bypass
the estimation of the variance by keeping the same kernel for each component of u t .
The storage of the Gram matrix K(S1 ) takes a space quadratic in the cardinality of the
training set.
Estimation.
In Gaussian Process regression, the mean function g with values g(X α ) is often set to the
constant zero-valued function, so that the terms µ(S1 ), µ(S2 ) do not appear in (5.15).
Sometimes, the mean function is set to a linear function of the inputs X α . In the present
context, the values of g could also be set to constant reference decisions, for instance, to
the decisions from a nominal plan (Section 2.1.1).
Selecting a kernel type automatically is not easy. In support vector machines, the
problem is partially addressed by working over a set of kernels (Lanckriet et al., 2004;
Micchelli and Pontil, 2005; Sonnenburg et al., 2006). Once the kernel type is chosen, the
selection of the hyperparameters η can be formulated as the maximization over η of the
loglikelihood of the observed data z(S1 ) (Mardia and Marshall, 1984), that is,
where At , Bt for t ≥ 1 denote fixed matrices of proper dimension, and where the cost
coefficients ct , the constraint right-hand sides ht , and decision vectors ut may depend, for
t ≥ 2, on the realization (ξ1 , . . . , ξt−1 ) of a random process ξ = (ξ1 , . . . , ξT ). As usual, the
expectation is taken over ξ and can be decomposed in nested conditional expectations.
Recall that a scenario tree for ξ is a set of realizations {ξ k }1≤k≤N of ξ, along with
probabilities pk > 0 assigned to scenarios ξ k and summing to 1. Recall that the branching
k
structure of the tree causes histories (ξ1k , . . . , ξt−1 ) to be identical among some scenarios k.
Let us denote by ct and ht the values of ct and ht associated to ξ k , noting in particular
k k
The mapping πtSH for t ≥ 2 is a function of (ξ1 , . . . , ξt−1 ), with value ūt , where ūt
corresponds to an optimal solution for ukt (any k), relative to the following program over
ukt0 for 1 ≤ k ≤ N and t ≤ t0 ≤ T ,
PN P T
minimize N −1 k=1 t0 =t pk hckt0 , ukt0 i
subject to At0 ukt0 −1 + Bt0 ukt0 = hkt0 , ukt0 0 for each k and for t0 ≥ t ,
def
where we set, for t0 = t, ukt−1 = ūt−1 for each k ,
ujt0 = ukt0 for each t ≥ t and j, k such that (ξ1k , . . . ξtk0 −1 ) ≡ (ξ1j , . . . ξtj0 −1 )
0
where N and all scenario-dependent quantities pk , ξtk , ckt , hkt should here be understood as
relative to the scenario tree for ξ given (ξ1 , . . . , ξt−1 ), which is built once the realization of
(ξ1 , . . . , ξt−1 ) becomes available, and which instantiates, along with ūt−1 , the parameters
of the program.
Our intention in this section is to take shrinking-horizon policies as the golden stan-
dard for sequential decision making, and compare them to other policies built with the
techniques proposed in the chapter on a common test sample of M = 104 scenarios.
For the simplicity of the parametrization of the scenario tree building algorithm, we
consider scenario trees with a uniform branching factor, and use the same branching
factor for rebuilding scenario trees on the shrinking horizon. Therefore, once the dis-
cretization method for ξt is fixed (choices are explained in length in Section 5.3.2), a
shrinking-horizon policy is uniquely determined by the branching factor. Moreover, us-
ing the same branching factor at each stage results in the following property: If the
k
realization of (ξ1 , . . . , ξt−1 ) is identical to (ξ1k , . . . , ξt−1 ) for some scenario k in the initial
scenario tree for ξ, then the subtree rooted at the node relative to (ξ1k , . . . , ξt−1 k
) is exactly
the subtree built at stage t for the scenario tree of ξ given (ξ1 , . . . , ξt−1 ). Hence, if one
simulates the shrinking horizon policy with uniform branching factor on the scenario ξ k ,
one will recover the decisions uk = (uk1 , . . . , ukT ) that were found to be optimal on the
initial scenario tree for computing ū1 .
To a single shrinking-horizon policy π SH will correspond several learned policies, ob-
tained by different learning algorithm applied to the same training data {(ξ k , uk )}1≤k≤N ,
relative to the scenario tree used to optimize π1SH = ū1 . Obviously, all these learned poli-
cies start with the same first-stage decision ū1 .
The test problem is a multi-product assembly problem under demand uncertainty. The
multistage structure of the problem is summarized in Table 5.1: the decisions to take
at each stage is put in correspondence with the available information at those stage,
represented by the realization of certain random variables. The mathematical formu-
lation of the problem is presented in Table 5.2 in nested form. The nested form is a
generalization to several stages of the formulation for two-stage programs presented in
Appendix D; it enables a reader to distinguish easily the constraints specific a decision
stage t, that is, the actual definition of the sets Ut (ξ). We have put at the end of the
chapter (page 104) a table that specifies the numerical value of all the parameters for the
test problem (Table 5.10).
5.3. Case Study 91
The test problem can be described as follows. A manufacturer can assemble 5 products
Pi , for which the demand di ∈ R is unknown, but influenced by three random factors
t ∈ R, t = 1, 2, 3, observed at distinct decision stages (see Table 5.1). We let d ∈ R 5 be
the random vector representing the demand. The products are made of subparts, some of
which are common among several products. There is a total of 8 distinct subparts. The
subpart are themselves made of components, some of which are common among several
subparts. There is a total of 12 distinct components that the manufacturer can buy.
The random demand d is assumed to be distributed according to the following model:
d = [b0 + b1 1 + b2 2 + b3 3 ]+ (5.19)
1 ∼ N (0, 1) , 2 ∼ N (0, 1) , 3 ∼ N (0, 1) , (5.20)
components, subparts and products that are bought or assembled. A second group of
so-called ancillary decisions v2 ∈ R12×8 , v3 ∈ R8×5 determines each quantity of compo-
nent/subpart allocated to a given subpart/product in the next stage of the production
process. A decision q4 ∈ R5 , corresponding to the quantity of product actually sold, and
defined by q4 = min{q3 , d} (elementwise minimum), is added to the group of ancillary
decisions, for the convenience of the problem formulation (convexity). To summarize,
collects the unit cost of each decision in u, in the order determined by the decomposition
(5.21). The subvectors c1 , c2 , c3 associated to q1 , q2 , q3 correspond to fixed production
costs and are nonnegative. Zero costs are associated to the decisions v 2 , v3 . The sub-
vector c4 associated to q4 has negative entries that correspond to the fixed prices of the
5 products with a sign change.
The decision vector u is structured by various constraints. Besides a nonnegativity
constraint u 0, these constraints are of two types:
Constraints (5.23) express that wtjk units of j are necessary for obtaining one unit of k,
where αtjk ≥ 0 is a fixed parameter, j refers to a component (if t = 1) or a subpart (if
t = 2), and k refers to a subpart (if t = 1) or a product (if t = 2). Note that if j does not
enter in the composition of k, one has wtjk = 0, so that (5.23) reduces to a redundant
nonnegativity constraint that can be removed. Constraints (5.24) express that the total
quantity of j employed in the various k cannot exceed the available quantity of j.
The relation q4 = min{q3 , d} can be expressed by the constraints
q4 q 3 , (5.25)
q4 d (5.26)
This section details how the scenario trees with uniform branching factors are built. We
focus on the problem of approximating N (0, 1) by a discrete distribution on S points,
specified by a support (ˆ 1 , . . . , ˆS ) and associated positive probability masses (p̂1 , . . . , p̂S ).
Indeed, once the support (ˆ 1 , . . . , ˆS ) and the probabilities (p̂1 , . . . , p̂S ) of the discrete
distribution are determined, the scenario tree is made of the S 3 distinct realizations of
ξ = (1 , 2 , 3 , d) of the form
where the indices i1 , i2 , i3 are valued in {1, . . . , S}, and the probability of the scenario ξ k
is given by pk = p̂i1 p̂i2 p̂i3 .
1 , . . . , ˆS ) denote the support of the discrete distribution, treated as an
Let ˆ = (ˆ
optimization variable in RS . We will use the quadratic distortion D 2 between the discrete
distribution and the target distribution N (0, 1), defined for any ˆ ∈ RS as
D2 (ˆ i − ||2 ,
) = E min1≤i≤S ||ˆ (5.27)
where is a random variable following the target distribution N (0, 1). By defining the
cells C i (ˆ
) = { ∈ R : ||ˆ i − || ≤ ||ˆ
j − ||, 1 ≤ j ≤ S}, whose boundaries have a
null measure under the target probability measure, (5.27) can be written as D 2 (ˆ ) =
PS R i 2
i=1 C i (ˆ
)
||ˆ
− || φ()d, with φ the probability density function of N (0, 1).
If ∇D 2 (ˆ
) = 0, that is,
i − )φ()d = 0 ,
R
C i (ˆ
)
(ˆ 1≤i≤S , (5.28)
then ˆ is called a stationary quantizer. When the distortion is minimized over ˆ without
constraint, as here where the support of the target distribution is unbounded, a local
minimum of the distortion is a stationary quantizer.
On the real line, the attention can be restricted to the points ˆ such that −∞ < ˆ1 <
· · · < ˆS < ∞, since the distortion decreases when a new point distinct from others is
added to the support of the discrete distribution. Under the convention that ˆ0 = −∞
and ˆS+1 = ∞, the cell C i (ˆ i−1 + ˆi ]/2, [ˆ
) is the closure of the interval ([ˆ i + ˆi+1 ]/2).
With the univariate normal distribution, which has a strictly log-concave density, a
local minimum of D 2 can be found by Newton’s method (Pages and Printems, 2003),
and this minimum is also a global minimum (this does not hold in the multivariate case).
Optimal solutions ˆ for values of S used in the sequel are represented on Figure 5.3. The
probabilities reported on the figure are obtained by integrating the normal density over
the cells Ci :
where Φ is the cumulative distribution function (cdf) of N (0, 1). The probabilities have
a closed-form expression thanks to the simple domain of integration.
94 Chapter 5. Inferring Decisions from Predictive Densities
1 2
I +0.798 0.5 J
0.0 1 values probabilities
I -0.798 0.5 J
3 4 5 6
0.107 0.0740
0.163
0.270 +1.000 0.181
+0.765 0.244
+0.453 0.337 0.245
0.0 0.459 0.0 0.298
-0.318
-1.224
-1.510 -1.724 -1.894
7 8 9 0.0311 10 0.0245
0.0536 0.0402
+1.477 0.0845 +1.591 0.0681
+1.188 0.137 +1.344 0.107
0.132 0.110
0.199 0.161 +0.610 0.141
+0.245 0.192 +0.444 0.164 0.157
0.0 0.221 0.0 0.176 -0.200
-0.561 -0.756 -0.919 -1.058
-2.033 -2.152 -2.255 -2.345
Fig. 5.3: Discretizations of N (0, 1) for branching factors from 1 to 10, obtained by minimizing
the quadratic distortion. Values that can be guessed by symmetry are not indicated.
where the inequality holds by Jensen’s inequality with the conditional density
φ()/p̂i . This implies that for a function f with values f (, x) convex in , one
PS i
has, for any fixed x, i , x) ≤ E{f (, x)}. Let x̄ ∈ argminx E{f (x, )}.
i=1 p̂ f (ˆ
Then it holds that
PS PS
minx i=1 p̂i f (ˆ
i , x) ≤ i=1 p̂i f (ˆ
i , x̄) ≤ E{f (, x̄)} = minx E{f (, x)} . (5.29)
The argument can be extended by taking g(y) = E{f2 (2 , y)}, with 2 a new random
variable independent of , and f2 (2 , y) defined by
where g2 is convex in z. Given a stationary quantizer for 2 , with values ˆj2 and
probabilities p̂j2 , j = 1, . . . , S2 , it holds by (5.29) and the convexity of f2 in 2 that
P S2
minyi ∈Y i (x) j=1 p̂j2 f2 (ˆ
j2 , y i ) ≤ minyi ∈Y i (x) E{f2 (2 , y i )} , (5.32)
where Y i (x) = {y i : Ax + By i = Cˆ
i , y i 0}. Let x̄ be an optimal solution to
the minimization over x ∈ X of the left-hand side of (5.31). One then obtains the
chain of inequalities
PS
minx∈X {hc, xi + i=1 p̂i min{yi : Ax+Byi =Cˆi , yi 0} {hc2 , y i i
P S2 j
+ j=1 p̂2 min{z ij : {A2 yi +B2 zij =C2 ˆj , zij 0} g2 (z ij )}}
2
PS
i i
≤ hc, x̄i +
i=1 p̂ min{y i : Ax̄+By i =Cˆ i , y i 0} {hc2 , y i
P S2 j
+ j=1 p̂2 min{z ij : {A2 yi +B2 zij =C2 ˆj , zij 0} g2 (z ij )}
2
PS P S2 j
= hc, x̄i + i=1 p̂i minyi ∈Y i (x̄) j=1 p̂2 f2 (ˆj2 , y i )
PS
≤ hc, x̄i + i=1 p̂i minyi ∈Y i (x̄) E{f2 (2 , y i )}
PS
= minx∈X {hc, xi + i=1 p̂i minyi ∈Y i (x) g(y i )}
≤ minx∈X {hc, xi + E {min{y(): Ax+By()=Cˆ, y()0} E2 {f2 (2 , y())}}}
= minx∈X {hc, xi + E {min{y(): Ax+By()=Cˆ, y()0} {hc2 , y()i
+E2 {min{z(,2 ): A2 y()+B2 z(,2 )=C2 2 , z(,2 )0} g2 (z)}}} ,
In Remark 5.3, a class of multistage programs has been identified, for which a single
scenario-tree approximation based on quadratic quantization yields a lower bound on the
exact optimal value of the program.
For this result to hold, the stagewise independence assumption between the random
variables , 2 , . . . , is essential. The function f (x, ) in (5.30) has to be convex in ,
preventing us to consider, instead of g(y), a general function g(y, ), as would be the case
if the expectation in the definition of g were conditioned on . The only dependence of
g on is through the value of its argument y, which depends on the realization of .
Now, there exists a formulation trick that allows to pass the value of to functions at
subsequent stages. It suffices to extend the decision vector y to the vector y + = (y, y ),
where y is a dummy decision variable subject to the constraint y = . The value of
can then be passed to the function g through y + itself, and by the same mechanism to
any subsequent function inside the nested expectations.
In fact, the multi-product assembly problem described in Section 5.3.1 could be put
under that form if (5.20) were replaced by d = b0 + b1 1 + b2 2 + b3 3 . Indeed, in the
reasoning of Remark 5.3, the transform δ = C−Ax can be extended to δ = C+D −Ax,
which is also an affine transform of but allows fixed right-hand sides when C = 0. With
96 Chapter 5. Inferring Decisions from Predictive Densities
the extension trick, it is possible to pass the value of 1 , 2 , 3 to the last stage, and to
express d through the linear equality constraint d = b0 + b1 1 + b2 2 + b3 3 .
Unfortunately, the lower bound certificate cannot be extended to the case where
d is defined by (5.20): the value of the last stage is convex in d but not in 3 . We
expect, however, that when the conditional probability of having all components of d not
truncated is large enough (we refer to the probability P{d 0} = P{∩5j=1 (b3 )j 3 > −λj }
when λ = (b0 + b1 1 + b2 2 ) 0), one is close to the case where d is affine in 1 , 2 , 3 , and
thus close to being able to certify that the quadratic quantization yields a lower bound.
When one or several components of λj are close to 0 or below, then it is likely that the
optimal choice of q3 will attempt to redirect the assembly to products with the largest
expected profit E{|(c4 )j |(q4 )j − (c3 )j (q3 )j }, and thus to favor products with a larger
conditional expected demand, which happens to be the products that follow the affine
demand model more closely — potentially diminishing the impact of a discretization bias
in the wrong direction. By bias in the wrong direction, we mean this: If we were able
to dynamically adjust a quantizer for the distribution of the components d j to make it
stationary given the values of 1 and 2 , so as to take the expectation over d rather than
3 , then the values of the adjusted quantizer would be greater than the values of the fixed
quantizer induced by the fixed quantization of 1 , 2 , 3 , that neglects the truncation of d
at 0.
Empirically, on our problem data, the optimal value of the scenario-tree approxima-
tions with uniform branching factor S = 1, 2, . . . increases with S and stabilizes at a
certain level for higher values of S. This strongly suggests that on our problem data, the
quadratic quantization approach consistently provides lower bounds on the value of the
exact multistage program (Table 5.3). The time taken by the numerical optimization
algorithm for solving the successive approximations has also been indicated on Table 5.3,
so as to provide an indication of the increasing difficulty of solving programs posed on
larger scenario trees.
5.3. Case Study 97
As already mentioned, the present problem is simple enough to let us simulate shrinking-
horizon policies on mutually independent test scenarios. We considered one cpu day as
the time limit beyond which the simulation time of one policy on 104 scenarios is not
acceptable. Simulation results for 4 shrinking-horizon policies on a fixed test sample
of 104 scenarios are reported on Table 5.4 (page 100). The average cpu time for the
evaluation of the sequence of decisions on one new scenario is also indicated on Table 5.4,
clearly illustrating the growing complexity of simulating shrinking-horizon policies. The
policy with branching factor 7 takes 6.5 seconds per scenario, that is, 6.5·10 4 /(3600·24) '
0.75 days to be evaluated on the test sample.
The reported empirical averages on the test sample are our estimate for the expected
cost of the policies. The standard error, defined as the standard deviation of the costs
on the test sample divided by the square root of the test sample size, indicates the order
of magnitude of our uncertainty about the true value of the policies as solutions to the
multistage program.
The apparent plateau of performance beyond a branching factor of 5 suggests that
the shrinking-horizon policy with branching factor 5 already attains performances that
are almost optimal, and this is confirmed by comparing the empirical average on the test
sample to the lower bounds of Table 5.3, in particular the best bound obtained on the
single program with the largest scenario tree (branching factor 10).
Remark 5.4. As the same test sample is used for each policy, the difference of
costs between pairs of policies should be significant enough to allow us to rank
the various policies reliably. On Table 5.9 (page 101), the reported standard error
is the standard deviation, on the test sample, of the difference of costs between
each pair of policies considered in the section, divided by the square root of the
test sample size. Thus, for instance, a confidence interval for the difference of
average cost between shrinking-horizon policies with branching factors 3 and 5
could be built by considering that the estimator for the difference is approximately
normally distributed with a standard deviation of 0.70. For some pairs of policies,
the standard errors reported in Table 5.9 are larger, but then they correspond to
policies with a larger difference in their empirical performance.
In general, the uncertainty about the true value of the difference of expected costs
among policies appears to be considerably smaller than the uncertainty about the
level of the expected cost itself, and actually small enough to justify with hindsight
the choice of the test sample size for ranking the policies. With a test sample 4
times larger, we would be able to improve our statistical estimates by a factor of
2, but then 4 cpus would be needed to simulate the shrinking-horizon policies on
the test sample in less than one day.
Remark 5.5. The shrinking-horizon policy with branching factor 1 (that uses a
single scenario to represent the future, corresponding to the mean scenario condi-
tionally to the information state, and thus implements a Model Predictive Control
98 Chapter 5. Inferring Decisions from Predictive Densities
approach) is already far better than a two-stage approximation strategy, that would
consist in
When simulating such a policy on the test sample, using a first-stage decision
computed on the scenario tree with branching factor 10 (and simply imposing that
the decisions q1 , v2 , q2 , v3 , q3 are common to every scenario), we obtain an empirical
cost equal to −261.39 (standard deviation of the estimate: 6.15), far worse than the
value −305.48 of the simplest shrinking-horizon policy. Such a test confirms the
interest of taking into account the available information on the demand and adjust
the production process online. It also allows to compute quickly a lower bound on
the value of multistage stochastic programming (VMS): the VMS can be estimated
as at least the difference of performance between the simplest shrinking-horizon
policy and the policy based on the two-stage approximation.
In the following experiments, we test policies that are built with the data extracted
from a given single scenario-tree approximation solved to optimality (the optimal value
of which being already reported in Table 5.3). We consider 3 such data sets, namely,
the ones obtained with branching factors 3, 5, and 7 respectively. Larger data sets are
advantageous from the statistical learning point of view, and at the same time they
provide better recourse decision examples, due to the finer discretization of the random
process used in the approximate stochastic programs.
The first-stage decision of the learned policies are exactly that of the corresponding
shrinking-horizon policy. The policy for the last stage decision is always set to the optimal
policy with decisions q4 = min{q3 , d} ∈ R5 . It remains to learn a mapping π1 from 1 ∈ R
to q2 ∈ R8 , and a mapping π2 from (1 , 2 ) ∈ R2 to q3 ∈ R5 . Indeed, once qt−1 and qt are
determined, the value of vt can be deduced by solving a simple optimization program,
that had to be solved anyway to ensure that a predicted decision q̂t is feasible, and to
repair it if necessary.
First, we test the simple approach described in Section 5.2. We estimate the mean and the
covariance matrix of a joint Gaussian model for (1 , 2 , 3 , q1 , q2 , q3 ) from the considered
data set. The value of the parameter 0 in (5.14) is set to 0.01 in all the experiments.
The predicted conditional densities of q2 given (1 , q1 ), and of q3 given (1 , 2 , q2 ), are
computed with the conditioning formulae (5.12). The decisions are then inferred by
solving programs of the form (5.7), as described in Section 5.1.2. The optimized variables
are qt and vt , structured by the constraints (5.23), (5.24).
5.3. Case Study 99
The performances of those policies are reported on Table 5.5. The branching factor
identifies the data set from which policies are learned.
The performance of the learned policies are worse than that of the corresponding
shrinking-horizon policies reported in Table 5.4, but already much better than the score
of the policy with a fixed optimized production plan described in Remark 5.5.
Next, we test the nonparametric approach described in Section 5.2.2. Experiments were
limited to the case of a radial basis kernel with a common bandwidth parameter r > 0
set beforehand for each component of the decision vectors. For the components of the
predictive conditional mean of q2 , we used the kernel with values
and for the components of the predictive conditional mean of q3 , we used the kernel with
values
P2
k 0 (i1 , i2 , j1 , j2 ) = exp{− i
t0 =1 (t0 − jt0 )2 /2r2 } = k(i1 , i1 ) · k(i2 , j2 ) .
We did not try to determine the best value of the bandwidth parameter r from
the data set, but rather tested the resulting policies on the test sample. The jitter
parameter σ 2 that enters the expression of the predictive conditional means (5.15) was
always set to 0.01.
The performance of the policies with the best found value of r — which depends on
the size of the data set from which the policy is learned — are reported in Table 5.6.
If we compare the results of Table 5.6 to the results of Table 5.5, we observe that on a
same training set (identified by the branching factor), the selected policy based on the
Gaussian Process model is better, in the case of branching factors 3 and 7, than the
corresponding policy based on the joint Gaussian model, and in fact a lot better with the
branching factor 3, corresponding to the smallest studied training set. On the training
set with the branching factor 5, however, the policy based on the joint Gaussian model is
better. In fact, that latter policy seems to dominate the 3 policies of Table 5.6, suggesting
that the simple approach based on the joint Gaussian model was worth investigating.
Finally, we tested the idea of emulating input-dependent bandwidth choices by using
kernels with values
where Φ is the cumulative distribution function of N (0, 1). In fact, since each t follows
N (0, 1), it holds that Φ(t ) is uniformly distributed on the interval [0, 1]. It seems then
wise to use a constant bandwidth r on this transformed input space, rather than on the
original input space.
The performance of the policies with the best found value of r — which happened to
be independent of the size of the data set from which the policy is learned — are reported
in Table 5.7. If we compare the results of Table 5.7 to the results of Table 5.6, we observe
100 Chapter 5. Inferring Decisions from Predictive Densities
Tab. 5.5: Results for policies based on the joint Gaussian model.
Tab. 5.6: Results for policies based on the Gaussian Process model.
Tab. 5.8: Gaussian Process with a transformed input space and a fast repair procedure.
Tab. 5.4 Tab. 5.5 Tab. 5.6 Tab. 5.7 Tab. 5.8
3 5 7 3 5 7 3 5 7 3 5 7 3 5 7
1 1.94 2.39 2.17 2.03 2.22 2.02 1.74 2.13 2.06 1.88 2.33 2.06 1.88 2.35 2.13
Tab. 5.4
3 0.70 0.43 1.81 0.97 0.96 0.72 0.81 0.59 0.42 0.66 0.43 0.42 0.68 0.43
5 0.33 2.21 0.93 1.13 1.23 0.80 0.76 0.97 0.27 0.56 0.96 0.26 0.46
7 2.00 0.88 1.00 0.98 0.69 0.57 0.71 0.32 0.32 0.71 0.35 0.24
Tab. 5.8 Tab. 5.7 Tab. 5.6 Tab. 5.5
3 1.98 1.72 1.33 1.72 1.83 1.68 2.15 1.95 1.68 2.16 1.95
5 0.37 1.12 0.72 1.07 1.12 0.88 0.92 1.12 0.88 0.88
7 0.94 0.77 1.09 1.05 1.07 0.98 1.05 1.08 0.97
3 0.89 0.63
5 0.33
7
that on a same training set, the policies using the kernel on the transformed space are
significantly better than the policies using the kernel on the original input space.
Therefore, these experiments illustrate that the performances of the policies based on
the Gaussian Process model are sensitive to the choice of the kernel. Depending on the
efforts that one is ready to make to test different choices of kernels, one can thus expect
to obtain good policies with the Gaussian Process model, perhaps even with small data
sets, as it was the case here with the training set relative to branching factor 3.
Discussion.
In terms of optimality, the results obtained here suggest that the learned policies are
able to attain performances that are quite decent with respect to the shrinking-horizon
policies. With trees of branching factor 3, for instance, the policy based on Gaussian
Process regression (with a good choice for the kernel) attains an average cost of about
102 Chapter 5. Inferring Decisions from Predictive Densities
-360 on the test sample, while the corresponding shrinking-horizon policy attains -370.
In terms of simulation times, with our Matlab implementation that calls cvx for
formulating and solving all programs, the learned policies are penalized by the need
to repair the predictions by solving a quadratic program, and the simulation times are
thus similar to the time taken by simulating the shrinking-horizon policy with branching
factor 1.
These results led us to try to replace the generic MAP repair procedure of Section 5.1
by a problem-specific, faster heuristic. In the present context, a possible heuristic consists
in fixing an ordering of the components of qt a priori, and then using the stocks qt−1 as
needed to reach the nominal level (q̂t )j predicted by the learned policy, or to a lower level
if one needed component of qt−1 gets depleted. The priority order is a hyper-parameter
of the repair procedure, that can be tested; our prior belief is that products with higher
profit per unit should be given a higher priority to the available stocks of components.
On the test sample, this new repair procedure combined with the Gaussian model
turns out to degrade the performance of the policy considerably. But combined with the
Gaussian process model, the performance is maintained (with the best found ordering for
the repair procedure), suggesting that the predictions of the Gaussian Process model are
precise enough to mitigate the potential inaccuracies of the repair procedure (Table 5.8).
Remark 5.6. It is a recurrent observation on our tables that the policies learned
from the data set with branching factor 5 slightly dominate those learned with
branching factor 7. One possible explanation is that despite its smaller cardinality,
the first data set contains better examples of decisions. In particular, the first-stage
decision may be better, or at least more robust to inaccuracies in the subsequent
recourse decisions. In fact, we have often observed that in two-stage programs,
the exact value of the first-stage decision optimal with respect to an approximate
program built with a deterministic method can actually be degraded by using more
discretization points, by a simple effect of luck in the selection of the values.
We can now claim that the best learned policy for our problem is the middle policy of
Table 5.8. Thanks to the high efficiency in the evaluation of this learned policy with the
fast repair procedure, we are able to test the policy on a new, independent test sample
of 106 scenarios.
The empirical average of the cost of the policy on this new test sample is -371.87, esti-
mated with standard error 0.63. The simulation of the policy on the new independent test
sample takes about 15 minutes in cpu time. With a confidence of approximately 95 %, the
exact value of the selected policy lies in the 2-standard error interval [−370.61, −373.14].
5.4 Conclusions
In this chapter, alternative methods for learning policies from data sets of scenario-
decisions pairs were explored, especially methods based on Gaussian Process regression.
The framework of Gaussian Processes was found attractive for several reasons: the pre-
dictions are relatively easy to compute (with small data sets, or in fact with kernels that
5.4. Conclusions 103
induce sparse Gram matrices), and are not based on probabilistic assumptions concerning
the way the scenarios of a data set were generated, in particular independence assump-
tions. This last observation is important, because the scenarios of a data set usually
come from a scenario tree built by conditional sampling or by deterministic methods,
and as such, are not independent. It is also true that the sequence of decisions associated
to a scenario actually depends, through the optimization of the decisions, on the other
scenario/decisions pairs present in the tree, so that we may be far from a situation where
each scenario/decisions pair in the data set could be viewed as generated independently
from some unknown probability distribution. Our case study suggests that Gaussian pro-
cesses can be combined gracefully with scenario-tree generation methods, with choices
guided by the knowledge on the way inputs were distributed or generated.
The MAP repair procedure expounded in the beginning of the chapter is a repair
procedure which is generic, but complicates the online evaluation of a learned policy. In
the next chapter, we review in detail the theory on Euclidian projections, and investigate
to which extent it is possible to accelerate the algorithm that computes that kind of
projection mapping by exploiting a data set of examples of projections already computed.
Nevertheless, our experiment in the present chapter seems to show that when the
feasibility sets are described by many constraints, it is better, from the point of view of
the computational complexity, to try to tailor a simple heuristic to restore the feasibility
of the decisions and obtain policies that are simple to evaluate, than to resort to a generic
procedure based on online optimization.
104 Chapter 5. Inferring Decisions from Predictive Densities
Tab. 5.10: Multi-product assembly problem: Values of the parameters in Table 5.2.
h iT
c1 = 0.25 1.363 0.8093 0.7284 0.25 0.535 0.25 0.25 0.25 0.4484 0.25 0.25
h iT
c2 = 2.5 2.5 2.5 2.5 13.22 2.5 3.904 2.5
0 1.223 0.6367 0 0
0 0 0 1.111 0
0 0 0.4579 0 0
0 0.1693 0.6589 0 0
[w3 ] =
0.5085 2.643 0 0 0
0.4017 0 0 0 0
0 0.7852 85.48 0 0
0 0 0 0.806 0.5825
h iT
c4 = −21.87 −98.16 −31.99 −10 −10
h iT
b0 = 13.9 12.86 18.21 10.14 17.21
h iT
b1 = 9.708 9.901 7.889 4.387 4.983
h iT
b2 = 2.14 6.435 3.2 9.601 7.266
h iT
b3 = 4.12 7.446 2.679 4.399 9.334
Chapter 6
Notations.
• B = {z : ||z|| ≤ 1} is the closed unit ball in Rn with n understood from the context.
• For a scalar ρ and a set B, ρB stands for the set {ρv : v ∈ B}.
assuming that the parameter is the realization x(ω) ∈ Rs of a random variable x drawn
from some unknown but fixed probability distribution, and that A ∈ Rs×m is a fixed
matrix.
We are interested in the prediction of the optimal solution y ∗ (ω) to P(x(ω)), given
x(ω), assuming that we know P (we do not have to estimate A, for instance). This
problem could be addressed from a machine learning point of view by trying to learn
a hypothesis h in some hypothesis space H that approximates well the optimal solu-
tion y ∗ (ω), in the sense that the distance between h(x(ω)) and the feasibility set
def
C(x(ω)) = {y ∈ Rm : Ay x(ω)} (6.2)
solution of P(x(ω)) for a sequence of realizations of x, that is, in some sense, build a
self-improving algorithm (Ailon et al., 2006).
In the sequel, we assume that x(ω) is valued in the set
def
dom C = {x ∈ Rs : C(x) 6= ∅} , (6.3)
becomes, with a suitable change of variables, the problem of projecting (with respect to
the Euclidian metric) the origin 0 on some polyhedral set.
minimize 1 T
2y y − 12 uT S −1 u subject to F R−1 y v + F S −1 u ,
where the constant term −uT S −1 u/2 can be dropped. Hence the program on z is equiv-
alent to the evaluation of the Euclidian projection of 0 ∈ Rm on the set C(x) = {y ∈
Rm : Ay x} with A = F R−1 and x = v + F S −1 u. Assuming that (6.4) is feasible and
thus C(x) is nonempty, the optimal solution z ∗ to (6.4) is recovered from the optimal
solution y ∗ using z ∗ = R−1 y ∗ − S −1 u.
can be recast as the parametric program P(x(ω)) by setting x(ω) = v(ω) + F S −1 u(ω)
and A = F R−1 in (6.1), where S = RT R is the Cholesky factorization of S.
Let us start by recalling some useful geometrical facts about Euclidian projections on
convex polyhedral sets (Rockafellar and Wets, 1998, Example 6.16, Theorems 6.9 and
6.46, Proposition 6.17). Figure 6.1 provides a visual support to the following definitions.
C
ȳ
NC (ȳ)
y
When C is a nonempty closed convex set, PC is single-valued, in the sense that PC (y) is
a singleton.
6.6 Proposition. Let C = {y ∈ Rm : Ay b}, where A is a matrix with rows aTi . For
ȳ ∈ C, let I(ȳ) = {i : aTi ȳ = bi } denote the set of active constraints at ȳ. Then the
normal cone to C at ȳ is given by
The relation between the normal cone and the Euclidian projection mapping is given
in the following proposition, only valid for closed convex sets.
6.7 Proposition. For a closed convex set C ⊂ Rm , every normal vector is a proximal
normal vector: d ∈ NC (ȳ) iff ȳ ∈ PC (ȳ + d), where in fact ȳ = PC (ȳ + d).
with [ȳ] reduced to the singleton {ȳ} when ȳ is in the relative interior of C — the relative
interior of a nonempty convex set C corresponds to the interior of C when C is viewed
as a subset of the smallest linear space containing C (the affine hull of C).
6.2. Geometry of Euclidian Projections 109
CI = {y ∈ Rm : aTi y = bi , i ∈ I} = {y ∈ Rm : AI y = bI } .
Hoffman’s lemma (Hoffman, 1952), stated next, shows that the Euclidian distance
d(y, C) = ||y − PC (y)|| from any point y to a polyhedral set C can be related to a
measure that does not depend on PC (y).
6.10 Proposition. Let C(b) = {y ∈ Rm : Ay b}. Let b0 and b1 be two vectors such
that C(b0 ) and C(b1 ) are nonempty. Then dh (C(b0 ), C(b1 )) ≤ κ(A)||b0 − b1 ||.
Remark 6.2. Observe that having A constant is essential in Proposition 6.10. For
instance, consider the set-valued mapping C 0 : R ⇒ R2 with values
Now, coming back to the parametric program (6.1), we observe that Hoffman’s lemma
allows to prove that if x in (6.1) follows a distribution having a compact support, then
the projection of the origin 0 ∈ Rm on the random polyhedral set C(x) defined by (6.2)
lies in a bounded set.
6.11 Proposition. Let C : Rs ⇒ Rm be the set-valued mapping with values C(x) defined
by (6.2). If x follows a probability distribution with compact support, then there exists
a finite κ̄ > 0 such that the projection y ∗ (ω) of the origin on the polyhedral set C(x(ω))
satisfies ||y ∗ (ω)|| ≤ κ̄ for all possible realizations x(ω) of x.
Proposition 6.11 shows that if one wants to try to predict from x(ω) a “nearly feasible”
optimal solution y ∗ (ω), with x drawn from a distribution with compact support, then
one could legitimately select a hypothesis space H of bounded functions.
A possible way to draw random points x(ω) ∈ dom C is thus to draw a linear com-
bination of vectors forming an orthonormal basis for A, and then add to the resulting
vector a random vector with nonnegative components.
Although we do not directly invoke it in the sequel, for completeness we also recall
the following structural property:
For the notion of extended-real-valued function used in the following proof, see Ap-
pendix A.1.
Proof. The program P(x) amounts to the minimization of the extended-real-valued func-
tion f¯ defined by f¯(x, y) = ||y||2 /2 if Ay x, and f¯(x, y) = ∞ otherwise. We
check that f¯(x, y) is jointly convex in x, y. Let us write xt = (1 − t)x0 + tx1 and
yt = (1 − t)y0 + ty1 for 0 < t < 1. If f¯(x0 , y0 ) and f¯(x1 , y1 ) are finite, implying
Ay0 x0 , Ay1 x1 , then f¯(xt , yt ) is also finite, since Ayt xt and f¯(xt , yt ) = ||y t ||2 /2 ≤
(1−t)||y0 ||2 /2+t||y1 ||2 /2 = (1−t)f¯(x0 , y0 )+tf¯(x1 , y1 ). If f¯(x0 , y0 ) = ∞ or f¯(x1 , y1 ) = ∞
, the convexity inequality f¯(xt , yt ) ≤ (1−t)f¯(x0 , y0 )+tf¯(x1 , y1 ) = ∞ for 0 < t < 1 is triv-
ially verified. Hence f¯(x, y) is convex in (x, y) (Rockafellar, 1970, Theorem 4.1). As a con-
vex set, the epigraph of f¯ defined by epi f¯ = {(x, y, α) ∈ (Rs ×Rm )×R : α ≥ f¯(x, y)} has
its projection on its component Rs × R convex as well. The function g(x) = inf y f¯(x, y)
whose epigraph is epi g = {(x, α) ∈ Rs × R : (x, y, α) ∈ epi f¯ for some y} is thus con-
vex.
Now, for x(ω) ∈ dom C, the program P(x(ω)) has a single minimizer y ∗ (ω) correspond-
ing to the projection of 0 ∈ Rm on the convex polyhedral set C(x(ω)). By Proposition 6.7,
setting C = C(x(ω)), a point y ∈ Rm is thus optimal if the vector 0 − y = −y lies in the
normal cone to C at y, that is, −y ∈ NC (y), or equivalently y + NC (y) 3 0.
Given the optimal solution y ∗ (ω) to P(x(ω)), it is easy to describe sets of nearly
optimal solutions, called -optimal solutions (see Appendix A, Section A.4). To this end,
let us recall the notion of tangent cone to an arbitrary set C (Rockafellar and Wets, 1998,
Definition 6.1, Theorem 6.9).
(y ν − ȳ)/τ ν → d .
112 Chapter 6. Learning Projections on Random Polyhedra
The set of all such vectors d is a closed cone, possibly reduced to the singleton {0}, called
the tangent cone to C at ȳ, and written TC (ȳ). In the particular case where C is a
convex subset of Rm , the tangent cone to C at ȳ is a convex set given by
6.15 Proposition(?). The sets of -optimal solutions S (ω) to P(x(ω)) satisfy two prop-
erties, expressed with respect to the exact optimal solution y ∗ (ω) and the set C(x(ω)):
The following proof relies on standard arguments — see for instance Dontchev and
Rockafellar (2009, Theorem 2E.3).
Proof. To lighten the notation, we write S for S (ω), C for C(x(ω)), and y ∗ for y ∗ (ω).
The set S = - argminy∈C f (y) is given by
There is no feasible vector y with ||y|| < ||y ∗ ||, whereas ||y|| = ||y ∗ || entails y = y ∗
by the strict convexity of f (y), hence the first part of the proposition. On the other
hand, the feasibility set C is described by a finite number of constraints, so that in a
sufficiently small neighborhood of y ∗ , say R0 , there is no new constraint that becomes
active: I(y) ⊂ I(y ∗ ) for y ∈ R0 ∩C. As the constraints are linear, C can be approximated
locally by the set Cy∗ = {y ∈ Rm : aTi y ≤ bi , i ∈ I(y ∗ )}. Since aTi yi∗ = bi for i ∈ I(y ∗ ),
we have Cy∗ = {y ∗ + d : aTi d ≤ 0, i ∈ I(y ∗ )} = y ∗ + TC (y ∗ ).
Remark 6.3. Having the set C polyhedral is important in the proof of Proposi-
tion 6.15. If the set C were not polyhedral (it can still be convex), there would not
necessarily exist a neighborhood R0 of y ∗ in which a proper inclusion of C ∩ R0
in [y ∗ + TC (y ∗ )] ∩ R0 can be precluded. The local approximation at y ∗ of the set
C by the set y ∗ + TC (y ∗ ) could thus include infeasible points. For example, for
C = {(x, t) ∈ R2 : t ≥ |x| + x2 }, one has TC (0) = {(x, t) ∈ R2 : t ≥ |x|}, and
consequently (C \ TC (0)) ∩ B 6= ∅ for any > 0.
6.3. Properties of Optimal Solutions 113
The following proposition relies on duality theory (Rockafellar and Wets, 1998, Chap-
ter 11).
6.16 Proposition(?). The dual of P(x(ω)) corresponds, after a sign change, to the
program
Proof. The Lagrangian for P(x(ω)) is L(y, λ) = 12 ||y||2 + λT (Ay − x(ω)), with λ 0.
The infimum of L(y, λ) over y is attained at ȳ = −AT λ. Hence the dual function
g(·) = inf y L(y, ·) has values g(λ) = − 21 λT AAT λ − x(ω)T λ. The dual formulation is
obtained by maximizing g(λ) subject to λ 0.
Given y ∗ (ω), it is often possible to obtain a solution to the dual problem, as shown
by the following proposition. Note that from now on, when ω or x(ω) is clear from the
context, we freely write C for C(x(ω)), and y ∗ for the optimal solution y ∗ (ω) to P(x(ω)).
We will also freely write x for its realization x(ω).
Proof. Having y ∗ ∈ C optimal means −y ∗ ∈ NC (y ∗ ), that is, there exists at least one
vector λ ∈ Rs such that
Ps
y∗ + i=1 λi ai = 0, λ 0, λi = 0 if i 6∈ I(y ∗ ) ,
∇f (y ∗ ) + AT λ = 0, Ay ∗ x, λ 0, λi (aTi y ∗ − xi ) ≥ 0
with multipliers λi optimal for the dual problem. Let AI ∈ Rp×m be the submatrix of
A with rows aTi , i ∈ I(y ∗ ), p = |I(y ∗ )|, so that the subvector λI ∈ Rp of λ stacking the
possibly nonzero elements λi , i ∈ I(y ∗ ), has to satisfy y ∗ + ATI λI = 0. If the rows of
AI are linearly independent (a constraint qualification which always holds for p = 1 and
never holds for p > m), then
Recall that the null space of ATI can be described as the span of the eigenvectors associated
to the zero eigenvalues of (ATI AI ).
114 Chapter 6. Learning Projections on Random Polyhedra
Uniquely determined multipliers do not always exist: this is consistent with the ob-
servation that the dual problem can have a continuum of optimal solutions if the matrix
(AAT ) in Proposition 6.16 is only positive semi-definite.
Remark 6.4. A property of the objective function f that facilitated the develop-
ments in Proposition 6.17 is the expression of its gradient ∇f (y) = y. The solution
to the inversion of the generalized equation u ∈ ∂f (y), where ∂f (y) is the subgra-
dient of f evaluated at y, is then simply y = u. We recall that in general, when f is
a proper lower-semicontinuous convex function, u ∈ ∂f (y) if and only if y ∈ ∂f ∗ (u)
with f ∗ (u) = supy {uT y −f (y)} (Rockafellar and Wets, 1998, Proposition 11.3).
where I(y ∗ ) = {i : aTi y ∗ = xi } is the set of active constraints at y ∗ , and where we define
Ni (I) = {0} if i ∈ I and Ni (I) = [0, ∞) if i 6∈ I. The inclusion can be refined by
considering, instead of I, the index set I + = {i : λi > 0} ⊂ I(y ∗ ) of the positive KKT
multipliers associated to the active constraints at y ∗ .
S −1 (y ∗ ) = {x ∈ Rs : xi = aTi y ∗ if i ∈ I + , xi ∈ [aTi y ∗ , ∞) if i 6∈ I + }
where I + = {i : λi > 0} is the index set of active constraints with positive multipliers.
Example 6.1. Propositions 6.17 and 6.18 can be illustrated on a numerical example
(Figure 6.2). Let A have 4 rows aT1 = [ 0 −1 ], aT2 = [ −1 1 ], aT3 = [ −1 0 ],
aT4 = [ −1 −2 ]. Let x have the value x(ω1 ) = [ −4 2 −2 0 ]T . The opti-
mal solution to P(x(ω1 )) is y ∗ (ω1 ) = [ 2 4 ]T . The set of active constraints is
I(y ∗ (ω1 ) = {1, 2, 3}, meaning that 3 hyperplanes meet at y ∗ (ω1 ). The matrix AI
has the 3 rows aT1 , aT2 , aT3 . The optimality condition is y ∗ (ω1 ) = −ATI λI . The set
6.3. Properties of Optimal Solutions 115
a2 a2
a3 a3
C1 C2
y∗
y∗
a1
a1
a4
(0, 0) (0, 0)
a4
Fig. 6.2: Left: Pathological case x(ω1 ) for which the dual D(x(ω1 )) has several optimal solu-
tions described in the example (see text). Right: Case x(ω2 ) where the dual problem
D(x(ω2 )) has a single optimal solution. The primal problems P(x(ω1 )), P(x(ω2 )) have
the same unique optimal solution y ∗ = (2, 4) ∈ R2 . The dashed line indicates the
minimal distance between the origin and the set Ci = {y ∈ R2 : Ay x(ωi )}.
of solutions for λI is
Note that a numerical solution algorithm applied to the dual problem could return
any particular solution λ ∈ ΛI × {0}. The solutions corresponding to µ = 0 and
µ = 2 are λ = [ 4 0 2 0 ]T and λ = [ 6 2 0 0 ]T respectively. The zero
elements of the solutions indicate that y ∗ (ω1 ) is still optimal when
Remark 6.5. Proposition 6.18 has formalized an invariance property with respect to
a subset of translations of the input x, where the subset depends on the output y ∗ .
In the perspective of using supervised learning to predict nearly feasible optimal
solutions, invariance properties could be used as a means to obtain virtual samples
(xν , y ν ) with xν ∈ S −1 (y ν ), or can be embedded in learning algorithms to improve
generalization abilities from prior knowledge (Decoste and Schölkopf, 2002).
The next proposition shows that from a single pair (x̄, ȳ ∗ ) with ȳ ∗ optimal for P(x̄),
it is sometimes possible to predict the optimal solution y ∗ (ω) for parameters x(ω) in a
116 Chapter 6. Learning Projections on Random Polyhedra
neighborhood of x̄. The size of the neighborhood is estimated in the proof, and is related
to the smallest singular value of the matrix AI defined in Proposition 6.19.
6.19 Proposition. Let ȳ be the optimal solution to the program P(x̄). Let A I ∈ Rp×m ,
p = |I(ȳ)|, be the submatrix of A stacking the rows aTi of active constraints i ∈ I(ȳ), and
for a vector x ∈ Rs , let xI ∈ Rp be the subvector of x stacking the elements xi , i ∈ I(ȳ).
If the rows of AI are linearly independent and if (AI ATI )−1 x̄I ≺ 0, then there exists a
neighborhood Q of x̄ such that for all x(ω) ∈ Q ∩ dom C, the optimal solution to P(x(ω))
is given by y ∗ (ω) = ATI (AI ATI )−1 xI (ω).
We choose Q0 = {x̄ + η0 u : ||u|| < 1} and R0 = {ȳ + (d0 /2) v : ||v|| < 1}. Then,
the distance of ȳ to any hyperplane {y : aTj y = xj (ω)}, j ∈ J, is greater than d0 /2
whenever x(ω) ∈ Q0 ∩ dom C, and y ∈ R0 ∩ C(x(ω)) is separated from the hyperplanes
{y : aTj y = xj (ω)} for j ∈ J. Hence j 6∈ I(y) and thus I(y) ⊂ I(ȳ) (no new active
constraints).
Next, we claim that if the rows aTi for i ∈ I(ȳ) are linearly independent, and if
(AI ATI )−1 x̄I ≺ 0, then there exists a neighborhood Q ⊂ Q0 of x̄ such that I(y ∗ (ω)) =
I(ȳ) whenever x(ω) ∈ Q ∩ dom C, where y ∗ (ω) denotes the optimal solution to P(x(ω)).
It is sufficient to show that whenever x(ω) ∈ Q∩dom C, y ∗ (ω) lies in R0 , and any optimal
λ∗i (ω), i ∈ I(ȳ), associated to y ∗ (ω) is positive, as λ∗i > 0 entails i ∈ I(y ∗ (ω)). Since the
rows of AI are linearly independent, the vector
is the only vector of possibly nonzero multipliers associated to ȳ (the reference optimal
solution). Let us replace the dual problem D(x(ω)) by a problem on the reduced set of
variables δI ∈ Rp with λI (ω) = λ̄I + δI , I = I(ȳ), namely,
If we relax the constraint δI −λ̄I , and if we set x(ω) = x̄(ω) + ∆x(ω), the optimality
condition for the resulting problem is ∇gI (δI∗ ) = 0, and its optimal solution is
δI∗ = −(AI ATI )−1 (AI ATI λ̄I + xI (ω)) = −(AI ATI )−1 ∆xI (ω) ,
where we have used the fact that λ̄I = −(AI ATI )−1 x̄I . Let us define
= min{λ̄i : i ∈ I} > 0 .
Since δi∗ > −λ̄i for each i ∈ I whenever ||δI∗ || < , we can guarantee, using the inequality
||δI∗ || ≤ ||(AI ATI )−1 || · ||∆xI (ω)|| ≤ ||(AI ATI )−1 || · ||∆x(ω)||
6.3. Properties of Optimal Solutions 117
that whenever ||x(ω) − x̄|| = ||∆x(ω)|| < η1 with η1 = min{η0 , ||(AI ATI )−1 ||−1 }, the
solution δI∗ satisfies the constraint of the initial reduced problem, and x(ω) ∈ Q0 . Thus
δI∗ is also optimal for the reduced problem. We note that ||(AI ATI )−1 ||−1 = (σp (AI ))2 ,
where σp (AI ) > 0 is the smallest singular value of AI (AI has rank p = |I(ȳ)|). Reverting
now to the full dual problem over λ ∈ Rm , we see that the vector λ∗ with λ∗i = λ̄i +δi∗ > 0
if i ∈ I(ȳ), λ∗i = 0 if i 6∈ I(ȳ), induces a vector
y = − i∈I λ∗i ai = ȳ − i∈I δi∗ ai = ȳ + ATI (AI ATI )−1 ∆xI (ω).
P P
Using ||y − ȳ|| ≤ ||ATI (AI ATI )−1 || · ||∆x(ω)||, we have ||y − ȳ|| < d0 /2 if ||∆x(ω)|| <
||ATI (AI ATI )−1 ||−1 d0 /2. In fact ||ATI (AI ATI )−1 ||−1 = σp (AI ). By setting
and choosing for Q the open ball of radius η centered at ȳ, we can ensure that y ∈ R 0 ,
so that I(ȳ) is the set of constraints active at y. This means that the vector y is optimal
for the primal problem, and that λ∗ is optimal for the dual problem.
Now, given the existence of a neighborhood Q of x̄ for which I(y ∗ (ω)) = I(ȳ) when
x(ω) ∈ Q ∩ dom C, y ∗ (ω) can be obtained as the projection of the origin on the linear
subspace {y ∈ Rm : aTi y = xi (ω), i ∈ I(ȳ)} whenever x(ω) ∈ Q ∩ dom C. With the rows
of AI linearly independent, the projection is given by y ∗ (ω) = ATI (AI ATI )−1 xI (ω).
In the context of the supervised learning of nearly feasible optimal solutions, where
one looks for a hypothesis h in a hypothesis space H of mappings from x(ω) to y ∗ (ω),
the knowledge of a local model for y ∗ (ω) in a neighborhood of x̄, for instance a first-
order approximation y ∗ (ω) ' ȳ + D(x(ω) − x̄), means that one could learn h not only
by penalizing the discrepancies between the sampled targets y(ω) and the predictions
h(x(ω)), but also by penalizing the discrepancy between the gradient of h at x̄ and
the gradient D of the local model known a priori. Such ideas have been developed by
Simard et al. (1998). We also note that it is technically possible to incorporate derivative
information in Gaussian Process regression (Solak et al., 2003).
Another possibility would be to learn classifiers for the events i ∈ I(ȳ), 1 ≤ i ≤ s, since
we know that the information on active constraints can be generalized locally around x̄,
and followed by a straightforward computation of y ∗ (ω).
Remark 6.6. A typical situation where the assumptions of Proposition 6.19 fail is
the case where two inequality constraints form an equality constraint: a T1 y ≤ x1 ,
aT2 y ≤ x2 with a2 = −a1 and x2 = −x1 . In that case, a solution ȳ has to satisfy
aT1 ȳ = x1 , and AI is always rank-deficient. In the event where the two parallel
hyperplanes are separated, it is not easy to predict which side of the so-induced
slab region the optimal solution will follow. If q pairs of hyperplanes are merged,
there might exist 2q distinct configurations of active constraints in the neighborhood
of x̄, provided that the assumptions of Proposition 6.19 hold with one element of
each pair of equality-forming hyperplanes removed from the index set I of active
constraints at ȳ.
6.20 Lemma. Given x(0), x(1) ∈ dom C, let x(t) = (1 − t)x(0) + tx(1) for 0 ≤ t ≤ 1.
Let y ∗ (t) denote the optimal solution to P(x(t)). If I = I(y ∗ (0)) = I(y ∗ (1)), then
y ∗ (t) = (1 − t)y ∗ (0) + ty ∗ (1). If in addition the rows aTi , i ∈ I, are linearly independent,
then y ∗ (t) = ATI (AI ATI )−1 xI (t).
As the convex hull of a collection of points {xν }, written conv({xν }), contains the
line segments between any two of its points, we have:
Another interesting question is whether we can, from a single point (x̄, ȳ) equipped
with a local model, infer the domain of validity of the model.
6.22 Proposition (Outer generalization). Let ȳ be the optimal solution to the pro-
gram P(x̄). Let I(ȳ), written I for short, be the index set of active constraints at ȳ,
and let J be its complement. Let AI be the submatrix of active rows of A. If the rows
of AI are linearly independent, then the subset of dom C (values for the parameter x)
where the index set of active constraints at the optimal solution y ∗ of the program P(x)
coincides with I = I(ȳ) can be described as the polyhedral cone
X(I) = {x ∈ Rs : BI xI 0, DI xI − xJ 0}
Proof. Having ȳ as an optimal solution shows that there exists some λ̄I ∈ Rp , p = |I(ȳ)|,
such that ȳ + ATI λ̄I = 0, λ̄I 0, AI ȳ = x̄I , and AJ ȳ ≺ x̄J . If the rows of AI are
linearly independent, λ̄I = −(AI ATI )−1 x̄I , which implies ȳ = ATI (AI ATI )−1 x̄I . Now, we
can replace x̄I by any xI and obtain a corresponding optimal solution y determined by
as long as we keep
To satisfy (6.8) we must enforce (AI ATI )−1 xI 0. Equation (6.9) is a consequence of
(6.7) multiplied by AI . To satisfy (6.10), we must enforce AJ ATI (AI ATI )−1 xI − xJ ≺ 0.
Actually, y will still be optimal if (6.10) is replaced by AJ y xJ (non-strict inequality).
In that case, we use the convention that if some new constraints enter the set of active
constraints at y, the index set I is still understood as the set of active constraints at ȳ.
To easily see that the resulting set X(I) as defined in the proposition with B I and
xI
DI is a cone, assume without loss of generality that x = , allowing us to rewrite
xJ
BI 0
X(I) = {x ∈ Rs : GI x 0} with GI = ∈ Rs×s ,
DI −I
where 0 is the zero matrix of dimension |I| and I the identity matrix of dimension |J|.
Remark 6.7. The subset of dom C for which there is no active constraint at the
optimal solution is X(∅) = Rs+ : it is easy to check that 0 ∈ argmin P(x) if and
only if x 0. That the point x = 0 is included in every set X(I) corresponds
to the existence of pathological cases (recall Figure 6.2) where several hyperplanes
meet at zero.
Let F be the set-valued mapping with values F(v) = {z ∈ Rm : F z v}, and let
dom F = {v : F(v) 6= ∅}. For some fixed µ̄ and v̄ ∈ dom F, let z̄ be the optimal
solution to Q(µ̄, v̄). With fiT denoting the i-th row of F , let I = {i : fiT z̄ = v̄i } be the
index set of active constraints at z̄. Then, there exist a neighborhood Q µ of µ̄ and a
neighborhood Qv of v̄ such that for all µ(ω) ∈ Qµ and v(ω) ∈ Qv ∩ dom F, the optimal
solution of Q(µ(ω), v(ω)) is
z ∗ (ω) = µ(ω) + ΣFIT (FI ΣFIT )−1 (vI (ω) − FI µ(ω)) , (6.11)
if the rows of FI are linearly independent and (FI ΣFIT )−1 (vI − FI µ(ω)) ≺ 0. In fact, the
expression (6.11) is valid if one has v(ω) ∈ dom F, the rows of FI linearly independent,
and µ(ω), v(ω) satisfying
Remark 6.8. There is a nice interpretation of z ∗ (ω) in (6.11) as the conditional mean
of a random variable Z with realizations Z(η), such that Z follows a priori a normal
N (µ(ω), Σ), and then is conditioned on the observation FI Z(η) = vI (ω).
Proof of Proposition 6.23. All the developments in the section have been done for the
parametric program (6.1), but can be applied easily to the parametric program (6.5),
1 T
Q(u(ω), v(ω)) : minimize 2 z Sz + u(ω)T z subject to F z v(ω) ,
with S positive definite. To adapt Proposition 6.19, for instance, let z̄ be the optimal
solution to Q(ū, v̄), and let I = {i : fiT z̄ = v̄i }. Let S = RT R be the Cholesky factoriza-
tion of S. The change of variables z = R−1 y − S −1 u(ω) applied to the system of active
constraints FI z = vI (ω) yields FI R−1 y = vI (ω) + FI S −1 u(ω), that is, AI y = xI (ω) if we
set A = F R−1 and x(ω) = v(ω)+F S −1 u(ω). Applying Proposition 6.19 and substituting
back, we deduce that there exist some neighborhoods Qu of ū and Qv of v̄ such that for
all u(ω) ∈ Qu and v(ω) ∈ Qv ∩ dom F, the optimal solution z ∗ (ω) to (6.5) is given by
if the rows of FI are linearly independent and (FI S −1 FIT )−1 (vI (ω) + FI S −1 u(ω)) ≺ 0. It
remains to set S = Σ−1 and u(ω) = −Σ−1 µ(ω) to get (6.11). The rest of the proposition
follows similarly from Proposition 6.22.
Our study of the optimal solution y ∗ (ω) to the program P(x(ω)) defined by (6.1) suggests
that the exact prediction problem, mapping an input x(ω) to the output y ∗ (ω), can be
reduced to the prediction of the index set of active constraints, mapping x(ω) to I(y ∗ (ω)).
The index sets of active constraints I partition the input space into subregions X(I),
found to be polyhedral cones. (The subregions are also called cells in the sequel.) Once
x(ω) is known to belong to some cell X(I), it is straightforward to find y ∗ (ω).
A cell X(I) requires s linear inequalities to be described as a polyhedron, where s is
the number of constraints of the parametric program P. In a problem with s constraints,
there could be an astronomically large number of index sets of active constraints I to
consider. Enumerating them individually without prior knowledge would be a daunting
task. But at least, Proposition 6.22 allows us to build instantly the cell X(I) associated
to a sample (x̄, ȳ), where ȳ is the optimal solution to P(x̄), and create a classifier asso-
ciated to X(I) for indicating whether a new input x is in X(I). A classifier is simply
a 0-1 indicator function of the set X(I), that could be represented by a decision tree
(Algorithm 6.1). By creating and exploiting existing classifiers, it would then be possible
to “learn” minimizers in an online fashion (Algorithm 6.2).
1. Let J = {j : aTj ȳ < x̄i } be the index set of inactive constraints at ȳ.
Set B = (AI ATI )−1 , and let bTj denote the j-th row of B.
Set D = AJ ATI B, and let dTk denote the k-th row of D.
Define φj (x) = bTj xI .
Define ψk (x) = dTk xI − xJ(k) , where J(k) is the k-th index of J.
4. Repeat for k = 1, . . . , s − p :
Split the current node using test ψk (x) ≤ 0 (true for the left child),
attach label {0} to the right child, and call the left child the current node.
the domain of validity of the local models; in particular, it does not attempt to build a
single classifier per constraint, that would tell us whether a single constraint is active at
the optimal solution. When the input data cannot be processed by existing local models,
a standard quadratic programming solver is called, and its result is used to build a new
local model.
Algorithm 6.2 can be viewed as a quadratic programming solver that adapts itself to
the input data it receives, so as to “minimize” its response time. One could for example
fix a maximal number of local models, and then allow local models that are infrequently
called to be disposed of after some time, since the membership tests induce an overhead
at most linear with the number of existing local models (we refine the linear complexity
estimate in Lemma 6.28 below, and implement a strategy for managing the local models
in Section 6.5).
Algorithm 6.2 can also be viewed as the builder of a decision policy h, that assigns to
each input x ∈ Rs an optimal decision y ∈ Rm . Initially, the decision policy always calls
a quadratic programming solver, except when the solution is trivially 0. Denoting by π
the mapping from the input x to the optimal solution y implemented by the solver, the
policy h can be expressed as
0 if x 0 ,
h(x) =
π(x) otherwise.
After after some training during which inputs xµ are received, outputs y µ = h(xµ ) are
self-generated, and classifiers δI µ are built with I µ = I(y µ ) the set of active
S constraints
µ s m s µ
at y , the decision policy h : R → R can exploit, in the region R+ ∪ µ X(I ) of
the input space, an explicit representation of the optimal solution mapping from x to y.
Namely, if M denotes the collection of index sets I µ already seen, the policy h can be
formally expressed by 3 pieces, assuming for notational simplicity that the probability of
x falling on the boundaries of the cells X(I) is 0:
P 0 if x 0 ,
T −1
S
h(x) = I∈M δ I (x)A I (A I A I ) x I if x ∈ I∈M X(I) (6.14)
π(x) otherwise.
6.4.2 Complexity.
With a finite number s of constraints, the number of possible cells X(I), say N , is finite
but very large:
6.24 Lemma. For the parametric program P(x) over y ∈ Rm with s constraints, the
number of cells X(I), written N , is at most
Pmin{s, m}
p=1 s!/(p!(s − p)!) ≤ 2s − 1 ,
Proof. For the feasibility set C = {y ∈ Rm : Ay x}, A ∈ Rm×s , and an index set I of
active constraints of cardinality p, the cell X(I) is well defined when the p rows of A I
(1 ≤ p ≤ s) are linearly independent. Note that if A has rank s (possible only if s ≤ m),
then AI has rank p and its rows linearly independent. Having the p rows of AI linearly
6.4. Classifiers for Sets of Active Constraints 123
independent is impossible if p > m. Thus, the index sets to consider are those obtained
by picking p constraints out of s, with p ≤ m. For programs with s ≤ m, there exist
2s − 1 theoretical combinations of active constraints.
Clearly, there is no hope of covering efficiently all the cells X(I). The proposed
approach is expected to work when the support of the distribution of x is concentrated
on a relatively modest number of cells. Instead of building in a systematic way the explicit
part of the mapping h, we let the construction process be driven by i.i.d. samples of x,
always allowing calls π(x) to the solver for samples that fall out of the domain where h
is known explicitly.
The complexity of building a classifier can be estimated as follows.
6.26 Lemma. A test x ∈ X(I) with |I| = p requires at most O(sp) operations, meaning
that the complexity is at most O(s2 ) for all I.
(Lemma 6.26 does not take into consideration the fact that a test should be aborted
as soon as one of the s inequalities to check is false.)
6.27 Lemma. A prediction for y given x ∈ X(I) with |I| = p requires at most O(mp)
operations, meaning that the complexity is at most O(ms) for all I if s ≤ m, or at most
O(m2 ) if s > m.
Proof. The prediction is y = ATI (AI ATI )−1 xI . The matrix product has already been
evaluated in the construction of the classifier δI , so that the complexity of evaluating the
prediction is reduced to the complexity of the matrix-vector multiplication.
p0 = P{x 6 0 and x ∈
/ X(I) for all I ∈ M} (6.15)
of a new sample x falling in an unknown subregion of the input space, but potentially
delays the call π(x) to the solver.
124 Chapter 6. Learning Projections on Random Polyhedra
6.28 Lemma. Let N be the total number of cells X(I) induced by the parametric pro-
gram P(x). Let M = αN , α ∈ (0, 1], be the number classifiers appended sequentially
to an initially empty collection M of classifiers, new cells X(I) being potentially discov-
ered as new i.i.d. samples of x are received. Let h be the policy (6.14) that maps x to
argmin P(x). Then, the expected number of tests of the form x ∈ X(I) in the evaluation
of h(x) is upper bounded by M (1 − α/2) + α/2.
Proof. Let X1 , . . . , XN be the polyhedral cells X(I) induced by the parametric pro-
gram P. Let qi be the probability that x ∈ Xi . Let δ1 , . . . , δM denote the distinct
classifiers of the collection M, indexed in the order of their creation. Each classifier δ j
is associated to a different cell Xi , and the probability of the possible matchings is a
function of the probabilities qi . Given the sequence {δj }1≤j≤M , let t(x) be the number
of tests (implemented by the classifiers) needed to detect any event x ∈ X i , or the event
that x falls in a cell not covered by any δj . We have t(x) = 0 if x 0, t(x) = j if there
exists some j such that δj (x) = 1, and t(x) = M otherwise.
The choice of qi that maximizes E{t(x)} (where the expectation is taken over x and
over the possible sequences of classifiers) is obtained with qi = 1/N for each i, since
having any qi > 1/N would make Xi more likely to appear sooner in the sequence of
classifiers, and every new sample x more likely to hit Xi . Note that qi = 1/N also means
that the probability of x 0 is chosen to be zero.
Then, thanks to the property that all the orderings of the classifiers are now equiprob-
able, it holds that t(x) = 0 with probability 0, t(x) = j with probability 1/N if j < M ,
and t(x) = M with probability 1/N + (N − M )/N . Writing M = αN for some α ∈ (0, 1],
the expectation of t with the worst-case distribution is
M
X j N −M M (1 + M ) N −M
E{t(x)} = +M = +M
j=1
N N 2N N
Th ≤ p0 · Tπ + (1 − p0 ) · Ty + (M (1 − α/2) + α/2) · TX ,
using the bound of Lemma 6.28 based on a worst-case distribution, which is also insen-
sitive to the possible reordering of the tests x ∈ X(I).
6.5. Numerical Experiments 125
The potential merits of the proposed algorithms are evaluated on various random prob-
lems. We recall that problems with random constraint matrices are not necessarily
representative of practical problems (Edelman, 1992) — for the simplex method, they
do provide insights, but they are unable to explain the behavior of the algorithm in a
relevant neighborhood of some fixed input data (Spielman and Teng, 2004). However,
random problems are easy to specify, and by a statistical concentration phenomenon,
large problems tend to be very similar; taken together, these two features facilitate the
replication of experiments and observations.
We create the test problems as follows. We select sets of parameters for the input
dimension and the number of constraints, namely:
(m, s) = (5, 10), (10, 5), (10, 20), (20, 10), (20, 40), (40, 20).
For each set, we generate one random matrix U ∈ Rs×m with i.i.d. elements Uij drawn
from the uniform distribution in [0, 1]. We define V k = U − 0.1k11T for k = 1, 2, . . . , 5,
where 11T is a matrix of ones in Rs×m . We form the matrix Ak from V k by stacking
the s rows aki = 2αik vik /||vik ||, where αik is drawn from the uniform distribution in [0, 1],
and vik is the i-th row of V k . We call P (m, s; k) the problem with A = Ak ∈ Rs×m . It
will turn out that problems get harder with k higher.
The parameter k controls the diversity of the directions normal to the halfspaces
defined by the random rows of A. With k small, the first singular value of Ak tends to
dominate the others.
In every problem P (m, s; k), we draw samples for x as follows. Starting from an
orthonormal basis à ∈ Rs×m0 for the range of A, where m0 = min{s, m} (the basis is
obtained from the svd decomposition of A), we set x = Ãξ0 + 0.5|ξ1 |, where ξ0 ∈ Rm0
and ξ1 ∈ Rs are drawn from standard multivariate normal distributions, and | · | denotes
the elementwise absolute value. By Proposition 6.12, x ∈ dom C. We have kept the
magnitude of the term in |ξ1 | relatively small, so as to avoid the case x 0, to which is
associated the optimal solution y ∗ = 0. Notice that the support of the distribution of x
is unbounded.
• M : the number of classifiers δI built during training (online learning). Note that
126 Chapter 6. Learning Projections on Random Polyhedra
M also gives the number of calls to the quadratic programming solver made during
the training phase.
• Xh : the fraction of test samples that can be processed by the learned classifiers,
corresponding to the fraction of samples from the test set that hit some cell X(I)
seen during the training phase.
1. Never create more than 1000 classifiers over the course of the online learning process.
3. Every 250 samples, reorder the classifiers by decreasing frequency of use, and remove
the stored classifiers that were never recalled after their creation.
Note that a same classifier could be rebuilt several times if it is removed too soon (espe-
cially for the last classifiers to be added within the window of 250 samples). However, the
rules imply together that a same classifier will be rebuilt at most 4 times. The purpose
of the first rule is to be able to decide online whether the learning approach should be
pursued: if one keeps building or rebuilding classifiers all the time, the problem is not well
adapted to the online learning approach, and one should stop building new classifiers.
The cpu times are reported on Table 6.1. In those tests, the solver is the Matlab
function quadprog.
The results of Table 6.1 suggest that for several problems, especially those where s < m
or k is small, the proposed approach is promising. A relatively small number of classifiers
suffices to cover almost all the input space relevant to the (unbounded) input distribution,
as shown by the fractions Xh close to 1. For those problems, we measured speed-up
factors between 2 and 15 over the systematic strategy that calls a quadratic programming
solver for each sample.
For other problems, especially those with many constraints and higher values for the
parameter k, the merits of the approach are less clear. The local models are valid on a
6.5. Numerical Experiments 127
relatively small volume of the input space, leading to a multiplication of the classifiers to
build. As the multiplication of the tests begin to hurt computing times, the management
rules for maintaining a useful collection of classifiers start to be important, if the proposed
approach is to remain competitive with the usual approach consisting in solving every
problem instance by calling the solver. Notice that the problem P (10, 5; 5) has M =
2s − 1 = 31 classifiers, meaning that all possible combinations of active constraints
are needed. Therefore, we believe that the problem P (20, 40; 5), on which the worst
fraction of the input space covered by 5000 classifiers is recorded, could in fact require,
by Lemma 6.24, as many as 0.6 · 1012 classifiers.
The last series of problems with m = 40, s = 20, illustrates clearly that the practical
speed-up performance of the proposed approach depends on the problem data A. The
whole table illustrates that the overhead cost of learning and maintaining classifiers can
be controlled, so that there is in fact a very strong incentive to use the proposed approach.
To close this section, we give on Figure 6.3 an example of prediction for the first
test problem P(5,10;1): two points x0 , x1 ∈ R10 have been drawn randomly, and the
5 components of the optimal solution along the line segment x(t) = (1 − t)x 0 + tx1 ,
t ∈ [0, 1], have been plotted against t.
128 Chapter 6. Learning Projections on Random Polyhedra
Tab. 6.1: Results on test problems: Covering by classifiers and Speed-up performance.
x0 xt x1
y2
y1
y(xt ) y5
y3
y4
Fig. 6.3: The components yk of the optimal solutions y ∈ R5 for the test problem P (5, 10; 1),
along a random line segment defined by xt = (1 − t)x0 + tx1 ∈ R10 , 0 ≤ t ≤ 1.
Breakpoints in the yk -curves indicate where the segment cuts a boundary between the
domains of 2 classifiers.
6.6 Conclusions
This chapter has discussed a family of parametrized optimization programs and a hy-
pothesis class for predicting, after some training on the task of solving random instances,
optimal solutions to new instances of the program. Based mainly on geometrical in-
sights, the analysis of the properties of optimal solutions has also emphasized the role
of constraint qualifications, and the possible occurrence of pathological cases for some
distributions of the input data. A natural choice for the hypothesis class was a piecewise
linear model describing how optimal solutions vary locally with the input data. Fitting
the model was possible using a strategy based on the exploitation of first-order optimality
conditions.
The technical assumption that the rows of AI (rows of active constraints at the op-
timal solution) are linearly independent corresponds to a linear independence constraint
qualification (LICQ). It implies that the set of optimal dual solutions is a singleton
(Facchinei and Pang, 2003, Proposition 3.2.1). Note that in nonlinear programming, a
necessary and sufficient condition ensuring that the set of optimal primal-dual solutions
is a singleton is the strict Mangasarian-Fromowitz constraint qualification (SMFCQ)
(Kyparisis, 1985) — see again Facchinei and Pang (2003, Proposition 3.2.1).
It is well known that a quadratic program subject to constraints with parametrized
righthand side admits a piecewise linear optimal solution (Garstka and Wets, 1974,
Proposition 3.5). Early results involving perturbations of the constraint matrix are also
available (Daniel, 1973), but they do not really allow to circumvent the difficulties that
arise from inequality constraints forming new equality constraints (the merging effect
detailed in Remark 6.6). An example revealing the combinatorial nature of the difficul-
ties, inspired from a linear program given by Martin (1975), is provided by the following
program over y = (y1 , y2 ) ∈ R2 , where the constraint matrix depends affinely on t ∈ R:
1 2
minimize 2 ||y|| subject to y 1 + y2 ≥ 1 , y1 + ty2 ≤ 1 .
parametric program
where the minimization is over y ∈ Rm1 and z ∈ Rm2 , and the parameters are x(ω) ∈ Rs1
and w(ω) ∈ Rs2 . This form is more general than (6.1), except when B2 is such that
{B2 z : z ∈ Rm2 } = Rs2 (which allows to ignore the constraint B1 y + B2 z w(ω),
trivially satisfied for any y with some proper choice of z). The mapping of interest is the
mapping from (x(ω), w(ω)) to the uniquely determined part y ∗ (ω) of an optimal solution.
132 Chapter 6. Learning Projections on Random Polyhedra
Chapter 7
Conclusion
This thesis presents novel strategies for the search of approximate solutions to multistage
stochastic programs. The framework is based on the association of statistical learning
techniques to scenario-tree approximation techniques from the multistage stochastic pro-
gramming literature. At first, the framework serves two purposes:
difficult. For general distributions, the question “Is E{f (ξ, π̄(ξ))} ≤ θ ?” can only be
answered up to a certain probabilistic confidence level α < 1.
A second computational difficulty stems from the fact that approximate stochastic
programming solution techniques furnish “solutions” that are fully specified only for
the first decision stage. To evaluate on a new realization of ξ the mapping π (the
decision policy) induced by these approximation techniques, one has to solve a sequence
of approximate versions of the original problem posed over gradually shrunk time horizons
(see Chapter 2).
Our diagnosis is that combined together, these two computational difficulties render
it impossible to assess in practice the third level in the hierarchy of the stochastic pro-
gramming methodology, namely, the adjustment of the sampling or discretization method
that replaces expectations by finite sums (so as to yield a program on a finite number of
optimization variables). Yet, we view this third level as a key ingredient for the success of
the whole approach (specific constructions back up this view in Chapter 4): the fact that
the random process is gradually observed, translated to a tree-structured representation
of the samples (scenarios), leaves many degrees of freedom for adjusting the location of
branchings in the tree, a possibility that should be exploited in the context of problems
posed over long time horizons (or more generally in the context of multistage problems
where the dimension of the random process is high).
In Chapter 2, we have presented multistage stochastic programming in the context of
several competing frameworks and methods for sequential decision making uncertainty,
such as Markov Decision Processes (MDP) and Model Predictive Control (MPC). We
have mentioned several solution heuristics for multistage stochastic programming that
have been explored in the optimization and operations research literature, such as two-
stage approximations, aggregation and averaging strategies, and consensus strategies
(Section 2.3).
In principle, the value of a multistage stochastic programming model over other or
simpler models cannot be estimated without building and developing a solution method
for all those models on the real data. The examples and case studies presented in the
thesis have been selected (Section 4.4) or created (Example 2.1, Section 5.3) after careful
experimentations on the model and problem data, in such a way that the multistage
model had a high value with respect to other models — in particular two-stage approxi-
mations — given the numerical data. This was an important stage for a sound evaluation
of the solution methods proposed in the thesis, but also time-consuming, which explains
in part why we chose not to multiply the number of examples or assess the methods on
problem instances with random or arbitrary data.
In Chapter 3, a series of statistical estimation methods has been considered, from max-
imum likelihood and maximum a posteriori estimation to bootstrap aggregating meth-
ods (bagging). The particular mix of perturbation, averaging and selection steps that
differentiates those methods suggests that the estimation and optimization aspects in
stochastic programming problems could in fact be given a unified treatment, based on
Monte Carlo methods and importance sampling methods. The Cross-Entropy method
for the simulation of rare events, and its application to combinatorial optimization, was
identified as a promising candidate for reducing the conceptual gap between the two
aspects (Section 3.2.3). At the same time, the idea of solving an ensemble of random
approximations to a multistage stochastic program, and then aggregating the results,
7.1. Summary of Contributions 135
• Posing the learning problems over a transformed output space, where the feasibility
constraints can be more easily enforced.
• Basing the model selection of a feasible policy π on an estimate of its expected cost
on the true multistage problem, rather than on the loss function of the supervised
learning problems.
Two model selection strategies have been proposed. The first one consists in the simul-
taneous search of the hyperparameters of π viewed as a whole entity. The second one
136 Chapter 7. Conclusion
• We have proposed to consider several scenario trees for a same multistage problem,
each of them leading to distinct approximations of the true problem. The trees are
to be ranked according to the value of the best policy that can be learned using
the data set of decisions optimized on those trees.
• We have suggested to retain the best policy of the best tree, say π ? , as a suboptimal
but feasible solution to the true multistage problem. The empirical estimate θ̂
of E{f (ξ, π ? )}, obtained by simulating the policy π ? on a new independent test
sample, furnishes a certificate of performance on the true problem. The estimate θ̂
can be adjusted if one wants to consider confidence levels. In other words, π ? is a
witness to the claim
• We have demonstrated on a family of test problems that the full approach is imple-
mentable in practice, at a very moderate computational cost, and yields, for those
test problems, near-optimal policies.
• Thanks to the moderate computational cost of this novel tree selection method, we
were able to study empirically the effect of meta parameters on the quality of the
solution, such as the number of scenarios in the trees to consider, or the type of
sampling processes for the values at the nodes of the trees.
Our experimentations indicate that considering a large number of small trees can lead
to an excellent tradeoff between solution quality and computational time.
In Chapter 5, we have considered a second set of strategies for learning policies, in
the context of a four-stage multi-product assembly problem under demand uncertainty
for which the value of the multistage formulation is high (Section 5.3).
• The general principle under the proposed learning approach is that the decisions
of a policy could initially be represented as probability densities conditioned on a
growing number of observations.
7.2. Perspectives 137
• The framework has been found to lead naturally to Gaussian Process regression
techniques.
• The choice of the covariance matrices (kernels) of the Gaussian Process models is
sometimes facilitated by the knowledge of the algorithm that generates the scenario
tree.
• Experiments on the test problem have demonstrated that with a suitable choice of
the kernels and of the repair procedure (the projection of the candidate decisions
on the current feasibility set), a near-optimal policy could be selected.
• We have indeed formulated the question in a setting where fruitful results and
insights can be derived. The setting is a well-known class of parametric strictly
convex quadratic programs, and is related to the feasibility restoration task as for-
mulated in Chapter 5, although some additional work would be needed to generalize
the results to general convex quadratic programming problems.
7.2 Perspectives
Learning a policy from a data set of optimized decisions is a technical compromise. One
would certainly prefer to optimize the parameters of a parametric policy directly on a
scenario tree, or by simulation combined to stochastic gradient descent techniques.
The idea of searching for policies directly seems to be as old as stochastic programming
itself (Garstka and Wets, 1974). Unfortunately, at the exception of particular settings
138 Chapter 7. Conclusion
• For specific applications, specific decisions rules might be proposed and tested. For
instance, it is often the case in planning and sequential decision making under
uncertainty that one is offered the choice to act now (the implementation details
being adjusted greedily), or to postpone the decision. Such situations have been
analyzed by Van Hentenryck and Bent (2006, Chapter 8) in the context of the
online dispatching of a taxi fleet, but could also be found in electricity generation
planning (optimal response to contingencies). Then, a fundamental component of
the decision policy is the mapping from the information state, possibly represented
by features, to the delay before taking irreversible or expensive decisions. The
mapping could be learned according to the data collected from optimized multistage
stochastic programs, and then further adjusted by simulations. If the decision of
acting now is selected, what we refer to as the repair procedure could be anything
from a greedy, one-step online optimization, to a call to another policy dedicated
to immediate actions.
• The proper way to associate, in a same data set, scenario/decisions examples col-
lected from several scenario-tree approximations solved to optimality, so as to infer
from this data set, or a post-processed version of it, a policy with better perfor-
mances on the exact problem than the best of the policies learned from the data
relative to a single scenario tree, is still to be found and shown to be computation-
ally efficient. Our intuition is that this approach could be especially interesting for
multistage problems with high-dimensional random processes, but would require
much work to ensure that the inconsistencies among the data sets of decisions are
innocuous in the context of the learning algorithm, or can be corrected by some
problem-dependent processing step.
• In Section 4.4.2, it was observed that a near-optimal policy had been obtained from
a scenario tree with statistical properties (including first moments) very far from
those of the targeted random process. This suggests that the paradigm according
to which finding ways to build a unique scenario tree as “close” as possible (in any
sense) to the original random process is the more rational objective one could aim
at, is in fact too limitative in the context of challenging multistage problems.
Besides the proposed multi-tree framework based on branching structure random-
ization, it might be conceivable to perturb the parameters of the targeted random
process itself (as long as the learned policies are ultimately tested on the exact
random process).
The objective of the approximate multistage programs could also be perturbed
or modified, for instance by adding regularization terms (as long as the learned
policies are ultimately tested with the exact cost function).
• The work on scenario tree generation methods is likely to have an impact on op-
timal experiment design, active learning, and on the direct selection of concise
data sets (Rachelson et al., 2010) for reducing, at the source, the complexity of
non-parametric models, or for facilitating the processing of data by complex algo-
rithms.
From a point of view more specific to the present work, Chapter 6 suggests possi-
ble research directions in supervised learning and artificial intelligence, based on simple
7.3. Influences on Research in Machine Learning 141
settings in which concepts such as “learning to learn” or “learning faster” are given a
simple mathematical formalization. It would certainly be worth exploring and developing
further such approaches, that could allow to better integrate existing technologies, and
focus on context detection, rather than on the learning task itself. Related work in this
general orientation includes Ailon et al. (2006) and Hartland et al. (2006).
142 Chapter 7. Conclusion
Appendix A
This appendix presents material from variational analysis (Rockafellar and Wets, 1998)
useful in the study of minimization problems subject to constraints, and approximations
of those minimization problems.
This material is a part of the fundamental theoretical background supporting many
works in stochastic programming, and more generally many work in optimization. It
provides a convenient formalism that we use in the thesis for discussing optimization
programs abstractly, although we do not insist in the main body of the thesis on some
of the technical subtleties highlighted in the present appendix, as these subtleties are
not absolutely required to communicate on the kind of work made in the context of the
thesis.
The appendix is organized as follows. Section A.1 defines minimization problems
through extended-real-valued functions. Section A.2 introduces notations for dealing
with sequences, subsequences and neighborhoods. Section A.3 defines the notion of
semi-continuity. Section A.4 gives sufficient conditions for the existence of optimal so-
lutions. Section A.5 defines the notion of epigraph. Section A.6 defines the notion of
epi-convergence of functions. Section A.7 connects epi-convergence to the property that
optimal solutions converge to true optimal solutions. Section A.8 relates epi-convergence
to other modes of convergence of functions. Section A.9 consider the generalization of
results to parametric optimization. Section A.10 consider the particularization of results
to convex optimization. Section A.11 defines the notion of local Lipschitz continuity.
A.1 Minimization
Let R = R∪{−∞, +∞} denote the set of extended real numbers. Minimization problems
and constrained minimization problems can be defined through the notion of extended-
real-valued functions.
The infimum of f , written inf f , is the greatest lower bound of f , that is, the greatest
value v ∈ R such that v ≤ f (x) for all x ∈ Rn . The infimum of f on a (possibly empty)
set C ⊂ Rn , written inf C f , is the greatest lower bound of the extended-real-valued
function that assigns to x ∈ C the value f (x), and to x ∈ Rn \ C the value ∞. When
C = Rn , inf C f = inf f . To emphasize the argument of f , we may write inf x f (x) instead
144 Appendix A. Elements of Variational Analysis
A.2 Sequences
Let N (x̄) denote the collection of all neighborhoods of x̄ ∈ Rn . We take the notions
of open set and neighborhood for granted (Mendelson, 1990). We are about to deal
with properties that must hold for all V ∈ N (x̄). In the metric space (Rn , d) where
Pn
d(x, y) = ||x − y|| = [ i=1 (xi − yi )2 ]1/2 , the properties that we will consider hold for all
V ∈ N (x̄) iff they hold for all open Euclidian balls of rational radius centered at x̄, that
is, for all V of the form {x ∈ Rn : ||x − x̄|| < δ} with 0 < δ ∈ Q.
Let the topological closure and interior of a set C ⊂ Rn be defined by
(Rockafellar and Wets, 1998, page 14). The topological boundary of a set C is defined
by bdry C = cl C \ int C.
Let {xν }ν∈N denote a sequence x1 , x2 , . . . with xν ∈ Rn and ν ∈ N (the set of natural
numbers, taken as the index set of the elements of the sequence). The set of points x ν in a
sequence {xν }ν∈N is called the range of the sequence (Rudin, 1976, page 48). A sequence
is said to be bounded if its range is bounded. In (Rn , d), a sequence {xν }ν∈N is said to
A.3. Semicontinuity 145
converge to x̄ (or to have x̄ as its limit point), written xν → x̄ or limν→∞ d(xν , x̄) = 0,
if for any > 0, there is some ν0 ∈ N such that ν ≥ ν0 entails d(xν , x̄) < . For instance,
the constant sequence with xν = x̄ converges to x̄. We can also have xν → x̄ despite
xν 6= x̄ for all ν.
Given a sequence {xν }ν∈N , the sequence xν1 , xν2 , . . . , where ν1 , ν2 , . . . is a sequence of
positive integers such that ν1 < ν2 < . . . , is called a subsequence of {xν }ν∈N . To facilitate
statements involving subsequences, let N∞ denote the collection of subsets of N of the
form {ν0 , ν0 + 1, . . . }, which contain all integers k greater or equal to some positive
#
integer ν0 . Let N∞ denote the collection of all subsets of N of infinite cardinality. Note
# # N
that N∞ ⊂ N∞ . Given N ∈ N∞ or N ∈ N∞ , we shall write xν → x to indicate that
the subsequence {xk }k∈N of the sequence {xν }ν∈N converges to x. The limit point x of
a subsequence {xk }k∈N with N ∈ N∞ #
is called a cluster point of the sequence {xν }ν∈N .
N
It is also called an accumulation point of the sequence {xν }ν∈N if xν → x with xν 6= x
for all ν ∈ N . For instance, the sequence {(−1)ν }ν∈N has no limit point, but it has two
cluster points −1, +1 that are not accumulation points.
In a metric space, it is often illuminating to view a neighborhood V ∈ N (x̄) as the
[interior of the closure of the] union of the ranges from all sequences in V that converge
to x̄. Such a viewpoint leads to definitions based on sequences — consider, for instance,
In the sequel, following Rockafellar and Wets (1998), statements are made preferably
in terms of sequences.
A.3 Semicontinuity
The following definition of lower and upper limits uses a min/max characterization proved
in Rockafellar and Wets (1998, Lemma 1.7).
By considering the constant sequence xν = x̄ one sees that lim inf x→x̄ f (x) ≤ f (x̄)
and lim supx→x̄ f (x) ≥ f (x̄).
A.4 Definition. For properties invoked as relative to X, the limits are taken over se-
quences in X. In particular, the lower and upper limits of f : Rn → R at x̄ relative to X
become
lim inf f (x) = sup inf f (x) and lim sup f (x) = inf sup f (x) .
X
x→x̄ V ∈N (x̄) x∈V ∩X X V ∈N (x̄) x∈V ∩X
x→x̄
Among the numerous characterization of continuity, we can thus find the following
ones.
A.8 Definition. The function f is lower level-bounded if for all α ∈ R, the set
lev≤α f = {x ∈ Rn : f (x) ≤ α} is bounded.
To handle situations where the evaluation of inf f and argmin f has a finite precision,
it is useful to consider, when inf f is finite, the set of -optimal solutions
As the elements of - argmin f are themselves evaluated with a finite precision, it is useful
to clarify in which sense elements x̃ close to -optimal solutions are close to being optimal
(Rockafellar and Wets, 1998, Theorem 1.43):
A.10 Theorem. If f : Rn → R is l.s.c. with inf f finite, the closed Euclidian ball of
radius ρ > 0 centered at an -optimal solution x̄ ( > 0) contains a point x̃ which is the
unique solution to the minimization of the perturbed function f (x) + ρ −1 ||x − x̃|| and
satisfies f (x̃) ≤ f (x̄).
A.5 Epigraph
For a real-valued function f0 : Rn → R, recall that the graph of f0 is the set gph f0 =
{(x, α) ∈ Rn × R : α = f0 (x)}. For extended-real-valued functions to be minimized, we
consider epigraphs.
We note that in earlier references (Rockafellar, 1970, 1974), a slightly altered definition
of the closure of a function was used: the function lim inf x0 →x̄ f (x) of definition A.13 was
denoted lsc f , and cl f was set to the constant function −∞ whenever lim inf x0 →x f (x0 ) =
−∞ for some x; cl f was set to lsc f otherwise.
148 Appendix A. Elements of Variational Analysis
A.6 Epi-convergence
lim sup C ν = {x ∈ Rn : ∃ N ∈ N∞
#
, ∃ xν ∈ C ν for each ν ∈ N, such that xν → x}
N
ν→∞
\ [
= cl Cν .
N ∈N∞ ν∈N
The inner limit of a sequence {C ν }ν∈N of subsets C ν ⊂ Rn is the set of limit points (if
any) of all sequences {xν }ν∈N such that N ∈ N∞ and ∅ = 6 C ν 3 xν for each ν ∈ N:
ν→∞
\ [
= cl Cν .
#
N ∈N∞ ν∈N
From the definition, the inner and outer limits are (possibly empty) closed sets. In
particular, for an arbitrary subset V of Rn , the constant sequence with C ν = V has
lim inf ν→∞ C ν = lim supν→∞ C ν = cl(V ). We always have the inclusion lim inf ν→∞ C ν ⊂
lim supν→∞ C ν since N∞ ⊂ N∞ #
. For instance, given two closed subsets A, B ⊂ Rn , the
sequence {C }ν∈N with C = A for ν odd and C ν = B for ν even has lim supν→∞ C ν =
ν ν
A.15 Definition. If lim inf ν→∞ C ν = lim supν→∞ C ν = C, the limit limν C ν exists and
is equal to C. One writes C ν → C to indicate that the sequence {C ν }ν∈N converges to C
(in the sense of the Painlevé-Kuratowski convergence of sets).
A.16 Definition. The lower epi-limit of the sequence {f ν }ν∈N , denoted e-lim inf ν f ν ,
is the function defined by identifying its epigraph to the outer limit of the sequence of
sets epi f ν :
The upper epi-limit of the sequence {f ν }ν∈N , denoted e-lim supν f ν , is the function
defined by identifying its epigraph to the inner limit of the sequence of sets epi f ν :
The value of the epi-limits at x has the following characterization proved in Rockafellar
and Wets (1998, Proposition 7.2). Note that for a sequence {y ν }ν∈N in R, the least and
greatest cluster points of the sequence are respectively lim inf ν y ν = limν→∞ [inf κ≥ν y κ ]
and lim supν y ν = limν→∞ [supκ≥ν y κ ].
A.7. Convergence in Minimization 149
Since by definition the lower and upper limits are closed, the epi-limit f , when it
exists, is lower semicontinuous (Rockafellar and Wets, 1998, Proposition 7.4(a)).
A.19 Proposition. The sequence {f ν }ν∈N epi-converges to f iff at each point x, the
two following conditions hold (Rockafellar and Wets, 1998, Equation 7(3)):
i. lim inf ν f ν (xν ) ≥ f (x) for every sequence xν → x,
ii. lim supν f ν (xν ) ≤ f (x) for some sequence xν → x.
We note the following monotone convergence property (Rockafellar and Wets, 1998,
Proposition 7.4(c-d)): if f ν ≥ f ν+1 for all ν (in the sense that f ν (x) ≥ f ν+1 (x) for every
e e
x and ν), then f ν → cl[inf ν f ν ]. If f ν ≤ f ν+1 for all ν, then f ν → supν [cl f ν ].
A.20 Theorem. Assume that the sequence {f ν }ν∈N epi-converges to a proper func-
tion f , with the functions f ν satisfying the assumptions of Theorem A.9 (f ν is proper,
l.s.c., and level-bounded). Then
i. The sets argmin f ν are nonempty and compact (by A.9);
ii. inf f is finite with argmin f nonempty and compact;
iii. inf f ν → inf f ;
iv. The cluster points of sequences {xν }ν∈N with xν ∈ argmin f ν are optimal for f :
∅ 6= lim supν (argmin f ν ) ⊂ argmin f ;
v. If argmin f = {x̄}, any sequence {xν }ν∈N with xν ∈ argmin f ν converges to x̄.
In Rockafellar and Wets (1998, Theorem 7.33), the assumptions on the functions f ν
are weakened: the sets lev≤α f ν have to be bounded for all α ∈ R only for ν in some
N ∈ N∞ , as having argmin f ν nonempty and compact in the tail of the sequence suffices.
Also, the sets from which are extracted the solutions xν are the sets
ν -argmin f ν = {x ∈ Rn : f ν (x) ≤ inf f ν + ν }
with {ν }ν∈N a sequence with ν > 0 decreasing monotonically to 0. In Theorem A.20
it is explicitly assumed that f is lower semicontinuous but actually as an epi-limit, f is
necessarily lower semicontinuous (Rockafellar and Wets, 1998, Proposition 7.4(a)).
150 Appendix A. Elements of Variational Analysis
With the following definition, taken from Rockafellar and Wets (1998, Exercise 7.9),
one gets a condition under which pointwise convergence entails epi-convergence (Rock-
afellar and Wets, 1998, Theorem 7.10). Here min{a, b} and max{a, b} refer to the lowest
and highest value between a and b.
The sequence is equi-upper semicontinuous at x̄ relative to X iff for every ρ > 0 and
> 0,
At the same time, observe from Proposition A.17 that when f ν epi-converges to f ,
there is at least one sequence xν → x̄ such that f ν (xν ) → f (x̄), whereas for an arbitrary
sequence xν → x̄ epi-convergence does not ensure that f ν (xν ) → f (x̄).
The following theorem is taken from Rockafellar and Wets (1998, Theorem 7.10):
A.9. Parametric Optimization 151
are bounded for all α ∈ R and for every u in some neighborhood V ∈ N (ū) of ū.
The following definition is useful inasmuch as one cannot usually assert that p(u) is
continuous in u.
A.10 Convexity
A.35 Definition. A subset C of Rn is convex if for all points x, y ∈ C and for 0 < λ < 1,
the points (1 − λ)x + λy are in C.
Convex functions are continuous at least on the interior of their effective domain
(Rockafellar and Wets, 1998, Theorem 2.35):
The last implication means that for every integer m and sets X(m) = {x ∈ R n : x =
ν X(m)
Pm Pm
k=1 λk xk with λk > 0, k=1 λk = 1, xk ∈ dom f }, it holds that x → x̄ implies
ν
f (x ) → f (x̄).
Proposition A.37 also covers the result that lower semicontinuous convex functions
whose effective domain is a polytope are continuous on their domain (Gale et al., 1968).
Convex functions can have discontinuities on the boundary of their effective domain,
even if they are also l.s.c.:
For extended-real valued functions, it makes sense to focus on Lipschitz continuity prop-
erties locally (Rockafellar and Wets, 1998, page 350).
where by convention |f (x) − f (x0 )| = ∞ if f (x) or f (x0 ) (or both) is infinite. The
function f is said to be strictly continuous (or locally Lipschitz continuous) at x̄
relative to X if lipX f (x̄) is finite, and f is said to be strictly continuous relative to X if
it has that property at each x̄ ∈ X. The mention to X is omitted when X = int dom f .
154 Appendix A. Elements of Variational Analysis
Appendix B
This appendix presents standard material from measure and probability theory (Billings-
ley, 1995; Pollard, 1990).
The definitions and results collected in the present appendix are relevant to this thesis
inasmuch as stochastic programming fundamentally deals with randomness. As random
objects more general than random vectors are required in the context of stochastic pro-
gramming, we define them here, using a formalism based on set-valued mappings (see
Definition B.8), following Rockafellar and Wets (1998).
The appendix is organized as follows. Section B.1 defines the notions of sigma-algebra
and probability space. Section B.2 defines random variables and random vectors. Section
B.3 defines random sets. Section B.4 defines random functions. Section B.5 defines the
expectation, including the treatment of extended-real-valued functions (Rockafellar and
Wets, 1998, Chapter 14). Section B.6 defines distributions and cumulative distribution
functions.
The probability space is made of three elements: the sample space, the sigma-algebra,
and the probability measure.
The empty set Ωc = ∅ and the sample space Ω are always in B by definition. Countable
156 Appendix B. Basic Probability Theory
B.3 Definition. The sigma-algebra generated by B0 is the intersection of all the sigma-
algebras that contain the class B0 .
Note that there are often several ways of generating a same sigma-algebra.
Examples of useful sigma-algebras are given below.
• The trivial sigma-algebra is the smallest possible sigma-algebra, made of the two
sets ∅ and Ω.
• The Borel sigma-algebra of the unit interval B((0, 1]) is the sigma-algebra generated
by the class of subintervals of (0, 1] of the form I = (a, b] with 0 < a < b ≤ 1. In
fact, the sigma-algebra generated by a countable number of subintervals (a, b] with
a, b restricted to rational numbers and 0 < a < b ≤ 1 can also be shown to coincide
with B((0, 1]).
• The Borel sigma-algebra B(R) is the sigma-algebra generated by the class of inter-
vals I = (a, b] of R. It can also be generated by the class of intervals (−∞, t], t ∈ R.
When we define functions on sigma-algebra that may be valued on the extended
real line R = R ∪ {±∞}, we consider the Borel sigma-algebra B(R) generated by
the subsets of B(R) and the two sets {−∞} and {+∞}, or alternatively by intervals
of the form (t, +∞], [−∞, t), t ∈ R.
In general, Borel sigma-algebras can be generated from all the open subsets of a topo-
logical space, or alternatively, from all the closed subsets of the topological space.
The elements of B(Rk ) are called Borel sets, without mention to Rk when k is clear
from the context.
A set B of a sigma-algebra B is said to be B-measurable. In the context of probability
theory, a set B ∈ B is referred to as an event.
Probabilities can be assigned to events by the means of a probability measure.
B.6 Definition. The P-completion of a sigma-algebra B is the class of sets B for which
there exist sets B0 , B1 ∈ B with B0 ⊂ B ⊂ B1 and P{B1 \ B0 } = 0.
F −1 (y) = {x ∈ X : y ∈ F (x)} .
F (x) = {y ∈ Y : F −1 (y) ∩ B 6= ∅} .
S
F (B) = x∈B
In practice, it is not necessary to check that the pre-image of every element of the
sigma-algebra C is in B: checking the condition for a class of subsets generating the
sigma-algebra C is sufficient (Billingsley, 1995, Theorem 13.1(i)):
When Ω0 = R with C the Borel sigma-algebra B(R), the B/C measurable mapping is
a real-valued mapping corresponding to a real-valued random variable.
Random variables can also be defined as functions of other random variables, not
necessarily defined on the same measurable space. For that purpose, the following result
is useful (Billingsley, 1995, Theorem 13.1(ii)).
Sometimes one considers the random variables first, and then generates the sigma-
algebras in such a way that the random variables of interest are measurable.
That is, the sigma-algebra generated by f1 ,. . . ,fk , with fi a mapping from Ω to a space Ω0i
equipped with a sigma-algebra Ci , is defined as the sigma-algebra generated by the class
of sets {fi−1 (C) : C ∈ Ci , i = 1, . . . , k}.
Random variables measurable with respect to the sigma-algebra generated by a collec-
tion of random variables are equivalent to functions of those random variables (Billingsley,
1995, Theorem 20.1):
(For brevity, we also allow ourselves to say that h is measurable with respect to f 1 , . . . , fk
when h is measurable with respect to the sigma-algebra generated by f1 , . . . , fk .)
Consider again the collection of random variables f1 , . . . , fk where fi is a B/Ci -
measurable mapping from Ω to Ci . Let F0 denote the trivial sigma-algebra, and let
Fi denote the sigma-algebras generated by the subcollection {f1 , . . . , fi } of random vari-
ables. By definition, it holds that Fi ⊆ Fj for 0 ≤ i < j ≤ k.
Random sets are defined as measurable set-valued mappings. We will only consider
mappings to closed subsets of Rn .
160 Appendix B. Basic Probability Theory
Remark B.1. It is not clear to us whether the empty set is considered in Rockafellar
and Wets as an admissible value for a closed-valued mapping, and how the selection
defined as a function can handle that case. The definition of the selection has
been changed in Dontchev and Rockafellar (2009, page 49) to allow for a local
definition, but a local definition on the subset Ω0 of Ω where F is not empty-valued
is not desirable for a measurable selection, which should be defined on the full
sample space Ω. Aubin and Frankowska (1990, Theorem 8.1.3) avoid the issue
by dealing only with non-empty-closed-valued measurable mappings F , but this
choice rules out the use of a measurable selection for selecting an optimal solution
to a parametric optimization program which is infeasible in some region of the
parameter space.
The following results are taken from Rockafellar and Wets (1998, Propositions 14.28,
Theorem 14.37).
B.22 Theorem. Let f : Ω × Rn → R be a normal integrand. Then for p(ω) = inf f (ω, ·)
and P (ω) = argmin f (ω, ·), it holds that the function p : Ω → R is measurable, the
mapping P : Ω ⇒ Rn is closed-valued and measurable, and in particular P admits a
measurable selection.
Examples of normal integrands are recorded below (Rockafellar and Wets, 1998, Ex-
amples 14.29, 14.31, 14.32; Proposition 14.39; Exercise 14.55), whereas Theorem B.19
gives the general measurability condition for set-valued mappings that characterizes nor-
mal integrands.
B.5 Expectation
Let (Ω, B, µ) be a measure space. Let M denote the class of all B/B(R)-measurable
mappings from Ω to R, and let M+ denote the class of all nonnegative mappings in M.
The expectation (or expected value) of nonnegative random variables is defined through
the integral (Billingsley, 1995, Equation 15.3).
where the supremum is over the partitions {B ν }1≤ν≤N of Ω with N finite and B ν ∈ B.
The expectation of a random variable f ∈ M+ on a probability space (Ω, B, P) is
(setting µ to P)
Z
E{f } = f d P .
The expectation of nonnegative random variables can also be defined through prop-
erties (Pollard, 2001, Theorem 2.12). For a set B ∈ B, let IB ∈ M+ denote the indicator
function of B defined by IB (ω) = 1 if ω ∈ B and IB (ω) = 0 if ω ∈
/ B.
B.25 Definition. For each probability measure P on the measurable space (Ω, B), there
is a functional E from M+ to [0, ∞] uniquely determined by the following properties.
i. E{IB } = P{B} for all B ∈ B;
ii. E{0} = 0 where the zero of the left-hand side denotes a zero-valued measurable
mapping;
iii. For α, β ≥ 0 and f, g ∈ M+ , E{αf + βg} = αE{f } + βE{g};
iv. If f, g are in M+ and f (ω) ≤ g(ω) for almost all ω ∈ Ω, then E{f } ≤ E{g};
v. (Monotone convergence.) For a sequence {f ν }ν∈N of functions f ν ∈ M+ , if
f ν (ω) → f (ω) with f ν (ω) ≤ f ν+1 (ω) for almost all ω ∈ Ω, then E{f ν } → E{f }
with E{f ν } ≤ E{f ν+1 }.
E{αf +βg} = αE{f }+βE{g} for α, β in R provided that the situation ∞−∞ is avoided.
Classes of random variables with finite expectations define particular spaces of mea-
surable functions (Pollard, 2001, Section 2.7).
B.28 Definition. Let (Ω, B, P) be a probability space. For 1 ≤ p < ∞, consider the
space Lp (Ω, B, P) of functions f ∈ M such that E{|f |p } is finite. For p = ∞, consider
the space L∞ (Ω, B, P) of functions f ∈ M for which the essential supremum inf[α ∈ R :
P{ω : |f (ω)| > α} = 0} is finite. Then, the Lebesgue space Lp (Ω, B, P) (1 ≤ p ≤ ∞)
is defined as the space of equivalence classes of functions [f ] = {g ∈ L p (Ω, B, P) : g =
f P-almost surely}.
To each element f of the space Lp (Ω, B, P) can be associated the real number ||f ||p =
(E{|f |p })1/p . The reduction to equivalence classes of functions is made so that in
Lp (Ω, B, P), ||f − g||p = 0 entails f = g. (|| · ||p is a semi-norm for Lp (Ω, B, P) and
a norm for Lp (Ω, B, P): see Definition C.3.)
Now we turn our attention to expectations over random functions. The expectation
over a random function is well-defined for normal integrands (Rockafellar and Wets, 1998,
Proposition 14.58):
The Lebesgue spaces Lp (Ω, B, P; Rn ) are decomposable, whereas the space of constant-
valued functions and the space of continuous mappings f : Ω → Rn are not decomposable
relative to most measures P (Rockafellar and Wets, 1998, page 677).
and as long as inf x∈X E{f (ω, x(ω))} > −∞, it holds that x̄ ∈ X is in argminx∈X Ef [x]
iff x̄(ω) is in argminx∈Rn f (ω, x) for P-almost every ω ∈ Ω.
B.6 Distributions
µ(B) = P{x(ω) ∈ B} .
if and only if F is absolutely continuous (Billingsley, 1995, Theorem 31.8), in the following
sense (Billingsley, 1995, Equation 31.28):
This appendix presents results from functional analysis useful in the theory of kernel
methods. We use kernels or kernel-based methods in several places in the thesis (Chapters
3, 5).
The appendix is organized as follows. Section C.1 defines Hilbert spaces. Section C.2
defines continuous linear mappings. Section C.3 defines reproducing kernels, positive def-
inite kernels, and reproducing kernel Hilbert spaces. Section C.4 gives the interpretation
of positive definite kernels as generalized inner products.
If (F, d) is a complete metric space, then a set C is closed when d(f, C) = 0 entails
f ∈ C.
Recall that a linear space F over a field K is a set F for which the addition (+) of two
elements f, g ∈ F and the multiplication (·) of an element f ∈ X by a scalar α ∈ K obey
166 Appendix C. Elements of Functional Analysis for Kernel Methods
C.3 Definition. A linear space F over a field K is a normed linear space if to every
element f ∈ F is associated a real number ||f ||, called the norm of f , with the following
properties (where f, g ∈ F and α ∈ K):
i. ||f || ≥ 0 with ||f || = 0 iff f = 0
ii. (Subadditivity.) ||f + g|| ≤ ||f || + ||g||
iii. ||αf || = |α| · ||f ||.
In a normed linear space F , the function d(f, g) = ||f − g|| is a metric for F .
In Definition C.3, if ||f || satisfies only the conditions ii and iii (which imply ||f || ≥ 0),
then ||f || is called a semi-norm. If ||f || satisfies conditions i, ii, and instead of condition iii
the weaker set of conditions
iii’. || − f || = ||f || ,
αν → 0 entails ||αν f || → 0 ,
||f ν || → 0 entails ||αf ν || → 0 ,
then ||f || is called a quasi-norm, and F is called a quasi-normed linear space. When F
is a quasi-normed or a normed linear space, f ν → f entails ||f ν || → ||f ||; furthermore, if
f ν → f , g ν → g, and αν → α, it holds that f ν + g ν → f + g and αν f ν → αf (Yosida,
1980, Proposition 2.2).
C.4 Proposition. Let F be a real pre-Hilbert space. The inner product between
f, g ∈ F , is defined by
hf, gi = 14 ||f + g||2 − 14 ||f − g||2 ,
and satisfies the following properties (where f, g, h ∈ F and α ∈ R):
hαf, gi = αhf, gi ; hf + g, hi = hf, hi + hg, hi ; hf, gi = hg, f i ; hf, f i = ||f || 2 .
Moreover, we have |hf, gi| ≤ ||f || ||g|| (Cauchy-Schwartz inequality).
C.5 Definition. A normed linear space that is complete is called a Banach space. A
pre-Hilbert space that is complete is called a Hilbert space.
C.6 Proposition. A separable Hilbert space F has an orthogonal base {f ν }ν∈I with
at most a countable number of elements (Yosida, 1980, Corollary III.5). Any f ∈ F
C.2. Linear Mappings 167
Let us denote by || · ||X and || · ||Y the norms of the Banach spaces X and Y . Let ||T ||
denote the smallest constant c > 0 such that ||T (x)||Y ≤ c||x||X for all x ∈ dom T . We
say that T is bounded if ||T || is finite, and call ||T || the operator norm of T . It holds
that a linear mapping T : X → Y is continuous if and only if T is bounded (Yosida,
1980, Corollary I.6.2).
Let L(X, Y ) denote the space of all continuous linear mappings T : X → Y . The
following statement of Riesz’s representation theorem is taken from Yosida (1980, Section
III.6).
C.8 Theorem. Let X be a Hilbert space over the field K and let f be an element of
L(X, K). Then there exists a unique element yf ∈ X such that f (x) = hx, yf i for every
x ∈ X with ||f || = ||yf ||X . Conversely, an element y ∈ X defines a unique mapping fy
in L(X, K) by fy (x) = hx, yi for every x ∈ X with ||fy || = ||y||X .
Note that k(·, y) acts as a Dirac distribution centered at y by the reproducing property,
whereas k(·, y) is actually a function defined on X.
With f (·) = k(·, y), Property ii. yields k(x, y) = hk(·, y), k(·, x)i. For real Hilbert
spaces we can write k(x, y) = hk(·, x), k(·, y)i, whereas for complex Hilbert spaces we
have k(x, y) = hk(·, x), k(·, y)i.
If a reproducing kernel k exists, it is unique (Aronszajn, 1950). A Hilbert space
for which a reproducing kernel exists is called a reproducing kernel Hilbert space
(RKHS). A reproducing kernel of F exists if and only if for every y ∈ X, the mapping
f 7→ f (y) (called the evaluation functional) is a continuous linear mapping with
168 Appendix C. Elements of Functional Analysis for Kernel Methods
respect to f ∈ F , meaning that there exists a finite cy > 0 such that |f (y)| ≤ cy ||f || for
all f ∈ F . If k exists, the smallest cy is k(y, y)1/2 by Cauchy-Schwartz inequality (C.4)
applied to f (y) = hf (·, k(·, y))i, whereas if a continuous linear mapping F y (f ) = f (y)
exists for every y, we have Fy (f ) = hf (·), gy (·)i for some gy ∈ X (by Theorem C.8) so
that gy (x) = k(x, y) is a reproducing kernel (Yosida, 1980, Proof of Theorem III.9.1).
From the relation |f (y)| ≤ k(y, y)1/2 ||f ||, one deduces that if there exists a scalar
c > 0 such that k(y, y)1/2 ≤ c for all y ∈ X, then ||f ||∞ = supy∈X |f (y)| ≤ c||f ||. For
the particular case of normalized kernels [k(y, y) = 1] we have ||f ||∞ ≤ ||f ||.
For a sequence {f ν }ν∈N in a RKHS, ||fn − f || → 0 entails fn (y) → f (y) for every
y ∈ X, since we have, by definition of the strong convergence, fn → f , and then fn (y) →
f (y) by continuity of the evaluation functional k(·, y).
C.10 Proposition. A reproducing kernel k for a class F of K-valued functions has the
P n Pn Pn 2
property that i=1 j=1 αi k(yi , yj )αj = || i=1 k(·, yi )αi || ≥ 0 for any finite collection
of elements yi ∈ F and coefficients αi ∈ K. That is, the Gram matrix K ∈ Kn×n with
elements Kij = k(yi , yj ) is positive semi-definite.
When X ⊂ Rd , Proposition C.10 can also be stated as follows: the linear mappings
R R
L : F → K defined by L(f ) = X X α(x)k(x, y)α(y) dx dy, with k a reproducing kernel
and α any K-valued continuous function with nonzero values on a compact subset of X,
are such that F (f ) ≥ 0.
The converse of Proposition C.10 is also true (Aronszajn, 1950, Theorem 2.4 at-
tributed to E.H. Moore). Before stating the theorem, we define the notion of positive
definite kernel.
Being a reproducing kernel, a positive definite kernel k has the property that k(·, y)
is continuous for every y ∈ X. The property does not imply that k is continuous as a
mapping from X × X to K (Lehto, 1952). A continuous positive definite kernel is called
a Mercer kernel. Since a function is continuous at any isolated point of its domain
(Rudin, 1976, Definition 4.5), the distinction between positive definite kernels and Mercer
kernels is irrelevant when X is a discrete set.
To build the class F of Theorem C.12 corresponding to a positive definite kernel k,
we follow Aronszajn (1950):
to which corresponds a norm ||f || = [ i j αi k(yi , yj )αj ]1/2 and then the class is com-
P P
is then given by
Pm Pn Pm P n
hf, gi = i=1 j=1 αi βj k(yj0 , yi ) = i=1 j=1 αi βj k(yi , yj0 ) .
Theorems C.10 and C.12 show that Reproducing Kernel Hilbert Spaces are uniquely
determined by the choice of a positive definite kernel. In the sequel, we refer to a
positive definite kernel simply as a kernel.
If X is a compact subspace of Rd , a continuous kernel k : X × X → R admits an
eigenfunction expansion
Pm
k(x, y) = ν=1 λν ψ ν (x)ψ ν (y)
√
with λν > 0 and m ≤ ∞ (Mercer, 1909). The vector φ(x) = { λν ψ ν (x)}1≤ν≤m is
interpreted as a feature vector for x, whereas k(x, y) is viewed as a generalized inner
product between the vectors φ(x), φ(y) valued in some feature space F. The mapping
φ : X → F is called a feature map (Aizerman et al., 1964).
To elucidate the nature of F, observe that φ(x) belongs to the space `2 of vec-
P∞ Pm
tors {ξ ν }ν∈N such that ν=1 |ξ ν |2 < ∞, since k(x, x) = ν=1 φν (x)2 is finite. In fact
P∞
`2 is a linear normed space equipped with the norm ||{ξ ν }ν∈N || = [ ν=1 (ξ ν )2 ]1/2 , which
can be interpreted as a generalization of the Euclidian norm in Rn when n tends to ∞.
The feature map is continuous: xν → x̄ entails φ(xν ) → φ(x̄), since ||φ(xν ) − φ(x̄)|| =
k(xν , xν ) + k(x̄, x̄) − 2k(xν , x̄) → 0 by continuity of k (Cucker and Smale, 2001).
Pm Pn
From k(·, y) = ν=1 λν ψ ν (·)ψ ν (y) and f (·) = i=1 αi k(·, yi ) for some n ≤ ∞, one
Pm Pn
can see that f has the form f (·) = ν=1 αfν ψ ν (·) with αfν = i=1 αi λν ψ ν (yi ), and that
Pm
hf, gi = ν=1 αfν αgν /λν .
In machine learning, it is common to extend the feature map interpretation to more
general spaces X and say that a function k : X × X → K is a kernel if there exist a
Hilbert space H and a mapping φ : X → H such that k(x, y) = hφ(x), φ(y)i for all
x, y ∈ X (Steinwart and Christman, 2008, Definition 4.1). The corresponding RKHS is
the class F of functions of the form f (·) = hh, φ(·)iH for some h ∈ H, equipped with
the norm ||f || = inf h∈H {||h||H : f (·) = hh, φ(·)iH } (Steinwart and Christman, 2008,
Theorem 4.21). Proposition C.13 still holds.
For the class of shift-invariant continuous kernels k : X × X → R with X = Rd , where
the shift-invariance property means that k(x+τ, y+τ ) = k(x, y) for any τ ∈ R d , Bochner’s
theorem (Bochner, 1933) [see also Yosida (1980, Theorem XI.13.2)] characterizes the
kernels k in the frequency domain. The following statement particularizes to real-valued
normalized kernels (k(x, x) = 1 for all x ∈ Rd ) a form of Bochner’s theorem given in
Hofmann et al. (2008).
Thus we have k(x, y) = E{exp{jhx, ξi}exp{j hy, ξi}}, which is very similar to Mercer’s
eigenfunction expansion (the countable sum has been replaced by an integral as X = R d
170 Appendix C. Elements of Functional Analysis for Kernel Methods
is now unbounded).
From Bochner’s theorem, Schoenberg (1938) obtains a characterization of shift-invariant
kernels having a radial symmetry.
Using basic kernels and positivity-preserving operations, more complex kernels can
be built. For example, one can define a kernel k : X × X → R with values
This appendix describes a classical formulation of the two-stage stochastic linear program
with recourse, and gives details on the structure of optimal solutions. We have included
this appendix in the thesis, because it clarifies the origin of certain assumptions that are
found to be technically challenging to remove in stochastic programming models.
The material is mainly taken from Birge and Louveaux (1997), up to some adjustments
based on Wets (1974); Römisch and Wets (2007); Shapiro et al. (2009).
The appendix is organized as follows. Section D.1 states the problem and gives a list
of assumptions that ensure that the formulation is meaningful. Section D.2 gives useful
properties that can be derived from the previous assumptions.
Let (Ω, B, P) be a probability space. A two-stage stochastic linear program with recourse
is a program of the form
z(ω) = ( q(ω), t1 (ω), . . . , ts2 (ω), w1 (ω), . . . , ws2 (ω), h(ω) ) (D.6)
which collects all the (possibly non-random) elements of (q, T, W, h). Let Z ⊂ R p with
p = m2 + s2 (m + m2 + 1) denote the support of z.
172 Appendix D. Structural Results for Two-Stage Stochastic Programming
Various well-posed forms of the program can be distinguished. To this end, we describe
standard assumptions. The joint role of those assumptions is detailed in Section D.2.
D.1 Definition (Measurability). The support of z is a Borel set of Rp , and the sigma-
algebra B contains the collection of Borel sets of the support of z, that is,
B ⊃ {B ∩ Z : B ∈ B(Rp )} .
The stated measurability assumption is consistent with Wets (1974, page 311). The
measurability of Q(x, ·) for each fixed x requires at least that the sigma-algebra gener-
ated by z be included in B. It does not harm to allow sigma-algebras larger than the
sigma-algebra generated by z since this would not alter the optimal value of the pro-
gram. Note that using larger sigma-algebras makes it possible to select distinct vectors
y ∗ (ω1 ), y ∗ (ω2 ) for attaining the optimal value of Q when z(ω1 ) = z(ω2 ), that is, to im-
plement a stochastic policy for y. Most authors rule out this possibility, but in practice
a numerical solution algorithm could indeed return distinct optimal values for y in face
of duplicate realizations of z.
Now, recall that a mapping F : Rd → Rm is said to be affine iff it has values F (ξ) =
b̄ + Bξ for some fixed b̄ ∈ Rm and B ∈ Rm×d .
D.3 Definition (Fixed Recourse). For all ω, W (ω) is a fixed matrix W ∈ Rs2 ×m2 .
Fixed recourse is a simplifying assumption under which the value function E{Q(·, ω)}
is easier to describe. The rows of the fixed matrix W are always assumed to be linearly
independent to avoid trivial redundancies or conflicts among equality constraints (Wets,
1974, page 312).
pos W = {W y : y 0} = W Rm
+ =R
2 s2
.
D.1. Problem Statement and Assumptions 173
The condition W Rm + = R
2 s2
implies that for any x ∈ Rm and all ω ∈ Ω, there exists
some y 0 such that T (ω)x + W y = h(ω). This surjectivity condition is sufficient
for having Q(x, ω) < ∞ almost surely. From (D.8) below, one can show that complete
recourse holds iff {π ∈ Rs2 : W T π 0} = {0} (Shapiro et al., 2009, page 33). Recall
from Rockafellar (1970, page 65) that the dimension of the largest subspace contained
in a cone is called the lineality of the cone; methods to check that the lineality of pos W
is s2 are described in Wets and Witzgall (1967) and Wallace and Wets (1992).
D.5 Definition (Relatively Complete Recourse). For all x ∈ K1 and P-almost all
ω ∈ Ω, there exists some y 0 such that T (ω)x + W (ω)y = h(ω), that is,
h(ω) − T (ω)x ∈ W (ω)Rm
+ .
2
The relatively complete recourse assumption means that Q(x, ω) < ∞ for almost
all ω and x ∈ K1 . We could still have E{Q(x, ω)} = ∞ if the distribution of Q(x, ·) is
not integrable. In particular, the assumption alone does not guarantee that K 1 ⊂ K2
— compare to Wets (1974, Definition 6.1) and Birge and Louveaux (1997, page 92).
Note that no generic method is available for checking that a relatively complete recourse
assumption holds. Relatively complete recourse is thus typically asserted at the modeling
step, where penalties in the objective can be favored over hard constraints.
The dual feasibility assumption ensures that Q(x, ω) > −∞ for almost all ω and all x.
Indeed, by weak duality,
Q(x, ω) = inf y { hq(ω), yi : T (ω)x + W (ω)y = h(ω), y0}
T
≥ supπ { hπ, h(ω) − T (ω)xi : W (ω) π q(ω) } (D.8)
> −∞ if Π(ω) 6= ∅.
D.7 Definition (Fixed Technology). For all ω, T (ω) is a fixed matrix T ∈ R s2 ×m .
D.8 Definition (Finite Second Moments). z ∈ L2 (Ω, B, P), that is, E{||z||2 } < ∞.
D.9 Definition (Finite Support). The support of z is finite, that is, there exists a
Pn
finite set Z = {z 1 , z 2 , . . . , z n } such that P{ω : z(ω) = z ν } = pν > 0 with ν=1 pν = 1.
ii. Under the finite second moments and fixed recourse assumptions,
(h,T )∈Σ
as shown in Wets (1974, Theorem 4.2) or Shapiro et al. (2009, Equation 2.33).
iii. Under the finite second moments and fixed recourse assumptions, if the support
of T is polyhedral and if h, T are statistically independent, then K2 is polyhedral
(Wets, 1974, Corollary 4.13).
iv. Under the finite second moments, fixed recourse and fixed technology assumptions,
K2 is polyhedral; more precisely (Wets, 1974, Theorem 4.10) there exist a matrix
p
W ∗ ∈ Rp×s2 (p finite) and a vector α∗ ∈ R such that
K2 = {x ∈ Rm : W ∗ T x α∗ } .
A proper function is said to be polyhedral when its epigraph is a polyhedral set. The
domain of a polyhedral function is necessarily a polyhedral set. A result established in
Rockafellar and Wets (1998, Theorem 2.49) shows that the class of proper polyhedral
functions is the class of proper convex piecewise linear functions.
It is interesting to identify conditions under which E{Q(x, ω)} (second term of ob-
jective in D.1, often referred to as the value function) is polyhedral. The finite sum
of proper polyhedral convex functions is polyhedral (Rockafellar, 1970, Theorem 19.4)
and the multiplication of a proper polyhedral convex function by a nonnegative scalar is
polyhedral (Rockafellar, 1970, Corollary 19.5.1), so when the support of ξ is finite, the
question is reduced to investigating under which conditions the integrand of the value
function, Q(x, ω), is proper and polyhedral.
The next lemma (Römisch and Wets, 2007, Lemma 3.1), which reformulates results in
Walkup and Wets (1969a), is instrumental in describing the structure of Q(x, ω) without
necessarily assuming fixed recourse. Under the affine dependence assumption, let Ξ ⊂ R d
denote the support of ξ, and let Φ : Rd × Rm2 × Rs2 → R be a mapping with values
Observe that
D.12 Lemma. Let the affine dependence and the polynomial support assumptions hold.
Let ξ ∈ Ξ be fixed. Then,
i. The sets D(ξ) and H(ξ) are polyhedral;
ii. The function Φ(ξ, ·, ·) is finite and continuous on D(ξ) × H(ξ);
iii. The function Φ(ξ, q, ·) is piecewise linear convex on H(ξ) for fixed q ∈ D(ξ);
iv. The function Φ(ξ, ·, t) is piecewise linear concave on D(ξ) for fixed t ∈ H(ξ).
D.13 Lemma. Let the affine dependence and the polynomial support assumptions hold.
Under the finite second moments, the fixed recourse, the relatively complete recourse and
the dual feasibility assumptions, there exist constants L1 > 0, L2 > 0, K > 0 such that
for all ξ, ξ 0 ∈ Ξ, any ρ > 0, and for all x, x0 ∈ K1 ∩ K2 ∩ ρB,
i. |Qf (x, ξ) − Qf (x, ξ 0 )| ≤ L1 ρ max{1, ||ξ||, ||ξ 0 ||} ||ξ − ξ 0 || ;
ii. |Qf (x, ξ) − Qf (x0 , ξ)| ≤ L2 max{1, ||ξ||2 } ||x − x0 || ;
iii. |Qf (x, ξ)| ≤ Kρ max{1, ||ξ||2 } .
176 Appendix D. Structural Results for Two-Stage Stochastic Programming
Finally, the following proposition, that addresses the differentiability of the value
function, is based on Walkup and Wets (1969b), Wets (1974) and Shapiro et al. (2009,
Propositions 2.7, 2.8, 2.9). Note that under the finite second moments and relatively
complete recourse assumptions, we have K1 ⊂ K2 , so that K2 is nonempty if K1 is
nonempty.
D.14 Proposition. Under the finite second moments, fixed recourse, relatively complete
recourse and dual feasibility assumptions, and assuming that K1 is nonempty,
i. E{Q(·, ω)} is proper;
ii. E{Q(·, ω)} is convex, lower semicontinuous and Lipschitz continuous on K 2 ;
iii. If (q, T ) is constant-valued, and the distribution of h is absolutely continuous, then
E{Q(·, ω)} is differentiable at x0 ∈ int K2 ;
iv. If for almost all (q, T ), the distribution of h conditionally to (q, T ) is absolutely
continuous, and if for almost all ω ∈ Ω, the dual solution set at x0 ∈ int K2
Ailon, N., B. Chazelle, K.L. Clarkson, D. Liu, W. Mulzer, C. Seshadhri. 2006. Self-improving
algorithms. Proceedings of the Seventeenth ACM-SIAM Symposium on Discrete Algorithms
(SODA 2006). 261–270.
Ali, S.M., S. Koenig, M. Tambe. 2005. Preprocessing techniques for accelerating the DCOP
algorithm ADOPT. Proceedings of the Fourth International Joint Conference on Autonomous
Agents & Multi Agent Systems. 1041–1048.
Ali, S.M., S.D. Silvey. 1966. A general class of coefficients of divergence of one distribution from
another. Journal of the Royal Statistical Society 28 131–142.
Antos, A., R. Munos, Cs. Szepesvári. 2008. Fitted Q-iteration in continuous action-space MDPs.
Advances in Neural Information Processing Systems 20 (NIPS-2007). 9–16.
Arrow, K.J. 1958. Historical background. K.J. Arrow, S. Karlin, H. Scarf, eds., Studies in the
Mathematical Theory of Inventory and Production. Stanford University Press, 3–15.
Artzner, P., F. Delbaen, J.-M. Eber, D. Heath, H. Ku. 2007. Coherent multiperiod risk adjusted
values and Bellman’s principle. Annals of Operations Research 152(1) 5–22.
Aubin, J.-P., H. Frankowska. 1990. Set-Valued Analysis. Modern Birkhäuser Classics, Springer.
2009 Reprint of the 1990 Edition.
Audibert, J.-Y., R. Munos, C. Szepesvári. 2007. Tuning bandit algorithms in stochastic envi-
ronments. Proceedings of the Eighteenth International Conference on Algorithmic Learning
Theory (ALT-2007). LNCS 4754, Springer, 150–165.
Auer, P., N. Cesa-Bianchi, P. Fischer. 2002. Finite-time analysis of the multiarmed bandit
problem. Machine Learning 47 235–256.
Azuma, K. 1967. Weighted sums of certain dependent random variables. Tohoku Mathematical
Journal 19 357–367.
Balas, E. 1998. Disjunctive programming: Properties of the convex hull of feasible points.
Discrete Applied Mathematics 89 3–44.
Banerjee, A., S. Merugu, I.S. Dhillon, J. Ghosh. 2005. Clustering with Bregman divergences.
Journal of Machine Learning Research 6 1705–1749.
Banerjee, O., L. El Ghaoui, A. d’Aspremont. 2008. Model selection through sparse maximum
likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning
Research 9 485–516.
Baotić, M., F. Borrelli, A. Bemporad, M. Morari. 2008. Efficient on-line computation of con-
strained optimal control. SIAM Journal on Control and Optimization 47 2470–2489.
Bartlett, P.L., S. Mendelson. 2002. Rademacher and Gaussian complexities: Risk bounds and
structural results. Journal of Machine Learning Research 3 463–482.
Beale, E.M.L. 1955. On minimizing a convex function subject to linear inequalities. Journal of
the Royal Statistical Society 17 173–184.
Bellman, R.E. 1954. The theory of dynamic programming. Bulletin of the American Mathemat-
ical Society 60 503–516.
Bemporad, A., M. Morari, V. Dua, E. Pistikopoulos. 2002. The explicit linear quadratic regulator
for constrained systems. Automatica 38 3–20. Corrigendum: Automatica 39 (2003) 1845-1846.
Bertsekas, D.P. 2005a. Dynamic Programming and Optimal Control . 3rd ed. Athena Scientific,
Belmont, MA.
Bertsekas, D.P. 2005b. Dynamic programming and suboptimal control: A survey from ADP to
MPC. European Journal of Control 11 310–334.
Bertsekas, D.P., J.N. Tsitsiklis. 1996. Neuro-Dynamic Programming. Athena Scientific, Belmont,
MA.
Birge, J.R. 1992. The value of the stochastic solution in stochastic linear programs with fixed
recourses. Mathematical Programming 24 314–325.
Bixby, R. 2002. Solving real-world linear programs: A decade and more of progress. Operations
Research 50 3–15.
Bochner, S. 1933. Monotone funktionen, stieltjessche integrale und harmonische analyse. Math-
ematische Annalen 108(1) 378–410.
Boda, K., J.A. Filar. 2006. Time consistent dynamic risk measures. Mathematical Methods of
Operations Research 63 169–186.
Bregman, L.M. 1967. The relaxation method of finding the common points of convex sets and
its application to the solution of problems in convex programming. USSR Computational
Mathematics and Mathematical Physics 7 200–217.
Breiman, L. 1998. Arcing classifiers (with discussion and a rejoinder by the author). The Annals
of Statistics 26 801–849.
BIBLIOGRAPHY 179
Brown, L.D. 1986. Fundamentals of statistical exponential families, IMS lecture notes – Mono-
graph series, vol. 9. Institute of Mathematical Statistics, Hayward, California.
Cesa-Bianchi, N., A. Conconi, C. Gentile. 2004. On the generalization ability of on-line learning
algorithms. IEEE Transactions on Information Theory 50 2050–2057.
Cesa-Bianchi, N., Y. Freund, D.P. Helmbold, D. Haussler, R.E. Schapire, M.K. Warmuth. 1997.
How to use expert advice. Journal of the Association for Computing Machinery 44 427–485.
Cesa-Bianchi, N., G. Lugosi. 2006. Prediction, Learning and Games. Cambridge University
Press, New York.
Chiralaksanakul, A. 2003. Monte Carlo methods for multi-stage stochastic programs. Ph.D.
thesis, University of Texas at Austin.
Chung, K.-J., M. Sobel. 1987. Discounted MDP’s: Distribution functions and exponential utility
maximization. SIAM J. Control and Optimization 25(1) 49–62.
Clarke, B.S., A.R. Barron. 1990. Information-theoretic asymptotics of Bayes methods. IEEE
Transactions of Information Theory 36 453–471.
Coquelin, P.-A., R. Munos. 2007. Bandit algorithms for tree search. Proceedings of the Twenty-
Third Conference on Uncertainty in Artificial Intelligence (UAI-2007). 67–74.
Cucker, F., S. Smale. 2001. On the mathematical foundations of learning. Bulletin of the
American Mathematical Society 39(1) 1–49.
Daniel, J.W. 1973. Stability of the solution of definite quadratic programs. Mathematical
Programming 5 41–53.
Dantzig, G.B. 1955. Linear programming under uncertainty. Management Science 1 197–206.
Dawid, A.P. 1984. Present position and potential developments: some personal views. Statistical
theory: The prequential approach (with discussion). Journal of the Royal Statistical Society
147 278–292.
Decoste, D., B. Schölkopf. 2002. Training invariant support vector machines. Machine Learning
46 161–190.
Defourny, B., D. Ernst, L. Wehenkel. 2008. Risk-aware decision making and dynamic program-
ming. Y. Engel, M. Ghavamzadeh, S. Mannor, P. Poupart, eds., NIPS-08 workshop on model
uncertainty and risk in reinforcement learning.
Defourny, B., L. Wehenkel. 2009. Large margin classification with the progressive hedging
algorithm. S. Nowozin, S. Sra, S. Vishwanathan, S. Wright, eds., Second NIPS workshop on
optimization for machine learning.
180 BIBLIOGRAPHY
Dempster, A.P., N.M. Laird, D.B. Rubin. 1977. Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society 39 1–38.
Dempster, M.A.H., G. Pflug, G. Mitra, eds. 2008. Quantitative Fund Management. Financial
Mathematics Series, Chapman & Hall/CRC.
Demuth, H., M. Beale. 1993. Neural network toolbox for use with Matlab.
Dontchev, A.L., R.T. Rockafellar. 2009. Implicit Functions and Solution Mappings. Springer.
Draper, D. 1995. Assessment and propagation of model uncertainty. Journal of the Royal
Statistical Society 57 45–97.
Dupacova, J., R. J.-B. Wets. 1988. Asymptotic behavior of statistical estimators and of optimal
solutions of stochastic optimization problems. The Annals of Statistics 16 1517–1549.
Edelman, A. 1992. Eigenvalue roulette and random test matrices. M.S. Moonen, G.H. Golub,
B. De Moor, eds., Linear Algebra for Large Scale and Real-Time Applications, NATO ASI ,
vol. 232. Springer, 365–368.
Efron, B., R. Tibshirani. 1993. An introduction to the bootstrap. Chapman and Hall, London.
Epstein, L., M. Schneider. 2003. Recursive multiple-priors. Journal of Economic Theory 113
1–13.
Ernst, D., P. Geurts, L. Wehenkel. 2005. Tree-based batch mode reinforcement learning. Journal
of Machine Learning Research 6 503–556.
Escudero, L.F. 2009. On a mixture of the fix-and-relax coordination and Lagrangian substitution
schemes for multistage stochastic mixed integer programming. Top 17 5–29.
Evans, M., T. Swartz. 1995. Methods for approximating integrals in statistics with special
emphasis on Bayesian integration problems. Statistical Science 10 254–272.
Facchinei, F., A. Fischer, C. Kanzow. 1998. On the accurate identification of active constraints.
SIAM Journal on Optimization 9 14–32.
Facchinei, F., J.-S. Pang. 2003. Finite-dimensional variational inequalities and complementary
problems. Springer. Published in two volumes, paginated continuously.
Fisher, R.A. 1925. Theory of statistical estimation. Proceedings of the Cambridge Philosophical
Society, vol. 22. 700–725.
Freund, Y., R.E. Schapire. 1996. Experiments with a new boosting algorithm. Proceedings of
the Thirteenth International Conference on Machine Learning (ICML-1996). 148–156.
BIBLIOGRAPHY 181
Gale, D., V. Klee, R.T. Rockafellar. 1968. Convex functions on convex polytopes. Proceedings
of the American Mathematical Society, vol. 19. 867–873.
Garstka, S.J., R.J.-B. Wets. 1974. On decision rules in stochastic programming. Mathematical
Programming 7(1) 117–143.
Gassmann, H.I., A. Prékopa. 2005. On stages and consistency checks in stochastic programming.
Operations Research Letters 33 171–175.
Geurts, P., D. Ernst, L. Wehenkel. 2006. Extremely randomized trees. Machine Learning 63
3–42.
Good, I.J., R.A. Gaskins. 1971. Non-parametric roughness penalties for probability densities.
Biometrika 58 255–277.
Haff, L.R. 1980. Empirical Bayes estimation of the multivariate normal covariance matrix. The
Annals of Statistics 8 586–597.
Hammond, P. J. 1976. Changing tastes and coherent dynamic choice. The Review of Economic
Studies 43 159–173.
Hartland, C., S. Gelly, N. Baskiotis, O. Teytaud, M. Sebag. 2006. Multi-armed bandit, dy-
namic environments and meta-bandits. P. Auer, N. Cesa-Bianchi, Z. Hussain, L. Newnham,
J. Shawe-Taylor, eds., NIPS-06 workshop on on-line trading of exploration and exploitation.
Haussler, D. 1999. Convolution kernels on discrete structures. Tech. rep., University of California
at Santa Cruz.
Heitsch, H., W. Römisch. 2009. Scenario tree modeling for multistage stochastic programs.
Mathematical Programming 118(2) 371–406.
Higle, J.L., S. Sen. 1991. Stochastic decomposition: An algorithm for two stage stochastic linear
programs with recourse. Mathematics of Operations Research 16 650–669.
Hilli, P., T. Pennanen. 2008. Numerical study of discretizations of multistage stochastic pro-
grams. Kybernetika 44 185–204.
Hochreiter, R., G.Ch. Pflug. 2007. Financial scenario generation for stochastic multi-stage
decision processes as facility location problems. Annals of Operations Research 152 257–272.
Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. Journal of
the Americal Statistical Association 58 13–30.
Hofmann, T., B. Schölkopf, A. Smola. 2008. Kernel methods in machine learning. The Annals
of Statistics 36(3) 1171–1220.
Howard, R.A. 1960. Dynamic Programming and Markov Processes. MIT Press.
Høyland, K., M. Kaut, S.W. Wallace. 2003. A heuristic for moment-matching scenario genera-
tion. Computational Optimization and Applications 24 1573–2894.
Høyland, K., S. Wallace. 2001. Generating scenario trees for multistage decision problems.
Management Science 47(2) 295–307.
Huang, K., S. Ahmed. 2009. The value of multistage stochastic programming in capacity plan-
ning under uncertainty. Operations Research 57 893–904.
Huber, P.J. 1964. Robust estimation of a location parameter. The Annals of Mathematical
Statistics 35 73–101.
Infanger, G. 1992. Monte Carlo (importance) sampling within a Benders decomposition algo-
rithm for stochastic linear programs. Annals of Operations Research 39 69–95.
Itakura, F., S. Saito. 1968. Analysis synthesis telephony based on the maximum likelihood
method. Proceedings of the Sixth International Congress on Acoustics. C17–20.
Kallrath, J., P.M. Pardalos, S. Rebennack, M. Scheidt, eds. 2009. Optimization in the Energy
Industry. Springer.
Kearns, M.J., Y. Mansour, A.Y. Ng. 2002. A sparse sampling algorithm for near-optimal plan-
ning in large Markov Decision Processes. Machine Learning 49(2-3) 193–208.
Koivu, M., T. Pennanen. 2010. Galerkin methods in dynamic stochastic programming. Opti-
mization 59 339–354.
Koltchinskii, V., D. Panchenko. 2002. Empirical margin distributions and bounding the gener-
alization error of combined classifiers. The Annals of Statistics 30 1–50.
Kouwenberg, R. 2001. Scenario generation and stochastic programming models for asset liability
management. European Journal of Operational Research 134 279–292.
Kuhn, D. 2005. Generalized Bounds for convex multistage Stochastic Programs, Lecture Notes
in Economics and Mathematical Systems, vol. 548. Springer.
Kydland, F.E., E.C. Prescott. 1977. Rules rather than discretion: The inconsistency of optimal
plans. The Journal of Political Economy 85 473–492.
Lanckriet, G., N. Cristianini, P. Bartlett, L.E. Ghaoui, M.I. Jordan. 2004. Learning the kernel
matrix with semidefinite programming. Journal of Machine Learning Research 5 27–72.
BIBLIOGRAPHY 183
Lehto, O. 1952. Some remarks on the kernel functions in Hilbert spaces. Annales Academiae
Scientiarium, Fennicae Sereie A 109 3–6.
Li, J.Q., A.R. Barron. 2000. Mixture density estimation. Advances in Neural Information
Processing Systems 12 (NIPS-2000). 279–285.
Littlestone, N., M.K. Warmuth. 1989. The weighted majority algorithm. Proceedings of the
Thirtieth Annual Symposium on Foundations of Computer Science. 256–261.
Littman, M.L., T.L. Dean, L.P. Kaelbling. 1995. On the complexity of solving Markov Decision
Problems. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence
(UAI-1995). 394–402.
MacKay, D.J.C. 2003. Information Theory, Inference and Learning Algorithms. Cambridge
University Press.
MacKay, M.D., R.J. Beckman, W.J. Conover. 1979. A comparison of three methods for selecting
values of input variables in the analysis of output from a computer code. Technometrics 21
239–245.
Madigan, D., J. York. 1995. Bayesian graphical models for discrete data. International Statistical
Review 63 215–232.
Mahalanobis, P.C. 1936. On the generalized distance in statistics. Proceedings of the National
Institute of Sciences of India, vol. 12. 49–55.
Mak, Wai-Kei, D.P. Morton, R. Kevin Wood. 1999. Monte Carlo bounding techniques for
determining solution quality in stochastic programs. Operations Research Letters 24(1-2)
47–56.
Mardia, K.V., R.J. Marshall. 1984. Maximum likelihood estimation of models for residual
covariance in spatial regression. Biometrika 71 135–146.
Martin, D.H. 1975. On the continuity of the maximum in parametric linear programming.
Journal of Optimization Theory and Applications 17 205–210.
Mease, D., A. Wyner. 2008. Evidence contrary to the statistical view of boosting (with responses
and rejoinder). Journal of Machine Learning Research 131–201.
Mercer, J. 1909. Functions of positive and negative type, and their connection with the theory
of integral equations. Philosophical Transactions of the Royal Society of London, Series A
209 415–446.
Mercier, L., P. Van Hentenryck. 2007. Performance analysis of online anticipatory algorithms
for large multistage stochastic integer programs. Proceedings of the Twentieth International
Joint Conference on Artificial Intelligence (IJCAI-07). 1979–1984.
Metropolis, N., A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller. 1953. Equation of
state calculations by fast computing machines. Journal of Chemical Physics 21 1087–1092.
184 BIBLIOGRAPHY
Metropolis, N., S. Ulam. 1949. The Monte Carlo method. J. Amer. Stat. Assoc. 44(247)
335–341.
Micchelli, C.A., M. Pontil. 2005. Learning the kernel function via regularization. Journal of
Machine Learning Research 6 1099–1125.
Neal, R.M. 1993. Probabilistic inference using Markov Chain Monte Carlo methods. Tech. Rep.
CRG-TR-93-1, University of Toronto.
Neal, R.M. 1997. Monte Carlo implementation of Gaussian Process models for Bayesian regres-
sion and classification. Tech. Rep. 9702, University of Toronto.
Neal, R.M. 2010. MCMC using Hamiltonian dynamics. S. Brooks, A. Gelman, G. Jones, X.-L.
Meng, eds., Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC Press.
Nesterov, Y. 2007. Gradient methods for minimizing composite objective function. CORE
discussion paper 76, Catholic University of Louvain.
Nesterov, Y., J.-Ph. Vial. 2008. Confidence level solutions for stochastic programming. Auto-
matica 44 1559–1568.
Nickisch, H., C.E. Rasmussen. 2008. Approximations for binary Gaussian process classification.
Journal of Machine Learning Research 9 2035–2078.
Norkin, V.I., Y.M. Ermoliev, A. Ruszczyński. 1998a. On optimal allocation of indivisibles under
uncertainty. Operations Research 46 381–395.
Norkin, V.I., G.Ch. Pflug, A. Ruszczyński. 1998b. A branch and bound method for stochastic
global optimization. Mathematical Programming 83 425–450.
O’Hagan, A. 1978. Curve fitting and optimal design for prediction. Journal of the Royal
Statistical Society 40 1–42.
Pages, G., J. Printems. 2003. Optimal quadratic quantization for numerics: the Gaussian case.
Monte Carlo Methods and Applications 9 135–166.
Patrinos, P., H. Sarimveis. 2010. A new algorithm for solving convex parametric quadratic
programs based on graphical derivatives of solution mappings. Automatica 46 1405–1418.
Pflug, G.Ch., W. Römisch. 2007. Modeling, Measuring and Managing Risk . World Scientific
Publishing Company.
BIBLIOGRAPHY 185
Pollard, D. 1990. Empirical Processes: Theory and Applications, NSF-CBMS Regional Confer-
ence Series in Probability and Statistics, vol. 2. Institute of Mathematical Statistics.
Pollard, D. 2001. A User’s Guide to Measure Theoretic Probability. Cambridge University Press.
Powell, W.B. 2007. Approximate Dynamic Programming: Solving the Curses of Dimensionality.
Wiley.
Puterman, M.L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming.
Wiley.
Rachelson, E., F. Schnitzler, L. Wehenkel, D. Ernst. 2010. Optimal sample selection for batch-
mode reinforcement learning. Submitted.
Rachev, S.T., W. Römisch. 2002. Quantitative stability in stochastic programming: The method
of probability metrics. Mathematics of Operations Research 27(4) 792–818.
Rahimi, A., B. Recht. 2008. Random features for large-scale kernel machines. Advances in
Neural Information Processing Systems 20 (NIPS-2007). 1177–1184.
Rahimi, A., B. Recht. 2009. Random kitchen sinks: Replacing optimization with randomization
in learning. Advances in Neural Information Processing Systems 21 (NIPS-2008). 1313–1320.
Raiffa, H., R. Schlaifer. 1961. Applied Statistical Decision Theory. Harvard University.
Rasmussen, C.E., C.K.I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press.
Riedel, F. 2004. Dynamic coherent risk measures. Stochatic Processes and their Applications
112 185–200.
Rockafellar, R.T. 1974. Conjugate Duality and Optimization. CBMS-NSF Regional Conference
Series in Applied Mathematics, SIAM.
Rockafellar, R.T., R.J.-B. Wets. 1991. Scenarios and policy aggregation in optimization under
uncertainty. Mathematics of Operations Research 16 119–147.
Rockafellar, R.T., R.J.-B. Wets. 1998. Variational Analysis. 3rd ed. Springer.
Römisch, W., R.J.-B. Wets. 2007. Stability of ε-approximate solutions to convex stochastic
programs. SIAM Journal on Optimization 18 961–979.
186 BIBLIOGRAPHY
Rubinstein, R.Y., D.P. Kroese. 2004. The Cross-Entropy Method. A Unified Approach to Combi-
natorial Optimization, Monte-Carlo Simulation, and Machine Learning. Information Science
and Statistics, Springer.
Rust, J. 1997. Using randomization to break the curse of dimensionality. Econometrica 65(3)
487–516.
Samuelson, P.A. 1937. A note on measurement of utility. The Review of Economic Studies 4(2)
155–161.
Schapire, R.E. 1990. The strength of weak learnability. Machine Learning 5(2) 197–227.
Schapire, R.E., P. Stone, D. McAllester, M.L. Littman, J.A. Csirik. 2002. Modeling auction
price uncertainty using boosting-based conditional density estimation. Proceedings of the
Nineteenth International Conference on Machine Learning (ICML-2002). 546–553.
Schoenberg, I.J. 1938. Metric spaces and completely monotone functions. Annals of Mathematics
39(4) 811–841.
Sen, S., R.D. Doverspike, S. Cosares. 1994. Network planning with random demand. Telecom-
munications Systems 3 11–30.
Shapiro, A. 2000. On the asymptotics of constrained local M-estimators. The Annals of Statistics
28 948–960.
Shapiro, A. 2003a. Inference of statistical bounds for multistage stochastic programming prob-
lems. Mathematical Methods of Operations Research 58(1) 57–68.
Shapiro, A. 2003b. Monte Carlo sampling methods. A. Ruszczyński, A. Shapiro, eds., Stochastic
Programming. Handbooks in Operations Research and Management Science, vol. 10. Elsevier,
353–425.
Shi, Q., J. Petterson, G. Dror, J. Langford, A. Smola, S.V.N. Vishwanathan. 2009. Hash kernels
for structured data. Journal of Machine Learning Research 10 2615–2637.
Shivaswamy, P., T. Jebara. 2010. Empirical Bernstein boosting. Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics (AISTATS-2010). JMLR
Workshop and Conference proceedings (vol. 9), 733–740.
Simon, H. 1956. Rational choice and the structure of the environment. Psychological Review 63
129–138.
Slyke, R. Van, R.J.-B. Wets. 1969. L-shaped linear programs with applications to optimal
control and stochastic programming. SIAM Journal on Applied Mathematics 17 638–663.
Solak, E., R. Murray-Smith, W.E. Leithead, D.J. Leith, C.E. Rasmussen. 2003. Derivative
observations in Gaussian Process models of dynamic systems. Advances in Neural Information
Processing Systems 15 (NIPS-2002). 1033–1040.
Sonnenburg, S., G. Rätsch, C. Schäfer, B. Schölkopf. 2006. Large scale multiple kernel learning.
Journal of Machine Learning Research 7 1531–1565.
Speed, T.P., H.T. Kiiveri. 1986. Gaussian Markov distributions over finite graphs. The Annals
of Statistics 14 138–150.
Spielman, D., S.-H. Teng. 2004. Smoothed analysis: Why the simplex algorithm usually takes
polynomial time. Journal of the Association for Computing Machinery 51 385–463.
Spjøtvold, J., P. Tøndel, T.A. Johansen. 2007. Continuous selection and unique polyhedral
representation of solutions to convex parametric quadratic programs. Journal of Optimization
Theory and Applications 134 177–189.
Stein, C. 1956. Inadmissibility of the usual estimator for the mean of a multivariate distribution.
Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability,
vol. 1. 197–206.
Steinwart, I., A. Christman. 2008. Support Vector Machines, chap. Kernels and Reproducing
Kernel Hilbert Spaces. Springer, 111–164.
Stone, M. 1974. Cross-validatory choice and assessment of statistical predictions. Journal of the
Royal Statistical Society 36 111–147.
Strotz, R.H. 1955. Myopia and inconsistency in dynamic utility maximization. The Review of
Economic Studies 23 165–180.
Sutton, R.S., A.G. Barto. 1998. Reinforcement Learning, an introduction. MIT Press.
Tikhonov, A.N., V.Y. Arsenin. 1977. Solutions of ill posed problems. W.H. Winston and Sons
(distributed by Wiley).
Tøndel, P., T. Johansen, A. Bemporad. 2003. Evaluation of piecewise affine control via binary
search tree. Automatica 39 945–950.
Van Hentenryck, P., R. Bent. 2006. Online stochastic combinatorial optimization. MIT Press.
von Neumann, J., O. Morgenstern. 1947. Theory of games and economic behavior . 2nd ed.
Princeton University Press.
Wainwright, M.J., M.I. Jordan. 2008. Graphical models, exponential families, and variational
inference. Foundations and Trends in Machine Learning 1 1–305.
Walker, A.M. 1969. On the asymptotic behaviour of posterior distributions. Journal of the
Royal Statistical Society 31 80–88.
Walkup, D., R.J.-B. Wets. 1967. Stochastic programs with recourse. SIAM Journal on Applied
Mathematics 15 1299–1314.
188 BIBLIOGRAPHY
Walkup, D., R.J.-B. Wets. 1969a. Lifting projections of convex polyhedra. Pacific Journal of
Mathematics 28 465–475.
Walkup, D., R.J.-B. Wets. 1969b. Stochastic programs with recourse II: On the continuity of
the objective. SIAM Journal on Applied Mathematics 17 98–103.
Wallace, S.W., R.J.-B. Wets. 1992. Preprocessing in stochastic programming: The case of linear
programs. ORSA Journal of Computing 4 45–49.
Wets, R.J.-B. 1974. Stochastic programs with fixed recourse: The equivalent deterministic
program. SIAM Review 16 309–339.
Wets, R.J.-B., C. Witzgall. 1967. Algorithms for frames and lineality spaces of cones. Journal
of Research of the National Bureau of Standards 71 1–7.
Weyl, H. 1935. Elementare Theorie der konvexen Polyeder. Commentarii Mathematici Helvetici
7 290–306.
Zhang, Z., M.I. Jordan, D.-Y. Yeung. 2009. Posterior consistency of the Silverman g-prior in
Bayesian model choice. Advances in Neural Information Processing Systems 21 (NIPS-2008).
1969–1976.
Index
189
190 Index
random
function, 161
set, 159
vector, 158
regularization, 31
relatively complete recourse, 173
repair procedure, 53, 102
robust optimization, 6