0% found this document useful (0 votes)
25 views200 pages

PhDthesis B Defourny

This thesis investigates using machine learning techniques to find better solutions to multistage stochastic programs. The author explores using supervised learning to build decision policies from scenario tree approximations of multistage stochastic programs. Two approaches are explored: using learned policies to quickly evaluate performance guarantees of approximations, and using them to select good scenario trees. Experiments show learned policies can provide performance close to optimal policies, suggesting this approach could be a method for sequential decision making under uncertainty. Limitations of explicit solution models are also studied in a parametric programming setting.

Uploaded by

SmritiShyamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views200 pages

PhDthesis B Defourny

This thesis investigates using machine learning techniques to find better solutions to multistage stochastic programs. The author explores using supervised learning to build decision policies from scenario tree approximations of multistage stochastic programs. Two approaches are explored: using learned policies to quickly evaluate performance guarantees of approximations, and using them to select good scenario trees. Experiments show learned policies can provide performance close to optimal policies, suggesting this approach could be a method for sequential decision making under uncertainty. Limitations of explicit solution models are also studied in a parametric programming setting.

Uploaded by

SmritiShyamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 200

Machine Learning Solution Methods for

Multistage Stochastic Programming

Thesis by
Boris Defourny

Systems and Modeling Research Unit


Department of Electrical Engineering and Computer Science
University of Liège, Belgium

2010
Abstract

This thesis investigates the following question: Can supervised learning techniques be
successfully used for finding better solutions to multistage stochastic programs? A similar
question had already been posed in the context of reinforcement learning, and had led to
algorithmic and conceptual advances in the field of approximate value function methods
over the years (Lagoudakis and Parr, 2003; Ernst, Geurts, and Wehenkel, 2005; Lang-
ford and Zadrozny, 2005; Antos, Munos, and Szepesvári, 2008). This thesis identifies
several ways to exploit the combination “multistage stochastic programming/supervised
learning” for sequential decision making under uncertainty.
Multistage stochastic programming is essentially the extension of stochastic program-
ming (Dantzig, 1955; Beale, 1955) to several recourse stages. After an introduction to
multistage stochastic programming and a summary of existing approximation approaches
based on scenario trees, this thesis mainly focusses on the use of supervised learning for
building decision policies from scenario-tree approximations.
Two ways of exploiting learned policies in the context of the practical issues posed
by the multistage stochastic programming framework are explored: the fast evaluation
of performance guarantees for a given approximation, and the selection of good scenario
trees. The computational efficiency of the approach allows novel investigations relative
to the construction of scenario trees, from which novel insights, solution approaches and
algorithms are derived. For instance, we generate and select scenario trees with random
branching structures for problems over large planning horizons. Our experiments on
the empirical performances of learned policies, compared to golden-standard policies,
suggest that the combination of stochastic programming and machine learning techniques
could also constitute a method per se for sequential decision making under uncertainty,
inasmuch as learned policies are simple to use, and come with performance guarantees
that can actually be quite good.
Finally, limitations of approaches that build an explicit model to represent an optimal
solution mapping are studied in a simple parametric programming setting, and various
insights regarding this issue are obtained.
iv
Acknowledgments

Warm thanks are addressed to my family for their love and constant support over the
years.
I express my deepest gratitude to my advisor, Louis Wehenkel. In addition to his
guidance, to his questioning, and to the many discussions we had together, from technical
subjects to scientific research in general, Louis has provided me with an outstanding
environment for doing research and communicating it, while demonstrating, with style,
his large culture, his high ethic, and his formidable ability to be present in times of need.
This thesis would not have been written without Louis’ interests for supervised learning,
ensemble methods, and optimal control.
I am very grateful to Damien Ernst. Together, we had many discussions, for instance
on approximate dynamic programming and reinforcement learning, the cross-entropy
method, the uses of scenario-tree methods, and the ways to communicate and publish
results. Damien has repeatedly provided support and encouragements regarding the
present work. Collaborating with Louis and Damien has been extremely energizing,
productive, and life-enriching.
I wish to express my gratitude to Rodolphe Sepulchre. Rodolphe too deserves credit
for setting up an excellent research environment, for instance by organizing stimulating
weekly seminars, group lunches, and offering me the opportunity to participate to activ-
ities of the Belgian network DYSCO or to activities hosted by CESAME/INMA(UCL).
Warm thanks to Jean-Louis Lilien for his good advice and support over the years.
With hindsight, I owe to Jean-Louis many important choices that I am happy to have
made.
I would like to thank Yves Crama for setting up discussions on multistage stochastic
programming and models in logistics, and inviting me to presentations given by his group.
I would like to thank Quentin Louveaux for several stimulating discussions.
I express my gratitude to Mania Pavella for her interest and her repeated encourage-
ments.
I am grateful to Jacek Gondzio (University of Edinburgh), Rüdiger Schultz (University
of Duisburg-Hessen), Werner Römisch (Humboldt University), and Alexander Shapiro
(Georgia Tech) for stimulating discussions on stochastic programming, and expressions
of interest in the general orientation of this research. The scientific responsibility of the
present work and the views expressed in the thesis rest with us.
I am grateful to Rémi Munos (INRIA Lille) and Olivier Teytaud (INRIA Saclay) for
stimulating discussions related to machine learning and sequential decision making.
I am grateful to Roman Belavkin (Middlesex University) for stimulating discussions
vi

on cognitive sciences, on finance, and for inviting me to give a seminar in his group.
I am grateful to Jovan Ilic and Marija Ilic (Carnegie Mellon University) for meetings
and discussions from which my interest in risk-aware decision making dates back.
My interest in multistage stochastic programming originates from meetings with mem-
bers of the department OSIRIS (Optimisation, Simulation, Risque et Statistiques) at
Electricité de France. I would like to thank especially Yannick Jacquemart, Kengy Barty,
Pascal Martinetto, Jean-Sébastien Roy, Cyrille Strugarek, and Gérald Vignal for inspiring
discussions on sequential decision making models and current challenges in the electricity
industry. The scientific responsibility of the present work and the views expressed in the
thesis rest with us.
Warm thanks to my friends, colleagues and past post-docs from the Systems and
Modeling research unit and beyond. I had innumerable scientific and personal discus-
sions with Alain Sarlette and Michel Journée, and during their post-doc time with Emre
Tuna and Silvère Bonnabel. I had wonderful time and discussions with Pierre Geurts,
Vincent Auvray, Guy-Bart Stan, Renaud Ronsse, Christophe Germay, Maxime Bon-
jean, Luca Scardovi, Denis Efimov, Christian Lageman, Alexandre Mauroy, Gilles Meyer,
Marie Dumont, Pierre Sacré, Christian Bastin, Guillaume Drion, Anne Collard, Laura
Trotta, Bertrand Cornélusse, Raphaël Fonteneau, Florence Belmudes, François Schnit-
zler, François Van Lishout, Gérard Dethier, Olivier Barnich, Philippe Ries, and Axel
Bodart. I extend my acknowledgements to Thierry Pironet and Yasemin Arda (HEC-
Ulg). Thanks also to Patricia Rousseaux, Thierry Van Cutsem and Mevludin Glavic.
Many thanks to my friends who encouraged me to pursue this work, and in particular to
Estelle Derclaye (University of Nottingham).
I am also indebted to people who helped me when I was abroad for conferences and
made my time there even more enjoyable: Marija Prica, Guillaume Obozinski, Janne
Kettunen, Aude Piette, Diana Chan, Daniel Bartz, Kazuya Haraguchi.
I had the opportunity to be invited to be a coauthor to papers by Sourour Am-
mar, Philippe Leray (INSA Rouen) and Louis Wehenkel, and to a paper by Bertrand
Cornélusse, Gérald Vignal and Louis Wehenkel, and I wish to thank these authors for
these collaborations.
I gratefully acknowledge the financial support of the Belgian network DYSCO (Dy-
namical Systems, Control, and Optimization), funded by the Interuniversity Attraction
Poles Programme, initiated by the Belgian State, Science Policy Office. The scientific
responsibility rests with us. This work was supported in part by the IST Programme
of the European Union, under the PASCAL2 Network of Excellence, IST-2007-216886.
This thesis only reflects our views.
Finally, I would like to express my extreme gratitude to the members of my thesis
defense committee: Rodolphe Sepulchre (Chair), Louis Wehenkel (Advisor), Yves Crama,
Quentin Louveaux, Shie Mannor (Technion), Werner Römisch (Humboldt University),
Alexander Shapiro (Georgia Tech), and Olivier Teytaud (INRIA Saclay/Université Paris-
Sud).
Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2. The Multistage Stochastic Programming Framework . . . . . . . . . . . . . . . 5


2.1 Description of the Framework . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 From Nominal Plans to Decision Processes . . . . . . . . . . . . . 5
2.1.2 Incorporating Probabilistic Reasoning . . . . . . . . . . . . . . . . 6
2.1.3 The Elements of the General Decision Model . . . . . . . . . . . . 6
2.1.4 The Tree Representation of Gradually Revealed Scenarios . . . . . 9
2.1.5 Approximating Random Processes with Scenario Trees . . . . . . . 10
2.1.6 Simple Example of Formulation . . . . . . . . . . . . . . . . . . . . 11
2.2 Comparison to Related Approaches . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 The Exogenous Nature of the Random Process . . . . . . . . . . . 13
2.2.2 Comparison to Markov Decision Processes . . . . . . . . . . . . . . 14
2.2.3 The Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . 14
2.3 The Value of Multistage Stochastic Programming . . . . . . . . . . . . . . 15
2.3.1 Reduction to Model Predictive Control . . . . . . . . . . . . . . . 16
2.3.2 Reduction to Two-Stage Stochastic Programming . . . . . . . . . . 16
2.3.3 Reduction to Heuristics based on Parametric Optimization . . . . 17
2.4 Practical Scenario-Tree Approaches . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Approximation Methods in Two-stage Stochastic Programming . . 20
2.4.2 Challenges in the Generation of Scenario Trees . . . . . . . . . . . 22
2.4.3 The Need for Testing Scenario-Tree Approximations . . . . . . . . 25
2.4.4 The Inference of Shrinking-Horizon Decision Policies . . . . . . . . 26
2.4.5 Alternatives to the Scenario Tree Approach . . . . . . . . . . . . . 27
viii Contents

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3. Solution Averaging in Structured Spaces . . . . . . . . . . . . . . . . . . . . . . 29


3.1 The Perturb-and-Combine Paradigm . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . 30
3.1.2 Bayesian Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.5 Bayesian Model Averaging . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.6 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Adaptation to Stochastic Programming . . . . . . . . . . . . . . . . . . . 36
3.2.1 Principle of the Approach . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Generation of an Ensemble of Incomplete Trees . . . . . . . . . . . 39
3.2.3 Optimization with the Cross-Entropy method . . . . . . . . . . . . 40
3.2.4 Aggregation of First-Stage Decisions . . . . . . . . . . . . . . . . . 42
3.3 Numerical Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Description of the Test Problem . . . . . . . . . . . . . . . . . . . 44
3.3.2 Particular Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4. Validation of Solutions and Scenario Tree Generation . . . . . . . . . . . . . . . 51


4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Learning and Evaluation of Scenario Tree Based Policies . . . . . . . . . . 52
4.2.1 Description of the Validation Method . . . . . . . . . . . . . . . . 53
4.2.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 Other Validation Strategies . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Monte Carlo Selection of Scenario Trees . . . . . . . . . . . . . . . . . . . 60
4.3.1 Description of the Selection Scheme . . . . . . . . . . . . . . . . . 60
4.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.1 Description of the Problem . . . . . . . . . . . . . . . . . . . . . . 63
4.4.2 Algorithms for Generating Small Scenario Trees . . . . . . . . . . 64
4.4.3 Algorithm for Learning Policies . . . . . . . . . . . . . . . . . . . . 67
4.4.4 Solving Programs Approximately by Linear Programming . . . . . 70
4.4.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Time Inconsistency and Bounded Rationality Limitations . . . . . . . . . 75
Contents ix

4.5.1 Time-Consistent Decision Processes . . . . . . . . . . . . . . . . . 75


4.5.2 Limitations of Validations Based on Learned Policies . . . . . . . . 77
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5. Inferring Decisions from Predictive Densities . . . . . . . . . . . . . . . . . . . 79


5.1 Constrained MAP Repair Procedure . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.2 Particularizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Gaussian Predictive Densities . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Joint Gaussian Model . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.2 Gaussian Process Model . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.1 Description of the Test Problem . . . . . . . . . . . . . . . . . . . 90
5.3.2 Discretization of the Random Process . . . . . . . . . . . . . . . . 93
5.3.3 Shrinking-Horizon Policies on the Test Sample . . . . . . . . . . . 97
5.3.4 Performances of Learned Policies . . . . . . . . . . . . . . . . . . . 98
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6. Learning Projections on Random Polyhedra . . . . . . . . . . . . . . . . . . . . 105


6.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Geometry of Euclidian Projections . . . . . . . . . . . . . . . . . . . . . . 107
6.3 Properties of Optimal Solutions . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4 Classifiers for Sets of Active Constraints . . . . . . . . . . . . . . . . . . . 120
6.4.1 Description of the Algorithms. . . . . . . . . . . . . . . . . . . . . 120
6.4.2 Complexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.5.1 Description of the Test Problems . . . . . . . . . . . . . . . . . . . 125
6.5.2 Description of the Experiments . . . . . . . . . . . . . . . . . . . . 125
6.5.3 Discussion of the Results. . . . . . . . . . . . . . . . . . . . . . . . 126
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3 Influences on Research in Machine Learning . . . . . . . . . . . . . . . . . 140

A. Elements of Variational Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 143


A.1 Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
x Contents

A.2 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144


A.3 Semicontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A.4 Attainment of a Minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A.5 Epigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
A.6 Epi-convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
A.7 Convergence in Minimization . . . . . . . . . . . . . . . . . . . . . . . . . 149
A.8 Pointwise, Continuous and Uniform Convergence . . . . . . . . . . . . . . 150
A.9 Parametric Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
A.10 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
A.11 Lipschitz Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

B. Basic Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


B.1 The Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
B.3 Random Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
B.4 Random Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
B.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
B.6 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

C. Elements of Functional Analysis for Kernel Methods . . . . . . . . . . . . . . . 165


C.1 Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
C.2 Linear Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
C.3 Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . 167
C.4 Positive Definite Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

D. Structural Results for Two-Stage Stochastic Programming . . . . . . . . . . . . 171


D.1 Problem Statement and Assumptions . . . . . . . . . . . . . . . . . . . . . 171
D.2 Structural Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Chapter 1

Introduction

Multistage stochastic programming has attracted a lot of interest over the last years,
as a promising framework for formulating sequential decision making problems under
uncertainty. Several potential applications of the framework are often cited:

• Capacity planning: finding the location and size of units or equipment, such as
power plants or telecommunication relays.

• Production planning: selecting components to produce, allocating components to


machines, managing stocks.

• Transportation and logistics: sending trucks and deliver goods.

• Financial management: balancing a portfolio of assets and liabilities according to


market conditions and subject to regulatory constraints.

In these applications, uncertainty may refer to the evolution of the demand for goods or
services, temperature and rainfall patterns affecting consumption or production, inter-
est rates affecting the burden of debt, . . . Under growing environmental stress, resource
limitations, concentration of populations in cities, many believe that these applications
can only get a higher societal impact in the future, and that even better quantitative
methods for tackling them are needed, especially methods able to take into account a
large number of constraints.
In general, problems where a flexible plan of successive decisions has to be imple-
mented, under uncertainties described by a probabilistic model, can be formulated as a
multistage stochastic program (Chapter 2). However, scalable numerical solution algo-
rithms are not always available, so that restrictions to certain classes of programs and
then further approximations are needed.
Interestingly, the approximations affect primarily the representation of the uncer-
tainty, rather than the space of possible decisions or the space of possible states reachable
by the controlled system. Thus, the limitations with the problem dimensions suffered by
the multistage stochastic framework are of a different nature than those found in dynamic
programming (the so-called curse of dimensionality). The multistage stochastic program-
ming framework is very attractive for settings where decisions in high-dimensional spaces
must be found, but suffers quickly from the dimensions of the uncertainty, and from the
extension of the planning horizon.
This thesis deals with some aspects related to multistage stochastic programming.
Our research was initially motivated by finding ways to incorporate to the multistage
2 Chapter 1. Introduction

stochastic programming framework recent advances in statistics and machine learning,


especially perturb-and-combine estimation methods, and value function approximation
methods from approximate dynamic programming.
In this thesis, we propose and implement on a series of test problems a fast approach,
based on supervised learning, for estimating the quality of an approximate solution. We
show that this approach is flexible and tractable enough to foster advances in the ways
multistage stochastic programming problems are approximated, by explicitly proposing
and evaluating novel approximation procedures, according to two criteria: the quality of
the approximate solution for the true problem, and the overall computational complexity
of the procedure.
A detailed account of the contributions presented in the thesis can be found in Sec-
tion 7.1.

1.1 Outline of the Thesis

The thesis is organized as follows.


Chapter 2 introduces the multistage stochastic programming approach to sequential
decision making under uncertainty, and several notions used throughout the thesis. It
discusses the value of multistage stochastic programming with respect to related ap-
proaches, and presents the main challenges and limitations of the multistage stochastic
programming approach.
Chapter 3 reviews some approaches to statistical estimation investigated in machine
learning. Then, it explores the idea of aggregating in a certain sense the solutions to
various approximations of the same multistage problem.
Chapter 4 develops the principles of a solution validation approach, based on super-
vised learning, and shows how it can be exploited so as to identify good approximations
to multistage programs under tight complexity limitations. Then, it proposes and eval-
uates on a family of test problems a new approximate solution approach, based on the
generation of several approximations (scenario trees) rather than a single one.
Chapter 5 investigates further methods for estimating the value of a single approxi-
mation to a multistage program, in the practical context of a test problem.
Chapter 6 develops an efficient procedure for predicting the optimal solution of a
certain class of parametric programs, with the aim of better characterizing the potential
limitations of approaches based on learning.
Chapter 7 concludes by a summary of contributions, a discussion on future research
directions, and some thoughts about the possible impacts on machine learning and arti-
ficial intelligence of the research in stochastic programming.
Some mathematical background, deemed not essential to be imposed as a preliminary
reading, has been put in a series of appendices. The reasons for including in the thesis
the material of a given appendix are detailed at the beginning of the appendix. The
appendices could also be handy to clarify some statements in the main body of the
thesis, and to this end, the content of the appendices has been referenced in an index
placed at the end of the thesis.
The appendices are organized as follows.
1.2. Published Work 3

Appendix A defines notions from optimization and variational analysis.


Appendix B defines notions from measure and probability theory.
Appendix C defines notions from functional analysis related to kernel methods.
Appendix D summarizes some results from two-stage stochastic programming.

1.2 Published Work

Whereas the material of Chapters 5 and 6 is still unpublished, most of the material of
Chapters 2, 3, 4 has been published in the following papers.
• B. Defourny, L. Wehenkel. 2007. Projecting generation decisions induced by a stochastic
program on a family of supply curve functions. Third Carnegie Mellon Conference on the
Electricity Industry. Pittsburgh PA. 6 pages.

• B. Defourny, D. Ernst, L. Wehenkel. 2008. Lazy planning under uncertainty by optimizing


decisions on an ensemble of incomplete disturbance trees. S. Girgin, M. Loth, R. Munos, editors,
Recent Advances in Reinforcement Learning, Eighth European Workshop (EWRL-2008). LNCS
(LNAI) 5323, Springer, 1–14.

• B. Defourny, D. Ernst, L. Wehenkel. 2009. Planning under uncertainty, ensembles of


disturbance trees and kernelized discrete action spaces. IEEE Symposium on Adaptive Dynamic
Programming and Reinforcement Learning (ADPRL-2009). 145–152.

• B. Defourny, D. Ernst, L. Wehenkel. 2009. Bounds for multistage stochastic programs


using supervised learning strategies. O. Watanabe, T. Zeugmann, editors, Stochastic Algo-
rithms: Foundations and Applications. Fifth International Symposium, SAGA 2009. LNCS
5792, Springer, 61–73.

• B. Defourny, D. Ernst, L. Wehenkel. 2010. Multistage stochastic programming: A scenario


tree based approach to planning under uncertainty. Accepted as a contributing chapter to
L.E. Sucar, E.F. Morales, and J. Hoey, editors, Decision Theory Models for Applications in
Artificial Intelligence: Concepts and Solutions. To be published by IGI Global.

Work that uses concepts or algorithms from stochastic programming, and addresses
specific topics in machine learning, has also been presented in the following papers.
• B. Defourny, D. Ernst, L. Wehenkel. 2008. Risk-aware decision making and dynamic
programming. Y. Engel, M. Ghavamzadeh, S. Mannor, P. Poupart, editors, NIPS-08 workshop
on model uncertainty and risk in reinforcement learning. 8 pages.

• B. Defourny, L. Wehenkel. 2009. Large margin classification with the progressive hedging
algorithm. S. Nowozin, S. Sra, S. Vishwanathan, S. Wright, editors, Second NIPS workshop on
optimization for machine learning. 6 pages.

Finally, some work to which we collaborated can be recorded.


• S. Ammar, P. Leray, B. Defourny, L. Wehenkel. 2008. High-dimensional probability density
estimation with randomized ensembles of tree structured Bayesian networks. Fourth European
Workshop on Probabilistic Graphical Models (PGM 2008). 9–16.

• S. Ammar, P. Leray, B. Defourny, L. Wehenkel. 2009. Probability density estimation


by perturbing and combining tree structured Markov networks. Tenth European Conference
on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU 2009).
158–167.
4 Chapter 1. Introduction

• B. Cornélusse, G. Vignal, B. Defourny, L. Wehenkel. 2009. Supervised learning of intra-


daily recourse strategies for generation management under uncertainties. IEEE Power Tech
Conference. 8 pages.
Chapter 2

The Multistage Stochastic Programming


Framework

This chapter presents the multistage stochastic programming approach to sequential


decision making under uncertainty. It points out important issues posed by the approach,
and discusses the value of the framework with respect to related frameworks.
The chapter is organized as follows. Section 2.1 presents the multistage stochastic
programming framework, the discretization techniques, and the considerations on nu-
merical optimization methods that have an influence on the way problems are modeled.
Section 2.2 compares the approach to Markov Decision Processes, discusses the curse
of dimensionality, and puts in perspective simpler decision making models based on nu-
merical optimization, such as two-stage stochastic programming with recourse or Model
Predictive Control. Section 2.3 explains the issues posed by the dominant approxima-
tion/discretization approach for solving multistage programs (which is suitable for han-
dling both discrete and continuous random variables). Section 2.4 provides some back-
ground information on existing approximation methods. Finally, Section 2.5 concludes
by our summary of today’s perception of multistage stochastic programming among re-
searchers.

2.1 Description of the Framework

In this section, we describe an attitude towards risk and uncertainty that can motivate
decision makers to employ multistage stochastic programming. Then, we detail the ele-
ments of the decision model and the approximations that can make the model tractable.

2.1.1 From Nominal Plans to Decision Processes

In their first attempt towards planning under uncertainty, decision makers often set up a
course of actions, or nominal plan (reference plan), deemed to be robust to uncertainties
in some sense, or to be a wise bet on future events. Then, they apply the decisions, often
departing from the nominal plan to better take account of actual events. To further
improve the plan, decision makers are then led to consider (i) in which parts of the
plan flexibility in the decisions may help to better fulfill the objectives, and (ii) whether
the process by which they make themselves (or the system) “ready to react” impacts the
initial decisions of the plan and the overall objectives. If the answer to (ii) is positive, then
it becomes valuable to cast the decision problem as a sequential decision making problem,
6 Chapter 2. The Multistage Stochastic Programming Framework

even if the net added value of doing so (benefits minus increased complexity) is unknown
at this stage. During the planning process, the adaptations (or recourse decisions) that
may be needed are clarified, their influence on prior decisions is quantified. The notion of
nominal plan is replaced by the notion of decision process, defined as a course of actions
driven by observable events. As distinct outcomes have usually antagonist effects on ideal
prior decisions, it becomes crucial to determine which outcomes should be considered, and
what importance weights should be put on these outcomes, in the perspective of selecting
decisions under uncertainty that are not regretted too much after the dissipation of the
uncertainty by the course of real-life events.

2.1.2 Incorporating Probabilistic Reasoning

In the robust optimization approach to decision making under uncertainty, decision mak-
ers are concerned by worst-case outcomes. Describing the uncertainty is then essentially
reduced to drawing the frontier between events that should be considered and events
that should be excluded from consideration (for instance, because they would paralyze
any action). In that context, outcomes under consideration form the uncertainty set, and
decision making becomes a game against some hostile opponent that selects the worst
outcome from the uncertainty set. Ben-Tal et al. (2004) provide arguments in favor of
robust approaches.
In a stochastic programming approach, decision makers use a softer frontier between
possible outcomes, by assigning weights to outcomes and optimizing some aggregated
measure of performance that takes into account all these possible outcomes. In that
context, the weights are often interpreted as a probability measure over the events, and
a typical way of aggregating the events is to consider the expected performance under
that probability measure.
Furthermore, interpreting weights as probabilities allows reasoning under uncertainty.
Essentially, probability distributions are conditioned on observations, and Bayes’ rule
from probability theory quantifies how decision makers’ initial beliefs about the likelihood
of future events — be it from historical data or from bets — should be updated on the
basis of new observations.
Technically, it turns out that the optimization of a decision process contingent to
future events is more tractable (read: suitable to large-scale operations) when the “rea-
soning under uncertainty” part can be decoupled from the optimization process itself. In
particular, such a decoupling occurs when the probability distributions describing future
events are not influenced in any way by the decisions selected by the agent, that is, when
the uncertainty is exogenous to the decision process.

2.1.3 The Elements of the General Decision Model

We can now describe the main elements of a multistage stochastic programming decision
model. These elements are:

i. A sequence of random variables ξ1 , ξ2 , . . . , ξT defined on a probability space


(Ω, B, P). The random variables represent the uncertainty in the decision problem,
and their possible values represent the possible observations to which the decision
2.1. Description of the Framework 7

maker will react. The probability measure P serves to quantify the prior beliefs
about the uncertainty. There is no restriction on the structure of the random vari-
ables; in particular, the random variables may be dependent. When the realization
of ξ1 , . . . , ξt−1 is known, there is a residual uncertainty represented by the random
variables ξt , . . . , ξT , the distribution of which in now conditioned on the realization
of ξ1 , . . . , ξt−1 .

ii. A sequence of decisions u1 , u2 , . . . , uT defining the decision process for the problem.
Many models also use a terminal decision uT +1 . We will assume that ut is valued in
a Euclidian space Rm (the space dimension m, corresponding to a number of scalar
decisions, could vary with the index t, but we will not stress that in the notation).

iii. A convention specifying when decisions should actually be taken and when the
realizations of the random variables are actually revealed. This means that if ξ t−1
is observed before taking a decision ut , we can actually adapt ut to the realization
of ξt−1 . To this end, we identify decision stages: see Table 2.1. A row of the
table is read as follows: at decision stage t > 1, the decisions u1 , . . . , ut−1 are
already implemented (no modification is possible), the realization of the random
variables ξ1 , . . . , ξt−1 is known, the realization of the random variables ξt , . . . , ξT is
still unknown but a density P(ξt , . . . , ξT | ξ1 , . . . , ξt−1 ) conditioned on the realized
value of ξ1 , . . . , ξt−1 is available, and the current decision to take concerns the
value of ut . Once such a convention holds, we need not stress in the notation the
difference between random variables ξt and their realized value, or decisions as
functions of uncertain events and the actual value for these decisions: the correct
interpretation is clear from the context of the current decision stage.
The adaptation of a decision ut to prior observations ξ1 , . . . , ξt−1 will always be
made in a deterministic fashion, in the sense that ut is uniquely determined by the
value of (ξ1 , . . . , ξt−1 ).
A sequential decision making problem has more than two decision stages inas-
much as the realizations of the random variables are not revealed simultaneously:
the choice of the decisions taken between successive observations has to take into
account some residual uncertainty on future observations. If the realization of
several random variables is revealed before actually taking a decision, then the
corresponding random variables should be merged into a single random vector;
if several decisions are taken without intermediary observations, then the corre-
sponding decisions should be merged into a single decision vector (Gassmann and
Prékopa, 2005). This is how a problem concerning several time periods could ac-
tually be a two-stage stochastic program, involving two large decision vectors u 1
(first-stage decision, constant), u2 (recourse decision, adapted to the observation of
ξ1 ). What is called a decision in a stochastic programming model may thus actually
correspond to several actions implemented over a certain number of discrete time
periods.

iv. A sequence of feasibility sets U1 , . . . , UT describing which decisions u1 , . . . , uT are


admissible. When ut ∈ Ut , one says that ut is feasible. The feasibility sets
U2 , . . . , UT may depend, in a deterministic fashion, on available observations and
prior decisions. Thus, following Table 2.1, Ut may depend on ξ1 , u1 , ξ2 , u2 ,
. . . , ξt−1 in a deterministic fashion. Note that prior decisions are uniquely deter-
8 Chapter 2. The Multistage Stochastic Programming Framework

Tab. 2.1: Decision stages, setting the order of observations and decisions.

Stage Available information for taking decisions Decision


Prior Observed Residual
decisions outcomes uncertainty
1 none none P(ξ1 , . . . , ξT ) u1
2 u1 ξ1 P(ξ2 , . . . , ξT | ξ1 ) u2
3 u1 , u2 ξ1 , ξ 2 P(ξ3 , . . . , ξT | ξ1 , ξ2 ) u3
.. ..
. .
T u1 , . . . , uT −1 ξ1 , . . . , ξT −1 P(ξT | ξ1 , . . . , ξT −1 ) uT

optional:
T +1 u1 , . . . , uT ξ1 , . . . , ξ T none (uT +1 )

mined by prior observations, but for convenience we keep track of prior decisions
to parametrize the feasibility sets.
An important role of the feasibility sets is to model how decisions are affected by
prior decisions and prior events. In particular, a situation with no possible recourse
decision (Ut empty at stage t, meaning that no feasible decision ut ∈ Ut exists) is
interpreted as a catastrophic situation to be avoided at any cost.
We will always assume that the planning agent knows the set-valued mapping from
the random variables ξ1 , . . . , ξt−1 and the decisions u1 , . . . , ut−1 to the set Ut of
feasible decisions ut .
We will also assume that the feasibility sets are such that a feasible sequence of
decisions u1 ∈ U1 , . . . , uT ∈ UT exists for all possible joint realizations of ξ1 , . . . , ξT .
In particular, the fixed set U1 must be nonempty. A feasibility set Ut parametrized
only by variables in a subset of {ξ1 , . . . , ξt−1 } must be nonempty for any possi-
ble joint realization of those variables. A feasibility set Ut also parametrized by
variables in a subset of {u1 , . . . , ut−1 } must be implicitly taken into account in the
definition of the prior feasibility sets, so as to prevent immediately a decision maker
from taking a decision at some earlier stage that could lead to a situation at stage t
with no possible recourse decision (Ut empty), be it for all possible joint realiza-
tions of the subset of {ξ1 , . . . , ξt−1 } on which Ut depends, or for some possible joint
realization only. These implicit requirements will affect in particular the definition
of U1 .
For example, assume that ut−1 , ut ∈ Rm , and take Ut = {ut ∈ Rm : ut 
0, At−1 ut−1 + Bt ut = ht (ξt−1 )} with At−1 , Bt ∈ Rq×m fixed matrices, and ht an
affine function of ξt−1 with values in Rq . If Bt is such that {Bt ut : ut ≥ 0} = Rq ,
meaning that for any v ∈ Rq , there exists some ut  0 with Bt ut = v, then this
is true in particular for v = ht (ξt−1 ) − At−1 ut−1 , so that Ut is never empty. More
details on such conditions can be found in Appendix D.

v. A performance measure, summarizing the overall objectives of the decision maker,


that should be optimized. It is assumed that the decision maker knows the perfor-
2.1. Description of the Framework 9

7. 1.
Ω 4.

1. 2. 3. 5. 2. 3.
8.
6. 4. 5. 6. 7. 8.

Fig. 2.1: (From left to right) Nested partitioning of the event space Ω, starting from a trivial
partition representing the absence of observations. (Rightmost) Scenario tree corre-
sponding to the partitioning process.

mance measure. In this chapter, we write the performance measure as the expecta-
tion of a function f that assigns some scalar value to each realization of ξ 1 , . . . , ξT
and u1 , . . . , uT , assuming the integrability of f with respect to the joint distribution
of ξ1 , . . . , ξT .
PT
For example, one could take for f a sum of scalar products t=1 ct · ut , where
c1 is fixed and where ct depends affinely on ξ1 , . . . , ξt−1 . The function f would
represent a sum of instantaneous costs over the planning horizon. The decision
maker would be assumed to know the vector-valued mapping from the random
variables ξ1 , . . . , ξt−1 to the vector ct , for each t.
Besides the expectation, more sophisticated ways to aggregate the distribution
of f into a single measure of performance have been investigated (Ruszczyński and
Shapiro, 2006; Pflug and Römisch, 2007). An important element considered in the
choice of the performance measure is the tractability of the resulting optimization
problem.

The planning problem is then formalized as a mathematical programming problem.


The formulation relies on a particular representation of the random process ξ 1 , . . . , ξT in
relation with the decision stages, referred to as a scenario tree in the stochastic program-
ming literature, and described in the next section.

2.1.4 The Tree Representation of Gradually Revealed Scenarios

Let us call scenario an outcome of the random process ξ1 , . . . , ξT . A scenario tree is an


explicit representation of the branching process induced by the gradual observation of
ξ1 , . . . , ξT , under the assumption that the random variables have a finite discrete support.
It is built as follows. A root node is associated to the first decision stage and to the initial
absence of observations. To the root node are connected children nodes associated to
stage 2, one child node for each possible outcome of the random variable ξ 1 . Then, to
each node of stage 2 are connected children nodes associated to stage 3, one for each
outcome of ξ2 given the observation of ξ1 relative to the parent node. The branching
process construction goes on until the last stage is reached; at this point, the outcomes
associated to the nodes on the unique path from the root to a leaf define together a
particular scenario, that can be associated to the leaf.
The probability distribution of the random variables is also taken into account. Prob-
ability masses are associated to the nodes of the scenario tree. The root node has
probability 1, whereas children nodes are weighted by probabilities that represent the
10 Chapter 2. The Multistage Stochastic Programming Framework

probability of the value to which they are associated, conditioned on the value associated
to their ancestor node. Multiplying the probabilities of the nodes of the path from the
root to a leaf gives the probability of a scenario.
Clearly, an exact construction of the scenario tree would require an infinite num-
ber of nodes if the support of (ξ1 , . . . , ξT ) is discrete but not finite. A random process
involving continuous random variables cannot be represented as a scenario tree; never-
theless, the scenario tree construction turns out to be instrumental in the construction
of approximations to nested continuous conditional distributions.
Branchings are essential to represent residual uncertainty beyond the first decision
stage. At the planning time, the decision makers may contemplate as many hypothetical
scenarios as desired, but when decisions are actually implemented, the decisions can-
not depend on observations that are not yet available. We have seen that the decision
model specifies, with decision stages, how the scenario actually realized will be gradually
revealed. No branchings in the representation of the outcomes of the random process
would mean that after conditioning on the observation of ξ1 , the outcome of ξ2 , . . . , ξT
could be predicted (anticipated) exactly. Under such a representation, decisions spanning
stages 2 to T would be optimized on the anticipated outcome. This would be equivalent
to optimizing a nominal plan for u2 , . . . , uT that fully bets on some scenario anticipated
at stage 2.
To visualize how information on the realization of the random variables becomes
gradually available, it is convenient to imagine nested partitions of the event space (Fig-
ure 2.1): refinements of the partitions appear gradually at each decision stage in cor-
respondence with the possible realizations of the new observations. To each subregion
induced by the partitioning of the event space can be associated a constant recourse
decision, as if decisions were chosen according to a piecewise constant decision policy.
On Figure 2.1, the surface of each subregion could also represent probabilities (then by
convention the initial square has a unit surface and the thin space between subregions
is for visual separation only). The dynamical evolution of the partitioning can be rep-
resented by a scenario tree: the nodes of the tree corresponds to the subregions of the
event space, and the edges between subregions connect a parent subregion to its refined
subregions obtained by one step of the recursive partitioning process.
Ideally a scenario tree should cover the totality of possible outcomes of a random
process. But unless the support of the distribution of the random variables is finite, no
scenario tree with a finite number of nodes can represent exactly the random process and
the probability measure, as we already mentioned, while even if the support is finite, the
number of scenarios grows exponentially with the number of stages.

2.1.5 Approximating Random Processes with Scenario Trees

In the general decision model, the agent is assumed to have access to the joint probability
distributions, and is able to derive from it the conditional distributions listed in Table 2.1.
In practice, computational limitations will restrict the quality of the representation of
P. Let us however reason at first at an abstract and ideal level to establish the program
that an agent would solve for planning under uncertainty.
For brevity, let ξ denote (ξ1 , . . . , ξT ), and let π(ξ) denote a decision policy mapping
realizations of ξ to realizations of the decision process u1 , . . . , uT . Let πt (ξ) denote
2.1. Description of the Framework 11

ut viewed as a function of ξ. To be consistent with the decision stages, the policy


must be non-anticipative, in the sense that ut cannot depend on observations relative to
subsequent stages. Equivalently one can say that π1 must be a constant-valued function,
π2 a function of ξ1 , and in general πt a function of ξ1 , . . . , ξt−1 for t = 2, . . . , T .
The planning problem can then be stated as the search for a non-anticipative policy π,
restricted by the feasibility sets Ut , that minimizes an expected total cost f spanning the
decision stages and determined by the scenario ξ and the decisions π(ξ):
P: minimize E {f (ξ, π(ξ))} subject to πt (ξ) ∈ Ut (ξ) ;
π(ξ) non-anticipative.

Here we used an abstract notation which hides the nested expectations corresponding
to the successive random variables, and the possible sum decomposition of the function f
among the decision stages. Concrete formulations are presented in Appendix D. Note
that it is possible to be more general by replacing the expectation operator by a func-
tional Φ{·} that maps the distribution of f to a single number in [−∞, ∞]. We also
stressed the possible dependence of Ut on ξ1 , u1 , ξ2 , u2 , . . . , ξt−1 by writing Ut (ξ).
A program more amenable to numerical optimization techniques is obtained by repre-
senting π(·) by a set of optimization variables for each possible argument of the function
— for each possible outcome ξ k = (ξ1k , . . . , ξTk ) of ξ, one associates the optimization vari-
ables (uk1 , . . . , ukT ), written uk for brevity. The non-anticipativity of the policy can be
expressed by a set of equality constraints: for the first decision stage (t = 1) we require
uk1 = uj1 for all (k, j), and for subsequent stages (t ≥ 2) we require ukt = ujt for each (k, j)
such that (ξ1k , . . . , ξt−1
k
) ≡ (ξ1j , . . . , ξt−1
j
).
A finite-dimensional approximation to the program P is obtained by considering a
finite number n of outcomes, and assigning to each outcome a probability p k > 0. This
yields a formulation on a scenario tree covering the scenarios ξ k :
Pn
P 0 : minimize k k k
k=1 p f (ξ , u ) subject to ukt ∈ Ut (ξ k ) ∀ k ;
uk1 = uj1 ∀ k, j ,
ukt = ujt whenever k
(ξ1k , . . . , ξt−1 ) ≡ (ξ1j , . . . , ξt−1
j
) .

Once again we used a simple notation ξ k for designating outcomes of the process ξ,
which hides the fact that outcomes can share some elements according to the branching
structure of the scenario tree.
Non-anticipativity constraints can also be accounted for implicitly. A partial path
from the root (depth 0) to some node of depth t of the scenario tree identifies some
outcome (ξ1k , . . . , ξtk ) of (ξ1 , . . . , ξt ). To the node can be associated the decision ukt+1 , but
also all decisions ujt+1 such that (ξ1k , . . . , ξtk ) ≡ (ξ1j , . . . , ξtj ). Those decisions are redundant
and can be merged into a single decision on the tree, associated to the considered node
of depth t.

2.1.6 Simple Example of Formulation

To fix ideas, we illustrate the scenario tree technique on a trajectory tracking problem
under uncertainty with control penalization. In the proposed example, the uncertainty
is such that the exact problem can be posed on a small finite scenario tree.
12 Chapter 2. The Multistage Stochastic Programming Framework

Say that a random process ξ = (ξ1 , ξ2 , ξ3 ), representing perturbations at time t =


1, 2, 3, has 7 possible outcomes (scenarios), denoted by ξ k , 1 ≤ k ≤ 7, with known
probabilities pk :

k 1 2 3 4 5 6 7

ξ1k -4 -4 -4 3 3 3 3
ξ2k -3 2 2 -3 0 0 2
ξ3k 0 -2 1 0 -1 2 1
pk 0.1 0.2 0.1 0.2 0.1 0.1 0.2 .

The random process is fully represented by the scenario tree of Figure 2.2 (Left): the
first possible outcome is ξ 1 = (−4, −3, 0) with probability p1 = 0.1, and so on. Note that
the random variables ξ1 , ξ2 , ξ3 are not mutually independent.
Assume that an agent can choose actions vt ∈ R at t = 1, 2, 3 (the notation vt instead
of ut is justified in the sequel). The goal of the agent is the minimization of an expected
P3
sum of costs E{ t=1 ct (vt , xt+1 ) | x1 = 0}. Here xt ∈ R is the state of a continuous-
state, discrete-time dynamical system, that starts from the initial state x 1 = 0 and
follows the state transition equation xt+1 = xt + vt + ξt . Costs ct (vt , xt+1 ), associated
to the decision vt and the transition to the state xt+1 , are defined by ct = (dt+1 + vt2 /4)
with dt+1 = |xt+1 − αt+1 | and α2 = 2.9, α3 = 0, α4 = 0 (αt+1 : nominal trajectory; dt+1 :
tracking error; vt2 /4: penalization of control effort).
An optimal policy mapping observations ξ1 , . . . , ξt−1 to decisions vt can be obtained
by solving the following convex quadratic program over variables vtk , xkt+1 , dkt+1 , where k
runs from 1 to 7 and t from 1 to 3, and over xk1 trivially set to 0:

P7 k
 P3 k k 2

minimize k=1 p t=1 (dt+1 + (vt ) /4)
subject to − dkt+1 ≤ xkt+1 − αt+1 ≤ dkt+1 ∀ k, t
xk1 = 0 , xkt+1 = xkt + vtk + ξtk ∀ k, t
v11 = v12 = v13 = v14 = v15 = v16 = v17
v21 = v22 = v23 , v24 = v25 = v26 = v27
v32 = v33 , v35 = v36 .

Here, the vector of optimization variables (v1k , xk1 ) plays the role of uk1 , the vector
(vtk , xkt , dkt )
plays the role of ukt for t = 2, 3, and the vector (xk4 , dk4 ) plays the role of uk4 ,
showing that the decision process u1 , . . . , uT +1 of the general multistage stochastic pro-
gramming decision model can in fact include state variables and more generally any
element that serves to evaluate costs conveniently.
The optimal objective value is +7.3148, and the optimal solution is depicted on Fig-
ure 2.2. In this example, the final solution can be recast as a mapping π̃t from xt
to vt : π̃1 (0) = −0.1, π̃2 (−4.1) = 2.1, π̃2 (2.9) = −1.16, π̃3 (−5) = 2, π̃3 (−1.26) = 1.26,
π̃3 (0) = 0.667, π̃3 (1.74) = −0.74, π̃3 (3.74) = −2. Hence in this case the modeling as-
sumption of an agent observing ξt instead of the system state xt is not a fundamental
restriction.
2.2. Comparison to Related Approaches 13

-0.1 0


v1  

x1       

-4 3
2.1 -1.16 -4.1 2.9
 

v2  

x2       

-3 2 -3 0 2
-5 0 -1.26 1.74 3.74
   

v3  

x3       

0 -2 1 0 -1 2 1 2 0.667 1.26 -0.74 -2


-1.333 0 3
     

x4       

0.1 0.2 0.1 0.2 0.1 0.1 0.2 -3 1.667 0 2.74

Fig. 2.2: (Left) Scenario tree representing the 7 possible scenarios for a random process ξ =
(ξ0 , ξ1 , ξ2 ). The outcomes ξtk are written in bold, and the scenario probabilities pk are
reported at the leaf nodes. (Middle) Optimal actions vt for the agent. (Right) Visited
states xt under the optimal actions, treated as artificial decisions (see text).

2.2 Comparison to Related Approaches

This section discusses several modeling and algorithmic complexity issues raised by the
multistage stochastic programming framework and scenario-tree based decision making.

2.2.1 The Exogenous Nature of the Random Process

A frequent assumption made in the stochastic programming framework is that decision


makers do not influence by their decisions the realization of the random process repre-
senting the uncertainty. The random process is said to be exogenous. This allows to
simulate, select and organize in advance possible realizations of the exogenous process,
before any observation is actually made, and then optimize jointly (by opposition to
individually for each scenario) the decisions contingent to the possible realizations.
The need to decouple the description of uncertainties and the optimization of decisions
might appear at first as a strong limitation on the situations that can be modeled and
treated by stochastic programming techniques. This impression is in part justified for a
large family of problems of control theory in which the uncertainty is identified to some
zero-mean noise perturbing the observations or the dynamics of the system, or when the
uncertainty is understood as the uncertainty on the value of system parameters. But in
another large family of sequential decision making problems under uncertainty, major
sources of uncertainty are precisely the ones that are the less influenced by the behavior
of the decision makers. We also note that random processes strongly influenced by the
behavior of the decision makers can sometimes be handled by incorporating them to the
initial decision process and treating them as a virtual decision process.
A probabilistic reasoning based on a subset of possible of scenarios could easily be
tricked by an adversarial random process that would exploit one of the scenarios discarded
during the planning process. In many practical problems however, the environment is
not totally adversarial. In situations where the environment is mildly adversarial, it is
often possible to choose measures of performances that are more robust to bad outcomes,
and that can still be optimized in a tractable way.
Finally, it is easier in terms of sample complexity to learn a model (find model pa-
rameters from finite data sets) for an exogenous process than for an endogenous process.
Learning a model for an exogenous process is possible from observations of the process,
such as time series, whereas learning a model for an endogenous process forces us to be
14 Chapter 2. The Multistage Stochastic Programming Framework

able to simulate possible state transitions for every possible action, or at least to have at
one’s disposal a fairly exhaustive data set relating actions to state transitions.

2.2.2 Comparison to Markov Decision Processes

In Markov Decision Processes (MDP) (Bellman, 1954; Howard, 1960), the decision maker
seeks to optimize a performance criterion decomposed into a sum of instantaneous re-
wards. The information state of the decision maker at time t coincides with the state x t
of a dynamical system For simplicity, we do not consider in this discussion partial ob-
servability (POMDP) or risk-sensitivity, for which the system state need not be the
information state of the agent. Optimal decision policies are often found by a reasoning
based on the dynamic programming principle, to which is essential the notion of state as
a sufficient statistic for representing the complete history of the system’s evolution and
agent’s beliefs.
Multistage stochastic programming problems could be viewed as a subclass of finite-
horizon Markov Decision Processes, by identifying the growing history of observations
(ξ1 , . . . , ξt−1 ) to the agent’s state. However, the mathematical assumptions under the
MDP and the stochastic programming formulations are in fact quite different. Complex-
ity results suggest that the algorithmic resolution of MDPs is efficient when the decision
space is finite and small (Littman et al., 1995; Rust, 1997; Mundhenk et al., 2000; Kearns
et al., 2002), while for the scenario-tree based stochastic programming framework, the
resolution is efficient when the optimization problem is convex — in particular the deci-
sion space is continuous — and the number of decision stages is small (Shapiro, 2006).
One of the main appeals of stochastic programming techniques is their ability to deal
efficiently with high-dimensional continuous decision spaces structured by numerous con-
straints, and with sophisticated, non-memoryless random processes. At the same time,
if stochastic programming models have traditionally been used for optimizing long-term
decisions that are implemented once and have lasting consequences, for example in net-
work capacity planning (Sen et al., 1994), they are now increasingly used in the context
of near-optimal control strategies that Bertsekas (2005a) calls limited-lookahead strate-
gies. In this usage, at each decision stage an updated model over the remaining planning
horizon is rebuilt and optimized on the fly, from which only the first-stage decisions are
actually implemented. Indeed, when a stochastic program is solved on a scenario tree, the
initial search for a decision policy degenerates into the search for sequences of decisions
relative to the scenarios covered by the tree. The first-stage decision does not depend
on observations and can thus always be implemented on any new scenario, whereas the
recourse decisions relative to any particular scenario in the tree could be infeasible on a
new scenario, especially if the feasibility sets depend on the random process.

2.2.3 The Curse of Dimensionality

The curse of dimensionality is an algorithmic-complexity phenomenon by which comput-


ing optimal policies on higher dimensional input spaces requires an exponential growth
of computational resources, leading to intractable problem formulations. In dynamic
programming, the input space is the state space or a reduced parametrization of it. In
practice the curse of dimensionality limits attempts to cover inputs to spaces embedded
2.3. The Value of Multistage Stochastic Programming 15

in Rd with d at most equal to 5-10.


Approximate Dynamic Programming methods (Bertsekas and Tsitsiklis, 1996; Bert-
sekas, 2005a; Powell, 2007) and Reinforcement Learning approaches (Sutton and Barto,
1998) help to mitigate the curse of dimensionality, for instance by attempting to focus
on the part of the state space that is actually reached under an optimal policy. An ex-
ploratory phase may be added to the original dynamic programming solution strategy so
as to discover the relevant part of the state space. Those approaches work well in several
cases:

i. The structure of a near-optimal policy is already known. For example, policy


search methods assume that a near-optimal policy can be parametrized a priori
by a small number of parameters, and rely on the user’s expertise to find such a
parametrization.

ii. Value-function based methods assume that there is a finite set of actions (or policy
parameters), given a priori, that are the elementary building blocks of a near-
optimal policy, and that can be used to drive the exploratory phase. The value
function represents or approximates the expected value-to-go from the current state,
and can be used to rank candidate actions (or policy parameters).

iii. By the structure of the optimization problem, the decisions and the state space
subregions identified as promising early in the exploratory phase are those that are
actually relevant to a near-optimal policy. This ensures the success of optimistic
exploratory strategies, that refine decisions within promising subregions.

Stochastic programming algorithms do not rely on the covering of the state space
of dynamic programming. Instead, they rely on the covering of the random exogenous
process, which needs not correspond to the complete state space (see how the auxiliary
state xt is treated in the example of the previous section). The complement of the state
space and the decision space are “explored” during the optimization procedure itself. The
success of the approach will thus depend on the tractability of the joint optimization in
those spaces, and not on insights on the structure of near-optimal policies.
In multistage stochastic programming approaches, the curse of dimensionality is
present when the number of decision stages increases, and in face of high-dimensional
exogenous processes. Therefore, methods that one could call, by analogy to approxi-
mate dynamic programming, approximate stochastic programming methods, will attempt
to cover only the realizations of the exogenous random process that are truly needed to
obtain near-optimal decisions. These methods work with a number of scenarios that does
not grow exponentially with the dimension of the exogenous process and the number of
stages.

2.3 The Value of Multistage Stochastic Programming

Due to the curse of dimensionality, multistage stochastic programming is in competition


with more tractable decision models. At the same time it provides a unifying framework
between several simplified decision making paradigms, that we now describe.
16 Chapter 2. The Multistage Stochastic Programming Framework

2.3.1 Reduction to Model Predictive Control

A radical simplification consists in discarding the detailed probabilistic information on the


uncertainty, taking a nominal scenario, and optimizing decisions on the nominal scenario.
The common practice for defining a nominal scenario is to replace random variables by
their expectation. The resulting problem is called the expected value problem, the solution
of which constitutes a nominal plan. Even if the nominal plan could be used as an open-
loop decision policy, that is, implemented over the complete planning horizon, decision
makers will usually want to recompute the plan at the next decision stage by solving
an updated expected value problem. In control theory, the approach is known as Model
Predictive Control (MPC) (Bertsekas, 2005b).
An indicator of the value of multistage programming decisions over model predictive
control decisions is given by the value of the stochastic solution (VSS). To make the
notion precise, let us define successively:

• V ∗ , the optimal value of the multistage stochastic program minπ E{f (ξ, π(ξ))}. For
notational simplicity, we adopt the convention that f (ξ, π(ξ)) = ∞ if the policy π
is anticipative or yields infeasible decisions.

• ζ = (ζ1 , . . . , ζT ), the nominal scenario.

• uζ , the optimal solution to the expected value problem minu f (ζ, u). Note that the
optimization is over a single fixed sequence of feasible decisions; the problem data
is determined by ζ.

• uζ1 , the first-stage decision of uζ .

• V ζ , the optimal value of the multistage stochastic program minπ E{f (ξ, π(ξ))}
subject to the additional constraint π1 (ξ) = uζ1 for all ξ. If by a slight abuse of
notation, we write π1 , viewed as an optimization variable, for the value of the
constant-valued function π1 , then the additional constraint is simply π1 = uζ1 .
By definition, V ζ is the value of a policy implementing the first decision from
the expected value problem, and then selecting optimal recourse decisions for the
subsequent decision stages. The recourse decisions differ in general from those that
would be selected by a policy optimal for the original multistage program.

The VSS is then defined as the difference V ζ − V ∗ ≥ 0. For maximization problems, it


would be defined by V ∗ − V ζ ≥ 0. Birge and Louveaux (1997) describe special cases
(with two decision stages, and restrictions on the way randomness affects problem data)
for which it is possible to compute bounds on the VSS. They also come to the conclusion,
from their survey of works studying the VSS, that there is no rule that can predict a priori
whether the VSS is low or high for a given problem instance — for example increasing
the variance of random variables may increase or decrease the VSS.

2.3.2 Reduction to Two-Stage Stochastic Programming

A less radical simplification consists in discarding the distinction between recourse stages,
keeping in the model a first stage (associated to full uncertainty) and a second stage (as-
sociated to full knowledge). A multistage model degenerates into a two-stage model when
2.3. The Value of Multistage Stochastic Programming 17

the scenario tree has branchings only at one stage. The situation arises for instance when
scenarios are sampled over the full horizon independently: the tree has then branchings
only at the root. In Huang and Ahmed (2009), the value of multistage stochastic program-
ming (VMS) is defined as the difference of the optimal values of the multistage model
versus the two-stage model. The authors establish bounds on the VMS and describe an
application (in the semiconductor industry) where the VMS is high. Note however that a
generalization of the notion of VSS would rather quantify how multistage decisions out-
perform two-stage decisions when those two-stage decisions are implemented with model
rebuilding at each stage, in the manner of the Model Predictive Control scheme.

2.3.3 Reduction to Heuristics based on Parametric Optimization

As an intermediate simplification between the expected value problem and the reduction
to a two-stage model, it is possible to optimize sequences of decisions separately on
each scenario. The decision maker can then use some averaging, consensus or selection
strategy to implement a first-stage decision inferred from the so-obtained ensemble of
first-stage decisions. Here again, the model should be rebuilt with updated scenarios at
each decision stage.
The problem of computing optimal decisions separately for each scenario is known as
the distribution problem. The problem appears in the definition of the expected value of
perfect information (EVPI), which quantifies the additional value that a decision maker
could reach in expectation if he or she were able to predict the future. To make the
notion precise, let V ∗ denote as before the optimal value of the multistage stochastic
program minπ E{f (ξ, π(ξ))} over non-anticipative policies π; let V (ξ) denote the optimal
value of the deterministic program minu f (ξ, u); and let V A be the expected value of
V (ξ), according to the distribution of ξ. Observe that V A is also the optimal value of the
program minπA E{f (ξ, π A (ξ))} over anticipative policies π A , the optimization of which
is now decomposable among scenario subproblems. The EVPI is then defined as the
difference V ∗ − V A ≥ 0. For maximization problems, it is defined by V A − V ∗ ≥ 0.
Intuitively, the EVPI is high when having to delay adaptations to final outcomes due to
a lack of information results in high costs.
The EVPI is usually interpreted as the price a decision maker would be ready to pay
to know the future (Raiffa and Schlaifer, 1961; Birge, 1992). The EVPI also indicates
how valuable the dependence of decision sequences is on the particular scenario they are
optimized over. Mercier and Van Hentenryck (2007) show on an example with low EVPI
how a strategy based on a particular aggregation of decisions optimized separately on
deterministic scenarios can be arbitrarily bad. Thus even if the EVPI is low, heuristics
based on the decisions of anticipative policies can perform poorly.
This does not mean that the approach cannot perform well in practice. Van Henten-
ryck and Bent (2006) have studied and refined various aggregation and regret-minimization
strategies on a series of stochastic combinatorial problems already hard to solve on a sin-
gle scenario, as well as schemes that build a bank of pre-computed reference solutions
and then adapt them online to accelerate the optimization on new scenarios. They show
that their strategies perform well on vehicle routing applications.
18 Chapter 2. The Multistage Stochastic Programming Framework

Remark 2.1. The progressive hedging algorithm (Rockafellar and Wets, 1991) is
a decomposition method that computes the solution to a multistage stochastic
program on a scenario tree by solving repeatedly separate subproblems on the
scenarios covered by the tree. First-stage decisions and other decisions coupled
by non-anticipativity constraints are obtained by aggregating the decisions of the
concerned scenarios, in the spirit of the heuristics based on the distribution problem
presented above. The algorithm modifies the scenario subproblems at each iteration
to make the decisions coupled by non-anticipativity constraints converge towards
a common and optimal decision.
As the iterations are carried out, first-stage decisions evolve from decisions hedged
by the aggregation strategy to decisions hedged by the multiple recourse deci-
sions computed on the scenario tree. Therefore, the progressive hedging algorithm
shows that there can be a smooth conceptual transition between the decision model
based on the distribution problem and the decision model based on the multistage
stochastic programming problem.

Example 2.1. We illustrate the computation of the VSS and the EVPI on an artifi-
cial multistage problem, with numerical parameters chosen in such a way that the
full multistage model is valuable. By valuable we mean that the presented simpli-
fied decision-making schemes will output first-stage decisions that are suboptimal.
If those decisions were implemented, and subsequently the best possible recourse
decisions were applied, the value of the objective over the full horizon would be
significantly suboptimal.
Let w1 , w2 , w3 be mutually independent random variables uniformly distributed on
{+1, −1}. Let ξ = (ξ1 , ξ2 , ξ3 ) be a random walk such that ξ1 = w1 , ξ2 = w1 + w2 ,
ξ3 = w1 + w2 + w3 . Let the 8 equiprobable outcomes of ξ form a scenario tree
and induce non-anticipativity constraints (the tree is a binary tree of depth 3).
Consider the decision process u = (u1 , u2 , u3 ) with u2 ∈ R and ut = (ut1 , ut2 ) ∈ R2
for t = 1, 3. Then consider the multistage stochastic program

maximize
1
P8 k k k
8 k=1 {[0.8u11 − 0.4(u2 /2 + u31 − ξ3k )2 ]
+ uk32 ξ3k + [1 − uk11 − uk12 ]}
subject to
uk11 + uk12 ≤ 1 ∀k
− uk11 ≤ uk2 ≤ uk11 ∀k
− uk1j ≤ uk3j ≤ uk1j ∀k and j = 1, 2
C1: uk1 = u11 ∀k
C2: uk2 = uk+1
2 = uk+2
2 = uk+3
2 for k = 1, 5
k k+1
C3: u3 = u3 for k = 1, 3, 5, 7 .

The non-anticipativity constraints C1, C2, C3, which are convenient to state the
problem, indicate in practice the redundant optimization variables that can be
eliminated.
2.3. The Value of Multistage Stochastic Programming 19

• The optimal value of the multistage stochastic program is V ∗ = 0.35 with


optimal first-stage decision u∗1 = (1, 0).
• The expected value problem for the mean scenario ζ = (0, 0, 0) is obtained by
setting momentarily ξ k = ζ for all k and adding the constraints

C2’: uk2 = u12 for all k ,


C3’: uk3 = u13 for all k .

Its optimal value is 1 with first-stage decision uζ1 = (0, 0). When equality
constraints are made implicit the problem can be formulated using 5 scalar
optimization variables only.
• The two-stage relaxation is obtained by relaxing the constraints C2, C3. Its
def
optimal value is 0.6361 with uk1 = uII
1 = (0.6111, 0.3889).

• The distribution problem is obtained by relaxing the constraints C1, C2, C3.
Its optimal value is V A = 0.6444. The two extreme scenarios ξ 1 = (1, 2, 3)
and ξ 8 = (−1, −2, −3) have first-stage decisions u11 = u81 = (0.7778, 0.2222)
and value -0.0556. The 6 other scenarios have uk1 = (0.5556, 0.3578) and value
0.8778, k = 2, . . . , 7. Note that in general, (i) scenarios with the same optimal
first-stage decision and values may still have different recourse decisions, and
(ii) the first-stage decisions can be distinct for all scenarios.
• The EVPI is equal to V A − V ∗ = 0.2944.
• Solving the multistage stochastic program with the additional constraint

C1ζ : uk1 = uζ1 ∀k

yields an upper bound on the optimal value of any scheme using the first-stage
decision of the expected value problem. This value is V ζ = −0.2000.
• The VSS is equal to V ∗ − V ζ = 0.55.
• Solving the multistage stochastic program with the additional constraint

C1II : uk1 = uII


1 ∀k

yields an upper bound on the optimal value of any scheme using the first-stage
decision of the two-stage relaxation model. This value is V II = 0.2431. Thus,
the value of the multistage model over a two-stage model, in our sense (distinct
from the VMS of Huang and Ahmed (2009)), is at least V ∗ − V II =0.1069.

To summarize, observe the collapse of the optimal value from V ∗ = 0.35 to V II =


0.2431 (with the first-stage decision of the two-stage model) and then to V ζ = −0.2
(with the first-stage decision of the expected value model).
We can also consider the anticipative decision sequences of the distribution problem,
and check if there exist plausible strategies that could exploit the set of first-stage
decisions to output a good first-stage decision (with respect to any decision-making
scheme for the subsequent stages).

• Selection strategy: Solving the multistage stochastic program with a con-


straint that enforces one of the first-stage decisions extracted from the dis-
tribution problem yields the following results: optimal value 0.3056 if u k1 =
20 Chapter 2. The Multistage Stochastic Programming Framework

(0.7778, 0.2222), optimal value 0.2167 if uk1 = (0.5556, 0.3578). But one has
to concede that in contrast to other simplified models, for which we solve
multistage programs only to measure the quality of a suboptimal first-stage
decision, the selection strategy needs good estimates of the different optimal
values to actually output the best decision.
• Consensus strategy: The outcome of a majority vote out of the set of the 8
first-stage decisions would be the decision (0.5556, 0.3578) associated to the
scenarios 2 to 7. With value 0.2167, this turns out to be the worst decision
between (0.7778, 0.2222) and (0.5556, 0.3578).
• Averaging strategy: The mean first-stage decision of the set of 8 first-stage
decisions is ū1 = (0.6111, 0.3239). Solving the multistage program with uk1 =
ū1 for all k yields the optimal value 0.2431.

The best result is the value 0.3056 obtained by the selection strategy. Note that
we are here in a situation where the multistage program and its variants could be
solved exactly, that is, with a scenario tree representing the possible outcomes of
the random process exactly.

2.4 Practical Scenario-Tree Approaches

We now focus on a practical question essential to the deployment of a multistage stochas-


tic programming model: if a problem has to be approximately represented by a scenario
tree in order to compute a decision strategy, how should a tractable and at the same
time representative scenario-tree approximation be selected for a given problem?
After some background on discretization methods for two-stage stochastic program-
ming, we pose the scenario tree building problem in an abstract way and then discuss
the antagonist requirements that make its solution very challenging. Then we review the
main families of methods proposed in the literature to build tractable scenario-tree ap-
proximations for a given problem, and highlight their main properties from a theoretical
point of view.
Given the difficulty of determining a priori good scenario-tree approximations for
many problems of practical interest (a difficulty which is to some extent surprising, given
the practical success of related approximation methods for two-stage stochastic program-
ming), there is a growing consensus on the necessity of being able to test a posteriori
the quality of scenario-tree based approximations on a large independent sample of new
scenarios. We present in this light a standard strategy based on the so-called shrinking-
horizon approach — the term is used, for instance, in Balasubramanian and Grossmann
(2003).

2.4.1 Approximation Methods in Two-stage Stochastic Programming

Let P denote a two-stage stochastic program, where the uncertainty is modeled by a


random vector ξ, possibly of high-dimension, following a certain distribution with either a
discrete support of large cardinality, or a continuous support. Let P 0 be an approximation
2.4. Practical Scenario-Tree Approaches 21

to P, where ξ is approximated by a random vector ξ 0 that follows a distribution with


a finite discrete support, the cardinality of the support being limited by the fact that
to each possible realization of ξ 0 is associated a number of optimization variables for
representing the corresponding recourse decisions. To obtain a good approximation, one
would ideally target the problem of finding a finite discrete distribution for ξ 0 (values
for the support and associated probability masses) such that any first-stage decision u 01
optimal for P 0 yields on P a minimal regret, in the sense that with optimal recourse
decisions, the value on P of the solution made of u01 and of optimal recourse decisions is
close to the exact optimal value of P. By analogy to the VSS, we could also say that the
distribution for ξ 0 should minimize the value of the exact program P with respect to the
approximate program P 0 .
Many authors have found it more convenient to restrict the attention on the problem
of finding a finite discrete distribution for ξ 0 such that the optimal value of P 0 is close
to the optimal value of P, and the solutions u01 optimal for P 0 are close to solutions
optimal for P. For this approach to work, one might want to impose some weak form of
continuity of the objective of P with respect to solutions. One may also want to ensure
that small perturbations of the probability measure for ξ have a bounded effect on the
perturbation of optimal solutions u1 .
An interesting deterministic approach (Rachev and Römisch, 2002) consists in ana-
lyzing the structure of optimal policies for a given problem class, the structure of the
objective when the optimal policy is implicitly taken into account, and inferring from it
a relevant measure of distance between probability distributions. The distance is based
on the worst-case difference among objectives taken from a class of functions with the
identified structure (or from a larger class of functions if this is technically convenient
for the computation of the distance measure). Finding a good approximation to a two-
stage stochastic program is then reformulated as the problem of finding a discrete dis-
tribution minimizing the relevant probability distance to the original distribution. Note
that probability distance minimization problems can be difficult to solve, especially with
high-dimensional distributions. Thus, the approach, which reformulates the approxi-
mation problem as the optimal quantization of the initial probability distribution, can
have essentially two sources of suboptimality with respect to the ideal approximation
problem: (i) in order to get a tractable computation of the distance, the class of func-
tions over which worst-case distances are evaluated has often to be enlarged; (ii) even
on the enlarged class, the probability distance minimization problem is often difficult to
solve (for instance NP-hard), so that the minimal distance is not necessarily attained,
especially when heuristics for finding solutions are used. Despite these limitations, the
approach has been shown to work well. Moreover, the reduction of the initial approxi-
mation problem to an optimal quantization problem indicates the relevance of existing
work on vector quantization and probability density estimation (MacKay, 2003, Chapter
20), and on discretization methods explored in approximate dynamic programming.
Randomized approaches are based on Monte Carlo sampling (Metropolis and Ulam,
1949) and its many extensions, including variance reduction techniques, and quasi Monte
Carlo techniques (MacKay et al., 1979). All these techniques have more or less been tried
for solving two-stage stochastic programs: Infanger (1992), for instance, investigates im-
portance sampling. They have been shown to work well in practice. Random approxi-
mations based on Monte Carlo have been shown to be consistent, in the sense that with
22 Chapter 2. The Multistage Stochastic Programming Framework

an infinite number of samples, optimal solutions to discretized programs can converge to


solutions optimal for P. More detailed results can be found in Shapiro (2003b).

2.4.2 Challenges in the Generation of Scenario Trees

In two-stage stochastic programming, the large or infinite set of recourse decisions of the
original program is reduced to a finite set of recourse decisions for the approximation.
Hence the exact and approximate solutions lie in different spaces and cannot be com-
pared directly. Still, recourse decisions can be treated implicitly, as if they were already
incorporated to the objective function, and as if the only remaining element to optimize
were the first-stage decision.
In multistage stochastic programming, we face the same issue: one cannot directly
compare finite-dimensional solutions obtained from finite scenario-tree approximations to
exact optimal solutions lying in a space of functions. But now, using the same technique
of treating all recourse decisions implicitly leads to a dilution of the structural properties
of the objective function. As these structural properties are weaker, the class of objective
functions to consider becomes very general. Worst-case distances between functions in
such classes may cease to guide satisfactorily a discretization procedure. In addition, as
the random process runs over several stages, the discretization problems are posed over
typically larger spaces, making them more difficult to solve, even approximately.
For these reasons, rather than presenting the generation of scenario trees as a natu-
ral extension of discretization methods for two-stage stochastic programming, with the
incorporation of branchings for representing the nested conditional probability densities,
we state the problem in a more open way, which also highlights complexity aspects:

Construct a tractable algorithm A that,

• given a multistage stochastic program P : minπ E{f (ξ, π(ξ))} defined


over a probability space (Ω, B, P) with objective f (including by conven-
tion the constraints) and non-anticipative policies π(ξ),
• will produce an approximate finite-dimensional surrogate program of the
Pn
form P 0 : minu k=1 pk {g(ξ k , uk )} defined over some reduced space
(Ω0 , B 0 , P0 ) and objective g, and from which a surrogate policy π̂(ξ) sub-
ject to non-anticipativity constraints may be computed in a tractable way,
• with the goal of making the regret

R = E{f (ξ, π̂(ξ))} − min E{f (ξ, π(ξ))} ≥ 0


π

as small as possible.

Notice that we allow, for the sake of generality, that the surrogate program may refer
to a function g different from the original objective f , and that we impose that the
algorithm A, the solving strategy associated to the problem P 0 , as well as the evaluation
of the induced policy π̂ on any new scenario, are all tractable. At this stage, we do not
specify how π̂ is inferred or understood; π̂ needs to be introduced here only to be able
to write a valid expression for the regret on the original multistage program.
2.4. Practical Scenario-Tree Approaches 23

Depending on situations, the problem P (random process model and function f ) can
be described analytically, or be only accessible through sampling and/or simulation. The
problem P 0 will be described by a scenario tree and the choice of the function g, under
limitations intrinsically due to the tractability of the optimization of the approximate
program.
As we have seen, there are many derived decision-making schemes and usages of
the multistage stochastic programming framework. Also, various classes of optimization
programs can be distinguished — with the main distinctions being between two-stage and
multistage settings, and among linear, convex, and integer/mixed-integer formulations —
and thus several possible families of functions over which one might attempt to minimize
a worst-case regret.
In the stochastic programming literature, several scenario tree generation strategies
have been studied. The scenario tree generation problem is there often viewed in one or
another of two reduced ways with respect to the above definition, namely

(i) as the problem of finding a scenario tree with an associated optimal value
n
X
min pk {f (ξ k , uk )}
u
k=1

close to the exact optimal value minπ E{f (ξ, π(ξ))}, or

(ii) as the problem of finding a scenario tree with its associated optimal first-stage
decision û1 close to a first-stage decision π1 optimal for the exact program.

Indeed, version (i) is useful when the goal is merely to estimate the optimal value of the
original program P, while version (ii) is useful when the goal is to extract only the first
stage decision, assuming that later on, recourse decisions are recomputed using a similar
algorithm, given the new observations.
The generic approximation problem that we have described is more general, since it
covers also the case where the scenario tree approach may be exploited offline to extract
a complete policy π̂(ξ) that may then be used later on, in a stand-alone fashion for
decision making over arbitrary scenarios and decision steps, be it in the real world or in
the context of Monte Carlo simulations.
To give an idea of theoretical results established in the scenario tree generation lit-
erature, we now briefly discuss two representative trends of research: works that study
Monte Carlo methods for building the tree, and works that seek to minimize in a deter-
ministic fashion a certain measure of discrepancy between the original process and the
approximate process represented by the scenario tree.

Monte Carlo Scenario Tree Sampling Methods.

Monte Carlo methods have several advantages: they are easy to implement and they scale
well with the dimension, in the sense that with enough samples, one can get close to the
statistical properties of high-dimensional target distributions with high probability. The
major drawback of (pure) Monte Carlo methods is the variance of the results (in our case,
the optimal value and optimal solutions of the approximate programs) in small-sample
conditions.
24 Chapter 2. The Multistage Stochastic Programming Framework

Let us describe the Sample Average Approximation method (SAA) (Shapiro, 2003b),
which uses Monte Carlo for generating the scenario tree. One starts by building the
branching structure of the tree. Note that the method does not specify how to carry out
that step. Practitioners often use the same branching factor for each node relative to
a given decision stage. They also often concentrate the branchings at early stages: the
branching factor is high at the root node and then decreases with the index of the decision
stage. The next step of the method consists in sampling the node values according to the
distributions conditioned on the values of the ancestor nodes. The procedure, referred to
as conditional sampling, is implemented by sampling the realizations of random variables
at stage t before sampling those of stage t+1. Distinct realizations are assigned to distinct
nodes, which are given a conditional probability equal to the inverse of the branching
factor. The last step consists in solving the program on the so-obtained scenario tree
and thus, although part of the description of the SAA method, does not concern the
generation of the tree itself.
Consider scenario trees obtained by conditional sampling. For simplicity assume a
uniform branching factor nt at each stage t, so that the number of scenarios is n =
QT
t=1 nt . Shapiro (2006) shows under some technical assumptions that if we want to
guarantee, with a probability at least 1 − α, that implementing the first-stage decision û 1
optimized on a scenario tree of size n while implementing subsequently optimal recourse
decisions conditionally to the first-stage decision will yield an objective value -close to
the exact optimal value, then the size n of the tree we use for that purpose has to grow
exponentially with the number of stages. The result goes against the intuition that
by asking for -optimality with probability 1 − α only, one could get moderate sample
complexity requirements. Now, as the exponential growth of the number of scenarios is
not sustainable, one can only hope solving multistage models in small-sample conditions,
and obtain solutions that at least with the SAA method may vary from tree to tree
and be of uncertain value for the real problem. Perhaps surprisingly, it is not possible to
obtain valid statistical bounds for that uncertain value by imposing as first-stage decision
the tested first-stage decision and reoptimizing recourse decisions on several new random
trees (Shapiro, 2003a).

Deterministic Scenario Tree Optimization Methods.

There exist various deterministic techniques for selecting jointly the scenarios of the tree.
Note that there is a part of numerical experimentation in the development of scenario tree
methods, and a risk of overestimating the domain of validity of the proposed methods,
since research efforts are oriented by experiments on particular problems.
Moment-matching methods (Høyland and Wallace, 2001; Høyland et al., 2003) at-
tempt to produce discrete distributions with some statistical moments matching those
of a target distribution. Moment matching may be done at the expense of other statis-
tics, such as the number and the location of the modes, that might also be important.
Hochreiter and Pflug (2007) give an example illustrating that risk.
The theoretical analysis underlying the so-called probability metrics methods, that
we have briefly evoked in the context of two-stage stochastic programming, was initially
believed to be easily extensible to the multistage case (Heitsch and Römisch, 2003); but
then it turned out that more elaborated measures of probability distances, integrating the
2.4. Practical Scenario-Tree Approaches 25

intertemporal aspect of observations, were needed (Heitsch and Römisch, 2009). These
elaborated metrics are more difficult to compute and to minimize, so that well-justified
discretizations of multistage programs are more difficult to obtain.
We can also mention methods that come with approximation guarantees, such as
bounds on the suboptimality of the approximation (Frauendorfer, 1996; Kuhn, 2005).
However, they are applicable only under relatively strong assumptions concerning the
problem class and the type of randomness. Quasi Monte Carlo techniques are perhaps
among the more generally applicable methods (Pennanen, 2009).
Most deterministic methods end up with the formulation of difficult optimization
problems, such as nonconvex or NP-hard problems (Høyland et al., 2003; Hochreiter and
Pflug, 2007), with computationally demanding tasks (such as multidimensional integra-
tions), especially for high-dimensional random processes.
The field is still in a state where the scope of existing methods is not well defined,
and where the algorithmic description of the methods is incomplete, especially concerning
the branching structure of the trees. That the domains of applicability are not known
or overestimated makes it delicate to select a sophisticated deterministic technique for
building a scenario tree on a new problem.

2.4.3 The Need for Testing Scenario-Tree Approximations

Theoretical analyses of scenario tree generation algorithms, often based on worst-case


reasonings or large deviation theory, provide guarantees on the quality of approximate
solutions that are usually too loose in practice or equivalently call for intractable sce-
nario tree sizes. Hence they do not really solve the basic question of how to build a priori
small scenario trees in a generic, scalable, and computationally efficient way, potentially
jeopardizing the practical relevance of the multistage extension of stochastic program-
ming for sequential decision making under uncertainty. Now if we are ready to renounce
to worst-case guarantees embedded in the scenario tree generation method, new tools
are needed for computing, a posteriori, guarantees on the value of a given numerical
approximation scheme.
If we want to assess on an independent test set of scenarios the performance of deci-
sions optimized on a scenario tree, a difficulty arises: first-stage decisions can be tested
but subsequent recourse decisions are only defined for the scenarios covered by the sce-
nario tree. Therefore, it is necessary to extend the approach so as to allow one to test
solutions on new scenarios, at a computational cost low enough to allow the validation
on a sufficiently large number of test scenarios.
We have to stress that this extension is not really necessary for two-stage stochastic
programming. First, approximations of two-stage models yield constant first-stage deci-
sions, that are implementable on any scenario, while recourse decisions on new scenarios
can then often be found analytically, or by running a myopic one-stage optimization pro-
cedure for each new scenario, or by implementing a known recourse procedure that the
initial two-stage model was only approximating for optimizing the first-stage decisions
— a strategy found efficient in capacity planning (Sen et al., 1994). Thus, testing is gen-
erally straightforward for two-stage models. Second, finite-dimensional approximations
of two-stage stochastic programming models do not use scenario trees. They only use a
finite set of outcomes. Theoretical results show that in the two-stage situation, statistical
26 Chapter 2. The Multistage Stochastic Programming Framework

confidence bounds on the quality of an approximate solution can be computed (Norkin


et al., 1998b; Mak et al., 1999). These results break down in the multistage case, giving
its true interest to guarantees based on testing (Shapiro, 2003a).

2.4.4 The Inference of Shrinking-Horizon Decision Policies

Several authors have proposed to use a generic scheme similar to Model Predictive Control
to assess the performances associated to a particular algorithm A for building the scenario
tree (Kouwenberg, 2001; Chiralaksanakul, 2003; Hilli and Pennanen, 2008). The scheme
can be sketched as follows.

i. Generate a scenario tree using algorithm A. Solve the resulting program and extract
from its solution the value of the first-stage decision u1 , say ū1 .

ii. Generate a test sample, of n00 mutually independent scenarios by drawing i.i.d.
realizations ξ j of the random process ξ.

iii. For each scenario ξ j of the test sample, set uj1 = ū1 , and obtain sequentially
the recourse decisions uj2 , . . . , ujT , as follows: each decision ujt is obtained as a
first-stage decision computed by taking as an initial condition the past decisions
uj1 , . . . , ujt−1 and the history ξ1j , . . . , ξt−1
j
of the test scenario ξ j , by conditioning the
joint distribution of ξt , . . . , ξT on the history, by using the algorithm A to build
a new scenario tree that approximates the random process ξt , . . . , ξT , by solving
the program formulated on this tree over the optimization variables relative to the
decisions ut , . . . , uT , and by discarding all but the decision ut , the optimal value of
which is then assigned to ujt .

iv. Estimate the overall performance of the scheme by Monte Carlo simulation. This
consists in evaluating on the test sample the empirical average
Pn00
VTS (A) = (1/n00 ) j=1 f (ξ j , uj ) ,

where we have denoted by uj = (uj1 , . . . , ujT ) the sequence of decisions associated


to the scenario ξ j , and where TS recalls that the estimate is computed on the test
sample.

The Monte Carlo estimate VTS (A) can provide an unbiased estimation of the value
of the scenario tree building algorithm A in the context of the other approximations
involved in the numerical computations of the sequences of decisions, such as for instance
simplifications of the objective function, or early stopping at low-accuracy solutions.
The estimator VTS (A) may have a high variance, but we can expect a high positive
correlation between this estimator and an estimator VTS (A0 ) using the same test sample
but relative to another tree generation algorithm A0 . This would allow a reliable com-
parison of the relative performance of the two algorithms A, A0 on the problem instance
at hand.
The validation is generic in the sense that it can be applied to any algorithm A, but
also in the sense that it addresses the general scenario tree building problem in the larger
context of the decision making scheme actually implemented in practice.
2.5. Conclusions 27

2.4.5 Alternatives to the Scenario Tree Approach

We point out that alternative numerical methods for solving infinite-dimensional two-
stage stochastic programs exist, based on an incorporation of the discretization procedure
to the optimization, for instance by updating the discretization or carrying out impor-
tance sampling within the iterations of a given optimization algorithm (Slyke and Wets,
1969; Higle and Sen, 1991; Norkin et al., 1998a), or by using stochastic subgradient meth-
ods (Nemirovski et al., 2009). Also, heuristics for finding good policies directly on the
infinite-dimensional multistage problem have been suggested: a possible idea, akin to di-
rect policy search procedures in Markov Decision Processes, is to optimize a combination
of feasible non-anticipative basis policies π j (ξ) specified beforehand (Koivu and Penna-
nen, 2010). These methods are nevertheless less general than the standard scenario tree
approach, because they seem to be reserved to applications with rather simple feasibility
sets.

2.5 Conclusions

Elaborating strategies involving complex quantitative decisions calls for optimization


techniques that can avoid a systematic enumeration and evaluation of possible options
at each decision stage. Multistage stochastic programming has been recognized by several
industries (Dempster et al., 2008; Kallrath et al., 2009) as a promising framework to for-
mulate complex problems under uncertainty, exploit domain knowledge, use risk-averse
objectives, incorporate probabilistic and dynamical aspects, and preserve structures that
allow to apply large-scale optimization techniques. These techniques for sequential de-
cision making under uncertainty are particularly interesting to study: Puterman (1994),
citing Arrow (1958) on the early roots of sequential decision processes, recalls the role of
the multi-period inventory models from the industry in the development of the theory of
Markov Decision Processes (Bellman, 1954; Howard, 1960). We could also mention the
role of applications in finance as a motivation for the early theory of multi-armed ban-
dits and for the theory of sequential prediction (Cesa-Bianchi and Lugosi, 2006), now an
important field of research in machine learning (Auer et al., 2002; Coquelin and Munos,
2007).
In the preceding sections, where the problem of inferring a good scenario-tree approx-
imation for a given multistage stochastic programming problem was presented, we have
seen that the state of the art does not currently provide strong enough methods with
broad enough practical coverage and good enough theoretical guarantees in terms of the
quality of the approximate solutions derived in this way.
Researchers in the field were thus led to suggest the use of the shrinking-horizon
recursive procedure for exploiting the scenario tree based approach in practice. However,
evaluating the resulting performance estimator on an independent sample of scenarios is
extremely demanding, as it requires, for each test scenario and at each stage of recourse
decisions, the automatic construction of a new scenario tree and the optimization of
the resulting program on the tree. Doing this is still beyond the possibility of available
computational approaches when considering the solution of large-scale problems.
For these reasons, there is currently no scalable off-the-shelf method for generating
and testing scenario-tree based approximations of multistage stochastic programs, and
28 Chapter 2. The Multistage Stochastic Programming Framework

the framework of stochastic programming based on scenario trees has in this way, in spite
of its theoretical appeal, lost its practical attractiveness during the last years in many
environments dealing with large-scale systems (Powell and Topaloglu, 2003; Van Henten-
ryck and Bent, 2006).
Chapter 3

Solution Averaging in Structured Spaces

In this chapter, we consider an extension to the multistage stochastic programming frame-


work, based on the simultaneous use of several scenario trees for making decisions.
This work is motivated by the excellent performances of the perturb-and-combine
estimation methods from machine learning (P&C) (Breiman, 1998).
In fact, stochastic programs are usually formulated in a way very similar to maximum
likelihood estimation problems (Dupacova and Wets, 1988). Yet, since Fisher’s contri-
butions to maximum likelihood estimation, considerable methodological advances have
been made on the problem of estimating a parameter of interest with a finite amount of
data, through a mix of optimization, resampling and averaging techniques.
The chapter is organized as follows. Section 3.1 provides an unified view to a series of
estimation methods used in machine learning, that have finally led to ensemble methods.
Section 3.2 proposes an approach that seeks to better estimate the first-stage decision
of a multistage program by aggregating several first-stage decisions optimal with re-
spect to different scenario-tree approximations. Section 3.3 illustrates and evaluates the
proposed approach on a test problem having an interesting discrete decision space. Sec-
tion 3.4 concludes the chapter with some ideas concerning the regularization of stochastic
programming approximations.

3.1 The Perturb-and-Combine Paradigm

Let X be a random vector following some fixed but unknown density PX , referred to as
the data-generating density.
Let D = {x1 , . . . , xn } denote a set of realizations of X drawn from PX in some way.
Call D the data set. For brevity we write xn for x1 , . . . , xn . A data set of n samples
is a random quantity. Its density is written PD . There is a wide spectrum of methods
from statistics and machine learning that can be used to explain the data, and predict
(forecast) future samples. We discuss those methods in the context of the inference of
a predictive density px|D for a new sample x, given the data. One could also condition
def
the density of x = (y, z) on some of its components y and interpret the resulting density
pz|y,D as the predictive distribution of output variables z given input variables y and
data set D.
The later use of densities is not discussed in this section. We simply recall that the
summary of a density through a single value is addressed by decision theory, and is usually
30 Chapter 3. Solution Averaging in Structured Spaces

done through the choice of a loss function (Robert, 2007, Chapter 2). The quality of the
inference can also be quantified through a measure of divergence between the predicted
and true densities (Ali and Silvey, 1966; Csizár, 1967). A particular divergence that has
been found useful (Clarke and Barron, 1990) is the Kullback-Leibler divergence between
two densities g, h, defined by
g(x)
Z
KL(g||h) = g(x) log dx . (3.1)
h(x)
The KL divergence is also referred to as the cross-entropy distance (Rubinstein and
Kroese, 2004).

3.1.1 Maximum Likelihood Estimation

In the simplest frequentist approach to explaining data, one assumes that the samples
are drawn independently, and that the data-generating density belongs to a family of
densities parametrized by θ ∈ Rd . The density at x with parameter θ is written p(x; θ).
As the joint density of independent random variables is the product of the marginal
densities, we can write the joint density of the samples as
n
Y
n
pD|θ (x ; θ) = p(xk ; θ) .
k=1

The parameter θ can be inferred (estimated) from the finite data set by maximizing the
log-likelihood of the data (Fisher, 1925):
n
X
θ̂ ∈ argmaxθ log p(xk ; θ) , (3.2)
k=1

where argmax f denotes the set of maximizers of f (often a singleton). The predictive
density, defined as the density of a new sample xn+1 , conditionally to the data set, is
then given by

px|D (xn+1 ; xn ) = p(xn+1 ; θ̂) ,

where θ̂ is in this context referred to as a plug-in estimate.

Remark 3.1. If log p(x; θ) is a continuously differentiable concave function with


respect to θ ∈ Θ ⊂ Rd , the maximum in (3.2) can be obtained from the first-order
optimality condition
n n
X X ∇θ p(xk ; θ)
∇θ log p(xk ; θ) = =0 , (3.3)
p(xk ; θ)
k=1 k=1

provided that the resulting estimate θ̂ is in the interior of Θ. The left-hand side of
(3.3) is called the score function. The random vector
n
X ∇θ p(Xk ; θ)
δ(θ) = with Xk drawn according to p(·; θ)
p(Xk ; θ)
k=1
3.1. The Perturb-and-Combine Paradigm 31

is called the efficient score. The covariance matrix of the efficient score is called
the Fisher information matrix, written I(θ; n) ∈ Rd×d — the argument n stresses
that we define the Fisher information matrix for n observations. We write I(θ)
for I(θ; 1). Under suitable conditions on p(X; θ) allowing the interchange of the
expectation and differentiation operators,

∂ 2 log p(X; θ)
Iij (θ; n) = −n E{ } ,
∂θi ∂θj

which shows that the Fisher information is related to the curvature of p evaluated
at θ.
Let φ(D) ∈ Rd , with D made of n observations, denote an unbiased estimator of θ,
that is, an Rd -valued mapping such that E{φ(D)} = θ. Let us also assume that the
true density PX is p(·; θ). Then under some regularity conditions, the covariance
matrix of φ(D), written Σφ , satisfies the Cramér-Rao inequality:

Σφ  I −1 (θ; n) ,

the inequality referring to the cone of positive semi-definite matrices. If the maxi-
mum likelihood estimate θ̂ in (3.2) is unique and “far enough” from the boundary
of Θ, then for n “large enough”, θ̂ is “approximately” normally distributed with
mean θ and covariance I −1 (θ; n) = n−1 I −1 (θ). This result is usually expressed by
saying that

n(θ̂ − θ) converges in distribution to N (0, I −1 (θ)) .

As θ̂ has asymptotically the best possible covariance for unbiased estimators, the
maximum likelihood estimator is said to be an efficient estimator. Note, however,
that the covariance matrix relative to a biased estimator could be smaller.
For asymptotic results in situations where p(x; θ) is not twice differentiable in
a neighborhood of θ, see Dupacova and Wets (1988); for asymptotic results in
situations where θ is on the boundary of Θ, see Shapiro (2000).

The function `(x; θ) = − log p(x; θ) is a loss function, called the negative log-likelihood
loss function. If the samples are truly drawn independently, the maximization of the log-
likelihood of the data in (3.2) is a surrogate program for the minimization of E{`(X; θ)},
where the expectation is taken with respect to the true data-generating density P X .
Hence, as the surrogate problem is ill-posed, it may be preferable to penalize the objective
when the number n of samples is small, for example (Tikhonov and Arsenin, 1977) by
adding a regularization term − 21 λ||θ||2 with λ > 0 to the log-likelihood:
n
X
log p(xk ; θ) − 21 n−1 λ||θ||2

θ̂ ∈ argmaxθ . (3.4)
k=1
Pn
When the objective can be written as a sum k=1 ρ(xk ; θ) for some function ρ, the cor-
responding estimates θ̂ are sometimes referred to as M-estimates (Maximum Likelihood
Type Estimates) (Huber, 1964).
When the true density of the data set PD cannot be identified to pD|θ for some θ,
be it because the data-generating density does not belong to the parametric family of
32 Chapter 3. Solution Averaging in Structured Spaces

densities, or because the samples are not drawn independently, the probability model is
said to be misspecified. This is the most common situation encountered in practice. Using
maximum likelihood type estimators with misspecified models does not necessarily lead
to inconsistent estimates (estimates with non vanishing bias as the number of samples
grows to infinity): what can really harm consistency is rather to omit some of the relevant
variables for explaining the data, or to assume wrong constraints between the components
of x (White, 1982).

3.1.2 Bayesian Averaging

In the simplest Bayesian approach to explaining data, one assumes that the samples are
drawn independently, that the data-generating density belongs to a family of densities
parametrized by θ ∈ Rd , and in addition that the parameter θ has been drawn from a
fixed density pθ , called the prior. The conditional density of θ given the data, written
pθ|D , is called the posterior. It is computed according to the Bayes formula for conditional
distributions
pD|θ (xn ; θ)pθ (θ)
pθ|D (θ; xn ) = R (3.5)
p (xn ; θ)pθ (θ)dθ
Θ D|θ

where Θ denotes the domain of θ, and dθ denotes the Lebesgue measure if θ is continu-
ous, or the counting measure if θ is countable. Note that the normalization of p θ|D by
the integral makes it possible to (formally) use improper priors (Jeffrey, 1939), that is,
R
“generalized” densities pθ such that Θ pθ (θ)dθ = +∞. In particular, using a uniform
prior pθ (θ) = 1 amounts to identify pθ|D to the likelihood, pD|θ . On the other hand, for
certain families of distributions p(x; θ), there exists a special choice for the prior, referred
to as the conjugate prior, such that the prior and the posterior belong to the same family
(Raiffa and Schlaifer, 1961). This is convenient for evaluating pθ|D in closed-form, but
reduces the prior to a mere device for making tractable predictions.
Note that the frequentist approach also makes prior assumptions, for instance through
the choice of λ in (3.4), which is formally set to 0 in the maximum likelihood estimate
(3.2). The family p(x; θ) and the type of regularization are also often chosen to facilitate
the evaluation of the M -estimate.
If (3.5) cannot be evaluated in closed-form, the simplest approximation is Maximum A
Posteriori (MAP) estimation, which consists in approximating pθ|D by a distribution with
all the probability mass concentrated on the mode of pθ|D (with ties broken arbitrarily).
Maximum A Posteriori estimation with a uniform prior coincides with Maximum Likeli-
hood estimation. More advanced approximation techniques include asymptotic methods
such as Laplace’s method (MacKay, 2003, Chapter 27) which consists in replacing a
distribution by a Gaussian approximation, importance sampling, multiple quadrature,
and Markov Chain Monte Carlo methods (MCMC) (Metropolis et al., 1953; Neal, 1993,
2010), which essentially consists in approximating the integration over θ by accumulat-
ing evaluations at points θk generated by a random walk in the parameter space Θ. The
relative merit of these methods are discussed in Evans and Swartz (1995) and in MacKay
(2003, Chapters 29 & 30). The methods that scale well with the dimension d are the
MCMC methods.
The Bayesian approach aims at taking into account (through the prior) the uncer-
tainty associated to the selection of a particular value θ̂ for making predictions after
3.1. The Perturb-and-Combine Paradigm 33

having observed the data. The density of a new sample xn+1 , conditionally to the data
set, is given by a mixture of all members of the parametric family, obtained by averaging
all the members with importance weights given by pθ|D (Bayesian averaging):
Z
n
px|D (xn+1 ; x ) = p(xn+1 ; θ)pθ|D (θ; xn )dθ . (3.6)
Θ

Here again approximations can be employed so as to carry out the integration.


The frequentist approach takes the uncertainty into account through regularization.
In many cases, there is a connection between particular prior distributions and particular
regularization mechanisms, so that regularization can be reinterpreted as an implicit
prior, and vice-versa (Kimeldorf and Wahba, 1970).

Remark 3.2. Under suitable conditions (Walker, 1969), one can show that if the
true density PX is p(·; θ) for some θ in the interior of its domain Θ, then for n
“large enough”, the posterior distribution px|D given the data is “approximately” a
normal distribution with mean θ̂ (maximum likelihood estimator on the data) and
covariance matrix I −1 (θ; n) (inverse of the Fisher information matrix for n obser-
vations).

3.1.3 Mixture Models

Bayesian averaging suggests to consider a more expressive class of probability models,


formed by combining models p(x; θ) from the parametric family.
Following this idea, one assumes that the samples of the data set are drawn inde-
pendently by first drawing a density from a family of densities parametrized by θ ∈ R d ,
according to some fixed but unknown distribution pθ , and then by drawing the sample
from the selected density p(x; θ). Note that θ is a latent variable: it is not part of the
observed samples collected in the data set. The density for a new sample x n+1 will be
given by (3.6), except that now pθ|D loses its Bayesian interpretation and is viewed as a
free component of the probability model. When pθ|D has a finite support of fixed cardi-
nality (finite mixture models), pθ|D is described by a finite number of parameters (values
and probability masses) and can thus be optimized according to the maximum likelihood
principle (Hasselblad, 1966), for example through the Expectation-Maximization (EM)
iterative procedure (Dempster et al., 1977), that treats a sample x as an incomplete
observation of (θ, x).

3.1.4 Model Selection

In a more advanced approach to explaining data, one assumes that the data-generating
density PX belongs to a space of densities described by a model structure M ∈ M with
model parameters θM ∈ ΘM . The dimension of θM can vary with M . One speaks of
nested models when there exists a complete ordering M1 , M2 , . . . of the models such that
all the densities representable by Mν are also representable by Mν+1 .
Models of different complexity (flexibility) coexist in the hypothesis space. Loosely
speaking, low complexity was originally associated to a small number of parameters for
34 Chapter 3. Solution Averaging in Structured Spaces

describing a model (Rissanen, 1978), or to a greater smoothness of the model (Good and
Gaskins, 1971). It is convenient to view a model through the pair (M, s), where s is a
complexity parameter associated to M . For nested models Mν , we can assume that there
exists an increasing function that maps structure indices ν to complexity parameters s.
Model selection methods aim at identifying the model M that best explains the data,
often by adapting the complexity s of the selected model to the size n of the data set.
Note that the misspecification issue is completely irrelevant here inasmuch as one seeks
to explain learnable properties of the data (Vapnik, 1998): assumptions on a hypothetical
true distribution PX are a matter of pure convenience.
In finite mixture density estimation for instance, the cardinality of the finite support
of pθ|D determines the model structure and induces a model ordering, so that compet-
ing models can be ranked according to the log-likelihood of the data penalized by a
complexity parameter s (Li and Barron, 2000).

3.1.5 Bayesian Model Averaging

Bayesian Model Averaging is the extension of Bayesian averaging to hypothesis spaces


formed of several model structures. Instead of performing model selection, one defines,
for nested models Mν , a prior pν on the structure index ν ∈ N ⊂ N relative to the
model Mν . If p(x; ν, θν ) denotes the density associated to Mν with parameter θν ∈ Θν ,
the predictive distribution is given by
X Z
px|D (xn+1 ; xn ) = pν|D (ν; xn ) p(xn+1 ; ν, θν )pθν |D (θν ; xn )dθν (3.7)
ν∈N Θν

with pν|D interpreted as the importance weight of the model Mν , determined by updating
the prior pν using the observed data.
For models M identified by some continuous hyper-parameter α ∈ Rq , so that x fol-
lows f (x; α, θ), it is common to define a joint prior pα,θ = pα pθ|α on (α, θ) ∈ A × Θ. The
predictive distribution is then given by
Z Z !
px|D (xn+1 ; xn ) = p(xn+1 ; α, θ)pθ|α,D (θ; α, xn )dθ pα|D (α; xn )dα . (3.8)
A Θ(α)

Approximations include MAP-type simplifications (selection of the model that maxi-


mizes pα,θ|D ), model expansion techniques that restrict the integration to a neighborhood
of a good model (Draper, 1995), selective model averaging that restrict the integration
to a series of good models, and Markov Chain Monte Carlo techniques that perform a
random walk in the space of models (Madigan and York, 1995). There also exist connec-
tions between particular model selection criteria and particular priors on models (Zhang
et al., 2009).

3.1.6 Ensemble Methods

Ensemble methods build on the principles suggested by Bayesian averaging, Bayesian


model averaging, and MCMC or importance sampling Monte Carlo techniques. We
will present these methods under a common umbrella by saying that ensemble methods
3.1. The Perturb-and-Combine Paradigm 35

assume a predictive distribution of the form (3.8) with a MCMC approximation already
applied to the integral, that is,
m
X
px|D (xn+1 ; xn ) = p(xn+1 ; αν , θν )wν , (3.9)
ν=1

where

• αν describes the structure of a model Mν ,

• θν refers to the parameters of the model Mν ,

• p(·; αν , θν ) stands for the predictive distribution according to the model Mν ,

• m is the number, possibly depending on xn , of models Mν that are generated


sequentially given the data xn , and possibly given information extracted from pre-
vious models — this information would be represented by the state of the Markov
Chain in MCMC,

• wν ≥ 0 is the weight of the model Mν in the ensemble, with weight updates


permitted during the construction of the sequence — in particular, setting an initial
sequence of weights to 0 amounts to discard models generated during a first “burn
in” period.

Each term in the sum represent the contribution of a weighted sample as if it were drawn
from the joint density pα,θ|D = pθ|α,D pα|D in (3.8), duplicate samples being permitted.

Bagging.

In bagging methods (bootstrap aggregating methods) (Breiman, 1996), the sequence


M1 , . . . , Mm is built by sampling models Mν as follows: the parameters of a model Mν
are particularized (plug-in estimate) to a random resampling Dν of the elements in the
data set D with replacement (bootstrap) (Efron and Tibshirani, 1993), each element of D
having the same probability to be drawn, repetitions permitted. The number of draws
is usually set to a fixed ratio α of the cardinality of D.
The idea of bagging has originally been proposed in the context of prediction, but
has then also been applied to density estimation. In the context of prediction, bagging
has been shown to reduce the variance of estimators that are unstable with respect to
perturbations of the data xn (assuming that the number n of samples is held fixed), with
a beneficial effect on the bias/variance tradeoff provided that the resampling ratio α is
large enough (Buja and Stuetzle, 2006) (α = 1 in Breiman’s original algorithm).

Boosting.

In boosting methods (Schapire, 1990; Freund and Schapire, 1996; Schapire et al., 2002),
the sequence M1 , . . . , Mm is built by sampling models Mν as follows: the parameters of a
model Mν are particularized to a random resampling Dν of the elements in a data set D,
each element of D having a certain probability to be drawn, determined by assigning to
each element k of the data set D an importance weight that is relatively greater if the
36 Chapter 3. Solution Averaging in Structured Spaces

element k is not well explained (or predicted) by the previous models. The m models are
then aggregated by weighted averaging (Littlestone and Warmuth, 1989), the weights
reflecting the respective quality of each model at explaining the data. The weighted
aggregation scheme depends on generalization bounds proper to the loss function chosen
for scoring the models.
Boosting has been shown to induce predictive models with excellent generalization
capabilities starting from a family of models Mk having their prediction slightly bet-
ter than random predictions once their parameters θk are adapted to the data (weak
models). The reasons for the empirical success of boosting may not still be fully eluci-
dated (Mease and Wyner, 2008). The aggregation schemes used for the online prediction
of (bounded) sequences X1 , X2 , . . . without assuming a probabilistic model PX (Cesa-
Bianchi and Lugosi, 1999), as advocated by Dawid (1984), are similar to the aggregation
schemes used in boosting (Cesa-Bianchi et al., 1997), and have been analyzed in terms of
their generalization ability in the context of online prediction (Cesa-Bianchi et al., 2004).

Other Ensemble Methods.

While bagging and boosting exploit perturbations of the data set based on the temporary
presence or absence of particular samples, other ensemble methods use perturbations of
the data set based on the temporary presence or absence of certain components of x
(features) (Dietterich, 2000; Breiman, 2001; Geurts et al., 2006). Research is still very
active in machine learning for finding beneficial ways to perturb data sets by further ran-
domizing the features, be it in the context of ensemble methods stricto sensu (Breiman,
2000), or in the context of kernel methods (Rahimi and Recht, 2008, 2009; Shi et al.,
2009).

3.2 Adaptation to Stochastic Programming

Most stochastic programs of practical interest use unbounded objective functions. This
is in strong contrast with the usual assumptions made in machine learning and online se-
quence prediction. A large body of theoretical work based on empirical processes theory
(Pollard, 1990), large-deviation theory and concentration inequalities such as Hoeffding’s
inequality (Hoeffding, 1963), Azuma’s inequality (Azuma, 1967), McDiarmid’s inequal-
ity (McDiarmid, 1989), ultimately relies on a bounded range assumption for establishing
the generalization bounds that back the predictions realized by mixtures of experts or
boosting-type approaches (Koltchinskii and Panchenko, 2002; Audibert et al., 2007; Shiv-
aswamy and Jebara, 2010). Results and reasonings from those works are thus difficult
to adapt to the usual models of stochastic programming. Note that when one accepts to
focus on bounded objective functions, theoretical investigations are possible (Nesterov
and Vial, 2008).
We follow another path here, and investigate empirically the use of bagging methods
for estimating an optimal first-stage decision to a multistage stochastic program. We
consider perturbed scenario-tree approximations to multistage stochastic programs with
decisions valued in a nonconvex feasible set. The standard averaging rule is not imple-
mentable, calling for a more sophisticated aggregation strategy. The first-stage decision
plays the role of the parameter θ considered in Section 3.1.
3.2. Adaptation to Stochastic Programming 37

3.2.1 Principle of the Approach

In this section, we outline the principle of the proposed approach, and discuss the main
underlying assumptions. We start by describing the class of problems that we address
and then provide an overview of the main ingredients of the proposed solution approach,
namely, a procedure for generating an ensemble of scenario trees, an algorithm based
on the cross-entropy method for computing near-optimal first-stage decisions, and a
kernel-based method for aggregating the first-stage decisions derived from the ensemble
of scenario trees. Background material on kernel methods can be found in Appendix C.

Problem Formulation and Assumptions.

We consider a system that evolves according to a state transition equation

xt+1 = ft (xt , ut , wt ) ,

starting from a fixed initial state x0 ∈ X. The state trajectory x0 , x1 , . . . is controlled by


the decisions ut ∈ U and perturbed by disturbances wt ∈ W generated by a memoryless,
exogenous process, so that wt is drawn from a fixed probability distribution Pt,w . A
reward process r0 , r1 , . . . , rT −1 is defined by mappings rt from X ×U ×W to R with values
rt (xt , ut , wt ). The initial state x0 , system dynamics ft , reward functions rt , disturbance
model Pt,w , are assumed to be known by the decision maker, whose goal is to find a non-
anticipative decision strategy µ for selecting actions ut and maximizing the expectation
of the cumulated rewards over T stages, written

T
X −1
J ∗ (x0 ) = max E{ rt (xt , ut , wt )|x0 }. (3.10)
µ
t=0

The candidate strategies µ for selecting the decisions ut at times 0 ≤ t < T is a sequence of
time-indexed deterministic mappings µt from the current history ht = (w0 , w1 , . . . , wt−1 )
of the disturbance process to a fixed decision ut = µt (ht ) ∈ U .
(To compare this setup to the Markov Decision Process framework, one may assume
temporarily that the disturbance process is observable. Then, the mappings from h t to
ut are as expressive as mappings from states xt to actions ut , since the states xt are
ultimately a function of ht : xt can be recovered from ht , given x0 , u0 , the decision rules
µ1 ,. . . µt−1 , and the state transition functions f0 , . . . , ft−1 ).
No assumption is made about the dimensionality or the structure of the state space X.
The space U of possible actions, and the space W of possible disturbances, are assumed
to be made of a finite number of elements.
The notations xt , ut , wt , ft , rt , the assumption of a memoryless disturbance process,
the initial condition for t = 0 rather than t = 1, are meant to facilitate the connection
with the usual discrete-time optimal control framework (Bertsekas, 2005a). The memo-
ryless assumption may be relaxed, by simply requiring that the probabilities of all future
disturbance sequences are known by the decision maker. The temporal decomposition of
the performance criterion in Equation (3.10) is fundamental in an optimization procedure
based on dynamic programming, but is not essential in the present approach.
38 Chapter 3. Solution Averaging in Structured Spaces

Exact Solution Based on a Complete Scenario Tree.

A complete scenario tree of depth T represents all the possible realizations of the process
w0 , w1 , . . . , wT −1 , together with their probabilities. In such a tree, the root node (depth 0)
corresponds to time t = 0 and to an empty process history. To each node n of depth
t ∈ {1, 2, . . . , T } in the tree corresponds a possible history hn = [w0 , . . . , wt−1 ]n of the
process, through the unique path from the root to the node n. The disturbance (w t−1 )n
is directly assigned to the node n together with its probability, while [w 0 , . . . , wt−2 ]n and
their joint probability can be collected from the disturbances and probabilities associated
to the nodes in the path.
Any strategy µ can be represented on the completetree by assigning to each node n of
depth 0 ≤ t < T a fixed value un = µ(hn ) ∈ U . Consequently, searching for an optimal
strategy is equivalent to jointly optimizing the values un assigned to the internal nodes
of the tree.
The performance criterion defined in (3.10) can be evaluated once decisions have
been assigned to the nodes. Indeed, given the value of x0 , u0 = µ0 and a particular w0 ,
one can evaluate r0 = r0 (x0 , u0 , w0 ) and x1 = f0 (x0 , u0 , w0 ) by the knowledge of rt
and ft at t = 0. The values r0 and x1 can thus be assigned to the node associated to
the disturbance process history [w0 ]. The probability P0,w (w0 ) is determined from the
disturbance process model. Given the nodal decision for u1 = µ1 (w0 ), and using x1
and a particular w1 , one gets the values of x2 and r1 for the corresponding particular
value of [w0 , w1 ]. The value r1 can be assigned to the node corresponding to [w0 , w1 ], to
which is also assigned a probability P0,w (w0 ) · P1,w (w1 ), since we assume that w0 , w1 are
independent. The propagation of nodal values is pursued until values are assigned to x T
and rT −1 . It can be carried out for each disturbance path in the tree. Therefore, for
a given decision strategy µ, all the rewards and probabilities entering the evaluation of
the expectation in (3.10) can be computed, given the system model ft , rt , Pt,w and the
initial state x0 .
Without any particular structure assumed for ft and rt , the optimization of the
policy µ may be done by a direct search of the decisions un assigned to the nodes of
T −1
the tree. However, the number of possible combinations is of the order of |U | |W | ,
meaning that as soon as the cardinalities |U |, |W |, or the time horizon T are large, an
exact optimization is intractable.

Approximate Solution Based on an Ensemble of Incomplete Scenario Trees.

Conceptually, an incomplete scenario tree is obtained by selecting a subset of the nodes


of a complete tree, by removing the arcs leading to these nodes as well as the subtrees
emanating from them, and by adjusting the probabilities of successor nodes so that they
still sum up to one. Each node of depth inferior to T in the resulting incomplete tree must
still have at least one successor node. In practice, incomplete trees can be constructed in a
top-down fashion, by subsampling the disturbance process according to its probabilities,
in such a way that the resulting incomplete scenario tree is small enough to induce a
tractable optimization problem.
While the decisions associated to the nodes of an incomplete scenario tree yield an
incomplete decision strategy, such a strategy always provides a value for the first-stage
3.2. Adaptation to Stochastic Programming 39

decision u0 . Therefore, building an incomplete tree, optimizing an incomplete decision


strategy over it, and extracting the first-stage decision, can be viewed as a simplified
estimation procedure for the search of u0 on the complete tree.
Rather than this usual estimate for an optimal u0 , we propose to consider the model
space of all incomplete trees, sample an ensemble of models in that space, and estimate
u0 by aggregating the optimal first-stage decisions associated to these models.
Recalling the conditions for the success of bagging approaches (Section 3.1.6), we can
expect that this estimation procedure can be beneficial when the individual incomplete
trees are not too small. Working on a small tree or on a single scenario could induce too
large a bias on the individual first-stage decisions with respect to the optimal first-stage
decision.
The main ingredients of this approach are further discussed below, with an emphasis
on decision problems with large discrete action and disturbance spaces.

3.2.2 Generation of an Ensemble of Incomplete Trees

Under our assumptions, the disturbance space can be written as W = {w 1 , . . . , w|W | },


where wj stands for a specific realization of the disturbance. To each disturbance w j is
associated a probability mass Pt,w (wt = wj ) = pj > 0 for each j ∈ J = {1, . . . , |W |},
P|W |
with j=1 pj = 1.
In our proposal, the generation of an ensemble of incomplete scenario trees is based on
the random sampling of a small number of successor nodes, in a top-down fashion. This
amounts to replace the discrete distributions Pt,w defined by the pairs (w j , pj ), j ∈ J,
by simpler distributions P̃t,w (the approximation can be different at each node), and
proceed by developing the nodes recursively. We assume that the simpler distributions
P|J|˜
are described by the pairs (w j , p̃j ) with j ∈ J˜ ⊂ J and j=1 p̃j = 1. In fact, as the
disturbance space is discrete, there would be no obvious way to define intermediary
or averaged values for the disturbances. We write V for the set {w j }j∈J˜. The set
W \ V = {w j }j∈J\J˜ is the set of disturbances omitted by the approximation.
The probability masses of the disturbances in W \V must be reallocated (transported)
to one or several disturbances in V . To do that in a way consistent with approximation
methods based on probability metrics (Section 2.4.2), we assume that the user can define
distances between the elements in W , ideally in such a way that close disturbances
induce similar state transitions and similar rewards. These ideas will be illustrated in
Section 3.3. Under this assumption, it makes sense to redistribute the probability mass of
a disturbance in W \V among the nearest disturbances in V , so as to reduce (heuristically)
approximation errors.
A generic way to induce distances in a discrete space is to introduce a positive-definite
kernel k : W × W → R with values k(w, w 0 ) = k(w0 , w) (see Appendix C). The distance
between two disturbances w, w 0 is then given by

d(w, w0 ) = [k(w, w) + k(w 0 , w0 ) − 2k(w, w 0 )]1/2 = d(w0 , w) ≥ 0 .

For each w ∈ W \ V , let C(w) denote the subset of V of nearest neighbors to w:

C(w) = {v ∈ V : d(v, w) ≤ d(v 0 , w) for all v 0 ∈ V } .


40 Chapter 3. Solution Averaging in Structured Spaces

(w2 , p2 ) p2 /2
p4
p2 /2 (w4 , p4 )
3 3
(w , p )

(w1 , p1 )

Fig. 3.1: Illustration of the probability redistribution rule (3.11), for W = {w 1 , w2 , w3 , w4 },


V = {w 1 , w3 }, C(w 2 ) = {w 1 , w3 }, C(w 4 ) = {w 3 }. The pairs (w j , pj ) can be embedded
in a feature space induced by the choice of the kernel k. The dots represent V (black)
and W \ V (white) in the feature space where the pairwise distances are evaluated.

For each v ∈ V , let C −1 (v) denote the subset of elements in W \ V that have v as a
nearest neighbor:

C −1 (v) = {w ∈ W \ V : w ∈ C(v)} .

The probability mass of a node n to which a disturbance w j ∈ V is associated is then


given by
X
p̃j = pj + pk /|C(wk )| . (3.11)
wk ∈C −1 (wj )

The probability mass redistribution rule is illustrated in Figure 3.1.

3.2.3 Optimization with the Cross-Entropy method

Consider an incomplete scenario tree with N nodes numbered from 1 (root, depth 0)
to N (last leaf, depth T ). We assume that the leaf nodes (depth T ) are numbered
from N − L + 1 to N , where L is the number of leaf nodes or equivalently, the number
of scenarios. Let w n , xn , un , r n , denote respectively the disturbance wt−1 , state xt ,
decision ut , reward rt−1 assigned to node n, where t corresponds to the depth of the node,
and where xt , ut , rt−1 are conditioned on the disturbance process history [w0 , . . . , wt−1 ]
induced by the path from the root to the node n. The root node has no disturbance and
reward assigned to it. The leaf nodes (N − L + 1 ≤ n ≤ N ) have no decision un assigned
to them. Let pn be the probability mass assigned to node n (the probabilities pn of the
nodes of depth t sum up to 1). For n > 1, let n− denote the index of the parent node
of node n. Let fn− and rn− denote the functions ft and rt for t equal to the depth of the
node n− . The problem (3.10) formulated on the incomplete scenario tree becomes
N
X
maximize pn r n
n=2

subject to x 1 = x0
xn = fn− (xn− , un− , wn ) 2≤n≤N
rn = rn− (xn− , un− , wn ) 2≤n≤N

over variables xn ∈ X, un ∈ U , r n ∈ R. By the implicit treatment of the equality


constraints, the problem can be rewritten as

maximize F (u1 , . . . , uN −L ) over u1 , . . . , uN −L ∈ U. (3.12)


3.2. Adaptation to Stochastic Programming 41

We view F as an arbitrary mapping from U × · · · × U = U N −L to R. In theory, the


maximum could be computed by sorting the values of F for the |U |N −L possible inputs.
For brevity, we will write u for [u1 , . . . , uN −L ].
For the estimation of the maximum of F over u ∈ U N −L , we use the Cross-Entropy
method (CE method) (Rubinstein and Kroese, 2004). When it is applied to importance
sampling, the Cross-Entropy method aims at selecting, from a family of parametrized
densities, the density that has the smallest Kullback-Leibler divergence (defined previ-
ously by (3.1)) with respect to an ideal importance sampling density. Rubinstein and
Kroese (2004) note that when the densities come from an exponential family of distri-
butions (defined in Chapter 5), the use of the Kullback-Leibler divergence (CE-distance)
allows analytical calculations, and reduces the complexity of importance sampling algo-
rithms that sequentially update the parameters of the sampling distributions.
When the CE method is applied to an optimization problem, viewed as the search of
a particular rare event, the method is based on two components:

• A random generator parametrized by θ for sampling candidate solutions u accord-


ing to a density g(·; θ):

u ∼ g(·; θ) . (3.13)

The parametrization by θ must be chosen in such a way that the distribution g


can both be uniform over each possible solution in U N −L , or concentrated on any
particular solution in U N −L .

• The procedure that computes the value of F for a candidate solution u, where
F (u) will be interpreted as the score assigned to u by F :

u 7→ F (u) . (3.14)

The method works as follows. Starting with the value of θ that corresponds to the uniform
distribution over the space of solutions, one draws NCE samples u1 , . . . , uNCE from the
density g(·; θ), scores them using the scoring function F , and tags as elite solutions the
samples with a score greater or equal to the dρNCE e-th best score, written γ̂. The
parameter ρ is set to a small positive value, typically 0.01 (a value for which the elite
solutions correspond to the best percentile of the empirical distribution of the score).
The parameter θ is then updated so as to decrease the CE distance of g(·; θ) with respect
to the empirical density induced by the elite solutions. The update rule proposed by
Rubinstein and Kroese (2004, Equation 4.8) is
X
θ ← θ̂ where θ̂ ∈ argmaxθ log g(uk ; θ) .
k: F (uk )≥γ̂

Thus in fact θ̂ is just the maximum likelihood estimate of the sampling density g(u; θ)
given the data set of elite samples (the parameter update maximizes the probability of
generating the elite samples).
After the parameter update step, a new set of NCE samples is redrawn from g(·; θ).
The parameter update/resampling procedure is repeated until the density g(·; θ) concen-
trates on a single solution, or until the elite scores have ceased to improve. The best
candidate solution with respect to F (at any iteration) is then returned. The authors of
42 Chapter 3. Solution Averaging in Structured Spaces

the method recommend to choose NCE proportional to the number of parameters in θ


(the dimension of θ depending itself on the size of the search space). They also propose
to smooth the updates of θ as follows: denoting by θj the value of θ at iteration j, they
suggest to set

θj+1 = αθ̂ + (1 − α)θj (3.15)

where α ∈ (0, 1] is the smoothing factor.

3.2.4 Aggregation of First-Stage Decisions

Let uν0 ∈ U denote a near-optimal first-stage decision (root-node decision) relative to an


incomplete tree ν, where the tree and uν0 have been obtained by the procedures described
in the preceding subsections. Let S0 = {u10 , . . . , um
0 } denote the set of near-optimal first-
stage decisions relative to m such trees.
The analysis of bagging methods (see Section 3.1.6) suggests that forming an ag-
gregate first-stage decision from the set S0 could decrease the influence on uν0 of the
particular sampled tree ν. However, a difficulty in the present setup, where decisions are
valued in a large discrete set U , consists in defining an admissible and useful aggregation
rule. By admissible, it is meant that the aggregated decision is valued in U ; by useful, it
is meant that the aggregation rule should be able to preserve the structure and properties
of near-optimal solutions.
To this end, we propose to take as the aggregated first-stage decision, written u a0 ,
the decision in S0 that is nearest to a special point that we call the centroid of the
decisions in S0 , and that is defined, as we explain next, using the metric induced on U
by a kernel kU that quantifies the similarity between decisions in U . Formally, we may
assume momentarily (so as to clarify intermediate calculations) that we have access to
the feature map ϕ : U → H relative to the reproducing kernel Hilbert space H induced
by kU : U ×U → R. Alternatively, we may assume that we can enumerate a finite number
of features for the decisions, from which we induce a kernel by (see Appendix C.4)

kU (u0 , u00 ) = hϕ(u0 ), ϕ(u00 )i ,

where h·, ·i stands for a suitably defined inner product.


The centroid of S0 , written uc0 , is defined by its coordinate φ(uc0 ) in the feature space,
set to
m
X
ϕ(uc0 ) = m−1 ϕ(uν0 ) .
ν=1

The squared distance between the centroid and some first-stage decision u 0 ∈ U is given
by

||ϕ(u0 ) − ϕ(uc0 )||2


m m
m X
hϕ(ui0 ), ϕ(uj0 )i .
X X
= hϕ(u0 ), ϕ(u0 )i − 2m−1 hϕ(u0 ), ϕ(ui0 )i + m−2
i=1 i=1 j=1
3.2. Adaptation to Stochastic Programming 43

The squared distance from some decision uν0 ∈ S0 to the centroid uc0 may be expressed
in terms of the elements Kij = hϕ(ui0 ), ϕ(uj0 )i of the Gram matrix K ∈ Rm×m by

m
X m X
X m
−1 −2
||ϕ(uν0 ) − ϕ(uc0 )||2 = Kνν − 2m Kiν + m Kij . (3.16)
i=1 i=1 j=1

The aggregated solution

ua0 ∈ arg min


ν
||ϕ(uν0 ) − ϕ(uc0 )||2 , (3.17)
u0 ∈S0
m
or equivalently ua0 = uj0 with j ∈ arg min {Kνν − 2m−1
X
Kiν } ,
1≤ν≤m
i=1

with ties broken arbitrarily, may thus also be computed without the need to refer to the
feature map ϕ once the Gram matrix is given. Therefore, the explicit computation of the
centroid in the feature space, which would require the explicit knowledge of the feature
map, is not actually needed for evaluating the aggregated solution.
Note that the empirical variance of the ensemble of decisions in the feature space
induced by the kernel kU , defined by
m
X m
X m X
X m
−1 −1 −2
var{S0 } = m ||ϕ(uν0 ) − ϕ(uc0 )||22 =m Kii − m Kij ,
ν=1 i=1 i=1 j=1

could also be evaluated even if the feature map is specified only implicitly by the definition
of the kernel, and could quantify the discrepancy between candidate decisions in S 0 .

Discussion.

First consider the situation where the decision space U only possesses a handful of el-
ements. Thanks to the small cardinality of U , we may expect that optimal first-stage
decisions are present among the elements of set S0 . Therefore, a simple majority vote
among the elements of S0 can be taken as the estimate ua of an optimal first-stage de-
cision. Note that the majority vote can be obtained from the general formulation (3.17)
based on kernels by setting Kij = δ{ui0 = uj0 }, where δ{·} denotes the 0-1 indicator
function of the event placed in argument. Indeed, as Kνν = 1, the squared distances
Pm
||ϕ(uν0 ) − ϕ(uc )||2 , 1 ≤ ν ≤ m, will only differ by the term −2m−1 i=1 δ{ui0 = uν0 },
proportional to the frequency of uν0 in S0 .
Now consider the situation where U is finite but has a cardinality |U | much larger
than |S0 | = m. It is then very likely that a clear majority will not be attained in S0 ,
especially if there are many quasi-equivalent decisions in terms of optimality. However, in
many situations, U is formed from the combination of several elementary decisions. One
could thus combine kernels on the elementary decision spaces, for instance by combining
separate majority votes on the elementary decisions.
The kernelization of the decision space enables one to incorporate prior knowledge on
the structure of the decision space. Therefore, kernels should be consistent with prior
beliefs about the decisions that have similar effects on the problem at hand.
44 Chapter 3. Solution Averaging in Structured Spaces

N
Fig. 3.2: Example of configuration for the sensor network problem, with eight sensors ( ) and
two targets (•) (figure taken from Dutech et al. (2005)).

3.3 Numerical Experiment

In this section, we illustrate the proposed approach on a test problem problem that has
a large, structured, discrete action space. We explain in detail how the action space is
kernelized, how the incomplete scenario trees are generated, and how the corresponding
optimization problems are solved approximately. We assess the quality of the first-stage
decision estimators û0 = ua0 obtained with the proposed approach by a direct comparison
with the optimal strategy, which can be computed exactly in this test problem by dynamic
programming (by evaluating recursively the tabular representation of the expected costs-
to-go (Q-functions), where the tabular representation of a Q-function has an entry for
each combination of state-action pairs).

3.3.1 Description of the Test Problem

The test problem is part of a series of standard benchmark problems proposed for compar-
ing Reinforcement Learning solution approaches (Dutech et al., 2005). The test problem
is inspired by a distributed control application from Ali et al. (2005) and named Sensor
Network. Note that among all the problems selected by Dutech et al. (2005), Sensor
Network is the only problem with a relatively large discrete decision space.
The problem can be described as follows. Eight sensors bracket an array of three
cells, as shown on Figure 3.3.1. Each cell is surrounded by 4 sensors. Two targets float
over the cells. Each cell is occupied at most one target.
At time step t, each sensor can be focussed on the cell to its left, on the cell to its
right, or be idle. The decision ut sets the action of the 8 sensors. The decision space
is thus a joint action space U = {0, 1, 2}8 that encodes the 3 possible actions of the
8 sensors (0: idle, 1: focus left, 2: focus right), totalling 38 = 6561 possible actions. A
unit cost is incurred for each focussed sensor; idle sensors have no cost.
The game consists in eliminating the two floating targets from the cells as quickly as
possible. Each target start at energy level 3. After sensors have been set according to u t ,
the targets move. The leftmost target randomly moves to the left (L), to the right (R),
or stay idle (I). A priori the 3 possibilities are equiprobable, but a move is cancelled if
the cell where the target intends to go is already occupied or does not exist. After the
move performed by the leftmost target, the rightmost target randomly moves according
to the same rules.
The intended moves of the targets are viewed as the disturbances in the problem.
The disturbance space is W = {L, R, I} × {L, R, I}, with each of the 32 = 9 possible com-
binations having probability 1/9. The effective moves may differ inasmuch as intended
3.3. Numerical Experiment 45

moves can be blocked as described above.


The sensors are then activated. A target that lies in a cell where 3 sensors or more
are focussed loses one energy point. A target is removed from the board when its energy
falls to 0. The game ends when the two targets have been eliminated from the board.
The state space X = {0, 1, 2, 3} × {0, 1, 2, 3} × {0, 1, 2, 3} encodes the target energy
level (0 to 3) of the 3 cells. When a target moves from one cell to another, these cells
swap their energy level. The initial state x0 is either [3 3 0], [3 0 3], [0 3 3], representing
2 targets with energy level 3. The state [0 0 0], corresponding to a board with no
remaining targets, is a terminal state. When a target is eliminated, a reward +30 is
obtained. Therefore, the instantaneous reward rt is given by
2
X 8
X
rt = 30 δ{energy level of target i goes from 1 to 0} − δ{sensor i is not idle}
i=1 i=1

PT
The total return is the discounted cumulated reward t=0 γ t rt with γ = 0.95, and
T = 10. The problem is the maximization of the expected total return, starting from
some given state, over stochastic programming decision rules µt : W t−1 → U with
values µt (w0 , . . . , wt−1 ) = ut . We concentrate on the estimation of an optimal first-stage
decision u0 , given x0 .

3.3.2 Particular Choices

The general approach described in the preceding subsections is adapted to the problem
at hand as follows.

• The disturbance space is decomposed as W = Wa × Wb with Wa = {L, R, I}


relative to the intended move of the leftmost target, and Wb = {L, R, I} relative to
the intended move of the rightmost target (if there are still two active targets). A
disturbance w is thus decomposed as w = (wa , wb ), with wa relative to Wa and wb
relative to Wb .
We assume (heuristically) that two disturbances w, w 0 such that wa = wa0 or wb =
wb0 are similar in terms of induced state transitions and rewards. This assumption
is taken into account by defining the kernel on the disturbance space (Section 3.2.2)
as

k(w, w0 ) = δ{wa = wa0 } + δ{wb = wb0 } . (3.18)

• The incomplete trees are built by sampling, with replacement, NW disturbances


in W at each node. Those NW disturbances are sampled according to their proba-
bilities. Duplicate samples are then eliminated, and the distinct samples are taken
as the children of the node and assigned a probability according to (3.11), with the
kernel k defined by (3.18). The initial number of samples NW is random, so that
the branching structure of an incomplete scenario tree is random. For a node of
depth d ≥ 0, NW = 3 samples are drawn with probability 1/(1 + d), and NW = 1
sample with probability 1 − 1/(1 + d). Random trees with more than 150 nodes
are rejected. Note that the complete tree on the |W | = 9 disturbances would have
P10 d 9
d=0 9 = 3.9 · 10 nodes.
46 Chapter 3. Solution Averaging in Structured Spaces

• The sampling distribution for candidate solutions u (Section 3.2.3) is first decom-
posed into N − L independent components, each component being relative to one
internal node of the scenario tree (assuming that the tree has N nodes, including
L leaf nodes of depth T ). The N − L components are themselves decomposed into
8 independent parts corresponding to the 8 sensors. Each part defines the distribu-
tion over {0, 1, 2} of the action of a sensor j at a node i, written aij . A distribution
for aij is described by the two scalar parameters pij = P{aij = 0}, qij = P{aij = 1},
with pij , qij ∈ [0, 1] and 0 ≤ 1 − pij − qij = P{aij = 2} ≤ 1. A uniform distri-
bution over all possible strategies on the incomplete tree is obtained by setting
pij = qij = 1/3 for all i, j, whereas any particular deterministic solution u can be
obtained by selecting for each pair (i, j) one of the three configurations {p ij = 1,
qij = 0}, {pij = 0, qij = 1}, or {pij = 0, qij = 0}. The distribution for generating
a random solution u associated to the incomplete scenario tree is thus specified by
2 · 8 · (N − L) parameters.
Once the elite samples uk (Section 3.2.3) have been scored by computing the ex-
pected discounted sum of rewards on the incomplete tree with the nodal decisions
set to uk , the parameters of the generating distribution for the solutions are up-
dated as follows: given ` elite samples, with akij denoting the action aij from the
elite sample uk , 1 ≤ k ≤ `, one first computes the empirical frequencies of the
elementary actions in the elite samples,

def P`
p̂ij = `−1 k=1 δ{akij = 0}
def P`
q̂ij = `−1 k=1 δ{akij = 1} ,

and then one updates the parameters pij , qij of the solution generating distribution
by

pij ← α p̂ij + (1 − α) pij


qij ← α q̂ij + (1 − α) qij ,

where α is the smoothing factor of Equation 3.15. In the numerical experiments


reported in the next section, α = 0.6.

• The Cross-Entropy optimization can be stopped as soon as the sampling distribu-


tions relative to the root node are almost deterministic, since ultimately only the
first-stage decision is extracted from a solution u and used in the aggregation step.
In the numerical experiments reported in the next section, the Cross-Entropy op-
timization is carried out by generating NCE = 32 (N − L) candidate solutions
per iteration, that is, twice the number of parameters that describe the solution-
generating distribution. The optimization is stopped as soon as the 8 actions at the
root have their distribution concentrated on a single action with probability 0.99.

• The aggregation scheme exploits the decomposition of the decision space into sep-
arate sensor actions. Recalling that the root node has the node index i = 1, let
aν1j denote the first-stage action of sensor j from the ν-th solution in the set S 0 ,
1 ≤ ν ≤ m. The action of sensor j in the centroid decision uc0 (Section 3.2.4) is
determined by a majority vote over the action of sensor j in the first-stage decisions
3.3. Numerical Experiment 47

uν0 = {aν1j }1≤j≤8 collected in the set S0 :


m
ν(j)
X
ac1j = a1j with ν(j) = min{argmaxk δ{ak1j = al1j }}, 1 ≤ j ≤ 8 ,
l=1

and then the aggregated decision ua0 is set to the element uν0 ∈ S0 sharing the most
actions with ua0 , that is,
8
X
ν = min{argmaxk δ{ak1j = ac1j }} .
j=1

A similar effect can be obtained by defining the kernel kU (Section 3.2.4) between
two elements uν0 , uσ0 of S0 as
8
X
kU (uν0 , uσ0 ) = δ{aν1j = aσ1j } = Kνσ = Kσν
j=1

and setting
m
0 X
ua0 = uν0 with ν 0 = min{argmaxν Kiν } ,
i=1

following (3.17) with Kνν constant and ties broken by the lexicographical order
on ν.

3.3.3 Simulation Results

Typical outcomes with an ensemble of m = 5 incomplete trees are reported in Table 3.1.
Three problems corresponding to the 3 initial configurations x0 of the targets with 3 en-
ergy points that float over the 3 cells are considered. Decisions uν0 are represented
graphically. For instance, the symbol -/\- /\/\
indicates that 3 sensors are focussed on
the leftmost cell, no sensor is focussed on the middle cell, 3 sensors are focussed on the
rightmost cell, and the remaining 2 sensors are idle (-). If the targets move onto the
leftmost or the rightmost target, they will be hit, so the combined action of the sensors
is effective. It would be suboptimal to have 1, 2 or 4 sensors be focussed on a same cell.
The table shows that the structure of optimal decisions can be destroyed in the centroid
decision uc0 . In fact, there are several configurations of the sensors that can lead to the
same effective targeting of two cells, but these equivalent configurations are made inef-
fective when they are averaged. The aggregated decision ua0 reaches a consensus while
preserving the structure of effective configurations.
It turns out that for the first and third problem, the aggregated decisions u a0 are
optimal, in the sense that the targeted cells are optimal, according to an exact dynamic
programming solution where the decision space is reduced a priori to 6 sensible choices
of targeted cells instead of considering the full set of combined actions of the sensors.
For the second problem (x0 = [3 0 3]), the decision ua0 shown in the table is slightly
suboptimal: if subsequently optimal decisions are selected, then u a0 brings an expected
return of 27.78 instead of the optimal return 27.88.
We repeated 10 times the experiment of building an ensemble of 5 trees and computing
the aggregated decision. An optimal decision was found: 7 times for x 0 = [3 3 0], 9 times
48 Chapter 3. Solution Averaging in Structured Spaces

Tab. 3.1: Typical result with an ensemble of 5 trees.

x0 First-stage decision uν0 uc0 ua0


ν=1 ν=2 ν=3 ν=4 ν=5

\//- \/-- \\/- -/\/ \/\- \//- \//-


[3 3 0] //\- /--- /\\- /\/- /-/\ /-\- //\-

-/\- \/\/ \/\- \-\- \//- \/\- \/\-


[3 0 3] /\/\ -\/- -\/\ /\/\ //\- /\/\ -\/\

\/-/ -\\/ -\\/ \-\/ -\// -\\/ -\\/


[0 3 3] /-/\ -/\\ -/\\ /\-\ -//\ -/\\ -/\\

for x0 = [0 3 3], and 5 times for x0 = [3 0 3] (a second-best decision is found in the


5 other cases).

3.4 Conclusions

This chapter has investigated empirically the estimation of an optimal first-stage de-
cision to a finite-horizon optimal control problem by scenario tree techniques. While
we recognize that the solution techniques used in this work might be of limited inter-
est in practice, given that the studied problem class would be more naturally addressed
from a Markov Decision Process perspective, we believe that the statistical framework in
which the proposed tree-bagging solution technique was presented clarifies the connec-
tion between statistical estimation/prediction methods and sequential decision making
by stochastic programming.
It is interesting to realize, in particular, that stochastic programming models take
seldom into account the intrinsic limitation that only finite-sample approximations can
be solved. Usual stochastic programming models are thus close in spirit to maximum
likelihood estimation models used on finite data without regularization. Certainly, the
appropriate ways to apply regularization to sequential decision making are not clear
at this stage, and would call for further research. For instance, we observed — in-
dependently of the material presented in this chapter — that the early stopping of the
progressive hedging algorithm (Rockafellar and Wets, 1991) (see also Remark 2.1), where
decisions are optimized on separate scenarios (with a penalization of the difference with
the decisions at the previous iteration) and then averaged if they are relative to a same
information state, could provide a kind of regularization without even modifying the
formulation of the model. With early stopping however, the objective being optimized
is no longer totally explicit, the solution has a dependence on the initial conditions, and
therefore the solution algorithm would require a careful tuning of its parameters on the
problem at hand.
Unfortunately, we are still lacking at this stage efficient methods for testing the real
value of any solution procedure, be it regularized or not. As the right amount of reg-
ularization (weight of the regularization in the objective, early stopping criterion, . . . )
3.4. Conclusions 49

is usually selected at the light of the results obtained by simulating the model on an
independent validation sample or by cross-validation methods (Stone, 1974; Efron and
Tibshirani, 1993), it would be vacuous to discuss regularization further if we were ulti-
mately unable to estimate the true quality of the regularized solution. The development
of efficient validation methods is the subject of the next chapter.
50 Chapter 3. Solution Averaging in Structured Spaces
Chapter 4

Validation of Solutions and Scenario Tree


Generation

In this chapter, we propose an approach for solving multistage stochastic programming


problems based on the idea of generating in a lazy fashion a large number of random
tractable scenario-tree based approximations. The approach is lazy in the sense that
instead of recommending a careful analysis of the structure of the problem at hand, and
instead of devoting all computational resources to the construction of a single scenario
tree, we recommend a multiplication of solution attempts through the generation of
several approximations. The method works by extracting, from the solutions of these
approximations, data sets that combine realizations of the random process and decision
sequences, and by processing these data sets by a supervised learning method, so as to
infer policies that can be later on tested efficiently on a large sample of new independent
scenarios. The learned policies can be exploited to infer multistage decision strategies
that achieve good performances in a very generic way. They can also be used to score
and select scenario trees, and thus to guide, in the context of a precise application, the
development and fine-tuning of a scenario tree generation algorithm.
The chapter is organized as follows. Section 4.1 motivates the approach investigated in
the chapter. Section 4.2 describes how feasible decision policies can be learned from a data
set of scenarios and decisions. Section 4.3 builds on the idea of exploiting several scenario-
tree approximations for learning several policies. Section 4.4 implements the proposed
approach on a family of test problems, introduces specific tree generation algorithms,
tunes them in the context of the test problems, and discusses the overall complexity
of the approach. Section 4.6 concludes by discussing the potential of some possible
extensions of the proposed algorithms, at the light of the results and insights obtained
in this chapter.

4.1 Motivation

Our approach is motivated by two complementary and intimately related considerations


induced by our analysis of current approximation methods for multistage stochastic pro-
gramming, and their confrontation to the problems addressed in the field of machine
learning in the last years.
The first motivation is derived from the need for intensive testing of decision-making
policies for multistage programs (Section 2.4.3). This need is primarily a consequence of
the lack of tight theoretical results that would provide broadly-usable prior guarantees
52 Chapter 4. Validation of Solutions and Scenario Tree Generation

on scenario tree based methods. Intensive testing is needed, because for obtaining perfor-
mance estimators that are statistically significant, it is important to test a decision policy
on a sufficiently large number of independent scenarios. Testing decisions a posteriori by
the shrinking-horizon approach (Section 2.4.4) is not a viable option, given the internal
use of additional scenario trees by this approach, and given the overall computational
complexity of the procedure. With respect to this motivation, machine learning offers a
multitude of ways of extracting policies that are easy to test in an automatic way on a
large number of independent samples.
The second motivation has to do with the intrinsic nature of the finite scenario-tree
approximation for multistage stochastic programming. The variance in the quality of the
optimal decisions that may be inferred from finite approximations suggests that those
problems are essentially ill-posed in the same sense as the inverse problems addressed
in machine learning are also ill-posed: small perturbations in the values of a finite data
set — finite number of scenarios in stochastic programming; finite number of input-
output pairs in machine learning — lead to perturbations of empirical expectations, and
ultimately lead to large variations (instability) of the quantities of interest — first-stage
decisions in stochastic programming; parameters of classifiers or regressors in machine
learning — that are being optimized on the basis of empirical estimates.
This analogy suggests that regularization techniques and principles from statistical
learning theory (Vapnik, 1998), such as the structural risk minimization principle, could
help to extract solutions from scenario-tree approximations in a sound way from the
theoretical point of view, and in an efficient way from the practical point of view.
The main ideas developed in the following subsections can be summarized as follows:
we propose an approach that (i) allows to test small scenario trees quickly and reliably,
(ii) is likely to offer better ways of exploiting individual scenario-tree approximations,
and (iii) in the end, allows to revisit the initial question (Section 2.4.2) of generating,
solving, ranking and exploiting tractable scenario trees for solving complex multistage
decision making problems.

4.2 Learning and Evaluation of Scenario Tree Based Policies

We start from the following observation: estimators of the quality of a scenario-tree ap-
proximation that are computationally cheap to evaluate can be constructed by resorting
to supervised learning techniques.
The basic principle consists in inferring a suboptimal decision policy by first learning
a sequence of decision predictors π̂1 , . . . π̂T from a data set of examples of information
state/decision pairs. The examples of information states are extracted from the nodes of
the scenario tree; they correspond to the partial scenario histories (ξ1k , . . . , ξt−1
k
) in the
tree. Later in the chapter (Section 4.4.3), we will see that the information states can also
be represented differently, for instance by features or by states in the sense of dynamic
programming. The examples of decisions are also extracted from the nodes of the tree:
they correspond to the decisions ukt optimized on the tree.
When a decision predictor π̂t is applied on a new scenario ξ (or more exactly, on
the observed part (ξ1 , . . . , ξt−1 ) of the scenario ξ), it outputs a predicted decision that
cannot be assumed to satisfy the exact feasibility constraints ut ∈ Ut (ξ) relative to the
4.2. Learning and Evaluation of Scenario Tree Based Policies 53

new scenario, if we want to define a framework that allows the use of existing standard
supervised learning algorithms for building the decision predictors. Therefore, to obtain
feasible decisions, we assume that the predicted decision can then be corrected in an
ad-hoc fashion using a computationally cheap feasibility-restoration procedure, that we
call repair procedure in the sequel and denote by Mt . The idea of using repair procedures
is also suggested in Küchler and Vigerske (2010) as a means of restoring the feasibility
of decisions extracted from a tree and applied to test scenarios.
We now formalize these ideas to describe how a learned decision policy can be used
to assess (validate), in a certain sense, a given scenario-tree approximation.

4.2.1 Description of the Validation Method

We consider a multistage stochastic program in abstract form (see Section 2.1.5):

P: minimize E {f (ξ, π(ξ))} subject to πt (ξ) ∈ Ut (ξ) ;


π(ξ) non-anticipative,

where ξ = (ξ1 , . . . , ξT ) is a random process, and where the optimization is over the
decision policy π = (π1 , . . . , πT ). We assume that ξt has its outcomes in some space Ξt ,
say Rd for simplicity, and that πt is valued in some space Ut (of which Ut (ξ) is a subset),
say Rm . We recall that π is non-anticipative if π1 is a constant-valued function, π2 is
a function of ξ1 , and more generally πt is a function of ξ1 , . . . , ξt−1 . Therefore, one can
define πt either as a mapping from Ξ1 × · · · × ΞT = RT d to Rm restricted to the class of
non-anticipative mappings, or as a mapping from Ξ1 × · · · × Ξt−1 = R(t−1)d to Rm .
Given an approximation of P on a scenario tree having n scenarios ξ k of probability pk ,
Pn
P 0 : minimize k k k
k=1 p f (ξ , u ) subject to ukt ∈ Ut (ξ k ) ∀ k ;
uk1 = uj1 ∀ k, j ,
ukt = ujt whenever k
(ξ1k , . . . , ξt−1 ) ≡ (ξ1j , . . . , ξt−1
j
) ,

let {ūk }1≤k≤n denote an optimal solution to P 0 , where each ūk = (ūk1 , . . . , ūkT ) corre-
sponds to the sequence of decisions associated to ξ k . We define a decision predictor π̂t as
def def
a mapping from inputs Xt = (ξ1 , . . . , ξt−1 ) ∈ R(t−1)d to outputs Yt = ut ∈ Rm , learned
def
from a data set Dt = {(Xtk , Ytk )}1≤k≤n of input-output pairs, obtained by collecting
from the scenario tree the observed parts of the scenarios and their associated optimized
decisions:
def k
Xtk = (ξ1k , . . . , ξt−1 ) ,
def
Ytk = ūkt .

Note that the duplicate samples (Xtk , Ytk ) ≡ (Xtj , Ytj ) induced by the non-anticipativity
conditions (the branching structure of the scenario tree) may be removed from the learn-
ing set Dt . In particular, D1 is reduced to a single learning sample Y1 = ū1 , leading to
a trivial learning problem and to the decision predictor π̂1 ≡ ū1 .
By construction of P 0 , and by the fact that U1 is constant-set-valued, the first-stage
decision is feasible: π̂1 (ξ) = ū1 ∈ U1 (ξ). For the subsequent decisions, the supervised
learning procedure cannot in general guarantee that π̂t (ξ) ∈ Ut (ξ) for all scenarios in the
54 Chapter 4. Validation of Solutions and Scenario Tree Generation

learning set and for all new scenarios ξ. Therefore, we repair the predictions to restore the
feasibility of the decisions. The nature of the repair procedure varies with the feasibility
constraints that have to be enforced. The realizations of the random quantities on which
Ut (ξ) depend are passed in arguments of the repair procedure, and the procedure is then
applied online on each new scenario and predicted decisions.
An example of repair procedure is the projection of a predicted decision on the feasibil-
ity set. Later in the thesis, we resort to simple problem-dependent heuristics for restoring
feasibility (Section 5.3.4). Formally, we define as an admissible repair procedure for U t
any mapping

Mt : (Ξ1 × · · · × Ξt−1 ) × (U1 × · · · × Ut−1 ) × Ut → Ut


with values Mt (ξ1 , . . . , ξt−1 ; u1 , . . . , ut−1 ; π̂t (ξ1 , . . . , ξt−1 ))

such that the range of Mt is always contained in the feasible set Ut (ξ), assuming that
u1 , . . . , ut−1 are in the corresponding feasibility sets U1 (ξ), . . . , Ut−1 (ξ), and that Ut (ξ) is
nonempty.
A learned (feasible) policy is made of the association of the decision predictors and
the repair procedures.
We can exploit a learned policy for computing an estimate of the quality of a scenario
tree, or a bound on the exact value of the original multistage program P. The procedure
can be described as follows.

i. Generate a scenario tree using a tree building algorithm A. Solve the resulting
program P 0 , extract from its solution the first-stage decision ū1 , and the data
sets Dt of scenario/decisions pairs.

ii. Learn the decision predictors π̂t from the data set Dt for t = 2, . . . , T .

iii. Generate a test sample of n0 mutually independent scenarios {ξ j }1≤j≤n0 by sam-


pling realizations of the random process ξ.

iv. For each scenario ξ j of the test sample, set uj1 = ū1 and compute sequentially
the recourse decisions uj2 , . . . , ujT . Each decision ujt is obtained by first evaluating
π̂t (ξ1j , . . . , ξt−1
j
) and then restoring feasibility by the repair procedure Mt .

v. Estimate the performance of the learned decision policy on the test sample by
P n0
forming the empirical average VTS (A) = (1/n0 ) j=1 f (ξ j , uj ), where the sum runs
over the indices relative to the scenarios in the test sample and their associated
decision sequences uj = (uj1 , . . . , ujT ).

The estimator VTS (A) computed in this way reflects the joint quality of the scenario
tree, the learned predictors and the repair procedures.
The estimator VTS (A) is obtained by simulating an explicit policy that generates
feasible decisions, and thus always provides a pessimistic bound (upper bound for min-
imization, lower bound for maximization) on the performance of the best policy that
could be inferred from the considered scenario tree, up to the standard error of the test
sample estimator. The pessimistic bound is also a reliable bound on the achievable per-
formance of a decision policy for the true problem, up to the standard error of the test
sample estimator.
4.2. Learning and Evaluation of Scenario Tree Based Policies 55

Note that in theory, a learned policy is not necessarily worse than a shrinking-horizon
policy using the same first-stage decision ū1 , since the supervised learning step could
actually improve the quality of the recourse decisions uj2 , . . . , ujT .
The pessimistic bound can be made tighter by testing various policies obtained from
the same scenario tree, but with different learning algorithms and/or repair procedures.
The best combination of algorithms and learning parameters could then be retained.
Note, however, that due to the optimistic bias induced by the selection of the best bound
on the test sample of size n0 , the value of the best policy should be evaluated again by
simulation on a new independent test sample of size n00 .
It is also possible to exploit estimators relative to policies learned from different
scenario trees but computed on the same test sample of size n0 . We may even expect
that scenario tree variants can be ranked reliably based on the value of these estimators,
despite the variance of the estimator due to the randomness in the generation of the test
sample, and despite a new source of bias due to the use of suboptimal recourse decisions
obtained from the learned policies. These ideas will be further developed in Section 4.3.
Note also that the input space of a learned policy is a simple matter of convenience.
As long as the policy remains non-anticipative, the input space can be described differ-
ently, typically by letting appear explicitly past decisions, state variables, and additional
features derived from the information state, that might facilitate the generalization of
the decisions in the data sets, or later on, the online evaluation of the learned decision
predictors. These ideas are illustrated in Section 4.4.3.
To simplify the exposition in the sequel, we will assume that all the considered al-
gorithms for learning policies use the same repair procedures Mt , and differ only by the
choice of the hypothesis space Ht for π̂t (space of functions considered by the supervised
learning algorithm). It is convenient to denote the possible hypothesis spaces by H tλ ,
where λ belongs to some index set Λ. For instance, λ could represent the weight of a
regularization term used in the supervised learning algorithm. For simplicity, we assume
that Λ has a finite cardinality |Λ|.

4.2.2 Complexity Analysis

In this section, we consider the complexity of computing an upper bound (a performance


guarantee) on the value of an exact multistage program P by simulating a series of
policies learned from a single scenario-tree approximation P 0 .
Recall that we have assumed for simplicity that ut ∈ Rm and ξt ∈ Rd , 1 ≤ t ≤ T .
We denote by ū1 ∈ Rm the constant first-stage decision. For t = 2, . . . , T , the map-
pings π̂t : R(t−1)d → Rm , with values π̂t (ξ1 , . . . , ξt−1 ), represent the learned deci-
sion predictors, and the mappings Mt : R(t−1)d × R(t−1)m × Rm → Rm , with values
Mt (ξ1 , . . . , ξt−1 , u1 , . . . , ut−1 , ut ), represent the repair procedures.
Then the mappings π̄t : RdT → Rm , with values π̄t (ξ), defined iteratively for t =
1, . . . , T by

π̄1 (ξ) = ū1 ,


π̄t (ξ) = Mt (ξ1 , . . . , ξt−1 ; u1 , . . . , ut−1 ; π̂t (ξ1 , . . . , ξt−1 )) = ut ,

correspond to a non-anticipative feasible decision policy π̄ = (π̄1 , . . . , π̄T ) for the original
56 Chapter 4. Validation of Solutions and Scenario Tree Generation

Algorithm 4.1 Selection and evaluation of a decision policy


Input: A first-stage decision ū1 , learning sets Dt of pairs (Xtk , Ytk ), hy-
pothesis spaces Htλ indexed by λ ∈ Λ, repair procedures Mt , and
a test sample of new scenarios (size n0 ).
Output: A feasible policy π̄.

1. For each λ ∈ Λ,
learn the decision predictors π̂tλ given the data sets Dt ,
using the hypothesis spaces Htλ , t = 1, . . . , T .

2. For each λ ∈ Λ,
evaluate the performance of the policy π̄ λ obtained by combining π̂tλ with Mt .
Let v λ be that performance evaluated on the common test sample of size n0 .

3. Select ν ∈ arg minλ∈Λ v λ and return π̄ ν .

4. Optional: return the value of v ν reevaluated on


an independent test sample of size n00 .

program P.
The computational complexity of exploiting π̄t on new scenarios depends on the com-
plexity of evaluating π̂t and Mt for all t.
The mappings π̂t should ideally be the best mappings from the best hypothesis spaces
one could consider, but in practice they correspond to the mappings identified by a given
supervised learning algorithm on the basis of the data sets Dt . We find it useful to
consider a series of policies in this section, because there is some leeway in the choice
of the supervised learning algorithm and/or its parameters, that can be exploited in the
search for ideal mappings.
In the usual supervised learning framework, one generally selects a model by evalu-
ating its performance on a fraction of the data set kept apart for testing purpose. In
the present setup, it is preferable to evaluate models by directly simulating the learned
policy π̄ on a common test sample of new scenarios (Algorithm 4.1).
If Algorithm 4.1 is merely run to select a best learned policy, a single test sample
of size n0 on which the policies are compared suffices. If in addition an unbiased upper
bound on the exact value of P is sought, an additional independent test sample of size n 00
is required on which the best policy should be simulated again.
In practice, the selection bias may be very small if n0 is large enough with respect to the
considered hypothesis spaces. Therefore, in some numerical experiments, we sometimes
allow ourselves to report directly the estimates obtained on the first test sample of size n 0 .
To discuss the complexity of Algorithm 4.1, let us introduce the following quantities.

• cA : expected time for forming the approximation P 0 to P using a scenario tree


building algorithm A,

• cS : expected time for obtaining a solution to P 0 ,


4.2. Learning and Evaluation of Scenario Tree Based Policies 57

• cL (t): expected running time of the learning algorithm on data Dt ,

• cE (t): expected running time of the combined computation of π̂t and Mt on a new
scenario.

We assume that cL (1) = cE (1) = 0 since the first decision ū1 is fixed and simply
extracted from a solution to P 0 . For t ≥ 2, note that cL (t) and cE (t) usually grow with
the dimension of the random variables ξt , the dimension of the decisions ut , and the
cardinality of the data sets Dt . The ratio between cL (t) and cE (t) depends largely on
the type of supervised learning algorithm and the type of repair procedure for M t . We
neglect the time for computing f (ξ, u) given ξ and u.
The following proposition is a straightforward consequence of the definition of Algo-
rithm 4.1:

4.1 Proposition. Algorithm 4.1 runs in expected time


" T T
# T
X X X
|Λ| · cL (t) + n0 cE (t) = |Λ| · [cL (t) + n0 cE (t)] ,
t=2 t=2 t=2

starting from data sets obtained in expected time cA + cS . The optional step of Algo-
PT
rithm 4.1 adds to the expected time a term n00 t=2 cE (t).

If Algorithm 4.1 is run on N parallel processes, one can essentially replace |Λ| in
Proposition 4.1 by |Λ|/N , and n00 by n00 /N .
The complexity of Algorithm 4.1 can be compared to the complexity of the usual
shrinking-horizon validation approach (Section 2.4.4). To this end, we extend our nota-
tions as follows.

• P(t) denotes the program for the minimization of the objective over the remaining
stages t, t + 1, . . . , T . The program P(1) is the original program P. Given real-
izations for ξ1 , . . . , ξt−1 and the corresponding implemented decisions ū1 , . . . , ūt−1 ,
one can obtain P(t) by replacing in P the random variables ξ1 , . . . , ξt−1 by their
outcomes, conditioning the distribution of ξt , . . . , ξT accordingly, and introducing
the constraints π1 (ξ) = ū1 , . . . , πt−1 (ξ) = ūt−1 .

• P 0 (t) denotes a scenario-tree approximation to P(t), built by some algorithm A(t).

• cA (t) denotes the expected time for forming the approximation P 0 (t) to P(t) using
algorithm A(t) for building a scenario tree over the shrunk horizon.

• cS (t) denotes the expected time for obtaining a solution to P 0 (t).

If the tree building algorithm A(t) is based on a pure Monte Carlo sampling, c A (t) should
be relatively small, and approximately proportional to the size of the scenario tree. If
A(t) is based on a deterministic method and the dimension of the random process is
not say 1 or 2, cA (t) may actually be quite large, even for t near the horizon T . The
time cS (t) can also be quite large, except perhaps for t = T or t close to T .
In Section 2.4.4, we had denoted all the algorithms A(t) simply by A, and written
VTS (A) for the estimate produced by the shrinking-horizon validation approach. In
the following proposition, we assume that the shrinking-horizon approach is run on the
58 Chapter 4. Validation of Solutions and Scenario Tree Generation

independent test sample of size n00 used for reevaluating the best policy selected by
Algorithm 4.1.

4.2 Proposition. The shrinking-horizon approach runs in expected time


T
X
n00 · [cA (t) + cS (t)] ,
t=2

using a first-stage decision obtained in expected time cA + cS .

(The use of N parallel processes allows to replace n00 by n00 /N ).


Note that when the scenario tree building algorithms A(t) are not deterministic, each
algorithm A(t) constitutes a new source of variance for the shrinking-horizon estimate.
These new sources of variance can greatly affect the sample size n00 that would be needed
to obtain a meaningful estimate.
The comparison of the complexity estimates stated in Propositions 4.1 and 4.2 sug-
gests that Algorithm 4.1 should be far more tractable than the shrinking-horizon valida-
tion strategy, provided that |Λ| is kept under control, and cE (t) is small enough.

4.2.3 Other Validation Strategies

In this section, we mention variants of the validation approaches discussed above, mo-
tivated by the complexity estimates of Propositions 4.1 and 4.2. These variants are
interesting to consider, but their implementation has been left as future work.
We begin by observing that it is possible to combine the two preceding validation ap-
proaches (supervised learning of policies and shrinking-horizon optimization) by combin-
ing learned policies at stages 2, . . . , t0 to a shrinking-horizon decision making procedure
for t = t0 + 1, . . . , T . This hybrid approach would run in expected time
P t0 PT
t=2 |Λ| · [cL (t) + n0 cE (t)] + t=t0 +1 n0 · [cA (t) + cS (t)] ,

starting from data obtained in expected time cA +cS . We would also add to the expected
P t0 PT
time the term n00 · [ t=2 cE (t) + t=t0 +1 [cA (t) + cS (t)]], relative to the reevaluation of
the selected hybrid policy on the test sample of size n00 .
The number of stage t0 could be chosen to minimize the expected running time,
namely (neglecting the reevaluation term)

t0 = sup{t ≤ T : |Λ| · [cL (t) + n0 cE (t)] ≤ n0 · [cA (t) + cS (t)]} ,

but a complication with the optimal choice of t0 is the possible dependence of the standard
error of the estimates on nondeterministic algorithms A(t).
Another possible variant is to carry out the selection of the models for π̂ 1 , . . . , π̂T
sequentially, that is, stage by stage. To describe this variant, we extend our notations as
follows.

• The index λ ∈ Λ is replaced by indices (λ1 , . . . , λT ) ∈ Λ1 × · · · × ΛT . This allows to


denote by Htλt a hypothesis space for the predictor π̂t , with the choice of λt ∈ Λt
decoupled from the previous choices of λ1 , . . . , λt−1 .
4.2. Learning and Evaluation of Scenario Tree Based Policies 59

Algorithm 4.2 Stage by stage selection and evaluation of a decision policy


Input: A first-stage decision ū1 , a data set Dt of pairs (Xt , Yt ) for t = 2
only, hypothesis spaces Htλt indexed by λt ∈ Λt , repair proce-
dures Mt , and a test sample of new scenarios (size n0 ).
Output: A feasible policy π̄.

1. Set π̄1 (ξ) = ū1 , and then set t = 2.

2. For each λt ∈ Λt ,
learn the predictor π̂tλt given the data set Dt , using the hypothesis space Htλt ;
build the policy π λt = (π̄1 , . . . , π̄t−1 , πtλt ), where πtλt combines π̂tλt and Mt .
If t = T , go to Step 4.

3. For each λt ∈ Λt ,
0
form and solve the problem P+ (t; π λt ); let v λt denote its optimal value.
Set νt ∈ argminλt ∈Λt v λt and set π̄t = πtνt .
0
Form the data set Dt+1 relative to ut+1 from the solution to P+ (t; π νt ).
Set t to t + 1 and go to Step 2.

4. For each λT ∈ ΛT ,
evaluate the performance of π λT on the common test sample of size n0 ;
let v λT be that performance.

5. Set νT ∈ argminλT ∈ΛT v λT and set π̄T = πTνT . Return π̄ = (π̄1 , . . . , π̄T ).

6. Optional: return the value of v νT reevaluated on


an independent test sample of size n00 .

• Given t < T and a policy π̄ † = (π̄1 , . . . , π̄t ) specified only from stage 1 to stage t,
the notation P+ (t; π̄ † ) refers to the original program P over a policy π, subject to
the additional constraints π1 (ξ) = π̄1 (ξ), . . . , πt (ξ) = π̄t (ξ). Thus, the program
P+ (t; π̄ † ) is the original problem P, except that π̄1 , . . . , π̄t are already specified.
0
• P+ (t; π̄ † ) denotes the scenario-tree approximation to P+ (t; π̄ † ) built by some al-
gorithm A(t). The algorithm A(t) must always return the same tree, while the
0
trees relative to A(1), . . . , A(T − 1) must all be different. Thus, P+ (t; π̄ † ) is the
0
approximate program P posed over a new scenario tree proper to t, and subject
to the additional constraints uk1 = π̄1 (ξ k ), . . . , ukt = π̄t (ξ k ) for all k.

Algorithm 4.2 describes how decision predictors are learned from data sets that in-
corporate the effect of the decision rules already selected for the previous stages, and
left unspecified for the subsequent stages. At each stage t, there is also a selection step
among possible decision predictors indexed by λt ∈ Λt .
Indeed, the advantage of Algorithm 4.2 over Algorithm 4.1 is that the learning problem
for a decision predictor for stage t + 1 takes into account the learned decisions rules
π̄1 , . . . , π̄t . As the learned decision rules introduce a loss of optimality and modify the
information states that can be reached at stage t + 1, other recourse decisions at stages
t + 1, . . . , T are preferable, and in fact, the ideal recourse decisions are those that would
60 Chapter 4. Validation of Solutions and Scenario Tree Generation

be obtained by solving the problem P+ (t; π̄ † ) with π̄ † suitably defined (see Step 2 of
Algorithm 4.2, where π λt plays the role of π̄ † ). We cannot solve P+ (t; π̄ † ), but we
0
can exploit a scenario-tree approximation P+ (t; π̄ † ) from which a data set Dt+1 for
learning π̂t+1 can constructed (Step 3 of Algorithm 4.2).
From the statistical point of view, the main drawback of Algorithm 4.2 is that it
cannot evaluate on the test sample of size n0 a policy that is not specified on the full
horizon. Instead, Step 3 in Algorithm 4.2 performs a weak form of model selection by
0
scoring the incomplete policies of Step 2 with the optimal value of the programs P + (t; π̄ † ).
0 †
The programs P+ (t; π̄ ) use a common scenario tree independent of the trees relative to
π̄1 , . . . , π̄t−1 , so as to reduce the selection bias. The selection is weak in the sense that
the score of an incomplete specified policy π̄ † is not a reliable estimate of the optimal
value of the exact program P+ (t; π̄ † ). An unbiased upper bound on the exact value of P
can be obtained by the optional Step 6 of Algorithm 4.2.
From the complexity point of view, the main drawback of Algorithm 4.2 is that the
0
programs P+ (t; π̄ † ) must be solved for each λt ∈ Λt , t = 2, . . . , T . Another concern is the
new source of variance of the test sample estimates coming from use of several scenario
trees, that could force us to use larger test samples.

4.3 Monte Carlo Selection of Scenario Trees

We now sketch a workable and generic scheme for obtaining approximate solutions to
a multistage stochastic program with performance guarantees, and for selecting good
scenario-tree approximations to the multistage stochastic program. The scheme builds
on the validation procedure described in Section 4.2.1 (Algorithm 4.1), which infers a
decision policy from examples of scenarios and decisions collected from a scenario-tree
approximation, and also computes an accurate estimate of the value of the learned policy
by Monte Carlo simulation.
A first idea simply consists in perturbing the data sets Dt of scenario/decisions pairs
used by the supervised learning procedure, by obtaining these data sets from different
scenario-tree approximations. This source of variation creates new opportunities for
finding better policies by supervised learning.
A second idea consists in identifying good scenario trees, on the basis of the perfor-
mance of the policies that can be learned from the data sets Dt collected from those
trees. This approach allows to study empirically algorithms that construct the scenario-
tree approximations, and to tune or modify these algorithms so as to improve the solution
procedure in terms of solution accuracy or in terms of computational complexity.

4.3.1 Description of the Selection Scheme

In this section, we describe the scheme that allows to identify good scenario trees. The
scheme consists in generating a possibly large set of randomized scenario-tree approxi-
mations P 0 for a given problem P, ranking them according to the estimated value of the
best policy learned from them, and identifying in this way a presumably best scenario
tree among the considered sample of trees. The best policy of the best scenario tree is
then viewed as the best solution for P found by the method, and its value can be assessed
4.3. Monte Carlo Selection of Scenario Trees 61

Algorithm 4.3 Monte Carlo selection of scenario trees


Input: Algorithms AS for building random tree structures and AV for
building the trees for the process ξ given a branching structure,
and 3 independent test samples of size n0 , n00 , n000 , made of real-
izations of ξ sampled independently.
Output: A scenario tree, a feasible policy π̄, a performance guarantee.

1. Generate a set T of scenario trees, using algorithms AS and AV .

2. For each tree in the set T indexed by 1 ≤ ν ≤ M ,


select a best policy π ν learned from the data set extracted from the tree,
using Algorithm 4.1 on the first test sample of size n0 .

3. For each policy π ν ,


reassess the performance of π ν on the second test sample of size n00 .
Let v ν denote that performance.

4. Set µ ∈ argmin1≤ν≤M v ν .
Return the scenario tree indexed by µ and the policy π̄ = π µ .

5. Reevaluate π µ on the third test sample of size n000 .


Let V µ denote that performance.
Return the bound minπ E{f (ξ, π(ξ))} ≤ E{f (ξ, π̄(ξ))} ' V µ (see text for details).

by Monte Carlo simulation on an independent test sample. Algorithm 4.3 describes each
step of the procedure.
Having enough diversity in the considered scenario-tree approximations multiplies our
chance of obtaining good data sets, from which good policies can be learned. Therefore,
it is interesting to assume that the generated scenario trees have a random branching
structure — a novelty with respect to the usual practice of multistage stochastic program-
ming. In our presentation of Algorithm 4.3, we formally decompose a tree generation
algorithm A into 2 components: AS for generating a random branching structure, and
AV for sampling realizations ξ k of the random process ξ according to the fixed branching
structure and for assigning probabilities to the nodes of the tree. Existing tree generation
methods from the stochastic programming literature, briefly discussed in Section 2.4.2,
correspond to a particular choice of AV .
Developing algorithms AS able to generate rich but tractable branching structures,
for low-dimensional processes or for high-dimensional processes, valid for short horizons
and long horizons, is an interesting open problem. We have investigated several variants
for AS in the context of a concrete family of problems (Section 4.4.2), without how-
ever providing general-purpose algorithms for generating random branching structures
adapted to high-dimensional random processes.
Algorithm 4.3 uses in theory 3 independent test samples of size n0 , n00 , n000 : one for
selecting a best hypothesis space for the best learned policy from the data sets relative
to a given tree; one for selecting the best tree; and one for estimating the performance of
the overall best policy. In practice, we do not always reevaluate our estimates on distinct
62 Chapter 4. Validation of Solutions and Scenario Tree Generation

independent test samples.


The tree and the policy that Algorithm 4.3 returns can be exploited in several ways.
We have already mentioned the possibility of tuning the scenario tree generation algo-
rithms to the problem at hand. Another possibility is to implement only the first-stage
decision π̄1 of the policy returned by Algorithm 4.3, and therefore to see the whole proce-
dure as a single step of a general shrinking-horizon or receding-horizon decision making
scheme.
It is also possible to use the policy π̄ on the full horizon, with a performance guarantee,
since π̄ should achieve, for the original problem P, an objective value close to the perfor-
mance guarantee estimated on the independent test sample of size n 000 . More precisely,
Pn000
the variance of the empirical estimate V µ = n1000 j=1 f (ξ j , π(ξ j )) in Step 5 of Algo-
Pn000
rithm 4.3 should be approximately equal to σ̂ 2 = n000 (n1000 −1) j=1 [f (ξ j , π(ξ j )) − V µ ]2 , so
that V µ + zα σ̂ would yield a conservative estimate of the performance of π̄ with confi-
dence 1 − α, where zα is the α-critical value of the standard normal distribution (Shapiro
et al., 2009, Section 5.6).
We could also mention that when the only information on the random process ξ is a
finite set of realizations ξ j , the tree selection method could be extended as follows. One
would split the set of realizations into a test set and a learning set from which a generative
model for simulating realizations of the random process would be inferred. The random
scenario trees would then be built by querying new samples from the generative model.

4.3.2 Discussion

The generic procedure presented in this section is based on various open ingredients that
may be exploited for the design of a wide class of algorithms in a flexible way. Namely, the
main ingredients are (i) the scenario tree sampling scheme, (ii) the (possibly regularized)
optimization technique used to obtain data sets from a scenario tree, (iii) the supervised
learning algorithm used to obtain the decision strategies from the data sets, (iv) the
repair procedure used to restore the feasibility of the decisions on new scenarios.
The main ideas of the proposed scheme are evaluated in the case study section on
a family of problems proposed by other authors. We illustrate how one may adjust the
scenario tree generation algorithm and the policy learning algorithm to one’s needs, and
by doing so we also illustrate the flexibility of the proposed approach and the potential
of the combination of scenario-tree based decision making with supervised learning. In
particular, the efficiency of supervised learning strategies makes it possible to rank large
numbers of policies inferred from large numbers of randomly generated scenario trees.
Although we do not illustrate this in the present work, we would like also to stress
that the scenario tree sampling scheme may be coupled in various other ways with the
inference of policies by machine learning. For example, one could seek to use sequential
Monte Carlo techniques inspired from the importance sampling literature, in order to
progressively guide the scenario tree sampling and machine learning methods towards
regions of high interest, given the quality of the policies inferred from scenarios trees
at previous iterations. Also, instead of using the data set obtained from each scenario
tree to extract a policy, one could use data sets collecting data from several scenario-
tree approximations to extract a single policy, in the spirit of the wide range of model
4.4. Case Study 63

perturbation and combination schemes reviewed in chapter 3.

4.4 Case Study

We will show the interest of the approximate solution techniques presented in the chapter
by applying them to a family of multistage stochastic programs. Implementation choices
difficult to discuss in general terms, such as choices concerning the supervised learning
of a policy for the recourse decisions, and the choices for the random generation of the
trees, will be illustrated on a concrete case.
The section starts by the formulation of a multistage stochastic program that various
researchers have presented as difficult for scenario tree methods (Hilli and Pennanen,
2008; Koivu and Pennanen, 2010; Küchler and Vigerske, 2010). Several instances of
the problem will be addressed, including instances on horizons considered as almost
unmanageable by scenario tree methods.

4.4.1 Description of the Problem

We consider a multistage problem adapted from Hilli and Pennanen (2008), interpreted in
that paper as the valuation of an electricity swing option. In this chapter, we interpret
the problem rather as the search for risk-aware strategies for distributing the sales of
a commodity over T stages in a flexible way adapted to market prices. A risk-aware
objective is very interesting for our purposes, but it is difficult to justify it in a context
of option valuation. The formulation of the problem is as follows:
PT
minimize ρ−1 log E{exp{−ρ t=1 ξt−1 · πt (ξ)}}
PT
subject to 0 ≤ πt (ξ) ≤ 1 and t=1 πt (ξ) ≤ Q ,
(4.1)
π non-anticipative.

The objective uses the exponential utility function, with risk aversion coefficient ρ.
Such objectives are discussed at the end of the chapter.
In our formulation of the problem, there is no constant first-stage decision to optimize.
We begin directly by the observation of ξ0 , followed by a recourse decision u1 = π1 (ξ0 ).
Observations and decisions are intertwined so that in general ut = πt (ξ0 , . . . , ξt−1 ). The
random variable ξt−1 is the unitary profit (ξt−1 > 0) or loss (ξt−1 < 0) that can re-
sult from the sale of the commodity at time t. Potential profits and losses fluctuate
in time, depending on market conditions (we later select a random process model for
market prices to complete the problem specification). The commodity is sold in quantity
ut = πt (ξ0 , . . . , ξt−1 ) at time t, meaning that the quantity ut can depend on past and
current prices. The decision is made under the knowledge of the potential profit or loss
at time t, given by ξt−1 · ut , but under uncertainty of future prices. This is by the way
why scenario tree techniques must be used with great care on this problem when the
planning horizon is long: as soon as the scenarios cease to have branchings, there is no
more residual uncertainty on future prices, and the optimization process wrongly iden-
tifies opportunities anticipatively. Those spurious future opportunities may significantly
degrade the quality of previous decisions.
We seek strategies where the sales per stage are bounded (constraint 0 ≤ π t (ξ) ≤ 1).
The constraint can model a bottleneck in the production process. Notice also that
64 Chapter 4. Validation of Solutions and Scenario Tree Generation

bounded sales are consistent with the model assumption of an exogenous random process:
very large sales are more likely to influence the market prices on long planning horizons.
The scalar Q bounds the total sales (we assume Q ≥ 1). It represents the initial stock of
commodity, the sale of which must be distributed optimally over the horizon T .
When the risk aversion coefficient ρ tends to 0, the problem reduces to the search of
a risk-neutral strategy. This case has been studied by Küchler and Vigerske (2010). It
admits a linear programming formulation:
PT
minimize −E{ t=1 ξt−1 · πt (ξ)}
PT
subject to 0 ≤ πt (ξ) ≤ 1 and t=1 πt (ξ) ≤ Q ,
(4.2)
π non-anticipative,

and an exact analytical solution (which thus serves as a reference)



ref 0 if t ≤ T − Q or ξt−1 ≤ 0 ,
πt (ξ) = (4.3)
1 if t > T − Q and ξt−1 > 0 .

• In a first series of experiments, we will take the numerical parameters and the
process ξ selected in Hilli and Pennanen (2008) (to ease the comparisons): ρ = 1,
T = 4, Q = 2; ξt = (exp{bt } − K) where K = 1 is the fixed cost (or the strike
price, when the problem is interpreted as the valuation
√ of an option) and b t is a
random walk: b0 = σ 0 , bt = bt−1 +σ t , with σ = 0.2 and t following a standard
normal distribution N (0, 1).
Pt
Noting that a priori bt = σ t0 =0 t0 is normally distributed with mean 0 and
variance (t + 1)σ 2 , we record, for future reference, that the first process ξ is such
that
1
√ log(ξt−1 + K) is a priori distributed as N (0, 1) , (4.4)
σ t

where σ = 0.2 and K = 1 .

• In a second series of experiments over various values of the parameters (ρ, T, Q)


with T up to 52, we will take for ξ the process selected in Küchler and Vigerske
(2010) (because otherwise on long horizons the price levels of the first process blow
out in an unrealistic way, making the problem rather trivial): ξt = ξt0 − K with
0
ξt0 = ξt−1 exp{σt − σ 2 /2} where σ = 0.07, K = 1, and t following a standard
normal distribution. Equivalently ξt = (exp{bt − (t + 1) σ 2 /2} − K) with bt a
random walk such that b0 = σ 0 and bt = bt−1 + σ t .
We record, for future reference, that the second process ξ is such that

1 σ t
√ log(ξt−1 + K) + is a priori distributed as N (0, 1) , (4.5)
σ t 2

where σ = 0.07 and K = 1 .

4.4.2 Algorithms for Generating Small Scenario Trees

At the heart of tree selection procedure relies our ability to generate scenario trees reduced
to a very small number of scenarios, with interesting branching structures. As the trees
4.4. Case Study 65

are small, they can be solved quickly and then scored using the supervised learning policy
inference procedure. Fast testing procedures make it possible to rank large numbers of
random trees.
The generation of random branching structures has not been explored in the classical
stochastic programming literature; we thus have to propose a first family of algorithms
in this section. The algorithms are developed with our needs in view, with the feedback
provided by the final numerical results of the tests, until results on the whole set of con-
sidered numerical instances suggest that a particular algorithm suffices for the application
at hand. We believe that the main ideas behind the algorithms will be reused in subse-
quent work for addressing the representation of stochastic processes of higher dimensions.
Therefore, in the following explanations we put more emphasis on the methodology we
followed than on the final resulting algorithms.

Method of Investigation.

The branching structure is generated by simulating the evolution of a branching process.


We will soon describe the branching process that we have used, but observe first that
the probability space behind the random generation of the tree structure is not at all
related to the probability space of the random process that the tree approximates. It is
the values and probabilities of the nodes that are later chosen in accordance to the target
probability distribution, either deterministically or randomly, using any new or existing
method.
For selecting the node values, we have tested different deterministic quantizations of
the one-dimensional continuous distributions of random variables ξt , and alternatively
different quantizations of the gaussian innovations t that serve to define ξt = ξt (t ), as
described by the relations given in the previous section. Namely, we have tested the min-
imization of the quadratic distortion (Pages and Printems, 2003) and the minimization
of the Wasserstein distance (Hochreiter and Pflug, 2007). On the considered problems
we did not notice significant differences in performance attributable to a particular de-
terministic variant.
What happened was that with deterministic methods, performances began to degrade
as the planning horizon was increased, perhaps because trying to preserve statistical
properties of the marginal distributions ξt distorts other statistics of the joint distribution
of (ξ0 , . . . , ξT −1 ), especially in higher dimensions. Therefore, for treating instances on
longer planning horizons, we switched to a crude Monte Carlo sampling for generating
node values.
By examining trees with the best scores in the context of the present family of prob-
lems, we observed that the empirical estimates of several statistics of the random process,
that were computed from the values and probabilities of the nodes of these scenario trees,
could be very far from values consistent with the exact model of the random process. For
PN
instance, even the empirical first moments k=1 pk ξtk could be very far from their the-
oretical values E{ξt }. This observation might suggest that it is very difficult to predict,
without any information on the optimal solutions, which properties should be preserved
in small scenario trees, and thus which objective should be optimized when attempting
to build a small scenario tree. If we had discovered a correlation between some features
of the trees and the scores, we could have filtered out bad trees without actually solving
66 Chapter 4. Validation of Solutions and Scenario Tree Generation

the programs associated to these trees, simply by computing the identified features.

Description of the Branching Processes.

We now describe the branching process used in the first series of experiments, made
with deterministic node values. Let r ∈ [0, 1] denote a fixed probability of creating a
branching. We start by creating the root node of the tree (depth 0), to which we assign
the conditional probability 1. With probability r, we create 2 successor nodes to which we
assign the values ±0.6745 and the conditional probabilities 0.5 (see Remark 4.1 below).
With probability (1 − r) we create instead a single successor node to which we assign
the value 0 and the conditional probability 1; this node is a degenerate approximation
of the distribution of t . Then we take each node of depth 1 as a new root and repeat
the process of creating 1 or 2 successor nodes to these new roots randomly. The process
is further repeated on the nodes of depth 2, . . . , T − 1, yielding a tree of depth T for
representing the original process 0 , . . . , T −1 . The scenario tree for ξ is derived from the
scenario tree for .

Remark 4.1 (Wasserstein distance). The discrete distribution that assigns proba-
bilities 0.5 on the values ±0.6745 is the discrete distribution with support of car-
dinality 2 that has the smallest Wasserstein distance l1 to the normal distribution
N (0, 1) followed by t . The Wasserstein distance l1 may be defined as follows. Let
X, Y be random variables following marginal distributions G and H respectively.
Assume that G and H are such that X and Y have finite first moments. Let P de-
note the collection of coupling measures between X and Y , that is, the collection of
probability measures P such that X follows G, and Y follows H. Then the Wasser-
stein distance l1 between G and H is defined as l1 (G, H) = inf P∈P {E{|X − Y |}}.
It admits a dual representation l1 (G||H) = supf ∈F1 {E{f (X)} − E{f (Y )}}, where
F1 = {f : R → R : |f (x) − f (y)| ≤ |x − y|} denotes the class of functions with
Lipschitz constant 1. It can be shown that the distribution with values y k and
probabilities pk , 1 ≤ k ≤ N , closest in the l1 sense to a density g (with respect to
the Lebesgue measure) can be computed as follows: Set y 0 = −∞, y N +1 = +∞
and minimize over y 1 < y 2 < · · · < y N the sum
N Z
X (y k +y k+1 )/2 Z (y k +y k+1 )/2
k k
|x − y |g(x)dx , and then set p = g(x)dx .
k=1 (y k−1 +y k )/2 (y k−1 +y k )/2

For the case N = 2 and g the density of N (0, 1), one can use a symmetry argument
R∞
and then evaluate argminy 0 |x − y|(2π)−1/2 exp{−x2 /2}dx ' 0.6745.

For problems on larger horizons, it is difficult to keep the size of the tree under
control with a single fixed branching parameter r — the number of scenarios would have
a large variance. Therefore, in the second series of experiments (made with random node
values), we used a slightly more complicated branching process, by letting the branching
probability r depend on the number of scenarios currently developed (Algorithm 4.4).
Specifically, let N be a target number of scenarios and T a target depth for the scenario
tree with the realizations of ξt relative to depth t + 1. Let nt be the number of parent
4.4. Case Study 67

Algorithm 4.4 Branching structure generation for the second series of experiments
Input: A targeted number of scenarios N ≥ 1, and a tree depth T ≥ 1.
Output: A random branching structure for a scenario tree having n ' N
scenarios.

1. Create a root node (depth 0). Set t = 0.


N −1
2. Set nt to the number of nodes at depth t. Set rt = (1/nt ).
T
3. For each node j of depth t:
Draw Zj uniformly in the interval [0, 1].
If Zj ≤ rt , append 2 children nodes to node j (binary branching).
If Zj > rt , append 1 child node to node j (no branching).

4. If t < T − 1, increment t and go to Step 2.


Otherwise, return the branching structure.

nodes at depth t. Note that nt is a random variable, except at the root where n0 = 1.
During the construction of the tree, parent nodes at depth t < T are developed and split
in 2 children nodes with a probability rt = n−1 t (N − 1)/T . Parent nodes have a single
child node with a probability 1 − rt . If rt > 1, we set rt = 1 and all nodes are split in 2
children nodes. Thus in general rt = min{1, n−1 t (N − 1)/T }. Note that the truncation
of rt to 1 has no effect on Algorithm 4.4 and has thus been omitted.
Algorithm 4.4 produces branching structures having approximately N scenarios in
the following sense. Assume that the number nT −1 of existing nodes at depth T − 1 is
large. By the independence of the random decision of creating 1 or 2 successor nodes,
and by a concentration of measure argument, the number of nodes created at depth T is
approximately equal to

nT = nT −1 (2 · rt−1 + 1 · (1 − rt−1 )) = nT −1 (1 + rt−1 )


= nT −1 (1 + (1/nT −1 )(N − 1)/T ) = nT −1 + (N − 1)/T.

Iterating this recursion yields nT = n0 +T (N − 1)/T = N . To establish this latter result,


we have neglected the fact that when nt−1 is small, the random value of nt conditionally
to nt−1 should not be approximated by the conditional mean of nt , as done in the
recursive formula. Thus, we have only nT ∼ N . The error affects mostly the first levels
of the tree under development, and seems to have a relatively small effect in practice.

4.4.3 Algorithm for Learning Policies

Solving a program on a scenario tree yields a data set of scenario/decision sequence pairs
(ξ, u). To infer a decision policy that generalizes the decisions of the tree to test scenarios,
we have to learn mappings from (ξ0 , . . . , ξt−1 ) to ut and ensure the compliance of the
decisions with the constraints. To some extent the procedure is thus problem-specific.
Here again we insist on the methodology.
68 Chapter 4. Validation of Solutions and Scenario Tree Generation

Dimensionality Reduction.

We first try to represent the information state (ξ0 , . . . , ξt−1 by a smaller number of
variables, because the representation (ξ0 , . . . , ξt−1 risks to become very cumbersome as
t grows. In particular, we can try to get back to a state-action space representation of
the policy (and postprocess data sets accordingly to recover the states). Note that in
general, the states we need are those that would be used by a hypothetical reformulation
of the optimization problem using dynamic programming. Here the objective is based
on the exponential utility function. By the property that
PT
E{exp{− t0 =1 ξt0 −1 · ut0 } | ξ0 , . . . , ξt−1 }
Pt−1 PT
= exp{− t0 =1 ξt0 −1 · ut0 } E{exp{− t0 =t ξt0 −1 · ut0 } | ξ0 , . . . , ξt−1 } ,

we can see that decisions at t0 = 1, . . . , t − 1 scale by a same factor the contribution


to the return brought by the decisions at t0 = t, . . . , T . Therefore, if the feasibility set
at time t can be expressed from state variables, the decisions at t0 = t, . . . , T can be
optimized independently of the decisions at t0 = 1, . . . , t − 1. This suggests to express
ut as a function of the state ξt−1 of the process ξ, and of an additional state variable ζt
defined by
Pt−1
ζ0 = Q , ζt = Q − t0 =1 ut0 ,
PT
that allows to reformulate, at time t, the constraint t0 =1 ut0 ≤ Q in (4.1) as
PT
t0 =t ut0 ≤ ζt . (4.6)

Feasibility Guarantees Sought Before Repair Procedures.

We try to map the output space in such a way that the predictions learned under the
new geometry and then transformed back using the inverse mapping comply with the
feasibility constraints. Here, we scale the output ut so as to have to learn the fraction
yt = yt (ξt−1 , ζt ) ∈ [0, 1] of the maximal allowed output min(1, ζt ). Indeed, note that
0 ≤ ut ≤ min(1, ζt ) summarizes the constraints of the problem at time t, namely the
constraint 0 ≤ πt (ξ) ≤ 1 in (4.1) and the constraint (4.6). Since ζ0 = Q is fixed
(with Q greater than 1 by assumption), we distinguish the cases u1 = y1 (ξ0 ) · 1 and
ut = yt (ξt−1 , ζt ) · min(1, ζt ). It will be easy to ensure that fractions yt predicted by the
learned models are valued in [0, 1] (thus we actually do not need to define an a posteriori
repair procedure).

Input Normalization.

It is convenient for the sequel to normalize the inputs. From the definition of ξ t−1 we can
def
recover the state of the random walk bt−1 , and use as first input xt1 = (σ 2 t)−1/2 bt−1 ,
which follows a standard normal distribution. Thus for the first version of the process ξ,
recalling (4.4), instead of ξt−1 we use xt1 = σ −1 t−1/2 log(ξt−1 + K). For the second
version of the process ξ, recalling (4.5), instead of ξt−1 we use xt1 = σ −1 t−1/2 log(ξt−1 +
def
K) + σt1/2 /2. Instead of the second input ζt (for t > 1) we use xt2 = ζt /Q, which is
valued in [0, 1]. We will also rewrite the fraction yt = yt (ξt−1 , ζt ) as yt = gt (xt1 , xt2 ) to
stress the change of input variables.
4.4. Case Study 69

βt1
tansig tansig
vt11 + wt1
+1

xt1 βt2 γt -1

tansig logsig
+ + yt
logsig
xt2 βt3 +1

tansig wt3
0

vt32 +

Fig. 4.1: Neural Network model with L = 3 hidden layers for the component gt of the policy (4.7)
to be learned from data. The figure is a graphical representation of (4.8). Training
the neural networks consists in finding, for each t, values for the parameters vtjk , βtj ,
wtj , γt that best explain examples of pairs (xt , yt ).

To summarize, the decisions ut = πt (ξ) will be obtained as follows:

σ −1 t−1/2 log(ξt−1 + K)

for the process ξ of (4.4)
xt1 =
σ −1 t−1/2 log(ξt−1 + K) + σt1/2 /2 for the process ξ of (4.5)
Pt−1
xt2 = ζt /Q = 1 − Q−1 t0 =1 ut0 (4.7)
yt = gt (xt1 , xt2 )
Pt−1
ut = yt · min{1, ζt } = yt · min{1, Q − t0 =1 u t0 } ,

with π non-anticipative and feasible if and only if gt is always valued in [0, 1].

Hypothesis Space.

We have to choose the hypothesis space for the functions gt in (4.7). In the present
situation, we find it convenient to choose the class of feed-forward neural networks with
one hidden layer of L neurons (Figure 4.1):
 PL  P2 
gt (xt1 , xt2 ) = logsig γt + j=1 wtj · tansig βtj + k=1 vtjk xtk , (4.8)

with weights vtjk and wtj , biases βtj and γt , and activation functions

tansig(x) = 2 · (1 + e−2 x )−1 − 1 valued in [−1, +1] ,


−x −1
logsig(x) = (1 + e ) valued in [0, 1] ,

a usual choice for imposing the output ranges [−1, +1] and [0, 1] respectively.
Since the training sets are extremely small, we take L = 2 for g1 (which has only one
input x11 ) and L = 3 for gt (t > 1).
We recall that artificial neural networks have been found to be well-adapted to nonlin-
ear regression. Standard implementations of neural networks (data structure construction
and training algorithms) are widely available (Demuth and Beale, 1993). We report here
the parameters chosen in our experiments for the sake of completeness; the method is
largely off-the-shelf.
70 Chapter 4. Validation of Solutions and Scenario Tree Generation

Details on the Implementation.

The weights and biases are determined by training the neural networks. We used the
Neural Network toolbox of Matlab with the default methods for training the networks
by backpropagation — the Nguyen-Widrow method for initializing the weights and bi-
ases of the networks randomly, the mean square error loss function, and the Levenberg-
Marquardt optimization algorithm. We used [−3, 3] for the estimated range of x t1 , cor-
responding to 3 standard deviations, and [0, 1] for the estimated range of x t2 .
Trained neural networks are dependent on the initial weights and biases before train-
ing, because the loss minimization problem is nonconvex. Therefore, we repeat the
training 5 times from different random initializations. We obtain several candidate poli-
cies (to be ranked on the test sample). In our experiments on the problem with T = 4,
we randomize the initial weights and biases of each network independently. In our exper-
iments on problems with T > 4, we randomize the initial weights and biases of g 1 (x11 )
and g2 (x21 , x22 ), but then we use the optimized weights and biases of gt−1 as the initial
weights and biases for the training of gt . Such a warm-start strategy accelerates the
learning tasks. Our intuition was that for optimal control problems, the decision rules
πt would change rather slowly with t, at least for stages far from the terminal horizon.
We do not claim that using neural networks is the only or the best way of building
models gt that generalize well and are fast in exploitation mode. The choice of the Matlab
implementation for the neural networks could also be criticized. It just turns out that
these choices are satisfactory in terms of implementation efforts, reliability of the codes,
solution quality, and overall running time.

4.4.4 Solving Programs Approximately by Linear Programming

An option of the proposed testing framework that we have not discussed, as it is linked
to technical aspects of numerical optimization, is that we can form the data sets of sce-
nario/decisions pairs using inexact solutions to the optimization programs associated to
the trees. Indeed, simulating a policy based on any data set will still give a pessimistic
bound on the optimal solution of the targeted problem. The tree selection procedure will
implicitly take this new source of approximation into account. In fact, every approxima-
tion one can think of for solving the programs could be tested on the problem at hand
and thus ultimately accepted or rejected, on the basis of the performance of the policy
on the test sample, and the time taken by the solver to generate the decisions of the
data set. In the present setting, we judged that solving multiple instances of large-scale
nonlinear programs would be too slow with cvx, and preferred to use a large-scale linear
programming approximation of the initial objective.

Principle of the Approximation.

Here, we present an approximation used for the problems with ρ > 0 on horizons larger
than T = 4, that turned out to perform satisfactorily on that family of problems. We
approximated the function exp{z} in the objective by a convex piecewise linear ap-
def
proximation, expL {z} = maxj∈{0,1,...,J−1} {cj · z + dj }, with cJ−1 = dJ−1 = 0, and
with cj , dj ∈ R chosen such that expL {zi } = exp{zi } on a sequence of anchor points
4.4. Case Study 71

z0 > z1 > · · · > zJ−1 :


exp{zj+1 } − exp{zj } zj+1 exp{zj } − zj exp{zj+1 }
cj = , dj = .
zj+1 − zj zj+1 − zj

This allows to approximate the nonlinear formulation of the targeted problem by


a linear formulation, at a very light cost in terms of additional optimization variables
(representing a new function v = v(ξ) valued in R) and at a controllable cost in terms of
additional constraints (J new constraints per scenario ξ). Precisely, everything happens
as if the targeted multistage program were formulated as

minimize E{v(ξ)}
PT
subject to v(ξ) ≥ cj · [−ρ t=1 ξt−1 · πt (ξ)] + dj
for j = 0, . . . , J − 2 ,
v(ξ) ≥ 0 (case j = J − 1),
PT
0 ≤ πt (ξ) ≤ 1 and t=1 πt (ξ) ≤ Q ,
π non-anticipative.

Details on the Implementation.

The anchor points zj may be chosen as follows. It is easy to see that at optimality
we should always have πt (ξ) = 0 if ξt−1 < 0. This means that the arguments z =
PT
−ρ t=1 ξt−1 · πt (ξ) of the exponential function will always be nonpositive at optimality.
Thus we may set z0 = 0: the exponential function will be approximated by the linear
function c0 (z)+d0 for z > 0 during the optimization process, without loss of precision. On
the other hand, in a finite-dimensional approximation, the support of the approximation
to the distribution of ξt has a maximal value, say ξM . The minimal value of the argument
of the exponential is thus greater or equal to z̄ = −ρ · ξM · Q. Thus if zJ−1 ≤ z̄
the exponential function will approximated by max{0, cJ−2 · z + dJ−2 } for z < zJ−1
during the optimization process, without loss of precision. We can then select J and
zJ−1 < zJ−2 < · · · < z0 = 0, with zJ−1 ≤ z̄, such that the approximation of exp{z} by
expL {z} is tight enough on the domain [z̄, 0]. For all z ∈ [z̄, 0], we have exp(z)L ≥ exp(z)
and max(expL {z} − exp{z}) < maxj {| exp{zj+1 } − exp{zj }|}.
For solving the linear programs we still use the interior-point solver associated to cvx.
One could also switch to simplex methods — arguments in favor of simplex methods may
be found in Bixby (2002).

4.4.5 Numerical Results

We now describe the numerical experiments we have carried out and comment on the
results.

Experiment on the short-horizon problem instance.

First, we consider the process ξ and parameters (ρ, Q, T ) taken from Hilli and Pennanen
(2008). We generate a sample of n0 = 104 scenarios drawn independently, on which each
learned policy will be tested. We generate 200 random tree structures as described previ-
ously (using r = 0.5 and rejecting structures with less than 2 or more than 10 scenarios).
PSfrag replacements

PSfrag replacements

72 Chapter 4. Validation of Solutions and Scenario Tree Generation

0
0.1
0.66 0.2

on the test sample


0.3
PSfrag replacements
0.4

cost of policy
0.64
0.5
0.6
0.62 0.7
0.8
0.9
0.6
1

0.58
1 2 3 4 5 6 7 8 9 10
number of scenarios
0 of the tree
0.1
PSfrag replacements
Fig. 4.2: First experiment: scores on the test sample
0.2 associated to the random scenario trees
(lower is better). The linear segments 0.3
join the best scores of policies inferred from
0.4
trees of equivalent complexity. 0
0.5
0.1
0.6
0.2
0.7
0.3
0.8
0.4
0.9
+0.828 1/4 0.5
+0.828
1 1/4
0.6
0.7
+0.352 +0.352
0.8 1/8
PSfrag replacements 1/8 0.9 1/8
+0.000 1/4 +0.000
1
-0.260 1/8 -0.260 1/8
1/8
-0.453 1/4 -0.453 1/4
0
0.1
ξ0k ξ1k ξ2k ξ3k p k
0.2 ξ0k ξ1k ξ2k ξ3k pk
0.3
+1.472 +1.472
0.4 1/8
1/8 0.5
0.6
+0.828 0.7
+0.828 1/16
0.8
0.9
1/8
+0.352 +0.352
1 1/8
1/8
1/16
+0.000 1/4 +0.000 1/16
1/8 1/4
-0.260 -0.260 1/8
-0.453 1/8 -0.453 1/16
-0.595 1/8 -0.595 1/8

ξ0k ξ1k ξ2k ξ3k pk ξ0k ξ1k ξ2k ξ3k pk


Fig. 4.3: Small trees (5,6,7,9 scenarios) from which good data sets could be obtained. The
scenarios ξ k = (ξ0k , ξ1k , ξ2k ) are shifted vertically to distinguish them when they pass
through common values, written on the left. Scenario probabilities pk are indicated on
the right.

Node values are set by the deterministic method, thus the variance in performance that
we will observe among trees of similar complexity will come mainly from the branching
structure. We form and solve the programs on the trees using cvx, and extract the data
sets. We generate 5 policies per tree, by repeatedly training the neural networks from
random initial weights and biases. Each policy is simulated on the test sample and the
best of the 5 policies is retained for each tree.
The result of the experiment is shown on Figure 4.2. Each point is relative to a
particular scenario tree. Points from left to right are relative to trees of increasing size.
4.4. Case Study 73

P n0 PT j
We report the value of (1/n0 ) j=1 exp{− t=1 ξt−1 · π̂t (ξ j )} for each learned policy π̂, in
accordance with the objective minimized in Hilli and Pennanen (2008). Lower is better.
Notice the large variance of the test sample scores among trees with the same number of
scenarios but different branching structures.
The tree selection method requires a single lucky outlier to output a good valid upper
bound on the targeted objective — quite an advantage with respect to approaches based
on worst-case reasonings for building a single scenario tree. With a particular tree of 6
scenarios (best result: 0.59) we already reach the guarantee that the optimal value of
our targeted problem is less or equal to log(0.59) ' −0.5276. On Figure 4.3, we have
represented graphically some of the lucky small scenario trees associated to the best
performances. Of course, tree structures that perform well here may not be optimal for
other problem instances.
The full experiment, that allows to draw Figures 4.2 and 4.3, takes 10 minutes to run
on a pc with a single 1.55 GHz processor and 512 Mb RAM. By comparing our bounds
to the results reported in Hilli and Pennanen (2008) (who have undertaken validation
experiments taking up to 30 hours on a pc with a single 3.8 GHz processor, 8 Gb RAM,
using a test sample of 10000 scenarios, and whose Figure 1 seems to indicate that the
best possible value for the bounds should be slightly greater than 0.58), we deduce that
we reached essentially the quality of the optimal solution.

Experiment on long-horizon problem instances.

Second, we consider the process ξ taken from Küchler and Vigerske (2010) (see Equa-
tion (4.5)) and a series of 15 sets of parameters for ρ, Q, T (see the first columns of
Table 4.1). We repeat the following experiment on each (ρ, Q, T ) with 3 different values
for the parameter N that controls the size of the random trees obtained with Algo-
rithm 4.4: Generate 25 random trees (we recall that this time the node values are also
randomized), solve the resulting 25 programs, learn 5 policies per tree (depending on the
random initialization of the neural networks), and report as the best score (best upper
bound) the lowest of the resulting 125 values computed on a common test sample of
n0 = 10000 scenarios. The test sample is proper to the problem instance (in fact, proper
to the time horizon T ).
Table 4.1 reports values corresponding to the average performance
0
n T
j
X X
−1 0
ρ log{(1/n ) exp{−ρ ξt−1 · π̂t (ξ j )}}
j=1 t=1

obtained for the considered series of problem instances, for the 3 considered nominal tree
sizes N (so as to illustrate the effect of the size of the trees on the performance of the
learned policies). One column is dedicated to the performance of the analytical reference
policy π ref on the test sample.
Note that the case that Küchler and Vigerske (2010) have considered is the case
corresponding to (ρ, Q, T ) = (0, 20, 52) in our table. The plots from their Figure 3 seem
to confirm that the optimal value for this case is around −3.6.
For the cases with ρ = 0, the reference value provided by the analytical optimal policy
suggests that the best policies found by our approach are close to optimality. For the
74 Chapter 4. Validation of Solutions and Scenario Tree Generation

Tab. 4.1: Second experiment: Best upper bounds for a family of problem instances.

Problem Upper bounds1 on the value of problems (4.1) with the process (4.5)

ρ Q T Reference2 Value of the best policy3 , for 3 tree sizes N

N =1·T N =5·T N = 25 · T

0 2 12 -0.19 -0.18 -0.17 -0.18


2 52 -0.40 -0.34 -0.32 -0.39
6 12 -0.51 -0.50 -0.49 -0.49
6 52 -1.19 -1.07 -1.03 -1.18
20 52 -3.64 -3.59 -3.50 -3.50
0.25 2 12 -0.18 -0.17 -0.17 -0.17
2 52 -0.34 -0.32 -0.31 -0.33
6 12 -0.44 -0.44 -0.44 -0.44
6 52 -0.75 -0.78 -0.78 -0.80
20 52 -1.46 -1.89 -1.93 -1.91
1 2 12 -0.15 -0.15 -0.15 -0.15
2 52 -0.22 -0.25 -0.22 -0.24
6 12 -0.31 -0.34 -0.34 -0.34
6 52 -0.37 -0.53 -0.53 -0.54
20 52 -0.57 -0.96 -0.98 -0.96
1
Estimated on a test sample of n0 = 10000 scenarios.
In a same row, lower is better. The best upper bound is in bold.
2
Defined by πtref (ξ) (Equation (4.3)) and optimal for the risk-neutral case ρ = 0.
3
Out of 125 policies learned from 25 random scenario trees (considered separately)
of about N scenarios, built with Algorithm 4.4.

cases with ρ = 0.25, the reference policy is now suboptimal. It still slightly dominates
the learned policies when Q = 2, but not anymore when Q = 6 or Q = 20. For the cases
with ρ = 1, the reference policy is dominated by the learned policies, except perhaps for
the cases with Q = 2. We also observe that results obtained with smaller trees (cases
N = 1 · T ) are sometimes better than results obtained with larger trees (cases N = 25 · T ,
that is, N = 300 if T = 12 and N = 1300 if T = 52). There is indeed a random
component in our tree generation approach, and it may happen that one small tree
ultimately gives a better data set than the data sets of the large trees, especially given
the relatively small number of trials in this experiment (25 trees per size N ) compared
to the number of stages.
Overall, the approach seems promising in terms of the usage of computational re-
sources. Table 4.2 reports the times taken for computing the bounds reported in Ta-
ble 4.1, using a Matlab/cvx implementation on a pc with a single 1.55 GHz processor,
512 Mb RAM. We recall that obtaining one bound involves generating 25 trees, form-
ing and solving the 25 corresponding mathematical programs, learning 125 policies, and
testing the 125 policies on 10000 scenarios. For instance, obtaining one of the 15 bounds
of the column N = 1 · T of Table 4.1 takes between 2 minutes (for the case ρ = 0, Q = 2,
T = 12) and 9 minutes (for the case ρ = 1, Q = 20, T = 52). Obtaining one of the 15
4.5. Time Inconsistency and Bounded Rationality Limitations 75

Tab. 4.2: Cpu Times for computing the bounds in Table 4.1
Problem Total1 cpu time (in seconds)

ρ Q T N =1·T N =5·T N = 25 · T

0 2 12 122 156 220


2 52 415 551 1282
6 12 123 150 223
6 52 435 590 1690
20 52 465 666 1783
0.25 2 12 136 169 250
2 52 460 780 2955
6 12 133 161 263
6 52 504 1002 4702
20 52 524 1084 5144
1 2 12 136 168 268
2 52 485 986 4425
6 12 139 187 313
6 52 524 1095 5312
20 52 543 1234 6613
1
Time for generating 25 trees of about N scenarios,
forming and solving the corresponding 25 programs,
learning 125 policies, testing each policy on 104 scenarios.

bounds of the column N = 25 · T of Table 4.1 takes from less than 4 minutes (for the case
ρ = 0, Q = 2, T = 12, N = 300) to 110 minutes (for the case ρ = 1, Q = 20, T = 52,
N = 1300).
The experiment shows that even if the proposed scenario tree selection method re-
quires generating and solving several trees, rather than one single tree, it can work very
well. In fact, the experiment illustrates that with a random tree generation process that
can generate an “interesting” set of small trees, there is a good likelihood (on the studied
family of problems) that at least one of those trees will lead to excellent performances.

4.5 Time Inconsistency and Bounded Rationality Limitations

This section briefly discusses the notion of dynamically consistent decision process, which
is relevant to sequential decision making with risk-sensitivity — by opposition to the
optimization of the expectation of a total return over the planning horizon, which can
be described as risk-indifferent, or risk-neutral.

4.5.1 Time-Consistent Decision Processes

We will say that an objective induces a dynamically consistent policy, or time-consistent


policy, if the decisions selected by a policy optimal for that objective coincide with the
decisions selected by a policy recomputed at any subsequent time step t and optimal for
the same objective with decisions and observations prior to t set to their realized value
76 Chapter 4. Validation of Solutions and Scenario Tree Generation

(and decisions prior to t chosen according to the initial optimal policy).


Time-consistent policies are not necessarily time-invariant: we simply require that the
optimal mappings πt from information states it to decisions ut at time t, evaluated from
some initial information state at t = 0, do not change if we take some decisions following
these mappings, and then decide to recompute them from the current information state.
We recall that in the Markov Decision Process framework, the information state i t is
the current state xt , and in the multistage stochastic programming framework, it is the
current history (ξ1 , . . . , ξt−1 ) of the random process, with t indexing decision stages. We
say that a decision process is time-consistent if it is generated by a time-consistent policy.
A close notion of time-consistency can also be defined by saying that the preferences
of the decision maker among possible distributions for the total return over the planning
horizon can never be affected by future information states that the agent recognizes, at
some point in the decision process, as impossible to reach (Shapiro, 2009; Defourny et al.,
2008).
In the absence of time-consistency, the following situation may arise (the discussion
is made in the multistage stochastic programming framework). At time t = 1, an agent
determines that for each possible outcome of a random variable ξ2 at time t = 2, the
decision u2 = a at time t = 2 is optimal (with respect to the stated objective and
constraints of the problem, given the distribution of ξ2 , ξ3 , . . . , and taking account of
optimized recourse decisions u3 , u4 , . . . over the planning horizon). Then at time t = 2,
having observed the outcome of the random variable ξ1 and conditioned the probability
distributions of ξ2 , ξ3 , . . . over this observation, and in particular, having ruled out all
scenarios where ξ1 differs from the observed outcome, the agent finds that for some
possible realizations of ξ2 , u2 = a is not optimal.
The notion of time-consistency already appears in Samuelson (1937), who states: “as
the individual moves along in time there is a sort of perspective phenomenon in that
his view of the future in relation to his instantaneous time position remains invariant,
rather than his evaluation of any particular year” (page 160). Several economists have
rediscovered and refined the notion (Strotz, 1955; Kydland and Prescott, 1977), especially
when trying to apply expected utility theory (von Neumann and Morgenstern, 1947),
valid for comparisons of return distributions viewed from a single initial information
state, to sequential decision making settings, where the information state evolves.
In fact, if an objective function subject to constraints can be optimized by dynamic
programming, in the sense that a recursive formulation of the optimization is possible
using value functions (on an augmented state space if necessary, and irrespectively of
complexity issues), then an optimal policy will satisfy the time-consistency property. This
connection between Bellman’s principle (1957) and time-consistency is well-established
(Epstein and Schneider, 2003; Riedel, 2004; Ruszczyński and Shapiro, 2006; Boda and
Filar, 2006; Artzner et al., 2007). By definition and by recursion, a value function is
not affected by states that have a zero probability to be reached in the future; when the
value function is exploited, a decision ut depends only on the current information state
it . Objectives that can be optimized recursively include the expected sum of rewards,
and the expected exponential utility of a sum of rewards (Howard and Matheson, 1972),
with discount permitted, although the recursion gets more involved (Chung and Sobel,
1987). A typical example of objective that cannot be rewritten recursively in general is
the variance of the total return over several decision steps. This holds true even if the
4.5. Time Inconsistency and Bounded Rationality Limitations 77

state fully describes the distribution of total returns conditionally to the current state.
Note, however, that a nice way of handling a mean-variance objective on the total return
is to relate it to the expected exponential utility: if R denotes a random total return,
Φρ {R} = E{R} − (ρ/2)var{R} ' −ρ−1 log E{exp(−ρR)}. The approximation holds for
small ρ > 0. It is exact for all ρ > 0 if R follows a Gaussian distribution.

4.5.2 Limitations of Validations Based on Learned Policies

In our presentation of multistage stochastic programming, we did not discuss several


extensions that can be used to incorporate risk awareness in the decision making pro-
cess. In particular, a whole branch of stochastic programming is concerned with the
incorporation of chance constraints in models (Prékopa, 1970; Prékopa, 1995), that is,
constraints to be satisfied with a probability less than 1. Another line of research in-
volves the incorporation of modern risk measures such as the conditional value-at-risk at
level α (expectation of the returns relative to the worst α-quantile of the distribution of
returns) (Rockafellar and Uryasev, 2000). An issue raised by many of these extensions,
when applied to sequential decision making, is that they may induce time-inconsistent
decision making processes (Boda and Filar, 2006).
The validation techniques based on supervised learning that we have proposed are not
adapted to time-inconsistent processes. Indeed, these techniques rely on the assumption
that the optimal solution of a multistage stochastic program is a sequence of optimal
mappings πt from reachable information states (ξ1 , . . . , ξt−1 ) to feasible decisions ut ,
uniquely determined by some initial information state at which the optimization of the
mappings takes place. We believe, however, that the inability to address the full range
of possible multistage programming models should have minor practical consequences.
On the one hand, we hardly see the point of formulating a sophisticated multistage
model with optimal recourse decisions unrelated to those that would be implemented
if the corresponding information states are actually reached. On the other hand, it is
always possible to simulate any learned policy, whatever the multistage model generating
the learning data might be, and score an empirical return distribution obtained with the
simulated policy according to any risk measure important for the application. Computing
a policy and sticking to it, even if preferences are changing over time, is a form of
precommitment (Hammond, 1976).
Finally, let us observe that a shrinking-horizon policy can be time-inconsistent for
two reasons: (i) the policy is based on an objective that cannot induce a time-consistent
decision process; (ii) the policy is based on an objective that could be reformulated using
value functions, but anyway the implicit evaluation of these value functions changes over
time, due to numerical approximations local to the current information state. Similarly,
if an agent uses a supervised-learning based policy to take decisions at some stage and is
then allowed to reemploy the learning procedure at later stages, the overall decision se-
quence may appear as dynamically inconsistent. The source (ii) of inconsistency appears
rather unavoidable in a context of bounded computational resources; more generally, it
seems that bounded rationality (Simon, 1956) would necessarily entail dynamical incon-
sistency.
78 Chapter 4. Validation of Solutions and Scenario Tree Generation

4.6 Conclusions

This chapter has presented a generic procedure for estimating the value of approximate
solutions to multistage stochastic programs. A direct application of this procedure is
the evaluation of the quality of the discretization of the original program. The proposed
selection of a best scenario tree among an ensemble of trees generated randomly, with the
branching structure also randomized, contributes to bring partial answers to the general
problem of building good scenario trees efficiently.
Our simple description of the proposed tree selection scheme (Algorithm 4.3), based on
an ensemble of random scenario trees generated independently, is less naive than it might
appear at first view, with in mind more advanced Monte Carlo sampling techniques for
generating the trees sequentially. Indeed, there is a terrible dimensionality challenge in
the search for a proper approximate representation of a random process ξ = (ξ 1 , . . . , ξT )
by a scenario tree, already on short horizons, say T equal to 4 or 5, and especially if
the dimension of the random vectors ξt is larger than say 1 or 2. In that context, it is
not clear whether more advanced importance sampling schemes would be tractable for
problems of practical interest.
On the other hand, given a scenario tree and optimal decisions associated to its nodes,
there is still much liberty in the way a policy can be learned, and in the way the feasibility
of the output of a learned decision predictor can be efficiently restored. The next chapter
will explore some of these possibilities.
We leave as future work the investigations concerning policies learned from the data
obtained from several scenario trees. Based on the numerical results collected in this
chapter, our first intuition is that the trees would have first to be sorted out. Indeed,
many trees, as we currently generate them, give very poor decisions. Adding the decisions
of such trees to a common data set is likely to hurt policies learned from the common
data set. The issue, however, is that we can sort out trees only if we can score them.
Currently, we score the trees by testing a policy learned from them. Our conclusion is that
learning a policy from several scenario trees would imply a computationally intensive,
boosting-like approach: the best policies learned from say the largest trees one could
solve would serve to identify the scenario/decisions pairs to be collected in a large data
set, that would then be used by a next generation of policies. Such ideas are difficult
to test and refine on the problems we have considered in this chapter, because the best
policies learned from single trees already yield near-optimal results.
Chapter 5

Inferring Decisions from Predictive Densities

In this chapter, we investigate alternative methods for learning feasible policies given
a data set of scenario/decisions pairs. We seek to infer conditional probability models
(predictive densities) for the decisions ut given the information state (ξ1 , . . . , ξt−1 ), and
then to obtain feasible decisions on new scenarios ξ by maximizing online the probability
of the decision ut subject to the current feasibility constraints ut ∈ Ut (ξ).
The chapter is organized in a backward fashion: Section 5.1 assumes that a predictive
density is available and seeks to exploit it so as to select a feasible decision; Section 5.2
concentrates on the inference of conditional predictive densities, given the current in-
formation state. In Section 5.3, a certain number of the proposed ideas are illustrated,
evaluated, and sometimes modified, in the context of a particular problem.

Notations.

In this chapter, we use the following notations.

• AT ∈ Rn×m is the transpose of A ∈ Rm×n .

• ha, bi = aT b is the inner product between 2 vectors a, b of the same dimension.

• ||a|| = ha, ai1/2 is the Euclidian norm of the vector a.

• For x = [x1 . . . xn ]T and y = [y1 . . . yn ]T ∈ Rn , x  y means xi ≤ yi , 1 ≤ i ≤ n,


and x ≺ y means xi < yi , 1 ≤ i ≤ n.

• Given column vectors z1 , . . . , zn , we freely write z = (z1 , . . . , zn ) to define a column


vector z = [z1T . . . znT ]T , especially when the vectors zi are replaced by vectors with
superscripts.

5.1 Constrained MAP Repair Procedure

We consider the following setup: Given a predictive density p̂t for the decision ut ∈ Rn ,
infer (select) a decision ūt such that ūt satisfies the feasibility constraints ūt ∈ Ut (ξ).
The given density p̂t is in fact an estimated density, obtained for instance as described in
Section 5.2. For the selection of a decision from the density, we maximize (the logarithm
of) the predictive density subject to constraints, which leads to the following estimate:

ūt ∈ argmaxut ∈Ut (ξ) log p̂t (ut ) . (5.1)


80 Chapter 5. Inferring Decisions from Predictive Densities

If p̂t is unimodal with its mode in Ut (ξ), then ūt is given by the mode of p̂t . The nontrivial
case is when the mode is not in Ut (ξ).
To make the approach computationally viable, we have to introduce restrictions on the
density p̂t and the feasible sets Ut . Moreover, one may want to ensure that the solution
set in (5.1) is a singleton. Indeed, we are interested in the situation where ū t is viewed
as the decision of a deterministic policy πt (ξ1 , . . . , ξt−1 ); by selecting ūt arbitrarily from
the solution set, an undesirable source of randomness would be added to the decision
process.

5.1.1 Assumptions

An interesting restriction on the models for p̂t is to assume that p̂t is taken from an
exponential family of distributions.
The following description of exponential families will suffice for our purposes. Given
an index set I of finite cardinality |I| = d, and a finite collection {φ` }`∈I of functions
φ` : Rn → R, let φ(ut ) ∈ Rd denote the d-dimensional column vector with elements
φ` (ut ), ` ∈ I, and define the (natural) exponential family associated to the collection
{φ` }`∈I as

p(ut ; θ) = exp{hθ, φ(ut )i − A(θ)} , (5.2)

where θ is allowed to take values from a set Θ ⊂ Rd described below, and where A(θ) is
the so-called cumulant generating function (log-partition function) defined by
Z
A(θ) = log exp{hθ, φ(ut )i}dut . (5.3)
Rn

Choosing a value for θ amounts to select a distribution among the members of the
exponential family.
The domain of the parameter θ is the set Θ = {θ ∈ Rd : A(θ) < ∞}. In the
terminology of Appendix A, the set Θ is the effective domain of the cumulant generating
function A(θ) viewed as an extended-real-valued function. The (natural) exponential
family is said to be regular if Θ is open. In the sequel, we assume that the family is
regular. It is well-known (Brown, 1986; Robert, 2007; Wainwright and Jordan, 2008) that
A(θ) is a convex function of θ (and thus in particular that Θ is convex). Moreover, A(θ)
is strictly convex for the so-called minimal exponential families. Minimal exponential
families are (natural) exponential families such that the functions φ` , ` ∈ I, and the
constant-valued function φ0 (x) = 1, form a set of linearly independent functions — that
is, for any θ 6= 0, hθ, φ(ut )i is not a constant-valued function of ut .
For minimal exponential families, there is a one-to-one correspondence between a
value θ ∈ Θ and a distribution from the family.
Using p(ut ; θ) from (5.2) for p̂t (ut ), the problem (5.1) becomes

ūt ∈ argmaxut ∈Ut (ξ) hθ, φ(ut )i , (5.4)

which is independent of the constant term A(θ), and corresponds formally to a maximum
a posteriori (MAP) estimation problem subject to additional constraints.
To ensure that (5.4) has a solution, we assume that the set Ut (ξ) is nonempty, closed,
and convex. Moreover, we assume that the support of p(ut ; θ) meets the interior of Ut (ξ),
5.1. Constrained MAP Repair Procedure 81

in order to guarantee that (5.4) has a nonempty solution set and does not lead to a
pathological optimization problem. It is well-known that the support of exponential
families does not depend on the value of their parameter θ. Therefore, given a subset C
of Rn such that Ut (ξ) is always in C for all ξ (it is possible to choose C = Rn ), one can
choose the exponential family so that its support covers C.

5.1.2 Particularizations

In multistage stochastic programming models, a frequent form for a set U t (ξ) is

Ut (ξ) = {ut ∈ Rn : At ut−1 + Bt ut = ht , ut  0} , (5.5)

where ut−1 is the decision relative to the previous stage (with ut−1 actually depending
only on ξ1 , . . . , ξt−2 ), Bt is very often a fixed matrix (recourse matrix), and At , ht are
a matrix (technology matrix) and a vector that may both depend on ξ1 , . . . , ξt−1 (often
only affinely). The form (5.5) is in part justified by results for two-stage stochastic
programming problems (Appendix D). When one uses (5.4) for computing ū t online on
a new scenario, the realizations of ut−1 , At and ht are known, and (5.4) becomes the
problem of solving over ut ∈ Rn the program

maximize hθt , φ(ut )i subject to Bt ut = ht − At ut−1 , ut  0 . (5.6)

In the sequel, we seek to identify some exponential families that lead to a concave
objective in (5.6).

Multivariate normal distributions.

We consider for p̂t in (5.1) the multivariate normal distribution N (λ, Λ) with mean
λ ∈ Rn and covariance matrix Λ ∈ Rn×n (we do not stress in the notation λ, Λ a possible
dependence of these parameters on ξ1 , . . . , ξt−1 and on t). We assume that Λ is positive
definite, so that the normal distribution has a density, namely,

p̂t (ut ) = ((2π)n det Λ)−1/2 exp{− 12 (ut − λ)T Λ−1 (ut − λ)}
= exp{− 12 (tr{Λ−1 ut uTt } − 2hΛ−1 λ, ut i + λT Λ−1 λ − log{(2π)n det Λ−1 })} .

In that case, using the precision matrix S = Λ−1 , the program (5.6) becomes the strictly
convex quadratic program

minimize (ut − λ)T S(ut − λ) (5.7)


subject to Bt ut = ht − At ut−1 , ut  0 .

The program (5.7) has a simple geometrical interpretation (Figure 5.1) in terms of the
Mahalanobis distance dM (ut , vt ) = ||S 1/2 (ut − vt )||2 between two vectors ut , vt ∈ Rn
(Mahalanobis, 1936). For conditions ensuring that the feasibility set is nonempty, see
Definitions D.3, D.4, D.5 in Appendix D.
A zero-valued element Sij of the precision matrix has the interpretation that the com-
ponents i, j of ut are conditionally independent given the other components (Dempster,
1972).
82 Chapter 5. Inferring Decisions from Predictive Densities

(ut − λ)T S(ut − λ) = f ∗

λ 

ūt Ut (ξ)


Fig. 5.1: Geometrical interpretation for (5.7). The matrix S defines a Mahalanobis distance
in Rn . The program (5.7) consists in computing the projection of λ on the set Ut (ξ)
according to this metric, by minimizing the distance between λ and ut ∈ Ut (ξ). On this
figure, ut ∈ R2 , and λ 6∈ Ut (ξ). The level set corresponding to the optimal objective
value f ∗ has been drawn (dashed line).

For stochastic programming problems where the components of the decisions u t can be
put in correspondence with spatial locations, for instance problems defined on networks,
it could make sense to use a Gaussian Markov random field model (Speed and Kiiveri,
1986) for the density p̂t (ut ).

Product of log-concave univariate densities.

We consider exponential families obtained as the product of log-concave univariate densi-


ties p̂it (densities such that log p̂it (·) is a concave function) relative to the i-th component
of ut = [ut 1 . . . ut n ]T . Here, p̂it is taken from an exponential family relative to a collec-
tion of functions {φi` }`∈I i , through the choice of a parameter vector θ i ∈ Θi . We write
φi (ut i ) for the vector collecting the elements φi` (ut i ), ` ∈ I i . With these choices, the
program (5.6) becomes
n
X
maximize hθi , φi (ut i )i subject to Bt ut = ht − At ut−1 , ut  0 . (5.8)
i=1

The form (5.8) is well suited to situations where probabilistic models for the scalar
components ut i have been learned separately, so as to obtain more tractable learning
problems. There is probably some structure among the components ut i once ut is op-
timized, and we may hope that by enforcing the condition ut ∈ Ut (ξ), we recover, to a
certain extent, a part of that structure — while the part of the structure that is induced
by the objective function of the original multistage decision making problem is unlikely
to be restored by this myopic feasibility restoration procedure.
As an example of log-concave density, we can cite the univariate normal distribution
N (µ, σ 2 ) with σ 2 > 0. Another potentially useful example is the gamma distribution
Γ(α, β) with α ≥ 1 (condition ensuring the log-concavity) and β > 0, supported on
(0, ∞). If we choose for p̂it the gamma distribution Γ(αi , βi ), then the density of p̂it is
given by

p̂it (ut i ) = (βi )αi [Γ(αi )]−1 (ut i )αi −1 exp{−βi ut i }


= exp{−βi ut i + (αi − 1) log ut i − log{[Γ(αi )]/(βi )αi }} ,
5.1. Constrained MAP Repair Procedure 83

P2 P2
i=1 βi u t i − i=1 (αi − 1) log ut i = f ∗


mp
ūt 
Ut (ξ)

Fig. 5.2: Geometrical interpretation for (5.10). The parameters αi of the marginal distributions
Γ(αi , βi ) for 1 ≤ i ≤ n define a weighted Itakura-Saito distance in Rn (see Remark 5.1).
The program (5.10) consists in computing the projection of the mode mp of the dis-
tribution of ut on the set Ut (ξ) according to this (pseudo) metric, by minimizing the
distance between mp and ut ∈ Ut (ξ), where mp = [mp 1 . . . mp n ]T , mp i = (αi − 1)/βi .
On the present figure, ut = [ut 1 ut 2 ]T ∈ R2 , and mp 6∈ Ut (ξ). The level set corre-
sponding to the optimal objective value f ∗ has been drawn (dashed line).

R∞
where Γ(αi ) = 0 tαi −1 exp{−t}dt is the gamma function evaluated at αi . One then
obtains the objective component

hθi , φi (ut i )i = −βi ut i + (αi − 1) log ut i , (5.9)

which is strictly concave if αi > 1. Note that its unconstrained maximization would then
def
yield the mode of the distribution Γ(αi , βi ), namely mp i = (αi − 1)/βi .
Now, if for instance each component ut i follows a distribution Γ(αi , βi ) with αi > 1
and βi > 0, the program (5.8) becomes the strictly convex program
Pn Pn
minimize i=1 βi ut i − i=1 (αi − 1) log ut i (5.10)
subject to Bt ut = ht − At ut−1 , ut  0

over the decision vector ut = [ut 1 , . . . , ut n ]T .


A geometrical interpretation of (5.10) is presented on Figure 5.2. In the context of
stochastic programming, it could make sense to choose a Gamma density for a decision u t i
(or some invertible transform of ut i ) that is naturally valued on the positive reals.

Remark 5.1 (Justification of the geometrical interpretation for (5.10)). The strictly
Pn
convex function F (ut ) = − i=1 hθi , φi (ut i )i, obtained by summing the compo-
nents (5.9) and changing the sign, induces a Bregman divergence (Bregman, 1967;
Banerjee et al., 2005) between ut , vt ∈ Rn given by

B(ut ||vt ) = F (ut ) − F (vt ) − ∇F (vt )T (ut − vt )


n  n 
αi − 1
 X 
X ut i
= βi (ut i − vt i ) − (αi − 1) log − βi − (ut i − vt i )
i=1
vt i i=1
vt i
n  
X ut i ut i
= (αi − 1) − log +1 . (5.11)
i=1
vt i vt i
84 Chapter 5. Inferring Decisions from Predictive Densities

The divergence (5.11) is a weighted version of the Itakura-Saito distance (Itakura


Pn
and Saito, 1968) , defined by dIS (ut ||vt ) = i=1 [(ut i /vt i )−log(ut i /vt i )+1], which
Pn
can be obtained as the Bregman divergence induced by FIS (ut ) = − i=1 log(ut i ).
(Originally, the Itakura-Saito distance is defined between power spectrum functions,
so that the sum over components i is in fact an integral over phases between −π
and π.)
Now, setting vt to the mode mp of the joint distribution of ut defined componentwise
by mp i = (αi − 1)/βi , the divergence (5.11) becomes
n n n
αi − 1
X X X  
B(ut ||mp ) = βi u t i − (αi − 1) log ut i + (αi − 1) log +1 ,
i=1 i=1 i=1
βi

that is, the objective of (5.10) up to a constant term. The omission of the constant
term shifts the value of the objective but does not alter the geometry of the level
sets.

5.2 Gaussian Predictive Densities

In this section, we consider joint probability models over (ξ1 , . . . , ξT , u1 , . . . , uT ), from


which conditional densities p̂t for ut can be obtained by conditioning over the observation
of ξ1 , . . . , ξt−1 . An interesting case is when the conditional densities for ut are Gaussian,
since this allows us to use (5.7) and find a decision ut ∈ Ut (ξ).
Note that the Gaussian case is in fact rather general inasmuch as one can also ap-
proximate a density by a multivariate normal density (Laplace’s approximation): Given
a density p̂t (x) with mode mp ∈ Rn , twice differentiable in a neighborhood of mp , com-
pute the Hessian matrix H ∈ Rn×n of p̂t at mp (elements Hij = ∂ 2 p̂t (mp )/∂xi ∂xj ), and
then replace p̂t by the density of a normal N (λ, Λ) with λ = mp and Λ−1 = −H.

5.2.1 Joint Gaussian Model

We consider the following joint Gaussian model as a base case for learning probabilistic
models.

Description.

It is well known that if a random vector z = (x, y) follows a multivariate normal distri-
bution N (z̄, Σ) with
   
x̄ Σx Σxy
z̄ = , Σ= ,
ȳ ΣTxy Σy

with Σ positive definite, then y conditionally to x follows a multivariate normal distri-


bution N (λ(x), Λ), where

λ(x) = ȳ + ΣTxy Σ−1


x (x − x̄) , Λ = Σy − ΣTxy Σ−1
x Σxy . (5.12)

A simple model of the predictive density for ut given ξ1 , . . . , ξt−1 can be obtained by
setting x = (ξ1 , . . . , ξt−1 ), y = ut , and then using the conditioning formulae (5.12) on
5.2. Gaussian Predictive Densities 85

a multivariate normal model N (z̄, Σ) for z = (x, y), with z̄ and Σ learned (estimated)
from a data set of scenario/decisions pairs (see below).
The evaluation of (5.12) for u2 , . . . , uT requires T − 1 matrix inversions, but as Σ−1
x
is independent of the observations x, the inversions and matrix products need not be
recomputed online on new scenarios.
By (5.12), the conditional mean of ut is an affine function of the observed history
(ξ1 , . . . , ξt−1 ). In fact, λ(x) would be called a linear decision rule in the context of
stochastic programming (Garstka and Wets, 1974).

Estimation.

There is a large literature on the estimation of the mean and the covariance matrix (or
its inverse) of a Gaussian random vector (Stein, 1956; Haff, 1980; Banerjee et al., 2008).
In the present context, given a data set of samples {z k }1≤k≤N , where z k = (xk , y k ),
PN
xk = (ξ1k , . . . , ξt−1
k
), y k = ukt , we can estimate the mean z̄ by ẑ = (1/N ) k=1 z k , and
estimate the covariance matrix Σ by a simple shrinkage estimator of the form
N
X
Σ̂ = (1 − )ΣML +  I , with ΣML = (1/N ) (z k − ẑ)(z k − ẑ)T . (5.13)
i=1

The identity matrix I is added with weight  ∈ (0, 1) in order to ensure that the estimated
covariance is positive definite and well-conditioned.
If Σ̂ in (5.13) is scaled by some positive factor, the conditional covariance Λ in (5.12)
is scaled by the same factor, whereas the conditional mean λ(x) is left unchanged. As
the minimizer of (5.7) is invariant with respect to a rescaling of the objective, one can
thus rescale (5.13) by a factor (1 − )−1 , set 0 = /(1 − ) > 0 and simply use

Σ̂ = ΣML + 0 I . (5.14)

By the same token, there is no potential advantage in replacing the maximum likelihood
estimator ΣML by an unbiased empirical estimator
N
X
Σemp = (N − 1)−1 (z k − ẑ)(z k − ẑ)T .
i=1

Discussion and Extension.

The program (5.7) that restores the feasibility of ut uses larger corrections for compo-
nents ut i of ut with larger conditional variances Λii . Under the joint Gaussian model,
the components ut i of the decision vector ut that have a larger estimated variance (rela-
tively to the other components) are those that are not well explained by the linear model
(compared to the other components).
Due to the corrections made by (5.7), the actual decision ūt will not in general de-
pend affinely on (ξ1 , . . . , ξt−1 ). Therefore, it might be beneficial to consider the actual
decisions (u2 , . . . , ut−1 ) as new observations, and extend the conditional model for ut by
computing (5.12) with y = ut and x = (ξ1 , . . . , ξt−1 , u2 , . . . , ut−1 ).
86 Chapter 5. Inferring Decisions from Predictive Densities

5.2.2 Gaussian Process Model

Gaussian processes allow to define nonparametric models (by opposition to models with
an a priori fixed number of parameters for summarizing the data, whatever the size of
the data). Following O’Hagan (1978), it is often said that Gaussian processes allow
to define prior distributions over spaces of functions, that are then updated to posterior
distributions, given a data set of observations of the relation between inputs and outputs.
Note that since Gaussian process models often incorporate the effect of a noise process
on observations, the relation between inputs and noisy observed outputs is actually not
of a purely functional nature (Neal, 1997, page 4).

Description.

Let J denote an index set (typically infinite), and let {X α }α∈J denote a collection of
vectors X α ∈ Rn such that X α 6= X β if α 6= β. The vectors X α are interpreted as query
points uniquely identified by labels α ∈ J (the labels can indicate an ordering between
distinct query points). For each α ∈ J , let Y α be a real-valued random variable with
finite variance. For any finite subset S of indices from J , let |S| denote the cardinality
of S, and let Y (S) denote the |S|-dimensional random vector with elements Y α , α ∈ S.
We assume that for any such subset S, the random vector Y (S) follows a multivariate nor-
mal distribution N (µ(S), K(S)), with its mean vector µ(S) ∈ R|S| and covariance matrix
K(S) ∈ R|S|×|S| defined below. This defines a so-called Gaussian process {Y α }α∈J .
The mean vector µ(S) = E{Y (S)} collects (stacks into a column vector) the elements

µα = g(X α ) , α∈S ,

defined using some fixed real-valued function g : Rn → R called the mean function.
The covariance matrix K(S) = E{[Y (S) − µ(S)][Y (S) − µ(S)]T } collects (stacks into
a symmetric |S| × |S| matrix) the elements

K αβ = k(X α , X β ) , α, β ∈ S ,

defined using some fixed positive definite kernel k : Rn × Rn → R (see Definition C.11 in
Appendix C — the name “positive definite kernel” is standard whereas the corresponding
matrix K(S) is only positive semi-definite). The kernel k (also called covariance function)
is parametrized by a vector η of hyperparameters that has not been written explicitly to
lighten the notation. For example, a kernel k : Rn × Rn → R with values
Pn
k(X α , X β ) = v0 exp{− 21 i=1 (Xiα − Xiβ )2 /σi2 }

(radial basis kernel) is parametrized by η = (v0 , σ1−2 , . . . , σn−2 ), with v0 > 0 and where
each σi > 0 is a bandwidth parameter associated to the i-th coordinate of the inputs X α
and X β .
Now, let (S1 , S2 ) denote a partition of S, that is, S1 ∪ S2 = S and S1 ∩ S2 = ∅. Let
K(S1 , S2 ) = E{[Y (S1 ) − µ(S1 )][Y (S2 ) − µ(S2 )]T } be the matrix with elements K αβ for
α ∈ S1 , β ∈ S2 , and let
   
µ(S1 ) K(S1 ) K(S1 , S2 )
µ= , K= .
µ(S2 ) K(S1 , S2 )T K(S2 )
5.2. Gaussian Predictive Densities 87

Let Z(S1 ) denote a |S1 |-dimensional random vector representing a noisy observation of
Y (S1 ), collecting elements defined by

Z α = Y α + σW α , α ∈ S1 ,

where σ 2 > 0 represents the variance of the observation noise, assumed to be i.i.d.
Gaussian, and where each W α is assumed to be drawn independently from the stan-
dard normal distribution N (0, 1). Then, the random vector (Z(S1 ), Y (S2 )) follows a
multivariate normal N (µ0 , K 0 ) with

K(S1 ) + σ 2 I K(S1 , S2 )
 
0 0
µ =µ , K = ,
K(S1 , S2 )T K(S2 )

where I stands for the |S1 |×|S1 | identity matrix. In particular, the random vector Y (S2 )
conditionally to Z follows a multivariate normal N (λY (Z(S1 )), ΛY ) with the conditional
mean and conditional covariance matrix given respectively by

λY (Z(S1 )) = µ(S2 ) + K(S1 , S2 )T (K(S1 ) + σ 2 I)−1 (Z(S1 ) − µ(S1 )) , (5.15)


T 2 −1
ΛY = K(S2 ) − K(S1 , S2 ) (K(S1 ) + σ I) K(S1 , S2 ) . (5.16)

When one actually observes a realization z(S1 ) ∈ R|S1 | of the random vector Z(S1 ),
the conditional mean of Y (S2 ) given z(S1 ) is a real vector λ̂ = λY (z(S1 )) ∈ R|S2 | that
represents the best prediction for the realization of Y (S2 ) in the mean-square error sense,
while the covariance matrix of the prediction error λ̂ − Y (S2 ) is given by Λ̂ = ΛY .
The contribution σ 2 I from the noise vector W can be viewed as a jitter term that
stabilizes the matrix inversion without perturbing too much the model (Neal, 1997). It
also allows to consider in Equations (5.15), (5.16), several independent noisy observations
at a same query point X α , by reinterpreting S1 as a multiset (collection) of indices of J .
We now apply the described Gaussian Process model to the inference of the distri-
bution of a decision vector ut conditionally to a new scenario ξ (of which we can only
observe ξ1 , . . . , ξt−1 ), given a data set of scenario/decisions pairs (ξ k , uk ) extracted from
a scenario tree. We describe the calculations for the i-th component of ut , written ut i .
k
We define S1 as an index set relative to the distinct values of (ξ1k , . . . , ξt−1 ) found in
the data set, and we set

X α = (ξ1α , . . . , ξt−1
α
) , z α = uα
ti , α ∈ S1 . (5.17)

This allows to compute the term (K(S1 ) + σ 2 I)−1 (z(S1 ) − µ(S1 )) in (5.15) as soon as
we obtain the data set (the realization z(S1 ) of Z(S1 )). Then, we view S2 as a singleton
relative to a new scenario ξ ∗ , and we set

X β = (ξ1∗ , . . . , ξt−1

) , Y β = ut i , β ∈ S2 . (5.18)

This allows to compute K(S1 , S2 ) as soon as we actually observe (ξ1∗ , . . . , ξt−1 ), with
K(S1 , S2 ) interpreted as a vector of weights describing the similarity of the new sce-
nario ξ ∗ with respect to each example ξ k stored in the data set.
At this stage, we can infer that the real-valued random variable ut i follows a univariate
normal distribution N (λi , Λii ) with λi = λY (z(S1 )) given by (5.15), and Λii = ΛY given
by (5.16). As for the predictive density for the full decision ut , we assume that each
88 Chapter 5. Inferring Decisions from Predictive Densities

component is independent conditionally to (ξ1 , . . . , ξt−1 ), so that ut follows a multivariate


normal N (λ, Λ) with λ formed from the components λi , and Λ defined as a diagonal
matrix with diagonal entries Λii .

Remark 5.2. It is also conceivable to infer a noisy predictive density (Rasmussen


and Williams, 2006, page 18), that is, build a model for Z(S2 ) given Z(S1 ). In
that case, K(S2 ) is replaced by K(S2 ) + σ 2 in the expression of K 0 (assuming
that S2 is a singleton), so that ut i follows a normal distribution N (λY (z(S1 )), ΛY +
σ 2 ). Note also that if we use the same kernel k (with the same hyperparameter
values) for each component of ut and model these components as being conditionally
independent given (ξ1 , . . . , ξt−1 ), the joint Gaussian model for the components of ut
has a covariance matrix equal to a multiple of the identity matrix. Under that
choice, whatever the chosen noise variance σ 2 and the value of ΛY common to
all components i, the program (5.7) is always equivalent to the minimization of
||ut − λ||2 subject to the feasibility constraints, with the conditional mean λ of u t
unaffected by the jitter term σ 2 added to K(S2 ).

It can be seen from (5.15) that the mean and thus the mode of the predicted Gaussian
density for ut i combine the decisions uα
t i of the data set, in a way that depends on the
similarity (determined by the kernel k) between the observed part (ξ1∗ , . . . , ξt−1

) of the
∗ α
new scenario ξ , and the scenarios ξ , α ∈ S1 , stored in the data set.
The factor (K(S1 )+σ 2 I)−1 (Z(S1 )−µ(S1 )) in (5.15) has to be evaluated once, whereas
µ(S2 ) and the vector K(S1 , S2 ) ∈ R|S1 | must be evaluated online for each new scenario.
Therefore, training requires a time cubic in the cardinality |S1 | of the training set due to
the matrix inversion, whereas the computation of the conditional mean λ can be done in
linear time. The online computation of the variance ΛY would require a time quadratic
in the cardinality of the training set, but following Remark 5.2, it is possible to bypass
the estimation of the variance by keeping the same kernel for each component of u t .
The storage of the Gram matrix K(S1 ) takes a space quadratic in the cardinality of the
training set.

Estimation.

In Gaussian Process regression, the mean function g with values g(X α ) is often set to the
constant zero-valued function, so that the terms µ(S1 ), µ(S2 ) do not appear in (5.15).
Sometimes, the mean function is set to a linear function of the inputs X α . In the present
context, the values of g could also be set to constant reference decisions, for instance, to
the decisions from a nominal plan (Section 2.1.1).
Selecting a kernel type automatically is not easy. In support vector machines, the
problem is partially addressed by working over a set of kernels (Lanckriet et al., 2004;
Micchelli and Pontil, 2005; Sonnenburg et al., 2006). Once the kernel type is chosen, the
selection of the hyperparameters η can be formulated as the maximization over η of the
loglikelihood of the observed data z(S1 ) (Mardia and Marshall, 1984), that is,

`(η ; z(S1 )) = − (N1 /2) log{2π} − 1


2 log{det(K(S1 ) + σ 2 I)}
− 21 (z(S1 ) − µ(S1 ))T (K(S1 ) + σ 2 I)−1 (z(S1 ) − µ(S1 ))
5.3. Case Study 89

where the value of K αβ = k(X α , X β ) for α, β ∈ S1 depends on η. A local maximum can


be found by gradient-ascent based optimization techniques. Another, Bayesian, approach
consists in putting prior distributions over the hyperparameters, and make predictions
by integrating out the hyperparameters.
For more details, we refer to the recent review by Nickisch and Rasmussen (2008). We
do not deem it essential to discuss these techniques further inasmuch as our ultimate goal
is not to find the best explanation to a training set of scenario/decisions pairs: the deci-
sions of the training set are not the optimal decisions for the original multistage stochastic
programming problem in the first place. In this chapter, our procedure for selecting a
model is limited to the simulation of repaired predicted decisions using candidate kernels
with fixed hyperparameter values.

5.3 Case Study

In this section, we consider a particular multistage stochastic program (described in detail


in Section 5.3.1) of the form
PT
minimize E{ t=1 hct , ut i}
subject to B 1 u1 = h 1 , u1  0 ,
At ut−1 + Bt ut = ht , ut  0 for t = 2, . . . , T ,

where At , Bt for t ≥ 1 denote fixed matrices of proper dimension, and where the cost
coefficients ct , the constraint right-hand sides ht , and decision vectors ut may depend, for
t ≥ 2, on the realization (ξ1 , . . . , ξt−1 ) of a random process ξ = (ξ1 , . . . , ξT ). As usual, the
expectation is taken over ξ and can be decomposed in nested conditional expectations.
Recall that a scenario tree for ξ is a set of realizations {ξ k }1≤k≤N of ξ, along with
probabilities pk > 0 assigned to scenarios ξ k and summing to 1. Recall that the branching
k
structure of the tree causes histories (ξ1k , . . . , ξt−1 ) to be identical among some scenarios k.
Let us denote by ct and ht the values of ct and ht associated to ξ k , noting in particular
k k

that c1 and h1 are necessarily constant-valued.


Now, observe that if a deterministic scenario-tree generation algorithm is chosen, and
if the parameters for building a scenario tree for ξ, a scenario tree for ξ given ξ1 (meaning
that ξ1k = ξ1 for each scenario k), a scenario tree for ξ given (ξ1 , ξ2 ), . . . , a scenario
tree for ξ given (ξ1 , . . . , ξT −1 ), are also fixed, then these choices uniquely determine a
shrinking-horizon policy π SH = (π1SH , . . . , πTSH ).
A shrinking-horizon policy π SH = (π1SH , . . . , πTSH ) is defined as follows. The map-
ping π1SH is a constant-valued function with value ū1 , where ū1 corresponds to an opti-
mal solution for u1 , relative to the following program over u1 and ukt for 1 ≤ k ≤ N and
1 ≤ t ≤ T,
P N PT
minimize N −1 k=1 t=1 pk hckt , ukt i
subject to B 1 u1 = h 1 , u1  0 ,
uk1 = u1 for each k ,
At ukt−1 + Bt ukt = hkt , ukt  0 for each k and for t ≥ 2 ,
ujt = ukt for each t ≥ 2 and j, k such that k
(ξ1k , . . . ξt−1 ) ≡ (ξ1j , . . . ξt−1
j
) .
90 Chapter 5. Inferring Decisions from Predictive Densities

The mapping πtSH for t ≥ 2 is a function of (ξ1 , . . . , ξt−1 ), with value ūt , where ūt
corresponds to an optimal solution for ukt (any k), relative to the following program over
ukt0 for 1 ≤ k ≤ N and t ≤ t0 ≤ T ,
PN P T
minimize N −1 k=1 t0 =t pk hckt0 , ukt0 i
subject to At0 ukt0 −1 + Bt0 ukt0 = hkt0 , ukt0  0 for each k and for t0 ≥ t ,
def
where we set, for t0 = t, ukt−1 = ūt−1 for each k ,
ujt0 = ukt0 for each t ≥ t and j, k such that (ξ1k , . . . ξtk0 −1 ) ≡ (ξ1j , . . . ξtj0 −1 )
0

(thus in particular for t0 = t, we have ujt = ukt for each j, k) ,

where N and all scenario-dependent quantities pk , ξtk , ckt , hkt should here be understood as
relative to the scenario tree for ξ given (ξ1 , . . . , ξt−1 ), which is built once the realization of
(ξ1 , . . . , ξt−1 ) becomes available, and which instantiates, along with ūt−1 , the parameters
of the program.
Our intention in this section is to take shrinking-horizon policies as the golden stan-
dard for sequential decision making, and compare them to other policies built with the
techniques proposed in the chapter on a common test sample of M = 104 scenarios.
For the simplicity of the parametrization of the scenario tree building algorithm, we
consider scenario trees with a uniform branching factor, and use the same branching
factor for rebuilding scenario trees on the shrinking horizon. Therefore, once the dis-
cretization method for ξt is fixed (choices are explained in length in Section 5.3.2), a
shrinking-horizon policy is uniquely determined by the branching factor. Moreover, us-
ing the same branching factor at each stage results in the following property: If the
k
realization of (ξ1 , . . . , ξt−1 ) is identical to (ξ1k , . . . , ξt−1 ) for some scenario k in the initial
scenario tree for ξ, then the subtree rooted at the node relative to (ξ1k , . . . , ξt−1 k
) is exactly
the subtree built at stage t for the scenario tree of ξ given (ξ1 , . . . , ξt−1 ). Hence, if one
simulates the shrinking horizon policy with uniform branching factor on the scenario ξ k ,
one will recover the decisions uk = (uk1 , . . . , ukT ) that were found to be optimal on the
initial scenario tree for computing ū1 .
To a single shrinking-horizon policy π SH will correspond several learned policies, ob-
tained by different learning algorithm applied to the same training data {(ξ k , uk )}1≤k≤N ,
relative to the scenario tree used to optimize π1SH = ū1 . Obviously, all these learned poli-
cies start with the same first-stage decision ū1 .

5.3.1 Description of the Test Problem

The test problem is a multi-product assembly problem under demand uncertainty. The
multistage structure of the problem is summarized in Table 5.1: the decisions to take
at each stage is put in correspondence with the available information at those stage,
represented by the realization of certain random variables. The mathematical formu-
lation of the problem is presented in Table 5.2 in nested form. The nested form is a
generalization to several stages of the formulation for two-stage programs presented in
Appendix D; it enables a reader to distinguish easily the constraints specific a decision
stage t, that is, the actual definition of the sets Ut (ξ). We have put at the end of the
chapter (page 104) a table that specifies the numerical value of all the parameters for the
test problem (Table 5.10).
5.3. Case Study 91

Tab. 5.1: Multi-product Assembly Problem: Multistage structure.

Stage Available information Decision to take


Description Variables Description Variables
1 No information Components to buy q1 ∈ R12
12×8
2 Factor 1 1 ∈ R Subparts to make v2 ∈ R , q2 ∈ R 8
3 Factors 1,2  1 , 2 ∈ R Products to assemble v3 ∈ R , q3 ∈ R5
8×5

4 Factors 1,2,3 1 , 2 , 3 ∈ R Sales q 4 ∈ R5

Tab. 5.2: Multi-product Assembly Problem: Nested formulation.

minimize hc1 , q1 i + E1 {Q1 (q1 , 1 )}


subject to q1  0 ,
Q1 (q1 , 1 ) = min hc2 , q2 i + E2 {Q2 (q2 , 1 , 2 )}
P
subject to w2ij (q2 )j ≤ (v2 )ij , j (v2 )ij ≤ (q1 )i ,
q2 , v 2  0 , (1 ≤ i ≤ 12, 1 ≤ j ≤ 8)

Q2 (q2 , 1 , 2 ) = min hc3 , q3 i + E3 {Q3 (q3 , 1 , 2 , 3 )}


P
subject to w3jk (q3 )k ≤ (v3 )jk , k (v3 )jk ≤ (q2 )j ,
q3 , v 3  0 , (1 ≤ j ≤ 8, 1 ≤ k ≤ 5)

Q3 (q3 , 1 , 2 , 3 ) = min hc4 , q4 i


def
subject to q4  [b0 + b1 1 + b2 2 + b3 3 ]+ = d ,
0  q 4  q3 .

The test problem can be described as follows. A manufacturer can assemble 5 products
Pi , for which the demand di ∈ R is unknown, but influenced by three random factors
t ∈ R, t = 1, 2, 3, observed at distinct decision stages (see Table 5.1). We let d ∈ R 5 be
the random vector representing the demand. The products are made of subparts, some of
which are common among several products. There is a total of 8 distinct subparts. The
subpart are themselves made of components, some of which are common among several
subparts. There is a total of 12 distinct components that the manufacturer can buy.
The random demand d is assumed to be distributed according to the following model:

d = [b0 + b1 1 + b2 2 + b3 3 ]+ (5.19)
1 ∼ N (0, 1) , 2 ∼ N (0, 1) , 3 ∼ N (0, 1) , (5.20)

where b0 , b1 , b2 , b3 ∈ R5 are fixed parameters, the random variables i are mutually


independent, and [·]+ denotes the componentwise positive part. To relate this model
with the usual notation ξ for the gradually observed random process, one has to set
ξ = (ξ1 , ξ2 , ξ3 ) with ξ1 = 1 , ξ2 = 2 , ξ3 = (3 , d).
The decision vector u is decomposed into two groups of variables. A first group of
so-called strategic decisions q1 ∈ R12 , q2 ∈ R8 , q3 ∈ R5 corresponds to the quantities of
92 Chapter 5. Inferring Decisions from Predictive Densities

components, subparts and products that are bought or assembled. A second group of
so-called ancillary decisions v2 ∈ R12×8 , v3 ∈ R8×5 determines each quantity of compo-
nent/subpart allocated to a given subpart/product in the next stage of the production
process. A decision q4 ∈ R5 , corresponding to the quantity of product actually sold, and
defined by q4 = min{q3 , d} (elementwise minimum), is added to the group of ancillary
decisions, for the convenience of the problem formulation (convexity). To summarize,

u = (q1 , v2 , q2 , v3 , q3 , q4 ) ∈ R166 , (5.21)

with a slight abuse of notation (since v2 and v3 were defined as matrices).


The time horizon T = 4 and the total dimension of a scenario (1 , 2 , 3 , d) ∈ R8 ,
where d is actually a function of (1 , 2 , 3 ) ∈ R3 , are small enough to let us test shrinking-
horizon policies of various complexity on a test sample of significant size.
The objective to be minimized is the expected cost E{hc, ui}, where the vector

c = (c1 , 0, c2 , 0, c3 , c4 ) ∈ R166 (5.22)

collects the unit cost of each decision in u, in the order determined by the decomposition
(5.21). The subvectors c1 , c2 , c3 associated to q1 , q2 , q3 correspond to fixed production
costs and are nonnegative. Zero costs are associated to the decisions v 2 , v3 . The sub-
vector c4 associated to q4 has negative entries that correspond to the fixed prices of the
5 products with a sign change.
The decision vector u is structured by various constraints. Besides a nonnegativity
constraint u  0, these constraints are of two types:

wtjk (qt )k ≤ (vt )jk , (5.23)


P
k (vt )jk ≤ (qt−1 )j . (5.24)

Constraints (5.23) express that wtjk units of j are necessary for obtaining one unit of k,
where αtjk ≥ 0 is a fixed parameter, j refers to a component (if t = 1) or a subpart (if
t = 2), and k refers to a subpart (if t = 1) or a product (if t = 2). Note that if j does not
enter in the composition of k, one has wtjk = 0, so that (5.23) reduces to a redundant
nonnegativity constraint that can be removed. Constraints (5.24) express that the total
quantity of j employed in the various k cannot exceed the available quantity of j.
The relation q4 = min{q3 , d} can be expressed by the constraints

q4  q 3 , (5.25)
q4  d (5.26)

since the components of c associated to q4 in the objective are negative.


The constraints at each stage can thus be expressed in the format (5.5), converting the
inequality constraints to equality constraints by introducing nonnegative slack variables.
It is easy to see that the right-hand side ht of (5.5) is fixed at t = 1, 2, 3 and depends
affinely on d at t = 4, due to (5.26). The dependence of the feasibility sets on the demand
factors t is implicit. For instance q3 depends on q2 through (5.23) and (5.24), while q2
itself depends on 1 . Such dependences give a rich structure to the feasibility sets.
5.3. Case Study 93

5.3.2 Discretization of the Random Process

This section details how the scenario trees with uniform branching factors are built. We
focus on the problem of approximating N (0, 1) by a discrete distribution on S points,
specified by a support (ˆ 1 , . . . , ˆS ) and associated positive probability masses (p̂1 , . . . , p̂S ).
Indeed, once the support (ˆ 1 , . . . , ˆS ) and the probabilities (p̂1 , . . . , p̂S ) of the discrete
distribution are determined, the scenario tree is made of the S 3 distinct realizations of
ξ = (1 , 2 , 3 , d) of the form

i1 , ˆi2 , ˆi3 , [b0 + b1 ˆi1 + b2 ˆi2 + b3 ˆi3 ]+ )


ξ k = (ˆ

where the indices i1 , i2 , i3 are valued in {1, . . . , S}, and the probability of the scenario ξ k
is given by pk = p̂i1 p̂i2 p̂i3 .
1 , . . . , ˆS ) denote the support of the discrete distribution, treated as an
Let ˆ = (ˆ
optimization variable in RS . We will use the quadratic distortion D 2 between the discrete
distribution and the target distribution N (0, 1), defined for any ˆ ∈ RS as

D2 (ˆ i − ||2 ,

) = E min1≤i≤S ||ˆ (5.27)

where  is a random variable following the target distribution N (0, 1). By defining the
cells C i (ˆ
) = { ∈ R : ||ˆ i − || ≤ ||ˆ
j − ||, 1 ≤ j ≤ S}, whose boundaries have a
null measure under the target probability measure, (5.27) can be written as D 2 (ˆ ) =
PS R i 2
i=1 C i (ˆ
)
||ˆ
 − || φ()d, with φ the probability density function of N (0, 1).
If ∇D 2 (ˆ
) = 0, that is,

i − )φ()d = 0 ,
R
C i (ˆ
)
(ˆ 1≤i≤S , (5.28)

then ˆ is called a stationary quantizer. When the distortion is minimized over ˆ without
constraint, as here where the support of the target distribution is unbounded, a local
minimum of the distortion is a stationary quantizer.
On the real line, the attention can be restricted to the points ˆ such that −∞ < ˆ1 <
· · · < ˆS < ∞, since the distortion decreases when a new point distinct from others is
added to the support of the discrete distribution. Under the convention that ˆ0 = −∞
and ˆS+1 = ∞, the cell C i (ˆ i−1 + ˆi ]/2, [ˆ
) is the closure of the interval ([ˆ i + ˆi+1 ]/2).
With the univariate normal distribution, which has a strictly log-concave density, a
local minimum of D 2 can be found by Newton’s method (Pages and Printems, 2003),
and this minimum is also a global minimum (this does not hold in the multivariate case).
Optimal solutions ˆ for values of S used in the sequel are represented on Figure 5.3. The
probabilities reported on the figure are obtained by integrating the normal density over
the cells Ci :

p̂i = i+1 /2 + ˆi /2) − Φ(ˆ


i /2 + ˆi−1 /2) ,
R
C i (ˆ
)
φ()d = Φ(ˆ

where Φ is the cumulative distribution function (cdf) of N (0, 1). The probabilities have
a closed-form expression thanks to the simple domain of integration.
94 Chapter 5. Inferring Decisions from Predictive Densities

1 2

I +0.798 0.5 J
0.0 1 values probabilities
I -0.798 0.5 J

3 4 5 6
0.107 0.0740
0.163
0.270 +1.000 0.181
+0.765 0.244
+0.453 0.337 0.245
0.0 0.459 0.0 0.298
-0.318
-1.224
-1.510 -1.724 -1.894

7 8 9 0.0311 10 0.0245
0.0536 0.0402
+1.477 0.0845 +1.591 0.0681
+1.188 0.137 +1.344 0.107
0.132 0.110
0.199 0.161 +0.610 0.141
+0.245 0.192 +0.444 0.164 0.157
0.0 0.221 0.0 0.176 -0.200
-0.561 -0.756 -0.919 -1.058
-2.033 -2.152 -2.255 -2.345

Fig. 5.3: Discretizations of N (0, 1) for branching factors from 1 to 10, obtained by minimizing
the quadratic distortion. Values that can be guessed by symmetry are not indicated.

Remark 5.3. A property of stationary quadratic quantizers is noteworthy in the


context of stochastic programming. It is well known (Pages and Printems, 2003)
that for any function f0 convex in , one has, by (5.28),
S S S
R !
X X C i (ˆ
)
φ()d X Z
φ()
i i i
p̂ f0 (ˆ
)= p̂ f0 R ≤ p̂i f0 () i d = E{f0 ()} ,
i=1 i=1 i
C (ˆ )
φ()d i=1 C i (ˆ
) p̂

where the inequality holds by Jensen’s inequality with the conditional density
φ()/p̂i . This implies that for a function f with values f (, x) convex in , one
PS i
has, for any fixed x, i , x) ≤ E{f (, x)}. Let x̄ ∈ argminx E{f (x, )}.
i=1 p̂ f (ˆ
Then it holds that
PS PS
minx i=1 p̂i f (ˆ
i , x) ≤ i=1 p̂i f (ˆ
i , x̄) ≤ E{f (, x̄)} = minx E{f (, x)} . (5.29)

Now, as a particular function convex in , consider

f (, x) = hc, xi + min{y: Ax+By=C, y0} g(y) (5.30)

where g is convex in y. The function f is convex in  as the sum of a fixed


term hc, xi and a function obtained as the composition of the affine transform
δ = C−Ax with the function f˜(δ) = min{y: By=δ, y0} g(y), which can be shown to
be convex in δ (Rockafellar, 1970, Theorem 5.7). Then (5.29) becomes, restricting
the minimization over x to some set X,
PS
minx∈X {hc, xi + i=1 p̂i min{yi : Ax+Byi =Cˆi , yi 0} g(y i )}
≤ minx∈X {hc, xi + E{min{y(): Ax+By()=C, y()0} g(y())}} . (5.31)
5.3. Case Study 95

The argument can be extended by taking g(y) = E{f2 (2 , y)}, with 2 a new random
variable independent of , and f2 (2 , y) defined by

f2 (2 , y) = hc2 , yi + min{z: A2 y+B2 z=C2 2 , z0} g2 (z)

where g2 is convex in z. Given a stationary quantizer for 2 , with values ˆj2 and
probabilities p̂j2 , j = 1, . . . , S2 , it holds by (5.29) and the convexity of f2 in 2 that
P S2
minyi ∈Y i (x) j=1 p̂j2 f2 (ˆ
j2 , y i ) ≤ minyi ∈Y i (x) E{f2 (2 , y i )} , (5.32)

where Y i (x) = {y i : Ax + By i = Cˆ
i , y i  0}. Let x̄ be an optimal solution to
the minimization over x ∈ X of the left-hand side of (5.31). One then obtains the
chain of inequalities
PS
minx∈X {hc, xi + i=1 p̂i min{yi : Ax+Byi =Cˆi , yi 0} {hc2 , y i i
P S2 j
+ j=1 p̂2 min{z ij : {A2 yi +B2 zij =C2 ˆj , zij 0} g2 (z ij )}}
2
PS
i i
≤ hc, x̄i +
i=1 p̂ min{y i : Ax̄+By i =Cˆ i , y i 0} {hc2 , y i
P S2 j
+ j=1 p̂2 min{z ij : {A2 yi +B2 zij =C2 ˆj , zij 0} g2 (z ij )}
2
PS P S2 j
= hc, x̄i + i=1 p̂i minyi ∈Y i (x̄) j=1 p̂2 f2 (ˆj2 , y i )
PS
≤ hc, x̄i + i=1 p̂i minyi ∈Y i (x̄) E{f2 (2 , y i )}
PS
= minx∈X {hc, xi + i=1 p̂i minyi ∈Y i (x) g(y i )}
≤ minx∈X {hc, xi + E {min{y(): Ax+By()=Cˆ, y()0} E2 {f2 (2 , y())}}}
= minx∈X {hc, xi + E {min{y(): Ax+By()=Cˆ, y()0} {hc2 , y()i
+E2 {min{z(,2 ): A2 y()+B2 z(,2 )=C2 2 , z(,2 )0} g2 (z)}}} ,

where the last inequality follows from (5.31).


By induction, the result can further be extended to several decision stages.

In Remark 5.3, a class of multistage programs has been identified, for which a single
scenario-tree approximation based on quadratic quantization yields a lower bound on the
exact optimal value of the program.
For this result to hold, the stagewise independence assumption between the random
variables , 2 , . . . , is essential. The function f (x, ) in (5.30) has to be convex in ,
preventing us to consider, instead of g(y), a general function g(y, ), as would be the case
if the expectation in the definition of g were conditioned on . The only dependence of
g on  is through the value of its argument y, which depends on the realization of .
Now, there exists a formulation trick that allows to pass the value of  to functions at
subsequent stages. It suffices to extend the decision vector y to the vector y + = (y, y  ),
where y  is a dummy decision variable subject to the constraint y  = . The value of 
can then be passed to the function g through y + itself, and by the same mechanism to
any subsequent function inside the nested expectations.
In fact, the multi-product assembly problem described in Section 5.3.1 could be put
under that form if (5.20) were replaced by d = b0 + b1 1 + b2 2 + b3 3 . Indeed, in the
reasoning of Remark 5.3, the transform δ = C−Ax can be extended to δ = C+D −Ax,
which is also an affine transform of  but allows fixed right-hand sides when C = 0. With
96 Chapter 5. Inferring Decisions from Predictive Densities

Tab. 5.3: Minimum of the approximate programs.

Branching Optimal Cpu time


factor value (seconds)
1 -805.73 1.8
2 -450.89 1.8
3 -397.80 5.0
4 -388.57 8.0
5 -383.36 17.2
6 -379.85 40.7
7 -378.09 79.8
8 -377.23 177.5
9 -376.91 353.5
10 -376.56 670.6

the extension trick, it is possible to pass the value of 1 , 2 , 3 to the last stage, and to
express d through the linear equality constraint d = b0 + b1 1 + b2 2 + b3 3 .
Unfortunately, the lower bound certificate cannot be extended to the case where
d is defined by (5.20): the value of the last stage is convex in d but not in  3 . We
expect, however, that when the conditional probability of having all components of d not
truncated is large enough (we refer to the probability P{d  0} = P{∩5j=1 (b3 )j 3 > −λj }
when λ = (b0 + b1 1 + b2 2 )  0), one is close to the case where d is affine in 1 , 2 , 3 , and
thus close to being able to certify that the quadratic quantization yields a lower bound.
When one or several components of λj are close to 0 or below, then it is likely that the
optimal choice of q3 will attempt to redirect the assembly to products with the largest
expected profit E{|(c4 )j |(q4 )j − (c3 )j (q3 )j }, and thus to favor products with a larger
conditional expected demand, which happens to be the products that follow the affine
demand model more closely — potentially diminishing the impact of a discretization bias
in the wrong direction. By bias in the wrong direction, we mean this: If we were able
to dynamically adjust a quantizer for the distribution of the components d j to make it
stationary given the values of 1 and 2 , so as to take the expectation over d rather than
3 , then the values of the adjusted quantizer would be greater than the values of the fixed
quantizer induced by the fixed quantization of 1 , 2 , 3 , that neglects the truncation of d
at 0.
Empirically, on our problem data, the optimal value of the scenario-tree approxima-
tions with uniform branching factor S = 1, 2, . . . increases with S and stabilizes at a
certain level for higher values of S. This strongly suggests that on our problem data, the
quadratic quantization approach consistently provides lower bounds on the value of the
exact multistage program (Table 5.3). The time taken by the numerical optimization
algorithm for solving the successive approximations has also been indicated on Table 5.3,
so as to provide an indication of the increasing difficulty of solving programs posed on
larger scenario trees.
5.3. Case Study 97

5.3.3 Shrinking-Horizon Policies on the Test Sample

As already mentioned, the present problem is simple enough to let us simulate shrinking-
horizon policies on mutually independent test scenarios. We considered one cpu day as
the time limit beyond which the simulation time of one policy on 104 scenarios is not
acceptable. Simulation results for 4 shrinking-horizon policies on a fixed test sample
of 104 scenarios are reported on Table 5.4 (page 100). The average cpu time for the
evaluation of the sequence of decisions on one new scenario is also indicated on Table 5.4,
clearly illustrating the growing complexity of simulating shrinking-horizon policies. The
policy with branching factor 7 takes 6.5 seconds per scenario, that is, 6.5·10 4 /(3600·24) '
0.75 days to be evaluated on the test sample.
The reported empirical averages on the test sample are our estimate for the expected
cost of the policies. The standard error, defined as the standard deviation of the costs
on the test sample divided by the square root of the test sample size, indicates the order
of magnitude of our uncertainty about the true value of the policies as solutions to the
multistage program.
The apparent plateau of performance beyond a branching factor of 5 suggests that
the shrinking-horizon policy with branching factor 5 already attains performances that
are almost optimal, and this is confirmed by comparing the empirical average on the test
sample to the lower bounds of Table 5.3, in particular the best bound obtained on the
single program with the largest scenario tree (branching factor 10).

Remark 5.4. As the same test sample is used for each policy, the difference of
costs between pairs of policies should be significant enough to allow us to rank
the various policies reliably. On Table 5.9 (page 101), the reported standard error
is the standard deviation, on the test sample, of the difference of costs between
each pair of policies considered in the section, divided by the square root of the
test sample size. Thus, for instance, a confidence interval for the difference of
average cost between shrinking-horizon policies with branching factors 3 and 5
could be built by considering that the estimator for the difference is approximately
normally distributed with a standard deviation of 0.70. For some pairs of policies,
the standard errors reported in Table 5.9 are larger, but then they correspond to
policies with a larger difference in their empirical performance.
In general, the uncertainty about the true value of the difference of expected costs
among policies appears to be considerably smaller than the uncertainty about the
level of the expected cost itself, and actually small enough to justify with hindsight
the choice of the test sample size for ranking the policies. With a test sample 4
times larger, we would be able to improve our statistical estimates by a factor of
2, but then 4 cpus would be needed to simulate the shrinking-horizon policies on
the test sample in less than one day.

Remark 5.5. The shrinking-horizon policy with branching factor 1 (that uses a
single scenario to represent the future, corresponding to the mean scenario condi-
tionally to the information state, and thus implements a Model Predictive Control
98 Chapter 5. Inferring Decisions from Predictive Densities

approach) is already far better than a two-stage approximation strategy, that would
consist in

• relaxing the multistage program to a two-stage program, with first-stage deci-


sion (q1 , v2 , q2 , v3 , q3 ) and second-stage decision q4 adjusted to the observation
of (1 , 2 , 3 ), and then
• implementing the resulting optimal first-stage decision (q1 , v2 , q2 , v3 , q3 ) in
open-loop (that is, neglecting the observations of 1 and 2 ), followed by the
optimal second-stage decision q4 = min{d, q3 } given the observation of d.

When simulating such a policy on the test sample, using a first-stage decision
computed on the scenario tree with branching factor 10 (and simply imposing that
the decisions q1 , v2 , q2 , v3 , q3 are common to every scenario), we obtain an empirical
cost equal to −261.39 (standard deviation of the estimate: 6.15), far worse than the
value −305.48 of the simplest shrinking-horizon policy. Such a test confirms the
interest of taking into account the available information on the demand and adjust
the production process online. It also allows to compute quickly a lower bound on
the value of multistage stochastic programming (VMS): the VMS can be estimated
as at least the difference of performance between the simplest shrinking-horizon
policy and the policy based on the two-stage approximation.

5.3.4 Performances of Learned Policies

In the following experiments, we test policies that are built with the data extracted
from a given single scenario-tree approximation solved to optimality (the optimal value
of which being already reported in Table 5.3). We consider 3 such data sets, namely,
the ones obtained with branching factors 3, 5, and 7 respectively. Larger data sets are
advantageous from the statistical learning point of view, and at the same time they
provide better recourse decision examples, due to the finer discretization of the random
process used in the approximate stochastic programs.
The first-stage decision of the learned policies are exactly that of the corresponding
shrinking-horizon policy. The policy for the last stage decision is always set to the optimal
policy with decisions q4 = min{q3 , d} ∈ R5 . It remains to learn a mapping π1 from 1 ∈ R
to q2 ∈ R8 , and a mapping π2 from (1 , 2 ) ∈ R2 to q3 ∈ R5 . Indeed, once qt−1 and qt are
determined, the value of vt can be deduced by solving a simple optimization program,
that had to be solved anyway to ensure that a predicted decision q̂t is feasible, and to
repair it if necessary.

Policies based on the Joint Gaussian Model.

First, we test the simple approach described in Section 5.2. We estimate the mean and the
covariance matrix of a joint Gaussian model for (1 , 2 , 3 , q1 , q2 , q3 ) from the considered
data set. The value of the parameter 0 in (5.14) is set to 0.01 in all the experiments.
The predicted conditional densities of q2 given (1 , q1 ), and of q3 given (1 , 2 , q2 ), are
computed with the conditioning formulae (5.12). The decisions are then inferred by
solving programs of the form (5.7), as described in Section 5.1.2. The optimized variables
are qt and vt , structured by the constraints (5.23), (5.24).
5.3. Case Study 99

The performances of those policies are reported on Table 5.5. The branching factor
identifies the data set from which policies are learned.
The performance of the learned policies are worse than that of the corresponding
shrinking-horizon policies reported in Table 5.4, but already much better than the score
of the policy with a fixed optimized production plan described in Remark 5.5.

Policies based on the Gaussian Process Model.

Next, we test the nonparametric approach described in Section 5.2.2. Experiments were
limited to the case of a radial basis kernel with a common bandwidth parameter r > 0
set beforehand for each component of the decision vectors. For the components of the
predictive conditional mean of q2 , we used the kernel with values

k(i1 , j1 ) = exp{−(i1 − j1 )2 /(2r 2 )} ,

and for the components of the predictive conditional mean of q3 , we used the kernel with
values
P2
k 0 (i1 , i2 , j1 , j2 ) = exp{− i
t0 =1 (t0 − jt0 )2 /2r2 } = k(i1 , i1 ) · k(i2 , j2 ) .

We did not try to determine the best value of the bandwidth parameter r from
the data set, but rather tested the resulting policies on the test sample. The jitter
parameter σ 2 that enters the expression of the predictive conditional means (5.15) was
always set to 0.01.
The performance of the policies with the best found value of r — which depends on
the size of the data set from which the policy is learned — are reported in Table 5.6.
If we compare the results of Table 5.6 to the results of Table 5.5, we observe that on a
same training set (identified by the branching factor), the selected policy based on the
Gaussian Process model is better, in the case of branching factors 3 and 7, than the
corresponding policy based on the joint Gaussian model, and in fact a lot better with the
branching factor 3, corresponding to the smallest studied training set. On the training
set with the branching factor 5, however, the policy based on the joint Gaussian model is
better. In fact, that latter policy seems to dominate the 3 policies of Table 5.6, suggesting
that the simple approach based on the joint Gaussian model was worth investigating.
Finally, we tested the idea of emulating input-dependent bandwidth choices by using
kernels with values

k(i1 , j1 ) = exp{−[Φ(i1 ) − Φ(j1 )]2 /(2r 2 )} ,


k 0 (i1 , i2 , j1 , j2 ) = k(i1 , i1 ) · k(i2 , j2 ) ,

where Φ is the cumulative distribution function of N (0, 1). In fact, since each  t follows
N (0, 1), it holds that Φ(t ) is uniformly distributed on the interval [0, 1]. It seems then
wise to use a constant bandwidth r on this transformed input space, rather than on the
original input space.
The performance of the policies with the best found value of r — which happened to
be independent of the size of the data set from which the policy is learned — are reported
in Table 5.7. If we compare the results of Table 5.7 to the results of Table 5.6, we observe
100 Chapter 5. Inferring Decisions from Predictive Densities

Tab. 5.4: Simulation results for shrinking-horizon policies.

Branching Empirical results on the test sample Cpu time (sec.)


factor Average Standard error per scenario
1 -305.48 4.88 0.9
3 -369.52 5.91 1.7
5 -374.34 6.37 3.1
7 -374.56 6.17 6.5

Tab. 5.5: Results for policies based on the joint Gaussian model.

Branching Empirical results on the test sample Cpu time (sec.)


factor Average Standard error per scenario
3 -307.57 5.27 1.3
5 -360.81 6.13 1.3
7 -356.07 5.88 1.3

Tab. 5.6: Results for policies based on the Gaussian Process model.

Branching Empirical results on the test sample Cpu time (sec.)


factor Average Standard error per scenario
3 -347.49 5.58 1.1
5 -348.94 6.12 1.1
7 -357.63 6.02 1.2

Tab. 5.7: Gaussian Process model with a transformed input space.

Branching Empirical results on the test sample Cpu time (sec.)


factor Average Standard error per scenario
3 -359.50 5.73 1.2
5 -368.76 6.31 1.2
7 -363.26 6.05 1.2
5.3. Case Study 101

Tab. 5.8: Gaussian Process with a transformed input space and a fast repair procedure.

Branching Empirical results on the test sample Cpu time (sec.)


factor Average Standard error per scenario
3 -359.87 5.74 0.001
5 -371.10 6.33 0.001
7 -370.28 6.12 0.001

Tab. 5.9: Standard error of pairwise differences on the test sample.

Tab. 5.4 Tab. 5.5 Tab. 5.6 Tab. 5.7 Tab. 5.8
3 5 7 3 5 7 3 5 7 3 5 7 3 5 7

1 1.94 2.39 2.17 2.03 2.22 2.02 1.74 2.13 2.06 1.88 2.33 2.06 1.88 2.35 2.13
Tab. 5.4

3 0.70 0.43 1.81 0.97 0.96 0.72 0.81 0.59 0.42 0.66 0.43 0.42 0.68 0.43
5 0.33 2.21 0.93 1.13 1.23 0.80 0.76 0.97 0.27 0.56 0.96 0.26 0.46
7 2.00 0.88 1.00 0.98 0.69 0.57 0.71 0.32 0.32 0.71 0.35 0.24
Tab. 5.8 Tab. 5.7 Tab. 5.6 Tab. 5.5

3 1.98 1.72 1.33 1.72 1.83 1.68 2.15 1.95 1.68 2.16 1.95
5 0.37 1.12 0.72 1.07 1.12 0.88 0.92 1.12 0.88 0.88
7 0.94 0.77 1.09 1.05 1.07 0.98 1.05 1.08 0.97

3 0.88 0.79 0.63 1.17 0.93 0.63 1.19 0.95


5 0.68 0.99 0.76 0.75 0.99 0.76 0.69
7 0.73 0.68 0.51 0.73 0.70 0.53

3 0.87 0.62 0.01 0.89 0.63


5 0.42 0.87 0.07 0.30
7 0.62 0.46 0.19

3 0.89 0.63
5 0.33
7

that on a same training set, the policies using the kernel on the transformed space are
significantly better than the policies using the kernel on the original input space.
Therefore, these experiments illustrate that the performances of the policies based on
the Gaussian Process model are sensitive to the choice of the kernel. Depending on the
efforts that one is ready to make to test different choices of kernels, one can thus expect
to obtain good policies with the Gaussian Process model, perhaps even with small data
sets, as it was the case here with the training set relative to branching factor 3.

Discussion.

In terms of optimality, the results obtained here suggest that the learned policies are
able to attain performances that are quite decent with respect to the shrinking-horizon
policies. With trees of branching factor 3, for instance, the policy based on Gaussian
Process regression (with a good choice for the kernel) attains an average cost of about
102 Chapter 5. Inferring Decisions from Predictive Densities

-360 on the test sample, while the corresponding shrinking-horizon policy attains -370.
In terms of simulation times, with our Matlab implementation that calls cvx for
formulating and solving all programs, the learned policies are penalized by the need
to repair the predictions by solving a quadratic program, and the simulation times are
thus similar to the time taken by simulating the shrinking-horizon policy with branching
factor 1.
These results led us to try to replace the generic MAP repair procedure of Section 5.1
by a problem-specific, faster heuristic. In the present context, a possible heuristic consists
in fixing an ordering of the components of qt a priori, and then using the stocks qt−1 as
needed to reach the nominal level (q̂t )j predicted by the learned policy, or to a lower level
if one needed component of qt−1 gets depleted. The priority order is a hyper-parameter
of the repair procedure, that can be tested; our prior belief is that products with higher
profit per unit should be given a higher priority to the available stocks of components.
On the test sample, this new repair procedure combined with the Gaussian model
turns out to degrade the performance of the policy considerably. But combined with the
Gaussian process model, the performance is maintained (with the best found ordering for
the repair procedure), suggesting that the predictions of the Gaussian Process model are
precise enough to mitigate the potential inaccuracies of the repair procedure (Table 5.8).

Remark 5.6. It is a recurrent observation on our tables that the policies learned
from the data set with branching factor 5 slightly dominate those learned with
branching factor 7. One possible explanation is that despite its smaller cardinality,
the first data set contains better examples of decisions. In particular, the first-stage
decision may be better, or at least more robust to inaccuracies in the subsequent
recourse decisions. In fact, we have often observed that in two-stage programs,
the exact value of the first-stage decision optimal with respect to an approximate
program built with a deterministic method can actually be degraded by using more
discretization points, by a simple effect of luck in the selection of the values.

We can now claim that the best learned policy for our problem is the middle policy of
Table 5.8. Thanks to the high efficiency in the evaluation of this learned policy with the
fast repair procedure, we are able to test the policy on a new, independent test sample
of 106 scenarios.
The empirical average of the cost of the policy on this new test sample is -371.87, esti-
mated with standard error 0.63. The simulation of the policy on the new independent test
sample takes about 15 minutes in cpu time. With a confidence of approximately 95 %, the
exact value of the selected policy lies in the 2-standard error interval [−370.61, −373.14].

5.4 Conclusions

In this chapter, alternative methods for learning policies from data sets of scenario-
decisions pairs were explored, especially methods based on Gaussian Process regression.
The framework of Gaussian Processes was found attractive for several reasons: the pre-
dictions are relatively easy to compute (with small data sets, or in fact with kernels that
5.4. Conclusions 103

induce sparse Gram matrices), and are not based on probabilistic assumptions concerning
the way the scenarios of a data set were generated, in particular independence assump-
tions. This last observation is important, because the scenarios of a data set usually
come from a scenario tree built by conditional sampling or by deterministic methods,
and as such, are not independent. It is also true that the sequence of decisions associated
to a scenario actually depends, through the optimization of the decisions, on the other
scenario/decisions pairs present in the tree, so that we may be far from a situation where
each scenario/decisions pair in the data set could be viewed as generated independently
from some unknown probability distribution. Our case study suggests that Gaussian pro-
cesses can be combined gracefully with scenario-tree generation methods, with choices
guided by the knowledge on the way inputs were distributed or generated.
The MAP repair procedure expounded in the beginning of the chapter is a repair
procedure which is generic, but complicates the online evaluation of a learned policy. In
the next chapter, we review in detail the theory on Euclidian projections, and investigate
to which extent it is possible to accelerate the algorithm that computes that kind of
projection mapping by exploiting a data set of examples of projections already computed.
Nevertheless, our experiment in the present chapter seems to show that when the
feasibility sets are described by many constraints, it is better, from the point of view of
the computational complexity, to try to tailor a simple heuristic to restore the feasibility
of the decisions and obtain policies that are simple to evaluate, than to resort to a generic
procedure based on online optimization.
104 Chapter 5. Inferring Decisions from Predictive Densities

Tab. 5.10: Multi-product assembly problem: Values of the parameters in Table 5.2.

h iT
c1 = 0.25 1.363 0.8093 0.7284 0.25 0.535 0.25 0.25 0.25 0.4484 0.25 0.25
h iT
c2 = 2.5 2.5 2.5 2.5 13.22 2.5 3.904 2.5

0.4572 0 4.048 0 0 0 0.8243 11.37


 

 0 0 0.7674 0.5473 0.3776 0 0 0 


 0.4794 0 0.4861 1.223 0 1.475 0 0 

0 0 0 0 0.5114 0.3139 0 0
 
 
0 12.29 1.378 0 0.3748 0.4554 0 0
 
 
 
 0.7878 0 0.293 1.721 0 0 0 0 
[w2 ] =  

 1.504 0.4696 0.248 0 0.1852 0 0.3486 0 


 0 1.204 0 0.7598 0.452 0 0 0 

0 0 0.2515 0.3753 0.6249 0 1.248 0
 
 
 

 1.545 0 0 0 0 0 0.2732 0 

 0 0 0 0.6597 0 2.525 0 0 
0 0 1.595 0 0 1.51 1.041 0.9847
h iT
c3 = 3.255 2.5 2.5 8.418 2.5

0 1.223 0.6367 0 0
 

 0 0 0 1.111 0 


 0 0 0.4579 0 0 

 0 0.1693 0.6589 0 0 
[w3 ] = 
 
 0.5085 2.643 0 0 0 

 
 0.4017 0 0 0 0 
 
 0 0.7852 85.48 0 0 
0 0 0 0.806 0.5825
h iT
c4 = −21.87 −98.16 −31.99 −10 −10
h iT
b0 = 13.9 12.86 18.21 10.14 17.21
h iT
b1 = 9.708 9.901 7.889 4.387 4.983
h iT
b2 = 2.14 6.435 3.2 9.601 7.266
h iT
b3 = 4.12 7.446 2.679 4.399 9.334
Chapter 6

Learning Projections on Random Polyhedra

Recent advances in numerical optimization algorithms (Nesterov, 2007; Nemirovski et al.,


2009) seem to suggest that two very different categories of convex feasibility sets can
be distinguished: the sets on which the Euclidian projection (or its generalization via
Bregman divergences) can be computed in closed-form, and the sets for which evaluating
projections requires the use of standard iterative methods.
In many applications, the feasibility set of interest is a convex polyhedron, that is, a
set described by a finite number of linear equality and linear inequality constraints, for
which Euclidian projectors in closed-form are typically not available. In this chapter, we
consider the fundamental operation of evaluating the Euclidian projection of the origin
(zero vector) on a random convex polyhedral set. We study a subclass of that problem
in depth, namely, a subclass related to the MAP repair procedure evaluated in the case
study of chapter 5. The analysis suggests an algorithm able to predict exact projections
by generalizing information from a data set of examples of projections. We say that
the algorithm is able to learn projections, even if strictly speaking, the algorithm knows
exactly to which extent it can generalize the examples already encountered, so that when
it is unable to return an exact result for the projection, it can simply call a standard
optimization procedure.
The overall goal of the chapter is less to build an efficient implementation of the
studied approach, than to identify its limitations, inasmuch as this latter perspective
may also shed light on limitations of learning applied to data sets of minimizers.
The chapter is organized as follows. Section 6.1 motivates the studied problem. Sec-
tion 6.2 presents geometrical insights, and Section 6.3 builds on these insights to study
the properties of the projections. Section 6.4 presents algorithms derived from those re-
sults. Section 6.5 evaluates empirically on a series of random problems the circumstances
for the success of the approach, and Section 6.6 concludes with references to related work.
We have written proofs for a series of propositions collected in Sections 6.1, 6.2, 6.3,
as this is by this mechanism that we have come to the ideas of Sections 6.4 and 6.5. The
proofs have been established independently of the existing literature. With hindsight,
we believe that the results of Sections 6.1, 6.2, 6.3 are natural and rather standard (see,
for instance, Facchinei and Pang (2003); Dontchev and Rockafellar (2009)), while being
sometimes rediscovered in some communities (see Section 6.6).
106 Chapter 6. Learning Projections on Random Polyhedra

Notations.

In the sequel, we use the following notations.

• AT ∈ Rn×m is the transpose of A ∈ Rm×n .

• ||z|| = (z T z)1/2 = hz, zi1/2 is the Euclidian norm of z ∈ Rn .

• B = {z : ||z|| ≤ 1} is the closed unit ball in Rn with n understood from the context.

• For a scalar ρ and a set B, ρB stands for the set {ρv : v ∈ B}.

• For v1 ∈ Rn and a set B2 ⊂ Rn , v1 + B2 stands for the set {v1 + v2 : v2 ∈ B2 }.

• For sets B1 , B2 ⊂ Rn , B1 + B2 stands for the set {v1 + v2 : v1 ∈ B1 , v2 ∈ B2 }. If


B1 is a singleton B1 = {v1 }, we write v1 + B2 rather than {v1 } + B2 .

• For x = [x1 . . . xn ]T and y = [y1 . . . yn ]T ∈ Rn , x  y means xi ≤ yi , 1 ≤ i ≤ n,


and x ≺ y means xi < yi , 1 ≤ i ≤ n.

• Given x = [x1 . . . xn ]T ∈ Rn , x+ (or [x]+ ) denotes the vector in Rn with components


max{0, xi }, 1 ≤ i ≤ n.

• Given x = [x1 . . . xn ]T ∈ Rn and a subset I of {1, . . . , n} of cardinality |I|, the


vector xI ∈ R|I| is the subvector of x that stacks the components xi such that
i ∈ I. For a matrix A ∈ Rn×m with rows aT1 ,. . . ,aTn , the matrix AI ∈ R|I|×m is the
submatrix of A that stacks the rows aTi of A such that i ∈ I.

6.1 Problem Statement

We consider the following parametric optimization program over y ∈ Rm ,

P(x(ω)) : minimize f (y) = 21 ||y||2 subject to Ay  x(ω) , (6.1)

assuming that the parameter is the realization x(ω) ∈ Rs of a random variable x drawn
from some unknown but fixed probability distribution, and that A ∈ Rs×m is a fixed
matrix.
We are interested in the prediction of the optimal solution y ∗ (ω) to P(x(ω)), given
x(ω), assuming that we know P (we do not have to estimate A, for instance). This
problem could be addressed from a machine learning point of view by trying to learn
a hypothesis h in some hypothesis space H that approximates well the optimal solu-
tion y ∗ (ω), in the sense that the distance between h(x(ω)) and the feasibility set
def
C(x(ω)) = {y ∈ Rm : Ay  x(ω)} (6.2)

is small, and the regret ||h(x(ω))||2 /2 − ||y ∗ (ω)||2 /2 is small.


However, the interest of a prediction y ∗ (ω) which is only “nearly feasible” remains
hard to define in the absence of a precise interpretation of the constraints in the context
of an application. Without renouncing totally to that possible avenue (results that could
be useful in that perspective are also given in this chapter), we find it more adequate
to look first for approaches that could accelerate the repeated evaluation of the optimal
6.2. Geometry of Euclidian Projections 107

solution of P(x(ω)) for a sequence of realizations of x, that is, in some sense, build a
self-improving algorithm (Ailon et al., 2006).
In the sequel, we assume that x(ω) is valued in the set
def
dom C = {x ∈ Rs : C(x) 6= ∅} , (6.3)

called the domain of C, with C interpreted as a set-valued mapping C : Rs ⇒ Rm with


values C(x) (Dontchev and Rockafellar, 2009) (see Appendix B, Definition B.8). In
probabilistic terms, we assume that the support of the distribution of x is in dom C.
We do not assume that the support of the distribution of x is bounded, although some
specific results can be established in that case.
The setting covers a large class of parametric, strictly convex quadratic programs, as
shown by the following proposition and its corollary.

6.1 Proposition. Let S ∈ Rm×m be a positive definite matrix, let F ∈ Rs×m be a


matrix, and let u ∈ Rm , v ∈ Rs be vectors. The quadratic program over z ∈ Rm ,
1 T
minimize 2 z Sz + uT z subject to Fz  v , (6.4)

becomes, with a suitable change of variables, the problem of projecting (with respect to
the Euclidian metric) the origin 0 on some polyhedral set.

Proof. Let S = RT R be the Cholesky factorization of S (where R is upper triangular).


Let z = R−1 y − S −1 u. By substitution, we obtain a program over y ∈ Rm ,

minimize 1 T
2y y − 12 uT S −1 u subject to F R−1 y  v + F S −1 u ,

where the constant term −uT S −1 u/2 can be dropped. Hence the program on z is equiv-
alent to the evaluation of the Euclidian projection of 0 ∈ Rm on the set C(x) = {y ∈
Rm : Ay  x} with A = F R−1 and x = v + F S −1 u. Assuming that (6.4) is feasible and
thus C(x) is nonempty, the optimal solution z ∗ to (6.4) is recovered from the optimal
solution y ∗ using z ∗ = R−1 y ∗ − S −1 u.

6.2 Corollary. The parametric optimization program over y ∈ Rm with parameters


u(ω) ∈ Rm , v(ω) ∈ Rs ,
1 T
Q(u(ω), v(ω)) : minimize 2 z Sz + u(ω)T z subject to F z  v(ω) , (6.5)

can be recast as the parametric program P(x(ω)) by setting x(ω) = v(ω) + F S −1 u(ω)
and A = F R−1 in (6.1), where S = RT R is the Cholesky factorization of S.

6.2 Geometry of Euclidian Projections

Let us start by recalling some useful geometrical facts about Euclidian projections on
convex polyhedral sets (Rockafellar and Wets, 1998, Example 6.16, Theorems 6.9 and
6.46, Proposition 6.17). Figure 6.1 provides a visual support to the following definitions.

6.3 Definition. Let C ⊂ Rm be a closed set. The Euclidian projection mapping on C


is the set-valued mapping PC : Rm ⇒ Rm with values

PC (y) = {ȳ ∈ C : ||ȳ − y|| ≤ ||y 0 − y|| for every y 0 ∈ C} .


108 Chapter 6. Learning Projections on Random Polyhedra

C

NC (ȳ)
y

Fig. 6.1: ȳ is the projection of y on C if y − ȳ is in the normal cone to C at ȳ.

When C is a nonempty closed convex set, PC is single-valued, in the sense that PC (y) is
a singleton.

6.4 Definition. Let C ⊂ Rm be a closed set. The proximal normals to C at ȳ ∈ Rm


are the vectors d ∈ Rm such that ȳ ∈ PC (ȳ + τ d) for some τ > 0.

6.5 Definition. Let C ⊂ Rm be a closed convex set and ȳ ∈ C. A vector d is normal


to C at ȳ if hd, y 0 − ȳi ≤ 0 for every y 0 ∈ C. The normal cone to C at ȳ is the set
NC (ȳ) = {d ∈ Rm : hd, y 0 − ȳi ≤ 0 for every y 0 ∈ C} if ȳ ∈ C, or NC (ȳ) = ∅ if ȳ ∈
/ C.

The normal cone to a closed convex set C at ȳ ∈ C always contains 0. If ȳ is in the


interior of C, the normal cone is reduced to {0}. A more general definition for the normal
cone, valid for an abstract set C, is also available, but it is not needed in the sequel.
The normal cone to a convex polyhedral set has a particular expression, given by the
following proposition.

6.6 Proposition. Let C = {y ∈ Rm : Ay  b}, where A is a matrix with rows aTi . For
ȳ ∈ C, let I(ȳ) = {i : aTi ȳ = bi } denote the set of active constraints at ȳ. Then the
normal cone to C at ȳ is given by

NC (ȳ) = {d = AT λ : λi ≥ 0 for i ∈ I(ȳ), λi = 0 for i ∈


/ I(ȳ)} .

The relation between the normal cone and the Euclidian projection mapping is given
in the following proposition, only valid for closed convex sets.

6.7 Proposition. For a closed convex set C ⊂ Rm , every normal vector is a proximal
normal vector: d ∈ NC (ȳ) iff ȳ ∈ PC (ȳ + d), where in fact ȳ = PC (ȳ + d).

From Proposition 6.7, one deduces that every point ȳ of C = {y ∈ Rm : Ay  b}


defines an equivalence class of points

[ȳ] = {y ∈ Rm : PC (y) = ȳ}


= {ȳ + AT λ : λi ≥ 0 for i ∈ I(ȳ), λi = 0 for i ∈
/ I(ȳ)}

with [ȳ] reduced to the singleton {ȳ} when ȳ is in the relative interior of C — the relative
interior of a nonempty convex set C corresponds to the interior of C when C is viewed
as a subset of the smallest linear space containing C (the affine hull of C).
6.2. Geometry of Euclidian Projections 109

Remark 6.1. Let C = {y ∈ Rm : Ay  b}, let y be some point in Rm , and let


ȳ = PC (y) be the projection of y on C. Let I(ȳ) = {i : aTi ȳ = bi } denote the index
set of active constraints at ȳ. If I(ȳ) were known in advance for any y, then one
could compute PC (y) as the projection of y on the linear space CI defined with the
set of active constraints I = I(PC (y)) by

CI = {y ∈ Rm : aTi y = bi , i ∈ I} = {y ∈ Rm : AI y = bI } .

In that hypothetical situation, a closed-form formula is available for the projection.


For instance, assuming for simplicity that the rows of AI are linearly independent
(Dontchev and Rockafellar, 2009, Exercise 2D.10), one has

PCI (y) = y − ATI (AI ATI )−1 (AI y − bI ) . (6.6)

For some particular sets C (for example, hyperrectangles), it holds that I =


I(PC (y)) is equal to the index set of active or violated constraints at y. But for an
arbitrary polyhedral set C and point y, I(PC (y)) is difficult to guess, and does not
usually coincide with the index set of active or violated constraints at y.

Hoffman’s lemma (Hoffman, 1952), stated next, shows that the Euclidian distance
d(y, C) = ||y − PC (y)|| from any point y to a polyhedral set C can be related to a
measure that does not depend on PC (y).

6.8 Lemma (Hoffman’s Lemma). Let C = {y ∈ Rm : Ay  b} be nonempty with


A ∈ Rs×m a nonzero matrix. For any y ∈ Rm , there exists a scalar κ(A) > 0 depending
on A such that d(y, C) ≤ κ(A) || [Ay − b]+ ||.

Estimating κ(A) and its sensitivity with respect to perturbations of A is an important


subject of study — useful references are collected in Facchinei and Pang (2003, Notes to
Chapter 3, page 332).
A well-known corollary of Hoffman’s lemma is stated in the next proposition (Propo-
sition 6.10). Let us first define the Hausdorff “distance” between two sets (Dontchev and
Rockafellar, 2009, page 138).

6.9 Definition. The excess of C0 ⊂ Rm beyond C1 ⊂ Rm is the quantity

e(C0 , C1 ) = supy∈C0 d(y, C1 ) ,

with e(∅, C1 ) = 0 if C1 6= ∅ and e(∅, ∅) = ∞. Equivalently,

e(C0 , C1 ) = inf{ρ ≥ 0 : C0 ⊂ C1 + ρB} .

The Pompeiu-Hausdorff “distance” between C0 and C1 can then be defined as the


quantity
dh (C0 , C1 ) = max{e(C0 , C1 ), e(C1 , C0 )} .

6.10 Proposition. Let C(b) = {y ∈ Rm : Ay  b}. Let b0 and b1 be two vectors such
that C(b0 ) and C(b1 ) are nonempty. Then dh (C(b0 ), C(b1 )) ≤ κ(A)||b0 − b1 ||.

Proof. For y ∈ C(b0 ), we have Ay  b0 and thus Ay − b1  b0 − b1 ; in particular [Ay −


b1 ]+  [b0 − b1 ]+ . Hence ||[Ay − b1 ]+ || ≤ ||[b0 − b1 ]+ || ≤ ||b0 − b1 ||. By Hoffman’s lemma
110 Chapter 6. Learning Projections on Random Polyhedra

applied to C(b1 ), we have d(y, C(b1 )) ≤ κ(A)||[Ay − b1 ]+ || ≤ κ(A)||b0 − b1 ||. As y ∈ C(b0 )


is arbitrary, supy∈C(b0 ) d(y, C(b1 )) ≤ κ(A)||b0 − b1 ||. Similarly, supy∈C(b1 ) d(y, C(b0 )) ≤
κ(A)||b1 − b0 ||, and the result follows.

Remark 6.2. Observe that having A constant is essential in Proposition 6.10. For
instance, consider the set-valued mapping C 0 : R ⇒ R2 with values

C 0 () = {(x, t) ∈ R2 : t ≥ (1 − )|x|} = {y ∈ R2 : A()y  0}


h i
(1 − ) −1
with A() = −(1 − ) −1 . For each η ≥ 0, the point (η, η(1 − )) ∈ C 0 () is at

distance η/ 2 from the set C 0 (0), so that by definition of the Pompeiu-Hausdorff
distance, dh (C 0 (0), C 0 ()) = ∞ for any  > 0, whereas the matrices A(0) and A()
could be made arbitrary “close” by choosing  > 0 small enough.

Now, coming back to the parametric program (6.1), we observe that Hoffman’s lemma
allows to prove that if x in (6.1) follows a distribution having a compact support, then
the projection of the origin 0 ∈ Rm on the random polyhedral set C(x) defined by (6.2)
lies in a bounded set.

6.11 Proposition. Let C : Rs ⇒ Rm be the set-valued mapping with values C(x) defined
by (6.2). If x follows a probability distribution with compact support, then there exists
a finite κ̄ > 0 such that the projection y ∗ (ω) of the origin on the polyhedral set C(x(ω))
satisfies ||y ∗ (ω)|| ≤ κ̄ for all possible realizations x(ω) of x.

Proof. We assume that x(ω) ∈ X ∩ dom C, where X is a bounded subset of Rs . We


must show that the minimizer y ∗ (ω) of P(x(ω)) lies in a bounded subset Y of Rm . But
actually, if x(ω) ∈ ρB for some constant ρ > 0, and if x(ω) ∈ dom C, where C(x(ω)) =
{y ∈ Rm : Ay  x(ω)}, then by Hoffman’s lemma it holds that ||y ∗ (ω)|| = d(0, C(x(ω))) ≤
κ(A)|| [−x(ω)]+ || ≤ κ(A)||x(ω)|| ≤ κ(A)ρ, where κ(A) is a constant depending on A, so
that y ∗ (ω) lies in the ball Y = κ(A)ρB. We set κ̄ = κ(A)ρ.

Proposition 6.11 shows that if one wants to try to predict from x(ω) a “nearly feasible”
optimal solution y ∗ (ω), with x drawn from a distribution with compact support, then
one could legitimately select a hypothesis space H of bounded functions.

6.3 Properties of Optimal Solutions

In this section, we establish a list of properties of optimal solutions to the parametric


program (6.1). The results that are not directly used in the subsequent sections are
marked by a star (?). The results converted to an algorithm in the sequel are Propositions
6.19 and 6.22.
We will first note the following simple characterization of the domain of the set-valued
mapping C defined by (6.2):

6.12 Proposition. The domain of the set-valued mapping C : Rs ⇒ Rm with values


C(x) = {y ∈ Rm : Ay  x} is the closed convex cone dom C = range(A) + Rs+ .
6.3. Properties of Optimal Solutions 111

Proof. The set dom C is the projection of the set {(x, y) ∈ Rs × Rm : Ay  x} on Rs


(first s components) and is thus closed as the projection of a closed set (the inequality
constraints defining the set in Rs+m are non-strict). The set dom C is convex since
x0 , x1 ∈ dom C means that Ay0  x0 , Ay1  x1 for some y0 , y1 ∈ Rm , implying the
existence of yt = (1−t)y0 +ty1 satisfying Ayt  (1−t)x0 +tx1 for 0 ≤ t ≤ 1. Furthermore,
dom C is a cone since Ay  x entails A(ty)  tx for t ≥ 0, so that x ∈ dom C entails
tx ∈ dom C for t ≥ 0. Now, the constraints defining the set C(x) are equivalent to
x = Ay + ξ, ξ  0, where {v ∈ Rs : v = Ay, y ∈ Rm } is by definition the range of
A ∈ Rs×m .

A possible way to draw random points x(ω) ∈ dom C is thus to draw a linear com-
bination of vectors forming an orthonormal basis for A, and then add to the resulting
vector a random vector with nonnegative components.
Although we do not directly invoke it in the sequel, for completeness we also recall
the following structural property:

6.13 Proposition(?). The function g(x) = inf y∈C(x) 21 ||y||2 is convex in x.

For the notion of extended-real-valued function used in the following proof, see Ap-
pendix A.1.

Proof. The program P(x) amounts to the minimization of the extended-real-valued func-
tion f¯ defined by f¯(x, y) = ||y||2 /2 if Ay  x, and f¯(x, y) = ∞ otherwise. We
check that f¯(x, y) is jointly convex in x, y. Let us write xt = (1 − t)x0 + tx1 and
yt = (1 − t)y0 + ty1 for 0 < t < 1. If f¯(x0 , y0 ) and f¯(x1 , y1 ) are finite, implying
Ay0  x0 , Ay1  x1 , then f¯(xt , yt ) is also finite, since Ayt  xt and f¯(xt , yt ) = ||y t ||2 /2 ≤
(1−t)||y0 ||2 /2+t||y1 ||2 /2 = (1−t)f¯(x0 , y0 )+tf¯(x1 , y1 ). If f¯(x0 , y0 ) = ∞ or f¯(x1 , y1 ) = ∞
, the convexity inequality f¯(xt , yt ) ≤ (1−t)f¯(x0 , y0 )+tf¯(x1 , y1 ) = ∞ for 0 < t < 1 is triv-
ially verified. Hence f¯(x, y) is convex in (x, y) (Rockafellar, 1970, Theorem 4.1). As a con-
vex set, the epigraph of f¯ defined by epi f¯ = {(x, y, α) ∈ (Rs ×Rm )×R : α ≥ f¯(x, y)} has
its projection on its component Rs × R convex as well. The function g(x) = inf y f¯(x, y)
whose epigraph is epi g = {(x, α) ∈ Rs × R : (x, y, α) ∈ epi f¯ for some y} is thus con-
vex.

Now, for x(ω) ∈ dom C, the program P(x(ω)) has a single minimizer y ∗ (ω) correspond-
ing to the projection of 0 ∈ Rm on the convex polyhedral set C(x(ω)). By Proposition 6.7,
setting C = C(x(ω)), a point y ∈ Rm is thus optimal if the vector 0 − y = −y lies in the
normal cone to C at y, that is, −y ∈ NC (y), or equivalently y + NC (y) 3 0.
Given the optimal solution y ∗ (ω) to P(x(ω)), it is easy to describe sets of nearly
optimal solutions, called -optimal solutions (see Appendix A, Section A.4). To this end,
let us recall the notion of tangent cone to an arbitrary set C (Rockafellar and Wets, 1998,
Definition 6.1, Theorem 6.9).

6.14 Definition(?). A vector d is tangent to C at ȳ ∈ C if for some sequence {y ν }ν∈N


of points y ν ∈ C converging to ȳ, and some sequence {τ ν }ν∈N of scalars τ ν converging
to 0 with 0 < τ ν+1 < τ ν , one has

(y ν − ȳ)/τ ν → d .
112 Chapter 6. Learning Projections on Random Polyhedra

The set of all such vectors d is a closed cone, possibly reduced to the singleton {0}, called
the tangent cone to C at ȳ, and written TC (ȳ). In the particular case where C is a
convex subset of Rm , the tangent cone to C at ȳ is a convex set given by

TC (ȳ) = cl{d ∈ Rm : ȳ + λd ∈ C for some λ > 0} .

For a polyhedral set C = {y ∈ Rm : Ay  x}, the tangent cone to C at ȳ is given by

TC (ȳ) = {d ∈ Rm : aTi d ≤ 0 for all i ∈ I(ȳ)}

(Rockafellar and Wets, 1998, Theorem 6.46).


The next proposition describes properties of the sets of -optimal solutions, denoted
by S (ω) for a given  and a given realization of ω.

6.15 Proposition(?). The sets of -optimal solutions S (ω) to P(x(ω)) satisfy two prop-
erties, expressed with respect to the exact optimal solution y ∗ (ω) and the set C(x(ω)):

i. S (ω) ∩ ||y ∗ (ω)||B = {y ∗ (ω)} for all  > 0;


ii. There exists an 0 > 0 such that for every  ∈ [0, 0 ],

S (ω) = ρB ∩ [y ∗ (ω) + TC(x(ω)) (y ∗ (ω))] ,

||y ∗ (ω)||2 + 2 and TC(x(ω)) (y ∗ (ω)) = {d ∈ Rm : aTi d ≤ 0, i ∈ I(y ∗ (ω))}.


p
where ρ =

The following proof relies on standard arguments — see for instance Dontchev and
Rockafellar (2009, Theorem 2E.3).

Proof. To lighten the notation, we write S for S (ω), C for C(x(ω)), and y ∗ for y ∗ (ω).
The set S = - argminy∈C f (y) is given by

S = {y ∈ C : f (y) ≤ f (y ∗ ) + } = {y ∈ C : ||y||2 ≤ ||y ∗ ||2 + 2}


p
= C ∩ ( ||y ∗ ||2 + 2)B.

There is no feasible vector y with ||y|| < ||y ∗ ||, whereas ||y|| = ||y ∗ || entails y = y ∗
by the strict convexity of f (y), hence the first part of the proposition. On the other
hand, the feasibility set C is described by a finite number of constraints, so that in a
sufficiently small neighborhood of y ∗ , say R0 , there is no new constraint that becomes
active: I(y) ⊂ I(y ∗ ) for y ∈ R0 ∩C. As the constraints are linear, C can be approximated
locally by the set Cy∗ = {y ∈ Rm : aTi y ≤ bi , i ∈ I(y ∗ )}. Since aTi yi∗ = bi for i ∈ I(y ∗ ),
we have Cy∗ = {y ∗ + d : aTi d ≤ 0, i ∈ I(y ∗ )} = y ∗ + TC (y ∗ ).

Remark 6.3. Having the set C polyhedral is important in the proof of Proposi-
tion 6.15. If the set C were not polyhedral (it can still be convex), there would not
necessarily exist a neighborhood R0 of y ∗ in which a proper inclusion of C ∩ R0
in [y ∗ + TC (y ∗ )] ∩ R0 can be precluded. The local approximation at y ∗ of the set
C by the set y ∗ + TC (y ∗ ) could thus include infeasible points. For example, for
C = {(x, t) ∈ R2 : t ≥ |x| + x2 }, one has TC (0) = {(x, t) ∈ R2 : t ≥ |x|}, and
consequently (C \ TC (0)) ∩ B 6= ∅ for any  > 0.
6.3. Properties of Optimal Solutions 113

The following proposition relies on duality theory (Rockafellar and Wets, 1998, Chap-
ter 11).

6.16 Proposition(?). The dual of P(x(ω)) corresponds, after a sign change, to the
program

D(x(ω)) : minimize − g(λ) = 21 λT (AAT )λ + x(ω)T λ


subject to λ0 .

Proof. The Lagrangian for P(x(ω)) is L(y, λ) = 12 ||y||2 + λT (Ay − x(ω)), with λ  0.
The infimum of L(y, λ) over y is attained at ȳ = −AT λ. Hence the dual function
g(·) = inf y L(y, ·) has values g(λ) = − 21 λT AAT λ − x(ω)T λ. The dual formulation is
obtained by maximizing g(λ) subject to λ  0.

Given y ∗ (ω), it is often possible to obtain a solution to the dual problem, as shown
by the following proposition. Note that from now on, when ω or x(ω) is clear from the
context, we freely write C for C(x(ω)), and y ∗ for the optimal solution y ∗ (ω) to P(x(ω)).
We will also freely write x for its realization x(ω).

6.17 Proposition(?). If y ∗ is optimal for P(x), any optimal solution λ∗ ∈ Rs for


the dual D(x) is determined by a subvector λI ∈ Rp of possibly nonzero elements λ∗i ,
i ∈ I(y ∗ ), p = |I(y ∗ )|, such that λI is a nonnegative solution to ATI λI = −y ∗ .

Proof. Having y ∗ ∈ C optimal means −y ∗ ∈ NC (y ∗ ), that is, there exists at least one
vector λ ∈ Rs such that
Ps
y∗ + i=1 λi ai = 0, λ  0, λi = 0 if i 6∈ I(y ∗ ) ,

where I(y ∗ ) = {i : aTi y ∗ = xi } is the index set of active constraints at y ∗ ∈ C. These


conditions are nothing else but the usual Karush-Kuhn-Tucker optimality conditions

∇f (y ∗ ) + AT λ = 0, Ay ∗  x, λ  0, λi (aTi y ∗ − xi ) ≥ 0

with multipliers λi optimal for the dual problem. Let AI ∈ Rp×m be the submatrix of
A with rows aTi , i ∈ I(y ∗ ), p = |I(y ∗ )|, so that the subvector λI ∈ Rp of λ stacking the
possibly nonzero elements λi , i ∈ I(y ∗ ), has to satisfy y ∗ + ATI λI = 0. If the rows of
AI are linearly independent (a constraint qualification which always holds for p = 1 and
never holds for p > m), then

λI = −(AI ATI )−1 AI y ∗ = −(AI ATI )−1 xI

where xI ∈ Rp is the subvector of x stacking the elements xi , i ∈ I(y ∗ ). We can


assume that the solution λI is nonnegative inasmuch as y ∗ is optimal. Now if p >
m and the columns of AI are linearly independent, the equation y ∗ + ATI λI = 0 is
underdetermined and admits the particular solution v0 = −AI (ATI AI )−1 y ∗ (least-norm
solution). If ker(ATI ) = {v : ATI v = 0} denotes the null space of ATI , then

λI ∈ {v0 + v : v ∈ ker(ATI )} ∩ Rp+ .

Recall that the null space of ATI can be described as the span of the eigenvectors associated
to the zero eigenvalues of (ATI AI ).
114 Chapter 6. Learning Projections on Random Polyhedra

Uniquely determined multipliers do not always exist: this is consistent with the ob-
servation that the dual problem can have a continuum of optimal solutions if the matrix
(AAT ) in Proposition 6.16 is only positive semi-definite.

Remark 6.4. A property of the objective function f that facilitated the develop-
ments in Proposition 6.17 is the expression of its gradient ∇f (y) = y. The solution
to the inversion of the generalized equation u ∈ ∂f (y), where ∂f (y) is the subgra-
dient of f evaluated at y, is then simply y = u. We recall that in general, when f is
a proper lower-semicontinuous convex function, u ∈ ∂f (y) if and only if y ∈ ∂f ∗ (u)
with f ∗ (u) = supy {uT y −f (y)} (Rockafellar and Wets, 1998, Proposition 11.3).

It is possible to extract information from the index set of active constraints at an


optimal solution y ∗ , as shown by the following proposition.

6.18 Proposition(?). Let S(x) = {y ∗ ∈ Rm : Ay ∗  x, ||y ∗ || ≤ ||y|| whenever Ay  x}


denote the set of optimal solutions for P(x) (the set is a singleton, assuming x ∈ dom C).
Let S −1 (y ∗ ) = {x ∈ Rs : y ∗ ∈ S(x)} denote the set of parameter vectors x such that y ∗
is optimal for P(x). Then, it holds that

S −1 (y ∗ ) ⊃{x ∈ Rs : xi = aTi y ∗ for i ∈ I(y ∗ ), xi ∈ [aTi y ∗ , ∞) for i ∈


/ I(y ∗ )}
= Ay ∗ + N1 (I(y ∗ )) × · · · × Ns (I(y ∗ ))

where I(y ∗ ) = {i : aTi y ∗ = xi } is the set of active constraints at y ∗ , and where we define
Ni (I) = {0} if i ∈ I and Ni (I) = [0, ∞) if i 6∈ I. The inclusion can be refined by
considering, instead of I, the index set I + = {i : λi > 0} ⊂ I(y ∗ ) of the positive KKT
multipliers associated to the active constraints at y ∗ .

Proof. Let bi = aTi y ∗ , 1 ≤ i ≤ s. By definition of I(y ∗ ), we have bi = xi if i ∈ I(y ∗ )


and bi < xi if i is in the complement of I(y ∗ ), that is, i ∈ J(y ∗ ) = {i : aTi y ∗ < xi } = J.
A constraint indexed by i ∈ J remains inactive if xi is in the open interval (bi , ∞), and
becomes active but does not alter the optimal solution y ∗ if xi = bi , whence the first part
of the proposition. Now, relaxing the constraints to which are associated zero-valued
KKT multipliers does not alter the optimal solution y ∗ , so that in fact

S −1 (y ∗ ) = {x ∈ Rs : xi = aTi y ∗ if i ∈ I + , xi ∈ [aTi y ∗ , ∞) if i 6∈ I + }

where I + = {i : λi > 0} is the index set of active constraints with positive multipliers.

Example 6.1. Propositions 6.17 and 6.18 can be illustrated on a numerical example
(Figure 6.2). Let A have 4 rows aT1 = [ 0 −1 ], aT2 = [ −1 1 ], aT3 = [ −1 0 ],
aT4 = [ −1 −2 ]. Let x have the value x(ω1 ) = [ −4 2 −2 0 ]T . The opti-
mal solution to P(x(ω1 )) is y ∗ (ω1 ) = [ 2 4 ]T . The set of active constraints is
I(y ∗ (ω1 ) = {1, 2, 3}, meaning that 3 hyperplanes meet at y ∗ (ω1 ). The matrix AI
has the 3 rows aT1 , aT2 , aT3 . The optimality condition is y ∗ (ω1 ) = −ATI λI . The set
6.3. Properties of Optimal Solutions 115

a2 a2

a3 a3
C1 C2
y∗
y∗
a1
a1
a4
(0, 0) (0, 0)
a4
Fig. 6.2: Left: Pathological case x(ω1 ) for which the dual D(x(ω1 )) has several optimal solu-
tions described in the example (see text). Right: Case x(ω2 ) where the dual problem
D(x(ω2 )) has a single optimal solution. The primal problems P(x(ω1 )), P(x(ω2 )) have
the same unique optimal solution y ∗ = (2, 4) ∈ R2 . The dashed line indicates the
minimal distance between the origin and the set Ci = {y ∈ R2 : Ay  x(ωi )}.

of solutions for λI is

ΛI =(−AI (ATI AI )−1 y ∗ (ω1 ) + ker{ATI }) ∩ R3+


= ([ 10
3
−2
3
8
3
]T + µ[ 1 1 −1 ]T : µ ∈ R) ∩ R3+
T
= {[ 4 0 2 ] + µ[ 1 1 −1 ]T : µ ∈ [0, 2]} .

Note that a numerical solution algorithm applied to the dual problem could return
any particular solution λ ∈ ΛI × {0}. The solutions corresponding to µ = 0 and
µ = 2 are λ = [ 4 0 2 0 ]T and λ = [ 6 2 0 0 ]T respectively. The zero
elements of the solutions indicate that y ∗ (ω1 ) is still optimal when

x ∈ Ay ∗ (ω1 ) + {0} × [0, ∞) × {0} × [0, ∞) ∪ {0} × {0} × [0, ∞) × [0, ∞)

with Ay ∗ (ω1 ) = [ −4 2 −2 −10 ]T .


Now if x has the value x(ω2 ) = x(ω1 ) + [ 1 1 1 −10 ]T , the optimal solution
to P(x(ω2 )) is y ∗ (ω2 ) = [ 2 4 ]T = y ∗ (ω1 ), showing that the inclusion concerning
S −1 (y ∗ ) in Proposition 6.18 may be proper.
Given that I(y ∗ (ω2 )) = {4}, and thus AI = aT4 , the solution to the optimality condi-
tion y ∗ (ω2 )+ATI λI = 0 is uniquely determined by λI = −(AI ATI )−1 AI y ∗ = 2 = λ4 .
Therefore, the dual D(x(ω2 )) admits the unique solution λ = [ 0 0 0 2 ]T . The
zero elements of the solution indicate that y ∗ (ω2 ) = y ∗ (ω1 ) is still optimal when
x ∈ Ay ∗ (ω1 ) + [0, ∞) × [0, ∞) × [0, ∞) × {0}.

Remark 6.5. Proposition 6.18 has formalized an invariance property with respect to
a subset of translations of the input x, where the subset depends on the output y ∗ .
In the perspective of using supervised learning to predict nearly feasible optimal
solutions, invariance properties could be used as a means to obtain virtual samples
(xν , y ν ) with xν ∈ S −1 (y ν ), or can be embedded in learning algorithms to improve
generalization abilities from prior knowledge (Decoste and Schölkopf, 2002).

The next proposition shows that from a single pair (x̄, ȳ ∗ ) with ȳ ∗ optimal for P(x̄),
it is sometimes possible to predict the optimal solution y ∗ (ω) for parameters x(ω) in a
116 Chapter 6. Learning Projections on Random Polyhedra

neighborhood of x̄. The size of the neighborhood is estimated in the proof, and is related
to the smallest singular value of the matrix AI defined in Proposition 6.19.

6.19 Proposition. Let ȳ be the optimal solution to the program P(x̄). Let A I ∈ Rp×m ,
p = |I(ȳ)|, be the submatrix of A stacking the rows aTi of active constraints i ∈ I(ȳ), and
for a vector x ∈ Rs , let xI ∈ Rp be the subvector of x stacking the elements xi , i ∈ I(ȳ).
If the rows of AI are linearly independent and if (AI ATI )−1 x̄I ≺ 0, then there exists a
neighborhood Q of x̄ such that for all x(ω) ∈ Q ∩ dom C, the optimal solution to P(x(ω))
is given by y ∗ (ω) = ATI (AI ATI )−1 xI (ω).

Proof. First, we show that there exist a neighborhood Q0 of x̄ and a neighborhood R0


of ȳ such that I(y) ⊂ I(ȳ) whenever x(ω) ∈ Q0 ∩ dom C and y ∈ R0 ∩ C(x(ω)). Let
J denote the set of inactive constraints at ȳ. For all j ∈ J, let dj be the distance
of ȳ to the hyperplane {y : aTj y = x̄j }, namely, dj = ||aj ||−1 (x̄j − aTj ȳ) > 0. Let
d0 = min{dj : j ∈ J} and let us define

η0 = min{||aj ||(dj − d0 /2) : j ∈ J} > 0 .

We choose Q0 = {x̄ + η0 u : ||u|| < 1} and R0 = {ȳ + (d0 /2) v : ||v|| < 1}. Then,
the distance of ȳ to any hyperplane {y : aTj y = xj (ω)}, j ∈ J, is greater than d0 /2
whenever x(ω) ∈ Q0 ∩ dom C, and y ∈ R0 ∩ C(x(ω)) is separated from the hyperplanes
{y : aTj y = xj (ω)} for j ∈ J. Hence j 6∈ I(y) and thus I(y) ⊂ I(ȳ) (no new active
constraints).
Next, we claim that if the rows aTi for i ∈ I(ȳ) are linearly independent, and if
(AI ATI )−1 x̄I ≺ 0, then there exists a neighborhood Q ⊂ Q0 of x̄ such that I(y ∗ (ω)) =
I(ȳ) whenever x(ω) ∈ Q ∩ dom C, where y ∗ (ω) denotes the optimal solution to P(x(ω)).
It is sufficient to show that whenever x(ω) ∈ Q∩dom C, y ∗ (ω) lies in R0 , and any optimal
λ∗i (ω), i ∈ I(ȳ), associated to y ∗ (ω) is positive, as λ∗i > 0 entails i ∈ I(y ∗ (ω)). Since the
rows of AI are linearly independent, the vector

λ̄I = −(AI ATI )−1 AI ȳ = −(AI ATI )−1 x̄I  0 (I = I(ȳ))

is the only vector of possibly nonzero multipliers associated to ȳ (the reference optimal
solution). Let us replace the dual problem D(x(ω)) by a problem on the reduced set of
variables δI ∈ Rp with λI (ω) = λ̄I + δI , I = I(ȳ), namely,

minimize − gI (δI ) = 21 (λ̄I + δI )T (AI ATI )(λ̄I + δI ) + xI (ω)T (λ̄I + δI )


subject to λ̄I + δI  0 .

If we relax the constraint δI  −λ̄I , and if we set x(ω) = x̄(ω) + ∆x(ω), the optimality
condition for the resulting problem is ∇gI (δI∗ ) = 0, and its optimal solution is

δI∗ = −(AI ATI )−1 (AI ATI λ̄I + xI (ω)) = −(AI ATI )−1 ∆xI (ω) ,

where we have used the fact that λ̄I = −(AI ATI )−1 x̄I . Let us define

 = min{λ̄i : i ∈ I} > 0 .

Since δi∗ > −λ̄i for each i ∈ I whenever ||δI∗ || < , we can guarantee, using the inequality

||δI∗ || ≤ ||(AI ATI )−1 || · ||∆xI (ω)|| ≤ ||(AI ATI )−1 || · ||∆x(ω)||
6.3. Properties of Optimal Solutions 117

that whenever ||x(ω) − x̄|| = ||∆x(ω)|| < η1 with η1 = min{η0 , ||(AI ATI )−1 ||−1 }, the
solution δI∗ satisfies the constraint of the initial reduced problem, and x(ω) ∈ Q0 . Thus
δI∗ is also optimal for the reduced problem. We note that ||(AI ATI )−1 ||−1 = (σp (AI ))2 ,
where σp (AI ) > 0 is the smallest singular value of AI (AI has rank p = |I(ȳ)|). Reverting
now to the full dual problem over λ ∈ Rm , we see that the vector λ∗ with λ∗i = λ̄i +δi∗ > 0
if i ∈ I(ȳ), λ∗i = 0 if i 6∈ I(ȳ), induces a vector

y = − i∈I λ∗i ai = ȳ − i∈I δi∗ ai = ȳ + ATI (AI ATI )−1 ∆xI (ω).
P P

Using ||y − ȳ|| ≤ ||ATI (AI ATI )−1 || · ||∆x(ω)||, we have ||y − ȳ|| < d0 /2 if ||∆x(ω)|| <
||ATI (AI ATI )−1 ||−1 d0 /2. In fact ||ATI (AI ATI )−1 ||−1 = σp (AI ). By setting

η = min{η0 , σp (AI ) d0 /2, (σp (AI ))2 }

and choosing for Q the open ball of radius η centered at ȳ, we can ensure that y ∈ R 0 ,
so that I(ȳ) is the set of constraints active at y. This means that the vector y is optimal
for the primal problem, and that λ∗ is optimal for the dual problem.
Now, given the existence of a neighborhood Q of x̄ for which I(y ∗ (ω)) = I(ȳ) when
x(ω) ∈ Q ∩ dom C, y ∗ (ω) can be obtained as the projection of the origin on the linear
subspace {y ∈ Rm : aTi y = xi (ω), i ∈ I(ȳ)} whenever x(ω) ∈ Q ∩ dom C. With the rows
of AI linearly independent, the projection is given by y ∗ (ω) = ATI (AI ATI )−1 xI (ω).

In the context of the supervised learning of nearly feasible optimal solutions, where
one looks for a hypothesis h in a hypothesis space H of mappings from x(ω) to y ∗ (ω),
the knowledge of a local model for y ∗ (ω) in a neighborhood of x̄, for instance a first-
order approximation y ∗ (ω) ' ȳ + D(x(ω) − x̄), means that one could learn h not only
by penalizing the discrepancies between the sampled targets y(ω) and the predictions
h(x(ω)), but also by penalizing the discrepancy between the gradient of h at x̄ and
the gradient D of the local model known a priori. Such ideas have been developed by
Simard et al. (1998). We also note that it is technically possible to incorporate derivative
information in Gaussian Process regression (Solak et al., 2003).
Another possibility would be to learn classifiers for the events i ∈ I(ȳ), 1 ≤ i ≤ s, since
we know that the information on active constraints can be generalized locally around x̄,
and followed by a straightforward computation of y ∗ (ω).

Remark 6.6. A typical situation where the assumptions of Proposition 6.19 fail is
the case where two inequality constraints form an equality constraint: a T1 y ≤ x1 ,
aT2 y ≤ x2 with a2 = −a1 and x2 = −x1 . In that case, a solution ȳ has to satisfy
aT1 ȳ = x1 , and AI is always rank-deficient. In the event where the two parallel
hyperplanes are separated, it is not easy to predict which side of the so-induced
slab region the optimal solution will follow. If q pairs of hyperplanes are merged,
there might exist 2q distinct configurations of active constraints in the neighborhood
of x̄, provided that the assumptions of Proposition 6.19 hold with one element of
each pair of equality-forming hyperplanes removed from the index set I of active
constraints at ȳ.

Now, an important question is whether a local model shared by a finite collection of


points can be generalized to the convex hull of the points.
118 Chapter 6. Learning Projections on Random Polyhedra

6.20 Lemma. Given x(0), x(1) ∈ dom C, let x(t) = (1 − t)x(0) + tx(1) for 0 ≤ t ≤ 1.
Let y ∗ (t) denote the optimal solution to P(x(t)). If I = I(y ∗ (0)) = I(y ∗ (1)), then
y ∗ (t) = (1 − t)y ∗ (0) + ty ∗ (1). If in addition the rows aTi , i ∈ I, are linearly independent,
then y ∗ (t) = ATI (AI ATI )−1 xI (t).

Proof. We consider the points y(t) = (1 − t)y ∗ (0) + ty ∗ (1), 0 ≤ t ≤ 1, in correspondence


with x(t) = (1 − t)x(0) + tx(1). Let j(t) represent the constraint aTj y ≤ x(t), 1 ≤ j ≤ s,
0 ≤ t ≤ 1. We have aTj y(0) − xj (0) < 0 for each j(0) 6∈ I, and aTj y(1) − xj (1) <
0 for each j(1) 6∈ I, by definition of I for y(0) = y ∗ (0) and y(1) = y ∗ (1). Hence
(1 − t)(aTj y(0) − xj (0)) + t(aTj y(1) − xj (1)) = aTj y(t) − xj (t) < 0 whenever j 6∈ I, meaning
that y(t) is feasible with respect to j(t) with j(t) 6∈ I(y(t)). Similarly, for each i ∈ I, it
holds that aTi y(0) = xi (0) and aTi y(1) = xi (1). Hence, aTi y(t) = xi (t), meaning that y(t)
is feasible with respect to i(t) with i(t) ∈ I(y(t)). We have thus shown that y(t) is feasible
and that I(y(t)) = I(y0∗ ) = I. Now, let λ(t) = (1−t)λ(0)+tλ(1), where λj (0) = λj (1) = 0
for j 6∈ I, and where λI (0)  0 is a solution to y ∗ (0) + ATI λI (0) = 0, and λI (1)  0 is a
solution to y ∗ (1)+ATI λ(1) = 0. The equality (1−t)(y ∗ (0)+ATI λ(0))+t(y ∗ (0)+ATI λ(1)) =
y(t)+ATI λt = 0 with λt  0 shows that y(t) satisfies the optimality conditions for P(x(t)).
Therefore, y(t) is the projection of 0 on the active constraints, and if the rows of A I are
linearly independent, y(t) = ATI (AI ATI )−1 x(t) for 0 ≤ t ≤ 1.

As the convex hull of a collection of points {xν }, written conv({xν }), contains the
line segments between any two of its points, we have:

6.21 Proposition (Inner generalization). Let {xν } be a collection of points in dom C


with a common set I of active constraints at the optimal solution to P(x ν ). If the
rows of AI are linearly independent, then y ∗ (ω) = ATI (AI ATI )−1 xI (ω) whenever x(ω) ∈
conv({xν }).

Another interesting question is whether we can, from a single point (x̄, ȳ) equipped
with a local model, infer the domain of validity of the model.

6.22 Proposition (Outer generalization). Let ȳ be the optimal solution to the pro-
gram P(x̄). Let I(ȳ), written I for short, be the index set of active constraints at ȳ,
and let J be its complement. Let AI be the submatrix of active rows of A. If the rows
of AI are linearly independent, then the subset of dom C (values for the parameter x)
where the index set of active constraints at the optimal solution y ∗ of the program P(x)
coincides with I = I(ȳ) can be described as the polyhedral cone

X(I) = {x ∈ Rs : BI xI  0, DI xI − xJ  0}

where BI = (AI ATI )−1 ∈ Rp×p and DI = AJ ATI BI ∈ R(s−p)×p .

Proof. Having ȳ as an optimal solution shows that there exists some λ̄I ∈ Rp , p = |I(ȳ)|,
such that ȳ + ATI λ̄I = 0, λ̄I  0, AI ȳ = x̄I , and AJ ȳ ≺ x̄J . If the rows of AI are
linearly independent, λ̄I = −(AI ATI )−1 x̄I , which implies ȳ = ATI (AI ATI )−1 x̄I . Now, we
can replace x̄I by any xI and obtain a corresponding optimal solution y determined by

y = ATI (AI ATI )−1 xI , (6.7)


6.3. Properties of Optimal Solutions 119

as long as we keep

λI = −(AI ATI )−1 xI  0 , (6.8)


AI y = x I , (6.9)
AJ y ≺ x J . (6.10)

To satisfy (6.8) we must enforce (AI ATI )−1 xI  0. Equation (6.9) is a consequence of
(6.7) multiplied by AI . To satisfy (6.10), we must enforce AJ ATI (AI ATI )−1 xI − xJ ≺ 0.
Actually, y will still be optimal if (6.10) is replaced by AJ y  xJ (non-strict inequality).
In that case, we use the convention that if some new constraints enter the set of active
constraints at y, the index set I is still understood as the set of active constraints at ȳ.
To easily see that the resulting set X(I) as defined in the proposition with B I and
 
xI
DI is a cone, assume without loss of generality that x = , allowing us to rewrite
xJ
 
BI 0
X(I) = {x ∈ Rs : GI x  0} with GI = ∈ Rs×s ,
DI −I

where 0 is the zero matrix of dimension |I| and I the identity matrix of dimension |J|.

Remark 6.7. The subset of dom C for which there is no active constraint at the
optimal solution is X(∅) = Rs+ : it is easy to check that 0 ∈ argmin P(x) if and
only if x  0. That the point x = 0 is included in every set X(I) corresponds
to the existence of pathological cases (recall Figure 6.2) where several hyperplanes
meet at zero.

We close the section by a particularization of the results.

6.23 Proposition(?). Consider the parametric program over z ∈ Rm ,

Q(µ(ω), v(ω)) : minimize (z − µ(ω))T Σ−1 (z − µ(ω)) subject to F z  v(ω) .

Let F be the set-valued mapping with values F(v) = {z ∈ Rm : F z  v}, and let
dom F = {v : F(v) 6= ∅}. For some fixed µ̄ and v̄ ∈ dom F, let z̄ be the optimal
solution to Q(µ̄, v̄). With fiT denoting the i-th row of F , let I = {i : fiT z̄ = v̄i } be the
index set of active constraints at z̄. Then, there exist a neighborhood Q µ of µ̄ and a
neighborhood Qv of v̄ such that for all µ(ω) ∈ Qµ and v(ω) ∈ Qv ∩ dom F, the optimal
solution of Q(µ(ω), v(ω)) is

z ∗ (ω) = µ(ω) + ΣFIT (FI ΣFIT )−1 (vI (ω) − FI µ(ω)) , (6.11)

if the rows of FI are linearly independent and (FI ΣFIT )−1 (vI − FI µ(ω)) ≺ 0. In fact, the
expression (6.11) is valid if one has v(ω) ∈ dom F, the rows of FI linearly independent,
and µ(ω), v(ω) satisfying

(FI ΣFIT )−1 (vI (ω) − FI µ(ω))  0 , (6.12)


(vJ (ω) − FJ µ(ω)) − FJ ΣFIT (FI ΣFIT )−1 (vI (ω) − FI µ(ω))  0 , (6.13)

where J = {j : fjT z̄ < v̄j } is the complement of I.


120 Chapter 6. Learning Projections on Random Polyhedra

Remark 6.8. There is a nice interpretation of z ∗ (ω) in (6.11) as the conditional mean
of a random variable Z with realizations Z(η), such that Z follows a priori a normal
N (µ(ω), Σ), and then is conditioned on the observation FI Z(η) = vI (ω).

Proof of Proposition 6.23. All the developments in the section have been done for the
parametric program (6.1), but can be applied easily to the parametric program (6.5),
1 T
Q(u(ω), v(ω)) : minimize 2 z Sz + u(ω)T z subject to F z  v(ω) ,

with S positive definite. To adapt Proposition 6.19, for instance, let z̄ be the optimal
solution to Q(ū, v̄), and let I = {i : fiT z̄ = v̄i }. Let S = RT R be the Cholesky factoriza-
tion of S. The change of variables z = R−1 y − S −1 u(ω) applied to the system of active
constraints FI z = vI (ω) yields FI R−1 y = vI (ω) + FI S −1 u(ω), that is, AI y = xI (ω) if we
set A = F R−1 and x(ω) = v(ω)+F S −1 u(ω). Applying Proposition 6.19 and substituting
back, we deduce that there exist some neighborhoods Qu of ū and Qv of v̄ such that for
all u(ω) ∈ Qu and v(ω) ∈ Qv ∩ dom F, the optimal solution z ∗ (ω) to (6.5) is given by

z ∗ (ω) = S −1 FIT (FI S −1 FIT )−1 (vI (ω) + FI S −1 u(ω)) − u(ω) ,


 

if the rows of FI are linearly independent and (FI S −1 FIT )−1 (vI (ω) + FI S −1 u(ω)) ≺ 0. It
remains to set S = Σ−1 and u(ω) = −Σ−1 µ(ω) to get (6.11). The rest of the proposition
follows similarly from Proposition 6.22.

6.4 Classifiers for Sets of Active Constraints

Our study of the optimal solution y ∗ (ω) to the program P(x(ω)) defined by (6.1) suggests
that the exact prediction problem, mapping an input x(ω) to the output y ∗ (ω), can be
reduced to the prediction of the index set of active constraints, mapping x(ω) to I(y ∗ (ω)).
The index sets of active constraints I partition the input space into subregions X(I),
found to be polyhedral cones. (The subregions are also called cells in the sequel.) Once
x(ω) is known to belong to some cell X(I), it is straightforward to find y ∗ (ω).
A cell X(I) requires s linear inequalities to be described as a polyhedron, where s is
the number of constraints of the parametric program P. In a problem with s constraints,
there could be an astronomically large number of index sets of active constraints I to
consider. Enumerating them individually without prior knowledge would be a daunting
task. But at least, Proposition 6.22 allows us to build instantly the cell X(I) associated
to a sample (x̄, ȳ), where ȳ is the optimal solution to P(x̄), and create a classifier asso-
ciated to X(I) for indicating whether a new input x is in X(I). A classifier is simply
a 0-1 indicator function of the set X(I), that could be represented by a decision tree
(Algorithm 6.1). By creating and exploiting existing classifiers, it would then be possible
to “learn” minimizers in an online fashion (Algorithm 6.2).

6.4.1 Description of the Algorithms.

The proposed learning strategy is essentially memory-based: it consists in building a


growing collection of local linear models. It does not attempt to generalize results beyond
6.4. Classifiers for Sets of Active Constraints 121

Algorithm 6.1 Building a decision-tree classifier associated to a set of active constraints


Input: A sample point (x̄, ȳ) such that x̄ ∈ dom C, ȳ ∈ argmin P(x̄),
and the rows of AI , I = I(ȳ), are linearly independent.
Output: A classifier δI : Rs → {0, 1} defined on dom C
with values δI (x) = 1 if I(argmin P(x)) = I(ȳ), and 0 otherwise.

1. Let J = {j : aTj ȳ < x̄i } be the index set of inactive constraints at ȳ.
Set B = (AI ATI )−1 , and let bTj denote the j-th row of B.
Set D = AJ ATI B, and let dTk denote the k-th row of D.
Define φj (x) = bTj xI .
Define ψk (x) = dTk xI − xJ(k) , where J(k) is the k-th index of J.

2. Create a root node and call it the current node.

3. Repeat for j = 1, . . . , p = |I| :


Split the current node using test φj (x) ≤ 0 (true for the left child),
attach label {0} to the right child, and call the left child the current node.

4. Repeat for k = 1, . . . , s − p :
Split the current node using test ψk (x) ≤ 0 (true for the left child),
attach label {0} to the right child, and call the left child the current node.

5. Attach label {+1} to the current node,


meaning that the local model y ∗ = ATI (AI ATI )−1 xI , I = I(ȳ), is valid.

Algorithm 6.2 Learning minimizers (online version)


Input: A set of M classifiers {δI µ }1≤µ≤M and a new sample x ∈ dom C.
Output: y ∈ argmin P(x) and an updated set of classifiers.

1. If x  0, return y = 0, leaving the set of classifiers intact.

2. Evaluate at x the classifiers δI µ , 1 ≤ µ ≤ M .

3. As soon as δI µ (x) = +1 for some µ ≤ M ,


set I = I µ and return y = ATI (AI ATI )−1 xI ,
leaving the set of classifiers intact.

4. Otherwise, call a solver. Set y ∈ argmin P(x), and set I = I(y).

5. If the rows of AI are linearly independent,


build a classifier δI M +1 associated to I M +1 = I,
and append it to the set of existing classifiers.
122 Chapter 6. Learning Projections on Random Polyhedra

the domain of validity of the local models; in particular, it does not attempt to build a
single classifier per constraint, that would tell us whether a single constraint is active at
the optimal solution. When the input data cannot be processed by existing local models,
a standard quadratic programming solver is called, and its result is used to build a new
local model.
Algorithm 6.2 can be viewed as a quadratic programming solver that adapts itself to
the input data it receives, so as to “minimize” its response time. One could for example
fix a maximal number of local models, and then allow local models that are infrequently
called to be disposed of after some time, since the membership tests induce an overhead
at most linear with the number of existing local models (we refine the linear complexity
estimate in Lemma 6.28 below, and implement a strategy for managing the local models
in Section 6.5).
Algorithm 6.2 can also be viewed as the builder of a decision policy h, that assigns to
each input x ∈ Rs an optimal decision y ∈ Rm . Initially, the decision policy always calls
a quadratic programming solver, except when the solution is trivially 0. Denoting by π
the mapping from the input x to the optimal solution y implemented by the solver, the
policy h can be expressed as

0 if x  0 ,
h(x) =
π(x) otherwise.

After after some training during which inputs xµ are received, outputs y µ = h(xµ ) are
self-generated, and classifiers δI µ are built with I µ = I(y µ ) the set of active
S constraints

µ s m s µ
at y , the decision policy h : R → R can exploit, in the region R+ ∪ µ X(I ) of
the input space, an explicit representation of the optimal solution mapping from x to y.
Namely, if M denotes the collection of index sets I µ already seen, the policy h can be
formally expressed by 3 pieces, assuming for notational simplicity that the probability of
x falling on the boundaries of the cells X(I) is 0:

 P 0 if x  0 ,
T −1
S
h(x) = I∈M δ I (x)A I (A I A I ) x I if x ∈ I∈M X(I) (6.14)
π(x) otherwise.

6.4.2 Complexity.

With a finite number s of constraints, the number of possible cells X(I), say N , is finite
but very large:

6.24 Lemma. For the parametric program P(x) over y ∈ Rm with s constraints, the
number of cells X(I), written N , is at most
Pmin{s, m}
p=1 s!/(p!(s − p)!) ≤ 2s − 1 ,

if we do not count X(∅) = {x ∈ Rs : x  0}.

Proof. For the feasibility set C = {y ∈ Rm : Ay  x}, A ∈ Rm×s , and an index set I of
active constraints of cardinality p, the cell X(I) is well defined when the p rows of A I
(1 ≤ p ≤ s) are linearly independent. Note that if A has rank s (possible only if s ≤ m),
then AI has rank p and its rows linearly independent. Having the p rows of AI linearly
6.4. Classifiers for Sets of Active Constraints 123

independent is impossible if p > m. Thus, the index sets to consider are those obtained
by picking p constraints out of s, with p ≤ m. For programs with s ≤ m, there exist
2s − 1 theoretical combinations of active constraints.

Clearly, there is no hope of covering efficiently all the cells X(I). The proposed
approach is expected to work when the support of the distribution of x is concentrated
on a relatively modest number of cells. Instead of building in a systematic way the explicit
part of the mapping h, we let the construction process be driven by i.i.d. samples of x,
always allowing calls π(x) to the solver for samples that fall out of the domain where h
is known explicitly.
The complexity of building a classifier can be estimated as follows.

6.25 Lemma. Building a classifier δI requires at most O(ms2 ) operations.

Proof. We follow the notations of Proposition 6.22. If A ∈ Rm×s is not assumed to be


sparse, building a new classifier δI with |I| = p requires O(mp2 ) operations to form AI ATI ,
O(p3 ) operations to invert AI ATI and obtain BI = (AI ATI )−1 , O(mp2 ) operations to form
ATI (AI ATI )−1 from ATI and BI , and O(m(s − p)p) operations to form DI = AJ ATI BI .
Note that m(s − p)p ≤ ms2 /4.

The complexity of exploiting a classifier δI (steps 1 and 2 of Algorithm 6.2) can be


estimated as follows.

6.26 Lemma. A test x ∈ X(I) with |I| = p requires at most O(sp) operations, meaning
that the complexity is at most O(s2 ) for all I.

Proof. We follow the notations of Proposition 6.22. Checking that BI xI  0 requires


O(p2 ) operations, and checking that DI xI − xJ  0 requires O((s − p)p) operations.
Note that (s − p)p ≤ s2 /4.

(Lemma 6.26 does not take into consideration the fact that a test should be aborted
as soon as one of the s inequalities to check is false.)

6.27 Lemma. A prediction for y given x ∈ X(I) with |I| = p requires at most O(mp)
operations, meaning that the complexity is at most O(ms) for all I if s ≤ m, or at most
O(m2 ) if s > m.

Proof. The prediction is y = ATI (AI ATI )−1 xI . The matrix product has already been
evaluated in the construction of the classifier δI , so that the complexity of evaluating the
prediction is reduced to the complexity of the matrix-vector multiplication.

If we assume that a stored classifier δI is never replaced by another in the course of


the training phase, appending a new classifier to the collection M of stored classifiers as
described in Algorithm 6.2 can only make smaller the probability

p0 = P{x 6 0 and x ∈
/ X(I) for all I ∈ M} (6.15)

of a new sample x falling in an unknown subregion of the input space, but potentially
delays the call π(x) to the solver.
124 Chapter 6. Learning Projections on Random Polyhedra

It would be interesting to be able to check whether x is in the union of the cells in M,


so as to avoid unnecessary tests, but this approach seems rather difficult to concretize —
at least considering the number of tests induced by the expression of the convex hull of
the union of k polyhedra (Balas, 1998). In the following lemma, we assume that all tests
S
x ∈ X(I), I ∈ M, have to be done before being able to conclude that x 6∈ I∈M X(I).

6.28 Lemma. Let N be the total number of cells X(I) induced by the parametric pro-
gram P(x). Let M = αN , α ∈ (0, 1], be the number classifiers appended sequentially
to an initially empty collection M of classifiers, new cells X(I) being potentially discov-
ered as new i.i.d. samples of x are received. Let h be the policy (6.14) that maps x to
argmin P(x). Then, the expected number of tests of the form x ∈ X(I) in the evaluation
of h(x) is upper bounded by M (1 − α/2) + α/2.

Proof. Let X1 , . . . , XN be the polyhedral cells X(I) induced by the parametric pro-
gram P. Let qi be the probability that x ∈ Xi . Let δ1 , . . . , δM denote the distinct
classifiers of the collection M, indexed in the order of their creation. Each classifier δ j
is associated to a different cell Xi , and the probability of the possible matchings is a
function of the probabilities qi . Given the sequence {δj }1≤j≤M , let t(x) be the number
of tests (implemented by the classifiers) needed to detect any event x ∈ X i , or the event
that x falls in a cell not covered by any δj . We have t(x) = 0 if x  0, t(x) = j if there
exists some j such that δj (x) = 1, and t(x) = M otherwise.
The choice of qi that maximizes E{t(x)} (where the expectation is taken over x and
over the possible sequences of classifiers) is obtained with qi = 1/N for each i, since
having any qi > 1/N would make Xi more likely to appear sooner in the sequence of
classifiers, and every new sample x more likely to hit Xi . Note that qi = 1/N also means
that the probability of x  0 is chosen to be zero.
Then, thanks to the property that all the orderings of the classifiers are now equiprob-
able, it holds that t(x) = 0 with probability 0, t(x) = j with probability 1/N if j < M ,
and t(x) = M with probability 1/N + (N − M )/N . Writing M = αN for some α ∈ (0, 1],
the expectation of t with the worst-case distribution is
M
X j N −M M (1 + M ) N −M
E{t(x)} = +M = +M
j=1
N N 2N N

= N (α − α2 /2) + α/2 = M (1 − α/2) + α/2 .

The expected time complexity of evaluating h as specified by (6.14) with M = αN


stored classifiers on a new sample x could be estimated as follows. With T π denoting the
expected time for evaluating π(x), TX the expected time for evaluating a cell membership
test (Lemma 6.26), Ty the expected time for evaluating a prediction given the positive
classifier (Lemma 6.27), and p0 the probability (6.15), h would be evaluated in expected
time

Th ≤ p0 · Tπ + (1 − p0 ) · Ty + (M (1 − α/2) + α/2) · TX ,

using the bound of Lemma 6.28 based on a worst-case distribution, which is also insen-
sitive to the possible reordering of the tests x ∈ X(I).
6.5. Numerical Experiments 125

6.5 Numerical Experiments

The potential merits of the proposed algorithms are evaluated on various random prob-
lems. We recall that problems with random constraint matrices are not necessarily
representative of practical problems (Edelman, 1992) — for the simplex method, they
do provide insights, but they are unable to explain the behavior of the algorithm in a
relevant neighborhood of some fixed input data (Spielman and Teng, 2004). However,
random problems are easy to specify, and by a statistical concentration phenomenon,
large problems tend to be very similar; taken together, these two features facilitate the
replication of experiments and observations.

6.5.1 Description of the Test Problems

We create the test problems as follows. We select sets of parameters for the input
dimension and the number of constraints, namely:

(m, s) = (5, 10), (10, 5), (10, 20), (20, 10), (20, 40), (40, 20).

For each set, we generate one random matrix U ∈ Rs×m with i.i.d. elements Uij drawn
from the uniform distribution in [0, 1]. We define V k = U − 0.1k11T for k = 1, 2, . . . , 5,
where 11T is a matrix of ones in Rs×m . We form the matrix Ak from V k by stacking
the s rows aki = 2αik vik /||vik ||, where αik is drawn from the uniform distribution in [0, 1],
and vik is the i-th row of V k . We call P (m, s; k) the problem with A = Ak ∈ Rs×m . It
will turn out that problems get harder with k higher.
The parameter k controls the diversity of the directions normal to the halfspaces
defined by the random rows of A. With k small, the first singular value of Ak tends to
dominate the others.
In every problem P (m, s; k), we draw samples for x as follows. Starting from an
orthonormal basis à ∈ Rs×m0 for the range of A, where m0 = min{s, m} (the basis is
obtained from the svd decomposition of A), we set x = Ãξ0 + 0.5|ξ1 |, where ξ0 ∈ Rm0
and ξ1 ∈ Rs are drawn from standard multivariate normal distributions, and | · | denotes
the elementwise absolute value. By Proposition 6.12, x ∈ dom C. We have kept the
magnitude of the term in |ξ1 | relatively small, so as to avoid the case x  0, to which is
associated the optimal solution y ∗ = 0. Notice that the support of the distribution of x
is unbounded.

6.5.2 Description of the Experiments

We conduct a test on a problem P (m, s; k) as follows. We run Algorithm 6.2 on 5000


i.i.d. samples for x, storing classifiers if the rows of AI are linearly independent. The
rank deficiency of AI is checked from the svd decomposition of AI , which is also used to
compute (AI ATI )−1 . Then, we test the stored classifiers on 5000 new i.i.d. samples to
estimate to which extent these classifiers cover the part of the input space relevant for
the distribution of the input data.
The results are presented on Table 6.1. We report:

• M : the number of classifiers δI built during training (online learning). Note that
126 Chapter 6. Learning Projections on Random Polyhedra

M also gives the number of calls to the quadratic programming solver made during
the training phase.

• n0 : the number of learning samples with y ∗ = 0 (identified by X(∅): test x  0).


These samples are not compensated by new samples, so they simply diminish the
effective size of the training set.

• nh : the number of samples processed by an already existing classifier during


training. These samples are not compensated by new samples. They are lost
for the detection of a useful cell X(I), but can be used to estimate the relative
importance of the classifiers, so as to reorder the classifiers at some point.

• Xh : the fraction of test samples that can be processed by the learned classifiers,
corresponding to the fraction of samples from the test set that hit some cell X(I)
seen during the training phase.

In these experiments, we did not encounter rank-deficient matrices AI , so for each


row of the table, we have M + n0 + nh = 5000.
We have also compared the cpu time taken by calling the quadratic programming
solver π(x) on the 5000 i.i.d. samples (reference method), to the cpu time taken by
running the online learning/prediction algorithm on the same 5000 samples, using in
addition the following rules for building and updating the collection of stored classifiers:

1. Never create more than 1000 classifiers over the course of the online learning process.

2. Never store more than 250 classifiers simultaneously.

3. Every 250 samples, reorder the classifiers by decreasing frequency of use, and remove
the stored classifiers that were never recalled after their creation.

Note that a same classifier could be rebuilt several times if it is removed too soon (espe-
cially for the last classifiers to be added within the window of 250 samples). However, the
rules imply together that a same classifier will be rebuilt at most 4 times. The purpose
of the first rule is to be able to decide online whether the learning approach should be
pursued: if one keeps building or rebuilding classifiers all the time, the problem is not well
adapted to the online learning approach, and one should stop building new classifiers.
The cpu times are reported on Table 6.1. In those tests, the solver is the Matlab
function quadprog.

6.5.3 Discussion of the Results.

The results of Table 6.1 suggest that for several problems, especially those where s < m
or k is small, the proposed approach is promising. A relatively small number of classifiers
suffices to cover almost all the input space relevant to the (unbounded) input distribution,
as shown by the fractions Xh close to 1. For those problems, we measured speed-up
factors between 2 and 15 over the systematic strategy that calls a quadratic programming
solver for each sample.
For other problems, especially those with many constraints and higher values for the
parameter k, the merits of the approach are less clear. The local models are valid on a
6.5. Numerical Experiments 127

relatively small volume of the input space, leading to a multiplication of the classifiers to
build. As the multiplication of the tests begin to hurt computing times, the management
rules for maintaining a useful collection of classifiers start to be important, if the proposed
approach is to remain competitive with the usual approach consisting in solving every
problem instance by calling the solver. Notice that the problem P (10, 5; 5) has M =
2s − 1 = 31 classifiers, meaning that all possible combinations of active constraints
are needed. Therefore, we believe that the problem P (20, 40; 5), on which the worst
fraction of the input space covered by 5000 classifiers is recorded, could in fact require,
by Lemma 6.24, as many as 0.6 · 1012 classifiers.
The last series of problems with m = 40, s = 20, illustrates clearly that the practical
speed-up performance of the proposed approach depends on the problem data A. The
whole table illustrates that the overhead cost of learning and maintaining classifiers can
be controlled, so that there is in fact a very strong incentive to use the proposed approach.
To close this section, we give on Figure 6.3 an example of prediction for the first
test problem P(5,10;1): two points x0 , x1 ∈ R10 have been drawn randomly, and the
5 components of the optimal solution along the line segment x(t) = (1 − t)x 0 + tx1 ,
t ∈ [0, 1], have been plotted against t.
128 Chapter 6. Learning Projections on Random Polyhedra

Tab. 6.1: Results on test problems: Covering by classifiers and Speed-up performance.

Parameters Training Testing Cpu time (sec)


m s k M n0 nh Xh (Ref)
5 10 1 101 423 4476 0.9956 11.81 1.51
5 10 2 131 418 4451 0.9946 12.54 2.08
5 10 3 176 400 4424 0.9930 14.56 3.20
5 10 4 228 340 4432 0.9868 15.19 4.08
5 10 5 251 416 4333 0.9866 15.02 4.64
10 5 1 25 609 4366 1.0000 11.38 0.59
10 5 2 25 584 4391 0.9998 12.02 0.69
10 5 3 30 588 4382 1.0000 13.51 0.85
10 5 4 30 535 4435 0.9998 14.20 0.89
10 5 5 31 567 4402 1.0000 15.50 0.94
10 20 1 465 36 4499 0.9544 15.45 6.31
10 20 2 742 29 4229 0.9276 17.23 11.17
10 20 3 1398 30 3572 0.8394 22.00 21.06
10 20 4 3092 22 1886 0.5216 28.87 31.83
10 20 5 3361 28 1611 0.4594 29.14 31.98
20 10 1 96 74 4830 0.9926 15.20 1.66
20 10 2 162 66 4772 0.9932 17.56 2.92
20 10 3 311 62 4627 0.9844 23.84 7.49
20 10 4 399 65 4536 0.9792 28.95 9.98
20 10 5 575 68 4357 0.9756 37.83 18.38
20 40 1 1513 0 3487 0.8032 24.63 21.08
20 40 2 2322 1 2677 0.6574 29.05 29.68
20 40 3 3957 0 1043 0.3134 38.38 40.26
20 40 4 4966 0 34 0.0180 57.51 58.24
20 40 5 5000 0 0 0.0002 86.14 86.86
40 20 1 276 3 4721 0.9766 22.99 5.59
40 20 2 674 0 4326 0.9236 33.92 15.36
40 20 3 1645 0 3355 0.7782 44.92 36.11
40 20 4 3181 1 1818 0.4850 88.13 79.72
40 20 5 4516 0 484 0.1442 130.01 127.88

m, s : number of variables and constraints of the problem (dimensions of A)


k : method of construction of the constraint matrix A (see text)
M : number of classifiers after training (without storage management rules)
n0 : samples with a zero optimal solution
nh : samples processed by an already existing classifier during training
Xh : empirical fraction of the relevant input space covered by the classifiers
Cpu time: of online learning on 5000 samples, using classifier storage management rules
(see text).
Ref: cpu time if one calls a solver on each of the 5000 samples.
6.6. Conclusions 129

x0 xt x1

y2
y1

y(xt ) y5

y3
y4

Fig. 6.3: The components yk of the optimal solutions y ∈ R5 for the test problem P (5, 10; 1),
along a random line segment defined by xt = (1 − t)x0 + tx1 ∈ R10 , 0 ≤ t ≤ 1.
Breakpoints in the yk -curves indicate where the segment cuts a boundary between the
domains of 2 classifiers.

6.6 Conclusions

This chapter has discussed a family of parametrized optimization programs and a hy-
pothesis class for predicting, after some training on the task of solving random instances,
optimal solutions to new instances of the program. Based mainly on geometrical in-
sights, the analysis of the properties of optimal solutions has also emphasized the role
of constraint qualifications, and the possible occurrence of pathological cases for some
distributions of the input data. A natural choice for the hypothesis class was a piecewise
linear model describing how optimal solutions vary locally with the input data. Fitting
the model was possible using a strategy based on the exploitation of first-order optimality
conditions.
The technical assumption that the rows of AI (rows of active constraints at the op-
timal solution) are linearly independent corresponds to a linear independence constraint
qualification (LICQ). It implies that the set of optimal dual solutions is a singleton
(Facchinei and Pang, 2003, Proposition 3.2.1). Note that in nonlinear programming, a
necessary and sufficient condition ensuring that the set of optimal primal-dual solutions
is a singleton is the strict Mangasarian-Fromowitz constraint qualification (SMFCQ)
(Kyparisis, 1985) — see again Facchinei and Pang (2003, Proposition 3.2.1).
It is well known that a quadratic program subject to constraints with parametrized
righthand side admits a piecewise linear optimal solution (Garstka and Wets, 1974,
Proposition 3.5). Early results involving perturbations of the constraint matrix are also
available (Daniel, 1973), but they do not really allow to circumvent the difficulties that
arise from inequality constraints forming new equality constraints (the merging effect
detailed in Remark 6.6). An example revealing the combinatorial nature of the difficul-
ties, inspired from a linear program given by Martin (1975), is provided by the following
program over y = (y1 , y2 ) ∈ R2 , where the constraint matrix depends affinely on t ∈ R:

1 2
minimize 2 ||y|| subject to y 1 + y2 ≥ 1 , y1 + ty2 ≤ 1 .

The optimal solution, y ∗ = (1/2, 1/2) if t ≤ 1, y ∗ = (1, 0) if t > 1, is a discontinuous


130 Chapter 6. Learning Projections on Random Polyhedra

function of the parameter t. The optimal value as a function of t is also discontinuous


at t = 1.
That the input space (set of values for the parameter x of the program P(x)) can be
partitioned into polyhedral subregions associated to index sets of active constraints is a
well-established result in variational analysis. One already finds it in spirit in a basis
decomposition theorem for linear programming proved in Walkup and Wets (1969a) and
restated in Wets (1974, Theorem 7.2). Interestingly, local Lipschitz continuity properties
useful in the analysis of the stability of two-stage stochastic linear programs (see Ap-
pendix D, Lemma D.13) are proved in Rachev and Römisch (2002, Proposition 3.2) by
appealing to such a decomposition.
The decomposition of the input space into polyhedral cells is rediscovered in Bempo-
rad et al. (2002) in the context of online quadratic programming for Model Predictive
Control (MPC). Further work along that line (including work on better implementations)
has been pursued since then (Tøndel et al., 2003; Spjøtvold et al., 2007; Baotić et al.,
2008). We also arrive in Section 6.3 to the polyhedral decomposition in the context of our
learning setting, but then we let the construction of the cells be guided by the empirical
distribution of the input data. In the practical implementation of the approach, we use
a hybrid method able to find a tradeoff between cell-based, closed-form calculations and
standard optimization, recognizing that all problems are not addressable efficiently by
the subregion-based approach.
Indeed, the worst-case complexity of parametric linear programming (Murty, 1980)
suggests that it is always possible to come up with examples contrived in such a way
that a subregion-based approach performs badly — in our case, the problems P (m, s; 5)
with the rows of the constraint matrix uniformly distributed in every direction.
The identification of active constraints has been studied by Facchinei et al. (1998) in
the context of nonlinear programming and variational inequalities through so-called iden-
tification functions. This work starts from the observation that under the Mangasarian-
Fromowitz constraint qualification, a primal-dual solution has a neighborhood where the
set of active constraints is not modified. The technical conditions that identification
functions have to satisfy are defined, and identification functions are built for specific
problem classes by exploiting error bounds valid for those classes — see also Facchinei
and Pang (2003, Chapter 6). The determination of the subregion where the identification
of active constraints is correct is left as an open problem. The methodology that we have
adopted in this chapter is (i) to find conditions ensuring the existence of a neighborhood
where active constraints are not modified (Proposition 6.19); (ii) to check whether the
subregion associated to a set of active constraints is a connected set (Proposition 6.21);
(iii) to see whether the extent of the subregion could be found (Proposition 6.22).
The notion of local models used throughout the chapter can be seen as a basic form
of single-valued localization for general solution mappings, as developed by Dontchev and
Rockafellar (2009). Investigating some aspects of the framework expounded in Dontchev
and Rockafellar (2009) was also one of the goals of this chapter.
Finally, we mention that some additional work would be needed to adapt the results
established in this chapter to the concrete MAP repair procedure used in Section 5.3
for restoring the feasibility of decisions predicted by learned policies. In particular, the
results relative to the parametric program (6.1) would have to be generalized to the
6.6. Conclusions 131

parametric program

P 0 (x(ω), w(ω)) : minimize 1


2 ||y||
2
subject to Ay  x(ω) , B1 y + B2 z  w(ω) ,

where the minimization is over y ∈ Rm1 and z ∈ Rm2 , and the parameters are x(ω) ∈ Rs1
and w(ω) ∈ Rs2 . This form is more general than (6.1), except when B2 is such that
{B2 z : z ∈ Rm2 } = Rs2 (which allows to ignore the constraint B1 y + B2 z  w(ω),
trivially satisfied for any y with some proper choice of z). The mapping of interest is the
mapping from (x(ω), w(ω)) to the uniquely determined part y ∗ (ω) of an optimal solution.
132 Chapter 6. Learning Projections on Random Polyhedra
Chapter 7

Conclusion

This thesis presents novel strategies for the search of approximate solutions to multistage
stochastic programs. The framework is based on the association of statistical learning
techniques to scenario-tree approximation techniques from the multistage stochastic pro-
gramming literature. At first, the framework serves two purposes:

• Making it possible to estimate the true value of an approximate solution in a generic


way;

• Making it possible to build discretizations (scenario trees) that allow to obtain


satisfying solutions to the true problem.

We propose several practical implementations of the framework based on various super-


vised learning techniques, ranging from neural networks to Gaussian processes, and apply
the general methodology to a variety of multistage problems (a family of risk-averse pro-
duction management problems under price uncertainty, and a multi-product assembly
problem under demand uncertainty).
From a higher level perspective, the association of multistage stochastic programming
and supervised learning can be viewed as a specific method for sequential decision making
under uncertainty, well adapted to large continuous action spaces. The quality of the
decision policies found on test problems with this approach suggests that it is possible
to capture a part of the value of multistage stochastic programming models in a variety
of contexts, at an acceptable computational cost.

7.1 Summary of Contributions

A multistage stochastic programming problem is an optimization problem of the form


minπ E{f (ξ, π(ξ))}, where ξ = ξ1 , . . . , ξT is a random process, π is a mapping from ξ
to a sequence of decisions u = u1 . . . , uT adapted to the gradual observation of ξ, and
f (ξ, π(ξ)) is the cost of applying π(ξ) with ξ. The cost is formally set to +∞ if the
sequence of decisions π(ξ) is infeasible on ξ. Chapter 2 has provided an introduction to
this mathematical formalization, while more technical material has been collected in the
Appendices.
In the case where ξ has a continuous distribution, the expectation operator represents
a multidimensional integral, so even before considering the problem of searching for a
best mapping π, the mere evaluation of the cost function f given a fixed mapping π̄ is
134 Chapter 7. Conclusion

difficult. For general distributions, the question “Is E{f (ξ, π̄(ξ))} ≤ θ ?” can only be
answered up to a certain probabilistic confidence level α < 1.
A second computational difficulty stems from the fact that approximate stochastic
programming solution techniques furnish “solutions” that are fully specified only for
the first decision stage. To evaluate on a new realization of ξ the mapping π (the
decision policy) induced by these approximation techniques, one has to solve a sequence
of approximate versions of the original problem posed over gradually shrunk time horizons
(see Chapter 2).
Our diagnosis is that combined together, these two computational difficulties render
it impossible to assess in practice the third level in the hierarchy of the stochastic pro-
gramming methodology, namely, the adjustment of the sampling or discretization method
that replaces expectations by finite sums (so as to yield a program on a finite number of
optimization variables). Yet, we view this third level as a key ingredient for the success of
the whole approach (specific constructions back up this view in Chapter 4): the fact that
the random process is gradually observed, translated to a tree-structured representation
of the samples (scenarios), leaves many degrees of freedom for adjusting the location of
branchings in the tree, a possibility that should be exploited in the context of problems
posed over long time horizons (or more generally in the context of multistage problems
where the dimension of the random process is high).
In Chapter 2, we have presented multistage stochastic programming in the context of
several competing frameworks and methods for sequential decision making uncertainty,
such as Markov Decision Processes (MDP) and Model Predictive Control (MPC). We
have mentioned several solution heuristics for multistage stochastic programming that
have been explored in the optimization and operations research literature, such as two-
stage approximations, aggregation and averaging strategies, and consensus strategies
(Section 2.3).
In principle, the value of a multistage stochastic programming model over other or
simpler models cannot be estimated without building and developing a solution method
for all those models on the real data. The examples and case studies presented in the
thesis have been selected (Section 4.4) or created (Example 2.1, Section 5.3) after careful
experimentations on the model and problem data, in such a way that the multistage
model had a high value with respect to other models — in particular two-stage approxi-
mations — given the numerical data. This was an important stage for a sound evaluation
of the solution methods proposed in the thesis, but also time-consuming, which explains
in part why we chose not to multiply the number of examples or assess the methods on
problem instances with random or arbitrary data.
In Chapter 3, a series of statistical estimation methods has been considered, from max-
imum likelihood and maximum a posteriori estimation to bootstrap aggregating meth-
ods (bagging). The particular mix of perturbation, averaging and selection steps that
differentiates those methods suggests that the estimation and optimization aspects in
stochastic programming problems could in fact be given a unified treatment, based on
Monte Carlo methods and importance sampling methods. The Cross-Entropy method
for the simulation of rare events, and its application to combinatorial optimization, was
identified as a promising candidate for reducing the conceptual gap between the two
aspects (Section 3.2.3). At the same time, the idea of solving an ensemble of random
approximations to a multistage stochastic program, and then aggregating the results,
7.1. Summary of Contributions 135

was presented as an estimation technique, related to bagging, that can be applied to a


set of perturbed (noisy) first-stage decisions.
To serve as a test bed for the development of these ideas, a multistage stochastic
programming formulation of a simple benchmark problem from reinforcement learning
with a finite discrete action space has been considered (Section 3.3). The optimization of
scenario-tree approximations is made via the Cross-Entropy method. Standard aggrega-
tion procedures based on averages or majority votes are not suitable for the processing
of discrete decisions, so we have proposed a generalization of the aggregation operation
based on the introduction of positive-definite kernels. This approach has been evaluated
by a series of experiments on the test problem, for which it was shown that the proposed
aggregation technique allows indeed to select a good first-stage decision out of the set of
candidate optimized first-stage decisions.
In Chapter 4, we have proposed a practical procedure for deriving candidate map-
pings π (decision policies) from a data set containing optimized sequences of decision
contingent to scenarios, originating from an optimal solution to a given scenario-tree
based approximate program. We have proposed the use of supervised learning techniques
for inferring a mapping π, decomposed into its successive decision rules for u 1 , . . . , uT .
The learning task raises however several issues: (i) Each individual learning problem is
over a growing input space, and over an output space corresponding to a decision u t of
potentially large dimension; (ii) The decision policy π has to comply with the original
program’s feasibility constraints, that structure the sequence of decisions and the compo-
nents of the individual decisions ut , so that the learning problems cannot be completely
decoupled; (iii) The data sets of decisions are noisy and biased (as they are obtained
from approximate programs), whereas the learned policy should have good performances
on the true multistage stochastic program; (iv) The sequences of decisions π(ξ) must
be computed quickly, a requirement that restricts the policy representations usable in
practice (for example, by introducing limitations on the size of the models built with
nonparametric methods).
A first set of strategies to circumvent these difficulties has then been proposed and
evaluated in the context of a practical problem (Section 4.4). Namely,

• Preprocessing the data set of optimal solutions and scenarios, so as to obtain a


compact representation of the information state. The introduction of new state
variables, that may depend on past decisions and past state variables, allows a
dynamical decoupling of the learning problems.

• Posing the learning problems over a transformed output space, where the feasibility
constraints can be more easily enforced.

• Combining learned models to repair procedures, so as to adjust the predicted de-


cisions to the actual feasibility constraints.

• Basing the model selection of a feasible policy π on an estimate of its expected cost
on the true multistage problem, rather than on the loss function of the supervised
learning problems.

Two model selection strategies have been proposed. The first one consists in the simul-
taneous search of the hyperparameters of π viewed as a whole entity. The second one
136 Chapter 7. Conclusion

consists in searching the hyperparameters of π gradually on the sequence of decisions,


using optimizations over independent scenario trees of shrinking depth as a proxy as long
as the policy π is not fully specified (see Section 4.2.3).
In Chapter 4, we have taken advantage of these novel solution valuation capabilities to
open up new avenues for the generation of scenario trees adapted to multistage problems
on long time horizons. Specifically,

• We have proposed to consider several scenario trees for a same multistage problem,
each of them leading to distinct approximations of the true problem. The trees are
to be ranked according to the value of the best policy that can be learned using
the data set of decisions optimized on those trees.

• We have suggested to retain the best policy of the best tree, say π ? , as a suboptimal
but feasible solution to the true multistage problem. The empirical estimate θ̂
of E{f (ξ, π ? )}, obtained by simulating the policy π ? on a new independent test
sample, furnishes a certificate of performance on the true problem. The estimate θ̂
can be adjusted if one wants to consider confidence levels. In other words, π ? is a
witness to the claim

min E{f (ξ, π(ξ))} ≤ θ̂


π

at a certain level of confidence. As the full distribution of the cost of using π ?


can also be estimated from Monte Carlo simulation (using histograms), arbitrary
measures of risk could also be reliably estimated, using a very large test sample.

• Practically, we have proposed to randomize the branching structure of the scenario


trees, so as to obtain a rich diversity in the considered scenario trees. We have
proposed simple strategies that allow to generate the trees in a top-down fashion
(and thus do not require to build and store large trees before pruning them in a
bottom-up fashion).

• We have demonstrated on a family of test problems that the full approach is imple-
mentable in practice, at a very moderate computational cost, and yields, for those
test problems, near-optimal policies.

• Thanks to the moderate computational cost of this novel tree selection method, we
were able to study empirically the effect of meta parameters on the quality of the
solution, such as the number of scenarios in the trees to consider, or the type of
sampling processes for the values at the nodes of the trees.

Our experimentations indicate that considering a large number of small trees can lead
to an excellent tradeoff between solution quality and computational time.
In Chapter 5, we have considered a second set of strategies for learning policies, in
the context of a four-stage multi-product assembly problem under demand uncertainty
for which the value of the multistage formulation is high (Section 5.3).

• The general principle under the proposed learning approach is that the decisions
of a policy could initially be represented as probability densities conditioned on a
growing number of observations.
7.2. Perspectives 137

• The selection of a single decision is formulated as a maximum a posteriori (MAP)


estimation problem, subject to feasibility constraints (Section 5.1). Under suitable
assumptions on the densities to consider, the procedure is closely related to the
projection of a single predicted decision on the current feasibility set.

• The framework has been found to lead naturally to Gaussian Process regression
techniques.

• The choice of the covariance matrices (kernels) of the Gaussian Process models is
sometimes facilitated by the knowledge of the algorithm that generates the scenario
tree.

• Experiments on the test problem have demonstrated that with a suitable choice of
the kernels and of the repair procedure (the projection of the candidate decisions
on the current feasibility set), a near-optimal policy could be selected.

In Chapter 6, we have addressed the more fundamental issue of the usefulness, in


terms of computational complexity, of building a model for predicting exact optimal
solutions rather than computing them with a standard optimization procedure, albeit
the issue has been addressed in the context of a specific setting (projections on random
polyhedra).

• We have indeed formulated the question in a setting where fruitful results and
insights can be derived. The setting is a well-known class of parametric strictly
convex quadratic programs, and is related to the feasibility restoration task as for-
mulated in Chapter 5, although some additional work would be needed to generalize
the results to general convex quadratic programming problems.

• We have improved our understanding of the structure of the solutions to these


parametric programs by establishing, from basic geometrical principles, properties
of the optimal solutions.

• In particular, we have identified a relation between the potential size of an exact


predictive model for the optimal solutions, and the smallest positive singular values
of some matrices. We have illustrated that relation by generating parametric test
problems in a particular way, and then estimating the size of the predictive models.

• We have developed a self-improving algorithm for solving a set of instances from


the considered class of parametric optimization problems, able to find a tradeoff
between an online learning approach and a pure optimization approach, mostly by
controlling the size of the learned model.

7.2 Perspectives

Learning a policy from a data set of optimized decisions is a technical compromise. One
would certainly prefer to optimize the parameters of a parametric policy directly on a
scenario tree, or by simulation combined to stochastic gradient descent techniques.
The idea of searching for policies directly seems to be as old as stochastic programming
itself (Garstka and Wets, 1974). Unfortunately, at the exception of particular settings
138 Chapter 7. Conclusion

where simple parametrizations can be used (decisions u affine in ξ or in features of ξ)


(Garstka and Wets, 1974; Ben-Tal et al., 2004), such formulations lead to nonconvex op-
timization problems. If one can still in theory argue for the selection of a local minimum,
one has also to observe that a parametric policy can seldom accommodate hard con-
straints, so that for most practical multistage problems, fixing the parametric form and
optimizing the parameters directly would simply lead to an infeasible problem. Circum-
venting such difficulties by relaxing hard constraints or reformulating the optimization
problem in terms of state variables would then simply bring us back to frameworks based
on Markov Decision Processes (Section 2.2.2).
By taking the intermediate step of exploiting the usual multistage stochastic pro-
gramming framework to obtain sequences of decisions tractably, one furnishes an initial
data set of decisions that plays the role of an initial condition for the simulation-based
optimization of a policy. As much flexibility is gained in the representation of the policy,
the policy can now be nonlinear (as in neural networks), non-parametric (as in Gaussian
processes), and/or incorporate repair procedures adapted to the hard constraints. From
that point of view, the use of several scenario trees can be interpreted as the use of sev-
eral initial conditions from which a policy can be locally optimized, while algorithms for
generating good scenario trees would implicitly aim at generating good initial conditions.
Clearly, by introducing an intermediate step in the direct policy optimization task,
we isolate (decouple) the two subtasks of optimizing decisions and optimizing policy
parameters, and thus block the circulation of information between the two subtasks. A
full range of iterative methods, alternating between the two subtasks, could be developed
with the aim of restoring the information flow — the iterative model selection procedure
(Algorithm 4.2) is a step in that direction. However, one would still have to demonstrate
that such an approach has some practical advantages over standard decision making
strategies based on successive shrinking-horizon optimizations.
In this thesis, we have presented methods, algorithms, and test problems on which
significant computational gains over alternative approaches could be shown. One can
still think of several potential improvements or extensions to the proposed techniques,
that will have to be sorted out by applying the proposed solution methodology based
on multistage stochastic programming and supervised learning to a wider variety of
applications.

• We have considered an application with many decision stages and a low-dimensional


random process (Section 4.4), but we have not considered an application with few
decisions stages and a high-dimensional random process. An adaptation of the
choices concerning the supervised learning component, and the model selection
procedure may be needed in that context.

• We have not considered multistage problems based on mixed-integer linear pro-


gramming formulations, that are currently under active investigation (Escudero,
2009). Our impression is that the advantage of a multistage stochastic program-
ming formulation over a Markov Decision Process formulation is less clear when
convex optimization tools cannot be used. Yet, an adaptation of the supervised
learning approach to discrete decision spaces is possible, in the spirit of the pro-
posals made in Chapter 3, and in that case, it should be possible to identify an
application on which the advantages of the solution methodology developed in
7.2. Perspectives 139

Chapter 4 would appear.

• For specific applications, specific decisions rules might be proposed and tested. For
instance, it is often the case in planning and sequential decision making under
uncertainty that one is offered the choice to act now (the implementation details
being adjusted greedily), or to postpone the decision. Such situations have been
analyzed by Van Hentenryck and Bent (2006, Chapter 8) in the context of the
online dispatching of a taxi fleet, but could also be found in electricity generation
planning (optimal response to contingencies). Then, a fundamental component of
the decision policy is the mapping from the information state, possibly represented
by features, to the delay before taking irreversible or expensive decisions. The
mapping could be learned according to the data collected from optimized multistage
stochastic programs, and then further adjusted by simulations. If the decision of
acting now is selected, what we refer to as the repair procedure could be anything
from a greedy, one-step online optimization, to a call to another policy dedicated
to immediate actions.

• The proper way to associate, in a same data set, scenario/decisions examples col-
lected from several scenario-tree approximations solved to optimality, so as to infer
from this data set, or a post-processed version of it, a policy with better perfor-
mances on the exact problem than the best of the policies learned from the data
relative to a single scenario tree, is still to be found and shown to be computation-
ally efficient. Our intuition is that this approach could be especially interesting for
multistage problems with high-dimensional random processes, but would require
much work to ensure that the inconsistencies among the data sets of decisions are
innocuous in the context of the learning algorithm, or can be corrected by some
problem-dependent processing step.

• In Section 4.4.2, it was observed that a near-optimal policy had been obtained from
a scenario tree with statistical properties (including first moments) very far from
those of the targeted random process. This suggests that the paradigm according
to which finding ways to build a unique scenario tree as “close” as possible (in any
sense) to the original random process is the more rational objective one could aim
at, is in fact too limitative in the context of challenging multistage problems.
Besides the proposed multi-tree framework based on branching structure random-
ization, it might be conceivable to perturb the parameters of the targeted random
process itself (as long as the learned policies are ultimately tested on the exact
random process).
The objective of the approximate multistage programs could also be perturbed
or modified, for instance by adding regularization terms (as long as the learned
policies are ultimately tested with the exact cost function).

In Chapter 6, a simple setting was identified in which the complexity of an explicit


representation of an optimal solution mapping could be studied, and put in relation
with problem data (rather than problem structure). At the time of writing, we are still
discovering recent work on explicit representations of solution mappings in slightly more
general settings (Patrinos and Sarimveis, 2010), and we may expect that such research
directions will continue to be investigated in the future. Nevertheless, it is clear from our
140 Chapter 7. Conclusion

experiments that the construction of a purely explicit representation of an exact solution


mapping is doomed to fail in some circumstances.
If the hybrid approach that we have proposed in Chapter 6 is shown to be able to
extract the computational benefit of explicit solution mappings when it exists, while
avoiding the risk of catastrophic performances in terms of computational times in other
circumstances, a question not yet addressed in Chapter 6 is the determination of the
extent to which an inexact approach, based on supervised learning, and followed by a
simple feasibility restoration, allows to overcome the potentially overwhelming complexity
of an exact model.
In the pure supervised learning setting, this question has been answered (at least
for classification) by the notions of hypothesis space, structural risk minimization, and
VC-dimension (Vapnik, 1998). In this thesis, we have not constructed a general theory
for predicting the sample complexity (number of scenarios, location of branchings in the
trees, number of generated trees) for learning a good enough policy in the context of a cho-
sen hypothesis class (space of policies), but rather have relied on standard model selection
principles, and on numerical testing. It could be interesting to see whether supervised
learning notions of combinatorial complexity (such as VC-dimension and Rademacher
complexity) (Bartlett and Mendelson, 2002) could be adapted more directly to the set-
ting of optimal solution mapping approximation, for instance, referring to Chapter 6
again, by using quantities related to the the smallest positive singular values of matrices
of active constraints.

7.3 Influences on Research in Machine Learning

From a general point of view, technical or conceptual advances in stochastic program-


ming techniques can influence specific fields of machine learning. We list below some
possibilities that come to mind.

• The investigations on general risk functionals made in the context of multistage


stochastic programming can have an impact on the research in reinforcement learn-
ing focussed on risk-aware strategies, as confirmed by various recent work (Morimura
et al., 2010).

• Decomposition algorithms initially motivated by stochastic programming applica-


tions (Rockafellar and Wets, 1991) can be used in the context of supervised learning
(Defourny and Wehenkel, 2009). Algorithmic advances in robust stochastic approx-
imation methods (Nemirovski et al., 2009) could also have an impact on supervised
learning approaches relying on large-scale optimization techniques.

• The work on scenario tree generation methods is likely to have an impact on op-
timal experiment design, active learning, and on the direct selection of concise
data sets (Rachelson et al., 2010) for reducing, at the source, the complexity of
non-parametric models, or for facilitating the processing of data by complex algo-
rithms.

From a point of view more specific to the present work, Chapter 6 suggests possi-
ble research directions in supervised learning and artificial intelligence, based on simple
7.3. Influences on Research in Machine Learning 141

settings in which concepts such as “learning to learn” or “learning faster” are given a
simple mathematical formalization. It would certainly be worth exploring and developing
further such approaches, that could allow to better integrate existing technologies, and
focus on context detection, rather than on the learning task itself. Related work in this
general orientation includes Ailon et al. (2006) and Hartland et al. (2006).
142 Chapter 7. Conclusion
Appendix A

Elements of Variational Analysis

This appendix presents material from variational analysis (Rockafellar and Wets, 1998)
useful in the study of minimization problems subject to constraints, and approximations
of those minimization problems.
This material is a part of the fundamental theoretical background supporting many
works in stochastic programming, and more generally many work in optimization. It
provides a convenient formalism that we use in the thesis for discussing optimization
programs abstractly, although we do not insist in the main body of the thesis on some
of the technical subtleties highlighted in the present appendix, as these subtleties are
not absolutely required to communicate on the kind of work made in the context of the
thesis.
The appendix is organized as follows. Section A.1 defines minimization problems
through extended-real-valued functions. Section A.2 introduces notations for dealing
with sequences, subsequences and neighborhoods. Section A.3 defines the notion of
semi-continuity. Section A.4 gives sufficient conditions for the existence of optimal so-
lutions. Section A.5 defines the notion of epigraph. Section A.6 defines the notion of
epi-convergence of functions. Section A.7 connects epi-convergence to the property that
optimal solutions converge to true optimal solutions. Section A.8 relates epi-convergence
to other modes of convergence of functions. Section A.9 consider the generalization of
results to parametric optimization. Section A.10 consider the particularization of results
to convex optimization. Section A.11 defines the notion of local Lipschitz continuity.

A.1 Minimization

Let R = R∪{−∞, +∞} denote the set of extended real numbers. Minimization problems
and constrained minimization problems can be defined through the notion of extended-
real-valued functions.

A.1 Definition. An extended-real-valued function f : Rn → R assigns to each


element x = (x1 , . . . , xn ) ∈ Rn a value in R, written f (x).

The infimum of f , written inf f , is the greatest lower bound of f , that is, the greatest
value v ∈ R such that v ≤ f (x) for all x ∈ Rn . The infimum of f on a (possibly empty)
set C ⊂ Rn , written inf C f , is the greatest lower bound of the extended-real-valued
function that assigns to x ∈ C the value f (x), and to x ∈ Rn \ C the value ∞. When
C = Rn , inf C f = inf f . To emphasize the argument of f , we may write inf x f (x) instead
144 Appendix A. Elements of Variational Analysis

of inf f , and inf x∈C f (x) instead of inf C f .


Similarly, the supremum of f , written sup f , is the least upper bound of f , and the
supremum of f on C ⊂ Rn is the least upper bound of the function that assigns to x ∈ C
the value f (x), and to x ∈ Rn \ C the value −∞. We always have sup f = − inf −f . We
have inf C f ≤ supC f if and only if C 6= ∅.
If inf C f < ∞ and there exists some x ∈ C such that f (x) = inf C f (x), we say that
the minimum of f on C is attained, we denote inf C f by minC f , and we define the
set of minimizers of f over C, written argminC f , as the subset of elements of C such
that f (x) = inf f . If inf C f < ∞ but no x ∈ C satisfies f (x) = inf C f (x), we say that
the minimum of f on C is not attained, and we set argminC f to the empty set ∅. If
inf C f = ∞, we set argminC f = ∅.
Minimizing f on C refers to the task of evaluating minC f and finding a point x ∈
argminC f . Very often in applications, f and x have an interpretation, and requirements
on x are expressed through inequality constraints fj (x) ≤ 0 and equality constraints
hj (x) = 0 using functions fj : Rn → R, 1 ≤ j ≤ p, and hj : Rn → R, 1 ≤ j ≤ q. In that
specific case,

C = {x ∈ Rn : fj (x) ≤ 0 for 1 ≤ j ≤ p, hj (x) = 0 for 1 ≤ j ≤ q} .

As minimizing f on C is equivalent to minimizing the function that coincides with f


on C and is set to ∞ on Rn \ C, in the sequel we will simply focus on the minimization
of f [on Rn ], assuming that C is already embedded in f .
Given a nonempty set X ∈ R, we also use the notation inf X for the greatest v ∈ R
satisfying v ≤ x for all x ∈ X. When v = inf X is in X, we write v = min X. Similarly,
sup X is the least v ∈ R satisfying v ≥ x for all x ∈ X, written max X when v ∈ X. The
case X = ∅ is handled by setting inf ∅ = ∞ and sup ∅ = −∞.

A.2 Sequences

Let N (x̄) denote the collection of all neighborhoods of x̄ ∈ Rn . We take the notions
of open set and neighborhood for granted (Mendelson, 1990). We are about to deal
with properties that must hold for all V ∈ N (x̄). In the metric space (Rn , d) where
Pn
d(x, y) = ||x − y|| = [ i=1 (xi − yi )2 ]1/2 , the properties that we will consider hold for all
V ∈ N (x̄) iff they hold for all open Euclidian balls of rational radius centered at x̄, that
is, for all V of the form {x ∈ Rn : ||x − x̄|| < δ} with 0 < δ ∈ Q.
Let the topological closure and interior of a set C ⊂ Rn be defined by

cl C = {x ∈ Rn : V ∩ C 6= ∅ for all V ∈ N (x)} ,


int C = {x ∈ Rn : V ⊂ C for some V ∈ N (x)}

(Rockafellar and Wets, 1998, page 14). The topological boundary of a set C is defined
by bdry C = cl C \ int C.
Let {xν }ν∈N denote a sequence x1 , x2 , . . . with xν ∈ Rn and ν ∈ N (the set of natural
numbers, taken as the index set of the elements of the sequence). The set of points x ν in a
sequence {xν }ν∈N is called the range of the sequence (Rudin, 1976, page 48). A sequence
is said to be bounded if its range is bounded. In (Rn , d), a sequence {xν }ν∈N is said to
A.3. Semicontinuity 145

converge to x̄ (or to have x̄ as its limit point), written xν → x̄ or limν→∞ d(xν , x̄) = 0,
if for any  > 0, there is some ν0 ∈ N such that ν ≥ ν0 entails d(xν , x̄) < . For instance,
the constant sequence with xν = x̄ converges to x̄. We can also have xν → x̄ despite
xν 6= x̄ for all ν.
Given a sequence {xν }ν∈N , the sequence xν1 , xν2 , . . . , where ν1 , ν2 , . . . is a sequence of
positive integers such that ν1 < ν2 < . . . , is called a subsequence of {xν }ν∈N . To facilitate
statements involving subsequences, let N∞ denote the collection of subsets of N of the
form {ν0 , ν0 + 1, . . . }, which contain all integers k greater or equal to some positive
#
integer ν0 . Let N∞ denote the collection of all subsets of N of infinite cardinality. Note
# # N
that N∞ ⊂ N∞ . Given N ∈ N∞ or N ∈ N∞ , we shall write xν → x to indicate that
the subsequence {xk }k∈N of the sequence {xν }ν∈N converges to x. The limit point x of
a subsequence {xk }k∈N with N ∈ N∞ #
is called a cluster point of the sequence {xν }ν∈N .
N
It is also called an accumulation point of the sequence {xν }ν∈N if xν → x with xν 6= x
for all ν ∈ N . For instance, the sequence {(−1)ν }ν∈N has no limit point, but it has two
cluster points −1, +1 that are not accumulation points.
In a metric space, it is often illuminating to view a neighborhood V ∈ N (x̄) as the
[interior of the closure of the] union of the ranges from all sequences in V that converge
to x̄. Such a viewpoint leads to definitions based on sequences — consider, for instance,

cl C = {x ∈ Rn : there is some sequence {xν }ν∈N


converging to x with xν ∈ C for all ν ∈ N}
= {x ∈ Rn : ∃xν → x with xν ∈ C} (brief statement),
n ν
int C = {x ∈ R : for all x → x, there is some ν0 ∈ N
such that xk ∈ C for all k ≥ ν0 }
= {x ∈ Rn : ∀ xν → x, ∃ N ∈ N∞ such that xν ∈ C for all ν ∈ N } .

In the sequel, following Rockafellar and Wets (1998), statements are made preferably
in terms of sequences.

A.3 Semicontinuity

The following definition of lower and upper limits uses a min/max characterization proved
in Rockafellar and Wets (1998, Lemma 1.7).

A.2 Definition. The lower and upper limits of an extended-real-valued function f :


Rn → R at x̄ are values in R defined respectively as

lim inf f (x) = sup inf f (x)


x→x̄ V ∈N (x̄) x∈V

= min{α ∈ R : ∃ xν → x̄ with f (xν ) → α} ,


lim sup f (x) = inf sup f (x)
x→x̄ V ∈N (x̄) x∈V

= max{α ∈ R : ∃ xν → x̄ with f (xν ) → α} .

The convergence to α = +∞ is interpreted as follows: f (xν ) → ∞ (−∞) if for any


ρ > 0, there is some N ∈ N∞ such that ν ∈ N entails xν ≥ ρ.
146 Appendix A. Elements of Variational Analysis

By considering the constant sequence xν = x̄ one sees that lim inf x→x̄ f (x) ≤ f (x̄)
and lim supx→x̄ f (x) ≥ f (x̄).

A.3 Definition. A function f : Rn → R is lower semicontinuous (l.s.c) at a point x̄


if lim inf x→x̄ f (x) ≥ f (x̄), or equivalently lim inf x→x̄ f (x) = f (x̄). It is upper semicon-
tinuous (u.s.c.) at a point x̄ if lim supx→x̄ f (x) ≤ f (x̄), or equivalently lim supx→x̄ f (x) =
f (x̄). The function f is l.s.c (respectively u.s.c.) if it is l.s.c. (respectively u.s.c.) at
every point x̄ ∈ Rn .

Sometimes we speak of functions with properties (such as semicontinuity) relative to a


set X with X a subset of Rn . This means that we only consider sequences xν converging
X
to x̄ ∈ X, with xν ∈ X for all ν. In that case, we write xν → x̄, and redefine the liminf
and limsup operators as follows.

A.4 Definition. For properties invoked as relative to X, the limits are taken over se-
quences in X. In particular, the lower and upper limits of f : Rn → R at x̄ relative to X
become
lim inf f (x) = sup inf f (x) and lim sup f (x) = inf sup f (x) .
X
x→x̄ V ∈N (x̄) x∈V ∩X X V ∈N (x̄) x∈V ∩X
x→x̄

Among the numerous characterization of continuity, we can thus find the following
ones.

A.5 Proposition. A function f : Rn → R is continuous iff it is both l.s.c. and u.s.c.


X
A.6 Proposition. A function f : Rn → R is continuous relative to X iff xν → x̄
entails f (xν ) → f (x̄). In particular with X = Rn , f is continuous iff xν → x̄ entails
f (xν ) → f (x̄).

A.4 Attainment of a Minimum

Let f : Rn → R be an extended-real-valued function. We consider the minimization of f ,


and in that context, we define the effective domain of f as the set
dom f = {x ∈ Rn : f (x) < ∞} .
A.7 Definition. The function f is proper if f has a nonempty effective domain and is
finite-valued there, that is, f (x) > −∞ for all x ∈ dom f .

The extended-real-valued functions that coincide with some real-valued function f 0 on


a nonempty set C and are equal to ∞ outside C are proper, whereas all other extended-
real-valued functions are improper. When f is proper, the corresponding real-valued
function f0 is referred to as the essential objective function.

A.8 Definition. The function f is lower level-bounded if for all α ∈ R, the set
lev≤α f = {x ∈ Rn : f (x) ≤ α} is bounded.

For example, the function x 7→ x2 is lower level-bounded whereas x 7→ exp(x) is not.


Both functions are continuous on their domain R.
The following theorem (Rockafellar and Wets, 1998, Theorem 1.9) provides sufficient
conditions under which an extended-real-valued function attains its minimum.
A.5. Epigraph 147

A.9 Theorem. If f : Rn → R is lower semicontinuous, lower level-bounded, and proper,


the infimum inf f is finite and attained on a nonempty compact subset of R n .

To handle situations where the evaluation of inf f and argmin f has a finite precision,
it is useful to consider, when inf f is finite, the set of -optimal solutions

- argmin f = {x ∈ Rn : f (x) ≤ inf f + }.

As the elements of - argmin f are themselves evaluated with a finite precision, it is useful
to clarify in which sense elements x̃ close to -optimal solutions are close to being optimal
(Rockafellar and Wets, 1998, Theorem 1.43):

A.10 Theorem. If f : Rn → R is l.s.c. with inf f finite, the closed Euclidian ball of
radius ρ > 0 centered at an -optimal solution x̄ ( > 0) contains a point x̃ which is the
unique solution to the minimization of the perturbed function f (x) + ρ −1 ||x − x̃|| and
satisfies f (x̃) ≤ f (x̄).

A.5 Epigraph

For a real-valued function f0 : Rn → R, recall that the graph of f0 is the set gph f0 =
{(x, α) ∈ Rn × R : α = f0 (x)}. For extended-real-valued functions to be minimized, we
consider epigraphs.

A.11 Definition. The epigraph of an extended-real-valued function f : R n → R is the


subset of Rn+1 defined by epi f = {(x, α) ∈ Rn × R : α ≥ f (x)}.

There are correspondences between the properties of an extended-real-valued function


and properties of its epigraph.

A.12 Proposition. For an extended-real-valued function f and its epigraph epi f :


i. dom f = {x ∈ Rn : (x, α) ∈ epi f for some α ∈ R}.
ii. f is proper iff epi f 6= ∅ and {(x, α) : α ∈ R} 6∈ epi f for all fixed x ∈ Rn .
iii. cl(epi f ) = {(x, α) ∈ Rn+1 : α ≥ lim inf x0 →x f (x0 )} (Rockafellar and Wets, 1998,
Exercise 1.13(a)).
iv. f is lower semicontinuous iff epi f is closed (Rockafellar and Wets, 1998, Theo-
rem 1.6(b)).
v. int(epi f ) = {(x, α) ∈ Rn+1 : α > lim supx0 →x̄ f (x0 )} (Rockafellar and Wets, 1998,
Exercise 1.13(b)).

A.13 Definition. The lower closure of a function f : Rn → R, denoted cl f , is the


function whose epigraph is cl(epi f ). A function f is said to be closed when f = cl f .
Specifically, cl f (x) = lim inf x0 →x f (x0 ) (Rockafellar and Wets, 1998, Equation 1(7)).

We note that in earlier references (Rockafellar, 1970, 1974), a slightly altered definition
of the closure of a function was used: the function lim inf x0 →x̄ f (x) of definition A.13 was
denoted lsc f , and cl f was set to the constant function −∞ whenever lim inf x0 →x f (x0 ) =
−∞ for some x; cl f was set to lsc f otherwise.
148 Appendix A. Elements of Variational Analysis

A.6 Epi-convergence

First we consider notions of limits for sequences of subsets of Rn .

A.14 Definition (Painlevé-Kuratowski convergence of sets). The outer limit of


a sequence {C ν }ν∈N of subsets C ν ⊂ Rn is the set of limit points (if any) of all sequences
{xν }ν∈N such that N ∈ N∞ #
and ∅ 6= C ν 3 xν for each ν ∈ N :

lim sup C ν = {x ∈ Rn : ∃ N ∈ N∞
#
, ∃ xν ∈ C ν for each ν ∈ N, such that xν → x}
N

ν→∞
\ [
= cl Cν .
N ∈N∞ ν∈N

The inner limit of a sequence {C ν }ν∈N of subsets C ν ⊂ Rn is the set of limit points (if
any) of all sequences {xν }ν∈N such that N ∈ N∞ and ∅ = 6 C ν 3 xν for each ν ∈ N:

lim inf C ν = {x ∈ Rn : ∃ N ∈ N∞ , ∃ xν ∈ C ν for each ν ∈ N, such that xν → x}


N

ν→∞
\ [
= cl Cν .
#
N ∈N∞ ν∈N

From the definition, the inner and outer limits are (possibly empty) closed sets. In
particular, for an arbitrary subset V of Rn , the constant sequence with C ν = V has
lim inf ν→∞ C ν = lim supν→∞ C ν = cl(V ). We always have the inclusion lim inf ν→∞ C ν ⊂
lim supν→∞ C ν since N∞ ⊂ N∞ #
. For instance, given two closed subsets A, B ⊂ Rn , the
sequence {C }ν∈N with C = A for ν odd and C ν = B for ν even has lim supν→∞ C ν =
ν ν

A ∪ B and lim inf ν→∞ C ν = A ∩ B.

A.15 Definition. If lim inf ν→∞ C ν = lim supν→∞ C ν = C, the limit limν C ν exists and
is equal to C. One writes C ν → C to indicate that the sequence {C ν }ν∈N converges to C
(in the sense of the Painlevé-Kuratowski convergence of sets).

Next we consider a sequence {f ν }ν∈N of extended-real-valued functions f ν : Rn → R.

A.16 Definition. The lower epi-limit of the sequence {f ν }ν∈N , denoted e-lim inf ν f ν ,
is the function defined by identifying its epigraph to the outer limit of the sequence of
sets epi f ν :

epi(e-lim inf ν f ν ) = lim sup(epi f ν )


ν→∞
n+1
: ∃ (xν , αν ) → (x, α) with αν ≥ f ν (xν ) for some N ∈ N∞
#
N
= {(x, α) ∈ R } .

The upper epi-limit of the sequence {f ν }ν∈N , denoted e-lim supν f ν , is the function
defined by identifying its epigraph to the inner limit of the sequence of sets epi f ν :

epi(e-lim supν f ν ) = lim inf (epi f ν )


ν→∞
n+1
: ∃ (xν , αν ) → (x, α) with αν ≥ f ν (xν ) for some N ∈ N∞ } .
N
= {(x, α) ∈ R

The value of the epi-limits at x has the following characterization proved in Rockafellar
and Wets (1998, Proposition 7.2). Note that for a sequence {y ν }ν∈N in R, the least and
greatest cluster points of the sequence are respectively lim inf ν y ν = limν→∞ [inf κ≥ν y κ ]
and lim supν y ν = limν→∞ [supκ≥ν y κ ].
A.7. Convergence in Minimization 149

A.17 Proposition. For a sequence of functions f ν : Rn → R and a point x ∈ R,


(e-lim inf ν f ν )(x) = min{α ∈ R : ∃ xν → x with lim inf ν f ν (xν ) = α} ,
(e-lim supν f ν )(x) = min{α ∈ R : ∃ xν → x with lim supν f ν (xν ) = α} .
A.18 Definition. The sequence {f ν }ν∈N is said to epi-converge to f when the lower
e
and upper epi-limits are identical and equal to f . This is written f ν → f .

Since by definition the lower and upper limits are closed, the epi-limit f , when it
exists, is lower semicontinuous (Rockafellar and Wets, 1998, Proposition 7.4(a)).

A.19 Proposition. The sequence {f ν }ν∈N epi-converges to f iff at each point x, the
two following conditions hold (Rockafellar and Wets, 1998, Equation 7(3)):
i. lim inf ν f ν (xν ) ≥ f (x) for every sequence xν → x,
ii. lim supν f ν (xν ) ≤ f (x) for some sequence xν → x.

We note the following monotone convergence property (Rockafellar and Wets, 1998,
Proposition 7.4(c-d)): if f ν ≥ f ν+1 for all ν (in the sense that f ν (x) ≥ f ν+1 (x) for every
e e
x and ν), then f ν → cl[inf ν f ν ]. If f ν ≤ f ν+1 for all ν, then f ν → supν [cl f ν ].

A.7 Convergence in Minimization

Consider a sequence of extended-real-valued functions f ν : Rn → R representing a se-


quence of minimization problems. Among the several possible notions of convergence
according to which we could say that f ν converges to f , epi-convergence plays a key role
for ensuring that the optimal value of f ν also converges to the optimal value of f as
ν → ∞, and that sequences of minimizers for the functions f ν have cluster points that
are also optimal for f .

A.20 Theorem. Assume that the sequence {f ν }ν∈N epi-converges to a proper func-
tion f , with the functions f ν satisfying the assumptions of Theorem A.9 (f ν is proper,
l.s.c., and level-bounded). Then
i. The sets argmin f ν are nonempty and compact (by A.9);
ii. inf f is finite with argmin f nonempty and compact;
iii. inf f ν → inf f ;
iv. The cluster points of sequences {xν }ν∈N with xν ∈ argmin f ν are optimal for f :
∅ 6= lim supν (argmin f ν ) ⊂ argmin f ;

v. If argmin f = {x̄}, any sequence {xν }ν∈N with xν ∈ argmin f ν converges to x̄.

In Rockafellar and Wets (1998, Theorem 7.33), the assumptions on the functions f ν
are weakened: the sets lev≤α f ν have to be bounded for all α ∈ R only for ν in some
N ∈ N∞ , as having argmin f ν nonempty and compact in the tail of the sequence suffices.
Also, the sets from which are extracted the solutions xν are the sets
ν -argmin f ν = {x ∈ Rn : f ν (x) ≤ inf f ν + ν }
with {ν }ν∈N a sequence with ν > 0 decreasing monotonically to 0. In Theorem A.20
it is explicitly assumed that f is lower semicontinuous but actually as an epi-limit, f is
necessarily lower semicontinuous (Rockafellar and Wets, 1998, Proposition 7.4(a)).
150 Appendix A. Elements of Variational Analysis

A.8 Pointwise, Continuous and Uniform Convergence

Pointwise convergence of a sequence {f ν }ν∈N concerns the convergence of f ν (x) with x


fixed, in contrast to epi-convergence concerned with f ν (xν ).

A.21 Definition. Let {f ν }ν∈N be a sequence of functions f ν : Rn → R. The sequence


is said to converge pointwise on X when f ν (x̄) → f (x̄) for every x̄ ∈ X.

The pointwise convergence f ν to a function f on Rn means that lim inf ν f ν (x) =


lim supν f ν (x) = f (x) for every x ∈ Rn .
In contrast to epi-convergence, pointwise convergence does not entail that inf f ν con-
verges to inf f (Rockafellar and Wets, 1998, Figure 7-1).

A.22 Proposition. Assume that {f ν }ν∈N converges pointwise to f . Even if f ν and f


satisfy the assumptions of Theorem A.9, we can have limν→∞ (inf x f ν (x)) 6= inf x f (x)
and lim supν (argmin f ν ) ∩ argmin f = ∅.

With the following definition, taken from Rockafellar and Wets (1998, Exercise 7.9),
one gets a condition under which pointwise convergence entails epi-convergence (Rock-
afellar and Wets, 1998, Theorem 7.10). Here min{a, b} and max{a, b} refer to the lowest
and highest value between a and b.

A.23 Definition. A sequence {f ν }ν∈N is equi-lower semicontinuous at x̄ relative to


X ⊂ Rn iff for every ρ > 0 and  > 0,

lim inf f ν (x) ≥ min{f ν (x̄) − , ρ} for all ν ∈ N .


X
x→x̄

The sequence is equi-upper semicontinuous at x̄ relative to X iff for every ρ > 0 and
 > 0,

lim sup f ν (x) ≤ max{f ν (x̄) + , −ρ} for all ν ∈ N .


X
x→x̄

A sequence {f ν }ν∈N is equicontinuous at x̄ relative to X if it is both equi-l.s.c. and


equi-u.s.c. at x̄ relative to X. It is equi-l.s.c./u.s.c./continuous relative to X if it has
the corresponding property at every x̄ ∈ X. It is said to be asymptotically equi-
l.s.c./u.s.c./continuous relative to X if the stated conditions hold for all ν ∈ N for some
N ∈ N∞ .

A.24 Theorem. Let {f ν }ν∈N be a sequence of l.s.c. functions f ν : R → R. Assume


that the sequence is asymptotically equi-lower semicontinuous. Then f ν epi-converges
to a function f iff f ν converges pointwise to f .

At the same time, observe from Proposition A.17 that when f ν epi-converges to f ,
there is at least one sequence xν → x̄ such that f ν (xν ) → f (x̄), whereas for an arbitrary
sequence xν → x̄ epi-convergence does not ensure that f ν (xν ) → f (x̄).

A.25 Definition. The sequence {f ν }ν∈N is said to converge continuously to f at x̄


X
relative to X iff xν → x̄ entails f ν (xν ) → f (x̄).

The following theorem is taken from Rockafellar and Wets (1998, Theorem 7.10):
A.9. Parametric Optimization 151

A.26 Theorem. The sequence {f ν }ν∈N converges continuously to f at x̄ relative to X iff


f ν epi-converges to f at x̄ relative to X and the sequence is asymptotically equicontinuous
at x̄ relative to X.

Finally, we define the notion of uniform convergence for a sequence of extended-


real-valued functions (Rockafellar and Wets, 1998, Definition 7.12) and its relation with
epi-convergence (Rockafellar and Wets, 1998, Proposition 7.15(a)).

A.27 Definition. The ρ-truncation of an extended-real-valued function f : R n → R


is the real-valued function fρ : Rn → R defined by fρ (x) = f (x) on {x : |f (x)| ≤ ρ},
fρ (x) = ρ on {x : f (x) > ρ}, fρ (x) = −ρ on {x : f (x) < −ρ}.

A.28 Definition. A sequence {f ν }ν∈N of real-valued functions f ν : Rn → R is said to


converge uniformly to f on X ⊂ Rn if for each  > 0, there is some N ∈ N∞ such
that |f ν (x) − f (x)| ≤  for every x ∈ X when ν ∈ N . A sequence {f ν }nu∈N of extended-
real-valued functions f ν : Rn → R is said to converge uniformly to f on X ⊂ Rn if
for each ρ > 0, the real-valued ρ-truncations fρν converge uniformly to the real-valued
ρ-truncation fρ .

A.29 Theorem. Let {f ν }ν∈N be a sequence of lower semicontinuous functions f ν :


Rn → R. If the functions f ν converge uniformly to a function f : Rn → R on a set
X ⊂ Rn , then the functions f ν epi-converge to f relative to X.

A.9 Parametric Optimization

A minimization problem with n optimization variables and m parameters can be viewed


as a single extended-real-valued function f : Rn × Rm → R. Such a function with values
f (x, u) induces a function-valued mapping u 7→ f (·, u).
Parametric optimization is concerned with the variation of p(u) = inf x f (x, u) and of
P (u) = argminx f (x, u) with u, that is, the characterization of the extended-real-valued
function u 7→ inf x f (x, u) and of the set-valued mapping u 7→ argminx f (x, u).
Sufficient conditions for the attainment of the infimum inf x f (x, u) will use the fol-
lowing generalization of Definition A.8 (Rockafellar and Wets, 1998, Definition 1.16).

A.30 Definition. A function f : Rn × Rm → R with values f (x, u) is said to be (lower)


level-bounded in x locally uniformly in u if for each ū ∈ Rm , the sets

lev≤α (u) = {x : f (x, u) ≤ α}

are bounded for all α ∈ R and for every u in some neighborhood V ∈ N (ū) of ū.

The following definition is useful inasmuch as one cannot usually assert that p(u) is
continuous in u.

A.31 Definition. Let p be an arbitrary function from Rm to R. A sequence of points


uν ∈ Rm is said to converge in the p-attentive sense to ū if uν → ū and p(uν ) →
p(ū).

A.32 Proposition. For having uν converging to ū in the p-attentive sense, it suffices


to have uν → ū with p continuous at ū relative to a set U containing uν and ū.
152 Appendix A. Elements of Variational Analysis

As with Theorem A.20, there is a useful notion of epi-convergence, now adapted


to function-valued mappings. The following definition results from combining Proposi-
tion 7.2 and Exercise 7.40 in Rockafellar and Wets (1998).

A.33 Definition. Let f : Rn × Rm → R with values f (x, u). The function-valued


mapping u → f (·, u) is said to be epi-continuous at ū iff for every sequence uν → ū,
the sequence of functions f ν = f (·, uν ) epi-converges to the function f (·, ū).

A.34 Theorem. Let f : Rn × Rm → R be proper and l.s.c. (on Rn × Rm ) with


f (x, u) level-bounded in x locally uniformly in u. Consider p(u) = inf x f (x, u) and
P (u) = argminx f (x, u).
i. The function p is proper and l.s.c. (on Rm ).
ii. (Generalization of Theorem A.9.) The set-valued mapping P assigns to each u ∈
dom p a nonempty compact set, and is empty-valued for each u ∈
/ dom p.
iii. If a sequence of points uν ∈ dom p converges to ū ∈ dom p in the p-attentive sense,
then any sequence {xν }ν∈N with xν ∈ P (uν ) is bounded and has its cluster points
in P (ū), that is (given that P (ū) is bounded),

∅ 6= lim sup dom p P (u) ⊂ P (ū).


u → ū
p(u)→p(ū)

iv. If p is continuous at ū relative to a set U with ū ∈ U and P (ū) 6= ∅, then any


sequence {xν }ν∈N with xν ∈ P (uν ) is bounded and has its cluster points in P (ū),
that is, lim sup U P (u) ⊂ P (ū) with P (ū) ⊂ B for some bounded set B.
u→ū
v. For p to be continuous at ū relative to U containing ū, a sufficient condition is the
existence of some x̄ ∈ P (ū) such that f (x̄, u) is continuous in u at ū relative to U
(Rockafellar and Wets, 1998, Theorem 1.17(c)).
vi. For p to be continuous at ū relative to U containing ū, another sufficient condition
is to have the function-valued mapping u 7→ f (·, u) epi-continuous at ū relative
to U (Rockafellar and Wets, 1998, Theorem 7.31(b)).

A.10 Convexity

We start with the notion of convexity for subsets of Rn .

A.35 Definition. A subset C of Rn is convex if for all points x, y ∈ C and for 0 < λ < 1,
the points (1 − λ)x + λy are in C.

Convex extended-real-valued functions are defined as follows.

A.36 Definition. A function f : Rn → R is convex iff its epigraph epi f is convex in


Rn × R.

We recall for comparison the definition of convexity for real-valued functions(Rockafellar,


1970, page 10): A function f from a convex set C ⊂ Rn to R is convex iff for all points
x, y ∈ C and for 0 < λ < 1, it holds that f ((1 − λ)x + λy) ≤ (1 − λ)f (x) + λf (y). Note
also that f is said to be strictly convex iff for all x, y ∈ C, x 6= y, and for 0 < λ < 1,
it holds that f ((1 − λ)x + λy) < (1 − λ)f (x) + λf (y).
A.11. Lipschitz Continuity 153

Convex functions are continuous at least on the interior of their effective domain
(Rockafellar and Wets, 1998, Theorem 2.35):

A.37 Proposition. Let f : Rn → R be convex. Then f is continuous on int(dom f ). If


f is also l.s.c., f is continuous relative to the convex hull of any finite subset of dom f .

The last implication means that for every integer m and sets X(m) = {x ∈ R n : x =
ν X(m)
Pm Pm
k=1 λk xk with λk > 0, k=1 λk = 1, xk ∈ dom f }, it holds that x → x̄ implies
ν
f (x ) → f (x̄).
Proposition A.37 also covers the result that lower semicontinuous convex functions
whose effective domain is a polytope are continuous on their domain (Gale et al., 1968).
Convex functions can have discontinuities on the boundary of their effective domain,
even if they are also l.s.c.:

A.38 Proposition. Let f : Rn → R be convex, proper, and lower semicontinuous.


i. f may fail to be continuous relative to a compact subset of dom f (Rockafellar and
Wets, 1998, Example 2.38).
ii. However, if n = 1, f : R → R is continuous relative to the closure of its domain
(Rockafellar and Wets, 1998, Corollary 2.37)).

A sequence of convex functions can epi-converge under favorable circumstances. The


following theorem is taken from Rockafellar and Wets (1998, Theorem 7.17(c)):

A.39 Theorem. Let {f ν }ν∈N be a sequence of convex functions f ν : Rn → R. If


the functions f ν converge uniformly to f on every compact set that does not meet the
boundary of dom f , with f convex, l.s.c., and having a nonempty interior, then f ν epi-
converges to f .

A.11 Lipschitz Continuity

A function f : Rn → Rm is said to be Lipschitz continuous if there exists a finite constant


κ ≥ 0 such that

||f (x) − f (x0 )|| ≤ κ||x0 − x|| for all x, x0 ∈ Rn .

For extended-real valued functions, it makes sense to focus on Lipschitz continuity prop-
erties locally (Rockafellar and Wets, 1998, page 350).

A.40 Definition. Let f : Rn → R be an extended-value function and let X be an open


subset of Rn containing a point x̄. Then, the Lipschitz modulus of f at x̄ relative to X
is defined as the value
|f (x) − f (x0 )|
lipX f (x̄) = lim sup ,
X ||x − x0 ||
x, x0 →x̄
x6=x0

where by convention |f (x) − f (x0 )| = ∞ if f (x) or f (x0 ) (or both) is infinite. The
function f is said to be strictly continuous (or locally Lipschitz continuous) at x̄
relative to X if lipX f (x̄) is finite, and f is said to be strictly continuous relative to X if
it has that property at each x̄ ∈ X. The mention to X is omitted when X = int dom f .
154 Appendix A. Elements of Variational Analysis
Appendix B

Basic Probability Theory

This appendix presents standard material from measure and probability theory (Billings-
ley, 1995; Pollard, 1990).
The definitions and results collected in the present appendix are relevant to this thesis
inasmuch as stochastic programming fundamentally deals with randomness. As random
objects more general than random vectors are required in the context of stochastic pro-
gramming, we define them here, using a formalism based on set-valued mappings (see
Definition B.8), following Rockafellar and Wets (1998).
The appendix is organized as follows. Section B.1 defines the notions of sigma-algebra
and probability space. Section B.2 defines random variables and random vectors. Section
B.3 defines random sets. Section B.4 defines random functions. Section B.5 defines the
expectation, including the treatment of extended-real-valued functions (Rockafellar and
Wets, 1998, Chapter 14). Section B.6 defines distributions and cumulative distribution
functions.

B.1 The Probability Space

The probability space is made of three elements: the sample space, the sigma-algebra,
and the probability measure.

B.1 Definition. The sample space Ω is an arbitrary nonempty set.

An element of Ω is denoted by ω. Often the sample space is interpreted as an arbitrary


space or set of points consisting of all the possible results or outcomes of an experiment
(Billingsley, 1995, page 17).
Then, a collection of subsets of Ω is identified for the purpose of performing set
operations involving limits along sequences of sets. In essence, admissible collections are
those that are closed under countable set operations.

B.2 Definition. A sigma-algebra B of subsets of a space Ω is a class of subsets of Ω


(a collection of sets B ⊂ Ω) such that
i. The set Ω belongs to B;
ii. If a set B is in B, then its complement B c = Ω \ B is in B;
S∞
iii. For a countable collection {B ν }ν∈N of sets B ν in B, the union ν=1 B ν is in B.

The empty set Ωc = ∅ and the sample space Ω are always in B by definition. Countable
156 Appendix B. Basic Probability Theory

intersections of sets in B are also in B, since (∪ν Bν )c = ∩ν (B ν )c and (∩ν B ν )c = ∪ν (B ν )c


(De Morgan’s laws of set theory).
Sigma-algebras can be constructed from an initial class B0 of subsets of interest.

B.3 Definition. The sigma-algebra generated by B0 is the intersection of all the sigma-
algebras that contain the class B0 .

Note that there are often several ways of generating a same sigma-algebra.
Examples of useful sigma-algebras are given below.

• The trivial sigma-algebra is the smallest possible sigma-algebra, made of the two
sets ∅ and Ω.

• The Borel sigma-algebra of the unit interval B((0, 1]) is the sigma-algebra generated
by the class of subintervals of (0, 1] of the form I = (a, b] with 0 < a < b ≤ 1. In
fact, the sigma-algebra generated by a countable number of subintervals (a, b] with
a, b restricted to rational numbers and 0 < a < b ≤ 1 can also be shown to coincide
with B((0, 1]).

• The Borel sigma-algebra B(R) is the sigma-algebra generated by the class of inter-
vals I = (a, b] of R. It can also be generated by the class of intervals (−∞, t], t ∈ R.
When we define functions on sigma-algebra that may be valued on the extended
real line R = R ∪ {±∞}, we consider the Borel sigma-algebra B(R) generated by
the subsets of B(R) and the two sets {−∞} and {+∞}, or alternatively by intervals
of the form (t, +∞], [−∞, t), t ∈ R.

• The k-dimensional Borel sigma-algebra B(Rk ) is the sigma-algebra generated by


the class of bounded rectangles {x = (x1 , . . . , xk ) ∈ Rk : ai < xi ≤ bi , i = 1, . . . , k}.

In general, Borel sigma-algebras can be generated from all the open subsets of a topo-
logical space, or alternatively, from all the closed subsets of the topological space.
The elements of B(Rk ) are called Borel sets, without mention to Rk when k is clear
from the context.
A set B of a sigma-algebra B is said to be B-measurable. In the context of probability
theory, a set B ∈ B is referred to as an event.
Probabilities can be assigned to events by the means of a probability measure.

B.4 Definition. A measure on a sigma-algebra B is an extended-real-valued function µ


defined on the class B of subsets of Ω such that
i. (Nonnegativity.) 0 ≤ µ{B} ≤ ∞ for every B ∈ B;
ii. µ{∅} = 0;
iii. (Countable additivity.) For any countable collection {B ν }ν∈N of sets B ν ∈ B with
S∞ P∞
B i ∩ B j = ∅ if i 6= j, µ{ ν=1 B ν } = ν=1 µ{B ν }.

A probability measure P on a sigma-algebra B is a measure P with P{Ω} = 1.

A measurable space is a pair (Ω, B) with Ω a nonempty set and B a sigma-algebra


on Ω. A measure space is a triple (Ω, B, µ) with Ω a nonempty set, B a sigma-algebra
B.2. Random Variables 157

on Ω, and µ a measure on B. A probability space is a measure space (Ω, B, P) with P


a probability measure.
Some sets are special in the context of a given probability space (Ω, B, P). A set B
is said to be a support of P if B is in B with P{B} = 1 (Billingsley, 1995, page 23).
A set B is said to be P-negligible if B is in B and P{B} = 0. When a property holds
for all ω on a set B with P{B} = 1, the property is said to hold with probability 1, or
almost surely.

B.5 Definition. A probability space (Ω, B, P) or a sigma-algebra B is said to be com-


plete for the probability measure P if any subset of a P-negligible set is also in B (and
hence P-negligible).

A measurable space can be made complete relative to a measure P by enlarging its


sigma-algebra (Pollard, 2001, Definition 2.27):

B.6 Definition. The P-completion of a sigma-algebra B is the class of sets B for which
there exist sets B0 , B1 ∈ B with B0 ⊂ B ⊂ B1 and P{B1 \ B0 } = 0.

Measurable spaces can be combined together.

B.7 Definition. Let (Ω1 , B1 ) and (Ω2 , B2 ) be measurable spaces.


i. The product space of Ω1 and Ω2 , denoted Ω1 × Ω2 , is the set of all pairs (ω1 , ω2 )
with ωi ∈ Ωi (i = 1, 2).
ii. The product sigma-algebra on Ω1 × Ω2 , denoted B1 ⊗ B2 , is the sigma-algebra
generated by the collection of sets of the form B1 × B2 = {(ω1 , ω2 ) ∈ Ω1 × Ω2 :
ωi ∈ Bi (i = 1, 2)} with Bi ∈ Bi (i = 1, 2).

B.2 Random Variables

Let us first describe notions relative to set-valued mappings.

B.8 Definition. A set-valued mapping F : X ⇒ Y assigns to each element x of X


one or more elements of Y , or possibly none. The set of elements y ∈ Y assigned by F
to x is denoted by F (x) (Dontchev and Rockafellar, 2009, page 2, exact citation).
i. The domain of F : X ⇒ Y is the set dom F = {x ∈ X : F (x) 6= ∅}.
ii. The range of F is the set range F = {y ∈ Y : y ∈ F (x) for some x ∈ X}.
iii. The inverse mapping F −1 is the set-valued mapping F −1 : Y ⇒ X defined by

F −1 (y) = {x ∈ X : y ∈ F (x)} .

iv. The image of a set B ⊂ X by F is the set

F (x) = {y ∈ Y : F −1 (y) ∩ B 6= ∅} .
S
F (B) = x∈B

v. The inverse image of a set C ⊂ Y (or pre-image of C) by F is the set

F −1 (C) = F −1 (y) = {x ∈ X : F (x) ∩ C 6= ∅} .


S
y∈C
158 Appendix B. Basic Probability Theory

There is a one-to-one correspondence between set-valued mappings from X to Y and


subsets of X × Y , by identifying a set-valued mapping F to its graph gph F = {(x, y) ∈
X × Y : y ∈ F (x)}. Note that the projection (x, y) 7→ x maps gph F to dom F , that
(x, y) 7→ y maps gph F to range F , and that (x, y) 7→ (y, x) maps gph F to gph F −1 .
In the standard usage, the word function designates a relation f that assigns to
each element x of a subset dom f of X, one element y of a subset range f of Y , written
y = f (x), and is undefined on X \ dom f . The inverse of a function f is not defined
on Y \ range f . There is however a one-to-one correspondence between functions and
set-valued mappings, in the sense that to each function f can be associated a set-valued
mapping F , that is single-valued on dom f with values F (x) = {y}, and empty-valued
on X \ dom f .
In the sequel, we adopt a compromise that consists in calling indifferently mapping or
function the relation F : X → Y that assigns to each element x of a subset dom F ⊂ X,
one element y of a subset range F of Y , written y = F (x), and is empty-valued on
X \ dom F . This is because the functions F : X → Y considered in the sequel are defined
on the full space X anyway. The inverse mapping of F , written F −1 is defined on the
full space Y as a set-valued mapping.
Let us now consider the probability space (Ω, B, P), and a measurable space (Ω 0 , C).
We view a random variable with values in Ω0 as a mapping F : Ω → Ω0 . Mappings of
interest are those for which the pre-image of every element of C is in B.

B.9 Definition. Let (Ω, B) and (Ω0 , C) be measurable spaces. A mapping F : Ω → Ω0


is said to be B/C-measurable if for each C ∈ C, the pre-image F −1 (C) is B-measurable.

In practice, it is not necessary to check that the pre-image of every element of the
sigma-algebra C is in B: checking the condition for a class of subsets generating the
sigma-algebra C is sufficient (Billingsley, 1995, Theorem 13.1(i)):

B.10 Theorem. Let F : Ω → Ω0 be a mapping with Ω equipped with a sigma-algebra


B and Ω0 equipped with a sigma-algebra C. If a class C0 generates C, and if for every
C0 ∈ C0 , the inverse image F −1 (C0 ) is in B, then F is B/C-measurable.

When Ω0 = R with C the Borel sigma-algebra B(R), the B/C measurable mapping is
a real-valued mapping corresponding to a real-valued random variable.

B.11 Definition. A real-valued random variable f on the probability space (Ω, B, P) is


a real-valued mapping from Ω to R that is B/B(R)-measurable.

Similarly, extended-real-valued random variables on (Ω, B, P) correspond to mappings


f : Ω → R that are B/B(R)-measurable.
Random variables with values in Rk are called random vectors in Billingsley (1995).

B.12 Definition. A k-dimensional random vector f on the probability space (Ω, B, P)


is a B/B(Rk )-measurable mapping from Ω to Rk .

In fact, a random vector turns out to be simply a k-tuple f = (f1 , . . . , fk ) of one-


dimensional random variables, since it can be shown that the random vector f (ω) =
(f1 (ω), . . . , fk (ω)) is B/B(Rk )-measurable if and only if its components fi (ω) are B/B(R)-
measurable (Billingsley, 1995, page 183).
B.3. Random Sets 159

Random variables can also be defined as functions of other random variables, not
necessarily defined on the same measurable space. For that purpose, the following result
is useful (Billingsley, 1995, Theorem 13.1(ii)).

B.13 Theorem. If F : Ω → Ω0 is B/B 0 -measurable and G : Ω0 → Ω00 is B 0 /B 00 -


measurable, then the composed mapping G ◦ F from Ω to Ω00 is B/B 00 -measurable.

Sometimes one considers the random variables first, and then generates the sigma-
algebras in such a way that the random variables of interest are measurable.

B.14 Definition. The sigma-algebra generated by a collection of random variables is


the intersection of all sigma-algebras with respect to which each random variable is
measurable.

That is, the sigma-algebra generated by f1 ,. . . ,fk , with fi a mapping from Ω to a space Ω0i
equipped with a sigma-algebra Ci , is defined as the sigma-algebra generated by the class
of sets {fi−1 (C) : C ∈ Ci , i = 1, . . . , k}.
Random variables measurable with respect to the sigma-algebra generated by a collec-
tion of random variables are equivalent to functions of those random variables (Billingsley,
1995, Theorem 20.1):

B.15 Theorem. Let f = (f1 , . . . , fk ) be a k-dimensional random vector.


i. The sigma-algebra generated by f1 , . . . , fk consists of the sets {f ∈ H} for H ∈
B(Rk ).
ii. A random variable h is measurable with respect to the sigma-algebra generated by
f1 , . . . , fk iff there exists a measurable mapping g from Rk to R such that for all ω,
h(ω) = g(f1 (ω), . . . , fk (ω)).

(For brevity, we also allow ourselves to say that h is measurable with respect to f 1 , . . . , fk
when h is measurable with respect to the sigma-algebra generated by f1 , . . . , fk .)
Consider again the collection of random variables f1 , . . . , fk where fi is a B/Ci -
measurable mapping from Ω to Ci . Let F0 denote the trivial sigma-algebra, and let
Fi denote the sigma-algebras generated by the subcollection {f1 , . . . , fi } of random vari-
ables. By definition, it holds that Fi ⊆ Fj for 0 ≤ i < j ≤ k.

B.16 Definition. The natural filtration associated to a sequence f 1 , . . . , fk of random


variables is the family {Fi : i = 0, . . . , k} of sub-sigma-algebras of B generated by the
growing subcollections {f1 , . . . , fi } of random variables.

The natural filtration represents the growing information on ω obtained by considering


growing collections of B-measurable random variables. Note from Theorem B.15 that
adding to a collection f1 , . . . , fi a random variable fi+1 which is merely a function of
f1 , . . . , fi will not alter the sub-sigma-algebra generated by the collection, that is, will
not refine the available information on ω.

B.3 Random Sets

Random sets are defined as measurable set-valued mappings. We will only consider
mappings to closed subsets of Rn .
160 Appendix B. Basic Probability Theory

B.17 Definition. Let (Ω, B) be a measurable space. A set-valued mapping F : Ω ⇒


Rn is measurable if for every open set C ⊂ Rn , the pre-image F −1 (C) is in B. If
F is closed-valued (the sets F (x) are closed), an equivalent measurability condition is
F −1 (C) ∈ B for every closed set C (Rockafellar and Wets, 1998, Theorem 14.3(b)).

The purpose of a measurable selection, defined below, is to reduce a measurable set-


valued mapping to a measurable (single-valued) mapping. This is useful, for instance,
for selecting a single optimal solution from a set of optimal solutions to a parametric
optimization program. Other examples of selections that are frequently met in practice
are the particular choices of matrix pseudo-inverses for returning a single solution to
an underdetermined system of equations. For a particular example in the thesis, see
Example 6.1.
We use the following non-standard definition of the measurable selection to avoid
issues posed by set-valued mappings that are empty-valued in some regions of the sample
space (see Remark B.1 below).

B.18 Definition. Given a measurable set-valued mapping F : Ω ⇒ Rn , a measur-


able selection for F is a measurable set-valued mapping f : Ω ⇒ Rn which is single-
valued with values f (ω) ∈ F (ω) whenever F (ω) is nonempty, and empty-valued whenever
F (ω) = ∅.
If F (ω) is nonempty for all ω ∈ Ω, the measurable selection can be defined more
simply as a measurable function f : Ω → Rn with values f (ω) ∈ F (ω) for all ω ∈ Ω.

A measurable closed-valued mapping always admits a measurable selection (Rockafellar


and Wets, 1998, Corollary 14.6).

Remark B.1. It is not clear to us whether the empty set is considered in Rockafellar
and Wets as an admissible value for a closed-valued mapping, and how the selection
defined as a function can handle that case. The definition of the selection has
been changed in Dontchev and Rockafellar (2009, page 49) to allow for a local
definition, but a local definition on the subset Ω0 of Ω where F is not empty-valued
is not desirable for a measurable selection, which should be defined on the full
sample space Ω. Aubin and Frankowska (1990, Theorem 8.1.3) avoid the issue
by dealing only with non-empty-closed-valued measurable mappings F , but this
choice rules out the use of a measurable selection for selecting an optimal solution
to a parametric optimization program which is infeasible in some region of the
parameter space.

There is a correspondence between the measurability of mappings and the measura-


bility of their associated graph (Rockafellar and Wets, 1998, Theorem 14.8).

B.19 Theorem. Let (Ω, B, P) be a probability space. Let F : Ω ⇒ Rn be closed-valued.


If the probability space is complete, the 3 following properties are equivalent.
i. The set-valued mapping F is measurable;
ii. For any set C ∈ B(Rn ), the pre-image F −1 (C) is in B;
iii. gph F is a B ⊗ B(Rn )-measurable subset of Ω × Rn .
B.4. Random Functions 161

B.4 Random Functions

Random functions can be interpreted as mappings from Ω × Rn to R: for every ω ∈ Ω,


there is a function f (ω, ·) from Rn to R. They can also be defined using their epigraph
representation: for every ω ∈ Ω, there is a subset epi f (ω, ·) of Rn+1 . In the sequel, we
will only consider l.s.c. random functions. Recall that the epigraph of a l.s.c. function is
a closed set (Proposition A.12).
If x is a random variable with values x(ω) and f is a random l.s.c. function with “val-
ues” f (ω, ·), where f (ω, ·) is a l.s.c. function, then f (·, x(·)) defines a random variable
with values f (ω, x(ω)) (a measurable mapping from Ω to R) if f satisfies suitable mea-
surability conditions. These conditions are given below: f has to be a normal integrand
(Rockafellar and Wets, 1998, Definition 14.27).

B.20 Definition. Let (Ω, B, P) be a probability space. Let f : Ω × Rn → R be a random


function with associated domain and epigraph

Df (ω) = dom f (ω, ·) = {x ∈ Rn : f (ω, x) < ∞}


Sf (ω) = epi f (ω, ·) = {(x, α) ∈ Rn × R : f (ω, x) ≤ α} .

If Sf : Ω ⇒ Rn × R is closed-valued and measurable (as a set-valued mapping defined on


Ω × Rn ), the function f is said to be a normal integrand (Rockafellar and Wets, 1998,
Definition 14.27).

The following results are taken from Rockafellar and Wets (1998, Propositions 14.28,
Theorem 14.37).

B.21 Proposition. Let (Ω, B, P) be a probability space and f : Ω × Rn → R a random


function with domain Df and epigraph Sf . If f is a normal integrand, then Df : Ω ⇒ Rn
is measurable (as a set-valued mapping), f is l.s.c. in x ∈ Rn for each fixed ω ∈ Ω, and f
is B-measurable in ω ∈ Ω for each fixed x ∈ Rn . In addition, the random variable defined
by ω 7→ f (ω, x(ω)), where x is B/B(Rn )-measurable, is itself B/B(R)-measurable.

B.22 Theorem. Let f : Ω × Rn → R be a normal integrand. Then for p(ω) = inf f (ω, ·)
and P (ω) = argmin f (ω, ·), it holds that the function p : Ω → R is measurable, the
mapping P : Ω ⇒ Rn is closed-valued and measurable, and in particular P admits a
measurable selection.

Examples of normal integrands are recorded below (Rockafellar and Wets, 1998, Ex-
amples 14.29, 14.31, 14.32; Proposition 14.39; Exercise 14.55), whereas Theorem B.19
gives the general measurability condition for set-valued mappings that characterizes nor-
mal integrands.

B.23 Proposition. Let f : Ω × Rn → R be a random function with domain Df (ω). If


any of the following conditions hold, f is a normal integrand.
i. (Carathéodory integrands.) f (ω, x) is finite-valued, measurable in ω for each x,
and continuous in x for each ω.
ii. There is a Carathéodory integrand f0 : Rn → R and a closed-valued measurable
mapping C : Ω ⇒ Rn such that f (ω, x) coincides with f0 (ω, x) if x ∈ C(ω) and
f (ω, x) = ∞ if x ∈
/ C(ω).
162 Appendix B. Basic Probability Theory

iii. (Jointly l.s.c. functions.) Ω is a Borel subset of Rd and f is l.s.c. (over Ω × Rn ).


iv. (Convex integrands.) f (ω, ·) is l.s.c. and convex (over Rn ) for each ω, the interior
of Df (ω) is nonempty whenever Df (ω) is nonempty, and f (ω, x) is measurable in ω
for each x.
v. (Simple integrands.) The range of Df and the range of f are finite (this holds true
in situations where the set of feasible x given ω is finite and Ω is finite).

B.5 Expectation

Let (Ω, B, µ) be a measure space. Let M denote the class of all B/B(R)-measurable
mappings from Ω to R, and let M+ denote the class of all nonnegative mappings in M.
The expectation (or expected value) of nonnegative random variables is defined through
the integral (Billingsley, 1995, Equation 15.3).

B.24 Definition. The integral of a function f ∈ M+ on a measure space (Ω, B, µ) is


Z X 
f dµ = sup inf ν f (ω) µ(B ν )
ω∈B
ν

where the supremum is over the partitions {B ν }1≤ν≤N of Ω with N finite and B ν ∈ B.
The expectation of a random variable f ∈ M+ on a probability space (Ω, B, P) is
(setting µ to P)
Z
E{f } = f d P .

The expectation of nonnegative random variables can also be defined through prop-
erties (Pollard, 2001, Theorem 2.12). For a set B ∈ B, let IB ∈ M+ denote the indicator
function of B defined by IB (ω) = 1 if ω ∈ B and IB (ω) = 0 if ω ∈
/ B.

B.25 Definition. For each probability measure P on the measurable space (Ω, B), there
is a functional E from M+ to [0, ∞] uniquely determined by the following properties.
i. E{IB } = P{B} for all B ∈ B;
ii. E{0} = 0 where the zero of the left-hand side denotes a zero-valued measurable
mapping;
iii. For α, β ≥ 0 and f, g ∈ M+ , E{αf + βg} = αE{f } + βE{g};
iv. If f, g are in M+ and f (ω) ≤ g(ω) for almost all ω ∈ Ω, then E{f } ≤ E{g};
v. (Monotone convergence.) For a sequence {f ν }ν∈N of functions f ν ∈ M+ , if
f ν (ω) → f (ω) with f ν (ω) ≤ f ν+1 (ω) for almost all ω ∈ Ω, then E{f ν } → E{f }
with E{f ν } ≤ E{f ν+1 }.

The expectation of an extended-real-valued random variable f is obtained by decom-


posing f into its positive and negative parts f + , f − ∈ M+ .

B.26 Definition. Let f : Ω → R be an extended-real-valued random variable. Define


f + (ω) = max{0, f (ω)} and f − (ω) = max{0, −f (ω)}. If E{f + } and E{f − } are not both
infinite, the expectation of f is said to be well-defined, and

E{f } = E{f + } − E{f − } .


B.5. Expectation 163

E{αf +βg} = αE{f }+βE{g} for α, β in R provided that the situation ∞−∞ is avoided.

The expectation has additional properties.

B.27 Proposition. Let {f ν }ν∈N be a sequence of functions in M.


vi. (Fatou’s lemma.) E{lim inf ν f ν } ≤ lim inf ν E{f ν }.
vii. (Dominated convergence.) If f ν (ω) → f (ω) for almost all ω and if there is some
g ∈ M such that |f ν (ω)| ≤ g(ω) for all ν and almost all ω with E{g} < ∞, then
E{f ν } → E{f } with E{f ν } finite and E{f } finite.
viii. (Uniform integrability.) If f ν (ω) → f (ω) for almost all ω and if the sequence is
uniformly integrable, in the sense that supν E{|f ν |I{|f ν |>α} } tends to 0 as α → ∞,
then E{f ν } → E{f }.

Classes of random variables with finite expectations define particular spaces of mea-
surable functions (Pollard, 2001, Section 2.7).

B.28 Definition. Let (Ω, B, P) be a probability space. For 1 ≤ p < ∞, consider the
space Lp (Ω, B, P) of functions f ∈ M such that E{|f |p } is finite. For p = ∞, consider
the space L∞ (Ω, B, P) of functions f ∈ M for which the essential supremum inf[α ∈ R :
P{ω : |f (ω)| > α} = 0} is finite. Then, the Lebesgue space Lp (Ω, B, P) (1 ≤ p ≤ ∞)
is defined as the space of equivalence classes of functions [f ] = {g ∈ L p (Ω, B, P) : g =
f P-almost surely}.

To each element f of the space Lp (Ω, B, P) can be associated the real number ||f ||p =
(E{|f |p })1/p . The reduction to equivalence classes of functions is made so that in
Lp (Ω, B, P), ||f − g||p = 0 entails f = g. (|| · ||p is a semi-norm for Lp (Ω, B, P) and
a norm for Lp (Ω, B, P): see Definition C.3.)

B.29 Definition. Let Lp (Ω, B, P; Rn ) be the space of B-measurable mappings x : Ω →


Rn such that the Euclidian norm mapping ω 7→ ||x(ω)|| is in Lp (Ω, B, P). Then, the
Lebesgue space Lp (Ω, B, P; Rn ) (1 ≤ p ≤ ∞) is defined as the space of equivalence
classes of functions [f ] = {g ∈ Lp (Ω, B, P; Rn ) : g = f P-almost surely}.

Now we turn our attention to expectations over random functions. The expectation
over a random function is well-defined for normal integrands (Rockafellar and Wets, 1998,
Proposition 14.58):

B.30 Proposition. Let (Ω, B, P) be a probability space. Let X denote a space of


B/B(Rn )-measurable mappings, and let x : Ω → Rn be a mapping in X . If the ran-
dom function f : Ω × Rn → R is a normal integrand, the functional Ef from X to R
given by
Ef [x] = E{f (ω, x(ω))}
is well-defined, under the additional convention (in a context of minimization) that
E{f (ω, x(ω))} = ∞ if E{f + (ω, x(ω))} = ∞. When Ef [x] is finite, it holds that x(ω) lies
in dom f (ω, ·) almost surely.

An important theorem identifies conditions under which the infinite-dimensional min-


imization of Ef [x] over x ∈ X reduces to a minimization of f (ω, ·) for each ω (Rockafellar
and Wets, 1998, Theorem 14.60). The result makes use of a property possessed by certain
spaces X of measurable functions (Rockafellar and Wets, 1998, Definition 14.59).
164 Appendix B. Basic Probability Theory

B.31 Definition. A space X of measurable functions x : Ω → Rn is decomposable


relative to a measure µ if for every function x0 ∈ X , every set B ∈ B with µ(B) finite
and every bounded measurable function x1 : Ω → Rn , X contains the function x that
coincides with x0 on Ω \ B and coincides with x1 on B.

The Lebesgue spaces Lp (Ω, B, P; Rn ) are decomposable, whereas the space of constant-
valued functions and the space of continuous mappings f : Ω → Rn are not decomposable
relative to most measures P (Rockafellar and Wets, 1998, page 677).

B.32 Theorem. Let (Ω, B, P) be a probability space. Let X be a space of measurable


functions x : Ω → Rn decomposable relative to P. Let f : Ω × Rn → R be a normal
integrand. Then, as long as Ef [x] 6≡ ∞,

inf E{f (ω, x(ω))} = E{ infn f (ω, x)} ,


x∈X x∈R

and as long as inf x∈X E{f (ω, x(ω))} > −∞, it holds that x̄ ∈ X is in argminx∈X Ef [x]
iff x̄(ω) is in argminx∈Rn f (ω, x) for P-almost every ω ∈ Ω.

B.6 Distributions

Let (Ω, B, P) be a probability space and let x : Ω → Rm be a B/B(Rm )-measurable


mapping.

B.33 Definition. The distribution of a random vector x : Ω → Rm is the mapping µ


from the Borel sets B ∈ B(Rm ) to the interval [0, 1], with values

µ(B) = P{x(ω) ∈ B} .

The support of the distribution of x, also referred to as the support of x, is defined as


the smallest closed set B (with respect to the set-inclusion ordering) such that µ(B) = 1.
The cumulative distribution function (cdf) of x is the mapping F : Rm → R with
values

F (t) = P{xi (ω) ≤ ti , i = 1, . . . , m} .

For a real-valued random variable x : Ω → R, the corresponding distribution function


F has an integral representation
Rt
F (t) = −∞ f (t)dt

if and only if F is absolutely continuous (Billingsley, 1995, Theorem 31.8), in the following
sense (Billingsley, 1995, Equation 31.28):

B.34 Definition. A function F : R → R is absolutely continuous if for each  > 0, there


is some δ > 0 such that the following condition holds: for each collection of k intervals
[ai , bi ] with a disjoint interior,
k
X k
X
|F (bi ) − F (ai )| <  if (bi − ai ) < δ .
i=1 i=1
Appendix C

Elements of Functional Analysis for Kernel


Methods

This appendix presents results from functional analysis useful in the theory of kernel
methods. We use kernels or kernel-based methods in several places in the thesis (Chapters
3, 5).
The appendix is organized as follows. Section C.1 defines Hilbert spaces. Section C.2
defines continuous linear mappings. Section C.3 defines reproducing kernels, positive def-
inite kernels, and reproducing kernel Hilbert spaces. Section C.4 gives the interpretation
of positive definite kernels as generalized inner products.

C.1 Hilbert Spaces

C.1 Definition. Let F be a nonempty set. A metric for F is a function d : F × F → R


with the following properties (where f, g, h ∈ F ):
i. d(f, g) ≥ 0 with d(f, g) = 0 iff f = g
ii. d(f, g) = d(g, f )
iii. (Triangle inequality.) d(f, h) ≤ d(f, g) + d(g, h).

A metric space (F, d) is defined as a nonempty set F equipped with a metric d.

In a metric space (F, d), the distance from an element f ∈ F to a set C ⊂ F is


given by d(f, C) = inf g∈C d(f, g), with d(f, ∅) = ∞. We say that a metric space (F, d)
is separable if F has a dense countable subset, in the sense that there exists a set
Q = {q ν }ν∈N of elements of F such that for all  > 0 and f ∈ F , d(f, Q) < .
A sequence {f ν }ν∈N in a metric space (F, d) is a Cauchy sequence if for each  > 0,
there is some N ∈ N∞ such that d(f µ , f ν ) <  when µ, ν ∈ N . We say that {f ν }ν∈N
converges [strongly] to some limit point f if limν→∞ d(f ν , f ) = 0. We denote the
limit by s-limν→∞ f ν = f and write f ν → f .

C.2 Definition. A metric space (F, d) is complete if every Cauchy sequence in it


converges to an element of F .

If (F, d) is a complete metric space, then a set C is closed when d(f, C) = 0 entails
f ∈ C.
Recall that a linear space F over a field K is a set F for which the addition (+) of two
elements f, g ∈ F and the multiplication (·) of an element f ∈ X by a scalar α ∈ K obey
166 Appendix C. Elements of Functional Analysis for Kernel Methods

the standard rules of algebra (commutativity, associativity, distributivity), with f + g


and α · f being also elements of F (Yosida, 1980, Section 0.4). For the properties of fields
we refer to Rudin (1976, Definition 1.12). A linear space is called a real linear space if
K = R. A linear space is called a complex linear space if K = C.

C.3 Definition. A linear space F over a field K is a normed linear space if to every
element f ∈ F is associated a real number ||f ||, called the norm of f , with the following
properties (where f, g ∈ F and α ∈ K):
i. ||f || ≥ 0 with ||f || = 0 iff f = 0
ii. (Subadditivity.) ||f + g|| ≤ ||f || + ||g||
iii. ||αf || = |α| · ||f ||.

A normed linear space is a pre-Hilbert space if its norm also satisfies


iv. ||f + g||2 + ||f − g||2 = 2(||f ||2 + ||g||2 ) .

In a normed linear space F , the function d(f, g) = ||f − g|| is a metric for F .
In Definition C.3, if ||f || satisfies only the conditions ii and iii (which imply ||f || ≥ 0),
then ||f || is called a semi-norm. If ||f || satisfies conditions i, ii, and instead of condition iii
the weaker set of conditions
iii’. || − f || = ||f || ,
αν → 0 entails ||αν f || → 0 ,
||f ν || → 0 entails ||αf ν || → 0 ,

then ||f || is called a quasi-norm, and F is called a quasi-normed linear space. When F
is a quasi-normed or a normed linear space, f ν → f entails ||f ν || → ||f ||; furthermore, if
f ν → f , g ν → g, and αν → α, it holds that f ν + g ν → f + g and αν f ν → αf (Yosida,
1980, Proposition 2.2).

C.4 Proposition. Let F be a real pre-Hilbert space. The inner product between
f, g ∈ F , is defined by
hf, gi = 14 ||f + g||2 − 14 ||f − g||2 ,
and satisfies the following properties (where f, g, h ∈ F and α ∈ R):
hαf, gi = αhf, gi ; hf + g, hi = hf, hi + hg, hi ; hf, gi = hg, f i ; hf, f i = ||f || 2 .
Moreover, we have |hf, gi| ≤ ||f || ||g|| (Cauchy-Schwartz inequality).

If F is a complex pre-Hilbert space, the inner product is defined as hf, gi = (f, g) +



j(f, j g) with (f, g) = ||f + g||2 /4 − ||f − g||2 /4 and j = −1. The properties of
Proposition C.4 hold with α ∈ C, except that now hf, gi = hg, f i (complex conjugate).
In particular, hf, αgi = αhf, gi.

C.5 Definition. A normed linear space that is complete is called a Banach space. A
pre-Hilbert space that is complete is called a Hilbert space.

A set B = {f ν }ν∈I of elements of a Hilbert space F is called an orthonormal set of F


if hf ν , f ν i = 1 and hf µ , f ν i = 0 for µ 6= ν. If in addition B is not a proper subset of an
orthonormal set of F , then B is an orthogonal basis of F .

C.6 Proposition. A separable Hilbert space F has an orthogonal base {f ν }ν∈I with
at most a countable number of elements (Yosida, 1980, Corollary III.5). Any f ∈ F
C.2. Linear Mappings 167

can be represented as f = ν∈I cν f ν with cν = hf, f ν i (Fourier expansion), whereas


P

||f ||2 = ν∈I |cν |2 (Parseval’s relation).


P

C.2 Linear Mappings

C.7 Definition. Let X, Y be Banach spaces over a field K. A mapping T : X → Y is


said to be a linear mapping if dom T = X and T (αx1 + βx2 ) = αT (x1 ) + βT (x2 ) for
every x1 , x2 ∈ X and scalar α1 , α2 ∈ K.

Let us denote by || · ||X and || · ||Y the norms of the Banach spaces X and Y . Let ||T ||
denote the smallest constant c > 0 such that ||T (x)||Y ≤ c||x||X for all x ∈ dom T . We
say that T is bounded if ||T || is finite, and call ||T || the operator norm of T . It holds
that a linear mapping T : X → Y is continuous if and only if T is bounded (Yosida,
1980, Corollary I.6.2).
Let L(X, Y ) denote the space of all continuous linear mappings T : X → Y . The
following statement of Riesz’s representation theorem is taken from Yosida (1980, Section
III.6).

C.8 Theorem. Let X be a Hilbert space over the field K and let f be an element of
L(X, K). Then there exists a unique element yf ∈ X such that f (x) = hx, yf i for every
x ∈ X with ||f || = ||yf ||X . Conversely, an element y ∈ X defines a unique mapping fy
in L(X, K) by fy (x) = hx, yi for every x ∈ X with ||fy || = ||y||X .

There is a one-to-one correspondence between elements f ∈ L(X, K) and elements


yf ∈ X. In particular, if X is a real Hilbert space, L(X, R) (the dual of X) can be
identified to a real Hilbert space equipped with the inner product hf, gi = hy f , yg i. If X
is a complex Hilbert space, L(X, C) can be identified to a complex Hilbert space equipped
with the inner product hf, gi = hyf , yg i.

C.3 Reproducing Kernel Hilbert Spaces

C.9 Definition. Let F be a space of functions f : X → K forming a Hilbert space. The


inner product between f, g ∈ F is written hf (·), g(·)i. Then, the mapping k : X × X → R
with values k(x, y) is a reproducing kernel of F if
i. For every y ∈ X, the function fy (·) = k(·, y) is in F ;
ii. (Reproducing property.) For every f ∈ F and every y ∈ X, f (y) = hf (·), k(·, y)i.

Note that k(·, y) acts as a Dirac distribution centered at y by the reproducing property,
whereas k(·, y) is actually a function defined on X.
With f (·) = k(·, y), Property ii. yields k(x, y) = hk(·, y), k(·, x)i. For real Hilbert
spaces we can write k(x, y) = hk(·, x), k(·, y)i, whereas for complex Hilbert spaces we
have k(x, y) = hk(·, x), k(·, y)i.
If a reproducing kernel k exists, it is unique (Aronszajn, 1950). A Hilbert space
for which a reproducing kernel exists is called a reproducing kernel Hilbert space
(RKHS). A reproducing kernel of F exists if and only if for every y ∈ X, the mapping
f 7→ f (y) (called the evaluation functional) is a continuous linear mapping with
168 Appendix C. Elements of Functional Analysis for Kernel Methods

respect to f ∈ F , meaning that there exists a finite cy > 0 such that |f (y)| ≤ cy ||f || for
all f ∈ F . If k exists, the smallest cy is k(y, y)1/2 by Cauchy-Schwartz inequality (C.4)
applied to f (y) = hf (·, k(·, y))i, whereas if a continuous linear mapping F y (f ) = f (y)
exists for every y, we have Fy (f ) = hf (·), gy (·)i for some gy ∈ X (by Theorem C.8) so
that gy (x) = k(x, y) is a reproducing kernel (Yosida, 1980, Proof of Theorem III.9.1).
From the relation |f (y)| ≤ k(y, y)1/2 ||f ||, one deduces that if there exists a scalar
c > 0 such that k(y, y)1/2 ≤ c for all y ∈ X, then ||f ||∞ = supy∈X |f (y)| ≤ c||f ||. For
the particular case of normalized kernels [k(y, y) = 1] we have ||f ||∞ ≤ ||f ||.
For a sequence {f ν }ν∈N in a RKHS, ||fn − f || → 0 entails fn (y) → f (y) for every
y ∈ X, since we have, by definition of the strong convergence, fn → f , and then fn (y) →
f (y) by continuity of the evaluation functional k(·, y).

C.10 Proposition. A reproducing kernel k for a class F of K-valued functions has the
P n Pn Pn 2
property that i=1 j=1 αi k(yi , yj )αj = || i=1 k(·, yi )αi || ≥ 0 for any finite collection
of elements yi ∈ F and coefficients αi ∈ K. That is, the Gram matrix K ∈ Kn×n with
elements Kij = k(yi , yj ) is positive semi-definite.

When X ⊂ Rd , Proposition C.10 can also be stated as follows: the linear mappings
R R
L : F → K defined by L(f ) = X X α(x)k(x, y)α(y) dx dy, with k a reproducing kernel
and α any K-valued continuous function with nonzero values on a compact subset of X,
are such that F (f ) ≥ 0.
The converse of Proposition C.10 is also true (Aronszajn, 1950, Theorem 2.4 at-
tributed to E.H. Moore). Before stating the theorem, we define the notion of positive
definite kernel.

C.11 Definition. A function k : X × X → K that is hermitian [k(x, y) = k(y, x)], and


such that any matrix in Kn×n (n ∈ N) with elements Kij = k(yi , yj ) (yi ∈ X, 1 ≤ i ≤ n)
is positive semi-definite, is called a positive definite kernel.

C.12 Theorem. To every positive definite kernel k : X × X → K, there corresponds


a unique class F of functions f : X → K forming a Hilbert space with a uniquely
determined inner product and with k as a reproducing kernel.

Being a reproducing kernel, a positive definite kernel k has the property that k(·, y)
is continuous for every y ∈ X. The property does not imply that k is continuous as a
mapping from X × X to K (Lehto, 1952). A continuous positive definite kernel is called
a Mercer kernel. Since a function is continuous at any isolated point of its domain
(Rudin, 1976, Definition 4.5), the distinction between positive definite kernels and Mercer
kernels is irrelevant when X is a discrete set.
To build the class F of Theorem C.12 corresponding to a positive definite kernel k,
we follow Aronszajn (1950):

C.13 Proposition. The class F is generated by functions of the form


Pm
f (·) = i=1 αi k(·, yi ) for some m ∈ N, αi ∈ K, yi ∈ X

to which corresponds a norm ||f || = [ i j αi k(yi , yj )αj ]1/2 and then the class is com-
P P

pleted by limit functions of Cauchy sequences in the metric of the norm.


Pm Pn
The inner product between two functions f (·) = i=1 αi k(·, yi ) and g(·) = j=1 βj k(·, yj0 )
C.4. Positive Definite Kernels 169

is then given by
Pm Pn Pm P n
hf, gi = i=1 j=1 αi βj k(yj0 , yi ) = i=1 j=1 αi βj k(yi , yj0 ) .

C.4 Positive Definite Kernels

Theorems C.10 and C.12 show that Reproducing Kernel Hilbert Spaces are uniquely
determined by the choice of a positive definite kernel. In the sequel, we refer to a
positive definite kernel simply as a kernel.
If X is a compact subspace of Rd , a continuous kernel k : X × X → R admits an
eigenfunction expansion
Pm
k(x, y) = ν=1 λν ψ ν (x)ψ ν (y)

with λν > 0 and m ≤ ∞ (Mercer, 1909). The vector φ(x) = { λν ψ ν (x)}1≤ν≤m is
interpreted as a feature vector for x, whereas k(x, y) is viewed as a generalized inner
product between the vectors φ(x), φ(y) valued in some feature space F. The mapping
φ : X → F is called a feature map (Aizerman et al., 1964).
To elucidate the nature of F, observe that φ(x) belongs to the space `2 of vec-
P∞ Pm
tors {ξ ν }ν∈N such that ν=1 |ξ ν |2 < ∞, since k(x, x) = ν=1 φν (x)2 is finite. In fact
P∞
`2 is a linear normed space equipped with the norm ||{ξ ν }ν∈N || = [ ν=1 (ξ ν )2 ]1/2 , which
can be interpreted as a generalization of the Euclidian norm in Rn when n tends to ∞.
The feature map is continuous: xν → x̄ entails φ(xν ) → φ(x̄), since ||φ(xν ) − φ(x̄)|| =
k(xν , xν ) + k(x̄, x̄) − 2k(xν , x̄) → 0 by continuity of k (Cucker and Smale, 2001).
Pm Pn
From k(·, y) = ν=1 λν ψ ν (·)ψ ν (y) and f (·) = i=1 αi k(·, yi ) for some n ≤ ∞, one
Pm Pn
can see that f has the form f (·) = ν=1 αfν ψ ν (·) with αfν = i=1 αi λν ψ ν (yi ), and that
Pm
hf, gi = ν=1 αfν αgν /λν .
In machine learning, it is common to extend the feature map interpretation to more
general spaces X and say that a function k : X × X → K is a kernel if there exist a
Hilbert space H and a mapping φ : X → H such that k(x, y) = hφ(x), φ(y)i for all
x, y ∈ X (Steinwart and Christman, 2008, Definition 4.1). The corresponding RKHS is
the class F of functions of the form f (·) = hh, φ(·)iH for some h ∈ H, equipped with
the norm ||f || = inf h∈H {||h||H : f (·) = hh, φ(·)iH } (Steinwart and Christman, 2008,
Theorem 4.21). Proposition C.13 still holds.
For the class of shift-invariant continuous kernels k : X × X → R with X = Rd , where
the shift-invariance property means that k(x+τ, y+τ ) = k(x, y) for any τ ∈ R d , Bochner’s
theorem (Bochner, 1933) [see also Yosida (1980, Theorem XI.13.2)] characterizes the
kernels k in the frequency domain. The following statement particularizes to real-valued
normalized kernels (k(x, x) = 1 for all x ∈ Rd ) a form of Bochner’s theorem given in
Hofmann et al. (2008).

C.14 Theorem. Let h : Rn → R by a continuous function with h(0) = 1 and h(x) =


h(−x). Then, the function k : Rn × Rn → R with values k(x, y) = h(x − y) is a
kernel iff there exists a random vector ξ ∈ Rn on a probability space (Ω, B, P) such that
h(x) = E{exp{j hx, ξi}}.

Thus we have k(x, y) = E{exp{jhx, ξi}exp{j hy, ξi}}, which is very similar to Mercer’s
eigenfunction expansion (the countable sum has been replaced by an integral as X = R d
170 Appendix C. Elements of Functional Analysis for Kernel Methods

is now unbounded).
From Bochner’s theorem, Schoenberg (1938) obtains a characterization of shift-invariant
kernels having a radial symmetry.

C.15 Definition. A function f : R → R with values f (t) is completely monotone


(c.m.) for t ≥ 0 if it is infinitely continuously differentiable on (0, ∞) with f (0) = f (0 + )
and has for every k its derivative of order k satisfying

(−1)k f (k) (t) ≥ 0 for 0 < t < ∞.

C.16 Theorem. The function k : Rn × Rn → R with values k(x, y) = f (||x − y||2 ) is a


kernel iff f : R → R is a completely monotone (c.m.) function for R+ .

An example of c.m. function for R+ is f (t) = exp{−at} with a ≥ 0; it shows that


the function k(x, y) = exp{− 21 ||x − y||2 /σ 2 } is a kernel. Other simple examples of c.m.
functions are (a + b t)−q with q > 0 and a, b ≥ 0 (a, b not both 0), and log(a + bt−1 ) with
a ≥ 1 and b > 0. Simple composition rules are as follows: If f1 , f2 are c.m. for t ≥ 0,
then α1 f1 + α2 f2 with α1 , α2 ≥ 0 is c.m. and f1 (t)f2 (t) is c.m.; If f is c.m. and g is
nonnegative with a c.m. derivative g 0 , then f (g(t)) is c.m.
Kernels are closed under positive sums and pointwise products:

C.17 Proposition. If ki : X ×X → K (i = 1, 2) are kernels, then the following functions


k : X × X → K are kernels.
i. k = α1 k1 + α2 k2 (α1 , α2 ≥ 0) with values k(x, y) = α1 k1 (x, y) + α2 k2 (x, y) .
ii. k = k1 · k2 with values k(x, y) = k1 (x, y)k2 (x, y).

If ki : Xi × Xi → K (i = 1, 2) are kernels, then the following functions k : (X1 × X2 ) ×


(X1 × X2 ) → K are kernels.
iii. k = k1 ⊕ k2 with values k(x1 , x2 , y1 , y2 ) = k1 (x1 , y1 ) + k2 (x2 , y2 ).
iv. k = k1 ⊗ k2 with values k(x1 , x2 , y1 , y2 ) = k1 (x1 , y1 )k2 (x2 , y2 ).

If k : (X × X) × (X × X) → K with values k(x1 , x2 , y1 , y2 ) is a kernel, then k ∆ :


X × X → K with values k ∆ (x, y) = k(x, x, y, y) is a kernel (Haussler, 1999).
Kernels are also closed under pointwise limits:

C.18 Proposition. If {k ν }ν∈N is a sequence of kernels k ν : X × X → K such that


k ν (x, y) → k(x, y) for all x, y ∈ X, then k is a kernel.

Using basic kernels and positivity-preserving operations, more complex kernels can
be built. For example, one can define a kernel k : X × X → R with values

k(x, y) = E{φ(ω, x)φ(ω, y)} ,

where φ : Ω × X → R is such that φ(ω, x) is in L2 (Ω, B, P) for each x ∈ X. If in addition


E{φ(ω, x)} = 0 for each x, we can interpret the Gram matrix for k evaluated at x 1 , . . . , xn
as the covariance matrix of the random variables φ(·, x1 ), . . . , φ(·, xn ).
Appendix D

Structural Results for Two-Stage Stochastic


Programming

This appendix describes a classical formulation of the two-stage stochastic linear program
with recourse, and gives details on the structure of optimal solutions. We have included
this appendix in the thesis, because it clarifies the origin of certain assumptions that are
found to be technically challenging to remove in stochastic programming models.
The material is mainly taken from Birge and Louveaux (1997), up to some adjustments
based on Wets (1974); Römisch and Wets (2007); Shapiro et al. (2009).
The appendix is organized as follows. Section D.1 states the problem and gives a list
of assumptions that ensure that the formulation is meaningful. Section D.2 gives useful
properties that can be derived from the previous assumptions.

D.1 Problem Statement and Assumptions

Let (Ω, B, P) be a probability space. A two-stage stochastic linear program with recourse
is a program of the form

minimize hc, xi + E{Q(x, ω)} (D.1)


subject to x ∈ K 1 ∩ K2 , (D.2)
where Q(x, ω) = miny { hq(ω), yi :
T (ω)x + W (ω)y = h(ω),
0  y ∈ R m2 } (D.3)
m
K1 = {x ∈ R : Ax = b, x  0} (D.4)
m
K2 = {x ∈ R : E{Q(x, ω)} < ∞} (D.5)

with c ∈ Rm (first-stage cost vector), A ∈ Rs×m , b ∈ Rs , and B-measurable mappings


q : Ω → Rm2 (second-stage cost vector), T : Ω → Rs2 ×m (technology matrix), W : Ω →
Rs2 ×m2 (recourse matrix), h : Ω → Rs2 . We let ai , ti , wi denote the i-th rows of A, T ,
W respectively.
Let z : Ω → Rm2 × Rs2 ×m × Rs2 ×m2 × Rs2 be the B-measurable mapping with values

z(ω) = ( q(ω), t1 (ω), . . . , ts2 (ω), w1 (ω), . . . , ws2 (ω), h(ω) ) (D.6)

which collects all the (possibly non-random) elements of (q, T, W, h). Let Z ⊂ R p with
p = m2 + s2 (m + m2 + 1) denote the support of z.
172 Appendix D. Structural Results for Two-Stage Stochastic Programming

Various well-posed forms of the program can be distinguished. To this end, we describe
standard assumptions. The joint role of those assumptions is detailed in Section D.2.

D.1 Definition (Measurability). The support of z is a Borel set of Rp , and the sigma-
algebra B contains the collection of Borel sets of the support of z, that is,

B ⊃ {B ∩ Z : B ∈ B(Rp )} .

The stated measurability assumption is consistent with Wets (1974, page 311). The
measurability of Q(x, ·) for each fixed x requires at least that the sigma-algebra gener-
ated by z be included in B. It does not harm to allow sigma-algebras larger than the
sigma-algebra generated by z since this would not alter the optimal value of the pro-
gram. Note that using larger sigma-algebras makes it possible to select distinct vectors
y ∗ (ω1 ), y ∗ (ω2 ) for attaining the optimal value of Q when z(ω1 ) = z(ω2 ), that is, to im-
plement a stochastic policy for y. Most authors rule out this possibility, but in practice
a numerical solution algorithm could indeed return distinct optimal values for y in face
of duplicate realizations of z.
Now, recall that a mapping F : Rd → Rm is said to be affine iff it has values F (ξ) =
b̄ + Bξ for some fixed b̄ ∈ Rm and B ∈ Rm×d .

D.2 Definition (Affine dependence). There exist a random variable ξ : Ω → R d


(d ≤ p) and affine mappings qf : Rd → Rm2 , Tf : Rd → Rs2 ×m , Wf : Rd → Rs2 ×m2 ,
hf : Rd → Rs2 , possibly constant-valued, such that for all ω ∈ Ω,

q(ω) = qf (ξ(ω)) , T (ω) = Tf (ξ(ω)) , W (ω) = Wf (ξ(ω)) , h(ω) = hf (ξ(ω)).

The affine dependence assumption enforces the parametrization of (q, T, W, h) by ξ.


It is made without loss of generality, as it is always possible to set ξ = z, and ex-
tract the appropriate coordinates of z through mappings qf , Tf , Wf , hf . One goal of the
parametrization is to represent through ξ the randomness of the non-constant elements
of z. Ideally, ξ is made of a small number of components, and the measure P is specified
indirectly by the joint distribution of those components.
If we define

Qf (x, ξ) = miny {hqf (ξ), yi : Tf (ξ)x + Wf (ξ)y = hf (ξ), 0  y ∈ R m2 } , (D.7)

we have Q(x, ω) = Qf (x, ξ(ω)).

D.3 Definition (Fixed Recourse). For all ω, W (ω) is a fixed matrix W ∈ Rs2 ×m2 .

Fixed recourse is a simplifying assumption under which the value function E{Q(·, ω)}
is easier to describe. The rows of the fixed matrix W are always assumed to be linearly
independent to avoid trivial redundancies or conflicts among equality constraints (Wets,
1974, page 312).

D.4 Definition (Complete Recourse). For all ω, W (ω) is a fixed matrix W ∈


Rs2 ×m2 , and the positive hull of W coincides with Rs2 :

pos W = {W y : y  0} = W Rm
+ =R
2 s2
.
D.1. Problem Statement and Assumptions 173

The condition W Rm + = R
2 s2
implies that for any x ∈ Rm and all ω ∈ Ω, there exists
some y  0 such that T (ω)x + W y = h(ω). This surjectivity condition is sufficient
for having Q(x, ω) < ∞ almost surely. From (D.8) below, one can show that complete
recourse holds iff {π ∈ Rs2 : W T π  0} = {0} (Shapiro et al., 2009, page 33). Recall
from Rockafellar (1970, page 65) that the dimension of the largest subspace contained
in a cone is called the lineality of the cone; methods to check that the lineality of pos W
is s2 are described in Wets and Witzgall (1967) and Wallace and Wets (1992).

D.5 Definition (Relatively Complete Recourse). For all x ∈ K1 and P-almost all
ω ∈ Ω, there exists some y  0 such that T (ω)x + W (ω)y = h(ω), that is,
h(ω) − T (ω)x ∈ W (ω)Rm
+ .
2

The relatively complete recourse assumption means that Q(x, ω) < ∞ for almost
all ω and x ∈ K1 . We could still have E{Q(x, ω)} = ∞ if the distribution of Q(x, ·) is
not integrable. In particular, the assumption alone does not guarantee that K 1 ⊂ K2
— compare to Wets (1974, Definition 6.1) and Birge and Louveaux (1997, page 92).
Note that no generic method is available for checking that a relatively complete recourse
assumption holds. Relatively complete recourse is thus typically asserted at the modeling
step, where penalties in the objective can be favored over hard constraints.

D.6 Definition (Dual Feasibility). For P-almost all ω ∈ Ω, the set


Π(ω) = {π ∈ Rs : W (ω)T π  q(ω)}
is nonempty.

The dual feasibility assumption ensures that Q(x, ω) > −∞ for almost all ω and all x.
Indeed, by weak duality,
Q(x, ω) = inf y { hq(ω), yi : T (ω)x + W (ω)y = h(ω), y0}
T
≥ supπ { hπ, h(ω) − T (ω)xi : W (ω) π  q(ω) } (D.8)
> −∞ if Π(ω) 6= ∅.
D.7 Definition (Fixed Technology). For all ω, T (ω) is a fixed matrix T ∈ R s2 ×m .

D.8 Definition (Finite Second Moments). z ∈ L2 (Ω, B, P), that is, E{||z||2 } < ∞.

D.9 Definition (Finite Support). The support of z is finite, that is, there exists a
Pn
finite set Z = {z 1 , z 2 , . . . , z n } such that P{ω : z(ω) = z ν } = pν > 0 with ν=1 pν = 1.

D.10 Definition (Polyhedral Support). The support of z is a polyhedral set.

A set is said to be polyhedral if it can be described as the intersection of a finite


number of halfspaces. For instance, by definition, K1 is polyhedral. A classic result from
Weyl (1935) shows that polyhedral sets are the only sets that can also be described as the
convex hull of a finite number of points and directions (Rockafellar, 1970, Theorem 19.1).
As the image of a polyhedral set in Rn1 by a linear transformation F : Rn1 → Rn2
is a polyhedral set in Rn2 (Rockafellar, 1970, Theorem 19.3), the polyhedral support
assumption holds if the affine dependence assumption holds with a polyhedral support
for ξ. Note also that an affine mapping F : Rd → Rm with values F (ξ) = b̄ + Bξ is
Lipschitz continuous with modulus ||B|| = max||u||=1 Bu (Rockafellar and Wets, 1998,
Example 9.3).
174 Appendix D. Structural Results for Two-Stage Stochastic Programming

D.2 Structural Properties

It is interesting to identify conditions under which K1 ∩ K2 (constraint D.2) is polyhedral


(Walkup and Wets, 1967). Clearly, the intersection of a finite number of polyhedral sets
is a polyhedral set, so the question, if it is not circumvented by the relatively complete
assumption, is reduced to investigating under which circumstances K 2 is polyhedral.
Sufficient conditions are collected in the next proposition, mainly based on Wets (1974)
and Birge and Louveaux (1997, Section 3.1). Essentially, these results concern cases
where the recourse matrix W is fixed and cases where the random variables have a finite
support.

D.11 Proposition (Representations of the effective domain of E{Q(·, ω)}). Un-


der the following sufficient conditions, the set K2 = {x ∈ Rm : E{Q(x, ω)} < ∞} in (D.5)
admits the following representations.
i. Under the finite second moments assumption,

K2 = {x ∈ Rm : P{ω : Q(x, ω) < ∞} = 1} .

ii. Under the finite second moments and fixed recourse assumptions,

K2 = {x ∈ Rm : for P-almost all ω ∈ Ω, there is some y0


such that W y + T (ω)x = h(ω)} ,

or equivalently, with Σ denoting the support of the distribution of (h, T ),


\
K2 = {x ∈ Rm : W Rm
+ 3 h − T x} ,
2

(h,T )∈Σ

as shown in Wets (1974, Theorem 4.2) or Shapiro et al. (2009, Equation 2.33).
iii. Under the finite second moments and fixed recourse assumptions, if the support
of T is polyhedral and if h, T are statistically independent, then K2 is polyhedral
(Wets, 1974, Corollary 4.13).
iv. Under the finite second moments, fixed recourse and fixed technology assumptions,
K2 is polyhedral; more precisely (Wets, 1974, Theorem 4.10) there exist a matrix
p
W ∗ ∈ Rp×s2 (p finite) and a vector α∗ ∈ R such that

K2 = {x ∈ Rm : W ∗ T x  α∗ } .

v. Under the finite second moments and complete recourse assumptions, K 2 = Rm .


vi. Under the finite support assumption, K2 is polyhedral; more precisely,

K2 = {x ∈ Rm : for all ω ∈ Ω, there is some y(ω)  0


such that W (ω)y(ω) = h(ω) − T (ω)x}
n
\
= {x ∈ Rm : y ν  0, W ν y ν = hν − T ν x}
ν=1

where the elements indexed by ν refer to the realizations ξ ν .


D.2. Structural Properties 175

A proper function is said to be polyhedral when its epigraph is a polyhedral set. The
domain of a polyhedral function is necessarily a polyhedral set. A result established in
Rockafellar and Wets (1998, Theorem 2.49) shows that the class of proper polyhedral
functions is the class of proper convex piecewise linear functions.
It is interesting to identify conditions under which E{Q(x, ω)} (second term of ob-
jective in D.1, often referred to as the value function) is polyhedral. The finite sum
of proper polyhedral convex functions is polyhedral (Rockafellar, 1970, Theorem 19.4)
and the multiplication of a proper polyhedral convex function by a nonnegative scalar is
polyhedral (Rockafellar, 1970, Corollary 19.5.1), so when the support of ξ is finite, the
question is reduced to investigating under which conditions the integrand of the value
function, Q(x, ω), is proper and polyhedral.
The next lemma (Römisch and Wets, 2007, Lemma 3.1), which reformulates results in
Walkup and Wets (1969a), is instrumental in describing the structure of Q(x, ω) without
necessarily assuming fixed recourse. Under the affine dependence assumption, let Ξ ⊂ R d
denote the support of ξ, and let Φ : Rd × Rm2 × Rs2 → R be a mapping with values

Φ(ξ, q, t) = inf y {hq, yi : Wf (ξ)y = t, y  0} .

Observe that

Q(x, ω) = Qf (x, ξ(ω)) = Φ( ξ(ω), qf (ξ(ω)), hf (ξ(ω)) − Tf (ξ(ω)) x ) .

By analogy to the relatively complete recourse assumption, let H(ξ) = Wf (ξ)Rm


+ , and
2

by analogy to the dual feasibility assumption, let

Πf (ξ) = {π ∈ Rs2 : Wf (ξ)T π  qf (ξ)} , D(ξ) = {q ∈ Rm2 : Πf (ξ) 6= ∅} .

D.12 Lemma. Let the affine dependence and the polynomial support assumptions hold.
Let ξ ∈ Ξ be fixed. Then,
i. The sets D(ξ) and H(ξ) are polyhedral;
ii. The function Φ(ξ, ·, ·) is finite and continuous on D(ξ) × H(ξ);
iii. The function Φ(ξ, q, ·) is piecewise linear convex on H(ξ) for fixed q ∈ D(ξ);
iv. The function Φ(ξ, ·, t) is piecewise linear concave on D(ξ) for fixed t ∈ H(ξ).

Inasmuch as t = hf (ξ(ω)) − Tf (ξ(ω))x, it holds that t depends affinely on x when ω


and thus h, T are fixed.
When the recourse is fixed, Lemma D.12 allows to establish the following local Lips-
chitz continuity properties for Q or equivalently Qf (Rachev and Römisch, 2002, Propo-
sition 3.2).

D.13 Lemma. Let the affine dependence and the polynomial support assumptions hold.
Under the finite second moments, the fixed recourse, the relatively complete recourse and
the dual feasibility assumptions, there exist constants L1 > 0, L2 > 0, K > 0 such that
for all ξ, ξ 0 ∈ Ξ, any ρ > 0, and for all x, x0 ∈ K1 ∩ K2 ∩ ρB,
i. |Qf (x, ξ) − Qf (x, ξ 0 )| ≤ L1 ρ max{1, ||ξ||, ||ξ 0 ||} ||ξ − ξ 0 || ;
ii. |Qf (x, ξ) − Qf (x0 , ξ)| ≤ L2 max{1, ||ξ||2 } ||x − x0 || ;
iii. |Qf (x, ξ)| ≤ Kρ max{1, ||ξ||2 } .
176 Appendix D. Structural Results for Two-Stage Stochastic Programming

Finally, the following proposition, that addresses the differentiability of the value
function, is based on Walkup and Wets (1969b), Wets (1974) and Shapiro et al. (2009,
Propositions 2.7, 2.8, 2.9). Note that under the finite second moments and relatively
complete recourse assumptions, we have K1 ⊂ K2 , so that K2 is nonempty if K1 is
nonempty.

D.14 Proposition. Under the finite second moments, fixed recourse, relatively complete
recourse and dual feasibility assumptions, and assuming that K1 is nonempty,
i. E{Q(·, ω)} is proper;
ii. E{Q(·, ω)} is convex, lower semicontinuous and Lipschitz continuous on K 2 ;
iii. If (q, T ) is constant-valued, and the distribution of h is absolutely continuous, then
E{Q(·, ω)} is differentiable at x0 ∈ int K2 ;
iv. If for almost all (q, T ), the distribution of h conditionally to (q, T ) is absolutely
continuous, and if for almost all ω ∈ Ω, the dual solution set at x0 ∈ int K2

arg maxπ { hπ, h(ω) − T (ω)x0 i : W T π  q(ω) }

is a singleton, then E{Q(·, ω)} is differentiable at x0 ∈ int K2 .


Bibliography

Ailon, N., B. Chazelle, K.L. Clarkson, D. Liu, W. Mulzer, C. Seshadhri. 2006. Self-improving
algorithms. Proceedings of the Seventeenth ACM-SIAM Symposium on Discrete Algorithms
(SODA 2006). 261–270.

Aizerman, M., E. Braverman, L. Rozonoer. 1964. Theoretical foundations of the potential


function method in pattern recognition learning. Automation and Remote Control 25 821–
837.

Ali, S.M., S. Koenig, M. Tambe. 2005. Preprocessing techniques for accelerating the DCOP
algorithm ADOPT. Proceedings of the Fourth International Joint Conference on Autonomous
Agents & Multi Agent Systems. 1041–1048.

Ali, S.M., S.D. Silvey. 1966. A general class of coefficients of divergence of one distribution from
another. Journal of the Royal Statistical Society 28 131–142.

Antos, A., R. Munos, Cs. Szepesvári. 2008. Fitted Q-iteration in continuous action-space MDPs.
Advances in Neural Information Processing Systems 20 (NIPS-2007). 9–16.

Aronszajn, N. 1950. Theory of reproducing kernels. Transactions of the American Mathematical


Society 68(3) 337–404.

Arrow, K.J. 1958. Historical background. K.J. Arrow, S. Karlin, H. Scarf, eds., Studies in the
Mathematical Theory of Inventory and Production. Stanford University Press, 3–15.

Artzner, P., F. Delbaen, J.-M. Eber, D. Heath, H. Ku. 2007. Coherent multiperiod risk adjusted
values and Bellman’s principle. Annals of Operations Research 152(1) 5–22.

Aubin, J.-P., H. Frankowska. 1990. Set-Valued Analysis. Modern Birkhäuser Classics, Springer.
2009 Reprint of the 1990 Edition.

Audibert, J.-Y., R. Munos, C. Szepesvári. 2007. Tuning bandit algorithms in stochastic envi-
ronments. Proceedings of the Eighteenth International Conference on Algorithmic Learning
Theory (ALT-2007). LNCS 4754, Springer, 150–165.

Auer, P., N. Cesa-Bianchi, P. Fischer. 2002. Finite-time analysis of the multiarmed bandit
problem. Machine Learning 47 235–256.

Azuma, K. 1967. Weighted sums of certain dependent random variables. Tohoku Mathematical
Journal 19 357–367.

Balas, E. 1998. Disjunctive programming: Properties of the convex hull of feasible points.
Discrete Applied Mathematics 89 3–44.

Balasubramanian, J., I.E. Grossmann. 2003. Approximation to multistage stochastic optimiza-


tion in multiperiod batch plant scheduling under demand uncertainty. Industrial & Engineer-
ing Chemistry Research 43 3695–3713.
178 BIBLIOGRAPHY

Banerjee, A., S. Merugu, I.S. Dhillon, J. Ghosh. 2005. Clustering with Bregman divergences.
Journal of Machine Learning Research 6 1705–1749.

Banerjee, O., L. El Ghaoui, A. d’Aspremont. 2008. Model selection through sparse maximum
likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning
Research 9 485–516.

Baotić, M., F. Borrelli, A. Bemporad, M. Morari. 2008. Efficient on-line computation of con-
strained optimal control. SIAM Journal on Control and Optimization 47 2470–2489.

Bartlett, P.L., S. Mendelson. 2002. Rademacher and Gaussian complexities: Risk bounds and
structural results. Journal of Machine Learning Research 3 463–482.

Beale, E.M.L. 1955. On minimizing a convex function subject to linear inequalities. Journal of
the Royal Statistical Society 17 173–184.

Bellman, R.E. 1954. The theory of dynamic programming. Bulletin of the American Mathemat-
ical Society 60 503–516.

Bemporad, A., M. Morari, V. Dua, E. Pistikopoulos. 2002. The explicit linear quadratic regulator
for constrained systems. Automatica 38 3–20. Corrigendum: Automatica 39 (2003) 1845-1846.

Ben-Tal, A., A. Goryashko, E. Guslitzer, A. Nemirovski. 2004. Adjustable robust solutions of


uncertain linear programs. Mathematical Programming 99(2) 351–376.

Bertsekas, D.P. 2005a. Dynamic Programming and Optimal Control . 3rd ed. Athena Scientific,
Belmont, MA.

Bertsekas, D.P. 2005b. Dynamic programming and suboptimal control: A survey from ADP to
MPC. European Journal of Control 11 310–334.

Bertsekas, D.P., J.N. Tsitsiklis. 1996. Neuro-Dynamic Programming. Athena Scientific, Belmont,
MA.

Billingsley, P. 1995. Probability and Measure. 3rd ed. Wiley.

Birge, J.R. 1992. The value of the stochastic solution in stochastic linear programs with fixed
recourses. Mathematical Programming 24 314–325.

Birge, J.R., F. Louveaux. 1997. Introduction to Stochastic Programming. Springer-Verlag, New


York.

Bixby, R. 2002. Solving real-world linear programs: A decade and more of progress. Operations
Research 50 3–15.

Bochner, S. 1933. Monotone funktionen, stieltjessche integrale und harmonische analyse. Math-
ematische Annalen 108(1) 378–410.

Boda, K., J.A. Filar. 2006. Time consistent dynamic risk measures. Mathematical Methods of
Operations Research 63 169–186.

Bregman, L.M. 1967. The relaxation method of finding the common points of convex sets and
its application to the solution of problems in convex programming. USSR Computational
Mathematics and Mathematical Physics 7 200–217.

Breiman, L. 1996. Bagging predictors. Machine Learning 24(2) 123–140.

Breiman, L. 1998. Arcing classifiers (with discussion and a rejoinder by the author). The Annals
of Statistics 26 801–849.
BIBLIOGRAPHY 179

Breiman, L. 2000. Randomizing outputs to increase prediction accuracy. Machine Learning 40


229–242.

Breiman, L. 2001. Random forests. Machine Learning 45 5–32.

Brown, L.D. 1986. Fundamentals of statistical exponential families, IMS lecture notes – Mono-
graph series, vol. 9. Institute of Mathematical Statistics, Hayward, California.

Buja, A., W. Stuetzle. 2006. Observations on bagging. Statistica Sinica 16 323–351.

Cesa-Bianchi, N., A. Conconi, C. Gentile. 2004. On the generalization ability of on-line learning
algorithms. IEEE Transactions on Information Theory 50 2050–2057.

Cesa-Bianchi, N., Y. Freund, D.P. Helmbold, D. Haussler, R.E. Schapire, M.K. Warmuth. 1997.
How to use expert advice. Journal of the Association for Computing Machinery 44 427–485.

Cesa-Bianchi, N., G. Lugosi. 1999. On prediction of individual sequences. The Annals of


Statistics 27 1865–1895.

Cesa-Bianchi, N., G. Lugosi. 2006. Prediction, Learning and Games. Cambridge University
Press, New York.

Chiralaksanakul, A. 2003. Monte Carlo methods for multi-stage stochastic programs. Ph.D.
thesis, University of Texas at Austin.

Chung, K.-J., M. Sobel. 1987. Discounted MDP’s: Distribution functions and exponential utility
maximization. SIAM J. Control and Optimization 25(1) 49–62.

Clarke, B.S., A.R. Barron. 1990. Information-theoretic asymptotics of Bayes methods. IEEE
Transactions of Information Theory 36 453–471.

Coquelin, P.-A., R. Munos. 2007. Bandit algorithms for tree search. Proceedings of the Twenty-
Third Conference on Uncertainty in Artificial Intelligence (UAI-2007). 67–74.

Csizár, I. 1967. Information-type measures of difference of probability distributions and indirect


observation. Studia Scientiarum Mathematicarum Hungarica 2 229–318.

Cucker, F., S. Smale. 2001. On the mathematical foundations of learning. Bulletin of the
American Mathematical Society 39(1) 1–49.

Daniel, J.W. 1973. Stability of the solution of definite quadratic programs. Mathematical
Programming 5 41–53.

Dantzig, G.B. 1955. Linear programming under uncertainty. Management Science 1 197–206.

Dawid, A.P. 1984. Present position and potential developments: some personal views. Statistical
theory: The prequential approach (with discussion). Journal of the Royal Statistical Society
147 278–292.

Decoste, D., B. Schölkopf. 2002. Training invariant support vector machines. Machine Learning
46 161–190.

Defourny, B., D. Ernst, L. Wehenkel. 2008. Risk-aware decision making and dynamic program-
ming. Y. Engel, M. Ghavamzadeh, S. Mannor, P. Poupart, eds., NIPS-08 workshop on model
uncertainty and risk in reinforcement learning.

Defourny, B., L. Wehenkel. 2009. Large margin classification with the progressive hedging
algorithm. S. Nowozin, S. Sra, S. Vishwanathan, S. Wright, eds., Second NIPS workshop on
optimization for machine learning.
180 BIBLIOGRAPHY

Dempster, A.M. 1972. Covariance selection. Biometrics 157–175.

Dempster, A.P., N.M. Laird, D.B. Rubin. 1977. Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society 39 1–38.

Dempster, M.A.H., G. Pflug, G. Mitra, eds. 2008. Quantitative Fund Management. Financial
Mathematics Series, Chapman & Hall/CRC.

Demuth, H., M. Beale. 1993. Neural network toolbox for use with Matlab.

Dietterich, T. 2000. An experimental comparison of three methods for constructing ensembles


of decision trees: Bagging, Boosting and Randomization. Machine Learning 40 139–157.

Dontchev, A.L., R.T. Rockafellar. 2009. Implicit Functions and Solution Mappings. Springer.

Draper, D. 1995. Assessment and propagation of model uncertainty. Journal of the Royal
Statistical Society 57 45–97.

Dupacova, J., R. J.-B. Wets. 1988. Asymptotic behavior of statistical estimators and of optimal
solutions of stochastic optimization problems. The Annals of Statistics 16 1517–1549.

Dutech, A., T. Edmunds, J. Kok, M. Lagoudakis, M. Littman, M. Riedmiller, B. Russel,


B. Scherrer, R. Sutton, S. Timmer, N. Vlassis, A. White, S. Whiteson. 2005. Reinforcement
Learning benchmarks and bake-offs II: A workshop at the 2005 NIPS conference. Available
at http://www.cs.rutgers.edu/∼mlittman/topics/nips05-mdp/bakeoffs05.pdf.

Edelman, A. 1992. Eigenvalue roulette and random test matrices. M.S. Moonen, G.H. Golub,
B. De Moor, eds., Linear Algebra for Large Scale and Real-Time Applications, NATO ASI ,
vol. 232. Springer, 365–368.

Efron, B., R. Tibshirani. 1993. An introduction to the bootstrap. Chapman and Hall, London.

Epstein, L., M. Schneider. 2003. Recursive multiple-priors. Journal of Economic Theory 113
1–13.

Ernst, D., P. Geurts, L. Wehenkel. 2005. Tree-based batch mode reinforcement learning. Journal
of Machine Learning Research 6 503–556.

Escudero, L.F. 2009. On a mixture of the fix-and-relax coordination and Lagrangian substitution
schemes for multistage stochastic mixed integer programming. Top 17 5–29.

Evans, M., T. Swartz. 1995. Methods for approximating integrals in statistics with special
emphasis on Bayesian integration problems. Statistical Science 10 254–272.

Facchinei, F., A. Fischer, C. Kanzow. 1998. On the accurate identification of active constraints.
SIAM Journal on Optimization 9 14–32.

Facchinei, F., J.-S. Pang. 2003. Finite-dimensional variational inequalities and complementary
problems. Springer. Published in two volumes, paginated continuously.

Fisher, R.A. 1925. Theory of statistical estimation. Proceedings of the Cambridge Philosophical
Society, vol. 22. 700–725.

Frauendorfer, K. 1996. Barycentric scenario trees in convex multistage stochastic programming.


Mathematical Programming 75 277–294.

Freund, Y., R.E. Schapire. 1996. Experiments with a new boosting algorithm. Proceedings of
the Thirteenth International Conference on Machine Learning (ICML-1996). 148–156.
BIBLIOGRAPHY 181

Gale, D., V. Klee, R.T. Rockafellar. 1968. Convex functions on convex polytopes. Proceedings
of the American Mathematical Society, vol. 19. 867–873.

Garstka, S.J., R.J.-B. Wets. 1974. On decision rules in stochastic programming. Mathematical
Programming 7(1) 117–143.

Gassmann, H.I., A. Prékopa. 2005. On stages and consistency checks in stochastic programming.
Operations Research Letters 33 171–175.

Geurts, P., D. Ernst, L. Wehenkel. 2006. Extremely randomized trees. Machine Learning 63
3–42.

Good, I.J., R.A. Gaskins. 1971. Non-parametric roughness penalties for probability densities.
Biometrika 58 255–277.

Haff, L.R. 1980. Empirical Bayes estimation of the multivariate normal covariance matrix. The
Annals of Statistics 8 586–597.

Hammond, P. J. 1976. Changing tastes and coherent dynamic choice. The Review of Economic
Studies 43 159–173.

Hartland, C., S. Gelly, N. Baskiotis, O. Teytaud, M. Sebag. 2006. Multi-armed bandit, dy-
namic environments and meta-bandits. P. Auer, N. Cesa-Bianchi, Z. Hussain, L. Newnham,
J. Shawe-Taylor, eds., NIPS-06 workshop on on-line trading of exploration and exploitation.

Hasselblad, V. 1966. Estimation of parameters for a mixture of normal distributions. Techno-


metrics 8 431–444.

Haussler, D. 1999. Convolution kernels on discrete structures. Tech. rep., University of California
at Santa Cruz.

Heitsch, H., W. Römisch. 2003. Scenario reduction algorithms in stochastic programming.


Computational Optimization and Applications 24 187–206.

Heitsch, H., W. Römisch. 2009. Scenario tree modeling for multistage stochastic programs.
Mathematical Programming 118(2) 371–406.

Higle, J.L., S. Sen. 1991. Stochastic decomposition: An algorithm for two stage stochastic linear
programs with recourse. Mathematics of Operations Research 16 650–669.

Hilli, P., T. Pennanen. 2008. Numerical study of discretizations of multistage stochastic pro-
grams. Kybernetika 44 185–204.

Hochreiter, R., G.Ch. Pflug. 2007. Financial scenario generation for stochastic multi-stage
decision processes as facility location problems. Annals of Operations Research 152 257–272.

Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. Journal of
the Americal Statistical Association 58 13–30.

Hoffman, A.J. 1952. On approximate solutions of systems of linear inequalities. Journal of


Research of the National Bureau of Standards 49 263–265.

Hofmann, T., B. Schölkopf, A. Smola. 2008. Kernel methods in machine learning. The Annals
of Statistics 36(3) 1171–1220.

Howard, R.A. 1960. Dynamic Programming and Markov Processes. MIT Press.

Howard, R.A., J. Matheson. 1972. Risk-sensitive Markov Decision Processes. Management


Science 18(7) 356–369.
182 BIBLIOGRAPHY

Høyland, K., M. Kaut, S.W. Wallace. 2003. A heuristic for moment-matching scenario genera-
tion. Computational Optimization and Applications 24 1573–2894.

Høyland, K., S. Wallace. 2001. Generating scenario trees for multistage decision problems.
Management Science 47(2) 295–307.

Huang, K., S. Ahmed. 2009. The value of multistage stochastic programming in capacity plan-
ning under uncertainty. Operations Research 57 893–904.

Huber, P.J. 1964. Robust estimation of a location parameter. The Annals of Mathematical
Statistics 35 73–101.

Infanger, G. 1992. Monte Carlo (importance) sampling within a Benders decomposition algo-
rithm for stochastic linear programs. Annals of Operations Research 39 69–95.

Itakura, F., S. Saito. 1968. Analysis synthesis telephony based on the maximum likelihood
method. Proceedings of the Sixth International Congress on Acoustics. C17–20.

Jeffrey, H. 1939. Theory of Probability. Oxford University Press.

Kallrath, J., P.M. Pardalos, S. Rebennack, M. Scheidt, eds. 2009. Optimization in the Energy
Industry. Springer.

Kearns, M.J., Y. Mansour, A.Y. Ng. 2002. A sparse sampling algorithm for near-optimal plan-
ning in large Markov Decision Processes. Machine Learning 49(2-3) 193–208.

Kimeldorf, G.S., G. Wahba. 1970. A correspondence between Bayesian estimation on stochastic


processes and smoothing by splines. The Annals of Mathematical Statistics 41 495–502.

Koivu, M., T. Pennanen. 2010. Galerkin methods in dynamic stochastic programming. Opti-
mization 59 339–354.

Koltchinskii, V., D. Panchenko. 2002. Empirical margin distributions and bounding the gener-
alization error of combined classifiers. The Annals of Statistics 30 1–50.

Kouwenberg, R. 2001. Scenario generation and stochastic programming models for asset liability
management. European Journal of Operational Research 134 279–292.

Küchler, C., S. Vigerske. 2010. Numerical evaluation of approximation methods in stochastic


programming. Optimization 59 401–415.

Kuhn, D. 2005. Generalized Bounds for convex multistage Stochastic Programs, Lecture Notes
in Economics and Mathematical Systems, vol. 548. Springer.

Kydland, F.E., E.C. Prescott. 1977. Rules rather than discretion: The inconsistency of optimal
plans. The Journal of Political Economy 85 473–492.

Kyparisis, J. 1985. On uniqueness of Kuhn Tucker multipliers in nonlinear programming. Math-


ematical Programming 32 242–246.

Lagoudakis, M.G., R. Parr. 2003. Reinforcement learning as classification: leveraging mod-


ern classifiers. Proceedings of the Twentieth International Conference on Machine Learning
(ICML-2003). 424–431.

Lanckriet, G., N. Cristianini, P. Bartlett, L.E. Ghaoui, M.I. Jordan. 2004. Learning the kernel
matrix with semidefinite programming. Journal of Machine Learning Research 5 27–72.
BIBLIOGRAPHY 183

Langford, J., B. Zadrozny. 2005. Relating reinforcement learning performance to classifica-


tion performance. Proceedings of the Twenty-Second International Conference on Machine
Learning (ICML-2005). 473–480.

Lehto, O. 1952. Some remarks on the kernel functions in Hilbert spaces. Annales Academiae
Scientiarium, Fennicae Sereie A 109 3–6.

Li, J.Q., A.R. Barron. 2000. Mixture density estimation. Advances in Neural Information
Processing Systems 12 (NIPS-2000). 279–285.

Littlestone, N., M.K. Warmuth. 1989. The weighted majority algorithm. Proceedings of the
Thirtieth Annual Symposium on Foundations of Computer Science. 256–261.

Littman, M.L., T.L. Dean, L.P. Kaelbling. 1995. On the complexity of solving Markov Decision
Problems. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence
(UAI-1995). 394–402.

MacKay, D.J.C. 2003. Information Theory, Inference and Learning Algorithms. Cambridge
University Press.

MacKay, M.D., R.J. Beckman, W.J. Conover. 1979. A comparison of three methods for selecting
values of input variables in the analysis of output from a computer code. Technometrics 21
239–245.

Madigan, D., J. York. 1995. Bayesian graphical models for discrete data. International Statistical
Review 63 215–232.

Mahalanobis, P.C. 1936. On the generalized distance in statistics. Proceedings of the National
Institute of Sciences of India, vol. 12. 49–55.

Mak, Wai-Kei, D.P. Morton, R. Kevin Wood. 1999. Monte Carlo bounding techniques for
determining solution quality in stochastic programs. Operations Research Letters 24(1-2)
47–56.

Mardia, K.V., R.J. Marshall. 1984. Maximum likelihood estimation of models for residual
covariance in spatial regression. Biometrika 71 135–146.

Martin, D.H. 1975. On the continuity of the maximum in parametric linear programming.
Journal of Optimization Theory and Applications 17 205–210.

McDiarmid, C. 1989. On the method of bounded differences. Surveys in Combinatorics 141


148–188.

Mease, D., A. Wyner. 2008. Evidence contrary to the statistical view of boosting (with responses
and rejoinder). Journal of Machine Learning Research 131–201.

Mendelson, B. 1990. Introduction to Topology. 3rd ed. Dover.

Mercer, J. 1909. Functions of positive and negative type, and their connection with the theory
of integral equations. Philosophical Transactions of the Royal Society of London, Series A
209 415–446.

Mercier, L., P. Van Hentenryck. 2007. Performance analysis of online anticipatory algorithms
for large multistage stochastic integer programs. Proceedings of the Twentieth International
Joint Conference on Artificial Intelligence (IJCAI-07). 1979–1984.

Metropolis, N., A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller. 1953. Equation of
state calculations by fast computing machines. Journal of Chemical Physics 21 1087–1092.
184 BIBLIOGRAPHY

Metropolis, N., S. Ulam. 1949. The Monte Carlo method. J. Amer. Stat. Assoc. 44(247)
335–341.

Micchelli, C.A., M. Pontil. 2005. Learning the kernel function via regularization. Journal of
Machine Learning Research 6 1099–1125.

Morimura, T., M. Sugiyama, H. Kashima, H. Hachiya, T. Tanaka. 2010. Parametric return


density estimation for reinforcement learning. Proceedings of the Twenty-Sixth Conference on
Uncertainty in Artificial Intelligence (UAI-2010).

Mundhenk, M., J. Goldsmith, C. Lusena, E. Allender. 2000. Complexity of finite-horizon Markov


decision process problems. Journal of the ACM 47 681–720.

Murty, K. 1980. Computational complexity of parametric linear programming. Mathematical


Programming 19 213–219.

Neal, R.M. 1993. Probabilistic inference using Markov Chain Monte Carlo methods. Tech. Rep.
CRG-TR-93-1, University of Toronto.

Neal, R.M. 1997. Monte Carlo implementation of Gaussian Process models for Bayesian regres-
sion and classification. Tech. Rep. 9702, University of Toronto.

Neal, R.M. 2010. MCMC using Hamiltonian dynamics. S. Brooks, A. Gelman, G. Jones, X.-L.
Meng, eds., Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC Press.

Nemirovski, A., A. Juditsky, G. Lan, A. Shapiro. 2009. Stochastic approximation approach to


stochastic programming. SIAM Journal on Optimization 19 1574–1609.

Nesterov, Y. 2007. Gradient methods for minimizing composite objective function. CORE
discussion paper 76, Catholic University of Louvain.

Nesterov, Y., J.-Ph. Vial. 2008. Confidence level solutions for stochastic programming. Auto-
matica 44 1559–1568.

Nickisch, H., C.E. Rasmussen. 2008. Approximations for binary Gaussian process classification.
Journal of Machine Learning Research 9 2035–2078.

Norkin, V.I., Y.M. Ermoliev, A. Ruszczyński. 1998a. On optimal allocation of indivisibles under
uncertainty. Operations Research 46 381–395.

Norkin, V.I., G.Ch. Pflug, A. Ruszczyński. 1998b. A branch and bound method for stochastic
global optimization. Mathematical Programming 83 425–450.

O’Hagan, A. 1978. Curve fitting and optimal design for prediction. Journal of the Royal
Statistical Society 40 1–42.

Pages, G., J. Printems. 2003. Optimal quadratic quantization for numerics: the Gaussian case.
Monte Carlo Methods and Applications 9 135–166.

Patrinos, P., H. Sarimveis. 2010. A new algorithm for solving convex parametric quadratic
programs based on graphical derivatives of solution mappings. Automatica 46 1405–1418.

Pennanen, T. 2009. Epi-convergent discretizations of multistage stochastic programs via inte-


gration quadratures. Mathematical Programming 116 461–479.

Pflug, G.Ch., W. Römisch. 2007. Modeling, Measuring and Managing Risk . World Scientific
Publishing Company.
BIBLIOGRAPHY 185

Pollard, D. 1990. Empirical Processes: Theory and Applications, NSF-CBMS Regional Confer-
ence Series in Probability and Statistics, vol. 2. Institute of Mathematical Statistics.

Pollard, D. 2001. A User’s Guide to Measure Theoretic Probability. Cambridge University Press.

Powell, W.B. 2007. Approximate Dynamic Programming: Solving the Curses of Dimensionality.
Wiley.

Powell, W.B., H. Topaloglu. 2003. Stochastic programming in transportation and logistics.


A. Ruszczyński, A. Shapiro, eds., Stochastic Programming. Handbooks in Operations Research
and Management Science, vol. 10. Elsevier, 555–635.

Prékopa, A. 1970. On probabilistic constrained programming. H.W. Kuhn, ed., Proceedings of


the Princeton Symposium on Mathematical Programming. Princeton University Press, 113–
138.

Prékopa, A. 1995. Stochastic Programming. Kluwer, Dordrecht, Boston.

Puterman, M.L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming.
Wiley.

Rachelson, E., F. Schnitzler, L. Wehenkel, D. Ernst. 2010. Optimal sample selection for batch-
mode reinforcement learning. Submitted.

Rachev, S.T., W. Römisch. 2002. Quantitative stability in stochastic programming: The method
of probability metrics. Mathematics of Operations Research 27(4) 792–818.

Rahimi, A., B. Recht. 2008. Random features for large-scale kernel machines. Advances in
Neural Information Processing Systems 20 (NIPS-2007). 1177–1184.

Rahimi, A., B. Recht. 2009. Random kitchen sinks: Replacing optimization with randomization
in learning. Advances in Neural Information Processing Systems 21 (NIPS-2008). 1313–1320.

Raiffa, H., R. Schlaifer. 1961. Applied Statistical Decision Theory. Harvard University.

Rasmussen, C.E., C.K.I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press.

Riedel, F. 2004. Dynamic coherent risk measures. Stochatic Processes and their Applications
112 185–200.

Rissanen, J. 1978. Modeling by shortest data description. Automatica 14 465–471.

Robert, C.P. 2007. The Bayesian Choice. 2nd ed. Springer.

Rockafellar, R.T. 1970. Convex Analysis. Princeton University Press.

Rockafellar, R.T. 1974. Conjugate Duality and Optimization. CBMS-NSF Regional Conference
Series in Applied Mathematics, SIAM.

Rockafellar, R.T., S. Uryasev. 2000. Optimization of conditional value-at-risk. Journal of Risk


2(3) 21–41.

Rockafellar, R.T., R.J.-B. Wets. 1991. Scenarios and policy aggregation in optimization under
uncertainty. Mathematics of Operations Research 16 119–147.

Rockafellar, R.T., R.J.-B. Wets. 1998. Variational Analysis. 3rd ed. Springer.

Römisch, W., R.J.-B. Wets. 2007. Stability of ε-approximate solutions to convex stochastic
programs. SIAM Journal on Optimization 18 961–979.
186 BIBLIOGRAPHY

Rubinstein, R.Y., D.P. Kroese. 2004. The Cross-Entropy Method. A Unified Approach to Combi-
natorial Optimization, Monte-Carlo Simulation, and Machine Learning. Information Science
and Statistics, Springer.

Rudin, W. 1976. Principles of Mathematical Analysis. 3rd ed. McGraw-Hill.

Rust, J. 1997. Using randomization to break the curse of dimensionality. Econometrica 65(3)
487–516.

Ruszczyński, A., A. Shapiro. 2006. Conditional risk mappings. Mathematics of Operations


Research 31 544–561.

Samuelson, P.A. 1937. A note on measurement of utility. The Review of Economic Studies 4(2)
155–161.

Schapire, R.E. 1990. The strength of weak learnability. Machine Learning 5(2) 197–227.

Schapire, R.E., P. Stone, D. McAllester, M.L. Littman, J.A. Csirik. 2002. Modeling auction
price uncertainty using boosting-based conditional density estimation. Proceedings of the
Nineteenth International Conference on Machine Learning (ICML-2002). 546–553.

Schoenberg, I.J. 1938. Metric spaces and completely monotone functions. Annals of Mathematics
39(4) 811–841.

Sen, S., R.D. Doverspike, S. Cosares. 1994. Network planning with random demand. Telecom-
munications Systems 3 11–30.

Shapiro, A. 2000. On the asymptotics of constrained local M-estimators. The Annals of Statistics
28 948–960.

Shapiro, A. 2003a. Inference of statistical bounds for multistage stochastic programming prob-
lems. Mathematical Methods of Operations Research 58(1) 57–68.

Shapiro, A. 2003b. Monte Carlo sampling methods. A. Ruszczyński, A. Shapiro, eds., Stochastic
Programming. Handbooks in Operations Research and Management Science, vol. 10. Elsevier,
353–425.

Shapiro, A. 2006. On complexity of multistage stochastic programs. Operations Research Letters


34(1) 1–8.

Shapiro, A. 2009. On a time-consistency concept in risk averse multistage stochastic program-


ming. Operations Research Letters 37 143–147.

Shapiro, A., D. Dentcheva, A. Ruszczyński. 2009. Lectures on Stochastic Programming: Modeling


and Theory. SIAM.

Shi, Q., J. Petterson, G. Dror, J. Langford, A. Smola, S.V.N. Vishwanathan. 2009. Hash kernels
for structured data. Journal of Machine Learning Research 10 2615–2637.

Shivaswamy, P., T. Jebara. 2010. Empirical Bernstein boosting. Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics (AISTATS-2010). JMLR
Workshop and Conference proceedings (vol. 9), 733–740.

Simard, P., Y. Le Cun, J. Denker, B. Victorri. 1998. Transformation invariance in pattern


recognition – Tangent distance and tangent propagation. G.B. Orr, K.-R. Müller, eds., Neural
Networks: Tricks of the Trade, vol. LNCS 1524. Springer, 239–274.
BIBLIOGRAPHY 187

Simon, H. 1956. Rational choice and the structure of the environment. Psychological Review 63
129–138.

Slyke, R. Van, R.J.-B. Wets. 1969. L-shaped linear programs with applications to optimal
control and stochastic programming. SIAM Journal on Applied Mathematics 17 638–663.

Solak, E., R. Murray-Smith, W.E. Leithead, D.J. Leith, C.E. Rasmussen. 2003. Derivative
observations in Gaussian Process models of dynamic systems. Advances in Neural Information
Processing Systems 15 (NIPS-2002). 1033–1040.

Sonnenburg, S., G. Rätsch, C. Schäfer, B. Schölkopf. 2006. Large scale multiple kernel learning.
Journal of Machine Learning Research 7 1531–1565.

Speed, T.P., H.T. Kiiveri. 1986. Gaussian Markov distributions over finite graphs. The Annals
of Statistics 14 138–150.

Spielman, D., S.-H. Teng. 2004. Smoothed analysis: Why the simplex algorithm usually takes
polynomial time. Journal of the Association for Computing Machinery 51 385–463.

Spjøtvold, J., P. Tøndel, T.A. Johansen. 2007. Continuous selection and unique polyhedral
representation of solutions to convex parametric quadratic programs. Journal of Optimization
Theory and Applications 134 177–189.

Stein, C. 1956. Inadmissibility of the usual estimator for the mean of a multivariate distribution.
Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability,
vol. 1. 197–206.

Steinwart, I., A. Christman. 2008. Support Vector Machines, chap. Kernels and Reproducing
Kernel Hilbert Spaces. Springer, 111–164.

Stone, M. 1974. Cross-validatory choice and assessment of statistical predictions. Journal of the
Royal Statistical Society 36 111–147.

Strotz, R.H. 1955. Myopia and inconsistency in dynamic utility maximization. The Review of
Economic Studies 23 165–180.

Sutton, R.S., A.G. Barto. 1998. Reinforcement Learning, an introduction. MIT Press.

Tikhonov, A.N., V.Y. Arsenin. 1977. Solutions of ill posed problems. W.H. Winston and Sons
(distributed by Wiley).

Tøndel, P., T. Johansen, A. Bemporad. 2003. Evaluation of piecewise affine control via binary
search tree. Automatica 39 945–950.

Van Hentenryck, P., R. Bent. 2006. Online stochastic combinatorial optimization. MIT Press.

Vapnik, V.N. 1998. Statistical Learning Theory. Wiley.

von Neumann, J., O. Morgenstern. 1947. Theory of games and economic behavior . 2nd ed.
Princeton University Press.

Wainwright, M.J., M.I. Jordan. 2008. Graphical models, exponential families, and variational
inference. Foundations and Trends in Machine Learning 1 1–305.

Walker, A.M. 1969. On the asymptotic behaviour of posterior distributions. Journal of the
Royal Statistical Society 31 80–88.

Walkup, D., R.J.-B. Wets. 1967. Stochastic programs with recourse. SIAM Journal on Applied
Mathematics 15 1299–1314.
188 BIBLIOGRAPHY

Walkup, D., R.J.-B. Wets. 1969a. Lifting projections of convex polyhedra. Pacific Journal of
Mathematics 28 465–475.

Walkup, D., R.J.-B. Wets. 1969b. Stochastic programs with recourse II: On the continuity of
the objective. SIAM Journal on Applied Mathematics 17 98–103.

Wallace, S.W., R.J.-B. Wets. 1992. Preprocessing in stochastic programming: The case of linear
programs. ORSA Journal of Computing 4 45–49.

Wets, R.J.-B. 1974. Stochastic programs with fixed recourse: The equivalent deterministic
program. SIAM Review 16 309–339.

Wets, R.J.-B., C. Witzgall. 1967. Algorithms for frames and lineality spaces of cones. Journal
of Research of the National Bureau of Standards 71 1–7.

Weyl, H. 1935. Elementare Theorie der konvexen Polyeder. Commentarii Mathematici Helvetici
7 290–306.

White, H. 1982. Maximum likelihood estimation of misspecified models. Econometrica 50 1–25.

Yosida, K. 1980. Functional Analysis. Sixth ed. Springer.

Zhang, Z., M.I. Jordan, D.-Y. Yeung. 2009. Posterior consistency of the Silverman g-prior in
Bayesian model choice. Advances in Neural Information Processing Systems 21 (NIPS-2008).
1969–1976.
Index

-optimal solution, 111, 147 for function-valued mappings, 152


from pointwise convergence, 150
accumulation point, 145 of convex functions, 153
epigraph, 111, 147
bagging, 35, 42 exogenous stochastic process, 13
boosting, 35, 78 expected value, 162
boundary of a set, 144 of perfect information (EVPI), 17
branching process, 9, 65 problem, 16
Bregman divergence, 84 exponential
family of distributions, 80
cell, 120, 122 utility function, 63, 68, 70
classifier, 120 extended-real-valued function, 111, 143
closure
of a function, 147 feasibility set, 7
of a set, 144 feature map, 169
cluster point, 145 filtration, 159
complete recourse, 173 fixed recourse, 172
conditional sampling, 24
continuity, 146 Gaussian process, 86, 117
of convex functions, 153 graph, 147
convergence
of optimal values, 149 Hilbert space, 166
of sets, 148
ill-posed problem, 52
pointwise, 150
infimum, 143
uniform, 151
inner product, 166
convexity, 14, 152
interior of a set, 144
in the random variables, 94
Itakura-Saito distance, 84
of the value function, 176
strict, 107, 152 kernel
cross-entropy, 30 as covariance function, 86
method, 41 for decisions, 42
cumulative distribution function (cdf), 164 for disturbances, 39
curse of dimensionality, 14 from feature map, 169
positive definite, 168, 169
decision stage, 7 reproducing, 167
decomposable space, 164
distribution Lebesgue space Lp , 163
of a random vector, 164 Lipschitz continuous, 153
problem, 17, 19 locally, 153, 175
dual feasibility, 173 lower
level-bounded, 146, 151
effective domain, 146 semicontinuous (l.s.c.), 146, 147
elite samples, 41, 46
ensemble methods, 34 Mahalanobis distance, 81
epi-convergence, 149 Markov decision process (MDP), 14

189
190 Index

maximum a posteriori (MAP), 32 complexity, 57, 97


maximum likelihood, 30, 41 sigma-algebra, 155
measurable, 158 Borel, 156
selection, 160 generated, 159, 172
set-valued mapping, 160 product, 157
space, 156 stationary quantizer, 93
metric, 165 supremum, 144
minimum, 144 essential, 163
attained, 146, 151
model predictive control (MPC), 16, 97 tangent cone, 111
model selection, 34 time consistency, 75
for policies, 59 two-stage
for trees, 73 approximation, 16, 19, 98
moment matching, 24 stochastic program, 7, 171
Monte Carlo simulation
of a policy, 26 uncertainty set, 6

neural networks, 69 value


nominal plan, 5, 16, 88 function, 175
non-anticipativity, 11 of multistage stochastic programming
normal (VMS), 17, 19, 98
cone, 108 of the stochastic solution (VSS), 16
integrand, 161, 163
Wasserstein distance, 66
parametric optimization, 106, 119, 151
policy search, 15, 27
polyhedral
function, 175
set, 105, 173
probability metrics method, 24, 39
probability space, 155
progressive hedging algorithm, 18, 48
proper, 146, 147

quadratic distortion, 65, 93

random
function, 161
set, 159
vector, 158
regularization, 31
relatively complete recourse, 173
repair procedure, 53, 102
robust optimization, 6

sample average approximation (SAA), 24


scenario tree, 9
generation, 22
incomplete, 38, 45
with random branching structure, 61
self-improving algorithm, 107
set-valued mapping, 157
shrinking-horizon policy, 26, 77, 89

You might also like