Mixed-Integer Optimization With Constraint Learning
Mixed-Integer Optimization With Constraint Learning
Constraint Learning
Donato Maragno*
Amsterdam Business School, University of Amsterdam, 1018 TV Amsterdam, Netherlands [email protected]
Holly Wiberg*
Operations Research Center, Massachusetts Institute of Technology, Cambridge MA 02139 [email protected]
arXiv:2111.04469v3 [math.OC] 26 Oct 2023
Dimitris Bertsimas
Sloan School of Management, Massachusetts Institute of Technology, Cambridge MA 02139 [email protected]
We establish a broad methodological foundation for mixed-integer optimization with learned constraints.
We propose an end-to-end pipeline for data-driven decision making in which constraints and objectives are
directly learned from data using machine learning, and the trained models are embedded in an optimization
including linear models, decision trees, ensembles, and multi-layer perceptrons, which allows us to capture
various underlying relationships between decisions, contextual variables, and outcomes. We also introduce
two approaches for handling the inherent uncertainty of learning from data. First, we characterize a decision
trust region using the convex hull of the observations, to ensure credible recommendations and avoid extrap-
olation. We efficiently incorporate this representation using column generation and propose a more flexible
formulation to deal with low-density regions and high-dimensional datasets. Then, we propose an ensemble
learning approach that enforces constraint satisfaction over multiple bootstrapped estimators or multiple
algorithms. In combination with domain-driven components, the embedded models and trust region define a
mixed-integer optimization problem for prescription generation. We implement this framework as a Python
package (OptiCL) for practitioners. We demonstrate the method in both World Food Programme planning
and chemotherapy optimization. The case studies illustrate the framework’s ability to generate high-quality
prescriptions as well as the value added by the trust region, the use of ensembles to control model robustness,
the consideration of multiple machine learning methods, and the inclusion of multiple learned constraints.
Key words : mixed-integer optimization, machine learning, constraint learning, prescriptive analytics
1
Maragno et al.: Mixed-integer Optimization with Constraint Learning
2
1. Introduction
Mixed-integer optimization (MIO) is a powerful tool that allows us to optimize a given objective
subject to various constraints. This general problem statement of optimizing under constraints is
nearly universal in decision-making settings. Some problems have readily quantifiable and explicit
objectives and constraints, in which case MIO can be directly applied. The situation becomes more
complicated, however, when the constraints and/or objectives are not explicitly known.
For example, suppose we deal with cancerous tumors and want to prescribe a treatment regimen
with a limit on toxicity; we may have observational data on treatments and their toxicity outcomes,
but we have no natural function that relates the treatment decision to its resultant toxicity. We
may also encounter constraints that are not directly quantifiable. Consider a setting where we
want to recommend a diet, defined by a combination of foods and quantities, that is sufficiently
“palatable.” Palatability cannot be written as a function of the food choices, but we may have
qualitative data on how well people “like” various potential dietary prescriptions. In both of these
examples, we cannot directly represent the outcomes of interest as functions of our decisions, but
we have data that relates the outcomes and decisions. This raises a question: how can we consider
In this work, we tackle the challenge of data-driven decision making through a combined machine
learning (ML) and MIO approach. ML allows us to learn functions that relate decisions to outcomes
of interest directly through data. Importantly, many popular ML methods result in functions that
are MIO-representable, meaning that they can be embedded into MIO formulations. This MIO-
representable class includes both linear and nonlinear models, allowing us to capture a broad
set of underlying relationships in the data. While the idea of learning functions directly from
data is core to the field of ML, data is often underutilized in MIO settings due to the need
for functional relationships between decision variables and outcomes. We seek to bridge this gap
through constraint learning; we propose a general framework that allows us to learn constraints
and objectives directly from data, using ML, and to optimize decisions accordingly, using MIO.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
3
Once the learned constraints have been incorporated into the larger MIO, we can solve the problem
The term constraint learning, used several times throughout this work, captures both constraints
and objective functions. We are fundamentally learning functions to relate our decision variables
to the outcome(s) of interest. The predicted values can then either be incorporated as constraints
or objective terms; the model learning and embedding procedures remain largely the same. For
this reason, we refer to them both under the same umbrella of constraint learning. We describe
Previous work has demonstrated the use of various ML methods in MIO problems and their utility
in different application domains. The simplest of these methods is the regression function, as the
approach is easy to understand and easy to implement. Given a regression function learned from
data, the process of incorporating it into an MIO model is straightforward, and the final model
does not require complex reformulations. As an example, Bertsimas et al. (2016) use regression
models and MIO to develop new chemotherapy regimens based on existing data from previous
More complex ML models have also been shown to be MIO-representable, although more effort
is required to represent them than simple regression models. Neural networks which use the ReLU
activation function can be represented using binary variables and big-M formulations (Amos et al.
2016, Grimstad and Andersson 2019, Anderson et al. 2020, Chen et al. 2020, Spyros 2020, Venzke
et al. 2020). Where other activation functions are used (Gutierrez-Martinez et al. 2011, Lombardi
et al. 2017, Schweidtmann and Mitsos 2019), the MIO representation of neural networks is still
With decision trees, each path in the tree from root to leaf node can be represented using
one or more constraints (Bonfietti et al. 2015, Verwer et al. 2017, Halilbasic et al. 2018). The
number of constraints required to represent decision trees is a function of the tree size, with larger
Maragno et al.: Mixed-integer Optimization with Constraint Learning
4
trees requiring more linearizations and binary variables. The advantage here, however, is that
decision trees are known to be highly interpretable, which is often a requirement of ML in critical
application settings (Thams et al. 2017). Random forests (Biggs et al. 2021, Mišić 2020) and other
tree ensembles (Cremer et al. 2019) have also been used in MIO in the same way as decision trees,
with one set of constraints for each tree in the forest/ensemble along with one or more additional
aggregate constraints.
Data for constraint learning can either contain information on continuous data, feasible and
infeasible states (two-class data), or only one state (one-class data). The problem of learning
functions from one-class data and embedding them into optimization models has been recently
investigated with the use of decision trees (Kudla and Pawlak 2018), genetic programming (Pawlak
and Krawiec 2019), local search (Sroka and Pawlak 2018), evolutionary strategies (Pawlak 2019),
and a combination of clustering, principal component analysis and wrapping ellipsoids (Pawlak
The above selected applications generally involve a single function to be learned and a fixed ML
method for the model choice. Verwer et al. (2017) use two model classes (decision trees and linear
models) in a specific auction design application, but in this case the models were determined a
priori. Some authors have presented a more general framework of embedding learned ML models
in optimization problems such as JANOS (Bergman et al. 2022) and EML (Lombardi et al. 2017),
but in practice these works are restricted to limited problem structures and learned model classes.
the full ML and optimization components of a data-driven decision making problem. In contrast to
EML and JANOS, OptiCL supports a wider variety of predictive models — neural networks (with
ReLU), linear regression, logistic regression, decision trees, random forests, gradient boosted trees
and linear support vector machines. OptiCL is also more flexible than JANOS, as it can handle
predictive models as constraints, and it also incorporates new concepts to deal with uncertainty in
the ML models. A comparison of OptiCL against JANOS and EML on two test problems is shown
in Appendix E.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
5
Our work falls under the umbrella of prescriptive analytics. Bertsimas and Kallus (2020) and
Elmachtoub and Grigas (2021) leverage ML model predictions as inputs into an optimization prob-
lem. Our approach is distinct from existing work in that we directly embed ML models rather than
extracting predictions, allowing us to optimize our decisions over the model. In the broadest sense,
our framework relates to work that jointly harnesses ML and MIO, an area that has garnered signif-
icant interest in recent years in both the optimization and machine learning communities (Bengio
et al. 2021).
1.2. Contributions
Our work unifies several research areas in a comprehensive manner. Our key contributions are as
follows:
1. We develop an end-to-end framework that takes data and directly implements model training,
model selection, integration into a larger MIO, and ultimately optimization. We make this
vide a practitioner-friendly tool for making better data-driven decisions. The code is available
mization pipeline with the goal of being accessible to end users as well as extensible by technical
researchers. Our framework natively supports models for both regression and classification
functions and handles constraint learning in cases with both one-class and two-class data. We
implement a cross-validation procedure for function learning that selects from a broad set of
model classes. We also implement the optimization procedure in the generic mathematical
modeling library Pyomo, which supports various state-of-the-art solvers. We introduce two
approaches for handling the inherent uncertainty when learning from data. First, we propose
an ensemble learning approach that enforces constraint satisfaction over an ensemble of mul-
tiple bootstrapped estimators or multiple algorithms, yielding more robust solutions. This
on a single point prediction: in the case of learned constraints, model misspecification can lead
Maragno et al.: Mixed-integer Optimization with Constraint Learning
6
to infeasibility. Additionally, we restrict solutions to lie within a trust region, defined as the
domain of the training data, which leads to better performance of the learned constraints. We
offer several improvements to a basic convex hull formulation, including a clustering heuristic
and a column selection algorithm that significantly reduce computation time. We also pro-
pose an enlargement of the convex hull which allows for exploration of solutions outside of
the observed bounds. Both the ensemble model wrapper and trust region enlargement are
controlled by parameters that allow an end user to directly trade-off the conservativeness of
2. We demonstrate the power of our method in two real-world case studies, using data from
the World Food Programme and chemotherapy clinical trials. We pose relevant questions
in the respective areas and formalize them as constraint learning problems. We implement
our framework and subsequently evaluate the quantitative performance and scalability of our
mation w̄i , and outcomes of interest ȳi for sample i. Following the guidelines proposed in Fajemisin
et al. (2021), we present a framework that, given data D, learns functions for the outcomes of
interest (y) that are to be constrained or optimized. These learned representations can then be
used to generate predictions for a new observation with context w. Figure 1 outlines the complete
Given the decision variable x ∈ Rn and the fixed feature vector w ∈ Rp , we propose model M(w)
min f (x, w, y)
x∈Rn ,y∈Rk
s.t. g(x, w, y) ≤ 0,
(1)
y = ĥD (x, w),
x ∈ X (w),
Maragno et al.: Mixed-integer Optimization with Constraint Learning
7
where f (., w, .) : Rn+k 7→ R, g(., w, .) : Rn+k 7→ Rm , and ĥD (., w) : Rn 7→ Rk . Explicit forms of f and
g are known but they may still depend on the predicted outcome y. Here, ĥD (x, w) represents
the predictive models, one per outcome of interest, which are ML models trained on D. Although
our subsequent discussion mainly revolves around linear functions, we acknowledge the significant
progress in nonlinear (convex) integer solvers. Our discussion can be easily extended to nonlinear
We note that the embedding of a single learned outcome may require multiple constraints and
auxiliary variables; the embedding formulations are described in Section 2.2. For simplicity, we
omit D in further notation of ĥ but note that all references to ĥ implicitly depend on the data used
to train the model. Finally, the set X (w) defines the trust region, i.e., the set of solutions for which
we trust the embedded predictive models. In Section 2.3, we provide a detailed description of how
Maragno et al.: Mixed-integer Optimization with Constraint Learning
8
the trust region X (w) is obtained from the observed data. We refer to the final MIO formulation
Model M(w) is quite general and encompasses several important constraint learning classes:
1. Regression. When the trained model results from a regression problem, it can be constrained
which the data is labeled as “feasible” (1) or “infeasible” (0), then the prediction is generally
a probability y ∈ [0, 1]. We can enforce a lower bound on the feasibility probability, i.e., y ≥ τ .
A natural choice of τ is 0.5, which can be interpreted as enforcing that the result is more likely
feasible than not. This can also extend to the multi-class setting, say k classes, in which the
output y is a k-dimensional unit vector, and we apply the constraint yi ≥ τ for whichever class
i is desired. When multiple classes are considered to be feasible, we can add binary variables
to ensure that a solution is feasible, only if it falls in one of these classes with sufficiently high
probability.
3. Objective function. If the objective function has a term that is also learned by training
an ML model, then we can introduce an auxiliary variable t ∈ R, and add it to the objective
function along with an epigraph constraint. Suppose for simplicity that the model involves
a single learned objective function, ĥ, and no learned constraints. Then the general model
becomes
min t
x∈Rn ,y∈R,t∈R
s.t. g(x, w) ≤ 0,
y = ĥ(x, w),
y − t ≤ 0,
x ∈ X (w).
Maragno et al.: Mixed-integer Optimization with Constraint Learning
9
Although we have rewritten the problem to show the generality of our model, it is quite
common in practice to use y in the objective and omit the auxiliary variable t.
We observe that constraints on learned outcomes can be applied in two ways depending on the
model training approach. Suppose that we have a continuous scalar outcome y to learn and we
want to impose an upper bound of τ ∈ R (it may also be a lower bound without loss of generality).
The first approach is called function learning and concerns all cases where we learn a regression
function ĥ(x, w) without considering the feasibility threshold (τ ). The resultant model returns a
predicted value y ∈ R. The threshold is then applied as a constraint in the optimization model
as y ≤ τ . Alternatively, we could use the feasibility threshold τ to binarize the outcome of each
sample in D into feasible and infeasible, that is ȳi := I(ȳi ≤ τ ), i = 1, . . . , N , where I stands for
the indicator function. After this relabeling, we train a binary classification model ĥ(x, w) that
returns a probability y ∈ [0, 1]. This approach, called indicator function learning, does not require
any further use of the feasibility threshold τ in the optimization model, since the predictive models
The function learning approach is particularly useful when we are interested in varying the
threshold τ as a model parameter. Additionally, if the fitting process is expensive and therefore
difficult to perform multiple times, learning an indicator function for each potential τ might be
infeasible. In contrast, the indicator function learning approach is necessary when the raw data
contains binary labels rather than continuous outcomes, and thus we have no ability to select or
vary τ .
Our framework is enabled by the ability to embed learned predictive models into an MIO formu-
lation with linear constraints. This is possible for many classes of ML models, ranging from linear
models to ensembles, and from support vector machines to neural networks. In this section, we
outline the embedding procedure for decision trees, tree ensembles, and neural networks to illus-
trate the approach. We include additional technical details and formulations for these methods,
In all cases, the model has been pre-trained ; we embed the trained model ĥ(x, w) into our larger
MIO formulation to allow us to constrain or optimize the resultant predicted value. Consequently,
the optimization model is not dependent on the complexity of the model training procedure, but
solely the size of the final trained model. Without loss of generality, we assume that y is one-
dimensional; i.e., we are learning a single model, and this model returns a scalar, not a multi-output
vector.
All of the methods below can be used to learn constraints that apply upper or lower bounds
to y, or to learn y that we incorporate as part of the objective. We present the model embedding
procedure for both cases when ĥ(x, w) is a continuous or a binary predictive model, where relevant.
We assume that either regression or classification models can be used to learn feasibility constraints,
Decision Trees. Decision trees partition observations into distinct leaves through a series of
feature splits. These algorithms are popular in predictive tasks due to their natural interpretability
and ability to capture nonlinear interactions among variables. Breiman et al. (1984) first introduced
Classification and Regression Trees (CART), which constructs trees through parallel splits in the
feature space. Decision tree algorithms have subsequently been adapted and extended. Bertsimas
and Dunn (2017) propose an alternative decision tree algorithm, Optimal Classification Trees
(and Optimal Regression Trees), that improves on the basic decision tree formulation through
an optimization framework that approximates globally optimal trees. Optimal trees also support
multi-feature splits, referred to as hyper-plane splits, that allow for splits on a linear combination
inequality A⊤
i x ≤ bi . We assume that A can have multiple non-zero elements, in which we have
the hyper-plane split setting; if there is only one non-zero element, this creates a parallel (single
feature) split. Each terminal node j (i.e., leaf) yields a prediction (pj ) for its observations. In the
case of regression, the prediction is the average value of the training observations in the leaf, and in
Maragno et al.: Mixed-integer Optimization with Constraint Learning
11
Node 1
A⊤
1 x ≤ b1
True False
Node 2 Node 5
A⊤
2 x ≤ b2 A⊤
5 x ≤ b5
binary classification, the prediction is the proportion of leaf members with the feasible class. Each
leaf can be described as a polyhedron, namely a set of linear constraints that must be satisfied by
all leaf members. For example, for node 3, we define P3 = x : A⊤ ⊤
1 x ≤ b1 , A2 x ≤ b2 .
Suppose that we wish to constrain the predicted value of this tree to be at most τ , a fixed
constant. After obtaining the tree in Figure 2, we can identify which paths satisfy the desired bound
(pi ≤ τ ). Suppose that p3 and p6 do satisfy the bound, but p4 and p7 do not. In this case, we can
enforce that our solution belongs to P3 or P6 . This same approach applies if we only have access to
two-class data (feasible vs. infeasible); we can directly train a binary classification algorithm and
enforce that the solution lies within one of the “feasible” prediction leaves (determined by a set
probability threshold).
If the decision tree provides our only learned constraint, we can decompose the problem into
multiple separate MIOs, one per feasible leaf. The conceptual model for the subproblem of leaf i
then becomes
min f (x, w)
x
s.t. g(x, w) ≤ 0,
(x, w) ∈ Pi ,
1
where the learned constraints for leaf i’s subproblem are implicitly represented by the polyhedron
Pi . These subproblems can be solved in parallel, and the minimum across all subproblems is
Maragno et al.: Mixed-integer Optimization with Constraint Learning
12
obtained as the optimal solution. Furthermore, if all decision variables x are continuous, these
subproblems are linear optimization problems (LOs), which can provide substantial computational
In the more general setting where the decision tree forms one of many constraints, or we are
interested in varying the τ limit within the model, we can directly embed the model into a larger
MIO. We add binary variables representing each leaf, and set y to the predicted value of the
assigned leaf. An observation can only be assigned to a leaf, if it obeys all of its constraints; the
structure of the tree guarantees that exactly one path will be fully satisfied, and thus, the leaf
used in a constraint or objective. The full formulation for the embedded decision tree is included in
Appendix A.2. This formulation is similar to the proposal in Verwer et al. (2017). Both approaches
have their own merits: while the Verwer formulation includes fewer constraints in the general case,
our formulation is more efficient in the case where the problem can be decomposed into individual
Ensemble Methods. Ensemble methods, such as random forests (RF) and gradient-boosting
machines (GBM) consist of many decision trees that are aggregated to obtain a single predic-
tion for a given observation. These models can thus be implemented by embedding many “sub-
models” (Breiman 2001). Suppose we have a forest with P trees. Each tree can be embedded as a
single decision tree (see previous paragraph) with the constraints from Appendix A.2, which yields
a predicted value yi .
RF models typically generate predictions by taking the average of the predictions from the
individual trees:
P
1X
y= yi .
P i=1
This can then be used as a term in the objective, or constrained by an upper bound as y ≤ τ ; this
can be done equivalently for a lower bound. In the classification setting, the prediction averages the
probabilities returned by each model (yi ∈ [0, 1]), which can likewise be constrained or optimized.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
13
Alternatively, we can further leverage the fact that unlike the other model classes, which return
a single prediction, the RF model generates P predictions, one per tree. We can impose a violation
In the case of GBM, we have an ensemble of base-learners which are not necessarily decision
where yi is the predicted value of the i-th regression model ĥi (x, w), βi is the weight associated
with the prediction. Although trees are typically used as base-learners, in theory we might use any
Neural Networks. We implement multi-layer perceptrons (MLP) with a rectified linear unit
(ReLU) activation function, which form an MIO-representable class of neural networks (Grimstad
and Andersson 2019, Anderson et al. 2020). These networks consist of an input layer, L − 2 hidden
layer(s), and an output layer. This nonlinear transformation of the input space over multiple nodes
(and layers) using the ReLU operator (v = max{0, x}) allows MLPs to capture complex functions
that other algorithms cannot adequately encode, making them a powerful class of models.
Critically, the ReLU operator, v = max{0, x}, can be encoded using linear constraints, as detailed
in Appendix A.3. The constraints for an MLP network can be generated recursively starting from
the input layer, which allows us to embed a trained MLP with an arbitrary number of hidden layers
and nodes into an MIO. We refer to Appendix A.3 for details on the embedding of regression,
As the optimal solutions of optimization problems are often at the extremes of the feasible region,
this can be problematic for the validity of the trained ML model. Generally speaking the accuracy
of a predictive model deteriorates for points that are further away from the data points in D
(Goodfellow et al. 2015). To mitigate this problem, we elaborate on the idea proposed by Biggs et al.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
14
(2021) to use the convex hull (CH) of the dataset as a trust region to prevent the predictive model
from extrapolating. According to Ebert et al. (2014), when data is enclosed by a boundary of convex
shape, the region inside this boundary is known as an interpolation region. This interpolation region
is also referred to as the CH, and by excluding solutions outside the CH, we prevent extrapolation.
If X = {x̂i }N
i=1 is the set of observed input data with x̂i = (x̄i , w̄i ), we define the trust region as
the CH of this set and denote it by CH(X). Recall that CH(X) is the smallest convex polytope
that contains the set of points X. It is well-known that computing the CH is exponential in time
and space with respect to the number of samples and their dimensionality Skiena (2008). However,
since the CH is a polytope, explicit expressions for its facets are not necessary. More precisely,
CH(X) is represented as
X X
CH(X) = x λi x̂i = x, λi = 1, λ ≥ 0 , (2)
i∈I i∈I
In situations such as the one shown in Figure 3a, CH(X) includes regions with few or no data
points (low-density regions). Blindly using CH(X) in this case can be problematic if the solutions
are found in the low-density regions. We therefore advocate the use of a two-step approach. First,
clustering is used to identify distinct high-density regions, and then the trust region is represented
We can either solve EM(w) for each cluster, or embed the union of the |K| CHs into the MIO given
by
[ X X X
|K|
CH(Xk ) = x λi x̂i = x, λi = uk ∀k ∈ K, uk = 1, λ ≥ 0, u ∈ {0, 1} , (3)
k∈K i∈Ik i∈Ik k∈K
where Xk ⊆ X refers to subset of samples in cluster k ∈ K with the index set Ik ⊆ I . The union of
CHs requires the binary variables uk to constrain a feasible solution to be exactly in one of the CHs.
More precisely, uk = 1 corresponds to the CH of the k-th cluster. As we show in Section 4, solving
EM(w) for each cluster may be done in parallel, which has a positive impact on computation time.
We note that both formulations (2) and (3) assume that x̂ is continuous. These formulations can
Maragno et al.: Mixed-integer Optimization with Constraint Learning
15
15 15
10 10
5 5
0 0
−10 0 10 20 −10 0 10 20
(a) CH(X) with single region. (b) CH(X) with clustered regions.
be extended to datasets with binary, categorical and ordinal features. In the case of categorical
features, extra constraints on the domain and one-hot encoding are required.
Although the CH can be represented by linear constraints, the number of variables in EM(w)
increases with the increase in the dataset size, which may make the optimization process prohibitive
when the number of samples becomes too large. We therefore provide a column selection algorithm
that selects a small subset of the samples. This algorithm can be directly used in the case of convex
optimization problems or embedded as part of a branch and bound algorithm when the optimization
problem involves integer variables. Figure 4 visually demonstrates the procedure; we begin with an
arbitrary sample of the full data, and use column selection to iteratively add samples x̂i until no
improvement can be found. In Appendix B.2, we provide a full description of the approach, as well
as a formal lemma which states that in each iteration of column selection, the selected sample from
X is also a vertex of CH(X). In synthetic experiments, we observe that the algorithm scales well
with the dataset size. The computation time required by solving the optimization problem with the
algorithm is near-constant and minimally affected by the number of samples in the dataset. The
1 1
experiments in Appendix B.2 show optimization with column selection to be significantly faster
than a traditional approach, which makes it an ideal choice when dealing with massive datasets.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
16
Figure 4 Visualization of the column selection algorithm. Known and learned constraints define the infeasible
region. The column selection algorithm starts using only a subset of data points (red filled circles),
X ′ ⊆ X to define the trust region. In each iteration a vertex of CH(X) is selected (red hollow circle)
and included in X ′ until the optimal solution (star) is within the feasible region, namely the convex
hull of X ′ . Note that with column selection we do not need the complete dataset to obtain the optimal
There are multiple sources of uncertainty, and consequently notions of robustness, that can be
considered when embedding a trained machine learning model as a constraint. We define two types
Function Uncertainty. The first source of uncertainty is in the underlying functional form of ĥ.
We do not know the ground truth relationship between (x, w) and y, and there is potential for model
mis-specification. We mitigate this risk through our nonparametric model selection procedure,
namely training ĥ for a diverse set of methods (e.g., decision tree, regression, neural network) and
Parameter Uncertainty. Even within a single model class, there is uncertainty in the parameter
estimates that define ĥ. Consider the case of linear regression. A regression estimator consists of
point estimates of coefficients and an intercept term, but there is uncertainty in the estimates
as they are derived from noisy data. We seek to make our model robust by characterizing this
uncertainty and optimizing against it. We propose model-wrapper ensemble approaches, which are
Maragno et al.: Mixed-integer Optimization with Constraint Learning
17
agnostic to the underlying model. The rest of this section addresses the model-wrapper approaches
and a looser formulation of the trust region that prevents the optimal solution from being too
We begin by describing the model “wrapper” approach for characterizing uncertainty, in which
we work directly with any trained models and their point predictions. Rather than obtaining our
estimated outcome from a single trained predictive model, we suppose that we have P estimators.
The set of estimators can be obtained by bootstrapping or by training models using entirely
different methods. The uncertainty is thus characterized by different realizations of the predicted
We introduce a constraint that at most α ∈ [0, 1] proportion of the P estimators violate the
constraint. Let ĥ1 , . . . , ĥP be the individual estimators. Then ĥi (x) ≤ τ in at least 1 − αP of these
estimators. This allows for a degree of robustness to individual model predictions by discarding a
Note that α = 0 enforces the bound for all estimators, yielding the most conservative estimate,
yi ≤ τ + M (1 − zi ), i = 1, . . . , P
P
1X
zi ≥ 1 − α,
P i=1
where zi ∈ {0, 1} ∀i = 1, . . . , P , and M is a sufficiently large constant. Appendix A.4 includes further
The violation limit concept can also be applied to estimators coming from multiple model classes,
which allows us to enforce that the constraint is generally obeyed when modeled through distinct
The use of the model wrapper approach and the trust region constraints, as defined in (2), has
a direct effect on the feasible region. The better performance of the learned constraints might be
balanced out by the (potentially) unnecessary conservatism of the optimal solution. Although we
introduced the trust region as a set of constraints to preserve the predictive performance of the
fitted constraints, Balestriero et al. (2021) show how in a high-dimensional space the generalization
performance of a fitted model is typically obtained extrapolating. In light of this evidence, we pro-
pose an ϵ-CH formulation which builds on (2), and more generally on (3). The relaxed formulation
of the trust region enables the optimal solution of problem M (w) to be outside CH(X). Formally,
we enlarge the trust region such that solutions outside CH(X) are considered feasible if they fall
within the hyperball, with radius ϵ, surrounding at least one of the data points in X, see Figure 5
with s ∈ Rn , and p set equal to 1,2 or ∞ to preserve the complexity of the optimization problem.
Figure 5 (right) shows the extended region obtained with the ϵ-CH. The choice of ϵ is pivotal in the
trade-off between the performance of the learned constraints and the conservatism of the optimal
solution. In the next section, we demonstrate how an increase in ϵ affects both the performance of
4. Case study: a palatable food basket for the World Food Programme
In this case study, we use a simplified version of the model proposed by Peters et al. (2021),
which seeks to optimize humanitarian food aid. Its extended version aims to provide the World
Food Programme (WFP) with a decision-making tool for long-term recovery operations, which
simultaneously optimizes the food basket to be delivered, the sourcing plan, the delivery plan,
and the transfer modality of a month-long food supply. The model proposed by Peters et al.
(2021) enforces that the food baskets address the nutrient gap and are palatable. To guarantee a
Maragno et al.: Mixed-integer Optimization with Constraint Learning
19
Figure 5 Trust region enlarged using an hyperball with radius ϵ around each sample in CH(X).
6 6
4 4
2 2
0 0
2 2
4 4
CH(X)
Hyperball with radius CH(X)
6 Data points in X 6 CH(X)
Feasible solution in CH(X) Data points in X
6 4 2 0 2 4 6 6 4 2 0 2 4 6
certain level of palatability, the authors use a number of “unwritten rules” that have been defined
in collaboration with nutrition experts. In this case study, we take a step further by inferring
palatability constraints directly from data that reflects local people’s opinions. We use the specific
case of Syria for this example. The conceptual model presents an LO structure with only the food
the procedure would remain unchanged if data were collected in the field, for example through
surveys. The structure of this problem, which is an LO and involves only one learned constraint,
allows the following analyses: (1) the effect of the trust-region on the optimal solution, and (2)
the effect of clustering on the computation time and the optimal objective value. Additionally,
the use of simulated data provides us with a ground truth to use in evaluating the quality of the
prescriptions.
and a diet model with constraints for nutrition levels and food basket palatability.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
20
The sets used to define the constraints and the objective function are displayed in Table 1. We
have three different sets of nodes, and the set of commodities contains all the foods available for
Sets
NS Set of source nodes
NT Set of transshipment nodes
ND Set of delivery nodes
K Set of commodities (k ∈ K)
L Set of nutrients (l ∈ L)
The parameters used in the model are displayed in Table 2. The costs used in the objective
function concern transportation (pT ) and procurement (pP ). The amount of food to deliver depends
on the demand (d) and the number of feeding days (days). The nutritional requirements (nutreq)
and nutritional values (nutrval) are detailed in Appendix C. The parameter γ is needed to convert
the metric tons used in the supply chain constraints to the grams used in the nutritional constraints.
The parameter t is used as a lower bound on the food basket palatability. The values of these
Parameters
γ Conversion rate from metric tons (mt) to grams (g)
di Number of beneficiaries at delivery point i ∈ ND
days Number of feeding days
nutreql Nutritional requirement for nutrient l ∈ L (grams/person/day)
nutvalkl Nutritional value for nutrient l ∈ L per gram of commodity k ∈ K
pPik Procurement cost (in $ / mt) of commodity k from source i ∈ NS
pTijk Transportation cost (in $ / mt) of commodity k from node i ∈ NS ∪ NT to node j ∈ NT ∪ ND
t Palatability lower bound
The decision variables are shown in Table 3. The flow variables Fijk are defined as the metric
tons of a commodity k transported from node i to j. The variable xk represents the average daily
ration per beneficiary for commodity k. The variable y refers to the palatability of the food basket.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
21
Variables
Fijk Metric tons of commodity k ∈ K transported between node i and node j
xk Grams of commodity k ∈ K in the food basket
y Food basket palatability
X X X X X X
min pPik Fijk + pTijk Fijk (6a)
x,y,F
i∈NS j∈NT ∪ND k∈K i∈NS ∪NT j∈NT ∪ND k∈K
X X
s.t. Fijk = Fjik , i ∈ NT , k ∈ K , (6b)
j∈NT j∈NT
X
γFjik = di xk days, i ∈ ND , k ∈ K , (6c)
j∈NS ∪NT
X
N utvalkl xk ≥ N utreql , l ∈ L, (6d)
k∈K
xsalt = 5, (6e)
y ≥ t, (6g)
y = ĥ(x), (6h)
Fijk , xk ≥ 0, i, j ∈ N , k ∈ K. (6i)
The objective function consists of two components, procurement costs and transportation costs.
Constraints (6b) are used to balance the network flow, namely to ensure that the inflow and the
outflow of a commodity are equal for each transhipment node. Constraints (6c) state that flow
into a delivery node has to be equal to its demand, which is defined by the number of beneficiaries
times the daily ration for commodity k times the feeding days. Constraints (6d) guarantee an
optimal solution that meets the nutrition requirements. Constraints (6e) and (6f) force the amount
of salt and sugar to be 5 grams and 20 grams respectively. Constraint (6g) requires the food basket
palatability (y), defined by means of a predictive model (6h), to be greater than a threshold (t).
Lastly, non-negativity constraints (6i) are added for all commodity flows and commodity rations.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
22
To evaluate the ability of our framework to learn and implement the palatability constraints, we
use a simulator to generate diets with varying palatabilities. Each sample is defined by 25 features
representing the amount (in grams) of all commodities that make up the food basket. We then
use a ground truth function to assign each food basket a palatability between 0 and 1, where 1
corresponds to a perfectly palatable basket, and 0 to an inedible basket. This function is based
on suggestions provided by WFP experts and complete details are outlined in Appendix C.1. The
data is then balanced to ensure that a wide variety of palatability scores are represented in the
dataset. The final data used to learn the palatability constraint consists of 121,589 samples. Two
examples of daily food baskets and their respective palatability scores are shown in Table 4. In this
case study, we use a palatability lower bound (t) of 0.5 for our learned constraint.
The next step of the framework involves training and choosing the predictive model that best
approximates the unknown constraint. The predictive models used to learn the palatability con-
straints are those discussed in Section 2, namely LR, SVM, CART, RF, GBM with decision trees
The experiments are executed using OptiCL jointly with Gurobi v9.1 (Gurobi Optimization, LLC
2021) as the optimization solver. Table 5 reports the performances of the predictive models evalu-
Maragno et al.: Mixed-integer Optimization with Constraint Learning
23
ated both for the validation set and for the prescriptions after being embedded into the optimization
model. The table also compares the performance of the optimization with and without the trust
region. The column “Validation MSE” gives the Mean Squared Error (MSE) of each model obtained
in cross-validation during model selection. While all scores in this column are desirably low, the
MLP model significantly achieves the lowest error during this validation phase. The column “MSE”
gives the MSE of the predictive models once embedded into the optimization problem to evalu-
ate how well the predictions for the optimal solutions match their true palatabilities (computed
using the simulator). It is found using 100 optimal solutions of the optimization model generated
with different cost vectors. The MLP model exhibits the best performance (0.055) in this context,
showing its ability to model the palatability constraint better than all other methods.
Table 5 Predictive models performances for the validation set (“Validation MSE”), and for
the prescriptions after being embedded into the optimization model with (“MSE-TR”) and
without the trust region (“MSE”). The last two columns show the average computation time in
seconds and its standard deviation (SD) required to solve the optimization model with
(“Time-TR”) and without the trust region (“Time”).
Benefit of trust region. Table 5 shows that when the trust region is used (“MSE-TR”), the MSEs
obtained by all models are now much closer to the results from the validation phase. This shows
the benefit of using the trust region as discussed in Section 2.3 to prevent extrapolation. With
the trust region included, the MLP model also exhibits the lowest MSE (0.001). The improved
performance seen with the inclusion of the trust region does come at the expense of computation
speed. The column “Time-TR” shows the average computation time in seconds and its standard
deviation (SD) with trust region constraints included. In all cases, the computation time has clearly
increased when compared against the computation time required without the trust region (column
Maragno et al.: Mixed-integer Optimization with Constraint Learning
24
“Time”). This is however acceptable, as significantly more accurate results are obtained with the
trust region.
Benefit of clustering. The large dataset used in this case study makes the use of the trust
region expensive in terms of time required to solve the final optimization model. While the column
selection algorithm described in Section 2.3 is ideal for significantly reducing the computation
time, optimization models that require binary variables, either for embedding an ML model or to
represent decision variables, would require column selection to be combined with a branch and
bound algorithm. However, in this more general MIO case, it is possible to divide the dataset into
clusters and solve in parallel an MIO for each cluster. By using parallelization, the total solution
time can be expected to be equal to the longest time required to solve any single cluster’s MIO.
Contrary to column selection, the use of clusters can result in more conservative solutions; the
trust region gets smaller with more clusters and prevents the model from finding solutions that
are convex combinations of members of different clusters. However, as described in Section 2.3,
solutions that lie between clusters may in fact reside in low-density areas of the feature space that
should not be included in the trust region. In this sense, the loss in the objective value might
Figure 6 shows the effect of clusters in solving the model (6a-6i) with GBM as the predictive
model used to learn the palatability constraint. K-means is used to partition the dataset into K
clusters, and the reported values are averaged over 100 iterations. In the left graph, we report the
maximum runtime distribution across clusters needed to solve the different MIOs in parallel. In
the right graph, we have the distributions of optimality gap, i.e., the relative difference between
the optimal solution obtained with clusters compared to the solution obtained with no clustering.
In this case study, the use of clusters significantly decreases the runtime (89.2% speed up with
K = 50) while still obtaining near-optimal solutions (less then 0.25% average gap with K = 50).
We observe that the trends are not necessarily monotonic in K. It is possible that a certain choice
of K may lead to a suboptimal solution, whereas a larger value of K may preserve the optimal
Figure 6 Effect of the number of clusters (K) on the computation time and the optimality gap across clusters,
70 0.25
60
0.20
50
0.15
40
0.10
30
20 0.05
10
0.00
0 10 20 30 40 50 0 10 20 30 40 50
K K
In these experiments, we assess the performance of the nominal and robust models. We consider
three dimensions of performance: (1) true constraint satisfaction, (2) objective function value, and
(3) runtime. The synthetic data used in this case study allows us to evaluate true palatability and
constraint satisfaction as these parameters vary. This is the primary goal of the model wrapper
ensemble approach, to improve feasibility and make solutions that are robust to any single learned
estimator.
We hypothesize that as our models become more conservative, we will more reliably satisfy the
desired palatability constraint with some toll on the objective function. Additionally, embedding
single nominal model. In this section, we compare the trade-offs in these metrics as we consider
different notions of robustness and vary our conservativeness. We note that we are able to evaluate
whether the true palatability meets the constraint threshold since palatability is defined through
a known function. As with the experiments above, we solve the palatability problem with 100
The results below explore the effect of the α (violation limit) on cost and palatability in the
WFP case study. Additional results on runtime, and experiments with varied estimators (P ), are
included in Appendix C.3. As the results demonstrate, the robustness parameters yield solutions
that vary in their conservativeness and runtime. There is not a single set of optimal parameters.
Rather, it is highly dependent on the use case, including factors like the stakes of the decision and
Multiple embedded models. We first consider the impact of the model wrapper approach in the
WFP problem. We compare different ways of embedding the palatability constraint, both using
multiple estimators of a single model class and an ensemble containing multiple model classes. We
run the experiments on a random sample of 1000 observations in the original WFP dataset. Within
a single model class, we vary the number of estimators (P ∈ [2, 5, 10, 25]) and the violation limit
(α ∈ [0, 0.1, 0.2, 0.5], or applying a mean constraint). Each estimator is obtained using a bootstrap
sample (proportion = 0.5) of the underlying data. We compute metrics (1-3) for each variant to
compare the tradeoffs in palatability (constraint satisfaction) and cost (objective function value).
Figure 7 presents the results for a decision tree with P = 25 and palatability threshold (τ ) equal
to 0.5. The left figure shows the trade off between palatability and the objective as the violation
limit (α) varies. As expected, improvements in palatability (when α decreases) lead to increases
in the total cost. However, we observe that a violation limit of 0.0 (vs. 0.5) leads to an 11.3%
modest 2.5% increase in cost. The center and right figure show how palatability and violations
vary with α. Palatability increases and violations decease with lower α. Both the violation rate
(proportion of iterations with real palatability < 0.5) and violation margin (average distance to
palatability threshold in cases where there is a violation) decrease with lower α. This experiment
demonstrates how the α parameter effectively controls the model’s robustness as measured by
constraint satisfaction. The approach has the advantage of parameterizing the violation limit,
tradeoffs.
Appendix C.3 reports further results for other model classes as well as runtime experiments.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
27
1300
Objective (Cost)
0.600
Palatability
0.2
Violation
0.575
1200 0.1
0.550
0.0
1100 0.525
0.1
0.500
1000
0.0 0.1 0.25 0.5 0.0 0.1 0.25 0.5 0.0 0.1 0.25 0.5
Enlarged trust region. In order to evaluate the effects of the enlarged trust region on the optimal
solution, we use a simplified version of problem (6a-6i) where the only constraints are on the
predictive model embedding, the palatability lower bound, and the ϵ-CH. In Figure 8, we show
how the objective function value and true palatability score vary according to different values of
ϵ ∈ [0, 0.8]. The results are obtained by averaging over 200 iterations with randomly generated cost
vectors and using a decision tree as a predictive model to represent the palatability outcome. As
expected, the objective value improves as ϵ increases. More interesting is the true palatability score
which stays around the imposed lower bound of 0.5 for values of ϵ smaller than 0.25. This means
that the predictive model is able to generalize even outside the CH as long as the optimal solution
In this case study, we extend the work of Bertsimas et al. (2016) in the design of chemotherapy
regimens for advanced gastric cancer. Late stage gastric cancer has a poor prognosis with limited
treatment options (Yang et al. 2011). This has motivated significant research interest and clinical
trials (National Cancer Institute 2021). In Bertsimas et al. (2016), the authors pose the question
of algorithmically identifying promising chemotherapy regimens for new clinical trials based on
existing trial results. They construct a database of clinical trial treatment arms which includes
cohort and study characteristics, the prescribed chemotherapy regimen, and various outcomes.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
28
Figure 8 Effect of the ϵ-CH on the objective value and the predictive model performance with respect to the
0.8
0.6
0.4
0.2
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
epsilon
Given a new study cohort and study characteristics, they optimize a chemotherapy regimen to
maximize the cohort’s survival subject to a constraint on overall toxicity. The original work uses
linear regression models to predict survival and toxicity, and it constrains a single toxicity measure.
In this work we leverage a richer class of ML methods and more granular outcome measures. This
offers benefits through higher performing predictive models and more clinically-relevant constraints.
Chemotherapy regimens are particularly challenging to optimize, since they involve multiple
drugs given at potentially varying dosages, and they present risks for multiple adverse events
that must be managed. This example highlights the generalizability of our framework to complex
domains with multiple decisions and learned functions. The treatment variables in this problem
consist of both binary and continuous elements, which are easily incorporated through our use
of MIO. We have several learned constraints which must be simultaneously satisfied, and we also
The use of clinical trial data forces us to consider each cohort as an observation, rather than
an individual, since only aggregate measures are available. Thus, our model optimizes a cohort’s
treatment. The contextual variables (w) consist of various cohort and study summary variables.
The inclusion of fixed, i.e., non-optimization, features allows us to account for differences in baseline
health status and risk across study cohorts. These features are included in the predictive models
but then are fixed in the optimization model to reflect the group for whom we are generating a
prescription. We assume that there are no unobserved confounding variables in this prescriptive
setting.
The treatment variables (x) encode a chemotherapy regimen. A regimen is defined by a set of
drugs, each with an administration schedule of potentially varied dosages throughout a chemother-
apy cycle. We characterize a regimen by drug indicators and each drug’s average daily dose and
frequency dosing strategies. The outcomes of interest (y) consist of overall survival, to be included
To determine the optimal chemotherapy regimen x for a new study cohort with characteristics
min yOS
x,y
s.t. yi ≤ τi , i ∈ YC ,
xb ∈ {0, 1}d ,
x ∈ X (w).
In this case study, we learn the full objective. However, this model could easily incorporate deter-
ministic components to optimize as additional weighted terms in the objective. We include one
The trust region, X (w), plays two crucial roles in the formulation. First, it ensures that the
predictive models are applied within their valid bounds and not inappropriately extrapolated. It
also naturally enforces a notion of “clinically reasonable” treatments. It prevents drugs from being
prescribed at doses outside of previously observed bounds, and it requires that the drug combination
must have been previously seen (although potentially in different doses). It is nontrivial to explicitly
characterize what constitutes a realistic treatment, and the convex hull provides a data-driven
solution that integrates directly into the model framework. Furthermore, the convex hull implicitly
enforces logical constraints between the different dimensions of x. For example, a drug’s average
and instantaneous dose must be 0, if the drug’s binary indicator is set to 0: this does not need to
be explicitly included as a constraint, since this is true for all observed treatment regimens. The
only explicit constraint required here is that the indicator variables xb are binary.
5.2. Dataset
Our data consists of 495 clinical trial arms from 1979-2012 (Bertsimas et al. 2016). We consider
nine contextual variables, including the average patient age and breakdown of primary cancer site.
There are 28 unique drugs that appear in multiple arms of the training set, yielding 84 decision
variables. We include several “dose-limiting toxicities” (DLTs) for our constraint set: Grade 3/4
constitutional toxicity, gastrointestinal toxicity, and infection, as well as Grade 4 blood toxicity.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
31
As the name suggests, these are chemotherapy side effects that are severe enough to affect the
course of treatment. We also consider incidence of any dose-limiting toxicity (“Any DLT”), which
We apply a temporal split, training the predictive models on trial arms through 2008 and generat-
ing prescriptions for the trial arms in 2009-2012. The final training set consists of 320 observations,
and the final testing set consists of 96 observations. The full feature set, inclusion criteria, and
To define the trust region, we take the convex hull of the treatment variables (x) on the training
set. This aligns with the temporal split setting, in which we are generating prescriptions going
forward based on an existing set of past treatment decisions. In general it is preferable to define the
convex hull with respect to both x and w as discussed in Appendix B.1, but this does not apply
well with a temporal split. Our data includes the study year as a feature to incorporate temporal
effects, and so our test set observations will definitionally fall outside of the convex hull defined by
Several ML models are trained for each outcome of interest using cross-validation for parameter
tuning, and the best model is selected based on the validation criterion. We employ function
learning for all toxicities, directly predicting the toxicity incidence and applying an upper bound
Based on the model selection procedure, overall DLT, gastrointestinal toxicity, and overall sur-
vival are predicted using GBM models. Blood toxicity and infection are predicted using linear
models, and constitutional toxicity is predicted with a RF model. This demonstrates the advantage
of learning with multiple model classes; no single method dominates in predictive performance. A
We generate prescriptions using the optimization model outlined in Section 5.1, with the embedded
model choices specified in Section 5.3. In order to evaluate the quality of our prescriptions, we must
estimate the outcomes under various treatment alternatives. This evaluation task is notoriously
challenging due to the lack of counterfactuals. In particular, we only know the true outcomes for
observed cohort-treatment pairs and do not have information on potential unobserved combina-
tions. We propose an evaluation scheme that leverages a “ground truth” ensemble (GT ensemble).
We train several ML models using all data from the study. These models are not embedded in
an MIO model, so we are able to consider a broader set of methods in the ensemble. We then
predict each outcome by averaging across all models in the ensemble. This approach allows us to
capture the maximal knowledge scenario. Furthermore, such a “consensus” approach of combining
ML models has been shown to improve predictive performance and is more robust to individual
model error (Bertsimas et al. 2021). The full details of the ensemble models and their predictive
We evaluate our model in multiple ways. We first consider the performance of our prescrip-
tions against observed (given) treatments. We then explore the impact of learning multiple sub-
constraints rather than a single aggregate toxicity constraint. All optimization models have the
following shared parameters: toxicity upper bound of 0.6 quantile (as observed in training data)
and maximum violation of 25% for RF models. We report results for all test set observations with
a feasible solution. It is possible that an observation has no feasible solution, implying that there is
not a suitable drug combination lying within the convex hull for this cohort based on the toxicity
requirements. These cases could be further investigated through a sensitivity analysis by relaxing
the toxicity constraints or enlarging the trust region. With clinical guidance, one could evaluate
the modifications required to make the solution feasible and the clinical appropriateness of such
relaxations.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
33
Table 6 reports the predicted outcomes under two constraint approaches: (1) constraining each
toxicity separately (“All Constraints”), and (2) constraining a single aggregate toxicity measure
(“DLT Only”). For each cohort in the test set, we generate predictions for all outcomes of interest
under both prescription schemes and compute the relative change of our prescribed outcome from
Benefit of prescriptive scheme. We begin by evaluating our proposed prescriptive scheme (“All
Constraints”) against the observed actual treatments. For example, under the GT ensemble scheme,
84.7% of cohorts satisfied the overall DLT constraint under the given treatment, compared to
94.1% under the proposed treatment. This yields an improvement of 11.10%. We obtain a signif-
icant improvement in survival (11.40%) while also improving toxicity limit satisfaction across all
individual toxicities. Using the GT ensemble, we see toxicity satisfaction improvements between
1.3%-25.0%. We note that since toxicity violations are reported using the average incidence for each
cohort, and the constraint limits are toxicity-specific, it is possible for a single DLT’s incidence to
be over the allowable limit while the overall “Any DLT” rate is not.
Table 6 Comparison of outcomes under given treatment regimen, regimen prescribed when only constraining the
aggregate toxicity, and regimen prescribed under our full model.
Benefit of multiple constraints. Table 6 also illustrates the value of enforcing constraints on each
individual toxicity rather than as a single measure. When only constraining the aggregate toxicity
measure (“DLT Only”), the resultant prescriptions actually have lower constraint satisfaction for
blood toxicity and infection than the baseline given regimens. By constraining multiple measures,
Maragno et al.: Mixed-integer Optimization with Constraint Learning
34
we are able to improve across all individual toxicities. The fully constrained model actually improves
the overall DLT measure satisfaction, suggesting that the inclusion of these “sub-constraints”
also makes the aggregate constraint more robust. This improvement does come at the expense of
slightly lower survival between the “All” and “DLT Only” models (-0.38 months) but we note that
incurring the individual toxicities that are violated in the “DLT Only” model would likely make
6. Discussion
Our experimental results illustrate the benefits of our constraint learning framework in data-
driven decision making in two problem settings: food basket recommendations for the WFP and
chemotherapy regimens for advanced gastric cancer. The quantitative results show an improvement
in predictive performance when incorporating the trust region and learning from multiple candi-
date model classes. Our framework scales to large problem sizes, enabled by efficient formulations
and tailored approaches to specific problem structures. Our approach for efficiently learning the
The nominal problem formulation is strengthened by embedding multiple models for a single
constraint rather than relying on a single learned function. This notion of robustness is particularly
functions can lead to suboptimal outcomes, a mis-specified constraint can lead to infeasible solu-
tions. Finally, our software exposes the model ensemble construction and trust region enlargement
options directly through user-specified parameters. This allows an end user to directly evaluate
tradeoffs in objective value and constraint satisfaction, as the problem’s real-world context often
We recognize several opportunities to further extend this framework. Our work naturally relates
to the causal inference literature and individual treatment effect estimation (Athey and Imbens
2016, Shalit et al. 2017). These methods do not directly translate to our problem setting; existing
work generally assumes highly structured treatment alternatives (e.g., binary treatment vs. control)
Maragno et al.: Mixed-integer Optimization with Constraint Learning
35
or a single continuous treatment (e.g., dosing), whereas we allow more general decision structures. In
future work, we are interested in incorporating ideas from causal inference to relax the assumption
of unobserved confounders.
Additionally, our framework is dependent on the quality of the underlying predictive models. We
constrain and optimize point predictions from our embedded models. This can be problematic in the
toub and Grigas 2021). We mitigate this concern in two ways. First, our model selection procedure
allows us to obtain higher quality predictive models by capturing several possible functional rela-
tionships. Second, our model wrapper approach for embedding a single constraint with an ensemble
of models allows us to directly control our robustness to the predictions of individual learners.
In future work, there is an opportunity to incorporate ideas from robust optimization to directly
account for prediction uncertainty in individual model classes. While this has been addressed in
the linear case (Goldfarb and Iyengar 2003), it remains an open area of research in more general
ML methods.
In this work, we present a unified framework for optimization with learned constraints that
leverages both ML and MIO for data-driven decision making. Our work flexibly learns problem
constraints and objectives with supervised learning, and incorporates them into a larger optimiza-
tion problem of interest. We also learn the trust region, providing more credible recommendations
and improving predictive performance, and accomplish this efficiently using column generation and
unsupervised learning. The generality of our method allows us to tackle quite complex decision set-
tings, such as chemotherapy optimization, but also includes tailored approaches for more efficiently
solving specific problem types. Finally, we implement this as a Python software package (OptiCL)
to enable practitioner use. We envision that OptiCL’s methodology will be added to state-of-the-art
Acknowledgments
The authors thank the anonymous reviewers and editorial team for their valuable feedback on this work. This
work was supported by the Dutch Scientific Council (NWO) grant OCENW.GROOT.2019.015, Optimization
Maragno et al.: Mixed-integer Optimization with Constraint Learning
36
for and with Machine Learning (OPTIMAL). Additionally, Holly Wiberg was supported by the National
Science Foundation Graduate Research Fellowship under Grant No. 174530. Any opinion, findings, and
conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily
References
formulations for trained neural networks. Mathematical Programming 183(1-2):3–39, ISSN 14364646,
URL http://dx.doi.org/10.1007/s10107-020-01474-5.
Athey S, Imbens G (2016) Recursive partitioning for heterogeneous causal effects. Proceedings of the National
1510489113.
Balestriero R, Pesenti J, LeCun Y (2021) Learning in high dimension always amounts to extrapolation. URL
http://dx.doi.org/10.48550/ARXIV.2110.09485.
Bengio Y, Lodi A, Prouvost A (2021) Machine learning for combinatorial optimization: A methodological
tour d’horizon. European Journal of Operational Research 290(2):405–421, ISSN 0377-2217, URL http:
//dx.doi.org/https://doi.org/10.1016/j.ejor.2020.07.063.
Bergman D, Huang T, Brooks P, Lodi A, Raghunathan AU (2022) JANOS: an integrated predictive and
mann DJ, Estrada V, Macaya C, Gil IJ (2021) Personalized prescription of ACEI/ARBs for hyper-
tensive COVID-19 patients. Health Care Management Science 24(2):339–355, ISSN 15729389, URL
http://dx.doi.org/10.1007/s10729-021-09545-5.
Bertsimas D, Dunn J (2017) Optimal classification trees. Machine Learning 106(7):1039–1082, ISSN
Bertsimas D, Kallus N (2020) From predictive to prescriptive analytics. Management Science 66(3):1025–
chemotherapy regimens for cancer. Management Science 62(5):1511–1531, ISSN 15265501, URL http:
//dx.doi.org/10.1287/mnsc.2015.2363.
Bertsimas, D and Dunn, J (2018) Machine Learning under a Modern Optimization Lens (Belmont: Dynamic
Ideas).
Biggs M, Hariss R, Perakis G (2021) Optimizing objective functions determined from random forests. SSRN
Bonfietti A, Lombardi M, Milano M (2015) Embedding decision trees and random forests in constraint
programming. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intel-
ligence and Lecture Notes in Bioinformatics) 9075:74–90, ISSN 16113349, URL http://dx.doi.org/
10.1007/978-3-319-18008-3_6.
Breiman L (2001) Random forests. Machine Learning 45(1):5–32, ISSN 08856125, URL http://dx.doi.
org/10.1023/A:1010933404324.
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees (Routledge),
Cancer Therapy Evaluation Program (2006) Common terminology criteria for adverse events v3.0. URL
https://ctep.cancer.gov/protocoldevelopment/electronic_applications/docs/ctcaev3.pdf.
Chen Y, Shi Y, Zhang B (2020) Input convex neural networks for optimal voltage regulation. URL http:
//arxiv.org/abs/2002.08684.
Cremer JL, Konstantelos I, Tindemans SH, Strbac G (2019) Data-driven power system operation: Explor-
ing the balance between cost and risk. IEEE Transactions on Power Systems 34(1):791–801, ISSN
Drucker H, Surges CJ, Kaufman L, Smola A, Vapnik V (1997) Support vector regression machines. Advances
Ebert T, Belz J, Nelles O (2014) Interpolation and extrapolation: Comparison of definitions and survey of
algorithms for convex and concave hulls. 2014 IEEE Symposium on Computational Intelligence and
Elmachtoub AN, Grigas P (2021) Smart “Predict, then Optimize”. Management Science 1–46, ISSN 0025-
Fajemisin A, Maragno D, den Hertog D (2021) Optimization with constraint learning: A framework and
George B Dantzig PW (1960) Decomposition principle for linear programs. Operations Research 8(1):101–
Goldfarb D, Iyengar G (2003) Robust portfolio selection problems. Mathematics of Operations Research
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. CoRR
abs/1412.6572.
Grimstad B, Andersson H (2019) ReLU networks as surrogate models in mixed-integer linear programs. Com-
compchemeng.2019.106580.
Gurobi Optimization, LLC (2021) Gurobi Optimizer Reference Manual. URL https://www.gurobi.com.
security-boundary constrained optimal power flow. IEEE Transactions on Power Systems 26(1):63–72,
AC-OPF for operations and markets. 20th Power Systems Computation Conference, PSCC 2018 URL
http://dx.doi.org/10.23919/PSCC.2018.8442786.
Kleijnen JP (2015) Design and analysis of simulation experiments. International Workshop on Simulation,
3–22 (Springer).
Maragno et al.: Mixed-integer Optimization with Constraint Learning
39
Kudla P, Pawlak TP (2018) One-class synthesis of constraints for Mixed-Integer Linear Programming with
C4.5 decision trees. Applied Soft Computing Journal 68:1–12, ISSN 15684946, URL http://dx.doi.
org/10.1016/j.asoc.2018.03.025.
Lombardi M, Milano M, Bartolini A (2017) Empirical decision model learning. Artificial Intelligence 244:343–
367.
Mišić VV (2020) Optimization of tree ensembles. Operations Research 68(5):1605–1624, ISSN 15265463, URL
http://dx.doi.org/10.1287/opre.2019.1928.
MOSEK (2019) MOSEK Optimizer API for Python 9.3.7. URL https://docs.mosek.com/latest/
pythonapi/index.html.
National Cancer Institute (2021) Treatment clinical trials for gastric (stomach) cancer. URL https://www.
cancer.gov/about-cancer/treatment/clinical-trials/disease/stomach-cancer/treatment.
Pawlak TP (2019) Synthesis of mathematical programming models with one-class evolutionary strategies.
//doi.org/10.1016/j.swevo.2018.04.007.
Pawlak TP, Krawiec K (2019) Synthesis of constraints for mathematical programming with one-class genetic
org/10.1109/TEVC.2018.2835565.
Pawlak TP, Litwiniuk B (2021) Ellipsoidal one-class constraint acquisition for quadratically constrained
programming. European Journal of Operational Research 293(1):36–49, ISSN 03772217, URL http:
//dx.doi.org/10.1016/j.ejor.2020.12.018.
Peters K, Silva S, Gonçalves R, Kavelj M, Fleuren H, den Hertog D, Ergun O, Freeman M (2021) The
nutritious supply chain: Optimizing humanitarian food assistance. INFORMS Journal on Optimization
3(2):200–226.
Schweidtmann AM, Mitsos A (2019) Deterministic global optimization with artificial neural networks
embedded. Journal of Optimization Theory and Applications 180(3):925–948, ISSN 15732878, URL
http://dx.doi.org/10.1007/s10957-018-1396-0.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
40
Shalit U, Johansson FD, Sontag D (2017) Estimating individual treatment effect: generalization bounds and
Skiena SS (2008) The Algorithm Design Manual (Springer Publishing Company, Incorporated), 2nd edition.
Spyros C (2020) From decision trees and neural networks to MILP: power system optimization considering
dynamic stability constraints. 2020 European Control Conference (ECC), 594–594 (IEEE), ISBN 978-
Sroka D, Pawlak TP (2018) One-class constraint acquisition with local search. GECCO 2018 - Proceedings
of the 2018 Genetic and Evolutionary Computation Conference 363–370, URL http://dx.doi.org/
10.1145/3205455.3205480.
Stoer J, Botkin ND (2005) Minimization of convex functions on the convex hull of a point set.
s00186-005-0018-4.
Stoer J, Botkin ND, Pykhteev OA (2007) An interior-point method for minimizing convex functions on
02331930701421111.
OPF. Proc. 10th Bulk Power Syst. Dyn. Control Symp., 1–10, URL http://irep2017.inesctec.pt/
conference-papers/conference-papers/paper65r7z1aplj.pdf.
UNHCR, UNICEF, WFP, WHO (2002) Food and nutrition needs in emergencies. URL https://www.who.
int/nutrition/publications/emergencies/a83743/en/.
Venzke A, Viola DT, Mermet-Guyennet J, Misyris GS, Chatzivasileiadis S (2020) Neural networks for
encoding dynamic security-constrained optimal power flow to mixed-integer linear programs URL
http://arxiv.org/abs/2003.07939.
Verwer S, Zhang Y, Ye QC (2017) Auction optimization using regression trees and linear models as integer
artint.2015.05.004.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
41
Wolfe P (1961) A duality theorem for non-linear programming. Quarterly of Applied Mathematics
19(3):239–244.
Yang D, Hendifar A, Lenz C, Togawa K, Lenz F, Lurje G, Pohl A, Winder T, Ning Y, Groshen S, Lenz
HJ (2011) Survival of metastatic gastric cancer: Significance of age, sex and race/ethnicity. Journal of
2078-6891.2010.025.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
42
y = β0 + βx⊤ x + βw
⊤
w.
Support Vector Machines. A support vector machine (SVM) uses a hyper-plane split to generate
predictions, both for classification (Cortes and Vapnik 1995) and regression (Drucker et al. 1997).
We consider the case of linear SVMs, since this allows us to obtain the prediction as a linear
function of the decision variables x. In linear support vector regression (SVR), which we use for
function learning, we fit a linear function to the data. The setting is similar to linear regression,
but the loss function only penalizes residuals greater than an ϵ threshold (Drucker et al. 1997). As
with linear regression, the trained model returns a linear function with coefficients βx , βw , and β0 .
The final prediction is
y = β0 + βx⊤ x + βw
⊤
w.
For the classification setting, linear support vector classification (SVC) identifies a hyper-plane
that best separates positive and negative samples (Cortes and Vapnik 1995). A trained SVC model
similarly returns coefficients βx , βw , and β0 , where a sample’s prediction is given by
1, if β0 + βx⊤ x + βw
⊤
w ≥ 0;
(
y=
0, otherwise.
In SVC, the output variable y is binary rather than a probability. In this case, the constraint can
simply be embedded as β0 + βx⊤ x + βw
⊤
w ≥ 0.
equivalently, −A⊤
2 x < −b2 . Furthermore, we can remove the strict inequalities using a sufficiently
A⊤
1 x − M (1 − l3 ) ≤ b1 , (7a)
Maragno et al.: Mixed-integer Optimization with Constraint Learning
43
A⊤
2 x − M (1 − l3 ) ≤ b2 , (7b)
A⊤
1 x − M (1 − l4 ) ≤ b1 , (7c)
−A⊤
2 x − M (1 − l4 ) ≤ −b2 − ϵ, (7d)
−A⊤
1 x − M (1 − l6 ) ≤ −b1 − ϵ, (7e)
A⊤
5 x − M (1 − l6 ) ≤ b5 , (7f)
−A⊤
1 x − M (1 − l7 ) ≤ −b1 − ϵ, (7g)
−A⊤
5 x − M (1 − l7 ) ≤ −b5 − ϵ, (7h)
l3 + l4 + l6 + l7 = 1, (7i)
y − (p3 l3 + p4 l4 + p6 l6 + p7 l7 ) = 0, (7j)
where l3 , l4 , l6 , l7 are binary variables associated with the corresponding leaves. For a given x, if
A⊤ ⊤
1 x ≤ b1 , Constraints (7e) and (7h) will force l6 and l7 to zero, respectively. If A2 x ≤ b2 , constraint
(7d) will force l4 to 0. The assignment constraint (7i) will then force l3 = 1, assigning the observation
to leaf 3 as desired. Finally, constraint (7j) sets y to the prediction of the assigned leaf (p3 ). We
can then constrain the value of y using our desired upper bound of τ (or lower bound, without loss
of generality).
More generally, consider a decision tree ĥ(x, w) with a set of leaf nodes L each described by a
binary variable li and a prediction score pi . Splits take the form (Ax )⊤ x + (Aw )⊤ w ≤ b, where Ax
gives the coefficients for the optimization variables x and Aw gives the coefficients for the non-
optimization (fixed) variables w. Let S l be the set of nodes that define the splits that observations
in leaf i must obey. Without loss of generality, we can write these all as (Āx )⊤ ⊤
j x + (Āw )j w − M (1 −
li ) ≤ b̄j , where Ā is A if leaf i follows the left split of j and −A otherwise. Similarly, b̄ equals b if
the leaf falls to the left split, and −b − ϵ otherwise, as established above. This decision tree can
then be embedded through the following constraints:
(Āx )⊤ ⊤ l
j x + (Āw )j w − M (1 − li ) ≤ b̄j , i ∈ L, j ∈ S , (8a)
X
li = 1, (8b)
i∈L
X
y− pi li = 0. (8c)
i∈L
Here, M can be selected for each split by considering the maximum difference between (Āx )⊤
j x+
(Āw )⊤
j w and bj . A prescription solution x for a patient with features w must obey the constraints
determined by its split path, i.e. only the splits that lead to its assigned leaf i. If li = 0 for some
leaf i, the corresponding split constraints need not be considered. If li = 1, constraint (8a) will
enforce that the solution obeys all split constraints leading to leaf i. If li = 0, no constraints
Maragno et al.: Mixed-integer Optimization with Constraint Learning
44
related to leaf i should be applied. When li = 0, constraint (8a) will be nonbinding at node j if
M ≥ (Āx )⊤ ⊤
j x + (Āw )j w − b̄j . Thus we can find the minimum necessary value of M by maximizing
these expressions over all possible values of x (for the patient’s fixed w). For a given patient with
features w for whom we wish to optimize treatment, EM(w) is the solution of
max(Āx )⊤ ⊤
j x + (Āw )j w − b̄j (9a)
x
x ∈ X (w). (9c)
Note that the non-learned constraints on x, namely constraint (9b), and the trust region constraint
(9c) allow us to reduce the search space when determining M .
MIO vs. LO formulation for decision trees. In Section 2, we proposed two ways of embedding a
decision tree as a constraint. The first uses an LO to represent each feasible leaf node in the tree,
while the second directly uses the entire MIO representation of the tree as a constraint. To compare
the performance of these two approaches, we learn the palatability constraint using decision trees
(CART) grown to have various numbers of leaves, and solve the optimization model with both
approaches.
Figure EC.1 Comparison of MIO and multiple LO approach to tree representation, as a function of the number
of leaves.
When comparing the solution times (averaged over 10 runs), Figure EC.1 shows that the MIO
approach is relatively consistent in terms of solution time regardless of the number of leaves. With
the LO approach however, as the number of leaves grows, the number of LOs to be solved also
grows. While the solution time of a single LO is very low, solving multiple LOs sequentially might
Maragno et al.: Mixed-integer Optimization with Constraint Learning
45
be heavily time consuming. A way to speed up the process is to solve the LOs in parallel. When
only one LO needs to be solved, it takes 1.8 seconds in this problem setting. By parallelizing the
solution of the LOs, the total solution time can be expected to take only as long as it takes for the
slowest LO to be solved.
v ≥ x, (10a)
v ≤ x − ML (1 − z), (10b)
v ≤ MU z, (10c)
v ≥ 0, (10d)
z ∈ {0, 1} , (10e)
where ML < 0 is a lower bound on all possible values of x, and MU > 0 is an upper bound. While
this embedding relies on a big-M formulation, it can be improved in multiple ways. The model
can be tightened by careful selection of ML and MU . Furthermore, Anderson et al. (2020) recently
proposed an additional iterative cut generation procedure to improve the strength of the basic
big-M formulation.
The constraints for an MLP network can be generated recursively starting from the input layer,
with a set of ReLU constraints for each node in each internal layer, l ∈ {2, . . . , L − 1}. This allows
us to embed a trained MLP with an arbitrary number of hidden layers and nodes into an MIO.
Regression. In a regression setting, the output layer L consists of a single node that is a linear
combination of the node values in layer L − 1, so it can be encoded directly as
X
y = v L = β0L + βjL vjL−1 .
j∈N L−1
Maragno et al.: Mixed-integer Optimization with Constraint Learning
46
Binary Classification. In the binary classification setting, the output layer requires one neuron
1
with a sigmoid activation function, S(x) = 1+e−x
. The value is given as
1
vL = −(β0L +β L⊤ v L−1 )
1+e
L
with v ∈ (0, 1). This function is nonlinear, and thus, cannot be directly embedded into our formula-
tion. However, if τ is our desired probability lower bound, it will be satisfied when β0L + β L⊤ v L−1 ≥
τ
ln 1−τ . Therefore, the neural network’s output, binarized with a threshold of τ , is given by
τ
1, if β L + β L⊤ v L−1 ≥ ln ;
0
y= 1−τ
0, otherwise.
For example, at a threshold of τ = 0.5, the predicted value is 1 when β0L + β L⊤ v L−1 ≥ 0. Here, τ
can be chosen according to the minimum necessary probability to predict 1. As for the SVC case,
y is binary and the constraint can be embedded as y ≥ 1. We refer to Appendix A.3 for the case of
neural networks trained for multi-class classification.
Multi-class classification. In multi-class classification, the outputs are traditionally obtained by
P
K
applying a softmax activation function, S(x)i = exi / k=1 e
xk
, to the final layer. This function
ensures that the outputs sum to one and can thus be interpreted as probabilities. In particular,
suppose we have a K-class classification problem. Each node in the final layer has an associated
weight vector βi , which maps the nodes of layer L − 1 to the output layer by βi⊤ v L−1 . The softmax
function rescales these values, so that class i will be assigned probability
⊤ L−1
eβi v
viL = PK ⊤ v L−1
βk
.
k=1 e
We cannot apply the softmax function directly in an MIO framework with linear constraints.
Instead, we use an argmax function to directly return an indicator of the highest probability class,
similar to the approach with SVC and binary classification MLP. In other words, the output y is
the identity vector with yi = 1 for the most likely class. Class i has the highest probability if and
only if
L
βi0 + βiL⊤ v L−1 ≥ βk0
L
+ βkL⊤ v L−1 , k = 1, . . . , K.
Constraint (11a) forces yi = 0, if the constraint is not satisfied for some k ∈ {1, . . . , K }. Con-
straint (11b) ensures that yi = 1 for the highest likelihood class. We can then constrain the predic-
tion to fall in our desired class i by enforcing yi = 1.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
47
As we explain in Section 2.3, the trust region prevents the predictive models from extrapolating. It
is defined as the convex hull of the set Z = {(x̄i , w̄i )}N n
i=1 , with x̄i ∈ R observed treatment decisions,
and w̄i ∈ Rp contextual information. In Section B.1, we explain the importance of using both x̄
and w̄ in the formulation of the convex hull. When the number of samples (N ) is too large, the
optimization model trust region constraints may become computationally expensive. In this case,
we propose a column selection algorithm which is detailed in Section B.2.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
48
3 3
2.5 2.5
2 2
w
w
1.5 1.5
1 1
0.5 0.5
1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4
x x
is exactly what we need. In (Stoer and Botkin 2005) and (Stoer et al. 2007) another method is
described to optimize over the convex hull of a huge set of points. However, the method proposed
in these papers is only suitable for problems that have only the convex hull constraint and no
additional constraints.
Let PI be a convex and continuously differentiable model consisting of an objective function and
constraints that may be known a priori as well as learned from data. Like in Section 2.3, we denote
the index set of samples by I . As part of the constraints, the trust region is defined on the entire
set Z . We start with the matrix Z ∈ RN ×(n+p) , where each row corresponds to a given data point
in Z . Then, model PI is given as
min f (Z ⊤ λ) (12a)
λ
s.t. gj (Z ⊤ λ) ≤ 0, j = 1, . . . , m, ⊥ µ, (12b)
X
λi = 1, ⊥ ρ, (12c)
i∈I
λi ≥ 0, i ∈ I, ⊥ υ, (12d)
where the decision variable x is replaced by Z ⊤ λ. Constraints (12b) include both known and learned
constraints, while constraints (12c) and (12d) are used for the trust region. The dual variables
associated with with constraints (12b), (12c), and (12d) are µ ∈ Rm , ρ ∈ R, and υ ∈ RN , respectively.
Note that for readability, we omit the contextual variables (w) without loss of generality.
When we deal with huge datasets, solving PI may be computationally expensive. Therefore, we
propose an iterative column selection algorithm (Algorithm 1) that can be used to speed up the
optimization while still obtaining a global optima.
The algorithm starts by initializing I ′ ⊆ I with an arbitrarily small subset of samples I 0 and
iteratively solves the restricted master problem PI ′ and the WolfeDual function. By solving PI ′ ,
we get the primal and dual optimal solutions λ∗ and (µ∗ , ρ∗ , υ ∗ ), respectively. The primal and dual
optimal solutions, together with I and I ′ , are given as input to WolfeDual which returns a set
of samples Ī ⊆ I \ I ′ with negative reduced cost. If Ī is not empty it is added to I ′ and a new
iteration starts, otherwise the algorithm stops, and λ∗ (with the corresponding x∗ ) is returned as
the global optima of PI . A visual interpretation of Algorithm 1 is shown in Figure 4.
In function WolfeDual, samples Ī are selected using the Karush–Kuhn–Tucker (KKT) sta-
tionary condition which corresponds to the equality constraint in the Wolfe dual formulation of
PI (Wolfe 1961). The KKT stationary condition of PI ′ is
m
X
∇λ f (Z̃ ⊤ λ∗ ) + µ∗i ∇λ gi (Z̃ ⊤ λ∗ ) − eρ∗ − υ ∗ = 0, (13)
i=1
Maragno et al.: Mixed-integer Optimization with Constraint Learning
50
2: while TRUE do
3: λ∗ , (µ∗ , ρ∗ , υ ∗ ) ← PI ′
5: if Ī ̸= ∅ then
6: I ′ ← I ′ ∪ Ī
7: else
8: Break
9: end if
where Z̃ is the matrix constructed with samples in I ′ , and e is an N ′ -dimensional vector of ones
with N ′ = |I ′ |. Equation (13) can be rewritten as
m
X
Z̃ ∇x f (Z̃ ⊤ λ∗ ) + µ∗i Z̃ ∇x gi (Z̃ ⊤ λ∗ ) − eρ∗ − υ ∗ = 0. (14)
i=1
Equation (14) is used to evaluate the reduced cost related to each sample z̄ ∈ Z which is not
in matrix Z̃. Consider a new sample z̄ in (14), with its associated λN ′ +1 set equal to zero.
(λ∗1 , . . . , λ∗N ′ , λN ′ +1 ) is still a feasible solution of the restricted master problem PI ′ , since it does not
affect the value of x. As a consequence, µ and ρ will not change their value, nor will f and g. The
only unknown variable is υN ′ +1 , namely the reduced cost of z̄. However, we can write it as
t
υ∗ Z̃ ⊤ ∗
X
∗ Z̃
= ∇x f (Z̃ λ ) + µi ∇x gi (Z̃ ⊤ λ∗ ) − eρ∗ . (15)
υN ′ +1 z̄ ⊤ z̄ ⊤
i=1
If υN ′ +1 is negative it means that we may improve the incumbent solution of PI ′ by including the
sample z̄ in Z̃.
Lemma 1. After solving the convex and continuously differentiable problem PI ′ , the sample in
I \ I ′ with the most negative reduced cost is a vertex of the convex hull CH(Z ).
Maragno et al.: Mixed-integer Optimization with Constraint Learning
51
The problem of finding z̄, such that its reduced cost is the most negative one, can be written as
a linear program where equation (16) is being minimized, and a solution must lie within CH(Z ).
That is,
min z ⊤ ∇x f (Z̃ ⊤ λ∗ ) + z ⊤ ∇x g(Z̃)µ∗ − ρ∗
z,λ
s.t. Z ⊺ λ = z,
X (17)
λj = 1,
j∈I
λj ≥ 0, j ∈ I,
where z and λ are the decision variables, and µ∗ , λ∗ , ρ∗ are fixed parameters. Since the objective
function is linear with respect to z, the optimal solution of (17) will necessarily be a vertex of
CH(Z ). □
To illustrate the benefits of column selection, consider the following convex optimization problem
that we shall refer to as Pexp :
min c⊤ x (18a)
x
n
X
s.t. log( exi ) ≤ t, (18b)
i=1
Ax ≤ b, (18c)
XN
λi z̄i = x, (18d)
i=1
XN
λj = 1, (18e)
j=1
λj ≥ 0, j = 1 . . . N. (18f)
Without a loss of generality, we assume that the constraint (18b) is known a priori, and constraints
(18c) are the linear embeddings of learned constraints with A ∈ Rk×n and b ∈ Rk . Constraints
(18d-18f) define the trust region based on N datapoints. Figure EC.3 shows the computation time
required to solve Pexp with different values of n, k, and N . The “No Column Selection” approach
consists of solving Pexp using the entire dataset. The “Column Selection” approach makes use of
Algorithm 1 to solve the problem, starting with |I 0 | = 100, and selecting only one sample at each
iteration, i.e., the one with the most negative reduced cost. It can be seen that in all cases, the use
of column selection results in significantly improved computation times. This allows us to more
quickly define the trust region for problems with large amounts of data.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
52
Figure EC.3 Effect of column selection on computation time. Solution times are reported for three different sizes
100. The number of samples goes from 500 to 5 × 105 . In each iteration, the sample with most
negative reduced cost is selected. The same problem is solved using MOSEK (2019) with conic
0 0 0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Number of samples in Z 1e5 Number of samples in Z 1e5 Number of samples in Z 1e5
Table EC.1 and Table EC.2 show the nutritional value of each food and our assumed nutrient
requirements, respectively. The values adopted are based on the World Health Organization (WHO)
guidelines (UNHCR et al. 2002).
Eng = Energy, Prot = Protein, Cal = Calcium, VitA = Vitamin A, ThB1 = ThiamineB1, RibB2 = RiboflavinB2, NicB3 = NicacinB3, Fol
= Folate, VitC = Vitamin C, Iod = Iodine
Maragno et al.: Mixed-integer Optimization with Constraint Learning
53
Eng = Energy, Prot = Protein, Cal = Calcium, VitA = Vitamin A, ThB1 = ThiamineB1, RibB2 = RiboflavinB2, NicB3 = NicacinB3, Fol
= Folate, VitC = Vitamin C, Iod = Iodine
sX
P alatability Score = xg − Optg ))2 ,
(γg (b (19)
g∈G
where
X
x
bg = xk with g ∈ G and
k∈Kg
maxg + ming
Optg = with g ∈ G .
2
To account for the different range sizes (maxg − ming ) across the macro-categories, we introduce
a scaling parameter γg that determines their influence on the score, as presented in Table EC.3.
The resulting score is normalized on a scale of 0 to 1, where a score of 1 represents a perfectly
appetizing food basket, while a score of 0 indicates an inedible basket.
The generation of diverse food baskets is done by solving several diet problems whose cost function
changes at each run and enforcing constraints on the nutrient requirements as well as on the
maximum number of foods belonging to the same category.
Table EC.4 shows the structure of the predictive models used in the WFP experiments. For each
model, the choice of parameters is based on a cross-validation procedure.
Table EC.4 Definition of the predictive model parameters used in the WFP
case study
Model Parameters
Linear ElasticNet parameters: 0.1 (alpha), 0.1 (ℓ1 -ratio)
SVM regularization parameter: 100
CART max depth: 10, max features: 1.0, min samples leaf: 0.02
RF max depth : 4, max features: auto, number of estimators: 25
GBM learning rate: 0.2, max depth: 5, number of estimators: 20
MLP hidden layers: 1, size hidden layers: (100,) activation: relu
Robustness impact by algorithm. Table EC.5 reports the change in objective value (cost) and
constrained outcome (palatbility) between the nominal and bootstrapped solution with 10 estima-
tors and a violation limit of 25%. The goal of the WFP case study is to minimize cost such that
palatability is at least 0.5; thus, a smaller cost and larger palatability are better. As expected, the
robust solution increases both the cost and palatability of the prescribed diets. We see that the
relative increase in cost is consistently lower than the relative increase in real palatability across all
methods, indicating that the improvement in palatability exceeds the incremental cost addition.
While the acceptable trade off between cost and palatability could differ by use case, this could be
further explored with alternative violation limits. Additionally, we compare the single algorithm
constraints against an ensemble of all six methods, also with a violation limit of α = 0.25. The
ensemble with multiple algorithms yields an objective value of 1313 and real palatability of 0.57.
This represents a -1.8% to 1% increase in cost and 5.6% to 15.6% increase in real palatability over
the nominal solutions. When compared to the bootstrapped single-method models, it is generally
more conservative. This is consistent with the fact that it must satisfy the constraint estimate
across the majority of the individual methods, forcing it to be conservative relative to this set.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
55
Table EC.5 Change in cost and palatability from nominal to bootstrapped (P = 10, α = 0.25) solution.
Effect of number of estimators. Table EC.6 compares the runtime as the number of estimators
(P ) increases up to 25 estimators. We see that the solve time for the linear, SVM, CART, and
MLP models are stable as the number of estimators increases. In contrast, we see that the ensemble
algorithms, RF and GBM, have exponential runtime increases as the number of estimators grows.
RF and GBM are already comprised of multiple individual learners, so embedding multiple esti-
mators involves adding multiple sets of decision trees, which becomes computationally expensive.
All results are reported over 100 instances. The experiments were run using a virtual computing
environment with 4 CPU and 32 GB total RAM. We also report the runtime for an ensemble of
estimators obtained from different model classes (“Ensemble”), using a single model from each
class.
We further investigate the runtimes with 25 estimators in Table EC.7. The left side of the table
reports the mean, median, and maximum runtimes for each method on the same 100 experiments
as above. We see that the RF and GBM models have reasonable median solve times (6.66 and
18.80 minutes, respectively), but the average solve times are driven up by outlier instances that
have significantly higher runtimes (max. 2110 and 1603 minutes, respectively). We propose to use
a time limit to control the experiment times. On the right side of the table, we see that using a 4
hour time limit returns optimal solutions for 95% of the RF runs and 82% of the GBM runs, and
feasible solutions for all but four GBM instances. In cases where an optimal solution is obtained,
the average runtime is less than 40 minutes. In cases where the time limit is hit, the average
remaining MIP gap is 1.02% for RF and 5.21% for GBM. The results suggest that imposing this
termination condition results in high quality solutions with a modest optimality gap.
The runtime experiments raise a natural question: what is the impact of embedding a larger
number of estimators? We consider the cost-palatability trade off for a decision tree model as we
vary the number of estimators from P = 2 to P = 50, averaged over the candidate violation limits.
The results are shown in Figure EC.4. As the number of estimators increases, the results tend
to be more conservative. By 10 estimators, the trade off curve well-approximates the curves for
higher estimator up to an inflection point where average cost increases significantly. By P = 25
Maragno et al.: Mixed-integer Optimization with Constraint Learning
56
Algorithm P = 2 P = 5 P = 10 P = 25
Linear 0.01 0.01 0.02 0.01
SVM 0.01 0.01 0.02 0.01
CART 0.02 0.02 0.03 0.12
RF 0.15 1.34 11.93 44.87
GBM 0.37 3.58 10.71 133.13
MLP 0.01 0.02 0.04 0.48
Ensemble 0.09
Table EC.7 Runtime results for P = 25 estimators, both when solved to optimality (left) and with a 4 hour time limit
(right).
and P = 50, the curves closely match, suggesting diminishing value in increasing the number of
estimators beyond a certain point.
Finally, Table EC.8 reports the parameters used in our bootstrapped models. For each method,
we report the parameter grid that was used in our model training and selection procedure. Individ-
ual estimators use different combinations of these parameters based on the validation performance
on the specific bootstrapped samples. We note that for these experiments, we used a default param-
eter grid implemented in OptiCL; this grid can be manually set by a user when specifying each
outcome of interest before model training.
Figure EC.4 Effect of the number of bootstrapped estimators (P ) on the cost and palatability of the prescribed
diet.
bootstraps
2.0 0.01
5.0
1340 10.0 1340
25.0
50.0 0.02
1330 1330
objective_function
objective_function
violation_margin
0.03
1320 1320
0.04
1310 1310
0.05
1300 1300
0.05 0.04 0.03 0.02 0.01 2.0 5.0 10.0 25.0 50.0 2.0 5.0 10.0 25.0 50.0
violation_margin bootstraps bootstraps
for the study context: the study year, country, and number of patients. Missing data was imputed
using multiple imputation based on the other contextual variables; 20% of observations had one
missing feature and 6% had multiple missing features.
Treatment Variables. Chemotherapy regimens involve multiple drugs being delivered at poten-
tially varied frequencies over the course of a chemotherapy cycle. As a result, multiple dimensions
of the dosage must be encoded to reflect the treatment strategy. As in Bertsimas et al. (2016), we
include three variables to represent each drug: an indicator (1 if the drug is used in the regimen),
instantaneous dose, and average dose.
Outcomes. We use Overall Survival (OS) as our survival metric, as reported in the clinical trials.
Any observations with unreported OS are excluded. We consider several “dose-limiting toxicities”
(DLTs): Grade 3/4 constitutional, gastrointestinal, infection, and neurological toxicities, as well
Maragno et al.: Mixed-integer Optimization with Constraint Learning
58
as Grade 4 blood toxicities. The toxicities reported in the original clinical trials are aggregated
according to the CTCAE toxicity classes (Cancer Therapy Evaluation Program 2006). We also
include a variable for the occurrence of any of the four individual toxicities (ti for each toxicity
i ∈ T , called DLT proportion; we treat these toxicity groups as independent and thus define the
DLT proportion as
Y
DLT = 1 − (1 − ti ).
i∈T
We define Grade 4 blood toxicity as the maximum of five individual blood toxicities (related to
neutrophils, leukocytes, lymphocytes, thrombocytes, anemia). Observations missing all of these
toxicities were excluded; entries with partial missingness were imputed using multiple imputation
based on other blood toxicity columns. Similarly, observations with no reported Grade 3/4 toxicities
were excluded; those with partial missingness were imputed using multiple imputation based on
the other toxicity columns. This exclusion criteria resulted in a final set of 461 (of 495) treatment
arms.
We split the data into training/testing sets temporally. The training set consists of all clinical
trials through 2008, and the testing set consists of all 2009-2012 trials. We exclude trials from the
testing set if they use new drugs not seen in the training data (since we cannot evaluate these given
treatments). We also identify sparse treatments (defined as being only seen once in the training
set) and remove all observations that include these treatments. The final training set consists of
320 observations, and the final testing set consists of 96 observations.
Table EC.10 Predictive model parameters used in the chemotherapy case study.
Table EC.12 Predictive model parameters used in the ground truth ensemble for model evaluation.
N
X
max yi (20a)
i=1
N
X
s.t. xi ≤ BUDGET, (20b)
i=1
where xi is the decision variable indicating the amount of scholarship assigned to each student
accepted, si is the SAT score of applicant i, and gi is the GPA score of applicant i. The predicted
outcome yi represents the probability of a candidate i accepting the offer, and ĥ is the fitted
model used to predict any candidate’s probabilities of accepting an offer. The parameters si , gi ,
and the decision variable xi are the predictive model’s inputs. In order to compare OptiCL and
JANOS, we solved the SEP for different student sizes, and compared the objective values and
runtimes. Although OptiCL and JANOS handle neural network embedding in a similar manner,
JANOS uses a parameterized discretization to handle logistic regression predictions. We therefore
compared their performances only using the logistic regression models, as we expected to see a
difference in performance based on the differences in implementation. In the experiments reported
in Figure EC.5, we discretize the logistic regression (LogReg) in JANOS using three different
number of intervals (reported between brackets in the Figure legend). From the experiments, we
can see that OptiCL achieves better objective values in all three instances. It can also be seen
that for the larger problems, OptiCL is much more efficient in terms of optimization runtime than
JANOS.
Figure EC.5 Objective value (right) and runtime (left) comparison between OptiCL and JANOS for the SEP.
30.5 323 3230 102
JANOS LogReg(5)
100 JANOS LogReg(15)
313 3140 JANOS LogReg(25)
29.8
OptiCL LogReg
308 3095 10 1
29.5
max z (21a)
s.t. z ≤ yk ∀k = 0, . . . , m − 1, (21b)
where xik is the binary decision variable indicating if a job i is mapped on core k or not. The
parameter cpii represents the average Clock Per Instructions (CPI) characterizing job i, and is a
measure of the difficulty of job i. The objective is to maximize the worst-case core efficiency, and
the fitted model ĥk is used to predict the efficiency of core k that is represented by yk ∈ [0, 1].
Constraints (21d) ensures that each job is mapped to only one core, and (21e) forces the same
number of jobs to run on each core. Constraints (21f), (21g) and (21h) are used to compute the
average CPI for a core k, the average CPI for the cores in the neighborhood of k (N (k)), and the
average CPI for cores not in the neighborhood of k respectively. Lombardi et al. (2017) conclude
that learning the efficiency function for each core by means of neural networks (with one hidden
layer of two nodes and tanh activation function) is computationally intractable. On the contrary,
our experiments show that we are able to solve this problem using neural networks with one hidden
layer and 10 nodes in a reasonable amount of time (19.4 seconds). We tried deeper neural networks,
but the increase in computational complexity did not lead to a gain in predictive performance.