0% found this document useful (0 votes)
4 views

Mixed-Integer Optimization With Constraint Learning

Uploaded by

zhaomeiliang996
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Mixed-Integer Optimization With Constraint Learning

Uploaded by

zhaomeiliang996
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Mixed-Integer Optimization with

Constraint Learning
Donato Maragno*
Amsterdam Business School, University of Amsterdam, 1018 TV Amsterdam, Netherlands [email protected]

Holly Wiberg*
Operations Research Center, Massachusetts Institute of Technology, Cambridge MA 02139 [email protected]
arXiv:2111.04469v3 [math.OC] 26 Oct 2023

Dimitris Bertsimas
Sloan School of Management, Massachusetts Institute of Technology, Cambridge MA 02139 [email protected]

Ş. İlker Birbil, Dick den Hertog, Adejuyigbe O. Fajemisin


Amsterdam Business School, University of Amsterdam, 1018 TV Amsterdam, Netherlands
[email protected] [email protected] [email protected]

We establish a broad methodological foundation for mixed-integer optimization with learned constraints.

We propose an end-to-end pipeline for data-driven decision making in which constraints and objectives are

directly learned from data using machine learning, and the trained models are embedded in an optimization

formulation. We exploit the mixed-integer optimization-representability of many machine learning methods,

including linear models, decision trees, ensembles, and multi-layer perceptrons, which allows us to capture

various underlying relationships between decisions, contextual variables, and outcomes. We also introduce

two approaches for handling the inherent uncertainty of learning from data. First, we characterize a decision

trust region using the convex hull of the observations, to ensure credible recommendations and avoid extrap-

olation. We efficiently incorporate this representation using column generation and propose a more flexible

formulation to deal with low-density regions and high-dimensional datasets. Then, we propose an ensemble

learning approach that enforces constraint satisfaction over multiple bootstrapped estimators or multiple

algorithms. In combination with domain-driven components, the embedded models and trust region define a

mixed-integer optimization problem for prescription generation. We implement this framework as a Python

package (OptiCL) for practitioners. We demonstrate the method in both World Food Programme planning

and chemotherapy optimization. The case studies illustrate the framework’s ability to generate high-quality

prescriptions as well as the value added by the trust region, the use of ensembles to control model robustness,

the consideration of multiple machine learning methods, and the inclusion of multiple learned constraints.

Key words : mixed-integer optimization, machine learning, constraint learning, prescriptive analytics

* These authors contributed equally.

1
Maragno et al.: Mixed-integer Optimization with Constraint Learning
2

1. Introduction

Mixed-integer optimization (MIO) is a powerful tool that allows us to optimize a given objective

subject to various constraints. This general problem statement of optimizing under constraints is

nearly universal in decision-making settings. Some problems have readily quantifiable and explicit

objectives and constraints, in which case MIO can be directly applied. The situation becomes more

complicated, however, when the constraints and/or objectives are not explicitly known.

For example, suppose we deal with cancerous tumors and want to prescribe a treatment regimen

with a limit on toxicity; we may have observational data on treatments and their toxicity outcomes,

but we have no natural function that relates the treatment decision to its resultant toxicity. We

may also encounter constraints that are not directly quantifiable. Consider a setting where we

want to recommend a diet, defined by a combination of foods and quantities, that is sufficiently

“palatable.” Palatability cannot be written as a function of the food choices, but we may have

qualitative data on how well people “like” various potential dietary prescriptions. In both of these

examples, we cannot directly represent the outcomes of interest as functions of our decisions, but

we have data that relates the outcomes and decisions. This raises a question: how can we consider

data to learn these functions?

In this work, we tackle the challenge of data-driven decision making through a combined machine

learning (ML) and MIO approach. ML allows us to learn functions that relate decisions to outcomes

of interest directly through data. Importantly, many popular ML methods result in functions that

are MIO-representable, meaning that they can be embedded into MIO formulations. This MIO-

representable class includes both linear and nonlinear models, allowing us to capture a broad

set of underlying relationships in the data. While the idea of learning functions directly from

data is core to the field of ML, data is often underutilized in MIO settings due to the need

for functional relationships between decision variables and outcomes. We seek to bridge this gap

through constraint learning; we propose a general framework that allows us to learn constraints

and objectives directly from data, using ML, and to optimize decisions accordingly, using MIO.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
3

Once the learned constraints have been incorporated into the larger MIO, we can solve the problem

directly using off-the-shelf solvers.

The term constraint learning, used several times throughout this work, captures both constraints

and objective functions. We are fundamentally learning functions to relate our decision variables

to the outcome(s) of interest. The predicted values can then either be incorporated as constraints

or objective terms; the model learning and embedding procedures remain largely the same. For

this reason, we refer to them both under the same umbrella of constraint learning. We describe

this further in Section 2.2.

1.1. Literature review

Previous work has demonstrated the use of various ML methods in MIO problems and their utility

in different application domains. The simplest of these methods is the regression function, as the

approach is easy to understand and easy to implement. Given a regression function learned from

data, the process of incorporating it into an MIO model is straightforward, and the final model

does not require complex reformulations. As an example, Bertsimas et al. (2016) use regression

models and MIO to develop new chemotherapy regimens based on existing data from previous

clinical trials. Kleijnen (2015) provides further information on this subject.

More complex ML models have also been shown to be MIO-representable, although more effort

is required to represent them than simple regression models. Neural networks which use the ReLU

activation function can be represented using binary variables and big-M formulations (Amos et al.

2016, Grimstad and Andersson 2019, Anderson et al. 2020, Chen et al. 2020, Spyros 2020, Venzke

et al. 2020). Where other activation functions are used (Gutierrez-Martinez et al. 2011, Lombardi

et al. 2017, Schweidtmann and Mitsos 2019), the MIO representation of neural networks is still

possible, provided the solvers are capable of handling these functions.

With decision trees, each path in the tree from root to leaf node can be represented using

one or more constraints (Bonfietti et al. 2015, Verwer et al. 2017, Halilbasic et al. 2018). The

number of constraints required to represent decision trees is a function of the tree size, with larger
Maragno et al.: Mixed-integer Optimization with Constraint Learning
4

trees requiring more linearizations and binary variables. The advantage here, however, is that

decision trees are known to be highly interpretable, which is often a requirement of ML in critical

application settings (Thams et al. 2017). Random forests (Biggs et al. 2021, Mišić 2020) and other

tree ensembles (Cremer et al. 2019) have also been used in MIO in the same way as decision trees,

with one set of constraints for each tree in the forest/ensemble along with one or more additional

aggregate constraints.

Data for constraint learning can either contain information on continuous data, feasible and

infeasible states (two-class data), or only one state (one-class data). The problem of learning

functions from one-class data and embedding them into optimization models has been recently

investigated with the use of decision trees (Kudla and Pawlak 2018), genetic programming (Pawlak

and Krawiec 2019), local search (Sroka and Pawlak 2018), evolutionary strategies (Pawlak 2019),

and a combination of clustering, principal component analysis and wrapping ellipsoids (Pawlak

and Litwiniuk 2021).

The above selected applications generally involve a single function to be learned and a fixed ML

method for the model choice. Verwer et al. (2017) use two model classes (decision trees and linear

models) in a specific auction design application, but in this case the models were determined a

priori. Some authors have presented a more general framework of embedding learned ML models

in optimization problems such as JANOS (Bergman et al. 2022) and EML (Lombardi et al. 2017),

but in practice these works are restricted to limited problem structures and learned model classes.

We take a broader perspective, proposing a comprehensive end-to-end pipeline that encompasses

the full ML and optimization components of a data-driven decision making problem. In contrast to

EML and JANOS, OptiCL supports a wider variety of predictive models — neural networks (with

ReLU), linear regression, logistic regression, decision trees, random forests, gradient boosted trees

and linear support vector machines. OptiCL is also more flexible than JANOS, as it can handle

predictive models as constraints, and it also incorporates new concepts to deal with uncertainty in

the ML models. A comparison of OptiCL against JANOS and EML on two test problems is shown

in Appendix E.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
5

Our work falls under the umbrella of prescriptive analytics. Bertsimas and Kallus (2020) and

Elmachtoub and Grigas (2021) leverage ML model predictions as inputs into an optimization prob-

lem. Our approach is distinct from existing work in that we directly embed ML models rather than

extracting predictions, allowing us to optimize our decisions over the model. In the broadest sense,

our framework relates to work that jointly harnesses ML and MIO, an area that has garnered signif-

icant interest in recent years in both the optimization and machine learning communities (Bengio

et al. 2021).

1.2. Contributions

Our work unifies several research areas in a comprehensive manner. Our key contributions are as

follows:

1. We develop an end-to-end framework that takes data and directly implements model training,

model selection, integration into a larger MIO, and ultimately optimization. We make this

available as an open-source software, OptiCL (Optimization with Constraint Learning) to pro-

vide a practitioner-friendly tool for making better data-driven decisions. The code is available

at https://github.com/hwiberg/OptiCL. The software encompasses the full ML and opti-

mization pipeline with the goal of being accessible to end users as well as extensible by technical

researchers. Our framework natively supports models for both regression and classification

functions and handles constraint learning in cases with both one-class and two-class data. We

implement a cross-validation procedure for function learning that selects from a broad set of

model classes. We also implement the optimization procedure in the generic mathematical

modeling library Pyomo, which supports various state-of-the-art solvers. We introduce two

approaches for handling the inherent uncertainty when learning from data. First, we propose

an ensemble learning approach that enforces constraint satisfaction over an ensemble of mul-

tiple bootstrapped estimators or multiple algorithms, yielding more robust solutions. This

addresses a shortcoming of existing approaches to embedding trained ML models, which rely

on a single point prediction: in the case of learned constraints, model misspecification can lead
Maragno et al.: Mixed-integer Optimization with Constraint Learning
6

to infeasibility. Additionally, we restrict solutions to lie within a trust region, defined as the

domain of the training data, which leads to better performance of the learned constraints. We

offer several improvements to a basic convex hull formulation, including a clustering heuristic

and a column selection algorithm that significantly reduce computation time. We also pro-

pose an enlargement of the convex hull which allows for exploration of solutions outside of

the observed bounds. Both the ensemble model wrapper and trust region enlargement are

controlled by parameters that allow an end user to directly trade-off the conservativeness of

the constraint satisfaction.

2. We demonstrate the power of our method in two real-world case studies, using data from

the World Food Programme and chemotherapy clinical trials. We pose relevant questions

in the respective areas and formalize them as constraint learning problems. We implement

our framework and subsequently evaluate the quantitative performance and scalability of our

methods in these settings.

2. Embedding predictive models

Suppose we have data D = {(x̄i , w̄i , ȳi )}N


i=1 , with observed treatment decisions x̄i , contextual infor-

mation w̄i , and outcomes of interest ȳi for sample i. Following the guidelines proposed in Fajemisin

et al. (2021), we present a framework that, given data D, learns functions for the outcomes of

interest (y) that are to be constrained or optimized. These learned representations can then be

used to generate predictions for a new observation with context w. Figure 1 outlines the complete

pipeline, which is detailed in the sections below.

2.1. Conceptual model

Given the decision variable x ∈ Rn and the fixed feature vector w ∈ Rp , we propose model M(w)

min f (x, w, y)
x∈Rn ,y∈Rk

s.t. g(x, w, y) ≤ 0,
(1)
y = ĥD (x, w),

x ∈ X (w),
Maragno et al.: Mixed-integer Optimization with Constraint Learning
7

Figure 1 Constraint learning and optimization pipeline.

Conceptual model (Section 2.1)


- Decision variables
- Contextual variables
- Parameters
- Known constraints
- Unknown constraints

Data pre-processing (Sections 3&4)


- Data cleaning
- Feature scaling
- Feature engineering

Predictive models (Section 2.2)


- Linear regression Trust region (Section 2.3)
- Support vector machines - Clustering
- Decision trees - Convex hull
- Ensemble methods - Column selection
- Neural networks

Optimization (Sections 3&4)


MIO with learned predictive models
and trust region constraints.

Evaluation (Sections 3&4)


Analyses of the optimal solution and
the embedded predictive models’
performance.

where f (., w, .) : Rn+k 7→ R, g(., w, .) : Rn+k 7→ Rm , and ĥD (., w) : Rn 7→ Rk . Explicit forms of f and

g are known but they may still depend on the predicted outcome y. Here, ĥD (x, w) represents

the predictive models, one per outcome of interest, which are ML models trained on D. Although

our subsequent discussion mainly revolves around linear functions, we acknowledge the significant

progress in nonlinear (convex) integer solvers. Our discussion can be easily extended to nonlinear

models that can be tackled by those ever-improving solvers.

We note that the embedding of a single learned outcome may require multiple constraints and

auxiliary variables; the embedding formulations are described in Section 2.2. For simplicity, we

omit D in further notation of ĥ but note that all references to ĥ implicitly depend on the data used

to train the model. Finally, the set X (w) defines the trust region, i.e., the set of solutions for which

we trust the embedded predictive models. In Section 2.3, we provide a detailed description of how
Maragno et al.: Mixed-integer Optimization with Constraint Learning
8

the trust region X (w) is obtained from the observed data. We refer to the final MIO formulation

with the embedded constraints and variables as EM(w).

Model M(w) is quite general and encompasses several important constraint learning classes:

1. Regression. When the trained model results from a regression problem, it can be constrained

by a specified upper bound τ , i.e., g(y) = y − τ ≤ 0, or lower bound τ , i.e., g(y) = −y + τ ≤ 0.

If y is a vector (i.e., multi-output regression), we can likewise provide a threshold vector τ

for the constraints.

2. Classification. If the trained model is obtained with a binary classification algorithm, in

which the data is labeled as “feasible” (1) or “infeasible” (0), then the prediction is generally

a probability y ∈ [0, 1]. We can enforce a lower bound on the feasibility probability, i.e., y ≥ τ .

A natural choice of τ is 0.5, which can be interpreted as enforcing that the result is more likely

feasible than not. This can also extend to the multi-class setting, say k classes, in which the

output y is a k-dimensional unit vector, and we apply the constraint yi ≥ τ for whichever class

i is desired. When multiple classes are considered to be feasible, we can add binary variables

to ensure that a solution is feasible, only if it falls in one of these classes with sufficiently high

probability.

3. Objective function. If the objective function has a term that is also learned by training

an ML model, then we can introduce an auxiliary variable t ∈ R, and add it to the objective

function along with an epigraph constraint. Suppose for simplicity that the model involves

a single learned objective function, ĥ, and no learned constraints. Then the general model

becomes

min t
x∈Rn ,y∈R,t∈R

s.t. g(x, w) ≤ 0,

y = ĥ(x, w),

y − t ≤ 0,

x ∈ X (w).
Maragno et al.: Mixed-integer Optimization with Constraint Learning
9

Although we have rewritten the problem to show the generality of our model, it is quite

common in practice to use y in the objective and omit the auxiliary variable t.

We observe that constraints on learned outcomes can be applied in two ways depending on the

model training approach. Suppose that we have a continuous scalar outcome y to learn and we

want to impose an upper bound of τ ∈ R (it may also be a lower bound without loss of generality).

The first approach is called function learning and concerns all cases where we learn a regression

function ĥ(x, w) without considering the feasibility threshold (τ ). The resultant model returns a

predicted value y ∈ R. The threshold is then applied as a constraint in the optimization model

as y ≤ τ . Alternatively, we could use the feasibility threshold τ to binarize the outcome of each

sample in D into feasible and infeasible, that is ȳi := I(ȳi ≤ τ ), i = 1, . . . , N , where I stands for

the indicator function. After this relabeling, we train a binary classification model ĥ(x, w) that

returns a probability y ∈ [0, 1]. This approach, called indicator function learning, does not require

any further use of the feasibility threshold τ in the optimization model, since the predictive models

directly encode feasibility.

The function learning approach is particularly useful when we are interested in varying the

threshold τ as a model parameter. Additionally, if the fitting process is expensive and therefore

difficult to perform multiple times, learning an indicator function for each potential τ might be

infeasible. In contrast, the indicator function learning approach is necessary when the raw data

contains binary labels rather than continuous outcomes, and thus we have no ability to select or

vary τ .

2.2. MIO-representable predictive models

Our framework is enabled by the ability to embed learned predictive models into an MIO formu-

lation with linear constraints. This is possible for many classes of ML models, ranging from linear

models to ensembles, and from support vector machines to neural networks. In this section, we

outline the embedding procedure for decision trees, tree ensembles, and neural networks to illus-

trate the approach. We include additional technical details and formulations for these methods,

along with linear regression and support vector machines, in Appendix A.


Maragno et al.: Mixed-integer Optimization with Constraint Learning
10

In all cases, the model has been pre-trained ; we embed the trained model ĥ(x, w) into our larger

MIO formulation to allow us to constrain or optimize the resultant predicted value. Consequently,

the optimization model is not dependent on the complexity of the model training procedure, but

solely the size of the final trained model. Without loss of generality, we assume that y is one-

dimensional; i.e., we are learning a single model, and this model returns a scalar, not a multi-output

vector.

All of the methods below can be used to learn constraints that apply upper or lower bounds

to y, or to learn y that we incorporate as part of the objective. We present the model embedding

procedure for both cases when ĥ(x, w) is a continuous or a binary predictive model, where relevant.

We assume that either regression or classification models can be used to learn feasibility constraints,

as described in Section 2.1.

Decision Trees. Decision trees partition observations into distinct leaves through a series of

feature splits. These algorithms are popular in predictive tasks due to their natural interpretability

and ability to capture nonlinear interactions among variables. Breiman et al. (1984) first introduced

Classification and Regression Trees (CART), which constructs trees through parallel splits in the

feature space. Decision tree algorithms have subsequently been adapted and extended. Bertsimas

and Dunn (2017) propose an alternative decision tree algorithm, Optimal Classification Trees

(and Optimal Regression Trees), that improves on the basic decision tree formulation through

an optimization framework that approximates globally optimal trees. Optimal trees also support

multi-feature splits, referred to as hyper-plane splits, that allow for splits on a linear combination

of features (Bertsimas, D. and Dunn, J. 2018).

A generic decision tree of depth 2 is shown in Figure 2. A split at node i is described by an

inequality A⊤
i x ≤ bi . We assume that A can have multiple non-zero elements, in which we have

the hyper-plane split setting; if there is only one non-zero element, this creates a parallel (single

feature) split. Each terminal node j (i.e., leaf) yields a prediction (pj ) for its observations. In the

case of regression, the prediction is the average value of the training observations in the leaf, and in
Maragno et al.: Mixed-integer Optimization with Constraint Learning
11

Figure 2 A decision tree of depth 2 with four terminal nodes (leaves).

Node 1
A⊤
1 x ≤ b1

True False

Node 2 Node 5
A⊤
2 x ≤ b2 A⊤
5 x ≤ b5

Node 3 Node 4 Node 6 Node 7


Prediction = p3 Prediction = p4 Prediction = p6 Prediction = p7

binary classification, the prediction is the proportion of leaf members with the feasible class. Each

leaf can be described as a polyhedron, namely a set of linear constraints that must be satisfied by

all leaf members. For example, for node 3, we define P3 = x : A⊤ ⊤
1 x ≤ b1 , A2 x ≤ b2 .

Suppose that we wish to constrain the predicted value of this tree to be at most τ , a fixed

constant. After obtaining the tree in Figure 2, we can identify which paths satisfy the desired bound

(pi ≤ τ ). Suppose that p3 and p6 do satisfy the bound, but p4 and p7 do not. In this case, we can

enforce that our solution belongs to P3 or P6 . This same approach applies if we only have access to

two-class data (feasible vs. infeasible); we can directly train a binary classification algorithm and

enforce that the solution lies within one of the “feasible” prediction leaves (determined by a set

probability threshold).

If the decision tree provides our only learned constraint, we can decompose the problem into

multiple separate MIOs, one per feasible leaf. The conceptual model for the subproblem of leaf i

then becomes

min f (x, w)
x

s.t. g(x, w) ≤ 0,

(x, w) ∈ Pi ,
1
where the learned constraints for leaf i’s subproblem are implicitly represented by the polyhedron

Pi . These subproblems can be solved in parallel, and the minimum across all subproblems is
Maragno et al.: Mixed-integer Optimization with Constraint Learning
12

obtained as the optimal solution. Furthermore, if all decision variables x are continuous, these

subproblems are linear optimization problems (LOs), which can provide substantial computational

gains. This is explored further in Appendix A.2.

In the more general setting where the decision tree forms one of many constraints, or we are

interested in varying the τ limit within the model, we can directly embed the model into a larger

MIO. We add binary variables representing each leaf, and set y to the predicted value of the

assigned leaf. An observation can only be assigned to a leaf, if it obeys all of its constraints; the

structure of the tree guarantees that exactly one path will be fully satisfied, and thus, the leaf

assignment is uniquely determined. A solution belonging to P3 will inherit y = p3 . Then, y can be

used in a constraint or objective. The full formulation for the embedded decision tree is included in

Appendix A.2. This formulation is similar to the proposal in Verwer et al. (2017). Both approaches

have their own merits: while the Verwer formulation includes fewer constraints in the general case,

our formulation is more efficient in the case where the problem can be decomposed into individual

subproblems (as described above).

Ensemble Methods. Ensemble methods, such as random forests (RF) and gradient-boosting

machines (GBM) consist of many decision trees that are aggregated to obtain a single predic-

tion for a given observation. These models can thus be implemented by embedding many “sub-

models” (Breiman 2001). Suppose we have a forest with P trees. Each tree can be embedded as a

single decision tree (see previous paragraph) with the constraints from Appendix A.2, which yields

a predicted value yi .

RF models typically generate predictions by taking the average of the predictions from the

individual trees:
P
1X
y= yi .
P i=1

This can then be used as a term in the objective, or constrained by an upper bound as y ≤ τ ; this

can be done equivalently for a lower bound. In the classification setting, the prediction averages the

probabilities returned by each model (yi ∈ [0, 1]), which can likewise be constrained or optimized.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
13

Alternatively, we can further leverage the fact that unlike the other model classes, which return

a single prediction, the RF model generates P predictions, one per tree. We can impose a violation

limit across the individual P estimators as proposed in Section 3.1.

In the case of GBM, we have an ensemble of base-learners which are not necessarily decision

trees. The model output is then computed as


P
X
y= βi yi ,
i=1

where yi is the predicted value of the i-th regression model ĥi (x, w), βi is the weight associated

with the prediction. Although trees are typically used as base-learners, in theory we might use any

of the MIO-representable predictive models discussed in this section.

Neural Networks. We implement multi-layer perceptrons (MLP) with a rectified linear unit

(ReLU) activation function, which form an MIO-representable class of neural networks (Grimstad

and Andersson 2019, Anderson et al. 2020). These networks consist of an input layer, L − 2 hidden

layer(s), and an output layer. This nonlinear transformation of the input space over multiple nodes

(and layers) using the ReLU operator (v = max{0, x}) allows MLPs to capture complex functions

that other algorithms cannot adequately encode, making them a powerful class of models.

Critically, the ReLU operator, v = max{0, x}, can be encoded using linear constraints, as detailed

in Appendix A.3. The constraints for an MLP network can be generated recursively starting from

the input layer, which allows us to embed a trained MLP with an arbitrary number of hidden layers

and nodes into an MIO. We refer to Appendix A.3 for details on the embedding of regression,

binary classification, and multi-class classification MLP variants.

2.3. Convex hull as trust region

As the optimal solutions of optimization problems are often at the extremes of the feasible region,

this can be problematic for the validity of the trained ML model. Generally speaking the accuracy

of a predictive model deteriorates for points that are further away from the data points in D

(Goodfellow et al. 2015). To mitigate this problem, we elaborate on the idea proposed by Biggs et al.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
14

(2021) to use the convex hull (CH) of the dataset as a trust region to prevent the predictive model

from extrapolating. According to Ebert et al. (2014), when data is enclosed by a boundary of convex

shape, the region inside this boundary is known as an interpolation region. This interpolation region

is also referred to as the CH, and by excluding solutions outside the CH, we prevent extrapolation.

If X = {x̂i }N
i=1 is the set of observed input data with x̂i = (x̄i , w̄i ), we define the trust region as

the CH of this set and denote it by CH(X). Recall that CH(X) is the smallest convex polytope

that contains the set of points X. It is well-known that computing the CH is exponential in time

and space with respect to the number of samples and their dimensionality Skiena (2008). However,

since the CH is a polytope, explicit expressions for its facets are not necessary. More precisely,

CH(X) is represented as
 X X 
CH(X) = x λi x̂i = x, λi = 1, λ ≥ 0 , (2)
i∈I i∈I

where λ ∈ RN , and I = {1, . . . , N } is the index set of samples in X.

In situations such as the one shown in Figure 3a, CH(X) includes regions with few or no data

points (low-density regions). Blindly using CH(X) in this case can be problematic if the solutions

are found in the low-density regions. We therefore advocate the use of a two-step approach. First,

clustering is used to identify distinct high-density regions, and then the trust region is represented

as the union of the CHs of the individual clusters (Figure 3b).

We can either solve EM(w) for each cluster, or embed the union of the |K| CHs into the MIO given

by

[  X X X 
|K|
CH(Xk ) = x λi x̂i = x, λi = uk ∀k ∈ K, uk = 1, λ ≥ 0, u ∈ {0, 1} , (3)
k∈K i∈Ik i∈Ik k∈K

where Xk ⊆ X refers to subset of samples in cluster k ∈ K with the index set Ik ⊆ I . The union of

CHs requires the binary variables uk to constrain a feasible solution to be exactly in one of the CHs.

More precisely, uk = 1 corresponds to the CH of the k-th cluster. As we show in Section 4, solving

EM(w) for each cluster may be done in parallel, which has a positive impact on computation time.

We note that both formulations (2) and (3) assume that x̂ is continuous. These formulations can
Maragno et al.: Mixed-integer Optimization with Constraint Learning
15

Figure 3 Use of the two-step approach to remove low-density regions.

15 15

10 10

5 5

0 0

−10 0 10 20 −10 0 10 20

(a) CH(X) with single region. (b) CH(X) with clustered regions.

be extended to datasets with binary, categorical and ordinal features. In the case of categorical

features, extra constraints on the domain and one-hot encoding are required.

Although the CH can be represented by linear constraints, the number of variables in EM(w)

increases with the increase in the dataset size, which may make the optimization process prohibitive

when the number of samples becomes too large. We therefore provide a column selection algorithm

that selects a small subset of the samples. This algorithm can be directly used in the case of convex

optimization problems or embedded as part of a branch and bound algorithm when the optimization

problem involves integer variables. Figure 4 visually demonstrates the procedure; we begin with an

arbitrary sample of the full data, and use column selection to iteratively add samples x̂i until no

improvement can be found. In Appendix B.2, we provide a full description of the approach, as well

as a formal lemma which states that in each iteration of column selection, the selected sample from

X is also a vertex of CH(X). In synthetic experiments, we observe that the algorithm scales well

with the dataset size. The computation time required by solving the optimization problem with the

algorithm is near-constant and minimally affected by the number of samples in the dataset. The
1 1
experiments in Appendix B.2 show optimization with column selection to be significantly faster

than a traditional approach, which makes it an ideal choice when dealing with massive datasets.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
16

Figure 4 Visualization of the column selection algorithm. Known and learned constraints define the infeasible

region. The column selection algorithm starts using only a subset of data points (red filled circles),

X ′ ⊆ X to define the trust region. In each iteration a vertex of CH(X) is selected (red hollow circle)

and included in X ′ until the optimal solution (star) is within the feasible region, namely the convex

hull of X ′ . Note that with column selection we do not need the complete dataset to obtain the optimal

solution, but rather only a subset.


Initial Convex Hull Column Selection iter 1 Column Selection iter 2
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Points in Vertex CH( ) Infeasible region CH( 0) CH( )
Points in 0
Optimal solution

3. Uncertainty and Robustness

There are multiple sources of uncertainty, and consequently notions of robustness, that can be

considered when embedding a trained machine learning model as a constraint. We define two types

of uncertainty in model (1).

Function Uncertainty. The first source of uncertainty is in the underlying functional form of ĥ.

We do not know the ground truth relationship between (x, w) and y, and there is potential for model

mis-specification. We mitigate this risk through our nonparametric model selection procedure,

namely training ĥ for a diverse set of methods (e.g., decision tree, regression, neural network) and

selecting the final model using a cross-validation procedure.

Parameter Uncertainty. Even within a single model class, there is uncertainty in the parameter

estimates that define ĥ. Consider the case of linear regression. A regression estimator consists of

point estimates of coefficients and an intercept term, but there is uncertainty in the estimates

as they are derived from noisy data. We seek to make our model robust by characterizing this

uncertainty and optimizing against it. We propose model-wrapper ensemble approaches, which are
Maragno et al.: Mixed-integer Optimization with Constraint Learning
17

agnostic to the underlying model. The rest of this section addresses the model-wrapper approaches

and a looser formulation of the trust region that prevents the optimal solution from being too

conservative when the predictive models have good extrapolation performance.

3.1. Model wrapper approach

We begin by describing the model “wrapper” approach for characterizing uncertainty, in which

we work directly with any trained models and their point predictions. Rather than obtaining our

estimated outcome from a single trained predictive model, we suppose that we have P estimators.

The set of estimators can be obtained by bootstrapping or by training models using entirely

different methods. The uncertainty is thus characterized by different realizations of the predicted

value from multiple estimators, which effectively form an ensemble.

We introduce a constraint that at most α ∈ [0, 1] proportion of the P estimators violate the

constraint. Let ĥ1 , . . . , ĥP be the individual estimators. Then ĥi (x) ≤ τ in at least 1 − αP of these

estimators. This allows for a degree of robustness to individual model predictions by discarding a

small number of potential outlier predictions. Formally,


P
1X
I(yi ≤ τ ) ≥ 1 − α. (4)
P i=1

Note that α = 0 enforces the bound for all estimators, yielding the most conservative estimate,

whereas α = 1 removes the constraint entirely. Constraint (4) is MIO-representable:

yi ≤ τ + M (1 − zi ), i = 1, . . . , P
P
1X
zi ≥ 1 − α,
P i=1

where zi ∈ {0, 1} ∀i = 1, . . . , P , and M is a sufficiently large constant. Appendix A.4 includes further

details on this formulation and special cases.

The violation limit concept can also be applied to estimators coming from multiple model classes,

which allows us to enforce that the constraint is generally obeyed when modeled through distinct

methods. This provides a measure of robustness to function uncertainty.


Maragno et al.: Mixed-integer Optimization with Constraint Learning
18

3.2. Enlarged convex hull

The use of the model wrapper approach and the trust region constraints, as defined in (2), has

a direct effect on the feasible region. The better performance of the learned constraints might be

balanced out by the (potentially) unnecessary conservatism of the optimal solution. Although we

introduced the trust region as a set of constraints to preserve the predictive performance of the

fitted constraints, Balestriero et al. (2021) show how in a high-dimensional space the generalization

performance of a fitted model is typically obtained extrapolating. In light of this evidence, we pro-

pose an ϵ-CH formulation which builds on (2), and more generally on (3). The relaxed formulation

of the trust region enables the optimal solution of problem M (w) to be outside CH(X). Formally,

we enlarge the trust region such that solutions outside CH(X) are considered feasible if they fall

within the hyperball, with radius ϵ, surrounding at least one of the data points in X, see Figure 5

(left). The ϵ-CH is formulated as follows:


 X X 
ϵ-CH(X) = (x, s) λi x̂i = x + s, λi = 1, λ ≥ 0, ||s||p ≤ ϵ , (5)
i∈I i∈I

with s ∈ Rn , and p set equal to 1,2 or ∞ to preserve the complexity of the optimization problem.

Figure 5 (right) shows the extended region obtained with the ϵ-CH. The choice of ϵ is pivotal in the

trade-off between the performance of the learned constraints and the conservatism of the optimal

solution. In the next section, we demonstrate how an increase in ϵ affects both the performance of

the embedded predictive models and the objective function value.

4. Case study: a palatable food basket for the World Food Programme

In this case study, we use a simplified version of the model proposed by Peters et al. (2021),

which seeks to optimize humanitarian food aid. Its extended version aims to provide the World

Food Programme (WFP) with a decision-making tool for long-term recovery operations, which

simultaneously optimizes the food basket to be delivered, the sourcing plan, the delivery plan,

and the transfer modality of a month-long food supply. The model proposed by Peters et al.

(2021) enforces that the food baskets address the nutrient gap and are palatable. To guarantee a
Maragno et al.: Mixed-integer Optimization with Constraint Learning
19

Figure 5 Trust region enlarged using an hyperball with radius ϵ around each sample in CH(X).

6 6

4 4

2 2

0 0

2 2

4 4

CH(X)
Hyperball with radius CH(X)
6 Data points in X 6 CH(X)
Feasible solution in CH(X) Data points in X
6 4 2 0 2 4 6 6 4 2 0 2 4 6

certain level of palatability, the authors use a number of “unwritten rules” that have been defined

in collaboration with nutrition experts. In this case study, we take a step further by inferring

palatability constraints directly from data that reflects local people’s opinions. We use the specific

case of Syria for this example. The conceptual model presents an LO structure with only the food

palatability constraint to be learned. Data on palatability is generated through a simulator, but

the procedure would remain unchanged if data were collected in the field, for example through

surveys. The structure of this problem, which is an LO and involves only one learned constraint,

allows the following analyses: (1) the effect of the trust-region on the optimal solution, and (2)

the effect of clustering on the computation time and the optimal objective value. Additionally,

the use of simulated data provides us with a ground truth to use in evaluating the quality of the

prescriptions.

4.1. Conceptual model

The optimization model is a combination of a capacitated, multi-commodity network flow model,

and a diet model with constraints for nutrition levels and food basket palatability.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
20

The sets used to define the constraints and the objective function are displayed in Table 1. We

have three different sets of nodes, and the set of commodities contains all the foods available for

procurement during the food aid operation.

Table 1 Definition of the sets used in


the WFP model.

Sets
NS Set of source nodes
NT Set of transshipment nodes
ND Set of delivery nodes
K Set of commodities (k ∈ K)
L Set of nutrients (l ∈ L)

The parameters used in the model are displayed in Table 2. The costs used in the objective

function concern transportation (pT ) and procurement (pP ). The amount of food to deliver depends

on the demand (d) and the number of feeding days (days). The nutritional requirements (nutreq)

and nutritional values (nutrval) are detailed in Appendix C. The parameter γ is needed to convert

the metric tons used in the supply chain constraints to the grams used in the nutritional constraints.

The parameter t is used as a lower bound on the food basket palatability. The values of these

parameters are based on those used by Peters et al. (2021).

Table 2 Definition of the parameters used in the WFP model.

Parameters
γ Conversion rate from metric tons (mt) to grams (g)
di Number of beneficiaries at delivery point i ∈ ND
days Number of feeding days
nutreql Nutritional requirement for nutrient l ∈ L (grams/person/day)
nutvalkl Nutritional value for nutrient l ∈ L per gram of commodity k ∈ K
pPik Procurement cost (in $ / mt) of commodity k from source i ∈ NS
pTijk Transportation cost (in $ / mt) of commodity k from node i ∈ NS ∪ NT to node j ∈ NT ∪ ND
t Palatability lower bound

The decision variables are shown in Table 3. The flow variables Fijk are defined as the metric

tons of a commodity k transported from node i to j. The variable xk represents the average daily

ration per beneficiary for commodity k. The variable y refers to the palatability of the food basket.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
21

Table 3 Definition of the variables used in the WFP model.

Variables
Fijk Metric tons of commodity k ∈ K transported between node i and node j
xk Grams of commodity k ∈ K in the food basket
y Food basket palatability

The full model formulation is as follows:

X X X X X X
min pPik Fijk + pTijk Fijk (6a)
x,y,F
i∈NS j∈NT ∪ND k∈K i∈NS ∪NT j∈NT ∪ND k∈K
X X
s.t. Fijk = Fjik , i ∈ NT , k ∈ K , (6b)
j∈NT j∈NT
X
γFjik = di xk days, i ∈ ND , k ∈ K , (6c)
j∈NS ∪NT
X
N utvalkl xk ≥ N utreql , l ∈ L, (6d)
k∈K

xsalt = 5, (6e)

xsugar = 20, (6f)

y ≥ t, (6g)

y = ĥ(x), (6h)

Fijk , xk ≥ 0, i, j ∈ N , k ∈ K. (6i)

The objective function consists of two components, procurement costs and transportation costs.

Constraints (6b) are used to balance the network flow, namely to ensure that the inflow and the

outflow of a commodity are equal for each transhipment node. Constraints (6c) state that flow

into a delivery node has to be equal to its demand, which is defined by the number of beneficiaries

times the daily ration for commodity k times the feeding days. Constraints (6d) guarantee an

optimal solution that meets the nutrition requirements. Constraints (6e) and (6f) force the amount

of salt and sugar to be 5 grams and 20 grams respectively. Constraint (6g) requires the food basket

palatability (y), defined by means of a predictive model (6h), to be greater than a threshold (t).

Lastly, non-negativity constraints (6i) are added for all commodity flows and commodity rations.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
22

Table 4 Two examples of daily food baskets.

Commodity Basket 1 Amount (g) Basket 2 Amount (g)


DSM 31.9 33.9
Chickpeas – 75.7
Lentils 41 –
Maize meal 48.9 –
Meat – 17.2
Oil 22 28.6
Salt 5 5
Sugar 20 20
Wheat 384.2 131.2
Wheat flour – 261.3
WSB 67.3 59.8
Palatability Score 0.436 0.741
DSM=dried skim milk, WSB=wheat soya blend.

4.2. Dataset and predictive models

To evaluate the ability of our framework to learn and implement the palatability constraints, we

use a simulator to generate diets with varying palatabilities. Each sample is defined by 25 features

representing the amount (in grams) of all commodities that make up the food basket. We then

use a ground truth function to assign each food basket a palatability between 0 and 1, where 1

corresponds to a perfectly palatable basket, and 0 to an inedible basket. This function is based

on suggestions provided by WFP experts and complete details are outlined in Appendix C.1. The

data is then balanced to ensure that a wide variety of palatability scores are represented in the

dataset. The final data used to learn the palatability constraint consists of 121,589 samples. Two

examples of daily food baskets and their respective palatability scores are shown in Table 4. In this

case study, we use a palatability lower bound (t) of 0.5 for our learned constraint.

The next step of the framework involves training and choosing the predictive model that best

approximates the unknown constraint. The predictive models used to learn the palatability con-

straints are those discussed in Section 2, namely LR, SVM, CART, RF, GBM with decision trees

as base-learners, and MLP with ReLU activation function.

4.3. Optimization results

The experiments are executed using OptiCL jointly with Gurobi v9.1 (Gurobi Optimization, LLC

2021) as the optimization solver. Table 5 reports the performances of the predictive models evalu-
Maragno et al.: Mixed-integer Optimization with Constraint Learning
23

ated both for the validation set and for the prescriptions after being embedded into the optimization

model. The table also compares the performance of the optimization with and without the trust

region. The column “Validation MSE” gives the Mean Squared Error (MSE) of each model obtained

in cross-validation during model selection. While all scores in this column are desirably low, the

MLP model significantly achieves the lowest error during this validation phase. The column “MSE”

gives the MSE of the predictive models once embedded into the optimization problem to evalu-

ate how well the predictions for the optimal solutions match their true palatabilities (computed

using the simulator). It is found using 100 optimal solutions of the optimization model generated

with different cost vectors. The MLP model exhibits the best performance (0.055) in this context,

showing its ability to model the palatability constraint better than all other methods.

Table 5 Predictive models performances for the validation set (“Validation MSE”), and for
the prescriptions after being embedded into the optimization model with (“MSE-TR”) and
without the trust region (“MSE”). The last two columns show the average computation time in
seconds and its standard deviation (SD) required to solve the optimization model with
(“Time-TR”) and without the trust region (“Time”).

Model Validation MSE MSE MSE-TR Time (SD) Time-TR (SD)


LR 0.046 0.256 0.042 0.003 (0.0008) 1.813 (0.204)
SVM 0.019 0.226 0.027 0.003 (0.0006) 1.786 (0.208)
CART 0.014 0.273 0.059 0.012 (0.0030) 7.495 (5.869)
RF 0.018 0.252 0.025 0.248 (0.1050) 30.128 (13.917)
GBM 0.006 0.250 0.017 0.513 (0.4562) 60.032 (41.685)
MLP 0.001 0.055 0.001 14.905 (41.764) 28.405 (23.339)
Runtimes reported using an Intel i7-8665U 1.9 GHz CPU, 16 GB RAM (Windows 10 environment).

Benefit of trust region. Table 5 shows that when the trust region is used (“MSE-TR”), the MSEs

obtained by all models are now much closer to the results from the validation phase. This shows

the benefit of using the trust region as discussed in Section 2.3 to prevent extrapolation. With

the trust region included, the MLP model also exhibits the lowest MSE (0.001). The improved

performance seen with the inclusion of the trust region does come at the expense of computation

speed. The column “Time-TR” shows the average computation time in seconds and its standard

deviation (SD) with trust region constraints included. In all cases, the computation time has clearly

increased when compared against the computation time required without the trust region (column
Maragno et al.: Mixed-integer Optimization with Constraint Learning
24

“Time”). This is however acceptable, as significantly more accurate results are obtained with the

trust region.

Benefit of clustering. The large dataset used in this case study makes the use of the trust

region expensive in terms of time required to solve the final optimization model. While the column

selection algorithm described in Section 2.3 is ideal for significantly reducing the computation

time, optimization models that require binary variables, either for embedding an ML model or to

represent decision variables, would require column selection to be combined with a branch and

bound algorithm. However, in this more general MIO case, it is possible to divide the dataset into

clusters and solve in parallel an MIO for each cluster. By using parallelization, the total solution

time can be expected to be equal to the longest time required to solve any single cluster’s MIO.

Contrary to column selection, the use of clusters can result in more conservative solutions; the

trust region gets smaller with more clusters and prevents the model from finding solutions that

are convex combinations of members of different clusters. However, as described in Section 2.3,

solutions that lie between clusters may in fact reside in low-density areas of the feature space that

should not be included in the trust region. In this sense, the loss in the objective value might

actually coincide with more trustable solutions.

Figure 6 shows the effect of clusters in solving the model (6a-6i) with GBM as the predictive

model used to learn the palatability constraint. K-means is used to partition the dataset into K

clusters, and the reported values are averaged over 100 iterations. In the left graph, we report the

maximum runtime distribution across clusters needed to solve the different MIOs in parallel. In

the right graph, we have the distributions of optimality gap, i.e., the relative difference between

the optimal solution obtained with clusters compared to the solution obtained with no clustering.

In this case study, the use of clusters significantly decreases the runtime (89.2% speed up with

K = 50) while still obtaining near-optimal solutions (less then 0.25% average gap with K = 50).

We observe that the trends are not necessarily monotonic in K. It is possible that a certain choice

of K may lead to a suboptimal solution, whereas a larger value of K may preserve the optimal

solution as the convex combination of points within a single cluster.


Maragno et al.: Mixed-integer Optimization with Constraint Learning
25

Figure 6 Effect of the number of clusters (K) on the computation time and the optimality gap across clusters,

with bootstrapped 95% confidence intervals.

Maximum runtime (seconds) Objective gap (%)

70 0.25

60
0.20
50
0.15
40
0.10
30

20 0.05

10
0.00
0 10 20 30 40 50 0 10 20 30 40 50
K K

4.4. Robustness results

In these experiments, we assess the performance of the nominal and robust models. We consider

three dimensions of performance: (1) true constraint satisfaction, (2) objective function value, and

(3) runtime. The synthetic data used in this case study allows us to evaluate true palatability and

constraint satisfaction as these parameters vary. This is the primary goal of the model wrapper

ensemble approach, to improve feasibility and make solutions that are robust to any single learned

estimator.

We hypothesize that as our models become more conservative, we will more reliably satisfy the

desired palatability constraint with some toll on the objective function. Additionally, embedding

multiple models or characterizing uncertainty sets introduces computational complexity over a

single nominal model. In this section, we compare the trade-offs in these metrics as we consider

different notions of robustness and vary our conservativeness. We note that we are able to evaluate

whether the true palatability meets the constraint threshold since palatability is defined through

a known function. As with the experiments above, we solve the palatability problem with 100

different realizations of the cost vector and average the results.


Maragno et al.: Mixed-integer Optimization with Constraint Learning
26

The results below explore the effect of the α (violation limit) on cost and palatability in the

WFP case study. Additional results on runtime, and experiments with varied estimators (P ), are

included in Appendix C.3. As the results demonstrate, the robustness parameters yield solutions

that vary in their conservativeness and runtime. There is not a single set of optimal parameters.

Rather, it is highly dependent on the use case, including factors like the stakes of the decision and

the allowable turnaround time to generate solutions.

Multiple embedded models. We first consider the impact of the model wrapper approach in the

WFP problem. We compare different ways of embedding the palatability constraint, both using

multiple estimators of a single model class and an ensemble containing multiple model classes. We

run the experiments on a random sample of 1000 observations in the original WFP dataset. Within

a single model class, we vary the number of estimators (P ∈ [2, 5, 10, 25]) and the violation limit

(α ∈ [0, 0.1, 0.2, 0.5], or applying a mean constraint). Each estimator is obtained using a bootstrap

sample (proportion = 0.5) of the underlying data. We compute metrics (1-3) for each variant to

compare the tradeoffs in palatability (constraint satisfaction) and cost (objective function value).

Figure 7 presents the results for a decision tree with P = 25 and palatability threshold (τ ) equal

to 0.5. The left figure shows the trade off between palatability and the objective as the violation

limit (α) varies. As expected, improvements in palatability (when α decreases) lead to increases

in the total cost. However, we observe that a violation limit of 0.0 (vs. 0.5) leads to an 11.3%

improvement in real palatability (20.8% improvement in predicted palatability), with a relatively

modest 2.5% increase in cost. The center and right figure show how palatability and violations

vary with α. Palatability increases and violations decease with lower α. Both the violation rate

(proportion of iterations with real palatability < 0.5) and violation margin (average distance to

palatability threshold in cases where there is a violation) decrease with lower α. This experiment

demonstrates how the α parameter effectively controls the model’s robustness as measured by

constraint satisfaction. The approach has the advantage of parameterizing the violation limit,

allowing us to explicitly control the model’s conservativeness and evaluate constraint-objective

tradeoffs.

Appendix C.3 reports further results for other model classes as well as runtime experiments.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
27

Figure 7 Comparison of CART models on objective function and constraint satisfaction.


1500 0.675
Constraint Constraint
Real Violation Rate
0.650 Predicted 0.4 Violation Margin
1400
0.625 0.3

1300
Objective (Cost)

0.600

Palatability
0.2

Violation
0.575
1200 0.1
0.550
0.0
1100 0.525
0.1
0.500
1000
0.0 0.1 0.25 0.5 0.0 0.1 0.25 0.5 0.0 0.1 0.25 0.5

Enlarged trust region. In order to evaluate the effects of the enlarged trust region on the optimal

solution, we use a simplified version of problem (6a-6i) where the only constraints are on the

predictive model embedding, the palatability lower bound, and the ϵ-CH. In Figure 8, we show

how the objective function value and true palatability score vary according to different values of

ϵ ∈ [0, 0.8]. The results are obtained by averaging over 200 iterations with randomly generated cost

vectors and using a decision tree as a predictive model to represent the palatability outcome. As

expected, the objective value improves as ϵ increases. More interesting is the true palatability score

which stays around the imposed lower bound of 0.5 for values of ϵ smaller than 0.25. This means

that the predictive model is able to generalize even outside the CH as long as the optimal solution

is not too far from it.

5. Case study: chemotherapy regimen design

In this case study, we extend the work of Bertsimas et al. (2016) in the design of chemotherapy

regimens for advanced gastric cancer. Late stage gastric cancer has a poor prognosis with limited

treatment options (Yang et al. 2011). This has motivated significant research interest and clinical

trials (National Cancer Institute 2021). In Bertsimas et al. (2016), the authors pose the question

of algorithmically identifying promising chemotherapy regimens for new clinical trials based on

existing trial results. They construct a database of clinical trial treatment arms which includes

cohort and study characteristics, the prescribed chemotherapy regimen, and various outcomes.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
28

Figure 8 Effect of the ϵ-CH on the objective value and the predictive model performance with respect to the

optimal solution. The values are obtained as an average of 200 iterations.

1.2 avg obj


avg real palat
palat threshold
1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
epsilon

Given a new study cohort and study characteristics, they optimize a chemotherapy regimen to

maximize the cohort’s survival subject to a constraint on overall toxicity. The original work uses

linear regression models to predict survival and toxicity, and it constrains a single toxicity measure.

In this work we leverage a richer class of ML methods and more granular outcome measures. This

offers benefits through higher performing predictive models and more clinically-relevant constraints.

Chemotherapy regimens are particularly challenging to optimize, since they involve multiple

drugs given at potentially varying dosages, and they present risks for multiple adverse events

that must be managed. This example highlights the generalizability of our framework to complex

domains with multiple decisions and learned functions. The treatment variables in this problem

consist of both binary and continuous elements, which are easily incorporated through our use

of MIO. We have several learned constraints which must be simultaneously satisfied, and we also

learn the objective function directly as a predictive model.


Maragno et al.: Mixed-integer Optimization with Constraint Learning
29

5.1. Conceptual model

The use of clinical trial data forces us to consider each cohort as an observation, rather than

an individual, since only aggregate measures are available. Thus, our model optimizes a cohort’s

treatment. The contextual variables (w) consist of various cohort and study summary variables.

The inclusion of fixed, i.e., non-optimization, features allows us to account for differences in baseline

health status and risk across study cohorts. These features are included in the predictive models

but then are fixed in the optimization model to reflect the group for whom we are generating a

prescription. We assume that there are no unobserved confounding variables in this prescriptive

setting.

The treatment variables (x) encode a chemotherapy regimen. A regimen is defined by a set of

drugs, each with an administration schedule of potentially varied dosages throughout a chemother-

apy cycle. We characterize a regimen by drug indicators and each drug’s average daily dose and

maximum instantaneous dose in the cycle:

xdb = I(drug d is administered),

xda = average daily dose of drug d,

xdi = maximum instantaneous dose of drug d.

This allows us to differentiate between low-intensity, high-frequency and high-intensity, low-

frequency dosing strategies. The outcomes of interest (y) consist of overall survival, to be included

as the objective (yOS ), and various toxicities, to be included as constraints (yi , i ∈ YC ).

To determine the optimal chemotherapy regimen x for a new study cohort with characteristics

w, we formulate the following MIO:

min yOS
x,y

s.t. yi ≤ τi , i ∈ YC ,

yi = ĥi (x, w), i ∈ YC ,


Maragno et al.: Mixed-integer Optimization with Constraint Learning
30

yOS = ĥOS (x, w),


X
xdb ≤ 3,
d

xb ∈ {0, 1}d ,

x ∈ X (w).

In this case study, we learn the full objective. However, this model could easily incorporate deter-

ministic components to optimize as additional weighted terms in the objective. We include one

domain-driven constraint, enforcing a maximum regimen combination of three drugs.

The trust region, X (w), plays two crucial roles in the formulation. First, it ensures that the

predictive models are applied within their valid bounds and not inappropriately extrapolated. It

also naturally enforces a notion of “clinically reasonable” treatments. It prevents drugs from being

prescribed at doses outside of previously observed bounds, and it requires that the drug combination

must have been previously seen (although potentially in different doses). It is nontrivial to explicitly

characterize what constitutes a realistic treatment, and the convex hull provides a data-driven

solution that integrates directly into the model framework. Furthermore, the convex hull implicitly

enforces logical constraints between the different dimensions of x. For example, a drug’s average

and instantaneous dose must be 0, if the drug’s binary indicator is set to 0: this does not need to

be explicitly included as a constraint, since this is true for all observed treatment regimens. The

only explicit constraint required here is that the indicator variables xb are binary.

5.2. Dataset

Our data consists of 495 clinical trial arms from 1979-2012 (Bertsimas et al. 2016). We consider

nine contextual variables, including the average patient age and breakdown of primary cancer site.

There are 28 unique drugs that appear in multiple arms of the training set, yielding 84 decision

variables. We include several “dose-limiting toxicities” (DLTs) for our constraint set: Grade 3/4

constitutional toxicity, gastrointestinal toxicity, and infection, as well as Grade 4 blood toxicity.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
31

As the name suggests, these are chemotherapy side effects that are severe enough to affect the

course of treatment. We also consider incidence of any dose-limiting toxicity (“Any DLT”), which

aggregates over a superset of these DLTs.

We apply a temporal split, training the predictive models on trial arms through 2008 and generat-

ing prescriptions for the trial arms in 2009-2012. The final training set consists of 320 observations,

and the final testing set consists of 96 observations. The full feature set, inclusion criteria, and

data processing details are included in Appendix D.1.

To define the trust region, we take the convex hull of the treatment variables (x) on the training

set. This aligns with the temporal split setting, in which we are generating prescriptions going

forward based on an existing set of past treatment decisions. In general it is preferable to define the

convex hull with respect to both x and w as discussed in Appendix B.1, but this does not apply

well with a temporal split. Our data includes the study year as a feature to incorporate temporal

effects, and so our test set observations will definitionally fall outside of the convex hull defined by

the observed (x, w) in our training set.

5.3. Predictive models

Several ML models are trained for each outcome of interest using cross-validation for parameter

tuning, and the best model is selected based on the validation criterion. We employ function

learning for all toxicities, directly predicting the toxicity incidence and applying an upper bound

threshold within the optimization model.

Based on the model selection procedure, overall DLT, gastrointestinal toxicity, and overall sur-

vival are predicted using GBM models. Blood toxicity and infection are predicted using linear

models, and constitutional toxicity is predicted with a RF model. This demonstrates the advantage

of learning with multiple model classes; no single method dominates in predictive performance. A

complete comparison of the considered models is included in Appendix D.2.


Maragno et al.: Mixed-integer Optimization with Constraint Learning
32

5.4. Evaluation framework

We generate prescriptions using the optimization model outlined in Section 5.1, with the embedded

model choices specified in Section 5.3. In order to evaluate the quality of our prescriptions, we must

estimate the outcomes under various treatment alternatives. This evaluation task is notoriously

challenging due to the lack of counterfactuals. In particular, we only know the true outcomes for

observed cohort-treatment pairs and do not have information on potential unobserved combina-

tions. We propose an evaluation scheme that leverages a “ground truth” ensemble (GT ensemble).

We train several ML models using all data from the study. These models are not embedded in

an MIO model, so we are able to consider a broader set of methods in the ensemble. We then

predict each outcome by averaging across all models in the ensemble. This approach allows us to

capture the maximal knowledge scenario. Furthermore, such a “consensus” approach of combining

ML models has been shown to improve predictive performance and is more robust to individual

model error (Bertsimas et al. 2021). The full details of the ensemble models and their predictive

performances are included in Appendix D.3.

5.5. Optimization results

We evaluate our model in multiple ways. We first consider the performance of our prescrip-

tions against observed (given) treatments. We then explore the impact of learning multiple sub-

constraints rather than a single aggregate toxicity constraint. All optimization models have the

following shared parameters: toxicity upper bound of 0.6 quantile (as observed in training data)

and maximum violation of 25% for RF models. We report results for all test set observations with

a feasible solution. It is possible that an observation has no feasible solution, implying that there is

not a suitable drug combination lying within the convex hull for this cohort based on the toxicity

requirements. These cases could be further investigated through a sensitivity analysis by relaxing

the toxicity constraints or enlarging the trust region. With clinical guidance, one could evaluate

the modifications required to make the solution feasible and the clinical appropriateness of such

relaxations.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
33

Table 6 reports the predicted outcomes under two constraint approaches: (1) constraining each

toxicity separately (“All Constraints”), and (2) constraining a single aggregate toxicity measure

(“DLT Only”). For each cohort in the test set, we generate predictions for all outcomes of interest

under both prescription schemes and compute the relative change of our prescribed outcome from

the given outcome predictions.

Benefit of prescriptive scheme. We begin by evaluating our proposed prescriptive scheme (“All

Constraints”) against the observed actual treatments. For example, under the GT ensemble scheme,

84.7% of cohorts satisfied the overall DLT constraint under the given treatment, compared to

94.1% under the proposed treatment. This yields an improvement of 11.10%. We obtain a signif-

icant improvement in survival (11.40%) while also improving toxicity limit satisfaction across all

individual toxicities. Using the GT ensemble, we see toxicity satisfaction improvements between

1.3%-25.0%. We note that since toxicity violations are reported using the average incidence for each

cohort, and the constraint limits are toxicity-specific, it is possible for a single DLT’s incidence to

be over the allowable limit while the overall “Any DLT” rate is not.

Table 6 Comparison of outcomes under given treatment regimen, regimen prescribed when only constraining the
aggregate toxicity, and regimen prescribed under our full model.

All Constraints DLT Only


Given (SD) Prescribed (SD) % Change Prescribed (SD) % Change
Any DLT 0.847 (0.362) 0.941 (0.237) 11.10% 0.906 (0.294) 6.90%
Blood 0.812 (0.393) 0.824 (0.383) 1.40% 0.706 (0.458) -13.00%
Constitutional 0.953 (0.213) 1.000 (0.000) 4.90% 1.000 (0.000) 4.90%
Infection 0.882 (0.324) 0.894 (0.310) 1.30% 0.800 (0.402) -9.30%
Gastrointestinal 0.800 (0.402) 1.000 (0.000) 25.00% 1.000 (0.000) 25.00%
Overall Survival 10.855 (1.939) 12.092 (1.470) 11.40% 12.468 (1.430) 14.90%
We report the mean and standard deviation (SD) of constraint satisfaction (binary indicator) and overall survival (months) across
the test set. The relative change is reported against the given treatment.

Benefit of multiple constraints. Table 6 also illustrates the value of enforcing constraints on each

individual toxicity rather than as a single measure. When only constraining the aggregate toxicity

measure (“DLT Only”), the resultant prescriptions actually have lower constraint satisfaction for

blood toxicity and infection than the baseline given regimens. By constraining multiple measures,
Maragno et al.: Mixed-integer Optimization with Constraint Learning
34

we are able to improve across all individual toxicities. The fully constrained model actually improves

the overall DLT measure satisfaction, suggesting that the inclusion of these “sub-constraints”

also makes the aggregate constraint more robust. This improvement does come at the expense of

slightly lower survival between the “All” and “DLT Only” models (-0.38 months) but we note that

incurring the individual toxicities that are violated in the “DLT Only” model would likely make

the treatment unviable.

6. Discussion

Our experimental results illustrate the benefits of our constraint learning framework in data-

driven decision making in two problem settings: food basket recommendations for the WFP and

chemotherapy regimens for advanced gastric cancer. The quantitative results show an improvement

in predictive performance when incorporating the trust region and learning from multiple candi-

date model classes. Our framework scales to large problem sizes, enabled by efficient formulations

and tailored approaches to specific problem structures. Our approach for efficiently learning the

trust region also has broad applicability in one-class constraint learning.

The nominal problem formulation is strengthened by embedding multiple models for a single

constraint rather than relying on a single learned function. This notion of robustness is particularly

important in the context of learning constraints: whereas mis-specfications in learned objective

functions can lead to suboptimal outcomes, a mis-specified constraint can lead to infeasible solu-

tions. Finally, our software exposes the model ensemble construction and trust region enlargement

options directly through user-specified parameters. This allows an end user to directly evaluate

tradeoffs in objective value and constraint satisfaction, as the problem’s real-world context often

shapes the level of desired conservatism.

We recognize several opportunities to further extend this framework. Our work naturally relates

to the causal inference literature and individual treatment effect estimation (Athey and Imbens

2016, Shalit et al. 2017). These methods do not directly translate to our problem setting; existing

work generally assumes highly structured treatment alternatives (e.g., binary treatment vs. control)
Maragno et al.: Mixed-integer Optimization with Constraint Learning
35

or a single continuous treatment (e.g., dosing), whereas we allow more general decision structures. In

future work, we are interested in incorporating ideas from causal inference to relax the assumption

of unobserved confounders.

Additionally, our framework is dependent on the quality of the underlying predictive models. We

constrain and optimize point predictions from our embedded models. This can be problematic in the

case of model misspecification, a known shortcoming of “predict-then-optimize” methods (Elmach-

toub and Grigas 2021). We mitigate this concern in two ways. First, our model selection procedure

allows us to obtain higher quality predictive models by capturing several possible functional rela-

tionships. Second, our model wrapper approach for embedding a single constraint with an ensemble

of models allows us to directly control our robustness to the predictions of individual learners.

In future work, there is an opportunity to incorporate ideas from robust optimization to directly

account for prediction uncertainty in individual model classes. While this has been addressed in

the linear case (Goldfarb and Iyengar 2003), it remains an open area of research in more general

ML methods.

In this work, we present a unified framework for optimization with learned constraints that

leverages both ML and MIO for data-driven decision making. Our work flexibly learns problem

constraints and objectives with supervised learning, and incorporates them into a larger optimiza-

tion problem of interest. We also learn the trust region, providing more credible recommendations

and improving predictive performance, and accomplish this efficiently using column generation and

unsupervised learning. The generality of our method allows us to tackle quite complex decision set-

tings, such as chemotherapy optimization, but also includes tailored approaches for more efficiently

solving specific problem types. Finally, we implement this as a Python software package (OptiCL)

to enable practitioner use. We envision that OptiCL’s methodology will be added to state-of-the-art

optimization modeling software packages.

Acknowledgments
The authors thank the anonymous reviewers and editorial team for their valuable feedback on this work. This

work was supported by the Dutch Scientific Council (NWO) grant OCENW.GROOT.2019.015, Optimization
Maragno et al.: Mixed-integer Optimization with Constraint Learning
36

for and with Machine Learning (OPTIMAL). Additionally, Holly Wiberg was supported by the National

Science Foundation Graduate Research Fellowship under Grant No. 174530. Any opinion, findings, and

conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily

reflect the views of the National Science Foundation.

References

Amos B, Xu L, Kolter JZ (2016) Input convex neural networks URL http://ariv.org/abs/1609.07152.

Anderson R, Huchette J, Ma W, Tjandraatmadja C, Vielma JP (2020) Strong mixed-integer programming

formulations for trained neural networks. Mathematical Programming 183(1-2):3–39, ISSN 14364646,

URL http://dx.doi.org/10.1007/s10107-020-01474-5.

Athey S, Imbens G (2016) Recursive partitioning for heterogeneous causal effects. Proceedings of the National

Academy of Sciences 113(27):7353–7360, ISSN 0027-8424, URL http://dx.doi.org/10.1073/pnas.

1510489113.

Balestriero R, Pesenti J, LeCun Y (2021) Learning in high dimension always amounts to extrapolation. URL

http://dx.doi.org/10.48550/ARXIV.2110.09485.

Bengio Y, Lodi A, Prouvost A (2021) Machine learning for combinatorial optimization: A methodological

tour d’horizon. European Journal of Operational Research 290(2):405–421, ISSN 0377-2217, URL http:

//dx.doi.org/https://doi.org/10.1016/j.ejor.2020.07.063.

Bergman D, Huang T, Brooks P, Lodi A, Raghunathan AU (2022) JANOS: an integrated predictive and

prescriptive modeling framework. INFORMS Journal on Computing 34(2):807–816.

Bertsimas D, Borenstein A, Mingardi L, Nohadani O, Orfanoudaki A, Stellato B, Wiberg H, Sarin P, Varel-

mann DJ, Estrada V, Macaya C, Gil IJ (2021) Personalized prescription of ACEI/ARBs for hyper-

tensive COVID-19 patients. Health Care Management Science 24(2):339–355, ISSN 15729389, URL

http://dx.doi.org/10.1007/s10729-021-09545-5.

Bertsimas D, Dunn J (2017) Optimal classification trees. Machine Learning 106(7):1039–1082, ISSN

15730565, URL http://dx.doi.org/10.1007/s10994-017-5633-9.

Bertsimas D, Kallus N (2020) From predictive to prescriptive analytics. Management Science 66(3):1025–

1044, ISSN 0025-1909, URL http://dx.doi.org/10.1287/mnsc.2018.3253.


Maragno et al.: Mixed-integer Optimization with Constraint Learning
37

Bertsimas D, O’Hair A, Relyea S, Silberholz J (2016) An analytics approach to designing combination

chemotherapy regimens for cancer. Management Science 62(5):1511–1531, ISSN 15265501, URL http:

//dx.doi.org/10.1287/mnsc.2015.2363.

Bertsimas, D and Dunn, J (2018) Machine Learning under a Modern Optimization Lens (Belmont: Dynamic

Ideas).

Biggs M, Hariss R, Perakis G (2021) Optimizing objective functions determined from random forests. SSRN

Electronic Journal 1–46, ISSN 1556-5068, URL http://dx.doi.org/10.2139/ssrn.2986630.

Bonfietti A, Lombardi M, Milano M (2015) Embedding decision trees and random forests in constraint

programming. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intel-

ligence and Lecture Notes in Bioinformatics) 9075:74–90, ISSN 16113349, URL http://dx.doi.org/

10.1007/978-3-319-18008-3_6.

Breiman L (2001) Random forests. Machine Learning 45(1):5–32, ISSN 08856125, URL http://dx.doi.

org/10.1023/A:1010933404324.

Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees (Routledge),

ISBN 978-0412048418, URL http://dx.doi.org/10.1201/9781315139470.

Cancer Therapy Evaluation Program (2006) Common terminology criteria for adverse events v3.0. URL

https://ctep.cancer.gov/protocoldevelopment/electronic_applications/docs/ctcaev3.pdf.

Chen Y, Shi Y, Zhang B (2020) Input convex neural networks for optimal voltage regulation. URL http:

//arxiv.org/abs/2002.08684.

Cortes C, Vapnik V (1995) Support-vector networks. Machine Learning 20(3):273–297.

Cremer JL, Konstantelos I, Tindemans SH, Strbac G (2019) Data-driven power system operation: Explor-

ing the balance between cost and risk. IEEE Transactions on Power Systems 34(1):791–801, ISSN

08858950, URL http://dx.doi.org/10.1109/TPWRS.2018.2867209.

Drucker H, Surges CJ, Kaufman L, Smola A, Vapnik V (1997) Support vector regression machines. Advances

in Neural Information Processing Systems 1:155–161, ISSN 10495258.


Maragno et al.: Mixed-integer Optimization with Constraint Learning
38

Ebert T, Belz J, Nelles O (2014) Interpolation and extrapolation: Comparison of definitions and survey of

algorithms for convex and concave hulls. 2014 IEEE Symposium on Computational Intelligence and

Data Mining (CIDM), 310–314, URL http://dx.doi.org/10.1109/CIDM.2014.7008683.

Elmachtoub AN, Grigas P (2021) Smart “Predict, then Optimize”. Management Science 1–46, ISSN 0025-

1909, URL http://dx.doi.org/10.1287/mnsc.2020.3922.

Fajemisin A, Maragno D, den Hertog D (2021) Optimization with constraint learning: A framework and

survey. URL https://arxiv.org/abs/2110.02121.

George B Dantzig PW (1960) Decomposition principle for linear programs. Operations Research 8(1):101–

111, URL http://dx.doi.org/https://doi.org/10.1287/opre.8.1.101.

Goldfarb D, Iyengar G (2003) Robust portfolio selection problems. Mathematics of Operations Research

28(1):1–38, ISSN 0364765X, 15265471, URL http://www.jstor.org/stable/4126989.

Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. CoRR

abs/1412.6572.

Grimstad B, Andersson H (2019) ReLU networks as surrogate models in mixed-integer linear programs. Com-

puters and Chemical Engineering 131:106580, ISSN 00981354, URL http://dx.doi.org/10.1016/j.

compchemeng.2019.106580.

Gurobi Optimization, LLC (2021) Gurobi Optimizer Reference Manual. URL https://www.gurobi.com.

Gutierrez-Martinez VJ, Cañizares CA, Fuerte-Esquivel CR, Pizano-Martinez A, Gu X (2011) Neural-network

security-boundary constrained optimal power flow. IEEE Transactions on Power Systems 26(1):63–72,

ISSN 08858950, URL http://dx.doi.org/10.1109/TPWRS.2010.2050344.

Halilbasic L, Thams F, Venzke A, Chatzivasileiadis S, Pinson P (2018) Data-driven security-constrained

AC-OPF for operations and markets. 20th Power Systems Computation Conference, PSCC 2018 URL

http://dx.doi.org/10.23919/PSCC.2018.8442786.

Kleijnen JP (2015) Design and analysis of simulation experiments. International Workshop on Simulation,

3–22 (Springer).
Maragno et al.: Mixed-integer Optimization with Constraint Learning
39

Kudla P, Pawlak TP (2018) One-class synthesis of constraints for Mixed-Integer Linear Programming with

C4.5 decision trees. Applied Soft Computing Journal 68:1–12, ISSN 15684946, URL http://dx.doi.

org/10.1016/j.asoc.2018.03.025.

Lombardi M, Milano M, Bartolini A (2017) Empirical decision model learning. Artificial Intelligence 244:343–

367.

Mišić VV (2020) Optimization of tree ensembles. Operations Research 68(5):1605–1624, ISSN 15265463, URL

http://dx.doi.org/10.1287/opre.2019.1928.

MOSEK (2019) MOSEK Optimizer API for Python 9.3.7. URL https://docs.mosek.com/latest/

pythonapi/index.html.

National Cancer Institute (2021) Treatment clinical trials for gastric (stomach) cancer. URL https://www.

cancer.gov/about-cancer/treatment/clinical-trials/disease/stomach-cancer/treatment.

Pawlak TP (2019) Synthesis of mathematical programming models with one-class evolutionary strategies.

Swarm and Evolutionary Computation 44:335–348, ISSN 2210-6502, URL http://dx.doi.org/https:

//doi.org/10.1016/j.swevo.2018.04.007.

Pawlak TP, Krawiec K (2019) Synthesis of constraints for mathematical programming with one-class genetic

programming. IEEE Transactions on Evolutionary Computation 23(1):117–129, URL http://dx.doi.

org/10.1109/TEVC.2018.2835565.

Pawlak TP, Litwiniuk B (2021) Ellipsoidal one-class constraint acquisition for quadratically constrained

programming. European Journal of Operational Research 293(1):36–49, ISSN 03772217, URL http:

//dx.doi.org/10.1016/j.ejor.2020.12.018.

Peters K, Silva S, Gonçalves R, Kavelj M, Fleuren H, den Hertog D, Ergun O, Freeman M (2021) The

nutritious supply chain: Optimizing humanitarian food assistance. INFORMS Journal on Optimization

3(2):200–226.

Schweidtmann AM, Mitsos A (2019) Deterministic global optimization with artificial neural networks

embedded. Journal of Optimization Theory and Applications 180(3):925–948, ISSN 15732878, URL

http://dx.doi.org/10.1007/s10957-018-1396-0.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
40

Shalit U, Johansson FD, Sontag D (2017) Estimating individual treatment effect: generalization bounds and

algorithms. International Conference on Machine Learning, 3076–3085 (PMLR).

Skiena SS (2008) The Algorithm Design Manual (Springer Publishing Company, Incorporated), 2nd edition.

Spyros C (2020) From decision trees and neural networks to MILP: power system optimization considering

dynamic stability constraints. 2020 European Control Conference (ECC), 594–594 (IEEE), ISBN 978-

3-90714-402-2, URL http://dx.doi.org/10.23919/ECC51009.2020.9143834.

Sroka D, Pawlak TP (2018) One-class constraint acquisition with local search. GECCO 2018 - Proceedings

of the 2018 Genetic and Evolutionary Computation Conference 363–370, URL http://dx.doi.org/

10.1145/3205455.3205480.

Stoer J, Botkin ND (2005) Minimization of convex functions on the convex hull of a point set.

Mathematical Methods of Operations Research 62(2):167–185, URL http://dx.doi.org/10.1007/

s00186-005-0018-4.

Stoer J, Botkin ND, Pykhteev OA (2007) An interior-point method for minimizing convex functions on

the convex hull of a point set. Optimization 56(4):515–524, URL http://dx.doi.org/10.1080/

02331930701421111.

Thams F, Halilbaši L, Pinson P, Chatzivasileiadis S, Eriksson R (2017) Data-driven security-constrained

OPF. Proc. 10th Bulk Power Syst. Dyn. Control Symp., 1–10, URL http://irep2017.inesctec.pt/

conference-papers/conference-papers/paper65r7z1aplj.pdf.

UNHCR, UNICEF, WFP, WHO (2002) Food and nutrition needs in emergencies. URL https://www.who.

int/nutrition/publications/emergencies/a83743/en/.

Venzke A, Viola DT, Mermet-Guyennet J, Misyris GS, Chatzivasileiadis S (2020) Neural networks for

encoding dynamic security-constrained optimal power flow to mixed-integer linear programs URL

http://arxiv.org/abs/2003.07939.

Verwer S, Zhang Y, Ye QC (2017) Auction optimization using regression trees and linear models as integer

programs. Artificial Intelligence 244:368–395, ISSN 00043702, URL http://dx.doi.org/10.1016/j.

artint.2015.05.004.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
41

Wolfe P (1961) A duality theorem for non-linear programming. Quarterly of Applied Mathematics

19(3):239–244.

Yang D, Hendifar A, Lenz C, Togawa K, Lenz F, Lurje G, Pohl A, Winder T, Ning Y, Groshen S, Lenz

HJ (2011) Survival of metastatic gastric cancer: Significance of age, sex and race/ethnicity. Journal of

Gastrointestinal Oncology 2(2):77–84, ISSN 2219-679X, URL http://dx.doi.org/10.3978/j.issn.

2078-6891.2010.025.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
42

Appendix A: Machine learning model embedding


A.1. Linear models
Linear Regression. Linear regression (LR) is a natural choice of predictive function given its
inherent linearity and ease of embedding. A regression model can be trained to predict the outcome
of interest, y, as a function of x and w. The algorithm can optionally use regularization; the
embedding only requires the final coefficient vectors βx ∈ Rn and βw ∈ Rp (and intercept term β0 )
to describe the model. The model can then be embedded as

y = β0 + βx⊤ x + βw

w.

Support Vector Machines. A support vector machine (SVM) uses a hyper-plane split to generate
predictions, both for classification (Cortes and Vapnik 1995) and regression (Drucker et al. 1997).
We consider the case of linear SVMs, since this allows us to obtain the prediction as a linear
function of the decision variables x. In linear support vector regression (SVR), which we use for
function learning, we fit a linear function to the data. The setting is similar to linear regression,
but the loss function only penalizes residuals greater than an ϵ threshold (Drucker et al. 1997). As
with linear regression, the trained model returns a linear function with coefficients βx , βw , and β0 .
The final prediction is
y = β0 + βx⊤ x + βw

w.

For the classification setting, linear support vector classification (SVC) identifies a hyper-plane
that best separates positive and negative samples (Cortes and Vapnik 1995). A trained SVC model
similarly returns coefficients βx , βw , and β0 , where a sample’s prediction is given by
1, if β0 + βx⊤ x + βw

w ≥ 0;
(
y=
0, otherwise.
In SVC, the output variable y is binary rather than a probability. In this case, the constraint can
simply be embedded as β0 + βx⊤ x + βw

w ≥ 0.

A.2. Decision trees


Consider the leaves in Figure 2. An observation will be assigned to the leftmost leaf (node 3) if
A⊤ ⊤ ⊤ ⊤
1 x ≤ b1 and A2 x ≤ b2 . An observation would be assigned to node 4 if A1 x ≤ b1 and A2 x > b2 , or

equivalently, −A⊤
2 x < −b2 . Furthermore, we can remove the strict inequalities using a sufficiently

small ϵ parameter, so that −A⊤


2 x ≤ −b2 − ϵ. We can then encode the leaf assignment of observation

x through the following constraints:

A⊤
1 x − M (1 − l3 ) ≤ b1 , (7a)
Maragno et al.: Mixed-integer Optimization with Constraint Learning
43

A⊤
2 x − M (1 − l3 ) ≤ b2 , (7b)

A⊤
1 x − M (1 − l4 ) ≤ b1 , (7c)

−A⊤
2 x − M (1 − l4 ) ≤ −b2 − ϵ, (7d)

−A⊤
1 x − M (1 − l6 ) ≤ −b1 − ϵ, (7e)

A⊤
5 x − M (1 − l6 ) ≤ b5 , (7f)

−A⊤
1 x − M (1 − l7 ) ≤ −b1 − ϵ, (7g)

−A⊤
5 x − M (1 − l7 ) ≤ −b5 − ϵ, (7h)

l3 + l4 + l6 + l7 = 1, (7i)

y − (p3 l3 + p4 l4 + p6 l6 + p7 l7 ) = 0, (7j)

where l3 , l4 , l6 , l7 are binary variables associated with the corresponding leaves. For a given x, if
A⊤ ⊤
1 x ≤ b1 , Constraints (7e) and (7h) will force l6 and l7 to zero, respectively. If A2 x ≤ b2 , constraint

(7d) will force l4 to 0. The assignment constraint (7i) will then force l3 = 1, assigning the observation
to leaf 3 as desired. Finally, constraint (7j) sets y to the prediction of the assigned leaf (p3 ). We
can then constrain the value of y using our desired upper bound of τ (or lower bound, without loss
of generality).
More generally, consider a decision tree ĥ(x, w) with a set of leaf nodes L each described by a
binary variable li and a prediction score pi . Splits take the form (Ax )⊤ x + (Aw )⊤ w ≤ b, where Ax
gives the coefficients for the optimization variables x and Aw gives the coefficients for the non-
optimization (fixed) variables w. Let S l be the set of nodes that define the splits that observations
in leaf i must obey. Without loss of generality, we can write these all as (Āx )⊤ ⊤
j x + (Āw )j w − M (1 −

li ) ≤ b̄j , where Ā is A if leaf i follows the left split of j and −A otherwise. Similarly, b̄ equals b if
the leaf falls to the left split, and −b − ϵ otherwise, as established above. This decision tree can
then be embedded through the following constraints:

(Āx )⊤ ⊤ l
j x + (Āw )j w − M (1 − li ) ≤ b̄j , i ∈ L, j ∈ S , (8a)
X
li = 1, (8b)
i∈L
X
y− pi li = 0. (8c)
i∈L

Here, M can be selected for each split by considering the maximum difference between (Āx )⊤
j x+

(Āw )⊤
j w and bj . A prescription solution x for a patient with features w must obey the constraints

determined by its split path, i.e. only the splits that lead to its assigned leaf i. If li = 0 for some
leaf i, the corresponding split constraints need not be considered. If li = 1, constraint (8a) will
enforce that the solution obeys all split constraints leading to leaf i. If li = 0, no constraints
Maragno et al.: Mixed-integer Optimization with Constraint Learning
44

related to leaf i should be applied. When li = 0, constraint (8a) will be nonbinding at node j if
M ≥ (Āx )⊤ ⊤
j x + (Āw )j w − b̄j . Thus we can find the minimum necessary value of M by maximizing

these expressions over all possible values of x (for the patient’s fixed w). For a given patient with
features w for whom we wish to optimize treatment, EM(w) is the solution of

max(Āx )⊤ ⊤
j x + (Āw )j w − b̄j (9a)
x

s.t. g(x, w) ≤ 0, (9b)

x ∈ X (w). (9c)

Note that the non-learned constraints on x, namely constraint (9b), and the trust region constraint
(9c) allow us to reduce the search space when determining M .
MIO vs. LO formulation for decision trees. In Section 2, we proposed two ways of embedding a
decision tree as a constraint. The first uses an LO to represent each feasible leaf node in the tree,
while the second directly uses the entire MIO representation of the tree as a constraint. To compare
the performance of these two approaches, we learn the palatability constraint using decision trees
(CART) grown to have various numbers of leaves, and solve the optimization model with both
approaches.

Figure EC.1 Comparison of MIO and multiple LO approach to tree representation, as a function of the number

of leaves.

When comparing the solution times (averaged over 10 runs), Figure EC.1 shows that the MIO
approach is relatively consistent in terms of solution time regardless of the number of leaves. With
the LO approach however, as the number of leaves grows, the number of LOs to be solved also
grows. While the solution time of a single LO is very low, solving multiple LOs sequentially might
Maragno et al.: Mixed-integer Optimization with Constraint Learning
45

be heavily time consuming. A way to speed up the process is to solve the LOs in parallel. When
only one LO needs to be solved, it takes 1.8 seconds in this problem setting. By parallelizing the
solution of the LOs, the total solution time can be expected to take only as long as it takes for the
slowest LO to be solved.

A.3. Multi-layer perceptrons


MLPs consist of an input layer, L − 2 hidden layer(s), and an output layer. In a given hidden layer
l of the network, with nodes N l , the value of a node i ∈ N l , denoted as vil , is calculated using
the weighted sum of the previous layer’s node values, followed by the ReLU activation function,
ReLU(x) = max{0, x}. The value is given as
 
 X 
vil = max 0, βi0
l
+ l l−1
βij vj ,
 l−1

j∈N

where βil is the coefficient vector for node i in layer l.


The ReLU operator can be encoded using linear constraints:

v ≥ x, (10a)

v ≤ x − ML (1 − z), (10b)

v ≤ MU z, (10c)

v ≥ 0, (10d)

z ∈ {0, 1} , (10e)

where ML < 0 is a lower bound on all possible values of x, and MU > 0 is an upper bound. While
this embedding relies on a big-M formulation, it can be improved in multiple ways. The model
can be tightened by careful selection of ML and MU . Furthermore, Anderson et al. (2020) recently
proposed an additional iterative cut generation procedure to improve the strength of the basic
big-M formulation.
The constraints for an MLP network can be generated recursively starting from the input layer,
with a set of ReLU constraints for each node in each internal layer, l ∈ {2, . . . , L − 1}. This allows
us to embed a trained MLP with an arbitrary number of hidden layers and nodes into an MIO.
Regression. In a regression setting, the output layer L consists of a single node that is a linear
combination of the node values in layer L − 1, so it can be encoded directly as
X
y = v L = β0L + βjL vjL−1 .
j∈N L−1
Maragno et al.: Mixed-integer Optimization with Constraint Learning
46

Binary Classification. In the binary classification setting, the output layer requires one neuron
1
with a sigmoid activation function, S(x) = 1+e−x
. The value is given as
1
vL = −(β0L +β L⊤ v L−1 )
1+e
L
with v ∈ (0, 1). This function is nonlinear, and thus, cannot be directly embedded into our formula-
tion. However, if τ is our desired probability lower bound, it will be satisfied when β0L + β L⊤ v L−1 ≥
 
τ
ln 1−τ . Therefore, the neural network’s output, binarized with a threshold of τ , is given by
 
τ

1, if β L + β L⊤ v L−1 ≥ ln ;
0
y= 1−τ

0, otherwise.
For example, at a threshold of τ = 0.5, the predicted value is 1 when β0L + β L⊤ v L−1 ≥ 0. Here, τ
can be chosen according to the minimum necessary probability to predict 1. As for the SVC case,
y is binary and the constraint can be embedded as y ≥ 1. We refer to Appendix A.3 for the case of
neural networks trained for multi-class classification.
Multi-class classification. In multi-class classification, the outputs are traditionally obtained by
P 
K
applying a softmax activation function, S(x)i = exi / k=1 e
xk
, to the final layer. This function
ensures that the outputs sum to one and can thus be interpreted as probabilities. In particular,
suppose we have a K-class classification problem. Each node in the final layer has an associated
weight vector βi , which maps the nodes of layer L − 1 to the output layer by βi⊤ v L−1 . The softmax
function rescales these values, so that class i will be assigned probability
⊤ L−1
eβi v
viL = PK ⊤ v L−1
βk
.
k=1 e

We cannot apply the softmax function directly in an MIO framework with linear constraints.
Instead, we use an argmax function to directly return an indicator of the highest probability class,
similar to the approach with SVC and binary classification MLP. In other words, the output y is
the identity vector with yi = 1 for the most likely class. Class i has the highest probability if and
only if
L
βi0 + βiL⊤ v L−1 ≥ βk0
L
+ βkL⊤ v L−1 , k = 1, . . . , K.

We can constrain this with a big-M constraint as follows:


L
βi0 + βiL⊤ v L−1 ≥ βk0
L
+ βkL⊤ v L−1 − M (1 − yi ), k = 1, . . . , K, (11a)
K
X
yk = 1. (11b)
k=1

Constraint (11a) forces yi = 0, if the constraint is not satisfied for some k ∈ {1, . . . , K }. Con-
straint (11b) ensures that yi = 1 for the highest likelihood class. We can then constrain the predic-
tion to fall in our desired class i by enforcing yi = 1.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
47

A.4. Model-wrapper approach


As discussed in Section 3.1, we can embed a set of models, rather than a single model, to improve
the robustness of constraint satisfaction. This ensemble of P estimators can be obtained through
multiple approaches, such as through bootstrapped estimators within a single model class (e.g., P
linear models or P decision trees) or by combining estimators across a range of model types (e.g.,
one linear model, one decision tree, and so on). Given the set of P estimators, we then constrain
that at most α proportion of the estimators violate the desired constraint. The α parameter then
allows us to control the degree of conservativeness of our solution, with higher α values resulting
in more permissive solutions and lower α resulting in more stringent constraint requirements. In
order to constrain the violation proportion, we need indicator variables to indicate whether each
of the P estimators satisfies the constraint. We use a big-M formulation to obtain these indicators
zi ∀i = 1, . . . , P , as outlined in Section 3.1. However, this does increase the complexity of the master
problem through the introduction of additional binary variables. There are two special cases of the
violation limit that circumvent the need for a big-M formulation:
• No allowable violation: We can enforce a violation limit of α = 0%, effectively the most
conservative “worst case violation” approach.
• Average constraint: Rather than constraining a certain proportion of estimators to obey
the constraint, we can enforce that the average prediction of all estimators obeys the constraint.
This avoids the need for tracking individual constraint satisfaction for each estimator.
In general, we note that the embedded models can be highly nonconvex on their own (e.g., if using
an ensemble model as the base estimator, such as Random Forests). Thus, the additional constraints
to identify and constrain violating models in this model wrapper approach are not the primary
complexity drivers in the master problem, rather complexity is driven by the individual estimators.
The experiments in Appendix C.3 further investigate this latter issue: we explore runtime as the
number of estimators (P ) increases, the incremental benefit of increasing the number of estimators,
and the impact of early stopping conditions.

Appendix B: Trust region

As we explain in Section 2.3, the trust region prevents the predictive models from extrapolating. It
is defined as the convex hull of the set Z = {(x̄i , w̄i )}N n
i=1 , with x̄i ∈ R observed treatment decisions,

and w̄i ∈ Rp contextual information. In Section B.1, we explain the importance of using both x̄
and w̄ in the formulation of the convex hull. When the number of samples (N ) is too large, the
optimization model trust region constraints may become computationally expensive. In this case,
we propose a column selection algorithm which is detailed in Section B.2.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
48

Figure EC.2 Effect of w̄ on the trust region.

3 3

2.5 2.5

2 2

w
w

1.5 1.5

1 1

0.5 0.5
1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4
x x

(b) Solutions may lie outside the trust


(a) Solutions lie within trust region.
region.

B.1. Defining the convex hull


We characterize the feasible decision space using the convex hull of our observed data. In general,
we recommend defining the feasible region with respect to both x̄ and w̄. This ensures that our
prescriptions are reasonable with respect to the contextual variables as well. Note that for different
values of w, the convex hull in the x space may be different. In Figure EC.2, the shaded region
represents the convex hull of Z formed by the dataset (blue dots), and the red line represents the
set of trusted solutions when w is fixed to a certain value. In Figure EC.2a, we see that the set
of trusted solutions (red line) lies within CH(Z ) when we include w̄. If we leave out w̄ in the
definition of the trust region, then we end up with the undesired situation shown in Figure EC.2b,
where the solution may lie outside of CH(Z ). We observe that in some cases we must define the
convex hull with a subset of variables. This is true in cases where the convex hull constraint leads
to excessive data thinning, in which case it may be necessary to define the convex hull on treatment
variables only.

B.2. Column selection


In this section, we propose a column selection method to deal with a huge set of data points. When
objectives and constraints are linear, our method reduces to the Dantzig-Wolfe decomposition
method (George B. Dantzig 1960). However, in our case, we do not have to solve the dual problem,
since we can just enumerate all the data points. In case the functions f and g in formulation (1) are
1 1
nonlinear and convex the Dantzig-Wolfe method cannot be used. The key point in our approach
is the choice of the dual problem. Although the use of Fenchel duality seems a logical way to deal
with our problem, it appears that Wolfe duality, which in general leads to nonconvex formulations,
Maragno et al.: Mixed-integer Optimization with Constraint Learning
49

is exactly what we need. In (Stoer and Botkin 2005) and (Stoer et al. 2007) another method is
described to optimize over the convex hull of a huge set of points. However, the method proposed
in these papers is only suitable for problems that have only the convex hull constraint and no
additional constraints.
Let PI be a convex and continuously differentiable model consisting of an objective function and
constraints that may be known a priori as well as learned from data. Like in Section 2.3, we denote
the index set of samples by I . As part of the constraints, the trust region is defined on the entire
set Z . We start with the matrix Z ∈ RN ×(n+p) , where each row corresponds to a given data point
in Z . Then, model PI is given as

min f (Z ⊤ λ) (12a)
λ

s.t. gj (Z ⊤ λ) ≤ 0, j = 1, . . . , m, ⊥ µ, (12b)
X
λi = 1, ⊥ ρ, (12c)
i∈I
λi ≥ 0, i ∈ I, ⊥ υ, (12d)

where the decision variable x is replaced by Z ⊤ λ. Constraints (12b) include both known and learned
constraints, while constraints (12c) and (12d) are used for the trust region. The dual variables
associated with with constraints (12b), (12c), and (12d) are µ ∈ Rm , ρ ∈ R, and υ ∈ RN , respectively.
Note that for readability, we omit the contextual variables (w) without loss of generality.
When we deal with huge datasets, solving PI may be computationally expensive. Therefore, we
propose an iterative column selection algorithm (Algorithm 1) that can be used to speed up the
optimization while still obtaining a global optima.
The algorithm starts by initializing I ′ ⊆ I with an arbitrarily small subset of samples I 0 and
iteratively solves the restricted master problem PI ′ and the WolfeDual function. By solving PI ′ ,
we get the primal and dual optimal solutions λ∗ and (µ∗ , ρ∗ , υ ∗ ), respectively. The primal and dual
optimal solutions, together with I and I ′ , are given as input to WolfeDual which returns a set
of samples Ī ⊆ I \ I ′ with negative reduced cost. If Ī is not empty it is added to I ′ and a new
iteration starts, otherwise the algorithm stops, and λ∗ (with the corresponding x∗ ) is returned as
the global optima of PI . A visual interpretation of Algorithm 1 is shown in Figure 4.
In function WolfeDual, samples Ī are selected using the Karush–Kuhn–Tucker (KKT) sta-
tionary condition which corresponds to the equality constraint in the Wolfe dual formulation of
PI (Wolfe 1961). The KKT stationary condition of PI ′ is
m
X
∇λ f (Z̃ ⊤ λ∗ ) + µ∗i ∇λ gi (Z̃ ⊤ λ∗ ) − eρ∗ − υ ∗ = 0, (13)
i=1
Maragno et al.: Mixed-integer Optimization with Constraint Learning
50

Algorithm 1 Column Selection


Input: I ▷ Index set of columns of Z ⊤

Output: λ∗ ▷ Optimal solution

1: I′ ← I0 ▷ Initial column pool

2: while TRUE do

3: λ∗ , (µ∗ , ρ∗ , υ ∗ ) ← PI ′

4: Ī ← WolfeDual(λ∗ , (µ∗ , ρ∗ , υ ∗ ), I ′ , I ) ▷ Column(s) selection

5: if Ī ̸= ∅ then

6: I ′ ← I ′ ∪ Ī

7: else

8: Break

9: end if

10: end while

where Z̃ is the matrix constructed with samples in I ′ , and e is an N ′ -dimensional vector of ones
with N ′ = |I ′ |. Equation (13) can be rewritten as
m
X
Z̃ ∇x f (Z̃ ⊤ λ∗ ) + µ∗i Z̃ ∇x gi (Z̃ ⊤ λ∗ ) − eρ∗ − υ ∗ = 0. (14)
i=1

Equation (14) is used to evaluate the reduced cost related to each sample z̄ ∈ Z which is not
in matrix Z̃. Consider a new sample z̄ in (14), with its associated λN ′ +1 set equal to zero.
(λ∗1 , . . . , λ∗N ′ , λN ′ +1 ) is still a feasible solution of the restricted master problem PI ′ , since it does not
affect the value of x. As a consequence, µ and ρ will not change their value, nor will f and g. The
only unknown variable is υN ′ +1 , namely the reduced cost of z̄. However, we can write it as
    t  
υ∗ Z̃ ⊤ ∗
X
∗ Z̃
= ∇x f (Z̃ λ ) + µi ∇x gi (Z̃ ⊤ λ∗ ) − eρ∗ . (15)
υN ′ +1 z̄ ⊤ z̄ ⊤
i=1

If υN ′ +1 is negative it means that we may improve the incumbent solution of PI ′ by including the
sample z̄ in Z̃.

Lemma 1. After solving the convex and continuously differentiable problem PI ′ , the sample in
I \ I ′ with the most negative reduced cost is a vertex of the convex hull CH(Z ).
Maragno et al.: Mixed-integer Optimization with Constraint Learning
51

Proof From equation (15) we have

υN ′ +1 = z̄ ⊤ ∇x f (Z̃ ⊤ λ∗ ) + z̄ ⊤ ∇x g(Z̃ ⊤ λ∗ )µ∗ − ρ∗ . (16)

The problem of finding z̄, such that its reduced cost is the most negative one, can be written as
a linear program where equation (16) is being minimized, and a solution must lie within CH(Z ).
That is,
min z ⊤ ∇x f (Z̃ ⊤ λ∗ ) + z ⊤ ∇x g(Z̃)µ∗ − ρ∗
z,λ

s.t. Z ⊺ λ = z,
X (17)
λj = 1,
j∈I

λj ≥ 0, j ∈ I,
where z and λ are the decision variables, and µ∗ , λ∗ , ρ∗ are fixed parameters. Since the objective
function is linear with respect to z, the optimal solution of (17) will necessarily be a vertex of
CH(Z ). □
To illustrate the benefits of column selection, consider the following convex optimization problem
that we shall refer to as Pexp :

min c⊤ x (18a)
x
n
X
s.t. log( exi ) ≤ t, (18b)
i=1
Ax ≤ b, (18c)
XN
λi z̄i = x, (18d)
i=1
XN
λj = 1, (18e)
j=1

λj ≥ 0, j = 1 . . . N. (18f)

Without a loss of generality, we assume that the constraint (18b) is known a priori, and constraints
(18c) are the linear embeddings of learned constraints with A ∈ Rk×n and b ∈ Rk . Constraints
(18d-18f) define the trust region based on N datapoints. Figure EC.3 shows the computation time
required to solve Pexp with different values of n, k, and N . The “No Column Selection” approach
consists of solving Pexp using the entire dataset. The “Column Selection” approach makes use of
Algorithm 1 to solve the problem, starting with |I 0 | = 100, and selecting only one sample at each
iteration, i.e., the one with the most negative reduced cost. It can be seen that in all cases, the use
of column selection results in significantly improved computation times. This allows us to more
quickly define the trust region for problems with large amounts of data.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
52

Figure EC.3 Effect of column selection on computation time. Solution times are reported for three different sizes

of problem Pexp . Small-scale: n = 5, k = 10. Medium-scale: n = 10, k = 50. Large-scale: n = 20, k =

100. The number of samples goes from 500 to 5 × 105 . In each iteration, the sample with most

negative reduced cost is selected. The same problem is solved using MOSEK (2019) with conic

reformulation for 10 different instances where c, A, and b are randomly generated.


Small-scale problem Medium-scale problem Large-scale problem
6 No Column Selection No Column Selection No Column Selection
Column Selection 15 Column Selection Column Selection
40
5
Computation time (seconds)

Computation time (seconds)

Computation time (seconds)


12
4 30
9
3
20
6
2
10
1 3

0 0 0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Number of samples in Z 1e5 Number of samples in Z 1e5 Number of samples in Z 1e5

Appendix C: WFP case study

Table EC.1 and Table EC.2 show the nutritional value of each food and our assumed nutrient
requirements, respectively. The values adopted are based on the World Health Organization (WHO)
guidelines (UNHCR et al. 2002).

Table EC.1 Nutritional contents per gram for different foods.


Food Eng(kcal) Prot(g) Fat(g) Cal(mg) Iron(mg) VitA(ug) ThB1(mg) RibB2(mg) NicB3(mg) Fol(ug) VitC(mg) Iod(ug)
Beans 335 20 1.2 143 8.2 0 0.5 0.22 2.1 180 0 0
Bulgur 350 11 1.5 23 7.8 0 0.3 0.1 5.5 38 0 0
Cheese 355 22.5 28 630 0.2 120 0.03 0.45 0.2 0 0 0
Fish 305 22 24 330 2.7 0 0.4 0.3 6.5 16 0 0
Meat 220 21 15 14 4.1 0 0.2 0.23 3.2 2 0 0
Corn-soya blend 380 18 6 513 18.5 500 0.65 0.5 6.8 0 40 0
Dates 245 2 0.5 32 1.2 0 0.09 0.1 2.2 13 0 0
Dried skim milk 360 36 1 1257 1 1,500 0.42 1.55 1 50 0 0
Milk 360 36 1 912 0.5 280 0.28 1.21 0.6 37 0 0
Salt 0 0 0 0 0 0 0 0 0 0 0 1000000
Lentils 340 20 0.6 51 9 0 0.5 0.25 2.6 0 0 0
Maize 350 10 4 13 4.9 0 0.32 0.12 1.7 0 0 0
Maize meal 360 9 3.5 10 2.5 0 0.3 0.1 1.8 0 0 0
Chickpeas 335 22 1.4 130 5.2 0 0.6 0.19 3 100 0 0
Rice 360 7 0.5 7 1.2 0 0.2 0.08 2.6 11 0 0
Sorghum/millet 335 11 3 26 4.5 0 0.34 0.15 3.3 0 0 0
Soya-fortified bulgur wheat 350 17 1.5 54 4.7 0 0.25 0.13 4.2 74 0 0
Soya-fortified maize meal 390 13 1.5 178 4.8 228 0.7 0.3 3.1 0 0 0
Soya-fortified sorghum grits 360 360 1 40 2 0 0.2 0.1 1.7 50 0 0
Soya-fortified wheat flour 360 16 1.3 211 4.8 265 0.66 0.36 4.6 0 0 0
Sugar 400 0 0 0 0 0 0 0 0 0 0 0
Oil 885 0 100 0 0 0 0 0 0 0 0 0
Wheat 330 12.3 1.5 36 4 0 0.3 0.07 5 51 0 0
Wheat flour 350 11.5 1.5 29 3.7 0 0.28 0.14 4.5 0 0 0
Wheat-soya blend 370 20 6 750 20.8 498 1.5 0.6 9.1 0 40 0

Eng = Energy, Prot = Protein, Cal = Calcium, VitA = Vitamin A, ThB1 = ThiamineB1, RibB2 = RiboflavinB2, NicB3 = NicacinB3, Fol
= Folate, VitC = Vitamin C, Iod = Iodine
Maragno et al.: Mixed-integer Optimization with Constraint Learning
53

Table EC.2 Nutrient requirements used in optimization model.


Type Eng(kcal) Prot(g) Fat(g) Cal(mg) Iron(mg) VitA(ug) ThB1(mg) RibB2(mg) NicB3(mg) Fol(ug) VitC(mg) Iod(ug)
Avg person day 2100 52.5 89.25 1100 22 500 0.9 1.4 12 160 0 150

Eng = Energy, Prot = Protein, Cal = Calcium, VitA = Vitamin A, ThB1 = ThiamineB1, RibB2 = RiboflavinB2, NicB3 = NicacinB3, Fol
= Folate, VitC = Vitamin C, Iod = Iodine

C.1. Food baskets generation and palatability function


Referring to Peters et al. (2021), a food basket xk (∀k ∈ K) is defined as a collection of K commodi-
ties, such as beans, meat, and oil, along with their respective quantities measured in grams. These
commodities are classified into five macro-categories: cereals and grains, pulses and vegetables, oils
and fats, mixed and blended foods, and meat and fish, as well as dairy. Each macro-category g ∈ G
is associated with an upper bound (maxg ) and a lower bound (ming ), as shown in Table EC.3.
We use the notation Kg to indicate the set of commodities belonging to category g. In contrast
to the approach used in Peters et al. (2021), where bound constraints were utilized to ensure the
palatability of the food basket, we expand the notion of palatability by incorporating a palatability
score that ranges from non-negative values and tends towards zero for more enjoyable diets. The
score is calculated as:

sX
P alatability Score = xg − Optg ))2 ,
(γg (b (19)
g∈G

where
X
x
bg = xk with g ∈ G and
k∈Kg

maxg + ming
Optg = with g ∈ G .
2
To account for the different range sizes (maxg − ming ) across the macro-categories, we introduce
a scaling parameter γg that determines their influence on the score, as presented in Table EC.3.
The resulting score is normalized on a scale of 0 to 1, where a score of 1 represents a perfectly
appetizing food basket, while a score of 0 indicates an inedible basket.

Table EC.3 Macro-categories bounds and


scaling factor.

macro-category min max γ


Cereals & Grains 200 600 1
Pulses & Vegetables 30 100 5.7
Oils & Fats 15 40 16
Mixed & Blended Foods 0 90 4.4
Meat & Fish & Dairy 0 60 6.6
Maragno et al.: Mixed-integer Optimization with Constraint Learning
54

The generation of diverse food baskets is done by solving several diet problems whose cost function
changes at each run and enforcing constraints on the nutrient requirements as well as on the
maximum number of foods belonging to the same category.

C.2. Predictive models

Table EC.4 shows the structure of the predictive models used in the WFP experiments. For each
model, the choice of parameters is based on a cross-validation procedure.

Table EC.4 Definition of the predictive model parameters used in the WFP
case study

Model Parameters
Linear ElasticNet parameters: 0.1 (alpha), 0.1 (ℓ1 -ratio)
SVM regularization parameter: 100
CART max depth: 10, max features: 1.0, min samples leaf: 0.02
RF max depth : 4, max features: auto, number of estimators: 25
GBM learning rate: 0.2, max depth: 5, number of estimators: 20
MLP hidden layers: 1, size hidden layers: (100,) activation: relu

C.3. Effect of robustness parameters

Robustness impact by algorithm. Table EC.5 reports the change in objective value (cost) and
constrained outcome (palatbility) between the nominal and bootstrapped solution with 10 estima-
tors and a violation limit of 25%. The goal of the WFP case study is to minimize cost such that
palatability is at least 0.5; thus, a smaller cost and larger palatability are better. As expected, the
robust solution increases both the cost and palatability of the prescribed diets. We see that the
relative increase in cost is consistently lower than the relative increase in real palatability across all
methods, indicating that the improvement in palatability exceeds the incremental cost addition.
While the acceptable trade off between cost and palatability could differ by use case, this could be
further explored with alternative violation limits. Additionally, we compare the single algorithm
constraints against an ensemble of all six methods, also with a violation limit of α = 0.25. The
ensemble with multiple algorithms yields an objective value of 1313 and real palatability of 0.57.
This represents a -1.8% to 1% increase in cost and 5.6% to 15.6% increase in real palatability over
the nominal solutions. When compared to the bootstrapped single-method models, it is generally
more conservative. This is consistent with the fact that it must satisfy the constraint estimate
across the majority of the individual methods, forcing it to be conservative relative to this set.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
55

Table EC.5 Change in cost and palatability from nominal to bootstrapped (P = 10, α = 0.25) solution.

Objective value Real palatability


Algorithm Nominal Bootstrapped Change Nominal Bootstrapped Change
Linear 1337 1359 1.6% 0.496 0.512 3.1%
SVM 1306 1308 0.1% 0.541 0.548 1.2%
CART 1301 1307 0.5% 0.539 0.550 2.1%
RF 1305 1306 0.0% 0.543 0.551 1.5%
GBM 1300 1304 0.3% 0.532 0.553 3.9%
MLP 1307 1313 0.5% 0.537 0.587 9.4%

Effect of number of estimators. Table EC.6 compares the runtime as the number of estimators
(P ) increases up to 25 estimators. We see that the solve time for the linear, SVM, CART, and
MLP models are stable as the number of estimators increases. In contrast, we see that the ensemble
algorithms, RF and GBM, have exponential runtime increases as the number of estimators grows.
RF and GBM are already comprised of multiple individual learners, so embedding multiple esti-
mators involves adding multiple sets of decision trees, which becomes computationally expensive.
All results are reported over 100 instances. The experiments were run using a virtual computing
environment with 4 CPU and 32 GB total RAM. We also report the runtime for an ensemble of
estimators obtained from different model classes (“Ensemble”), using a single model from each
class.
We further investigate the runtimes with 25 estimators in Table EC.7. The left side of the table
reports the mean, median, and maximum runtimes for each method on the same 100 experiments
as above. We see that the RF and GBM models have reasonable median solve times (6.66 and
18.80 minutes, respectively), but the average solve times are driven up by outlier instances that
have significantly higher runtimes (max. 2110 and 1603 minutes, respectively). We propose to use
a time limit to control the experiment times. On the right side of the table, we see that using a 4
hour time limit returns optimal solutions for 95% of the RF runs and 82% of the GBM runs, and
feasible solutions for all but four GBM instances. In cases where an optimal solution is obtained,
the average runtime is less than 40 minutes. In cases where the time limit is hit, the average
remaining MIP gap is 1.02% for RF and 5.21% for GBM. The results suggest that imposing this
termination condition results in high quality solutions with a modest optimality gap.
The runtime experiments raise a natural question: what is the impact of embedding a larger
number of estimators? We consider the cost-palatability trade off for a decision tree model as we
vary the number of estimators from P = 2 to P = 50, averaged over the candidate violation limits.
The results are shown in Figure EC.4. As the number of estimators increases, the results tend
to be more conservative. By 10 estimators, the trade off curve well-approximates the curves for
higher estimator up to an inflection point where average cost increases significantly. By P = 25
Maragno et al.: Mixed-integer Optimization with Constraint Learning
56

Table EC.6 Optimization solver runtime (minutes) as


a function of number of bootstrapped estimators.

Algorithm P = 2 P = 5 P = 10 P = 25
Linear 0.01 0.01 0.02 0.01
SVM 0.01 0.01 0.02 0.01
CART 0.02 0.02 0.03 0.12
RF 0.15 1.34 11.93 44.87
GBM 0.37 3.58 10.71 133.13
MLP 0.01 0.02 0.04 0.48
Ensemble 0.09

Table EC.7 Runtime results for P = 25 estimators, both when solved to optimality (left) and with a 4 hour time limit
(right).

Runtime (mins) to optimality 4 hour time limit


Avg. runtime Avg.
Algorithm Mean Median Max % feasible % optimal to optimality remaining
(mins) MIP gap
Linear 0.01 0.01 0.04 100% 100% 0.01 –
SVM 0.01 0.01 0.04 100% 100% 0.01 –
CART 0.12 0.08 0.64 100% 100% 0.12 –
RF 44.87 6.66 2109.72 100% 95% 19.76 1.02%
GBM 133.13 18.80 1603.51 96% 82% 37.79 5.21%
MLP 0.49 0.12 6.89 100% 100% 0.48 –

and P = 50, the curves closely match, suggesting diminishing value in increasing the number of
estimators beyond a certain point.
Finally, Table EC.8 reports the parameters used in our bootstrapped models. For each method,
we report the parameter grid that was used in our model training and selection procedure. Individ-
ual estimators use different combinations of these parameters based on the validation performance
on the specific bootstrapped samples. We note that for these experiments, we used a default param-
eter grid implemented in OptiCL; this grid can be manually set by a user when specifying each
outcome of interest before model training.

Appendix D: Chemotherapy regimen design


D.1. Data Processing
The data for this case study includes three components, study cohort characteristics (w), treatment
variables (x), and outcomes (y). The raw data was obtained from Bertsimas et al. (2016), in which
the authors manually curated data from 495 clinical trial arms for advanced gastric cancer. Our
feature space was processed as follows:
Cohort Characteristics. We included several cohort characteristics to adjust for the study con-
text: fraction of male patients, median age, primary site breakdown (Stomach vs. GEJ), fraction of
patients receiving prior palliative chemotherapy, and mean ECOG score. We also included variables
Maragno et al.: Mixed-integer Optimization with Constraint Learning
57

Table EC.8 Default parameter grid for supported algorithms.

Algorithm Parameter Grid


alpha’: [0.1, 1, 10, 100, 1000],
Linear
’l1 ratio’: np.arange(0.1, 1.0, 0.2)
SVM ’C’: [.1, 1, 10, 100]
max depth’: [3, 4, 5, 6, 7, 8, 9, 10],
CART ’min samples leaf’: [0.02, 0.04, 0.06],
’max features’: [0.4, 0.6, 0.8, 1.0]
n estimators’: [10, 25],
RF ’max features’: [’auto’],
’max depth’: [2, 3, 4]
learning rate’: [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
GBM ’max depth’: [2, 3, 4, 5],
’n estimators’: [20]
MLP ’hidden layer sizes’: [(10,), (20,), (50,), (100,)]

Figure EC.4 Effect of the number of bootstrapped estimators (P ) on the cost and palatability of the prescribed

diet.

bootstraps
2.0 0.01
5.0
1340 10.0 1340
25.0
50.0 0.02
1330 1330
objective_function

objective_function

violation_margin

0.03
1320 1320
0.04
1310 1310
0.05
1300 1300

0.05 0.04 0.03 0.02 0.01 2.0 5.0 10.0 25.0 50.0 2.0 5.0 10.0 25.0 50.0
violation_margin bootstraps bootstraps

for the study context: the study year, country, and number of patients. Missing data was imputed
using multiple imputation based on the other contextual variables; 20% of observations had one
missing feature and 6% had multiple missing features.
Treatment Variables. Chemotherapy regimens involve multiple drugs being delivered at poten-
tially varied frequencies over the course of a chemotherapy cycle. As a result, multiple dimensions
of the dosage must be encoded to reflect the treatment strategy. As in Bertsimas et al. (2016), we
include three variables to represent each drug: an indicator (1 if the drug is used in the regimen),
instantaneous dose, and average dose.
Outcomes. We use Overall Survival (OS) as our survival metric, as reported in the clinical trials.
Any observations with unreported OS are excluded. We consider several “dose-limiting toxicities”
(DLTs): Grade 3/4 constitutional, gastrointestinal, infection, and neurological toxicities, as well
Maragno et al.: Mixed-integer Optimization with Constraint Learning
58

Table EC.9 Comparison of out-of-sample R2 all considered


models for learned outcomes in chemotherapy regimen selection
problem.

Outcome Linear SVM CART RF GBM


Any DLT 0.268 -0.094 -0.016 0.152 0.202
Blood 0.196 -1.102 0.012 0.153 0.105
Constitutional 0.106 0.144 0.157 0.194 0.136
Infection 0.082 -0.511 -0.222 0.070 0.035
Gastrointestinal 0.141 -0.196 -0.023 0.066 0.083
Overall Survival 0.448 0.385 0.474 0.496 0.450

as Grade 4 blood toxicities. The toxicities reported in the original clinical trials are aggregated
according to the CTCAE toxicity classes (Cancer Therapy Evaluation Program 2006). We also
include a variable for the occurrence of any of the four individual toxicities (ti for each toxicity
i ∈ T , called DLT proportion; we treat these toxicity groups as independent and thus define the
DLT proportion as
Y
DLT = 1 − (1 − ti ).
i∈T

We define Grade 4 blood toxicity as the maximum of five individual blood toxicities (related to
neutrophils, leukocytes, lymphocytes, thrombocytes, anemia). Observations missing all of these
toxicities were excluded; entries with partial missingness were imputed using multiple imputation
based on other blood toxicity columns. Similarly, observations with no reported Grade 3/4 toxicities
were excluded; those with partial missingness were imputed using multiple imputation based on
the other toxicity columns. This exclusion criteria resulted in a final set of 461 (of 495) treatment
arms.
We split the data into training/testing sets temporally. The training set consists of all clinical
trials through 2008, and the testing set consists of all 2009-2012 trials. We exclude trials from the
testing set if they use new drugs not seen in the training data (since we cannot evaluate these given
treatments). We also identify sparse treatments (defined as being only seen once in the training
set) and remove all observations that include these treatments. The final training set consists of
320 observations, and the final testing set consists of 96 observations.

D.2. Predictive Models


Table EC.9 shows the out-of-sample performance of all considered methods in the model selection
pipeline. We note that model choice is based on the 5-fold validation performance, so it does not
necessarily correspond to the highest test set performance. The final parameters for each model
and each outcome, selected through the cross-validation procedure, are shown in Table EC.10.
Maragno et al.: Mixed-integer Optimization with Constraint Learning
59

Table EC.10 Predictive model parameters used in the chemotherapy case study.

Outcome Model Parameters


Any DLT GBM learning rate: 0.2, max depth: 2, number of estimators: 20
Blood Linear ElasticNet parameters: 0.1 (alpha), 0.7 (ℓ1 -ratio)
Constitutional RF max depth : 4, max features: ’auto’, number of estimators: 25
Infection Linear ElasticNet parameters: 1 (alpha), 0.5 (ℓ1 -ratio)
Gastrointestinal GBM learning rate: 0.1, max depth: 4, number of estimators: 20
Overall Survival GBM learning rate: 0.1, max depth: 3, number of estimators: 20

Table EC.11 Performance (R2 ) of individual models in ground truth


ensemble for model evaluation.

outcome Linear SVM CART RF GBM XGB


Any DLT 0.301 0.330 0.250 0.573 0.670 0.323
Blood 0.287 0.351 0.211 0.701 0.813 0.446
Constitutional 0.139 0.224 0.246 0.602 0.682 0.285
Infection 0.217 0.303 0.139 0.514 0.588 0.247
Gastrointestinal 0.201 0.328 0.238 0.563 0.733 0.475
Overall Survival 0.528 0.469 0.421 0.815 0.827 0.756

D.3. Prescription Evaluation


Table EC.11 shows the performance of the models that comprise the ground truth ensemble used in
the evaluation framework. These models trained on the full data. We see that the ensemble models,
particularly RF and GBM, have the highest performance. These models are trained on more data
and include more complex parameter options (e.g., deeper trees, larger forests) since they are not
required to be embedded in the MIO and are rather used directly to generate predictions. The final
parameters for each model and each outcome, selected through the cross-validation procedure, are
shown in Table EC.12. For this reason, the GT ensemble could also be generalized to consider
even broader method classes that are not directly MIO-representable, such as neural networks with
alternative activation functions, providing an additional degree of robustness.

D.4. Optimization runtimes


Table EC.13 reports the runtimes of the optimization model results presented in Section 5.5,
Table 6. Results are averaged over all patients in the test set.

Appendix E: Comparison with JANOS and EML


As mentioned earlier in Section 1.1, JANOS and EML are two software frameworks for embedding
learned ML models in optimization problems. In this section, we compare the performance of
OptiCL to those of JANOS and EML using the test problems in Bergman et al. (2022) and Lombardi
et al. (2017), respectively. The experiments are conducted using an Intel i7-8665U 1.9 GHz CPU,
16 GB RAM (Windows 10 environment).
Maragno et al.: Mixed-integer Optimization with Constraint Learning
60

Table EC.12 Predictive model parameters used in the ground truth ensemble for model evaluation.

Algorithm Parameter Any DLT Blood Const. Inf. GI OS


Linear alpha 0.1 0.1 1 1 1 0.1
ℓ1 ratio 0.6 0.5 0.4 0.3 0.7 0.8
SVM regularization parameter 100 100 1 10 100 0.1
CART max depth 3 3 4 3 5 3
max features 1 1 0.6 0.6 0.6 0.8
min samples per leaf 0.04 0.06 0.06 0.06 0.06 0.02
RF max depth 6 8 6 6 6 8
max features auto auto auto auto auto auto
number of estimators 500 500 500 250 250 250
GBM learning rate 0.01 0.025 0.01 6 0.01 0.01
max depth 5 5 5 auto 6 5
number of estimators 250 250 250 250 250 250
XGB cols sampled by tree 0.8 1 0.8 1 0.8 1
gamma 0.5 0.5 1 1 0.5 10
max depth 4 5 4 4 5 4
min child weight 10 1 10 10 1 10
number of estimators 250 250 250 250 250 250
subsample 1 0.8 0.8 0.8 0.8 1
Const. = Constitutional, Inf. = Infection, GI = Gastrointestinal, OS = Overall survival.

Table EC.13 Average (and standard deviation)


of runtimes for gastric cancer case, in seconds.

Model Version Average Time (SD)


All Constraints 0.511 (0.892)
DLT Only 0.203 (0.433)

E.1. OptiCL vs JANOS


In the Student Enrolment Problem (SEP) in Bergman et al. (2022), a university’s admission office
seeks to offer scholarships to some of the admitted students in order to bolster the class profile.
The objective is to maximize the expected class size subject to budget constraints. This problem
is formulated as:

N
X
max yi (20a)
i=1
N
X
s.t. xi ≤ BUDGET, (20b)
i=1

yi = ĥ(si , gi , xi ) ∀i ∈ {1, . . . , N }, (20c)

0 ≤ xi ≤ 25, 000 ∀i ∈ {1, . . . , N }, (20d)


Maragno et al.: Mixed-integer Optimization with Constraint Learning
61

where xi is the decision variable indicating the amount of scholarship assigned to each student
accepted, si is the SAT score of applicant i, and gi is the GPA score of applicant i. The predicted
outcome yi represents the probability of a candidate i accepting the offer, and ĥ is the fitted
model used to predict any candidate’s probabilities of accepting an offer. The parameters si , gi ,
and the decision variable xi are the predictive model’s inputs. In order to compare OptiCL and
JANOS, we solved the SEP for different student sizes, and compared the objective values and
runtimes. Although OptiCL and JANOS handle neural network embedding in a similar manner,
JANOS uses a parameterized discretization to handle logistic regression predictions. We therefore
compared their performances only using the logistic regression models, as we expected to see a
difference in performance based on the differences in implementation. In the experiments reported
in Figure EC.5, we discretize the logistic regression (LogReg) in JANOS using three different
number of intervals (reported between brackets in the Figure legend). From the experiments, we
can see that OptiCL achieves better objective values in all three instances. It can also be seen
that for the larger problems, OptiCL is much more efficient in terms of optimization runtime than
JANOS.

Figure EC.5 Objective value (right) and runtime (left) comparison between OptiCL and JANOS for the SEP.
30.5 323 3230 102

30.2 318 3185 101


optimization runtime (s)
objective function

JANOS LogReg(5)
100 JANOS LogReg(15)
313 3140 JANOS LogReg(25)
29.8
OptiCL LogReg

308 3095 10 1
29.5

29.2 303 3050 10 2


50 500 5000 50 500 5000
# students # students

E.2. OptiCL vs EML


In the thermal-aware Workload Dispatching Problem (WDP) in Lombardi et al. (2017), the goal
is to assign jobs to the different cores on a multi-core processor. The processor has 24 dual-core
tiles arranged in a 4×6 grid, resulting in an arrangement with 48 cores in an 8×6 grid. A direct
comparison between OptiCL and EML is not possible, as Lombardi et al. (2017) do not use neural
networks or decision trees for constraint learning in MIO problems. Their focus for these predictive
models are Local Search, Constraint Programming, or SAT Modulo Theory problems. What we
do, however, is demonstrate that OptiCL is able to solve the example in an MIO setting. The model
considered here is the “ANN1” model in Lombardi et al. (2017) given as:
Maragno et al.: Mixed-integer Optimization with Constraint Learning
62

max z (21a)

s.t. z ≤ yk ∀k = 0, . . . , m − 1, (21b)

yk = ĥk (avgcpik , neighcpik , othercpik ) ∀k = 0, . . . , m − 1, (21c)


m−1
X
xik = 1 ∀i = 0, . . . , n − 1, (21d)
k=0
n−1
X n
xik = ∀k = 0, . . . , m − 1, (21e)
i=0
m
n−1
1 X
avgcpik = cpii xik ∀k = 0, . . . , m − 1, (21f)
m i=0
1 X
neighcpik = avgcpih ∀k = 0, . . . , m − 1, (21g)
m
h∈N (k)
1 X
othercpik = avgcpih ∀k = 0, . . . , m − 1, (21h)
m − 1 − |N (k)|
h̸=k,h∈N
/ (k)

xik ∈ {0, 1} ∀i = 0, . . . , n − 1 ∀k = 0, . . . , m − 1, (21i)

where xik is the binary decision variable indicating if a job i is mapped on core k or not. The
parameter cpii represents the average Clock Per Instructions (CPI) characterizing job i, and is a
measure of the difficulty of job i. The objective is to maximize the worst-case core efficiency, and
the fitted model ĥk is used to predict the efficiency of core k that is represented by yk ∈ [0, 1].
Constraints (21d) ensures that each job is mapped to only one core, and (21e) forces the same
number of jobs to run on each core. Constraints (21f), (21g) and (21h) are used to compute the
average CPI for a core k, the average CPI for the cores in the neighborhood of k (N (k)), and the
average CPI for cores not in the neighborhood of k respectively. Lombardi et al. (2017) conclude
that learning the efficiency function for each core by means of neural networks (with one hidden
layer of two nodes and tanh activation function) is computationally intractable. On the contrary,
our experiments show that we are able to solve this problem using neural networks with one hidden
layer and 10 nodes in a reasonable amount of time (19.4 seconds). We tried deeper neural networks,
but the increase in computational complexity did not lead to a gain in predictive performance.

You might also like