Statistical Machine Learning For Quantitative Finance
Statistical Machine Learning For Quantitative Finance
271
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
1. INTRODUCTION
Quantitative finance (QF) is concerned with frameworks for managing trading activities in finan-
cial markets. The cornerstone of modern QF consists of stochastic models that aim to capture the
random phenomena pervasive in markets. A typical QF model specifies stochastic dynamics for
the system of interest (e.g., a particular stock share price) and then poses particular tasks for that
system (e.g., hedging a financial contract). At an abstract level, the model defines a data-generating
process, and the modeler manipulates quantities linked to that process. This perspective highlights
the manifold links between QF and the design and analysis of computer experiments (DACE). Like
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
in DACE, the fundamental modeling goal in QF is to capture observed market features and struc-
tural properties and achieve the maximum degree of realism. At the same time, the fundamental
application of financial models is to analyze their output in an accurate and efficient manner. The
above tension between model complexity and tractability has grown as the QF field has matured
and has accelerated in the past decade as new computational paradigms have vastly expanded
the envelope of what is feasible. Compared with even the early 2000s, when compact parametric
models were de rigueur, practicing QF researchers, or quants, now routinely operate with large
nonparametric frameworks that are coupled with advanced machine learning–flavored numerical
techniques. As a result, the DACE lens is essential for decoupling the modeling complexity from
computational demands. The glue connecting such expensive experiments with cheap numerical
proxies is statistical machine learning.
This review surveys the current landscape of statistical learning in QF. For a broader overview,
readers are referred to the recent monograph by Dixon et al. (2020), the upcoming edited volume
by Capponi & Lehalle (2022), and the deep learning–oriented surveys by Ruf & Wang (2020),
Charpentier et al. (2021), and Hambly et al. (2021).
The main object in statistical learning is the surrogate (Gramacy 2020), which is an empirically
trained functional approximator. More precisely, a surrogate is a function b f : X → R p mapping
inputs x ∈ X ⊆ R into outputs, normally scalars (p = 1). The surrogate is constructed from a
d
training set D = (x1:N , y1:N ), lifting it into a model describing the relations between x’s and y’s.
This is done by finding a function b f (·) that minimizes a loss functional R( f ; D). The loss is
constructed from a metric L(ŷ, y) that defines the distance between surrogate predictions and the
given data, describing the cost of incorrectly predicting outputs. In the idealized world, the loss is
then the expected value based on the random variables (X, Y) that generate the independent and
identically distributed (i.i.d.) data in D,
R̄( f ) := E[L( f (X ), Y )].
Since the joint distribution of (X, Y) cannot be ascertained, we approximate R̄ with the empirical
∑
loss R( b
f ; D) := N1 i L( bf (xi ), yi ).
1. Computing 5(t, S) exactly for a given (t, S) is possible, but is challenging and time-
consuming. For example, it might necessitate solving a partial differential equation (PDE)
or applying a fast Fourier transform. Then D is a collection of inputs where y i = 5(t i , S i )
was evaluated exactly, and the goal is to obtain a cheaper representation of the pricing map
by extrapolating these y i values.
2. Option prices are evaluated through a Monte Carlo engine. For a given (t i , S i ), the modeler
has access to a noisy estimate of 5(t i , S i ), namely an empirical average Y i of Ň samples,
with precision being O(Ň −1/2 ):
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
1 ∑
Ň
Y i := Y ji , E[Y ji ] = 5(t i , Si ), indep. 1.
Ň j=1
probabilistic task of approximating a given quantity. To wit, given an analytically intractable ex-
pectation, the classical Monte Carlo method constructs an approximation that is evaluated on the
spot to any desired level of accuracy. In contrast, a surrogate model first builds a general purpose
approximator; later, it is evaluated at yet-unknown inputs.
designs or features) x, true (latent) responses f (x), and observations y(x). The outputs y can either
be exact samples of f (·) (the noise-free setup) or be stochastically sampled. In the latter case, noise
is usually assumed to be additive, stationary (in x), and Gaussian, y(x) = f (x) + ϵ, ϵ ∼ N (0, σY2 )
i.i.d., although most algorithms can generalize (modulo computational challenges) to nonaddi-
tive, non-Gaussian state-dependent noise. Thus, statistical learning subsumes both interpolation
of exact samples and smoothing of noisy data and extrapolation.
The interpretation of the input space x ∈ X is context-specific; an input x can include both
stochastic quantities (i.e., values of a stochastic process) and model or market parameters, such
as option properties (strike, maturity, etc.), volatility, or interest rates. Thus, x is almost always
multidimensional. Picking precisely what x does and does not contain is frequently a key modeling
choice. The overall training set D may be fixed and given (an external data set) or be generated by
the modeler themselves. In the latter framework, constructing D is known as experimental design.
To fit a surrogate, one selects an approximation space H and a loss function L ≡ L(ŷ1:N , y1:N ),
where ŷi is the surrogate prediction at xi . The most common choice is mean-squared error,
∑
LMSE := N1 N i=1 (ŷ − y ) , which matches the probabilistic structure of conditional expectations
i i 2
∑ ∑ i i|
that are defined as L2 minimizers. Further criteria are LMAE = N1 i |ŷi − yi |, LMAPE = N1 i |ŷ −yyi
,
∑ i i 2
The surrogate b
(ŷ −y )
LMAX = max1≤i≤N |ŷi − yi |, and LR2 = 1 − ∑i i 2 . f is then taken to be the
i (ȳ−y )
empirical minimizer of L( f (x1:N ), y1:N ) = R( f ; D),
b ( )
f = arg min L f (x1:N ), y1:N . 2.
f ∈H
Practically speaking, the surrogate is defined through its (hyper)parameters ϑ so that the fitting
step can be actualized as an optimization problem for finding ϑ̂ that minimizes R( f ; D). This is
generally a high-dimensional, nonconvex optimization task, and global solutions are not guaran-
teed. Indeed, the performance of various surrogate frameworks is often predicated on how well
the latter optimization can be carried out. A common approach is to apply stochastic gradient
descent, generating a sequence of ϑ ( j) ’s over training epochs j = 0, 1, . . . that involve minibatches
(i.e., subsets) of D.
The optimization in Equation 2 may be regularized by adding additional terms. Regularization
is applied to mitigate overfitting; it can also include penalty terms to give preference to surrogates
that satisfy additional soft constraints. Financial domain knowledge can be embedded through
context-specific loss metrics L(·) or shape-constrained surrogates (Dugas et al. 2000, Yang et al.
2017, Huh 2019, Chataigner et al. 2021, Zheng et al. 2021).
To assess the quality of b f , one takes a test set C and a fitness metric (often different from the loss
metric) W ( b f , C). For surrogate assessment, the relation between the test and the training sets D
and C is essential. Generalization bias—overestimating model performance due to looking at in-
sample results—is a well-known concern. Indeed, without testing on new, unseen inputs, one is not
able to detect overfitting. Similarly, model performance might be misleading if there is any unfair
advantage in preselecting the test set. For example, when working with real data sets, Ruf & Wang
274 Ludkovski
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
(2020) point out the potential for data snooping that arises when randomly partitioning between
training and testing sets. Instead, chronological partitioning should be considered; this is also nec-
essary since there is a lot of time inhomogeneity in financial data, such as varying volatility regimes.
In many financial contexts data are collected continuously (e.g., by scraping market quotes
every day), known as streaming or online learning. To make such learning efficient, one desires
surrogates that can be updated rather than fully refit every time more training data arrive. Up-
dating is possible through warm-starting the underlying optimizers [i.e., judiciously initializing
ϑ (0) ] and sometimes via explicit updating equations. The case where the streamed training data
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
are picked by the modeler is called sequential design and brings forth the topic of the value of
information from collecting additional samples.
1.3.1. Geometry of the training set. While training inputs are sometimes gridded, most sur-
rogates operate with mesh-free experimental designs. The classical QF learning algorithms tend
to generate inputs as samples along simulated paths of a stochastic process. This means that the
experimental design has a sampling density x ∼ p(x). Alternatives include sampling x using Latin
hypercube designs or low-discrepancy quasi–Monte Carlo sequences (Lemieux 2009). Stratifi-
cation or importance sampling (Glasserman 2004) may also come into view. It is important to
remember that surrogate predictions can only be trusted to approximate well over the training
domain. It is therefore prudent to monitor the use of the surrogate and retrain if the test domain
C shifts relative to the range of D, known as concept drift.
1.3.2. Convergence. Statistical learning paradigms afford extensive convergence theories re-
garding the approximation quality (Györfi et al. 2002). The first fundamental type of results
concerns consistency, that is, the approximation error vanishing in a certain limit, typically as
the sample size goes to infinity (Elie et al. 2020, Cheridito & Gersey 2021). This might be possi-
ble only for true responses f (·) belonging to some given class of functions or universally (e.g., for
any twice-differentiable function). The second type of results is about asymptotic convergence
rates (Belomestny 2011b, Glau & Mahlstedt 2019, Gonon & Schwab 2021). The third type is
about convergence in terms of surrogate hyperparameters ϑ, keeping the data-generating pro-
cess fixed. These results concern the question of using more basis functions (or more neurons
in NNs), and often require taking joint limits in |D| and |ϑ| (Clément et al. 2002, Glasserman
& Yu 2004a, Belomestny et al. 2010a). Finally, for some learning frameworks it is possible to
obtain probabilistic guarantees on their quality, that is, to do a self-assessment that provides lo-
calized error bounds on the surrogate prediction. This uncertainty quantification is essential for
adaptive learning and, moreover, allows the modeler to reject surrogate results if its self-reported
quality is insufficient. Such guarantees are increasingly important for financial risk managers and
for assessing model risk. A further common challenge is that many QF problems generate a se-
quence of interdependent surrogates, motivating analysis of error propagation from one surrogate
to another.
parameterized by the constant interest rate r and constant volatility σ . This implies that ST |St is
log normal. We wish to evaluate the price of a financial derivative on (St ), that is, a contract that
entitles its owner to a payoff that depends on S. Given a payoff g(ST ), no-arbitrage pricing theory
implies that a fair price of this contract at date t is the expected discounted payoff conditional on
information available at time t,
[ ]
E e−r(T −t ) g(ST ) St . 4.
For a call payoff gCall (S) := (S − K)+ , exact integration is possible, yielding the Black–Scholes
[ ]
formula for 5(t, S) = E e−r(T −t ) (ST − K )+ |St = S , which is a function of the five parameters
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
(S, K, r, σ , T − t). Note that the drift being r in Equation 3 reflects the fact that the dynamics
of (St ) are already stated under the risk-neutral Q-measure that is the relevant one for contingent
claim valuation, skipping the issue of risk premia common in econometrics.
In general, Equation 4 does not admit closed-form solutions and is best thought of as an input–
output map, with inputs being the contract and model parameters θ and output being the option
price. To set the mathematical framework, let (Xt ) denote the stochastic process summarizing
relevant financial quantities and taking values in X ⊆ Rd . We assume the standard probabilis-
tic structure of (, F, (Ft ), P), where Xt is adapted to the filtration (Ft ). In the most common
situation, one considers European-style contracts, where the payoff at maturity date T is a func-
[ ∫T ]
tional, g(XT ) ∈ R, g ∈ L2 (P). The no-arbitrage price at date t is EQ e− t rs ds g(XT ) Ft , where (rt )
is the risk-free interest rate process, Q is the pricing measure, and the σ -algebra Ft summarizes
the available information up to t [0, T]. When (Xt ) is Markov, the expectation is a function of
Xt ,
[ ∫T ]
5(t, x) := EQ e− t rs ds g(XT ) Xt = x . 5.
This is the case for the instance when (Xt ) satisfies a stochastic differential equation,
dXt = µ(Xt ) dt + σ (Xt ) dWt Q , 6.
where (Wt Q ) is a (multidimensional) Brownian motion.
Modern QF models feature either more sophisticated payoffs (for example, path-dependent
ones that depend on X[t, T] and not just on XT ) or more sophisticated dynamics than in
Equation 3, precluding analytic expressions for 5(t, x). For example, the Heston stochastic
volatility model has two-dimensional Xt = (St , v t ) with five parameters θ = (r, κ, η, σ , ρ) and
dynamics
√
dSt = rSt dt + vt St dWt 1 ;
7.
dv = κ (η − v ) dt + σ √v dW 2 , d⟨Wt 1 , Wt 2 ⟩ = ρdt.
t t t t
Respective evaluation of a single call and put price requires solving either a PDE or a Monte
Carlo simulation in order to integrate g(·) against the bivariate conditional distribution of (St , v t )
b S, v, θ ).
(Glasserman 2004); quants seek the broader surrogate (t, S, v, θ ) 7→ 5(t,
1.4.1. Option hedging. A hedging strategy is a sequence of functions h(k, ·) : Rd → R such that
h(k, x) specifies holdings of the asset S in period tk given overall state x. The strategy is assumed
to be self-financing; that is, cash is dynamically moved in or out of a riskless savings account to
account for intermediate profits and losses without additional cash flows. The resulting discounted
wealth at T is the discrete stochastic integral
KT −1
∑
VT := (h · S)T = V0 + h(k, Xtk )[S(Xtk+1 ) − S(Xtk )] − c(h(k, Xtk )), 8.
k=0
276 Ludkovski
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
where c(·) captures transaction costs. The goal of hedging is to minimize the expected
hedging error E[L(e−rT g(XT ), VT )] relative to the discounted option payoff e−rT g(XT ). As
usual, this is evaluated using a batch of N samples of X0:KT , for example, mean trad-
∑ Bermudan option:
ing error WMTE (g(xT1:N ), vT1:N ) := N1 n {e−rT g(xTn ) − vTn } and median absolute trading error
∑ American-style
WMATE (g(xT1:N ), vT1:N ) = N1 n |e−rT g(xTn ) − vTn |. As in option pricing, one is interested in surro- contract that can be
gates b
h(k, ·) for the hedging ratios. Observe how the above fitness criteria are highly implicit with exercised at a discrete
respect to bh(k, ·). Training can be model-free, using historical paths of (St ). set of prespecified time
instances
1.4.2. American options. In American-style contracts, the buyer can collect her payoff at any
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
time τ ≤ T prior to maturity. This choice is made dynamically as market conditions unfold. The
pricing problem for an American option with payoff g(t, x) (we notationally subsume discounting)
is known to reduce to finding the decision rule τ , namely a stopping time to maximize expected
reward g,
[ ]
EQ g(τ , Xτ ) → max! 9.
where each Hℓ : Rnℓ−1 7→ Rnℓ is an affine map Hℓ (x) = Wℓ · x + wℓ , and ϕ ℓ ’s are the nonlinear ac-
tivation functions applied coordinate-wise. Common activation functions are the rectified linear
unit (ReLU) ϕ(x) = max (x, 0) and exponential linear unit (ELU) ϕ(x) = max (ex − 1, x). The
parameter nℓ is the number of nodes in layer ℓ = 1, . . . , L. Training an NN means learning
the matrices Wcℓ ∈ Rnℓ ×nℓ−1 and the vectors wbℓ ∈ Rnℓ such that the NN outputs N b (xi ) are close
∑
to the observed outputs y i . The total number of the NN hyperparameters ϑ is |ϑ| = Lℓ=1 nℓ ×
(nℓ−1 + 1) and is often in the thousands. In practice, the nonconvex optimization problem to find
cℓ and w
W bℓ is achieved via gradient descent, gradually improving weights ϑ ( j) as training mini-
batches indexed by j are considered. The weights are randomly initialized with samples ϑ (0) drawn
from either the uniform or Gaussian distributions. One common procedure with the ReLU activa-
tion function initializes the weights in layer ℓ according to wℓ(0) ∼ N (0, 2/nℓ ) (Ferguson & Green
2018). To enable such generic initialization all inputs (and outputs) are prescaled into the [0, 1]d
hypercube.
Several features of NNs are useful for QF contexts. First, the universal approximation prop-
erty asserts that NNs are able to approximate continuous functions arbitrarily well; that is, it
guarantees sufficient flexibility given NNs with enough parameters. Second, NNs support exten-
sive offline training together with fast prediction, resolving many of the scalability limitations of
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
other surrogates. Third, the NN minibatches work well with streamed data.
K T = [κ (x∗ , x1 ; ϑ ), . . . , κ (x∗ , xN ; ϑ )]
and K is an N × N covariance matrix described through the kernel function κ(·,· ; ϑ).
GP fitting corresponds to selecting an appropriate function space H ≡ Hϑ by optimizing the
hyperparameters ϑ that drive m(·) and κ(·,·). This is done in a hierarchical manner, first fixing
a kernel family and then using maximum likelihood optimization to infer ϑ given D. Once ϑ is
chosen, the kriging equations (Equation 10) yield the surrogate output.
A popular choice for κ(·,·) is the (anisotropic) squared exponential (SE) family, parameterized
by the length scales {ℓlen,k }dk=1 and the process variance σ p2 :
( ∑ d
(xk − xk′ )2 )
κSE (x, x′ ) := σ p2 exp − . 11.
k=1
2ℓ2len,k
Other popular kernels are the Matérn-5/2 and Matérn-3/2 families (Roustant et al. 2012):
( √ ) √
∏d
5 5 − ℓ 5 |xk −xk′ |
′ ′ ′ 2
κM52 (x, x ) := σ p
2
1+ |xk − xk | + 2 (xk − xk ) e len,k ,
k=1
ℓlen,k 3ℓlen,k
( √ ) √
∏
d
3 − ℓ 3 |xk −xk′ |
κM32 (x, x′ ) := σ p2 1+ |xk − xk′ | e len,k .
k=1
ℓlen,k
278 Ludkovski
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
The GP kernel κ (x, x′ ) controls the smoothness of m∗ (·). The SE kernel (Equation 11) yields
infinitely differentiable fits m∗ (·), while a Matérn kernel of order k + 1/2 yields approximators
that are in C k . Thus Matérn-3/2 surrogates are in C 1 and Matérn-5/2 surrogates are in C 2 . Note
Feature engineering:
that modulo rescaling, all the above belong to the class of radial basis functions, where κ(x, x′ ) is selecting the best
a function of |x − x′ | only; nonseparable kernels are also available. subset of pricing
More advanced GP surrogates that have been considered in QF contexts include shape- features to feed a
constrained GPs (Cousin et al. 2016, Chataigner et al. 2021), multioutput GPs (Crépey & Dixon statistical surrogate
2020) that can simultaneously learn multiple pricing functions (useful for portfolio valuation),
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
and heteroskedastic GPs (Binois et al. 2018). GP kernels can also be composed via addition,
multiplication, and convolution (Duvenaud 2014).
GPs intrinsically support updating via iterated conditioning and also excel in interpolat-
ing within sparse data environments where D is small. A further key feature of GP models
is uncertainty quantification via the posterior covariance Cov( f (x∗1 ), f (x∗2 )| D) = κ (x∗1 , x∗2 ) −
−1
K1T [K + σϵ2 I] K2 , where Ki = [κ (x∗i , x1 ; ϑ ), . . . , κ (x∗i , xN ; ϑ )] for i = 1, 2. The interpretation is
that x∗ 7→ m∗ (x∗ ) is the most likely input–output map that is consistent with the training data set
D, and Var( f (x∗ )|D) is the model uncertainty capturing the range of other potential input–output
fits. The latter allows for internal model assessment. For example, high predictive uncertainty can
allow the modeler to reject the existing surrogate prediction in favor of either retraining it or even
using direct evaluation of the underlying data generator (Crépey & Dixon 2020).
where the base models hj (·) are obtained sequentially and ν is the learning rate. The key concept
∑
of GB is that the base hj ’s are obtained in a stagewise manner; that is, f j (x) := h0 + k≤ j νhk (x) =
f j−1 (x) + νh j (x) is learned by fitting hj (·) conditional on f j−1 and the jth training batch of size
Nj . The most common base model is a decision tree; the shrinkage parameter ν < 1 is used
for stability. Fitting hj is done greedily by minimizing the loss function L( f j−1 + νh j , D) =
minh L( f j−1 (x1:N j ) + νh(x1:N j ), y1:N j ) over base h’s. In other words, the next base model is chosen to
minimize the residual errors from the previous model over the new training batch analogously to
a gradient descent step. Modern versions of GB, such as LightGBM (Ke et al. 2017), include
multiple machine learning enhancements that stabilize the surrogate and mitigate overfitting
(e.g., dropout, randomly removing some of the earlier hk ’s when solving for hj in order to lower
the influence of the first few base models). Because the base models are trees, feature engineering
is critical for GB in order to more easily find good leaf splits during training.
train on a small subset of the Chebyshev grid and then use a completion algorithm via a
low-rank approximation. This is combined with gradually adding training sites and adaptively
increasing the rank. The tensor of the coefficients c is obtained using tensor-matrix multipli-
Call option: the right
but not the obligation cation and the fast Fourier transform. Chebyshev interpolation offers extensive convergence
to buy the underlying theory.
asset for $K
Bermudan swaption: 3. LEARNING TO PRICE OPTIONS
the right but not the
Daily tasks in managing financial positions involve numerous calculations of contract prices, hedg-
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
obligation to enter
into an interest rate ing ratios, risk measures, and value adjustments. As the underlying stochastic models have become
swap on any of the more sophisticated, their implementation is more computationally intensive. As a result, com-
predetermined dates tk putational gains afforded by cloud computing, parallel processing, faster processors, and better
prior to T algorithms are canceled out by the ever higher number of computations to be made (e.g., more
valuation adjustments to account for credit risk or funding, more accuracy, more risk sensitivities,
new structured products) and by each computation taking longer. Nowadays, trading desks do
millions of similar computational tasks each day.
To make these repeated calculations efficient and lightning fast, a surrogate is employed to
learn the price of the option as a function of the underlying price and contract characteristics.
Indeed, a derivatives valuation model is ultimately a function that maps inputs, consisting of
market data and trade-specific terms, to an output representing the option value. The basic loss
function is the squared distance LMSE between observed (simulated) option prices and surrogate
predictions. While a simple contract such as the call option in a Black–Scholes model has a
total of five inputs, more complex products such as Bermudan swaptions in Libor models have
valuation functions with hundreds of inputs, involving all the properties of the underlying swap
and option exercise schedule.
The most well-studied surrogates for this task are NNs. Following the early pioneering work
by Hutchinson et al. (1994), the field rapidly expanded recently with Culkin & Das (2017) being
the first to apply deep NNs. The recent survey by Ruf & Wang (2020) presents an exhaustive and
monumental table that lists more than 150 papers that investigated NNs for option pricing. They
compare extant proposals across six categories, including model setup, NN setup, and case studies
considered. Complementary literature has considered GPs (De Spiegeleer et al. 2018, Crépey &
Dixon 2020), Chebyshev polynomials (Olivares & Alvarez 2016, Gaß et al. 2018, Glau et al. 2020),
GB (Davis et al. 2020), and cubic splines (Olivares & Alvarez 2016).
280 Ludkovski
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
implied volatility σ Imp . The most common surrogate output y is the option price; if working with
moneyness features, the output could also be the option price divided by its strike 5/K.
Among test cases, existing literature usually sticks to the most liquid contracts, such as on the
Implied volatility: the
S&P 500 and DAX indices. Noise, limited observations, and special geometry (e.g., only certain value σ Imp that inverts
strikes being liquid) are all challenges in working with real data sets and imply that some methods the Black–Scholes
may not translate well from synthetic to real-life applications. formula for a given
call/put price PMkt (t, S)
and known r, K, T
3.2. Training
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
To approximate a derivatives valuation function, the modeler must select the domain of application
D̄ and then the actual (discrete) training set D ∈ D̄. Since the surrogate by construction minimizes
the error between its estimates and that of the training data that it is presented, it is important to
choose a representative domain. The trade-off between larger and smaller D̄ is one of generality
versus model complexity and training time (Ferguson & Green 2018). For example, in the case of
a Bermudan swaption we may choose to take as given the properties of a specific trade and then
train the model against only a variety of input market data scenarios. Lower-dimensional D̄ will
lead to smaller surrogates and hence lower training requirements (size of D, amount of time spent
training, etc.).
The sampling distribution underlying D is also crucial. For example, there is little point in
generating a large volume of examples that yield zero option value. Similarly, regions of rapidly
changing valuations need to be assigned more training data. Hence, the sampling distribution
should often be nonuniform and capture financial dependence between its inputs. One may
also complement with synthetic boundary inputs to enforce out-of-sample extrapolation shape
(Ackerer et al. 2020, Ludkovski & Saporito 2022).
In terms of training the NN surrogate itself, Garcia & Gençay (2000) and Gençay & Qi (2001)
look at enhancements such as early stopping and bagging. Gonon & Schwab (2021) establish con-
vergence rates for deep NN surrogates of option pricing in exponential Lévy models, exploiting
the relative smoothness of the latter functionals and pointing out the role of different activation
functions. Specifically, they show that O(ϵ −1 ) neurons are needed to achieve an error of at most ϵ.
3.3. No Arbitrage
Option pricers must satisfy certain model-free constraints imposed no arbitrage. For example, for
call options, no arbitrage implies (we provide the financial names for the respective constraints on
price sensitivities)
■ a calendar spread constraint, where prices increase in time to maturity T − t;
■ a bull spread constraint, where prices are increasing in moneyness S/K; and
■ a butterfly spread constraint, where prices are convex in moneyness.
Dugas et al. (2009) achieve global arbitrage-free predicted prices using hard constraints
through a special one-layer network architecture. Yang et al. (2017) use a gated network struc-
ture, constrained to have nonnegative weights and certain activation functions. Chataigner
et al. (2020) built a related, sparsely connected architecture with a convex activation function
softplus(x) = log (1 + ex ). An alternative is to penalize no-arbitrage violations, known as soft con-
straints (Ackerer et al. 2020, Chataigner et al. 2020). Itkin (2019) introduces penalty functions to
enforce the positivity of the first and second derivatives of 5 with respect to maturity T and strike
K, respectively, in addition to the negativity of the first derivative with respect to K. Note that
such penalization is done during training and hence does not necessarily apply to out-of-sample
inputs.
shares will replicate offs, unlike calls and puts). The translation is achieved through the Dupire formula that converts
the option payoff, implied volatilities into a local volatility surface (Gatheral 2011),
achieving perfect risk
elimination σ 2 (T , K ) ∂T 5(T , K ) + rK ∂K 5(T , K )
= .
Theta: sensitivity of 2 K 2 ∂KK
2
5(T , K )
option value to time to
Observe that above, the right-hand side is required to be positive; that is, the surrogate for 5 must
maturity T − t
be constrained to have nonnegative T and nonnegative ∂KK 2
.
Ackerer et al. (2020) train NN surrogates to observed implied volatilities using a mix of L1 and
L2 norms and penalize violations of the above constraints (see also Zheng et al. 2021). They point
out that inputting σ Imp is more stable statistically than training to observed prices. Chataigner
et al. (2020) use the implied volatility surface as an intermediate transformation—both inputs
and desired outputs are option prices, and σ Imp is solely an intermediate output of the surrogate.
Fengler (2009) uses natural cubic splines with monotonicity and convexity constraints to build an
arbitrage-free implied volatility surface. Glau et al. (2019a) learn implied volatility surfaces using
Chebyshev interpolation.
for arbitrary time t and underlying price S. The functional 5(·, ·) is not directly known, and one
is provided a training set D = {(t i , Si , yi ) : i = 1, . . . , N }, where y i ≃ 5(t i , S i ). This is very similar
to option pricing, except now the task is to learn (t, S) 7→ 1(t, S).
One approach is to fit a surrogate (t, S) 7→ 5(t, b S) and then set 1(t, b S) := ∂ 5(t, b S)/∂S. The
second step can be done either analytically for certain surrogates or via backpropagation, that
is, adjoint algorithmic differentiation (Capriotti & Giles 2012, Capriotti et al. 2017). Surrogate
construction for Greeks estimation must be approached with care, since for some classes, such as
b differentiation is known to lead to highly unstable or even nonsensical estimates
a spline-based 5,
282 Ludkovski
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
b and other gradients ( Jain & Oosterlee 2015). This is because the typical L2 criterion LMSE
for 1
b S) is completely unaware of the subsequent plan to compute
that is driving the fitting of 5(t,
gradients. Indeed, assessment of 1 b is typically based on properties of the tracking error VT in
Equation 8.
GPs have been considered for this task by Crépey & Dixon (2020), Chataigner et al. (2021),
and Ludkovski & Saporito (2022), relying on the availability of analytic derivatives ∂m ∗
∂x j
(x∗ ) =
∂m ∂κ
∂x j
(x∗ , x1:N )(K + σϵ2 I)−1 (y − m), which admit the closed-form expression for the three
(x∗ ) + ∂x j
kernel families provided in Section 2.2. On the one hand, this reduces the error in 1 b since only a
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
single approximation is needed, and the differentiation is exact. On the other hand, GPs offer an
in-model assessment of the accuracy of 1 b by rigorously propagating the underlying uncertainty
about 5. b The outputted credible bands around 1 b guide the end user on how well the surrogate
learned the desired Greek. This information is relevant for trading purposes, such as in the con-
text of no-transaction regions under trading costs (Whalley & Wilmott 1997). Chataigner et al.
(2021) consider shape-constrained GPs since arbitrage constraints heavily impact the shape of the
Greeks (such as 1 being monotone).
NN surrogates for Greek estimation have been considered by Chataigner (2021), Chataigner
et al. (2021), Jain & Oosterlee (2015), and Davis et al. (2020). NNs are attractive thanks to their uni-
form approximation, not just for smooth response functions but also for their derivatives (Hornik
et al. 1990). Chebyshev interpolation for Greeks is proposed by Maran et al. (2021).
Figure 1 displays the estimated Delta of a call option in a Black–Scholes model (Equation 3).
We employ two different surrogate types: GPs and NNs. The first method employs a Matérn-
5/2 covariance kernel and a constant prior mean m(x) = β0 ; the NN employs the ELU activation
function, with L = 3 layers and nℓ = 64 neurons. The GP is trained via maximum likelihood and
implemented in R; the NN is trained via the Adam algorithm and implemented in Python. For
both cases the training set is two dimensional across time to maturity τ and stock price S, fixing
the option strike K = 40, interest rate r = 0.04, and volatility σ = 0.22. Call price data are coming
Figure 1
b S) of a Black–Scholes call with parameters K = 40, r = 0.04, and σ = 0.22, and
Estimated Delta S 7→ 1(t,
time to maturity T − t = 0.3 as a function of S. Abbreviations: GP, Gaussian process; NN, neural network.
from noisy Monte Carlo samples, using Ň = 400 draws of ST to approximate EQ [e−r(T −t ) (ST −
K )+ | St = s] at N = 400 gridded input sites. We observe some errors around the edges with a
nonmonotone 1 b and violation of the no-arbitrage restriction on the far right, 1
b > 1. The fact that
the surrogate prediction is better in the middle of the training domain D̄ is very typical. Ludkovski
& Saporito (2022) provide more on the role of experimental design, demonstrating, for instance,
that learning sensitivities on a grid is much faster than on an irregularly sampled domain.
An alternative way to learn the Delta specifically is to take the hedging perspective. Since
delta hedging is risk-minimizing, one can learn hedging ratios by minimizing one-step portfolio
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
risk. To this end, Ruf & Wang (2022) propose minimizing local portfolio variance, Var(V(δ)) :=
Var(δ · S1 + (1 + r1t)(50 − δ · S0 ) − 51 ), where S0 , S1 are the underlying prices today and
tomorrow and 50 , 51 are the respective call prices. Modulo time discretization, the minimizer is
δ ∗ = 5(0, S0 )/S, that is, the Delta. Ruf & Wang (2022) then train an NN surrogate to real-life
(S&P 500) observations of (S01:N , S11:N , 51:N
0 , 51 ) plus additional features, such as σ Imp . To match
1:N
no-arbitrage constraints that imply that the hedging ratio must lie in the interval [0, 1], they
employ a sigmoid activation function on the output (see also Halperin 2020).
5. MODEL CALIBRATION
The model calibration task (Bayer et al. 2019, Liu et al. 2019, Benth et al. 2021, Horvath et al. 2021)
considers the inverse problem of finding the best set of model parameters θ that match observed
empirical prices. To this end, one wishes to obtain the pricing functional θ 7→ 5(t, x, λ; θ ) and
then to solve the calibration task,
1 ∑( )2
M
inf 5(t, xm , λm ; θ ) − P Mkt (t, xm , λm ) , 14.
θ M m=1
where P := {P Mkt (t, xm , λm ) : m = 1, . . . , M} is the collection of observed option prices with
features λ1:M at date t.
With a statistical machine learning approach introduced by Horvath et al. (2021), one first
b x, λ; θ ) using a training set D := {(t j , x j , λ j ; θ j ), j = 1, . . . , N } (of size N,
trains a surrogate 5(t,
picked by the modeler) and then optimizes θ for the given observation set P,
1 ∑ (b )2
θ̂ (P ) := arg inf 5(t, xm , λm ; θ ) − P Mkt (t, xm , λm ) . 15.
θ M
m
Since solving Equation 14 requires repeatedly evaluating 5 across different θ ’s, when the lat-
ter is expensive, there is a large gain from switching to a fast-to-evaluate surrogate 5.b Indeed,
Equation 15 can yield orders-of-magnitude performance gains, allowing on-the-fly calibration by
traders in real time. At the same time, the modeler may invest essentially unlimited resources in
b (e.g., via deep NNs) as long as its ultimate evaluation is fast.
training 5
Huh (2019) calibrates exponential Lévy models using NNs. Itkin (2019) suggests calibrating
models by first building a surrogate for the forward pricing functional (x, λ, θ ) 7→ 5(x, λ; θ ) and
then inverting it to learn the map (x, λ, 5) 7→ θ via a second NN. As an example, he considers
learning the implied volatility map (x, λ, 5) 7→ σImp (x) and then using observed σ Imp to calibrate
a local volatility model.
6. LEARNING TO STOP
Valuing American-style contracts is a particular case of optimal stopping problems (OSPs), where
the goal is to evaluate the value function V : [0, T ] × X → R representing expected reward,
[ ]
V (t, x) := sup E g(τ , Xτ )| Xt = x . 16.
τ ∈St
284 Ludkovski
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
For the remainder of the section we adopt the discrete-time paradigm of Bermudan options,
indexing by k rather than tk and taking T = tKT .
It is most intuitive to think of optimal stopping as dynamic decision making. At each exercise
step k, the controller must decide whether to stop (0) or continue (1), which, within a Markovian
structure, is encoded via the feedback strategy A = (A0:KT (·)) with each Ak (x) ∈ {0, 1}. The action
map Ak gives rise to the stopping region,
Sk := {x ∈ X : Ak (x) = 0} ⊆ X ,
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
where the decision is to stop, and in parallel defines the corresponding first hitting time,
∑
KT
∏
m−1
τAk:K := inf{ℓ ≥ k : Aℓ (Xℓ ) = 0} ∧ KT = m · (1 − Am (Xm )) Aℓ (Xℓ ), 17.
T
m=k ℓ=k
which is an optimal exercise time after k. Hence, solving an OSP is equivalent to classifying any
(x, k) into Sk or its complement, the continuation set. For example, in the most well-known exam-
ple of the Bermudan put, it is known that Sk = [0, s¯k ]; that is, one should stop as soon as the asset
price drops below the exercise thresholds s¯k .
The action set Ak is characterized recursively as
[ ]
Ak (x) = 1 ⇐⇒ E g(τAk+1:K , XτA )| Xk = x > g(k, x); 18.
T k+1:KT
that is, one should continue if the expected reward-to-go dominates the immediate payoff.
Denoting the step-ahead conditional expectation of the value function by
q(k, x) := E [V (k + 1, Xk+1 ) |Xk = x] 19.
and using the dynamic programming principle that asserts that V (k, x) =
( )
max g(k, x), E[V (k + 1, Xk+1 )|Xk = x] , we can equivalently write Ak (x) = 1 ⇔ q(k, x) > g(k, x).
The q-value q(k, ·) is known as the continuation value.
The regression Monte Carlo (RMC) (also known as the Longstaff–Schwartz Algorithm or
least squares Monte Carlo, the terminology being historical and potentially a misnomer for statis-
ticians) framework (Longstaff & Schwartz 2001, Egloff 2005) recursively constructs approximate
bk ’s through iterating on either Equation 18 or Equation 19. Thus, the RMC framework gener-
A
ates functional approximations of the continuation values q̂(k, ·) in order to build A bk (·). The RMC
b (KT , x) = g(KT , x), and for k = KT − 1, . . . , 1 it repeats as follows.
loop is initialized with V
1. Learn the q-value q̂(k, ·).
bk (x) := 1{q̂(k,x)>g(k,x)} .
2. Set A
( )
b
3. Set V (k, x) := max q̂(k, x), g(k, x) .
The principal surrogate fitting task in step 1 can be implemented as learning x 7→ E[V b (k +
1, Xk+1 )|Xk = x] (Tsitsiklis & van Roy 2001) or as approximating x 7→ E[g(τAbk+1:K , XτAb ) Xk =
T k+1:KT
x] (Longstaff & Schwartz 2001; see also a multistep, look-ahead version in Egloff et al. 2007).
These choices are distinct because Vb (k + 1, x) ̸= E[g(τ b )| Xk+1 = x] due to the
Ak+1:K , XτA
b
T k+1:KT
approximation error.
Classical RMC employs an LMSE loss function with user-specified basis functions, fitted based
on a training set Dk . The Monte Carlo reference is to the standard strategy of constructing Dk from
M i.i.d. draws of Xk , obtained by a Monte Carlo simulation of M respective paths of (Xk ) (thus Dk
and Dk+1 are not independent). Observe again a mismatch between L and the performance metric
[ ]
for evaluating Ab0:K , which is based on the induced expected reward E g(τ b , Xτ ) X0 = x , as
A0:K b
A T 0:KT
usual evaluated via a Monte Carlo average. Nevertheless RMC has turned out to be enormously
successful; the seminal article by Longstaff & Schwartz (2001) has more than 4,000 citations
and has spawned numerous enhancements [see reviews in Broadie & Cao 2008, Kohler 2010,
Tompaidis & Yang 2013, and the monograph by Belomestny & Schoenmakers (2018)].
■ piecewise regression with adaptive subgrids by Bouchard & Warin (2011), the idea being to
avoid global fits that tend to be too rough and to come up with simple base fits (constant
∑
or linear) defined over a collection of subregions Cℓ : b
f (x) = Lℓ=1 fˆℓ (x)1{x∈Cℓ } , with the lat-
ter chosen to be rectangular and equi-probable in Dk [see also dynamic trees proposed by
Gramacy & Ludkovski (2015) who generate partitions via a probabilistic genetic ensemble
approach resembling a random forest];
■ regularized regression, such as LASSO (Kohler & Krzyżak 2012) and ridge regression (Hu &
Zastawniak 2020), the idea being to start with a large number of potential bases and prevent
overfitting by shrinking the irrelevant regression coefficients to zero;
■ kernel regression by Belomestny (2011b), the idea being to provide a nonparametric surro-
gate based on a kernel function κ (x; h) that reduces to the choice of the kernel bandwidth h
(see also nearest-neighbor regression in Agarwal & Juneja 2015);
■ GP regression by Goudenège et al. (2019), Goudenège et al. (2020), and Ludkovski (2018);
■ NN surrogates introduced by Haugh & Kogan (2004) and Kohler et al. (2010) within a
single-layer architecture [Becker et al. (2020) recently considered deep NNs];
■ Chebyshev polynomials by Glau et al. (2019b);
■ smoothing splines by Kohler (2008); and
■ adaptive regression bases by Belomestny et al. (2018).
286 Ludkovski
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
Figure 2
Surrogates for the timing value Tb(k, x) in a one-dimensional Bermudan put problem, with x being the stock price. We show three
surrogates fitted to the same problem: a GP utilizing the kernel in Equation 11 (black), a smoothing spline (purple), and a shallow
one-layer NN (blue). The surrogates are from time step k = 10 (t = 0.4, left) and k = 20 (t = 0.8, right). We also display the uncertainty
quantification regarding the GP fit of T̂ (k, x) (gray bands, 95%). Abbreviations: GP, Gaussian process; NN, neural network.
b (k + 1, Xk+1 )| Xk ] can be done analytically. Glasserman & Yu (2004b), Balata & Pal-
such that E[V
czewski (2017, 2018), and Glau et al. (2019b) all consider polynomial-type surrogates for American
options and more general control problems.
A deep learning approach to American option pricing was proposed by Becker et al. (2019) via
learning the action sets. Using Equation 17 one may rewrite
electricity markets (Carmona & Touzi 2008). Extending RMC to multiple stopping requires
constructing surrogates V̂ (m) (k, x) that enumerate the number m of remaining exercise rights,
capturing the marginal value of each decision. Each V̂ (m) , m ≥ 1 is characterized as a solution
of an OSP where the payoff is related to V̂ (m−1) . RMC for multiple stopping was pioneered by
Meinshausen & Hambly (2004) using linear models. Readers are referred to Kirkby & Deng (2019)
for a version utilizing B-splines and Ludkovski (2021) for implementation with GPs. Deschatre &
Mikael (2020) extend the policy parameterization of Becker et al. (2019) to multiexercise contracts.
Optimal switching formulations arise in the limit M → ∞ above, whereby the controller is
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
able to make an infinite series of discrete decisions, sequentially altering the state of the controlled
system (Xt ). This class of models also subsumes the situation where the decisions are not binary but
are in some finite action space U. Motivating case studies include on/off control of a power plant,
management of natural gas storage facilities (which can be in the gas injection, gas withdrawal, and
holding regimes) (Carmona & Ludkovski 2010, Mazières & Boogert 2013, Nadarajah et al. 2017,
Ludkovski & Maheshwari 2020), and capacity expansion models (Aid et al. 2014). Simulation-
based algorithms for optimal switching construct several surrogates V (u) (k, x) that are indexed by
the current control regime u ∈ U. Among recent works, Ludkovski & Maheshwari (2020) study
GP surrogates for this purpose and Bachouch et al. (2022) consider deep NNs.
Impulse control is a further generalization that features a double sequence of stopping times
and impulse amounts, A := (τ m , zm ). The interpretation is that the state process (XtA ) is subject
to stochastic differential equation dynamics, as well as repeated lumpy interventions or shocks
∫s ∫s
of size zm 4 at chosen instances τ m [0, T]: Xst,x,A = x + t µ(Xrt,x,A )dr + t σ (Xrt,x,A )dWr +
∑
m:t<τm ≤s zm . The goal of the controller is to maximize rewards driven by X and her actions
(τ m , zm ). Impulse control can be reduced to repeated optimal stopping where the action is com-
pound: After deciding to act, the controller evaluates the intervention operator MV b (t, x) :=
b
supz∈4 {V (t, x + z) − κ (x, z)} to select the best impulse. Ludkovski (2022) and Deschatre & Mikael
(2020) provide further details.
with the superscript emphasizing the influence of the control on the transition density of Xk+1 u
|Xk .
Like in the prior section, an increasingly popular approach is to construct functional approxima-
tors Vb (k, ·) that are trained based on empirical samples of (Xk , uk , X uk ). This approach dates back
k+1
to at least Chen et al. (1999), who proposed spline surrogates. Readers are referred to Deisenroth
et al. (2009) for a GP implementation and Belomestny et al. (2010a) for linear models in the flavor
of RMC. Recently, there has been an explosion of interest in applying deep learning (Han & E
2016, Fecamp et al. 2020, Germain et al. 2021, Bachouch et al. 2022).
A different approach is to parameterize the set of strategies b uϑ (k, ·) and then to maximize ex-
pected reward over ϑ (Huré et al. 2021). Another alternative is to learn the q-value b q(k, x, u) that
summarizes costs to go jointly in terms of state-action pairs; optimal policy is then extracted as the
288 Ludkovski
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
minimizer of b q(k, x, ·) (Chen & Ludkovski 2021). Note that in Equation 21, optimal feedback con-
trol u∗ (k, x) can be characterized in terms of the gradient of V (k, x), connecting to the literature
in Section 4. This intrinsic coupling between value function and feedback control is weakened in
actor-critic approaches that build separate surrogates for V b and b
u in the interest of computational
efficiency (Guéant & Manziuk 2019, Cao et al. 2021). Experimental design for Equation 21 is
addressed, for example, in the control randomization (Kharroubi et al. 2014, Zhang et al. 2019)
approach.
Complementary to the above are machine learning techniques for nonlinear PDEs, which
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
can be used to characterize continuous-time stochastic control problems using the Hamilton–
Jacobi–Bellman representation. A respective Deep Galerkin Method was proposed by Sirignano
& Spiliopoulos (2018) and has generated nearly 1,000 citations in less than 5 years.
There is by now a long list of specific financial control problems where statistical machine
learning algorithms have been taken up. We mention portfolio optimization with transaction costs
(Cong & Oosterlee 2016, Zhang et al. 2019), optimal execution in limit order books (Leal et al.
2020), and market making (Guéant & Manziuk 2019). Additional applications motivated by finan-
cial mathematics settings include (adaptive) robust control (Chen & Ludkovski 2021), principal
agent problems (Baldacci et al. 2019), constrained control (Balata et al. 2021), ranking problems
(Hu 2019), and McKean–Vlasov problems (Carmona & Laurière 2021).
A fruitful research area has been to extend the above methods to stochastic games, where
equilibrium strategies are characterized through best-response conditions (Han & Hu 2020,
Laurière 2021). One motivation is that finding equilibria requires repeated optimization over best
responses, which is expensive to do directly and where fast surrogates are greatly beneficial. Fixed-
point iteration concepts may be combined with the sequential training of the surrogate for best
response (see, for example, the class of fictitious play algorithms).
Finally, we mention in passing reinforcement learning (RL) approaches that aim to solve for
V (t, x) all at once across space and time (Dixon et al. 2020, Charpentier et al. 2021, Hambly et al.
2021). RL has been investigated especially for learning data-driven hedging strategies (Buehler
et al. 2019, Kolm & Ritter 2019, Cao et al. 2021, Giurca & Borovkova 2021, Ruf & Wang 2022)
that aspire to be model-free.
8. OUTLOOK
Historically, statistical learning in QF has evolved semi-independently across several distinct ap-
plications such as American option pricing and learning the implied volatility surface. It remains
the case that many papers propose new methodologies geared for a narrow or very specific context
such that their broader relevance is hard to assess. Recent software suites such as those by Gevret
et al. (2018) and Ludkovski (2021) aim to facilitate such meta-analysis and cross-comparison. An-
other gap is between the common test beds in academic circles and the practical concerns faced
by practitioners such that the real-life applicability of new methods is often limited.
We are currently witnessing a surge of theoretical publications as well as a proliferation of
industry start-ups that claim computational breakthroughs enabled by techniques such as deep
learning. It will take some time to sort out what will be the long-standing advances that pass the test
of time. What is clear is that QF applications have enough specialized features that customization
and tailoring are critical, and hence no single tool is ever going to be the right one for all tasks.
As such, it is worthwhile to adopt the higher perspective afforded by statistical learning theories
and be simultaneously conversant in the language of stochastics, finance, statistics, and machine
learning.
FUTURE ISSUES
1. Barriers due to different terminology used across research communities remain a
challenge but also provide ongoing opportunities for knowledge transfer.
2. There is a need for better benchmarking test beds. Despite many recent works pro-
viding reproducible computational packages and notebooks, there is a lack of common
benchmarks to enable meaningful comparison of tools and definition of state-of-the-art
performance.
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
DISCLOSURE STATEMENT
The author is not aware of any affiliations, memberships, funding, or financial holdings that might
be perceived as affecting the objectivity of this review.
ACKNOWLEDGMENTS
M.L. is partially supported by National Science Foundation grant DMS-1821240. Comments
from an anonymous reviewer are appreciated.
LITERATURE CITED
Ackerer D, Tagasovska N, Vatter T. 2020. Deep smoothing of the implied volatility surface. Adv. Neural Inform.
Proc. Syst. 33:11552–63
Agarwal A, Juneja S. 2015. Nearest neighbor based estimation technique for pricing Bermudan options. Int.
Game Theory Rev. 17(1):1540002
Aid R, Campi L, Langrené N, Pham H. 2014. A probabilistic numerical method for optimal multiple switching
problems in high dimension. SIAM J. Financ. Math. 5(1):191–231
Bachouch A, Huré C, Langrené N, Pham H. 2022. Deep neural networks algorithms for stochastic control
problems on finite horizon: numerical applications. Methodol. Comput. Appl. Probab. 24(1):143–78
Balata A, Ludkovski M, Maheshwari A, Palczewski J. 2021. Statistical learning for probability-constrained
stochastic optimal control. Eur. J. Oper. Res. 290(2):640–56
Balata A, Palczewski J. 2017. Regress-later Monte Carlo for optimal inventory control with applications in
energy. arXiv:1703.06461 [math.OC]
Balata A, Palczewski J. 2018. Regress-later Monte Carlo for optimal control of Markov processes.
arXiv:1712.09705 [math.OC]
Baldacci B, Manziuk I, Mastrolia T, Rosenbaum M. 2019. Market making and incentives design in the presence
of a dark pool: a deep reinforcement learning approach. arXiv:1912.01129 [q-fin.MF]
Bayer C, Horvath B, Muguruza A, Stemper B, Tomas M. 2019. On deep calibration of (rough) stochastic
volatility models. arXiv:1908.08806 [q-fin.MF]
Becker S, Cheridito P, Jentzen A. 2019. Deep optimal stopping. J. Mach. Learn. Res. 20:2712–36
Becker S, Cheridito P, Jentzen A. 2020. Pricing and hedging American-style options with deep learning.
J. Risk Financ. Manag. 13(7):158
Belomestny D. 2011a. On the rates of convergence of simulation-based optimization algorithms for optimal
stopping problems. Ann. Appl. Probab. 21(1):215–39
Belomestny D. 2011b. Pricing Bermudan options by nonparametric regression: optimal rates of convergence
for lower estimates. Finance Stochast. 15(4):655–83
Belomestny D, Kolodko A, Schoenmakers J. 2010a. Regression methods for stochastic control problems and
their convergence analysis. SIAM J. Control Optim. 48(5):3562–88
290 Ludkovski
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
Belomestny D, Milstein GN, Schoenmakers J. 2010b. Sensitivities for Bermudan options by regression
methods. Decis. Econ. Finance 33(2):117–38
Belomestny D, Schoenmakers J. 2018. Advanced Simulation-Based Methods for Optimal Stopping and Control:
With Applications in Finance. London: Palgrave Macmillan
Belomestny D, Schoenmakers J, Spokoiny V, Tavyrikov Y. 2018. Optimal stopping via deeply boosted
backward regression. arXiv:1808.02341 [math.NA]
Benth FE, Detering N, Lavagnini S. 2021. Accuracy of deep learning in calibrating HJM forward curves. Digit.
Finance 3(3):209–48
Binois M, Gramacy RB, Ludkovski M. 2018. Practical heteroskedastic Gaussian process modeling for large
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
De Spiegeleer J, Madan DB, Reyners S, Schoutens W. 2018. Machine learning for quantitative finance: fast
derivative pricing, hedging and fitting. Quant. Finance 18(10):1635–43
Deisenroth MP, Rasmussen CE, Peters J. 2009. Gaussian process dynamic programming. Neurocomputing
72(7):1508–24
Deschatre T, Mikael J. 2020. Deep combinatorial optimisation for optimal stopping time problems:
application to swing options pricing. arXiv:2001.11247 [q-fin.CP]
Dixon MF, Halperin I, Bilokon P. 2020. Machine Learning in Finance. Cham, Switz.: Springer
Dugas C, Bengio Y, Bélisle F, Nadeau C, Garcia R. 2000. Incorporating second-order functional knowledge
for better option pricing. Adv. Neural Inform. Proc. Syst. 13:472–78
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
Dugas C, Bengio Y, Bélisle F, Nadeau C, Garcia R. 2009. Incorporating functional knowledge in neural
networks. J. Mach. Learn. Res. 10:1239–62
Duvenaud D. 2014. Automatic model construction with Gaussian processes. PhD Thesis, Univ. Cambridge,
Cambridge, UK
Egloff D. 2005. Monte Carlo algorithms for optimal stopping and statistical learning. Ann. Appl. Probab.
15(2):1396–432
Egloff D, Kohler M, Todorovic N. 2007. A dynamic look-ahead Monte Carlo algorithm for pricing Bermudan
options. Ann. Appl. Probability 17(4):1138–71
Elie R, Perolat J, Laurière M, Geist M, Pietquin O. 2020. On the convergence of model free learning in mean
field games. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 7143–50. Palo Alto,
CA: AAAI
Fecamp S, Mikael J, Warin X. 2020. Deep learning for discrete-time hedging in incomplete markets. J. Comput.
Finance 25(2):51–85
Fengler MR. 2009. Arbitrage-free smoothing of the implied volatility surface. Quant. Finance 9(4):417–28
Ferguson R, Green AD. 2018. Deeply learning derivatives. SSRN Work. Pap. 3244821
Fromkorth A, Kohler M. 2011. Analysis of least squares regression estimates in case of additional errors in the
variables. J. Stat. Plan. Inference 141(1):172–88
Fu H, Jin X, Pan G, Yang Y. 2012. Estimating multiple option Greeks simultaneously using random parameter
regression. J. Comput. Finance 16(2):85–118
Garcia R, Gençay R. 2000. Pricing and hedging derivative securities with neural networks and a homogeneity
hint. J. Econom. 94(1–2):93–115
Gaß M, Glau K, Mahlstedt M, Mair M. 2018. Chebyshev interpolation for parametric option pricing. Finance
Stochast. 22(3):701–31
Gatheral J. 2011. The Volatility Surface: A Practitioner’s Guide. Hoboken, NJ: John Wiley & Sons
Gençay R, Qi M. 2001. Pricing and hedging derivative securities with neural networks: Bayesian regularization,
early stopping, and bagging. IEEE Trans. Neural Netw. 12(4):726–34
Germain M, Pham H, Warin X. 2021. Neural networks-based algorithms for stochastic control and PDEs in
finance. arXiv:2101. [math.OC]
Gevret H, Langrené N, Lelong J, Warin X, Maheshwari A. 2018. STochastic OPTimization library in C++. Res.
Rep., EDF Lab., Paris
Giurca A, Borovkova S. 2021. Delta hedging of derivatives using deep reinforcement learning. SSRN Work. Pap.
3847272
Glasserman P. 2004. Monte Carlo Methods in Financial Engineering. New York: Springer
Glasserman P, Yu B. 2004a. Number of paths versus number of basis functions in American option pricing.
Ann. Appl. Probab. 14(4):2090–119
Glasserman P, Yu B. 2004b. Simulation for American options: regression now or regression later? In Monte
Carlo and Quasi-Monte Carlo Methods 2002, ed. H Niederreiter, pp. 213–26. Berlin: Springer
Glau K, Herold P, Madan DB, Pötz C. 2019a. The Chebyshev method for the implied volatility. J. Comput.
Finance 23(3):1–31
Glau K, Kressner D, Statti F. 2020. Low-rank tensor approximation for Chebyshev interpolation in parametric
option pricing. SIAM J. Financ. Math. 11(3):897–927
Glau K, Mahlstedt M. 2019. Improved error bound for multivariate Chebyshev polynomial interpolation. Int.
J. Comput. Math. 96(11):2302–14
292 Ludkovski
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
Glau K, Mahlstedt M, Pötz C. 2019b. A new approach for American option pricing: the dynamic Chebyshev
method. SIAM J. Sci. Comput. 41(1):B153–80
Gonon L, Schwab C. 2021. Deep ReLU network expression rates for option prices in high-dimensional,
exponential Lévy models. Finance Stochast. 25(4):615–57
Goudenège L, Molent A, Zanette A. 2019. Variance reduction applied to machine learning for pricing
Bermudan/American options in high dimension. arXiv:1903.11275 [q-fin.CP]
Goudenège L, Molent A, Zanette A. 2020. Machine learning for pricing American options in high-dimensional
Markovian and non-Markovian models. Quant. Finance 20(4):573–91
Gramacy RB. 2020. Surrogates: Gaussian Process Modeling, Design, and Optimization for the Applied Sciences. Boca
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
Kohler M, Krzyżak A. 2012. Pricing of American options in discrete time using least squares estimates with
complexity penalties. J. Stat. Plan. Inference 142(8):2289–307
Kohler M, Krzyżak A, Todorovic N. 2010. Pricing of high-dimensional American options by neural networks.
Math. Finance 20(3):383–410
Kolm PN, Ritter G. 2019. Dynamic replication and hedging: a reinforcement learning approach. J. Financ.
Data Sci. 1(1):159–71
Laurière M. 2021. Numerical methods for mean field games and mean field type control. In Mean Field Games,
ed. F Delarue, pp. 221–82. Providence, RI: Am. Math. Soc.
Leal L, Laurière M, Lehalle CA. 2020. Learning a functional control for high-frequency finance.
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
arXiv:2006.09611 [math.OC]
Lemieux C. 2009. Monte Carlo and Quasi-Monte Carlo Sampling. New York: Springer
Liu S, Borovykh A, Grzelak LA, Oosterlee CW. 2019. A neural network-based framework for financial model
calibration. J. Math. Ind. 9(1):9
Longstaff FA, Schwartz ES. 2001. Valuing American options by simulation: a simple least-squares approach.
Rev. Financ. Stud. 14(1):113–47
Ludkovski M. 2018. Kriging metamodels and experimental design for Bermudan option pricing. J. Comput.
Finance 22(1):37–77
Ludkovski M. 2020. mlOSP: Towards a unified implementation of regression Monte Carlo algorithms.
arXiv:2012.00729 [q-fin.CP]
Ludkovski M. 2021. mlOSP: Regression Monte Carlo algorithms for optimal stopping. R package, version 1.0.
https://github.com/mludkov/mlOSP
Ludkovski M. 2022. Regression Monte Carlo for impulse control. Math. Action 11:73–90
Ludkovski M, Maheshwari A. 2020. Simulation methods for stochastic storage problems: a statistical learning
perspective. Energy Syst. 11(2):377–415
Ludkovski M, Saporito Y. 2022. KrigHedge: Gaussian process surrogates for Delta hedging. Appl. Math.
Finance 28(4):330–60
Maran A, Pallavicini A, Scoleri S. 2021. Chebyshev Greeks: Smoothing Gamma without bias. SSRN Work. Pap.
3872744
Mazières D, Boogert A. 2013. A radial basis function approach to gas storage valuation. J. Energy Mark. 6(2):19–
50
Meinshausen N, Hambly BM. 2004. Monte Carlo methods for the valuation of multiple-exercise options.
Math. Finance 14(4):557–83
Nadarajah S, Margot F, Secomandi N. 2017. Comparison of least squares Monte Carlo methods with
applications to energy real options. Eur. J. Oper. Res. 256(1):196–204
Olivares P, Alvarez A. 2016. Pricing basket options by polynomial approximations. J. Appl. Math. 2016:9747394
Rasmussen CE, Williams CKI. 2006. Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press
Reppen AM, Soner HM, Tissot-Daguette V. 2022. Neural optimal stopping boundary. arXiv:2205.04595
[q-fin.PR]
Risk J, Ludkovski M. 2018. Sequential design and spatial modeling for portfolio tail risk measurement. SIAM
J. Financ. Math. 9(4):1137–74
Roustant O, Ginsbourger D, Deville Y. 2012. DiceKriging, DiceOptim: two R packages for the analysis of
computer experiments by Kriging-based metamodeling and optimization. J. Stat. Softw. 51(1):1–55
Ruf J, Wang W. 2020. Neural networks for option pricing and hedging: a literature review. J. Comput. Finance
24(1):1–46
Ruf J, Wang W. 2022. Hedging with linear regressions and neural networks. J. Bus. Econ. Stat. 40(4):1442–54
Ruppert D. 2004. Statistics and Finance: An Introduction. New York: Springer
Sirignano J, Spiliopoulos K. 2018. DGM: a deep learning algorithm for solving partial differential equations.
J. Comput. Phys. 375:1339–64
Tompaidis S, Yang C. 2013. Pricing American-style options by Monte Carlo simulation: alternatives to
ordinary least squares. J. Comput. Finance 18(1):121–43
Tsitsiklis JN, van Roy B. 2001. Regression methods for pricing complex American-style options. IEEE Trans.
Neural Netw. 12(4):694–703
294 Ludkovski
ST10CH12_Ludkovski ARjats.cls February 14, 2023 11:59
Whalley AE, Wilmott P. 1997. An asymptotic analysis of an optimal hedging model for option pricing with
transaction costs. Math. Finance 7(3):307–24
Yang Y, Zheng Y, Hospedales T. 2017. Gated neural networks for option pricing: Rationality by design.
In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, ed. S Singh, S Markovitch,
pp. 52–58. Palo Alto: AAAI Press
Zanger DZ. 2018. Convergence of a least-squares Monte Carlo algorithm for American option pricing with
dependent sample data. Math. Finance 28(1):447–79
Zhang R, Langrené N, Tian Y, Zhu Z, Klebaner F, Hamza K. 2019. Dynamic portfolio optimization with
liquidity cost and market impact: a simulation-and-regression approach. Quant. Finance 19(3):519–32
Downloaded from www.annualreviews.org. Guest (guest) IP: 179.111.201.253 On: Tue, 29 Oct 2024 01:36:36
Zheng Y, Yang Y, Chen B. 2021. Incorporating prior financial domain knowledge into neural networks for
implied volatility surface prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge
Discovery & Data Mining, pp. 3968–75. New York: Assoc. Comput. Mach.