Apress Bayesian Optimization Theory and Practice Using Python 1484290623
Apress Bayesian Optimization Theory and Practice Using Python 1484290623
Optimization
Theory and Practice Using Python
—
Peng Liu
Bayesian Optimization
Theory and Practice Using Python
Peng Liu
Bayesian Optimization: Theory and Practice Using Python
Peng Liu
Singapore, Singapore
Introduction�������������������������������������������������������������������������������������������������������������xv
vi
Table of Contents
Chapter 7: Case Study: Tuning CNN Learning Rate with BoTorch������������������������� 185
Seeking Global Optimum of Hartmann�������������������������������������������������������������������������������������� 186
Generating Initial Conditions����������������������������������������������������������������������������������������������� 187
Updating GP Posterior��������������������������������������������������������������������������������������������������������� 188
vii
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 225
viii
About the Author
Peng Liu is an assistant professor of quantitative finance
(practice) at Singapore Management University and an
adjunct researcher at the National University of Singapore.
He holds a Ph.D. in Statistics from the National University
of Singapore and has ten years of working experience as a
data scientist across the banking, technology, and hospitality
industries.
ix
About the Technical Reviewer
Jason Whitehorn is an experienced entrepreneur and
software developer and has helped many companies
automate and enhance their business solutions through data
synchronization, SaaS architecture, and machine learning.
Jason obtained his Bachelor of Science in Computer Science
from Arkansas State University, but he traces his passion
for development back many years before then, having first
taught himself to program BASIC on his family’s computer
while in middle school. When he’s not mentoring and
helping his team at work, writing, or pursuing one of his
many side-projects, Jason enjoys spending time with his wife and four children and
living in the Tulsa, Oklahoma, region. More information about Jason can be found on his
website: https://jason.whitehorn.us.
xi
Acknowledgments
This book summarizes my learning journey in Bayesian optimization during my
(part-time) Ph.D. study. It started as a personal interest in exploring this area and
gradually grew into a book combining theory and practice. For that, I thank my
supervisors, Teo Chung Piaw and Chen Ying, for their continued support in my
academic career.
xiii
Introduction
Bayesian optimization provides a unified framework that solves the problem of
sequential decision-making under uncertainty. It includes two key components: a
surrogate model approximating the unknown black-box function with uncertainty
estimates and an acquisition function that guides the sequential search. This book
reviews both components, covering both theoretical introduction and practical
implementation in Python, building on top of popular libraries such as GPyTorch and
BoTorch. Besides, the book also provides case studies on using Bayesian optimization
to seek a simulated function's global optimum or locate the best hyperparameters (e.g.,
learning rate) when training deep neural networks. The book assumes readers with a
minimal understanding of model development and machine learning and targets the
following audiences:
All source code used in this book can be downloaded from github.com/apress/
Bayesian-optimization.
xv
CHAPTER 1
Bayesian Optimization
Overview
As the name suggests, Bayesian optimization is an area that studies optimization
problems using the Bayesian approach. Optimization aims at locating the optimal
objective value (i.e., a global maximum or minimum) of all possible values or the
corresponding location of the optimum in the environment (the search domain). The
search process starts at a specific initial location and follows a particular policy to
iteratively guide the following sampling locations, collect new observations, and refresh
the guiding policy.
As shown in Figure 1-1, the overall optimization process consists of repeated
interactions between the policy and the environment. The policy is a mapping function
that takes in a new input observation (plus historical ones) and outputs the following
sampling location in a principled way. Here, we are constantly learning and improving
the policy, since a good policy guides our search toward the global optimum more
efficiently and effectively. In contrast, a good policy would save the limited sampling
budget on promising candidate locations. On the other hand, the environment contains
the unknown objective function to be learned by the policy within a specific boundary.
When probing the functional value as requested by the policy, the actual observation
revealed by the environment to the policy is often corrupted by noise, making learning
even more challenging. Thus, Bayesian optimization, a specific approach for global
optimization, would like to learn a policy that can help us efficiently and effectively
navigate to the global optimum of an unknown, noise-corrupted environment as quickly
as possible.
1
© Peng Liu 2023
P. Liu, Bayesian Optimization, https://doi.org/10.1007/978-1-4842-9063-7_1
Chapter 1 Bayesian Optimization Overview
Figure 1-1. The overall Bayesian optimization process. The policy digests the
historical observations and proposes the new sampling location. The environment
governs how the (possibly noise-corrupted) observation at the newly proposed
location is revealed to the policy. Our goal is to learn an efficient and effective
policy that could navigate toward the global optimum as quickly as possible
Global Optimization
Optimization aims to locate the optimal set of parameters of interest across the whole
domain through carefully allocating limited resources. For example, when searching
for the car key at home before leaving for work in two minutes, we would naturally start
with the most promising place where we would usually put the key. If it is not there,
think for a little while about the possible locations and go to the next most promising
place. This process iterates until the key is found. In this example, the policy is digesting
the available information on previous searches and proposing the following promising
location. The environment is the house itself, revealing if the key is placed at the
proposed location upon each sampling.
This is considered an easy example since we are familiar with the environment
in terms of its structural design. However, imagine locating an item in a totally new
2
Chapter 1 Bayesian Optimization Overview
environment. The policy would need to account for the uncertainty due to unfamiliarity
with the environment while sequentially determining the next sampling location. When
the sampling budget is limited, as is often the case in real-life searches in terms of
time and resources, the policy needs to argue carefully on the utility of each candidate
sampling location.
Let us formalize the sequential global optimization using mathematical terms. We
are dealing with an unknown scalar-valued objective function f based on a specific
domain Α. In other words, the unknown subject of interest f is a function that maps a
certain sample in Α to a real number in ℝ, that is, f : Α → ℝ. We typically place no specific
assumption about the nature of the domain Α other than that it should be a bounded,
compact, and convex set.
Unless otherwise specified, we focus on the maximization setting instead of
minimization since maximizing the objective function is equivalent to minimizing the
negated objective, and vice versa. The optimization procedure thus aims at locating
the global maximum f ∗ or its corresponding location x∗ in a principled and systematic
manner. Mathematically, we wish to locate f ∗ where
f * = max f ( x ) = f ( x * )
xeA
x * = argmax xeA f ( x )
Figure 1-2 provides an example one-dimensional objective function with its global
maximum f ∗ and its location x∗ highlighted. The goal of global optimization is thus to
systematically reason about a series of sampling decisions within the total search space
Α, so as to locate the global maximum as fast as possible, that is, sampling as few times
as possible.
3
Chapter 1 Bayesian Optimization Overview
Figure 1-2. An example objective function with the global maximum and its
location marked with star. The goal of global optimization is to systematically
reason about a series of sampling decisions so as to locate the global maximum as
fast as possible
Note that this is a nonconvex function, as is often the case in real-life functions we
are optimizing. A nonconvex function means we could not resort to first-order gradient-
based methods to reliably search for the global optimum since it will likely converge to
a local optimum. This is also one of the advantages of Bayesian optimization compared
with other gradient-based optimization procedures.
4
Chapter 1 Bayesian Optimization Overview
• Each functional evaluation is costly, thus ruling out the option for
an exhaustive probing. We need to have a sample-efficient method to
minimize the number of evaluations of the environment while trying
to locate its global optimum. In other words, the optimizer needs to
fully utilize the existing observations and systematically reason about
the next sampling decision so that the limited resource is well spent
on promising locations.
5
Chapter 1 Bayesian Optimization Overview
Figure 1-3. Three possible functional forms. On the left is a convex function whose
optimization is easy. In the middle is a nonconvex function with multiple local
minima, and on the right is also a nonconvex function with a wide flat region full
of saddle points. Optimization for the latter two cases takes a lot more work than
for the first case
6
Chapter 1 Bayesian Optimization Overview
Figure 1-4. Slow convergence due to a small learning rate on the left and
divergence due to a large learning rate on the right
7
Chapter 1 Bayesian Optimization Overview
Next, we will delve into the various components of a typical Bayesian optimization
setup, including the observation model, the optimization policy, and the Bayesian
inference.
8
Chapter 1 Bayesian Optimization Overview
Figure 1-5. Illustrating the actual observations (in dots) and the underlying
objective function (in dashed line). When sampling at a specific location, the
observation would be disrupted by an additive noise. The observation model thus
determines how the observation would be revealed to the policy, which needs to
account for the uncertainty due to noise perturbation
To make our discussion more precise, let us use f (x) to denote the (unknown)
objective function value at location x. We sometimes write f (x) as f for simplicity. We
use y to denote the actual observation at location x, which will slightly differ from f due
to noise perturbation. We can thus express the observation model, which governs how
the policy sees the observation from the environment, as a probability distribution of y
based on a specific location x and true function value f:
p ( y |x , f )
Let us assume an additive noise term ε inflicted on f; the actual observation y can
thus be expressed as
y = f +e
Here, the noise term ε arises from measurement error or inaccurate statistical
approximation, although it may disappear in certain computer simulations. A common
9
Chapter 1 Bayesian Optimization Overview
practice is to treat the error as a random variable that follows a Gaussian distribution
with a zero mean and fixed standard deviation σ, that is, ε~N(0, σ2). Note that it is
unnecessary to fix σ across the whole domain A; the Bayesian optimization allows for
both homoscedastic noise (i.e., fixed σ across A) and heteroskedastic noise (i.e., different
σ that depends on the specific location in A).
Therefore, we can formulate a Gaussian observation model as follows:
p ( y|x , f ,a ) = N ( y ; f ,a 2 )
This means that for a specific location x, the actual observation y is treated as a
random variable that follows a Gaussian/normal distribution with mean f and variance
σ2. Figure 1-6 illustrates an example probability distribution of y centered around f. Note
that the variance of the noise is often estimated by sampling a few initial observations
and is expected to be small, so that the overall observation model still strongly depends
on and stays close to f.
Figure 1-6. Assuming a normal probability distribution for the actual observation
as a random variable. The Gaussian distribution is centered around the objective
function f value evaluated at a given location x and spread by the variance of the
noise term
The following section introduces Bayesian statistics to lay the theoretical foundation
as we work with probability distributions along the way.
10
Chapter 1 Bayesian Optimization Overview
B
ayesian Statistics
Bayesian optimization is not a particular algorithm for global optimization; it is a suite of
algorithms based on the principles of Bayesian inference. As the optimization proceeds
in each iteration, the policy needs to determine the next sampling decision or if the
current search needs to be terminated. Due to uncertainty in the objective function and
the observation model, the policy needs to cater to such uncertainty upon deciding
the following sampling location, which bears both an immediate impact on follow-up
decisions and a long-term effect on all future decisions. The samples selected thus need
to reasonably contribute to the ultimate goal of global optimization and justify the cost
incurred due to sampling.
Using Bayesian statistics in optimization paves the way for us to systematically
and quantitatively reason about these uncertainties using probabilities. For example,
we would place a prior belief about the characteristics of the objective function and
quantify its uncertainties by assigning high probability to specific ranges of values and
low probability to others. As more observations are collected, the prior belief is gradually
updated and calibrated toward the true underlying distribution of the objective function
in the form of a posterior distribution.
We now cover the fundamental concepts and tools of Bayesian statistics.
Understanding these sections is essential to appreciate the inner workings of Bayesian
optimization.
B
ayesian Inference
Bayesian inference essentially relies on the Bayesian formula (also called Bayes’ rule)
to reason about the interactions among three components: the prior distribution p(θ)
where θ represents the parameter of interest, the likelihood p(data| θ) given a specific
parameter θ, and the posterior distribution p(θ| data). There is one more component, the
evidence of the data p(data), which is often not computable. The Bayesian formula is as
follows:
p ( data|o ) p (o )
p (o |data ) =
p ( data )
Let us look closely at this widely used, arguably the most important formula in
Bayesian statistics. Remember that any Bayesian inference procedure aims to derive the
11
Chapter 1 Bayesian Optimization Overview
posterior distribution p(θ| data) (or calculate its marginal expectation) for the parameter
of interest θ, in the form of a probability density function. For example, we might end up
with a continuous posterior distribution as in Figure 1-7, where θ varies from 0 to 1, and
all the probabilities (i.e., area under the curve) would sum to 1.
12
Chapter 1 Bayesian Optimization Overview
progressively approach a normal distribution given that more data is being collected,
thus forming a posterior distribution that better approximates the true distribution of θ.
Figure 1-8. Updating the prior uniform distribution toward a posterior normal
distribution as more data is collected. The role of the prior distribution decreases
as more data is collected to support the approximation to the true underlying
distribution
The last term is the denominator p(data), also referred to as the evidence, which
represents the probability of obtaining the data over all different choices of θ and serves
as a normalizing constant independent of θ in Bayes’ theorem. This is the most difficult
part to compute among all the components since we need to integrate over all possible
values of θ by taking an integration. For each given θ, the likelihood is calculated based
on the assumed observation model for data generation, which is the same as how the
likelihood term is calculated. The difference is that the evidence considers every possible
value of θ and weights the resulting likelihood based on the probability of observing a
particular θ. Since the evidence is not connected to θ, it is often ignored when analyzing
the proportionate change in the posterior. As a result, it focuses only on the likelihood
and the prior alone.
A relatively simple case is when the prior p(θ) and the likelihood p(data| θ) are
conjugate, making the resulting posterior p(θ| data) analytic and thus easy to work with
due to its closed-form expression. Bayesian inference becomes much easier and less
restrictive if we can write down the explicit form and generate the exact shape of the
posterior p(θ| data) without resorting to sampling methods. The posterior will follow the
same distribution as the prior when the prior is conjugate with the likelihood function.
One example is when both the prior and the likelihood functions follow a normal
13
Chapter 1 Bayesian Optimization Overview
distribution, the resulting posterior will also be normally distributed. However, when the
prior and the likelihood are not conjugate, we can still get more insight on the posterior
distribution via efficient sampling techniques such as Gibbs sampling.
14
Chapter 1 Bayesian Optimization Overview
Figure 1-9. Comparing the frequentist approach and the Bayesian approach
regarding the parameter of interest. The frequentist approach treats θ as a fixed
quantity that can be estimated via MLE, while the Bayesian approach employs a
probability distribution which gets refreshed as more data is collected
15
Chapter 1 Bayesian Optimization Overview
occurs given that the event y = Y has occurred. It is thus referred to as conditional
probability, as the probability of the first event is now conditioned on the second event.
All conditional probabilities for a (continuous) random variable x given a specific value
of another random variable (i.e., y = Y) form the conditional probability distribution
p(x| y = Y). More generally, we can write the joint probability distribution of random
variables x and y as p(x, y) and conditional probability distribution as p(x ∣ y).
The joint probability is also symmetrical, that is, p(X and Y) = p(Y and X), which is
a result of the exchangeability property of probability. Plugging in the definition of joint
probability using the chain rule gives the following:
p ( X u Y ) = p ( X|Y ) p ( Y ) = p ( Y|X ) p ( X )
If you look at this equation more closely, it is not difficult to see that it can lead to the
Bayesian formula we introduced earlier, namely:
p ( Y|X ) p ( X )
p ( X|Y ) =
p(Y )
Understanding this connection gives us one more reason not to memorize the
Bayesian formula but to appreciate it. We can also replace a single event x = X with the
random variable x to get the corresponding conditional probability distribution p(x| y = Y).
Lastly, we may only be interested in the probability of an event for one random
variable alone, disregarding the possible realizations of the other random variable.
That is, we would like to consider the probability of the event x = X under all possible
values of y. This is called the marginal probability for the event x = X. The marginal
probability distribution for a (continuous) random variable x in the presence of another
(continuous) random variable y can be calculated as follows:
p x p x ,y dy p x|y p y dy
The preceding definition essentially sums up possible values p(x| y) weighted by the
likelihood of occurrence p(y). The weighted sum operation resolves the uncertainty in
the random variable y and thus in a way integrates it out of the original joint probability
distribution, keeping only one random variable. For example, the prior probability
p(θ) in Bayes’ rule is a marginal probability distribution of θ, which integrates out
other random variables, if any. The same goes for the evidence term p(data) which is
calculated by integrating over all possible values of θ.
16
Chapter 1 Bayesian Optimization Overview
p y p x ,y dx p y|x p x dx
Figure 1-10 summarizes the three common probability distributions. Note that
the joint probability distribution focuses on two or more random variables, while
both the conditional and marginal probability distributions generally refer to a single
random variable. In the case of the conditional probability distribution, the other
random variable assumes a specific value and thus, in a way, “disappears” from the
joint distribution. In the case of the marginal probability distribution, the other random
variable is instead integrated out of the joint distribution.
17
Chapter 1 Bayesian Optimization Overview
Let us revisit Bayes’ rule in the context of conditional and marginal probabilities.
Specifically, the likelihood term p(data| θ) can be treated as the conditional probability of
the data given the parameter θ, and the evidence term p(data) is a marginal probability
that needs to be evaluated across all possible choices of θ. Based on the definition
of marginal probability, we can write the calculation of p(data) as a weighted sum
(assuming a continuous θ):
p data p data| p d
I ndependence
A special case that would impact the calculation of the three probabilities mentioned
earlier is independence, where the random variables are now independent of each
other. Let us look at the joint, conditional, and marginal probabilities with independent
random variables.
When two random variables are independent of each other, the event x = X would
have nothing to do with the event y = Y, that is, the conditional probability for x = X
given y = Y becomes p(X| Y) = p(X). The conditional probability distribution for two
independent random variables thus becomes p(x| y) = p(x). Their joint probability
becomes the multiplication of individual probabilities: p(X ∩ Y) = P(X| Y)P(Y) = p(X)p(Y),
and the joint probability distribution becomes a product of individual probability
distributions: p(x, y) = p(x)p(y). The marginal probability of x is just its own probability
distribution:
p x p x|y p y dy p x p y dy p x p y dy p x
where we have used the fact that p(x) can be moved out of the integration operation due
to its independence with y, and the total area under a probability distribution is one, that
is, ∫ p(y)dy = 1.
We can also extend to conditional independence, where the random variable x
could be independent from y given another random variable z. In other words, we have
p(x, y| z) = p(x| z)p(y| z).
18
Chapter 1 Bayesian Optimization Overview
p y p y , d p y| p d
which is the exact definition of the evidence term in Bayes’ formula. In a discrete world,
we would take the prior probability for a specific value of the parameter θ, multiply
the likelihood of the resulting data given the current θ, and sum across all weighted
likelihoods.
Now let us look at the posterior predictive distribution for a new data point y′ after
observing a collection of data points collectively denoted as . We would like to assess
how the future data would be distributed and what value of y′ we would likely to observe
if we were to run the experiment and acquire another data point again, given that we
have observed some actual data. That is, we want to calculate the posterior predictive
distribution p y | .
We can calculate the posterior predictive distribution by treating it as a marginal
distribution (conditioned on the collected dataset ) and applying the same technique
as before, namely:
where the second term p | is the posterior distribution of the parameter θ that
can be calculated by applying Bayes’ rule. However, the first term p y | , is more
involved. When assessing a new data point after observing some existing data points, a
19
Chapter 1 Bayesian Optimization Overview
common assumption is that they are conditionally independent given a particular value
of θ. Such conditional independence implies that p y | , p y | , which happens
to be the likelihood term. Thus, we can simplify the posterior predictive distribution as
follows:
p y | p y | p | d
which follows the same pattern of calculation compared to the prior predictive
distribution. This would then give us the distribution of observations we would expect
for a new experiment (such as probing the environment in the Bayesian optimization
setting) given a set of previously collected observations. The prior and posterior
predictive distributions are summarized in Figure 1-11.
Figure 1-11. Definition of the prior and posterior predictive distributions. Both
are calculated based on the same pattern of a weighted sum between the prior and
the likelihood
Let us look at an example of the prior predictive distribution under a normal prior
and likelihood function. Before the experiment starts, we assume the observation model
for the likelihood of the data y to follow a normal distribution, that is, y~N(θ, σ2), or p(y| θ,
σ2) = N(θ, σ2), where θ is the underlying parameter and σ2 is a fixed variance. For example,
in the case of the observation model in the Bayesian optimization setting introduced
earlier, the parameter θ could represent the true objective function, and the variance σ2
originates from an additive Gaussian noise. The distribution of y is dependent on θ,
20
Chapter 1 Bayesian Optimization Overview
The prior predictive distribution can thus be calculated by plugging in the definition
of normal likelihood term p(y| θ) and the normal prior term p(θ). However, there is a
simple trick we can use to avoid the math, which would otherwise be pretty heavy if we
were to plug in the formula of the normal distribution directly.
Let us try directly working with the random variables. We will start by noting that
y = (y − θ) + θ. The first term y − θ takes θ away from y, which decentralizes y by changing
its mean to zero and removes the dependence of y on θ. In other words, (y − θ)~N(0, σ2),
which also represents the distribution of the random noise in the observation model
of Bayesian optimization. Since the second term θ is also normally distributed, we can
derive the distribution of y as follows:
y ~ N 0 , 2 N 0 , 2 N 0 , 2 2
where we have used the fact that the addition of two independent normally distributed
random variables will also be normally distributed, with the mean and variance
calculated based on the sum of individual means and variances.
Therefore, the marginal probability distribution of y becomes p y N 0 , 2 2 .
Intuitively, this form also makes sense. Before we start to collect any observation about
y, our best guess for its mean would be θ0, the expected value of the underlying random
variable θ. Its variance is the sum of individual variances since we are considering
uncertainties due to both the prior and the likelihood; the marginal distribution needs
21
Chapter 1 Bayesian Optimization Overview
Figure 1-12. Derivation process of the prior predictive distribution for a new data
point before collecting any observations, assuming a normal distribution for both
the likelihood and the prior
We can follow the same line of reasoning for the case of posterior predictive
distribution for a new observation y′ after collecting some data points under the
normality assumption for the likelihood p(y′| θ) and the posterior p | , where
p(y′| θ) = N(θ, σ2) and p | N , 2 . We can see that the posterior distribution for
θ has an updated set of parameters θ′ and 2 using Bayes’ rule as more data is collected.
Now recall the definition of the posterior predictive distribution with a continuous
underlying parameter θ:
p y | p y | p | d
22
Chapter 1 Bayesian Optimization Overview
Figure 1-13 summarizes the derivation of the posterior predictive distributions under
normality assumption for the likelihood and the prior for a continuous θ.
Figure 1-13. Derivation process of the posterior predictive distribution for a new
data point after collecting some observations, assuming a normal distribution for
both the likelihood and the prior
23
Chapter 1 Bayesian Optimization Overview
Figure 1-14 illustrates an example of the marginal prior distribution and the
conditional likelihood function (which is also a probability distribution) along with the
observation Y. We can see that both distributions follow a normal curve, and the mean
of the latter is aligned to the actual observation Y due to the conditioning effect from
Y = θ. Also, the probability of observing Y is not very high based on the prior distribution
p(θ), which suggests a change needed for the prior in the posterior update of the next
iteration. We will need to change the prior in order to improve such probability and
conform the subjective expectation to reality.
Figure 1-14. Illustrating the prior distribution and the likelihood function, both
following a normal distribution. The mean of the likelihood function is equal to the
actual observation due to the effect of conditioning
The prior distribution will then gradually get updated to approximate the
actual observations by invoking Bayes’ rule. This will give the posterior distribution
p |Y N , 2 in solid line, whose mean is slightly nudged from θ0 toward Y and
updated to θ′, as shown in Figure 1-15. The prior distribution and likelihood function
are displayed in dashed lines for reference. The posterior distribution of θ is now more
aligned with what is actually observed in reality.
24
Chapter 1 Bayesian Optimization Overview
Figure 1-15. Deriving the posterior distribution for θ using Bayes’ rule. The
updated mean θ′ is now between the prior mean θ0 and actual observation Y,
suggesting an alignment between subjective preference and reality
25
Chapter 1 Bayesian Optimization Overview
Gaussian Process
A prevalent choice of stochastic process in Bayesian optimization is the Gaussian
process, which requires that these finite-dimensional probability distributions are
multivariate Gaussian distributions in a continuous domain with infinite number of
variables. It is a flexible framework to model a broad family of functions and quantify
their uncertainties, thus being a powerful surrogate model used to approximate the true
underlying function. We will delve into the details of the Gaussian process in the next
chapter, but for now, let us look at a few visual examples to see what it offers.
Figure 1-17 illustrates an example of a “flipped” prior probability distribution for a
single random variable selected from the prior belief of the Gaussian process. Each point
26
Chapter 1 Bayesian Optimization Overview
follows a normal distribution. Plotting the mean (solid line) and 95% credible interval
(dashed lines) of all these prior distributions gives us the prior process for the objective
function regarding each location in the domain. The Gaussian process thus employs an
infinite number of normally distributed random variables within a bounded range to
model the underlying objective function and quantify the associated uncertainty via a
probabilistic approach.
Figure 1-17. A sample prior belief of the Gaussian process represented by the
mean and 95% credible interval for each location in the domain. Every objective
value is modeled by a random variable that follows a normal prior predictive
distribution. Collecting the distributions of all random variables could help us
quantify the potential shape of the true underlying function and its probability
The prior process can thus serve as the surrogate data-generating process to
generate samples in the form of functions, an extension of sampling single points from
a probability distribution. For example, if we were to repeatedly sample from the prior
process earlier, we would expect the majority (around 95%) of the samples to fall within
the credible interval and a minority outside this range. Figure 1-18 illustrates three
functions sampled from the prior process.
27
Chapter 1 Bayesian Optimization Overview
Figure 1-18. Three example functions sampled from the prior process, where
majority of the functions fall within the 95% credible interval
In the Gaussian process, the uncertainty on the objective value of each location is
quantified using the credible interval. As we start to collect observations and assume a
noise-free and exact observation model, the uncertainties at the collection locations will
be resolved, leading to zero variance and direct interpolation at these locations. Besides,
the variance increases as we move further away from the observations, resulting from
integrating the prior process with the information provided by the actual observations.
Figure 1-19 illustrates the updated posterior process after collecting two observations.
The posterior process with updated knowledge based on the observations will thus make
a more accurate surrogate model and better estimate the objective function.
28
Chapter 1 Bayesian Optimization Overview
Figure 1-19. Updated posterior process after incorporating two exact observations
in the Gaussian process. The posterior mean interpolates through the observations,
and the associated variance reduces as we move nearer the observations
Acquisition Function
The tools from Bayesian inference and the extension to the Gaussian process provide
principled reasoning on the distribution of the objective function. However, we would
still need to incorporate such probabilistic information in our decision-making to search
for the global maximum. We need to build a policy that absorbs the most updated
information on the objective function and recommends the following most promising
sampling location in the face of uncertainties across the domain. The optimization
policy thus plays an essential role in connecting the Gaussian process to the eventual
goal of Bayesian optimization. In particular, the posterior predictive distribution
provides an outlook on the objective value and associated uncertainty for locations not
explored yet, which could be used by the optimization policy to quantify the utility of any
alternative location within the domain.
When converting the posterior knowledge about candidate locations, that is,
posterior parameters such as the mean and the variance, to a single utility score, the
acquisition function comes into play. An acquisition function is a manually designed
mechanism that evaluates the relative potential of each candidate location in the
form of a scalar score, and the location with the maximum score will be used as the
recommendation for the next round of sampling. It is a function that assesses how
valuable a candidate location when we acquire/sample it. The acquisition function
29
Chapter 1 Bayesian Optimization Overview
30
Chapter 1 Bayesian Optimization Overview
particular observation model. The Gaussian process surrogate model then uses the new
observation to obtain a posterior process in support of follow-up decision-making by the
preset acquisition function. This process continues until the stopping criterion such as
exhausting a given budget is met. Figure 1-20 illustrates this process.
Summary
Bayesian optimization is a class of methodology that aims at sample-efficient
global optimization. This chapter covered the foundations of the BO framework,
including the following:
31
Chapter 1 Bayesian Optimization Overview
In the next chapter, we will discuss the first component: the Gaussian process,
covering both theoretical understanding and practical implementation in Python.
32
CHAPTER 2
Gaussian Processes
In the previous chapter, we covered the derivation of the posterior distribution for
parameter θ as well as the predictive posterior distribution of a new observation y′
under a normal/Gaussian prior distribution. Knowing the posterior predictive
distribution is helpful in supervised learning tasks such as regression and classification.
In particular, the posterior predictive distribution quantifies the possible realizations
and uncertainties of both existing and future observations (if we were to sample again).
In this chapter, we will cover some more foundation on the Gaussian process in the first
section and switch to the implementation in code in the second section.
The way we work with the parameters depends on the type of models used for
training. There are two types of models in supervised learning tasks: parametric and
nonparametric models. Parametric models assume a fixed set of parameters to be
estimated and used for prediction. For example, by defining a set of parameters θ
(bolded lowercase to denote multiple elements contained in a vector) given a set of input
observations X (bolded uppercase to denote a matrix) and output target y, we rely on the
parametric model p(y| X, θ) and estimate the optimal parameter values θˆ via procedures
such as maximum likelihood estimation or maximum a posteriori estimation. Using a
Bayesian approach, we can also infer the full posterior distribution p(θ| X, y) to enable a
distributional representation instead of a point estimate for the parameters θ.
Figure 2-1 illustrates the shorthand math notation for matrix X and vector y.
33
© Peng Liu 2023
P. Liu, Bayesian Optimization, https://doi.org/10.1007/978-1-4842-9063-7_2
Chapter 2 Gaussian Processes
On the other hand, nonparametric methods do not assume a fixed set of parameters,
and, instead, the number of parameters is determined by the size of the dataset. For
example, the k-nearest neighbors (KNN) algorithm is a nonparametric model that takes
the average of the nearest k data points as the final prediction. This process involves
calculating the distance/similarity of all data points, where these similarities serve as the
weights (equal weightage for all k nearby points in KNN). Therefore, it tends to be slow at
test time when the size of the dataset is large, although training time is faster compared
with parametric models due to direct data memorization.
Gaussian processes (GP) can be considered a type of nonparametric model. It is a
stochastic process used to characterize the distribution over functions instead of a fixed
set of parameters. The key difference is that GP extends a limited set of parameters
θ from a discrete space, often used in multivariate Gaussian distribution, into an
unlimited function f in a continuous and infinite space, which corresponds to a function
or a curve when plotted on a graph. GP inherits the nice mathematical properties of a
multivariate Gaussian distribution and offers a flexible framework to model functions
with uncertainty estimates.
For example, Figure 2-2 shows one realization of a coefficient vector θ that contains
three parameters and a function f that displays as a curve and has infinite number of
parameters. However, computationally we would still use a finite number of points
to approximate the actual curve when plotting the function. Using GP, the function f
represents an infinite collection of random variables, where any (finite) subset of these
random variables follows a joint Gaussian distribution. The mutual dependence among
these random variables thus determines the resulting shape of the function f.
34
Chapter 2 Gaussian Processes
p f GP f ; ,
The mean function μ returns the expected value (central tendency) for an arbitrary
input location x, and the covariance function quantifies the similarity between any two
input locations.
We will shed more light on the covariance function later. For now, it helps to think
of GP as a massive Gaussian distribution. Under the GP framework, inference on the
distribution over functions starts with a prior process (also over the functions), which
gets iteratively updated as new observations are revealed. By treating f as a collection
of random variables that follow a GP prior and using Bayesian inference to obtain the
GP posterior, we could quantify the uncertainties about the underlying true objective
function itself, including its maximizer f ∗ and x∗. Also, depending on the type of task, GP
can be used for both regression and classification.
35
Chapter 2 Gaussian Processes
Before we formalize the mathematical properties of GP, let us review the basics of
Gaussian distribution in a multivariate setting, which will help us build the intuition
in understanding GP. We will then delve into the mechanics of developing useful prior
distributions for the objective function and calculating the posterior distributions after
observing some noisy or noise-free samples.
Now we would like to fit a Gaussian distribution for each scenario, which will allow
us to describe other unknown points better if the distribution is fit properly. In
the two-dimensional case, a Gaussian distribution is characterized by a mean vector
2 122
1 and a covariance matrix K 112 2
. The mean vector describes the central
2 21 22
36
Chapter 2 Gaussian Processes
tendency if we were to sample from the Gaussian distribution repeatedly, and the
covariance matrix the intercorrelation among the points.
Correspondingly, any point on the plane can be considered as a realization (or
equivalently, a sample or simulation) from the Gaussian distribution:
2 122
x ~ N ,K N 1 , 112 2
2
21 22
Since we assume these points are centered around the origin, this implies that
0
. Now let us take a closer look at the covariance matrix.
0
112 var x 1 x 1 x 1 x 1
2 2
122 212 x 1 x 1 x 2 x 2 x 1 x 2
cov x 1 ,x 2 122 2
x1x 2 21
sd x 1 sd x 2 11 22 11 22
37
Chapter 2 Gaussian Processes
where we have used the symmetric property of the covariance matrix assumed to be
positive semidefinite, that is, 122 21 . The correlation is a metric that summarizes the
T
2
covariance matrix by making use of all the components in the calculation. It is a numeric
number between -1 and 1, where 1 represents perfect positive correlation (such as
x2 = 2x1) and -1 means perfect negative correlation (such as x2 = − 2x1).
Since the diagonal entries in the covariance matrix denote the feature-specific
variance, we can easily apply normalization by dividing the standard deviation and
obtain a unit variance, resulting in 112 22
2
1.
Things become more interesting when we look at the off-diagonal entries, that is,
σ 122 and σ 212 . In the left distribution in Figure 2-3, if we were to increase x2 from zero to
a specific value (say x2 = 0.8), our knowledge of the possible values of x1 (i.e., positive or
negative) is still very limited. In fact, knowing x2 does not contribute to our knowledge
about x1 at all, which still remains equally distributed around the vertical axis. The
information about x2 does not help getting more information about x1. In other words,
x1 and x2 are uncorrelated to each other, and 122 21 2
0 . Therefore, for the first
scenario, we have
x1 0 1 0
2 ~ N ,
x 0 0 1
On the other hand, if we were to take the same cut in the second scenario and
condition x2 to the same value, we know that x1 will be positive with a high probability.
This time, knowing the value of x2 gives us some information about x1, thus making
these two variables correlated. We would expect x1 to increase if x2 were to increase, thus
forming a positive correlation. Assume the covariance between x1 and x2 is 0.6, that is,
the strength of co-movement is 122 21
2
0.6 , the bivariate Gaussian distribution for the
second scenario could then be expressed as
x1 0 1 0.6
2 ~ N ,
x 0 0.6 1
Figure 2-4 summarizes these two scenarios, where the key difference is the
dependence between these two variables. A value of zero at the off-diagonal entries
indicates that the variables are not dependent upon each other, while a positive or
negative value suggests some level of dependence.
38
Chapter 2 Gaussian Processes
Figure 2-4. Illustrating the bivariate Gaussian distribution for the two scenarios.
The covariance entries on the left are set to zero to represent the fact that the
variables are uncorrelated, while the covariance entries are 0.6 due to a positive
correlation on the right
p x 1 N x 1 |1 , K 11
p x 2 N x 2 |2 , K 22
39
Chapter 2 Gaussian Processes
Now assume we have an observation for variable x2, say x2 = a, how does this
information update our belief about the distribution of x1? In other words, we are
interested in the posterior distribution of x1 conditioned on the observation x2 = a. The
conditional posterior distribution of x1 given x2 = a can be written as
p x 1 |x 2 a N x 1 |12| ,K 12|
where the conditional posterior mean and variance are defined as follows:
12| 1 K 12 K 221 a 2
1
K 12| K 11 K 12 K 22 K 21
Note that the posterior variance of x1 does not depend on the observed value of x2.
Figure 2-5. Obtaining a random sample from the desired univariate Gaussian
distribution based on three steps: sampling from a uniform distribution,
converting to the corresponding input of the CDF (or PDF) using the inverse
cumulative function, and applying the scale-location transformation to convert
the random sample to the space of the normal distribution with the desired mean
and variance
41
Chapter 2 Gaussian Processes
Let us extend to the multivariate case and look at how to sample from a bivariate
Gaussian distribution with an arbitrary mean vector 1 and covariance matrix
2
K 11 K 12
K . We could follow a similar process and start by sampling from a standard
K 21 K 22
0 1 0
bivariate normal distribution N , and then apply the scale-location
0 0 1
transformation to the resulting samples.
x1 0 1 0
How could we randomly sample x 2 from N , ? Put equivalently,
x 0 0 1
we would like to sample x from N(0, I), where 0 is a vector of zeros and I is the identity
matrix. Since the off-diagonal entries of the covariance matrix are zero, we could exploit
the fact that the two variables x1 and x2 are uncorrelated. Thus, the problem becomes
drawing random samples from the individual distribution for x1 and x2, respectively.
Based on the previous theorem on multivariate Gaussian distribution, the marginal
distributions are the univariate Gaussian distributions with the respective mean and
variance. In other words:
x 1 ~ N 0 ,1
x 2 ~ N 0 ,1
K = LLT
Thus, we can obtain a transformed sample via Lx + μ, which will follow N(μ, K).
Understanding the sampling process of a multivariate Gaussian distribution is
essential when it comes to simulating samples (specifically, curves) from a posterior
Gaussian process in the Bayesian framework. A random sample from the multivariate
Gaussian distribution lives in the form of a vector with limited elements, which gets
42
Chapter 2 Gaussian Processes
stretched to an unlimited function under Gaussian processes. Next, we will move into
the setting of sequentially obtaining new observations and look at how to plug in the GP
framework to obtain updated samples using GP regression.
43
Chapter 2 Gaussian Processes
the variables are centered to simplify the derivation. Also, note that we use {xi} with the
subscript i to denote the location of the observations arranged along the horizontal axis,
which are considered given in the Bayesian inference framework in the previous chapter.
We can assume the three variables jointly follow a multivariate Gaussian distribution
as follows:
f1 0 K 11 K 12 K 13
f f 2 ~ N ,K N 0 , K 21
K 22 K 23
f 3 0 K K 32 K 33
31
where the covariance function K captures the correlation between the three
observations. Specifically, f1 should be more correlated to f2 than f3 due to a smaller
difference in value. Such prior belief also constitutes our inductive bias into the model:
the resulting model should be smooth without sudden jumps between nearby points.
See Figure 2-6 for a recap of this example.
f1 0 1 0.8 0.3
f ~ N 0 , 0.8 1 0.7
2
f 3 0 0.3 0.7 1
44
Chapter 2 Gaussian Processes
A popular kernel function we can use to obtain a covariance matrix similar to the
preceding example is the squared exponential kernel:
K ij xi ,x j e
2
xi x j
where we take the exponential of the negative squared distance as the final distance
measure. We can now quantify the similarity between any pair of points in the bounded
range. For example, when xi is very far away from xj and their distance approaches
infinity, that is, ‖xi − xj‖ → ∞, we have Kij → 0; when xi = xj, that is, the similarity of a
given point with itself, we have Kij = 1, corresponding to the diagonal entries in the
covariance matrix. Therefore, the squared exponential kernel function κ(xi, xj) describes
the similarity between any two input locations using a value between 0 and 1. Using
such a kernel function would thus enable us to build a covariance matrix similar to the
example earlier.
Let us implement the kernel function in the following code listing and observe the
similarity value as the input distance increases.
Running the preceding code would generate Figure 2-7, which shows that the
covariance between two points decreases as their distance increases and approaches
zero when the distance exceeds two. This means that the functional values at nearby
locations are more correlated, while distant locations would essentially lead to
independent random variables with little or no covariance. It also suggests that revealing
an observation at an arbitrary input location provides nontrivial information about the
functional values of nearby locations.
45
Chapter 2 Gaussian Processes
Note that Kii = κ(xi, xi) = 1 at every point xi ∈ A. Thus, the covariance between two
points xi and xi also measures the correlation between the corresponding functional
values f (xi) and f (xj) based on the definition of correlation introduced earlier. Assuming
a mean function that returns zero across the whole domain, we can see that the marginal
distribution of the random variable at every input location follows a standard normal
distribution.
From another perspective, the kernel function can also be considered to determine
the height of the three random variables {f1, f2, f3}, which represents our entity of interest,
given a fixed set of locations {x1, x2, x3}. We can also introduce additional parameters
to the kernel function so that it becomes more flexible in characterizing the particular
dataset, which could be in the form of images or texts.
46
Chapter 2 Gaussian Processes
illustrated in Figure 2-8, given that closer points will be similar in value as encoded by
the kernel function, we can intuit that the middle point has the highest probability,
which has roughly the same distance with the second and third observations. That
is, a small change in the input location should result in a small change in the output
observation value. Choosing the middle would result in a smooth curve when
connecting it with other observations. In addition, if we were to move x∗ closer to x2, we
would expect f∗ to increase in value and further approach f2.
Figure 2-8. Three possible values of f∗ (dash circles) for a new input location x∗.
Based on the smoothness assumption encoded in the kernel/covariance function,
we would intuit the middle point to be the most likely candidate
Let us analyze how this intuition can be incorporated using the GP framework. We
can model the corresponding observation as a random variable f∗ that follows a joint
Gaussian distribution together with the observed random variables f = {f1, f2, f3}. Lumping
all variables together allows us to enforce the similarity constraints as encoded in the
covariance function.
Specifically, the observed variables f and the unobserved variable f∗ are respectively
distributed as follows:
f ~ N 0 ,K
f ~ N 0, x ,x
47
Chapter 2 Gaussian Processes
where the self-covariance term K(x∗, x∗) is just the variance of x∗, which would equal 1
when using the squared exponential kernel function. To make the setting more general,
we assume a finite number of random variables in f observed at locations x without
noise and jointly following a multivariate Gaussian distribution:
p f |x N f |,0|,K
which is the same as f~N(0, K). Revealing these observations will help update our prior
belief p(f∗| x∗) = N(f∗| 0, κ(x∗, x∗)) of GP and obtain a posterior p(f∗| x∗, x, f) for a set of new
random variables f∗ at a set of new input locations x∗, also to be observed without noise.
To see how this can be achieved, we refer to the GP framework and use the fact that f and
f∗ will also be jointly Gaussian. That is
f K K
f ~ N 0, KT
K
where K∗ = κ(x, x∗) and K∗∗ = κ(x∗, x∗). When there is a total of n observed samples (i.e.,
training data) and n∗ new input locations, we would have a n × n matrix for K, a n × n∗
matrix for K∗, and a n∗ × n∗ matrix for K∗∗.
With f and f∗ modeled as a joint Gaussian distribution, we can again rely on the
multivariate Gaussian theorem and directly use the closed-form solution to obtain the
parameters of the conditional distribution p(f∗| x∗, x, f):
p f |x , x , f N f | ,
K T K 1f
K K T K 1K
The updated posterior p(f∗| x∗, x, f) with such parameters will thus assign the highest
probability to the middle point (mean estimate) among the three candidates in Figure 2-8.
Figure 2-9 summarizes the derivation process of learning from existing data and
obtaining the predictive (conditional) distribution for the new data under the GP
framework. The key part is defining the kernel function as a form of a prior belief or
an inductive bias and iteratively refining it by incorporating information from the
observations.
48
Chapter 2 Gaussian Processes
Figure 2-9. Obtaining the predictive distribution of new input locations using the
Gaussian process. We start by defining and optionally optimizing a kernel function
to encode what is observed from actual samples and what is believed using the
prior. Then we model all variables as jointly Gaussian and obtain the conditional
distribution for the new variables
49
Chapter 2 Gaussian Processes
p f |x , x , y N f | ,
K T K y1y
K K T K y 1K
p y |x , x , y N y | , 2y I
We can then use this predictive distribution to perform sampling at any input
location within the domain. When the number of samples grows to infinity, connecting
these samples would form a functional curve, which is a sample from the Gaussian
process. We can also plot the confidence interval by connecting the pointwise standard
deviations. Viewing the functional curve as a composition of infinitely many pointwise
samples also follows the same argument as the case of sampling from a bivariate
Gaussian distribution.
50
Chapter 2 Gaussian Processes
See Listing 2-2 for the full code on generating samples from a GP prior.
51
Chapter 2 Gaussian Processes
Running the preceding code will generate five random curves, each consisting of
100 discrete points connected to approximate a function across the whole domain. See
Figure 2-10 for the output.
We can make the kernel function more flexible by introducing additional parameters.
For example, the widely used Gaussian kernel or RBF kernel uses two parameters: the
length parameter l to control the smoothness of the function and σf to control the vertical
variation:
1
xi ,x j 2f exp 2 || xi x j ||2
2l
For demonstration purposes, we will use the same parameters for each input
location/dimension, which also results in the so-called isotropic kernel and is defined
as follows. Both input arguments could include multiple elements/locations, each
52
Chapter 2 Gaussian Processes
# Args:
# X1: array of m points (m x d).
# X2: array of n points (n x d).
# Returns:
# (m x n) matrix.
def ise_kernel(X1, X2, l=1.0, sigma_f=1.0):
sq_dist = np.sum(X1**2,1).reshape(-1,1) + np.sum(X2**2,1) - 2*np.
dot(X1,X2.T)
return sigma_f**2 * np.exp(-0.5/l**2 * sq_dist)
We can obtain the mean vector (assumed to be zero) and covariance matrix (using
l = 1 and σf = 1) based on a list of input locations previously defined in X_test.
We can then plot the five samples together with the mean function and 95%
confidence bounds that form the uncertainty region based on the diagonal entries of the
covariance matrix (contained in the uncertainty variable). We define a utility function
53
Chapter 2 Gaussian Processes
to plot the samples across the predefined grid values in the following code listing. The
function also has two input arguments (X_train and Y_train) as placeholders when
actual observations are revealed.
Listing 2-6. Plotting GP prior mean function, uncertainty region, and samples
We can then supply the necessary ingredients to invoke this function, which
produces an output shown in Figure 2-11:
Figure 2-11. Five samples from the GP prior together with the mean function and
95% confidence intervals
54
Chapter 2 Gaussian Processes
Listing 2-7. Calculating GP posterior mean vector and covariance matrix for the
new inputs
55
Chapter 2 Gaussian Processes
Returns:
Posterior mean vector (n x d) and covariance matrix (n x n).
"""
# covariance matrix for observed inputs
K = ise_kernel(X_train, X_train, l, sigma_f) + sigma_y**2 *
np.eye(len(X_train))
# cross-variance between observed and new inputs
K_s = ise_kernel(X_train, X_s, l, sigma_f)
# covariance matrix for new inputs
K_ss = ise_kernel(X_s, X_s, l, sigma_f) + 1e-8 * np.eye(len(X_s))
# computer inverse of covariance matrix
K_inv = inv(K)
# posterior mean vector based on derived closed-form formula
mu_s = K_s.T.dot(K_inv).dot(Y_train)
# posterior covariance matrix based on derived closed-form formula
cov_s = K_ss - K_s.T.dot(K_inv).dot(K_s)
Now we can pass in a few noise-free training data points and apply the function
to obtain posterior mean and covariance and generate updated samples from the GP
posterior.
We can call the same plotting function again with the training data passed in as
additional parameters and plotted as crosses.
56
Chapter 2 Gaussian Processes
Running the preceding code produces Figure 2-12. The posterior mean function
smoothly interpolates the observed inputs where the corresponding variances of
the random variables are also zero. As represented by the confidence interval, the
uncertainty is small in the neighborhood of observed locations and increases as
we move further away from the observed inputs. In the meantime, the marginal
distributions at locations sufficiently far away from the observed locations will largely
remain unchanged, given that the prior covariance function essentially encodes no
correlation for distant locations.
57
Chapter 2 Gaussian Processes
Running the preceding code generates Figure 2-13, where the mean function does
not interpolate the actual training observations due to the extra noise term in Ky and the
variances at the corresponding input locations are also nonzero.
Figure 2-13. Drawing five samples from GP posterior based on noisy observations
In the previous example, we set the standard deviation of the noise term (assumed
to be independent of the underlying random variable of interest) as 0.5. Increasing the
noise variance would lead to further deviation of actual observation from the underlying
true functional value at the specific input location, resulting in additional inflation of
the confidence interval across the whole domain. As verified in Figure 2-14, by setting
noise=1, the confidence interval becomes wider given a lower signal-to-noise ratio.
58
Chapter 2 Gaussian Processes
Figure 2-14. A wider confidence interval across the whole input domain given a
lower signal-to-noise ratio
Also, note that the fitted mean function becomes smoother when given a stronger
noise. By making coarser approximations, noisier training data thus helps prevent the
model from becoming too wiggly and, as a result, prone to overfitting.
59
Chapter 2 Gaussian Processes
params = [
(0.1, 1.0),
(2.0, 1.0),
(1.0, 0.1),
(1.0, 2.0)
]
plt.figure(figsize=(12, 5))
The output of running the preceding code is shown in Figure 2-15. In the top row,
increasing l while controlling σf from the left to the right plot shows that the resulting fit
is less wiggly and associated with more narrow uncertainty regions. When fixing l and
increasing σf from the left to the right plot in the bottom row, we observe a much wider
uncertainty region, especially toward the locations at the right end.
60
Chapter 2 Gaussian Processes
Hyperparameter Tuning
The kernel parameters l and σf are fixed before fitting a GP and are often referred
to as the hyperparameters. In the last section, we showed the impact of different
manually set hyperparameters on the resulting fit. Instead of testing over multiple
different combinations one by one, a better approach is to automatically identify
61
Chapter 2 Gaussian Processes
the value of the hyperparameters based on the characteristics of the observed data
under the GP framework. Since we already have access to the marginal likelihood of
these observations, a common approach is to go with the set of hyperparameters that
maximize the joint log-likelihood of these marginal distributions.
Assume a total of n noisy samples y observed at locations x and following a joint
Gaussian distribution with mean vector μ and covariance matrix Ky. We will also use
θ = {l, σf} to denote all hyperparameters. The joint likelihood of the marginal Gaussian
distribution can be expressed as
p y|x , N y | ,K y
ˆ argmax p y|x ,
1 1
p y |x , exp y K y 1 y
T
2
n
|K y| 2
where ∣Ky∣ represents the determinant of Ky. We can then take the log of the joint
likelihood to remove the exponent and instead maximize the log likelihood:
1 n 1
log p y |x , y K y1 y log 2 log |K y|
T
2 2 2
where the first term measures the goodness of fit, and the last two terms serve as a
form of entropy-based penalty. We can further simplify the formula by assuming μ = 0,
resulting in
1 n 1
log p y |x , y T K y 1 y log 2 log |K y|
2 2 2
62
Chapter 2 Gaussian Processes
Let us look at how to implement this. In the following code listing, we define a
function that takes in the training data and noise level and returns a function that
calculates the negative log marginal likelihood based on the specified parameters (l and
σf). This is a direct implementation based on the definition of the negative log-likelihood
earlier.
Args:
X_train: training locations (m x d).
Y_train: training targets (m x 1).
Noise: known noise level of Y_train.
Returns:
Minimization objective function.
"""
Y_train = Y_train.ravel()
def nll(theta):
K = ise_kernel(X_train, X_train, l=theta[0], sigma_f=theta[1]) + \
noise**2 * np.eye(len(X_train))
return 0.5 * Y_train.dot(inv(K).dot(Y_train)) + \
0.5 * len(X_train) * np.log(2*np.pi) + \
0.5 * np.log(det(K))
return nll
63
Chapter 2 Gaussian Processes
We can invoke the minimize function from numpy to perform the minimization using
L-BFGS-B, a widely used optimization algorithm. The optimization procedure starts with
both l and σf equal to one and searches for the optimal parameters. In practice, multiple
rounds of searches with different starting points are often conducted to avoid the same
local minima.
# initialize at [1,1] and search for optimal values within between 1e-5 to
infinity
res = minimize(nll_direct(X_train, Y_train, noise), [1, 1],
bounds=((1e-5, None), (1e-5, None)),
method='L-BFGS-B')
l_opt, sigma_f_opt = res.x
# compute posterior mean and covariance with optimized kernel parameters
and plot the results
mu_s, cov_s = update_posterior(X_test, X_train, Y_train, l=l_opt,
sigma_f=sigma_f_opt, sigma_y=noise)
plot_gp(mu_s, cov_s, X_test, X_train=X_train, Y_train=Y_train)
Running the preceding code will generate Figure 2-16, where the training data points
are reasonably covered with the 95% uncertainty regions. The mean function of the GP
posterior also appears to be a good approximation.
64
Chapter 2 Gaussian Processes
Note that directly calculating the negative log marginal likelihood based on the given
formula may be numerically unstable and computationally expensive as the problem
starts to scale. In particular, calculating K −y 1 involves inverting the matrix, which
becomes computationally costly if the matrix grows. In practice, a neat trick is to convert
the matrix inversion to another more computation-friendly operation and indirectly
obtain the same result.
1
To put things in perspective, let us look at the first term − y T K −y 1 y in logp(y| x,θ).
2
−1
Calculating K y using inv(K) may cause numerical instability as in the previous code.
To circumvent the direct calculation, we can treat the calculation of K −y 1 y as a whole and
use the Cholesky decomposition trick to solve a system of equations to avoid direct
matrix inversion. Specifically, we note that K y 1 y L T L1 y , where we have used the
Cholesky decomposition to convert Ky into the product of a triangular factor matrix L
and its transpose LT, that is, Ky = LLT. The triangular matrix is now much more memory
efficient when performing mathematical operations, and only one triangular matrix
needs to be computed and stored.
Assuming Lm = y, then we can easily solve for m = L−1y using the solve_triangular
function from the scipy package. Similarly, assuming LTα=, we can use the same
function to solve for α = L−Tm = L−TL−1y. Therefore, we can calculate K −y 1 y by solving
two systems of equations based on the result from the Cholesky decomposition.
In addition, we can further accelerate the calculation by using the fact that
n
log |K y| 2 log Lii , where Lii is the ith diagonal entry of factor matrix L. The preceding
i 1
65
Chapter 2 Gaussian Processes
def nll(theta):
K = ise_kernel(X_train, X_train, l=theta[0], sigma_f=theta[1]) + \
noise**2 * np.eye(len(X_train))
L = cholesky(K)
# solve the first system of equations
m = solve_triangular(L, Y_train, lower=True)
# solve the second system of equations
alpha = solve_triangular(L.T, m, lower=False)
return nll
Running the preceding code will generate the same plot as earlier, but the difference
reveals when the problem becomes large. To see the full list of implementations, refer
to the accompanying notebook at https://github.com/jackliu333/bayesian_
optimization_theory_and_practice_with_python/blob/main/Chapter_2.ipynb.
Summary
A Gaussian process is an important tool in many modeling problems. In this chapter, we
covered the following list of items:
66
Chapter 2 Gaussian Processes
In the next chapter, we will look at expected improvement (EI), the most widely used
acquisition function in Bayesian optimization.
67
CHAPTER 3
Bayesian Decision
Theory and Expected
Improvement
The previous chapter used Gaussian processes (GP) as the surrogate model to
approximate the underlying objective function. GP is a flexible framework that provides
uncertainty estimates in the form of probability distributions over plausible functions
across the entire domain. We could then resort to the closed-form posterior predictive
distributions at proposed locations to obtain an educated guess on the potential
observations.
However, it is not the only choice of surrogate model used in Bayesian optimization.
Many other models, such as random forest, have seen increasing use in recent years,
although the default and mainstream choice is still a GP. Nevertheless, the canonical
Bayesian optimization framework allows any surrogate model as long as it provides a
posterior estimate for the function, which then gets used by the acquisition function to
generate a sampling proposal.
The acquisition function bears even more choices and is an increasingly crowded
research space. Standard acquisition functions such as expected improvement and
upper confidence bound have seen wide usage in many applications, and problem-
specific acquisition functions incorporating domain knowledge, such as safe constraint,
are constantly being proposed. The acquisition function assumes a more important role
in the Bayesian optimization framework as it directly determines the sampling decision
for follow-up data acquisition. A good acquisition function thus enables the optimizer
to locate the (global) optimum as fast as possible, where the optimum is measured in
the sense of the location that holds the optimum value or the optimum value across the
whole domain.
69
© Peng Liu 2023
P. Liu, Bayesian Optimization, https://doi.org/10.1007/978-1-4842-9063-7_3
Chapter 3 Bayesian Decision Theory and Expected Improvement
This chapter will dive into the Bayesian optimization pipeline using expected
improvement, the most widely used acquisition function for sampling decisions.
We will first characterize Bayesian optimization as a sequential decision process
under uncertainty, followed by a thorough introduction of expected improvement.
An intelligent selection of the following sampling location that involves uncertainty
estimates is the key to achieving sample-efficient global optimization. Lastly, we will go
through a case study using expected improvement to guide hyperparameter tuning.
70
Chapter 3 Bayesian Decision Theory and Expected Improvement
The optimizer thus needs to trade off between calling off the query and performing
additional sampling, which incurs an additional cost. Therefore, the action space of
the optimizer contains not only the sampling location but also a binary decision on
termination.
Figure 3-1 characterizes the sequential decision-making process that underpins
Bayesian optimization. The policy would propose the following sampling location
at each outer loop iteration or terminate the loop. Suppose it decides to propose an
additional sampling action. In that case, we will enter the inner loop to seek the most
promising location with the highest value of the prespecified acquisition function. We
would then probe the most favorable location and append the additional observation in
our data collection, which is then used to update the posterior belief on the underlying
objective function through GP. On the other hand, if the policy believes the additional
query is not worth the corresponding cost to improve our belief on the global optimum,
it would decide to terminate the outer loop and return the current best estimate of
the global optimum or its location. This also forms the stopping rule of the policy,
which could be triggered upon exhausting the limited budget or assuming an adaptive
mechanism based on the current progression.
71
Chapter 3 Bayesian Decision Theory and Expected Improvement
Quantifying the improvement on the belief of the global optimum is reflected in the
expected marginal gain on the utility of observed data, which is the core concept in the
Bayesian decision theory used in Bayesian optimization. We will cover this topic in the
following sections.
sampling action. We use the superscript ∗ to denote an optimal action and the subscript
n + 1 to indicate the additional first future sampling location on top of the existing n
observations. The addition of one thus means the lookahead horizon or time step into
72
Chapter 3 Bayesian Decision Theory and Expected Improvement
the future. The notation xn + 1 denotes all possible candidate locations, including the
observed ones, and is viewed as a random variable. The optimal decision is then defined
as follows:
dataset n 1 n xn 1 ,yn 1 , and finally updating the posterior belief assuming a
sampling location xn 1 . Choosing an appropriate acquisition function thus plays a
Figure 3-2. Illustrating the entire BO loop by iteratively maximizing the current
acquisition function, probing additional data, and updating posterior belief
73
Chapter 3 Bayesian Decision Theory and Expected Improvement
Although the BO loop could begin with an empty dataset, practical training often
relies on a small dataset consisting of a few uniformly sampled observations. This
accelerates the optimization process as it serves as a warm start and presents a more
informed prior belief than a uniform one. The effect is even more evident when the
initial dataset has good coverage of different locations of the domain.
U
tility-Driven Optimization
The eventual goal of BO is to collect a valuable set of observations that are most informative
about the global optimum. The value of a dataset is quantified by utility, a notion initially
used in the Bayesian decision theory and used here to assist the sequential optimization in
BO via the acquisition function. The acquisition function builds on top of the utility of the
currently available dataset when assessing the value of candidate locations.
Since our goal is to locate the global maximum, a natural choice for the utility
function is the maximum value of the current dataset, that is, u n max y1:n yn
, assuming the case of noise-free observations. This is also called the incumbent of
the current dataset and is used as a benchmark when evaluating all future candidate
observations. As the most widely used acquisition function in practical applications,
the expected improvement function uses this incumbent to award candidate locations
whose putative observations are likely to be higher.
When assessing a candidate location xn + 1, we would require a fictional observation
y to be able to calculate the utility if we were to acquire an additional observation at
this location. Considering the randomness of the objective function, our best estimate
is that yn + 1 will follow a posterior normal distribution according to the updated GP
posterior. Since yn + 1 is a random variable, the standard approach is to integrate out
its randomness by calculating the expected utility at the particular location, that is,
yn1 u xn 1 , yn 1 , n xn 1 , n , conditioned on the specific evaluation location xn + 1
and current set of observations n . This also corresponds to the expected utility
when assuming we have an additional unknown observation (xn + 1, yn + 1), leading to
yn1 u n 1 |xn 1 , n yn1 u n xn 1 ,yn 1 |xn 1 , n yn1 u xn 1 ,yn 1 ,n |xn 1 ,n
. We could then utilize the posterior predictive distribution p yn 1 |n to express the
expected utility as an integration operation in the continuous case as follows:
yn1 u xn 1 ,yn 1 ,n |xn 1 ,n u xn 1 ,yn 1 ,n p yn 1 |xn 1 , n dyn 1
74
Chapter 3 Bayesian Decision Theory and Expected Improvement
This expression considers all possible values of yn + 1 at location xn + 1. It weighs the
corresponding utility based on the probability of occurrence. With access to the expected
utility at each candidate location, the next following location could be determined by
selecting the one with the largest expected utility:
where the subscript 1 in 1 xn 1 ;n denotes the number of lookahead steps into the
future. The second step follows since there is no randomness in the utility of the existing
observations u n .
The optimal action using the one-step lookahead policy is then defined as the
maximizer of the expected marginal gain:
Figure 3-3 illustrates this process. We start with the utility of collected observations
u n as the benchmark for comparison when evaluating the expected marginal gain
at a new candidate location. The evaluation needs to consider all possible values of the
next observation based on updated posterior GP and thus leads to the expected utility
term yn1 u xn 1 ,yn 1 ,n |xn 1 , n . Since we are considering one step ahead in the
future, the acquisition function 1 xn 1 ;n becomes one-step lookahead policy, and
75
Chapter 3 Bayesian Decision Theory and Expected Improvement
our goal is to select the location that maximizes the expected marginal gain in the utility
of the collected dataset.
Figure 3-3. Deriving the one-step lookahead policy by maximizing the expected
marginal gain in the utility of the acquired observations
76
Chapter 3 Bayesian Decision Theory and Expected Improvement
Following the same mechanics as before, the multi-step lookahead policy would
make the optimal sampling decision on xn 1 by maximizing the expected long-term
where the expectation is taken with respect to randomness in future locations and
observations. Equivalently, we can rely on the terminal expected marginal gain in the
utility defined as follows:
which serves as the multi-step lookahead acquisition function to support the optimal
sequential optimization:
where the definition is only shifted downward by a constant value u n compared with
maximizing the expected terminal utility u n |x ,n alone.
where we explicitly write the posterior probability distribution of yn + 1 as p yn 1 |xn 1 ,n
and the following joint probability distributions of {(xn + i, yn + i), i = 2, …, τ} as
p x
i 2
n i ,yn i |n i 1 . Integrating out these random variables would give us the eventual
multi-step lookahead marginal gain in the expected utility of the returned dataset.
77
Chapter 3 Bayesian Decision Theory and Expected Improvement
Figure 3-4 summarizes the process of deriving the multi-step lookahead acquisition
function. Note that the simulation of the next round of candidate locations and
observations in {(xn + i, yn + i), i = 2, …, τ} depends on all previously accumulated dataset
n i 1 , which is used to construct the updated posterior belief based on both observed
and putative values.
Figure 3-4. The multi-step lookahead optimal policy that selects the best sampling
location by maximizing the marginal expected utility of the terminal dataset
We can glean more insight on the process of calculating this expression by drawing
out the sequence of nested expectation and maximization operations. As shown in
Figure 3-5, we start with the next sampling location xn + 1 in a maximization operator,
followed by yn + 1 in an expectation operator. The same pattern continues at later
stages, with a maximization operator in xn + 2, an expectation operator in yn + 2, and so
on, until reaching the putative observation yn + τ at the last stage. Each operator, be it
maximization or expectation, involves multiple branches. Common strategy is to solve
the maximization operation via a standard procedure such as L-BFGS and approximate
the expectation operation via Gaussian quadrature.
78
Chapter 3 Bayesian Decision Theory and Expected Improvement
Apparently, calculating a nested form of expectations that accounts for all possible
future paths is computationally challenging. In addition, since our goal is to select
an optimal sampling action by maximizing the acquisition function, we will add a
reasonable assumption that all future actions will also be optimal given the current
dataset, which may include putative realizations of the random variable on the
objective value. Adding the optimality condition means that rather than considering
all possible future paths of {(xn + i, yn + i), i = 1, …, τ}, we will only focus on the optimal one
x
n i
,yn i ,i 1,, , which essentially removes the dependence on the candidate
locations by choosing the maximizing location. The argument for selecting the optimal
action by maximizing the long-term expected gain in utility follows the Bellman principle
of optimality, as described in the next section.
79
Chapter 3 Bayesian Decision Theory and Expected Improvement
which is the subject we seek to maximize. To explicitly connect with the one-step
lookahead acquisition function and the remaining τ − 1 steps of simulations into the
future, we can introduce the one-step utility u n 1 by adding and subtracting this term
in the expectation, as shown in the following:
xn 1 ;n
u n |xn 1 , n u n
u n u n 1 u n 1 |xn 1 , n u n
u n 1 |xn 1 , n u n u n u n 1 |xn 1 , n
1 xn 1 ;n 1 xn 2 ; n 1 xn 1 , n
Here, we have decomposed the long-term expected marginal gain in utility into the
sum of an immediate one-step lookahead gain in utility and the expected lookahead
gain for the remaining τ − 1 steps.
Now, following Bellman’s principle of optimality, all the remaining τ − 1 actions
will be made optimally. This means that instead of evaluating each candidate location
for xn + 2 when calculating 1 xn 2 ;n 1 , we would only be interested in the location
with the maximal value, that is, 1 xn 2 ;n 1 , or equivalently 1 n 1 , removing
dependence on the location xn + 2. The multi-step lookahead acquisition function under
the optimality assumption thus becomes
xn 1 ;n
max 1 xn 1 ;n 1 n 1 |xn 1 , n
xn 1A
xn 1A
max 1 xn 1 ;n max 1 xn 2 ;n 1 |xn 1 , n
xn2A
80
Chapter 3 Bayesian Decision Theory and Expected Improvement
where we have plugged in the definition of 1 n 1 as well to explicitly express the
optimal policy value xn 1 ;n as a series of nested maximization and expectation
operations. Such recursive definition is called the Bellman equation, which explicitly
reflects the condition that all follow-up actions need to be made optimally to make an
optimal action.
Figure 3-6 summarizes the process of deriving the Bellman equation for the multi-
step lookahead policy. Again, calculating the optimal policy value requires calculating
the expected optimal value of future subpolicies. Being recursive in nature, calculating
the current acquisition function can be achieved by adopting a reverse computation,
starting from the terminal step and performing the calculations backward. However, this
would still incur an exponentially increasing burden as the lookahead horizon expands.
Figure 3-6. Illustrating the derivation process of the Bellman equation for the
multi-step lookahead policy, where the optimal policy is expressed as a series of
maximization and expectation operations, assuming all follow-up actions need to
be made optimally in order to make the optimal action at the current step
We will touch upon several tricks to accelerate the calculation of this dynamic
programming (DP) problem later in the book and only highlight two common
approaches for now. One approach is called limited lookahead, which limits the number
of lookahead steps in the future. The other is to use a rollout approach with a base
81
Chapter 3 Bayesian Decision Theory and Expected Improvement
policy, which reduces the maximization operator into a quick heuristic-based exercise.
Both approaches are called approximate dynamic programming (ADP) methods and
are illustrated in Figure 3-7. See the recent book titled Bayesian Optimization by Roman
Garnett for more discussion on this topic.
Expected Improvement
Acquisition functions differ in multiple aspects, including the choice of the utility
function, the number of lookahead steps, the level of risk aversion or preference, etc.
Introducing risk appetite directly benefits from the posterior belief about the underlying
objective function. In the case of GP regression as the surrogate model, the risk is
quantified by the covariance function, with its credible interval expressing the level of
uncertainty about the possible values of the objective.
When it comes to the utility of the collected observations, the expected improvement
chooses the maximum of the observed value as the benchmark for comparison upon
selecting an additional sampling location. It also implicitly assumes that only one
sampling is left before the optimization process terminates. The expected marginal gain in
utility (i.e., the acquisition function) becomes the expected improvement in the maximal
observation, calculated as the expected difference between the observed maximum and
the new observation after the additional sampling at an arbitrary sampling location.
82
Chapter 3 Bayesian Decision Theory and Expected Improvement
which returns the marginal increment in the incumbent if f n 1 f n and zero otherwise,
as a result of observing fn + 1. Readers familiar with the activation function in neural
networks would instantly connect this form with the ReLU (rectified linear unit)
function, which keeps the positive signal as it is and silences the negative one.
Due to randomness in yn + 1, we can introduce the expectation operator to integrate
it out, giving us the expected marginal gain in utility, that is, the expected improvement
acquisition function:
83
Chapter 3 Bayesian Decision Theory and Expected Improvement
, the process starts by converting the max operator into an integral, which is then
separated into two different and easily computable parts. These two parts correspond
to exploitation and exploration, respectively. Exploitation means continuing sampling
the neighborhood of the observed region with a high posterior mean, and exploration
encourages sampling an unvisited area where the posterior uncertainty is high. The
expected improvement acquisition function thus implicitly balances off these two
opposing forces.
84
Chapter 3 Bayesian Decision Theory and Expected Improvement
f
EI xn 1 ;n n 1 0
n 1 n 1
f
EI xn 1 ;n n 1 0
n 1 n 1
Since the partial derivatives of the expected improvement with respect to μn + 1 and
σn + 1 are both positive, an increase in either parameter will result in a higher expected
improvement, thus completing the automatic trade-off between exploitation and
exploration under the GP regression framework.
It is also worth noting that σn + 1 = 0 occurs when the posterior mean function
passes through the observations. In this case, we have EI xn 1 ;n 0 . In addition,
a hyperparameter ξ is often introduced to control the amount of exploration in
practical implementation. By subtracting ξ from n 1 f n in the preceding closed-form
expression, the posterior mean μn + 1 will have less impact on the overall improvement
compared to the posterior standard deviation σn + 1. The closed-form expression of the
expected improvement acquisition function thus becomes
n 1 f n zn 1 n 1 zn 1 ; n 1 0
EI xn 1 ;n
0 ; n 1 0
where
n 1 f
; n 1 0
zn 1 n 1
0 ; n 1 0
85
Chapter 3 Bayesian Decision Theory and Expected Improvement
import numpy as np
import random
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.optimize import minimize
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, Matern
SEED = 8
random.seed(SEED)
np.random.seed(SEED)
%matplotlib inline
Next, we will define the objective function and search domain. The objective
function provides noise-perturbed observations upon sampling at an arbitrary location
within the search domain. It will also be used to generate noise-free observations for
reference during plotting.
As shown in the following code listing, we generate a random number from a
standard normal distribution based on the dimension of the domain, accessed via the *
sign to unpack the tuple into an acceptable format. The value is then multiplied by the
prespecified noise level for the observation model. The search domain is specified as a
nested list in bounds, where the inner list contains the upper and lower bounds for each
dimension; in this case, we are looking at a single-dimensional search domain.
86
Chapter 3 Bayesian Decision Theory and Expected Improvement
Now we can visualize the objective function and generate two random noisy samples
in X_init and Y_init to kick-start the optimization procedure. Note that plotting a
function is completed by generating a dense grid of points/locations with the search
bounds in X_plot, calculating the corresponding noise-free functional values Y_plot for
each location, and connecting these values smoothly, as shown in the following code
listing.
87
Chapter 3 Bayesian Decision Theory and Expected Improvement
The result is shown in Figure 3-9. Note that the two samples are selected to be
sufficiently distant from each other. In practice, a good initial design should have
good coverage of the whole search domain to promise a good GP prior before
optimization starts.
Figure 3-9. Visualizing the underlying objective function and two initial random
noisy samples
88
Chapter 3 Bayesian Decision Theory and Expected Improvement
mu_sample_opt = np.max(mu_sample)
# ignore divide by zero warning if any
with np.errstate(divide='warn'):
# calculate ei if sd>0
imp = mu - mu_sample_opt - xi
Z = imp / sigma
ei = imp * norm.cdf(Z) + sigma * norm.pdf(Z)
# set zero if sd=0
ei[sigma == 0.0] = 0.0
return ei
89
Chapter 3 Bayesian Decision Theory and Expected Improvement
Listing 3-5. Proposing the next sampling point by optimizing the acquisition
function
# iterate through n_restart different random points and return most
promising result
for x0 in np.random.uniform(bounds[:, 0], bounds[:, 1],
size=(n_restarts, dim)):
# use off-the-shelf solver based on approximate second order
derivative
res = minimize(min_obj, x0=x0, bounds=bounds, method='L-BFGS-B')
# replace running optimum if any
if res.fun < min_val:
min_val = res.fun[0]
min_x = res.x
return min_x.reshape(-1, 1)
Before entering BO's outer loop to seek the global optimum, we will define a few
utility functions that plot the policy performance across iterations. This includes
the plot_approximation() function that plots the GP posterior mean and 95%
confidence interval along with the collected samples and objective function, the plot_
acquisition() function that plots the expected improvement across the domain along
90
Chapter 3 Bayesian Decision Theory and Expected Improvement
with the location of the maximum, and the plot_convergence() function that plots the
distances between consecutive sampling locations and the running optimal value as
optimization proceeds. All three functions are defined in the following code listing.
Listing 3-6. Proposing the next sampling point by optimizing the acquisition
function
91
Chapter 3 Bayesian Decision Theory and Expected Improvement
x = X_sample[n_init:].ravel()
y = Y_sample[n_init:].ravel()
r = range(1, len(x)+1)
# distance between consecutive sampling locations
x_neighbor_dist = [np.abs(a-b) for a, b in zip(x, x[1:])]
# best observed value until the current time point
y_max = np.maximum.accumulate(y)
# plot the distance between consecutive sampling locations
plt.subplot(1, 2, 1)
plt.plot(r[1:], x_neighbor_dist, 'bo-')
plt.xlabel('Iteration')
plt.ylabel('Distance')
plt.title('Distance between consecutive x\'s')
# plot the evolution of observed maximum so far
plt.subplot(1, 2, 2)
plt.plot(r, y_max, 'ro-')
plt.xlabel('Iteration')
plt.ylabel('Best Y')
plt.title('Value of best selected sample')
Now we can move into the main outer loop to look for the global optimum by
maximizing the expected improvement at each stage. In the following code listing, we
first instantiate a GP regressor with a Matérn kernel, which accepts two hyperparameters
that can be estimated by maximizing the marginal likelihood of the observed samples.
In this case, we fix these hyperparameters to simplify the process. The GP regressor also
accepts the unknown noise level via the alpha argument to incorporate noise in the
observations.
92
Chapter 3 Bayesian Decision Theory and Expected Improvement
X_sample = X_init
Y_sample = Y_init
# number of optimization iterations
n_iter = 20
# specify figure size
plt.figure(figsize=(12, n_iter * 3))
plt.subplots_adjust(hspace=0.4)
# start of optimization
for i in range(n_iter):
# update GP posterior given existing samples
gpr.fit(X_sample, Y_sample)
# obtain next sampling point from the acquisition function (expected_
improvement)
X_next = propose_location(expected_improvement, X_sample, Y_sample,
gpr, bounds)
# obtain next noisy sample from the objective function
Y_next = f(X_next, noise)
# plot samples, surrogate function, noise-free objective and next
sampling location
plt.subplot(n_iter, 2, 2 * i + 1)
plot_approximation(gpr, X_plot, Y_plot, X_sample, Y_sample, X_next,
show_legend=i==0)
plt.title(f'Iteration {i+1}')
plt.subplot(n_iter, 2, 2 * i + 2)
plot_acquisition(X_plot, expected_improvement(X_plot, X_sample, Y_
sample, gpr), X_next, show_legend=i==0)
# append the additional sample to previous samples
X_sample = np.vstack((X_sample, X_next))
Y_sample = np.vstack((Y_sample, Y_next))
Here, we use X_sample and Y_sample to be the running dataset augmented with
additional samples as optimization continues for a total of 20 iterations. Each iteration
consists of updating the GP posterior, locating the maximal expected improvement,
observing at the proposed location, and incorporating the additional sample to the
training set.
93
Chapter 3 Bayesian Decision Theory and Expected Improvement
Figure 3-10. Plotting the first three iterations, in which the EI-based BO performs
more exploration at regions with high uncertainty
94
Chapter 3 Bayesian Decision Theory and Expected Improvement
Figure 3-11. Concentration of sampling locations at the left peak of the objective
function, a sign of exploitation as the optimization process converges
For the full list of intermediate plots across iterations, please visit the accompanying
notebook for this chapter at https://github.com/jackliu333/bayesian_
optimization_theory_and_practice_with_python/blob/main/Chapter_3.ipynb.
Once the optimization completes, we can examine its convergence using the
plot_convergence() function. As shown in the left plot in Figure 3-12, a larger distance
corresponds to more exploration, which occurs mostly at the initial stage of optimization
as well as iterations 17 and 18 even when the optimization seems to be converging.
Such exploration nature is automatically enabled by expected improvement and helps
jumping out of local optima in search of a potentially higher global optimum. This is
also reflected in the right plot, where a higher value is obtained at iteration 17 due to
exploration.
95
Chapter 3 Bayesian Decision Theory and Expected Improvement
Figure 3-12. Plotting the distance between consecutive proposed locations and the
value of the best-selected sample as optimization proceeds
At this point, we have managed to implement the full BO loop using expected
improvement from scratch. Next, we will look at a few BO libraries that help us achieve
the same task.
96
Chapter 3 Bayesian Decision Theory and Expected Improvement
Running the preceding code will generate Figure 3-13, which shows a concentration
of samples around the global maximum at the left peak. Note that the samples are not
exactly the same as in our previous example due to the nondeterministic nature of the
optimization procedure as well as the randomness in the observation model.
Figure 3-13. Visualizing the proposed samples using the gp_minimize() function
97
Chapter 3 Bayesian Decision Theory and Expected Improvement
We can also show the plots on the distances of consecutive proposals and the best-
observed value. As shown in Figure 3-14, even though the optimizer obtains a high value
at the second iteration, it continues to explore promising regions with high uncertainty,
as indicated by the two peaks in the distance plot.
Summary
Bayesian optimization is an extension of the classic Bayesian decision theory. The
extension goes into its use and choice of surrogate and acquisition functions. In this
chapter, we covered the following list of items:
• The inner BO loop involves seeking the location that maximizes the
acquisition function, and the outer BO loop seeks the location of the
global optimum.
In the next chapter, we will revisit the Gaussian process and discuss GP regression
using a widely used framework: GPyTorch.
99
CHAPTER 4
Gaussian Process
Regression with GPyTorch
So far, we have grasped the main components of a Bayesian optimization procedure: a
surrogate model that provides posterior estimates on the mean and uncertainty of the
underlying objective function and an acquisition function that guides the search for
the next sampling location based on its expected gain in the marginal utility. Efficiently
calculating the posterior distributions becomes essential in the case of parallel Bayesian
optimization and Monte Carlo acquisition functions. This branch evaluates multiple
points simultaneously discussed in a later chapter.
In this chapter, we will focus on the first component of building and refreshing a
surrogate model and its implementation based on a highly efficient, state-of-the-art
package called GPyTorch. The GPyTorch package is built on top of PyTorch and inherits
all its built-in advantages, such as GPU acceleration and auto-differentiation. It serves as
the backbone of BoTorch, the state-of-the-art BO package widely used in research and
practice and to be introduced in the next chapter.
Introducing GPyTorch
Gaussian processes (GP) model the underlying objective function as a sample from the
distribution of functions, where the distribution takes a prior form and gets updated as
a posterior distribution upon revealing functional observations. Being a nonparametric
model, it has an unlimited expressive capacity to learn from the training data and make
predictions with a quantified uncertainty level. After choosing a kernel function to
encode our prior assumption on the smoothness of the objective function, we can fit a
GP model and apply it in a regression setting to obtain the posterior mean and variance
at any new point, a topic covered in Chapter 2.
101
© Peng Liu 2023
P. Liu, Bayesian Optimization, https://doi.org/10.1007/978-1-4842-9063-7_4
Chapter 4 Gaussian Process Regression with GPyTorch
When we are concerned with estimating the value at a new location along with its
associated variance, performing either interpolation or extrapolation, we are conducting
GP regression. Depending on the nature of the outcome, the model can be extended to
the classification setting as well. In the next section, we will briefly cover the basics of
PyTorch that underpins the GPyTorch library, followed by a short review of the basics of
GP regression and an implementation case using GPyTorch.
102
Chapter 4 Gaussian Process Regression with GPyTorch
Note that the data type in the output may differ depending on the specification of the
environment. The output suggests that the variable a is a tensor object. Alternatively, we
can also create a tensor by converting a NumPy array:
>>> b = np.array([2,3,4])
>>> b = torch.from_numpy(b)
>>> print(b)
tensor([2, 3, 4])
>>> c = torch.add(a, b)
>>> print(c)
tensor([3, 5, 7])
>>> c = a_grad**2
>>> print(c)
tensor([1., 4., 9.], grad_fn=<PowBackward0>)
Here, the grad_fn attribute suggests that this is a power function when performing
backward propagation and computing the gradients. Recall that in basic calculus, for
a power function y = x2, the gradient (or first derivative) is y′ = 2x. Let us sum up all the
elements to aggregate the inputs into a single output:
103
Chapter 4 Gaussian Process Regression with GPyTorch
Other than the summed result of 14, the output also suggests that this result comes
from a summation operation. With all the basic operations specified and configured in
the computation graph, we can perform backpropagation using the autograd feature by
invoking the backward() function:
>>> out.backward()
>>> print(a_grad.grad)
tensor([2., 4., 6.])
The gradients are now automatically calculated and accessible via the grad attribute,
which matches our expected output.
When performing the inner optimization in a BO loop, that is, locating the maximum
of the acquisition function, the gradient-based method is often used. When such
gradients can be easily calculated, the optimization can be managed efficiently. Methods
such as multi-start stochastic gradient ascent (as we want to maximize the acquisition
function) heavily rely on the autograd feature in PyTorch, GPyTorch, and BoTorch in
particular. We will discuss this algorithm in a later chapter.
Next, let us revisit the basics of GP regression.
Revisiting GP Regression
GP regression relies on the closed-form solution for the predictive mean and variance
of the random variable y∗ at any new sampling location x∗. Suppose we have collected
a training set of n observations n x i ,yi i 1 , where each sampled location xi ∈ ℝd is
n
and k x ,x f x f x f x f x . For any finite set of random
variables within the domain, their joint distribution follows a multivariate Gaussian
distribution, resulting in y1:n ~ 0 ,K n , where we assume a constant zero mean
function μ = 0 and Kn = Kn(x1 : n, x1 : n) denotes the covariance matrix, noting Kn(x1 : n, x1 : n)i, j
= k(xi, xj).
104
Chapter 4 Gaussian Process Regression with GPyTorch
105
Chapter 4 Gaussian Process Regression with GPyTorch
106
Chapter 4 Gaussian Process Regression with GPyTorch
Before fitting a GP model, we must define the mean and covariance (kernel) functions
for the prior GP. There are multiple mean functions available in GPyTorch, including the zero
mean function gpytorch.means.ZeroMean(), the constant mean function gpytorch.means.
ConstantMean(), and the linear mean function gpytorch.means.LinearMean(). In this case,
we use the constant mean function, assuming that there is no systematic trend in the data.
As for the kernel function, we choose one of the most widely used kernels: the RBF
x x 2
kernel, or the squared exponential kernel k x1 ,x 2 exp 1 2 2 , where l is the
2l
length-scale parameter that can be optimized via a maximum likelihood procedure. The
RBF kernel can be instantiated by calling the class gpytorch.kernels.RBFKernel().
We would also add a scaling coefficient to the output of the kernel function, simply
by wrapping it with kernels.ScaleKernel(). Indeed, as the GPyTorch official
documentation suggests: If you don’t know what kernel to use, we recommend that you
start out with a gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel). The
following code listing defines the mean and kernel functions.
mean_fn = gpytorch.means.ConstantMean()
kernel_fn = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
Next, we will define a class to perform GP regression based on the exact GP inference
class gpytorch.models.ExactGP. As shown in the following code listing, we define a
107
Chapter 4 Gaussian Process Regression with GPyTorch
class GPRegressor(gpytorch.models.ExactGP):
def __init__(self, X, y, mean, kernel, likelihood=None):
# choose the standard observation model as required by exact GP
inference
if likelihood is None:
likelihood = gpytorch.likelihoods.GaussianLikelihood()
# initiate the superclass ExactGP to refresh the posterior
super().__init__(X, y, likelihood)
# store attributes
self.mean = mean
self.kernel = kernel
self.likelihood = likelihood
108
Chapter 4 Gaussian Process Regression with GPyTorch
with torch.no_grad():
# get posterior distribution p(f|x)
pred = self(x)
# convert posterior distribution p(f|x) to p(y|x)
return self.likelihood(pred)
Next, the forward() function computes the posterior distribution at any input
location and returns a multivariate normal distribution parameterized by the realization
of the posterior mean and kernel functions at the particular location. This function
gets called automatically when an instance of the GPRegressor class is used to make
predictions. Note that MultivariateNormal is the only distribution allowed in the case of
exact GP inference in GPyTorch.
Finally, the predict() function takes the output p(f | x) from forward() and overlays
an observation model to produce p(y| x). Note that we need to set the GP model to the
inference mode via the eval() function and performance inference without calculating
the gradients, as indicated by the torch.no_grad() context manager.
We can then create a model by passing in the required arguments in the self-defined
class as follows, which will train a GP model based on the provided training set and
model configurations for the mean and kernel functions:
Now, let us visualize the posterior GP after fitting the model to the ten noisy
observations. As shown in the following code listing, we create a function that accepts
the learned GP model instance and optionally the range of the x-axis stored in xlim for
plotting. We first extract the training data by accessing the train_inputs and train_
targets attributes, respectively, both converted to NumPy format. When xlim is not
provided, we rely on the upper and lower limits of X_train with a slight extension. Next,
we create a list of equally spaced input locations in X_plot and score them to obtain the
corresponding posterior predictive distributions, whose mean, upper, and lower bounds
are used for plotting.
109
Chapter 4 Gaussian Process Regression with GPyTorch
input:
model : gpytorch.models.GP
xlim : the limit of x axis, tuple(float, float) or None
"""
# extract training data in numpy format
X_train = model.train_inputs[0].cpu().numpy()
y_train = model.train_targets.cpu().numpy()
# obtain range of x axis
if xlim is None:
xmin = float(X_train.min())
xmax = float(X_train.max())
x_range = xmax - xmin
xlim = [xmin - 0.05 * x_range,
xmax + 0.05 * x_range]
# create a list of equally spaced input locations following the same
dtype as parameters
model_tensor_example = list(model.parameters())[0]
X_plot = torch.linspace(xlim[0], xlim[1], 200).to(model_tensor_example)
# generate predictive posterior distribution
model.eval()
predictive_distribution = model.predict(X_plot)
# obtain mean, upper and lower bounds
lower, upper = predictive_distribution.confidence_region()
prediction = predictive_distribution.mean.cpu().numpy()
X_plot = X_plot.numpy()
110
Chapter 4 Gaussian Process Regression with GPyTorch
We can see that the GP regression model is a rough fit with a wide confidence
interval across the domain. Since the length-scale parameter for the kernel function has
a direct influence on the shape of the resulting model fitting, we will discuss how to fine-
tune this parameter to improve the predictive performance in the next section.
111
Chapter 4 Gaussian Process Regression with GPyTorch
Listing 4-5. Plotting the kernel function with different length-scale parameters
if ax is None:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x.numpy(), K.cpu().numpy())
k = gpytorch.kernels.RBFKernel()
all_lengthscale = np.asarray([0.2, 0.5, 1, 2, 4, 8])
figure, axes = plt.subplots(2, 3, figsize=(12, 8))
for tmp_lengthscale, ax in zip(all_lengthscale, axes.ravel()):
k.lengthscale = tmp_lengthscale
plot_kernel(k, ax=ax)
ax.set_ylim([0, 1])
ax.legend([tmp_lengthscale])
Running this code snippet generates Figure 4-4. When the length scale is small, the
kernel function puts more weight on the nearby points and discards distance ones by
setting the resulting similarity value to zero, thus displaying a less smooth curve. The
kernel function becomes smoother as the length scale increases.
112
Chapter 4 Gaussian Process Regression with GPyTorch
Figure 4-4. Visualizing the kernel function under different length-scale parameter
settings
Back to our previous example. Upon initializing the kernel function in kernel_fn,
we can access its length-scale parameter via the base_kernel.lengthscale attribute,
where base_kernel refers to the RBF kernel wrapping in a scaling operation. As shown
in the following, the default value is 0.6931, and the kernel object has a gradient attribute
that will be used for gradient calculation that flows from the loss (negative maximum
likelihood, to be discussed shortly) to the length-scale parameter:
>>> kernel_fn.base_kernel.lengthscale
tensor([[0.6931]], grad_fn=<SoftplusBackward0>)
Changing this parameter will result in a different GP model fitting. For example, we
can manually set the length-scale parameter to 0.1 via the following code snippet:
113
Chapter 4 Gaussian Process Regression with GPyTorch
Running this code snippet generates Figure 4-5, where the model fitting now looks
better in terms of varying the overall confidence intervals along with the observed
data points.
Figure 4-5. GP regression model after setting the length-scale parameter manually
However, manually setting the length-scale parameter and observing the fitting
quality is time-consuming and inefficient. We need to resort to an automatic procedure
that seeks the optimal parameter by optimizing a specific metric, such as the likelihood
of the observed data.
Given a GP model f ~GP ,k and a training set (X, y), the likelihood of observing
the outcome y can be expressed as
p f y|X p y| f X p f X |X df
which can be computed exactly in an exact GP inference setting, that is, GP regression
with a Gaussian likelihood model. Negating the joint likelihood gives rise to the loss to
be minimized in the optimization procedure. In GPyTorch, this joint marginal likelihood
can be computed via first creating a gpytorch.mlls.ExactMarginalLogLikelihood()
instance, which takes the Gaussian likelihood and the exact GP model as inputs and
outputs the exact marginal log likelihood (MLL) based on the posterior distribution
p(f | X) and the target observations y.
Now let us look at how to optimize the length scale automatically based on an
optimization procedure similar to maximum likelihood estimation (MLE), where the
114
Chapter 4 Gaussian Process Regression with GPyTorch
likelihood is negated and minimized. In the following code listing, we define a function
to perform the training procedure based on the existing training dataset (X and y), the
number of epochs (n_epochs, each epoch is a full pass of the training set), the learning
rate (lr, used by the chosen optimizer), and optionally the training log indicator
(verbose). Note that the variance σ2 of the Gaussian noise term used in the Gaussian
likelihood model is also another hyperparameter that needs to be fine-tuned. Both σ2
and lengthscale can be accessed via the model.parameters() function.
115
Chapter 4 Gaussian Process Regression with GPyTorch
The function starts by switching the model to the training mode. As for the optimizer
of the hyperparameters, we use the Adam optimizer, a popular choice known for its
stable updates and quick convergence. We also create an exact GP likelihood instance
that is used to calculate the joint marginal likelihood, an indication of the goodness of fit
in each training iteration. Each step of the gradient descent update consists of clearing
the existing gradients, calculating the loss function, backpropagating the gradients, and
performing the weight update.
As shown in Figure 4-6, after training the model for a total of 100 epochs, the result
GP fit looks much better, where the confidence bounds are much narrower compared
with before. Examining the length scale after training completes gives a value of 0.1279,
as shown in the following code snippet.
Figure 4-6. Visualizing the GP fit after optimizing model hyperparameters, that is,
the length scale and noise variance
>>> kernel_fn.base_kernel.lengthscale
tensor([[0.1279]], grad_fn=<SoftplusBackward0>)
Next, we will look at the noise variance term as another hyperparameter that
influences the smoothness of the fitted GP regression model.
116
Chapter 4 Gaussian Process Regression with GPyTorch
>>> model.likelihood.noise
tensor([0.0183], grad_fn=<AddBackward0>)
We can also override the noise variance, which directly impacts the resulting
smoothness and confidence interval of the fitted GP regression model. For example,
Figure 4-7 shows the fitted GP model after setting the noise variance to 0.1, where the
confidence intervals are much larger than earlier.
Figure 4-7. Visualizing the fitted GP model after overriding the noise variance
The automatic hyperparameter tuning procedure also performs well when the noise
level increases in the previous example. In the following code listing, we increase the
noise variance to 0.5 when generating noisy observations compared to 0.1 earlier. The
rest of the code remains the same.
Running the preceding code generates Figure 4-8. The confidence intervals are
narrower and more adaptive than before. The optimized noise is now 0.0114, as shown
in the following code snippet.
Figure 4-8. Visualizing the GP model fitted with increased noise variance
>>> model.likelihood.noise
tensor([0.0114], grad_fn=<AddBackward0>)
In certain situations, we would like to keep the noise variance fixed and unchanged
throughout the optimization process. This occurs when we have some prior knowledge
of the noise variance. In such a case, we can keep it out of the trainable parameters in
the training_parameters from the train() earlier and only optimize the length scale.
To achieve this, we can add an additional input argument fixed_noise_variance to
accept the expected noise variance, if any. Next, replace the assignment of training_
parameters using the following code snippet:
118
Chapter 4 Gaussian Process Regression with GPyTorch
Figure 4-9. Fitting GP regression model with fixed noise variance and optimized
length scale
Feel free to check out the entire training code in the accompanying notebook at
https://github.com/jackliu333/bayesian_optimization_theory_and_practice_
with_python/blob/main/Chapter_4.ipynb.
The previous example shows the importance of choosing a proper kernel function
and optimizing the hyperparameters to obtain a good GP fit. In the next section, we will
look at different kernel functions available in GPyTorch and their combined usage in
improving model fitting performance.
119
Chapter 4 Gaussian Process Regression with GPyTorch
• RBF kernel
r2
k x1 ,x 2 2 exp 2
2l
where r = ‖x1 − x2‖.
r 5 r2 r
k x1 ,x 2 2 1 5 2 exp 5
l 3l l
• Linear kernel
k x1 ,x 2 i2 x1,i x 2 ,i
i
• Polynomial kernel
k x1 ,x 2 2 x1 x 2 c
d
• Periodic kernel
sin 2 r
k x1 ,x 2 2 exp 2
l2
In the following code listing, we plot each kernel function using the plot_kernel()
function defined earlier. Note that we used the string match function split() to extract
the kernel function name as the title for each subplot.
120
Chapter 4 Gaussian Process Regression with GPyTorch
covariance_functions = [gpytorch.kernels.RBFKernel(),
gpytorch.kernels.RQKernel(),
gpytorch.kernels.MaternKernel(nu=5/2),
gpytorch.kernels.LinearKernel(power=1),
gpytorch.kernels.PolynomialKernel(power=2),
gpytorch.kernels.PeriodicKernel()
]
figure, axes = plt.subplots(2, 3, figsize=(9, 6))
axes = axes.ravel()
for i, k in enumerate(covariance_functions):
plot_kernel(k, ax=axes[i])
axes[i].set_title(str(k).split('(')[0])
figure.tight_layout()
Running the preceding code will generate Figure 4-10. Although each type of
kernel displays different characteristics, we can further combine them via addition and
multiplication operations to create an even more flexible representation, as we will show
in the next section.
121
Chapter 4 Gaussian Process Regression with GPyTorch
k x1 ,x 2 k1 x1 ,x 2 k2 x1 ,x 2
k x1 ,x 2 k1 x1 ,x 2 k2 x1 ,x 2
Now let us look at how to combine different kernel functions. We first create a list
kernel_functions to store three kernel functions: linear kernel, periodic kernel, and
RBF kernel. We also overload the + and * operators as lambda functions, so that +(a,b)
122
Chapter 4 Gaussian Process Regression with GPyTorch
becomes a+b and *(a,b) is equivalent to a*b. Since we would like to apply the two
operators for each unique combination of any two kernel functions, the combinations()
function from the itertools package is used to generate the unique combinations to be
iterated over. See the following code listing for the full implementation details.
kernel_functions = [gpytorch.kernels.LinearKernel(power=1),
gpytorch.kernels.PeriodicKernel(),
gpytorch.kernels.RBFKernel()]
# overload the + and * operators
operations = {'+': lambda x, y: x + y,
'*': lambda x, y: x * y}
figure, axes = plt.subplots(len(operations), len(kernel_functions),
figsize=(9, 6))
import itertools
axes = axes.ravel()
count = 0
# iterate through each unique combinations of kernels and operators
for j, base_kernels in enumerate(itertools.combinations(kernel_
functions, 2)):
for k, (op_name, op) in enumerate(operations.items()):
kernel = op(base_kernels[0], base_kernels[1])
plot_kernel(kernel, ax=axes[count])
kernel_names = [
str(base_kernels[i]).split('(')[0] for i in [0, 1]
]
axes[count].set_title('{} {} {}'.format(kernel_names[0], op_name,
kernel_names[1]),
fontsize=12)
count += 1
figure.tight_layout()
Running the preceding code will generate Figure 4-11. Note that when adding the
linear kernel with the periodic kernel, the resulting combined kernel displays both linear
trend and periodicity.
123
Chapter 4 Gaussian Process Regression with GPyTorch
Of course, we can extend the combination to more than two kernels or even apply an
automatic procedure that searches for the optimal combination of kernels based on the
available training data. The next section will demonstrate the model’s learning capacity
improvement by applying more complex and powerful kernel combinations.
data = np.load('airline.npz')
X = torch.tensor(data['X'])
y = torch.tensor(data['y']).squeeze()
train_indices = list(range(70)) + list(range(90, 129))
124
Chapter 4 Gaussian Process Regression with GPyTorch
Running the preceding code generates Figure 4-12, showing the training set to be
used to fit a GP regression model. The figure suggests an increasing trend with a seasonal
pattern, which can be explained by the growing travel demand and seasonality across
the year. In addition, the vertical range is also increasing, indicating a heteroscedastic
variance across different months. These structural components, namely, the trend,
seasonality, and random noise, are essential elements often modeled in the time-series
forecasting literature. In this exercise, we will explore different kernel combinations and
eventually model the GP using three types of composite kernels in an additive manner.
Figure 4-12. Visualizing the training set of the monthly airline passenger count data
Let us try a vanilla RBF kernel function and observe the fitting performance. The
following code listing uses a similar hyperparameter optimization procedure. The only
notable difference is the use of the double() function to convert the GP model instance
to double precision format.
125
Chapter 4 Gaussian Process Regression with GPyTorch
Running this code listing outputs Figure 4-13, which shows a poor fitting
performance in that the structural components are not captured.
Figure 4-13. Visualizing the GP regression fit using the RBF kernel
The fitted model displays a smooth increasing trend within the interpolation region
(between 1949 and 1960) and a decreasing trend in the extrapolation region (after 1960).
We can improve the fitting performance for the trend component by adding another
polynomial kernel of degree one, as shown in the following code listing.
kernel_fn = gpytorch.kernels.ScaleKernel(gpytorch.kernels.
PolynomialKernel(power=1)) + \
gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
model = GPRegressor(X_train, y_train, mean_fn, kernel_fn).double()
train(model, X_train, y_train, verbose=False)
plot_model(model, xlim = [1948, 1964])
Running the preceding code generates Figure 4-14, where the long-term trend
displays a certain level of flexibility compared with earlier.
126
Chapter 4 Gaussian Process Regression with GPyTorch
Figure 4-14. Visualizing the GP regression fit using the RBF kernel and
polynomial kernel
Since the data is seasonal, we can additionally build another kernel to capture such
seasonality. In the following code listing, we create a seasonality-seeking kernel called
k_seasonal by adding the periodic kernel, linear kernel, and RBF kernel. This kernel will
then work together with the previous trend-seeking kernel via the AdditiveKernel()
function, a utility function that supports summing over multiple component kernels.
k_trend = (gpytorch.kernels.ScaleKernel(gpytorch.kernels.
PolynomialKernel(power=1)) +
gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel()))
k_seasonal = (gpytorch.kernels.ScaleKernel(gpytorch.kernels.
PeriodicKernel()) *
gpytorch.kernels.LinearKernel() *
gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel()))
kernel_fn = gpytorch.kernels.AdditiveKernel(k_trend, k_seasonal)
gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
model = GPRegressor(X_train, y_train, mean_fn, kernel_fn).double()
train(model, X_train, y_train, verbose=False)
plot_model(model, xlim = [1948, 1964])
Running the preceding code generates Figure 4-15, where the trend and seasonality
components are well captured.
127
Chapter 4 Gaussian Process Regression with GPyTorch
Figure 4-15. Visualizing the GP regression fit using composite kernels on both
trend and seasonality
However, the fitting still appears wiggly due to a uniform noise variance applied
across the whole domain. In the next attempt, we will add another kernel that increases
the noise variance as we travel from left to right. Such kernel can be considered as a
multiplication of a white noise kernel and a linear kernel, where the white noise kernel
assumes a constant variance across the whole domain, and the linear kernel gradually
lifts the variance when multiplied by the white noise kernel.
In the following code listing, we first define a class as the white noise kernel. When
calling the class instance, its forward() function will return the prespecified noise
variance (wrapped via a lazy tensor format for efficiency) when the two input locations
overlap and zero otherwise. This kernel is then multiplied with a linear kernel to model
the heteroscedastic noise component, which is added to the previous two composite
kernels to model the GP jointly.
return DiagLazyTensor(torch.ones(x1.shape[0]).to(x1) *
self.noise)
else:
return torch.zeros(x1.shape[0], x2.shape[0]).to(x1)
k_noise = gpytorch.kernels.ScaleKernel(WhiteNoiseKernel(noise=0.1)) *
gpytorch.kernels.LinearKernel()
kernel_fn = gpytorch.kernels.AdditiveKernel(k_trend, k_seasonal, k_noise)
gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
model = GPRegressor(X_train, y_train, mean_fn, kernel_fn).double()
train(model, X_train, y_train, verbose=False)
plot_model(model, xlim = [1948, 1964])
Running the preceding code generates Figure 4-16, where the GP fitting looks much
better now. The overall trend, seasonality, and heteroscedastic noise are well captured in
the interpolation region and continued in the extrapolation region.
Figure 4-16. Visualizing the GP regression fit using composite kernels on the
trend, seasonality, and heteroscedastic noise
Summary
Building an excellent surrogate model is essential for effective BO. In this chapter, we
discussed one of the most popular (and modular) GP packages, GPyTorch, and its use in
various aspects. Specifically, we covered the following:
129
Chapter 4 Gaussian Process Regression with GPyTorch
In the next chapter, we will look at Monte Carlo acquisition functions and introduce
a series of optimization techniques that turn out to be quite helpful in accelerating the
BO process.
130
CHAPTER 5
131
© Peng Liu 2023
P. Liu, Bayesian Optimization, https://doi.org/10.1007/978-1-4842-9063-7_5
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
section and switch to the corresponding Monte Carlo (MC)–based EI in the next section,
which requires MC estimation for a functional evaluation of the acquisition function.
In the following section, we will introduce the Hartmann function, a widely used
synthetic function to benchmark the performance of optimization algorithms.
4 6 2
f x i exp Aij x j Pij
i 1 j 1
where αi, Aij, and Pij are all constants. The function is usually evaluated over a six-
dimensional unit hypercube, namely, xj ∈ (0, 1) for all j = 1, …, 6. The global minimum
is obtained at x∗ = (0.20169, 0.150011, 0.476874, 0.275332, 0.311652, 0.6573), with
f (x∗) = − 3.32237.
Let us look at how to play around with the Hartmann function available in BoTorch.
First, we will need to install the package via the following command:
SEED = 8
random.seed(SEED)
132
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
np.random.seed(SEED)
torch.manual_seed(SEED)
>>> train_obj
tensor([[0.1066],
[0.0032],
[0.0027],
[0.7279],
[0.0881],
[0.8750],
[0.0038],
[0.1098],
[0.0103],
[0.7158]])
Next, we will build a GP-powered surrogate model and optimize the associated
hyperparameters using utilities provided by GPyTorch, a topic covered in the previous
chapter.
133
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
The model object now serves as the surrogate model and provides posterior mean
and variance estimates for any sampling location within the unit hypercube.
The hyperparameters can be accessed via the named_hyperparameters() method as
follows, where the output is wrapped in a list in order to print out the generator object:
>>> list(model.named_hyperparameters())
[('likelihood.noise_covar.raw_noise', Parameter containing:
tensor([2.0000], requires_grad=True)),
('mean_module.constant', Parameter containing:
tensor([0.], requires_grad=True)),
('covar_module.raw_outputscale', Parameter containing:
tensor(0., requires_grad=True)),
('covar_module.base_kernel.raw_lengthscale', Parameter containing:
tensor([[0., 0., 0., 0., 0., 0.]], requires_grad=True))]
Based on the output, the default value for the length scale is zero for all six
dimensions and two for the noise variance. In order to optimize these hyperparameters
using the maximum likelihood principle, we will first obtain an instance that contains
the exact marginal log likelihood (stored in mll) and then use the fit_gpytorch_model
function to optimize these hyperparameters of the GPyTorch model instance. The
core optimization uses the L-BFGS-B procedure via the scipy.optimize.minimize()
function. The following code snippet performs the optimization of the hyperparameters:
134
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
We can verify the hyperparameters now and check if these values have changed. As
shown in the following, the hyperparameters now assume different values, all of which
are optimized by maximizing the marginal log likelihood of the observed training data:
>>> list(model.named_hyperparameters())
[('likelihood.noise_covar.raw_noise', Parameter containing:
tensor([0.0060], requires_grad=True)),
('mean_module.constant', Parameter containing:
tensor([0.2433], requires_grad=True)),
('covar_module.raw_outputscale', Parameter containing:
tensor(-2.1142, requires_grad=True)),
('covar_module.base_kernel.raw_lengthscale', Parameter containing:
tensor([[-0.7155, -0.7190, -0.7218, -0.8089, -1.1630, -0.5477]],
requires_grad=True))]
In the next section, we will use the analytic form of the expected improvement
acquisition function and obtain the solution for the outer BO loop.
Introducing the Analytic EI
BoTorch provides the ExpectedImprovement class that computes the expected
improvement over the current best-observed value based on the analytic formula we
derived earlier. It requires two input parameters: a GP model which has a single outcome
and optionally with optimized hyperparameters and the best-observed scalar value,
assumed to be noiseless. The calculation is based on the posterior mean and variance of
the GP surrogate model at a single specific sampling location.
135
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
Next, we will instantiate the EI object based on best_value and the posterior GP
model object. In the following code snippet, we pass the first observation to the EI object
and obtain a result of 2.8202e-24, which is almost zero. Intuitively, this makes sense as
the marginal utility from an already sampled location should be minimal.
136
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
In this function, we first use the to() method of a tensor object to align the data
type of best_f to that of the candidate location X. We then access the posterior mean
and variance of the candidate location, with the shape arranged according to batch
evaluation and broadcasting needs and the standard deviation floored to avoid dividing
by zero. With the unit improvement calculated in u, we can plug in the analytic form of
EI, given as follows for ease of reference, and perform the calculation accordingly:
where f ∗ is the best-observed value so far, and ϕ and Φ denote the probability and
cumulative density function of a standard normal distribution, respectively. Note that
the code extracts out a common factor of σ, making it easier for implementation.
Next, we will look at how to optimize the black-box Hartmann function using EI.
137
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
138
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
139
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
After setting all these input arguments, we can run the following code snippet to
generate the final solution to the global optimum:
new_point_analytic, _ = optimize_acqf(
acq_function=EI, # acquisition function to guide the search
bounds=torch.tensor([[0.0] * 6, [1.0] * 6]), # 6d unit hypercube
q=1, # generate one candidate location in each iteration
num_restarts=20, # number of starting points for multistart
optimization
raw_samples=100, # number of samples for initialization
options={}, # additional options if any
)
>>> new_point_analytic
tensor([[0.1715, 0.3180, 0.0816, 0.4204, 0.1985, 0.7727]])
Any optimization routine starts with an initial condition, the starting point of an
optimizer, which could be set either randomly or following some heuristics. The initial
conditions are set by the optional batch_initial_conditions argument in optimize_
acqf(). Since this argument defaults to None and assuming we have no prior preference
for generating the initial optimization condition, optimize_acqf() will use the gen_
batch_initial_conditions() function to select a set of starting points in order to run
the multi-start gradient-based algorithm.
These set of starting points, also referred to as initial conditions in BoTorch, are
selected based on a multinomial probability distribution whose probabilities are
determined by the value of the acquisition functions, additionally weighted by a
temperature hyperparameter. Selecting the initial conditions concerns two steps:
determining a much bigger set of random locations (specified by the raw_samples
parameter) to cover the total action space as much as possible and selecting a subset of
promising candidate locations to serve as initial starts (specified by the num_restarts).
Let us look at these two steps in detail.
Intuitively, a good initial condition should bear a relatively high value when
evaluated using the current acquisition function. This requires building a set of space-
filling points so as to cover all regions of the search space and not leave out any big
chunk of search space. Although these initial points are randomly selected, directly
following a uniform distribution does not guarantee an even coverage of the whole
search points using the set of the proposed random point. Instead, we would like a
quasi-random sequence of points that has a low discrepancy in terms of distance among
each other.
To this end, the Sobol sequences are used in BoTorch to form a uniform partition
of the unit hypercube along each dimension. In other words, the quasi-Monte Carlo
samples generated using Sobol sequences could fill the space more evenly, resulting
in faster convergence and more stable estimates. The Sobol sequences are generated
using the torch.quasirandom.SobolEngine class, which accepts three input arguments:
dimension to specify the feature dimension, scramble to specify if the sequences would
be scrambled, and seed to specify the seed for the random number generator.
We generate 200 random points of Sobol sequences in a two-dimensional unit
hypercube in the following code snippet. First, we create a SobolEngine instance by
specifying the corresponding input arguments and storing it in sobol_engine. We then
call the draw() method to create 200 random points in sobol_samples, which are further
decomposed into xs_sobol and ys_sobol to create the two-dimensional plot:
141
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
Running this code snippet generates Figure 5-2. The points appear to be evenly
distributed across the whole domain, providing a good space-filling design to “warm-
start” the optimization procedure. A good initial design could also enable better
estimates of, for example, the expectation operator and faster convergence to the global
optimum using a specific optimization routine.
Figure 5-2. Sobol sequences that consist of 200 quasi-random points in a two-
dimensional space. The points appear to be evenly distributed across the whole
space, providing better estimates and faster convergence in optimization tasks
random_samples = torch.rand(200, 2)
xs_uniform = [x[0] for x in random_samples]
ys_uniform = [x[1] for x in random_samples]
plt.scatter(xs_uniform, ys_uniform)
Running this code snippet generates Figure 5-3, where empty regions are more
prevalent compared to earlier. Leaving out these regions would thus provide a relatively
insufficient coverage and suboptimal estimation.
After generating a good coverage using the Sobol sequences, the next step is to obtain the
corresponding evaluations of the acquisition function. Since we have access to the analytic
solution of EI, calculating these evaluations is cheap and straightforward. In Botorch, the
Sobol sequences and corresponding values of the acquisition function are respectively stored
in X_rnb and Y_rnb inside the gen_batch_initial_conditions() function.
The last step is to generate a set of initial conditions using the initialize_q_
batch() function, whose input arguments include X (the set of Sobol sequences), Y
(corresponding EI values), and n (number of initial starts). Here, a heuristic is used to
143
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
select the n initial conditions from the X Sobol sequences without replacement, where
the probability of being selected is proportional to eηZ, with η being an optional input
Y Y
temperature parameter and Z representing the standardized EI value. In
Y
other words, a multinomial distribution is used to select the n initial conditions, where
locations with high EI values are more likely to be selected, thus providing good initial
starts to the multi-start optimization procedure.
Back to our running example. In the following code snippet, we generate 20 initial
conditions in a six-dimensional unit hypercube using gen_batch_initial_conditions(),
where 100 raw samples are created via Sobol sequences. Printing out the dimension of the
returned result in Xinit shows that it is a 20x1x6 tensor, with 20 representing the number of
initial conditions, 1 the number of proposed sampling locations in each BO outer iteration
(i.e., the vanilla sequential BO), and 6 the number of features in the state space.
if not initial_conditions_provided:
batch_initial_conditions = _gen_initial_conditions()
The initial conditions generated via this function will get stored in the batch_
initial_conditions variable. As a convention, the function used solely inside the high-
level function starts with the underscore, as in the case of _gen_initial_conditions().
With the initial conditions generated based on the aforementioned heuristic, we
will now enter into the core optimization phase using the gradient-based method with
random restart, which is completed via the _optimize_batch_candidates() function,
another built-in function in the source optimize_acqf() function.
As mentioned earlier, the actual optimization happens in gen_candidates_scipy(),
which generates a set of optimal candidates based on the widely used scipy.optimize.
minimize() function provided by SciPy. It starts with a set of initial candidates that serve
as the starting points for optimization and are passed in the initial_conditions input
argument. It will also accept additional core input arguments such as the acquisition function
acquisition_function, lower bounds in lower_bounds, and upper bounds in upper_bounds.
Let us examine the core optimization routine closely. In the following code,
we demonstrate the use of gen_candidates_scipy() in generating a set of optimal
candidates, of which the candidate with the highest EI value is returned:
145
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
Printing out the best candidate in batch_candidates shows the exact same
candidate location as before:
>>> batch_candidates
tensor([[0.1715, 0.3180, 0.0816, 0.4204, 0.1985, 0.7727]])
146
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
X = (
torch.from_numpy(x)
.to(initial_conditions)
.view(shapeX)
.contiguous()
.requires_grad_(True)
)
X_fix = fix_features(X, fixed_features=fixed_features)
loss = f(X_fix).sum()
# compute gradient w.r.t. the inputs (does not accumulate
in leaves)
gradf = _arrayify(torch.autograd.grad(loss, X)[0].contiguous().
view(-1))
if np.isnan(gradf).any():
msg = (
f"{np.isnan(gradf).sum()} elements of the {x.size}
element "
"gradient array `gradf` are NaN. This often indicates
numerical issues."
)
if initial_conditions.dtype != torch.double:
msg += " Consider using `dtype=torch.double`."
raise RuntimeError(msg)
fval = loss.item()
return fval, gradf
With the EI value and the gradient available at any point across the search space, we
can now utilize an off-the-shelf optimizer to perform the gradient-based optimization.
A common routine is to call the “L-BFGS-B” method in the minimize() function, which
is a quasi-Newton method that estimates the Hessian matrix (second derivative) of
the objective function, in this case, the analytic EI. The following code snippet shows
the part of gen_candidates_scipy() where the optimization happens, with the initial
condition stored in x0, the objective function in f_np_wrapper, and the optimization
results in res. For the full implementation details, the reader is encouraged to refer to
the official BoTorch GitHub.
147
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
res = minimize(
fun=f_np_wrapper,
args=(f,),
x0=x0,
method=options.get("method", "SLSQP" if constraints else
"L-BFGS-B"),
jac=True,
bounds=bounds,
constraints=constraints,
callback=options.get("callback", None),
options={k: v for k, v in options.items() if k not in ["method",
"callback"]},
)
Up till now, we have examined the inner workings of optimizing over an analytic
EI acquisition function. Note that while the number of proposals grows in the case of
parallel BO, that is, q>1, the analytic solution is not readily available. In this case, the
Monte Carlo (MC) version is often used to approximate the true value of the acquisition
function at a specified input location. We refer to this as the MC acquisition function, as
introduced in the next section.
value of the acquisition function α(xn + 1) at an arbitrary next sampling location xn + 1.
Since the corresponding observation yn + 1 is not yet revealed, our best strategy is to rely
on the expected utility while taking into account the updated posterior distribution at
148
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
xn + 1. Concretely, denote ξ~P(f (xn + 1)| D1 : n) as the random variable for the observation
at xn + 1. The value of the acquisition function α(xn + 1) would be conditioned on ξ, whose
utility is denoted as u(ξ). To integrate out the randomness in ξ, the expected utility is
used as the value of the acquisition function, giving
where P(f (xn + 1)| D1 : n) is the posterior distribution of f at xn + 1 given D1 : n observed so far.
Evaluating the acquisition function α(xn + 1), however, requires taking the integral
with respect to the posterior distribution, which is often analytically intractable and
difficult to compute. This is even the case when proposing more than one sampling
locations in parallel BO, where the analytic expressions generally do not exist. Instead,
we would only be able to approximate α(xn + 1) by sampling a set of m putative Monte
Carlo observations i i 1 at xn + 1, resulting in
m
1 m
x n 1 u i
m i 1
1 m
EI x n 1 max i f ,0
m i 1
where f ∗ is the best observation collected so far in a noiseless observation model.
ξi~P(f (xn + 1)| D1 : n) follows a normal distribution, where its mean and variance are
denoted as μ(xn + 1) and σ2(xn + 1), respectively. In practical implementation, the
reparameterization trick is often used to disentangle the randomness in ξi. Specifically,
we can reexpress ξi as μ(xn + 1) + L(xn + 1)ϵi, where L(xn + 1)L(xn + 1)T= σ2(xn + 1) is the root
decomposition of σ2(xn + 1), and ϵi~N(0, 1) follows a standard normal distribution.
In BoTorch, the MC EI is calculated by the qExpectedImprovement class, where the q
represents the number of proposals in the parallel BO setting.
149
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
We can perform the sampling operation by passing posterior to sampler, where the
results are printed as follows:
>>> samples
tensor([[[0.0506]],
[[0.1386]],
[[0.1737]],
...,
[[0.1739]],
[[0.1384]],
[[0.0509]]], grad_fn=<UnsqueezeBackward0>)
The shape of samples also shows that there are a total of 1024 samples:
>>> samples.shape
torch.Size([1024, 1, 1])
150
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
With these random samples in place, the next step is to obtain the corresponding
evaluations of the acquisition function so as to approximate its expectation. These
evaluations are then averaged to produce the final approximated expected utility of
the acquisition function at the specified sampling location, thus the name Monte Carlo
acquisition function.
These computations are completed via the qExpectedImprovement class, which is
designed to propose multiple points at the same time in the parallel BO framework.
Specifically, the following four steps are performed:
• Sampling the joint posterior distribution over the q input points
• Evaluating the improvement over the current best value for each
MC sample
• Maximizing over q points and selecting the best value
• Averaging over all the MC samples to produce the final estimate
qEI x max max j f ,0
j 1,,q
where ξi~P(f (x)| D). Using MC approximation together with the reparameterization trick,
we can approximate qEI(x) as follows:
qEI x
1 m
max max x j L x i j f ,0
m i 1 j 1,,q
where ϵi~N(0, 1). Let us look at an example on how an approximate qEI value can be
calculated. In the following code snippet, we initialize a qExpectedImprovement instance
by passing in the model object created earlier, the best-observed value so far in best_
value, and the qMC sampler in sampler. The created instance in qEI could then be used
to provide evaluation at any given sampling location. In this case, the evaluation is zero
when passing in the first observed location, which makes sense since the marginal utility
from any observed location should be none.
Note that the [None, :] part essentially adds a batch dimension of 1 to the
current tensor.
We can also plug in the best candidate location stored in batch_candidates and
observe the qEI value, which is 0.0538:
Within the qEI instance, the sequence of operations mentioned earlier is performed
via the forward() function, which is included as follows for ease of reference:
Next, let us apply this MC version of EI and seek the global optimum. In particular,
we are interested in whether the MC acquisition function will return the same optimum
location as the analytic counterpart. In the following code snippet, we first instantiate
a normal sampler that produces 500 points of Sobol sequence, which will be used
to estimate the value of MC EI at each proposed location via the MC_EI object. This
acquisition function will then be passed into the same optimize_acqf() function that
performs the core optimization procedure introduced earlier. We also set the seed of
PyTorch to ensure reproducibility given the randomness in Monte Carlo simulations:
152
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
Examining the result in new_point_mc shows that the location closely matches the
one from analytic EI (saved in new_point_analytic):
>>> new_point_mc
tensor([[0.1715, 0.3181, 0.0816, 0.4204, 0.1985, 0.7726]])
We can also check the norm of the difference between these two vectors:
Summary
Many acquisition functions involve the expectation of some real-valued function of the
model’s output at a specific sampling location. Instead of directly evaluating the integral
operation from the expectation operator, the Monte Carlo version of the acquisition
function offers a convenient path to evaluate the value of the original acquisition
function via MC simulations, thus avoiding integration and offering faster computation.
Specifically, we covered the following:
153
Chapter 5 Monte Carlo Acquisition Function with Sobol Sequences and Random Restart
In the next chapter, we will discuss another acquisition function called knowledge
gradient (KG) and the one-shot optimization technique when performing sequential
search using KG.
154
CHAPTER 6
Knowledge Gradient:
Nested Optimization vs.
One-Shot Learning
In the previous chapter, we learned the inner workings of the optimization procedure
using BoTorch, highlighting the auto-differentiation mechanism and modular design
of the framework. This paves the way for many new acquisition functions we can plug
in and test. In this chapter, we will extend our toolkit of acquisition functions to the
knowledge gradient (KG), a nonmyopic acquisition function that performs better than
expected improvement (EI) in many cases.
Although empirically superior, calculating the KG value at an arbitrary location
is nontrivial. The nonmyopic nature of the KG acquisition function increases the
computational complexity since the analytic form is unavailable, and we need to
perform a nested optimization for each KG evaluation; one needs to resort to the
approximation method to obtain an evaluation of the KG value. Besides introducing
the formulation of the KG acquisition function, this chapter also covers the mainstream
approximation methods used to compute the KG value. In particular, we will illustrate
the one-shot KG (OKG) formulation proposed in the BoTorch framework that can
significantly accelerate the computation by converting the nested optimization problem
into a deterministic optimization setting using techniques such as sample average
approximation (SAA). We will also dive into the implementation of OKG in BoTorch,
shedding light on the practice usage of this novel optimization technique.
155
© Peng Liu 2023
P. Liu, Bayesian Optimization, https://doi.org/10.1007/978-1-4842-9063-7_6
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
156
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
this (and therefore arbitrarily any) location or equivalently the utility of the
augmented dataset via the maximum of the new predictive function, that is,
u Dn 1 D n1 max Dn1 x . The improvement based on the additional pair (xn + 1, yn + 1)
xX
could then be calculated as n1 n .
We can then define the KG function as the expected marginal increase in utility
between n and n+1, where the nonmyopic nature is reflected as the expected outcome
under various future simulations. Specifically, the KG function KG x;n at a candidate
location x is defined as follows:
Here, the expectation is taken with respect to the random variable yn + 1 at a given
location xn + 1. Different realizations of yn + 1 would result in different increases in the
marginal gain in the utility function. By integrating out the randomness in yn + 1, we
obtain the expected average-case marginal gain in utility. Such one-step lookahead
formulation based on the global reward utility forms the KG acquisition function.
The global nature of KG also means that the last point to be evaluated may not
necessarily be one of the locations previously evaluated; we are willing to commit to a
new location upon the termination of the search process. For each new location, the
integration effectively considers all possible values of observations under the predictive
posterior mean function. Figure 6-1 summarizes the definition of the KG acquisition
function.
157
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Having formulated the KG acquisition function, we will look at its computation in the
optimization procedure in the following section.
KG x;n
1 M
M i 1
i
u n 1 u n
M i 1
1 M i
n1 n
where each ith putative sample y(i) is generated based on the corresponding GP posterior,
that is, y i p y|x ,n . In addition, Dni1 max i x , and ni1 n x,y i
xX Dn 1
158
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
better. In the extreme case when M → ∞, the approximation will be unbiased,
that is,
KG x ;n KG x ;n .
Note that when locating the maximum posterior mean value, we can use the
nonlinear optimization method such as L-BFGS-B to locate the global maximum along
the posterior mean curve. L-BFGS-B is a second-order optimization algorithm we
encountered previously when using the minimize() optimization routine provided
by SciPy.
Figure 6-2 provides the general schematic of optimization using Monte Carlo
simulations for certain acquisition functions, where MC integration in KG x;n is
M
used to approximate the expectation in KG x;n using samples y i from the
i 1
posterior distribution.
Let us look at the algorithm that can be used to evaluate the KG value at a given
sampling location. As shown in Figure 6-3, we first calculate the maximum posterior
mean value n based on the collected dataset n and then enter the main loop
to calculate the approximate KG value based on simulations. In each iteration, we
159
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
simulate a realization y(i) of a normally distributed random variable with mean n x
and variance 2 n x at the location x under consideration. Equivalently, we can
first generate a standard normal realization z i 0 ,1 and apply the scale-shift
transformation to get y i n x n x z i , a topic covered in Chapter 2. After
obtaining the new simulation and appending it to the dataset to form ni1 , we can
acquire a new maximum posterior mean value ni1 , which can then be used to
calculate the single improvement in utility, namely, ni1 n . Upon completing the
loop, the average improvement is returned as the approximate KG value,
that is,
KG x;n
M i 1
1 M i
n1 n .
Figure 6-3. Illustrating the algorithm for calculating the KG value using Monte
Carlo simulations
Therefore, using Monte Carlo simulation, we are essentially using the average
difference in the maximum predictive mean between the augmented dataset n M1 and
the existing dataset n across all M simulations. This is also the inner optimization
procedure which involves locating the maximum posterior mean, whereas the outer
optimization procedure concerns locating the maximum KG value across the domain.
160
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
ascent in the maximization setting) and calculate x KG x ;n to guide the search
toward the local optimum. In other words, we would like to obtain the next evaluation
location xt + 1 from the current location xt using the following iteration:
x t 1 x t t xt KG x t ;n
Now let us look at the gradient term xt KG x t ;n . Plugging in the definition of
KG x t ;n gives the following:
xt KG x t ;n xt p ( y |xt ,n ) n1 n xt p ( y |xt ,n ) n1
Here, we removed the term n since the gradient does not depend on a fixed constant.
When taking the gradient of an expectation term, a useful technique called
infinitesimal perturbation analysis can be applied here to ease the calculation, which
161
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
xt KG x t ;n xt p ( y |xt ,n ) n1 p ( y |xt ,n ) xt n1
With the expectation operator moved to the outer layer, we can apply the same
Monte Carlo technique to sample a list of M gradients xt ni1 for i ∈ {1, …, M} and take
the average.
Now, the question is how to calculate a single gradient xt ni1 for a given location
xt and simulated observation y(i). Recall that ni1 is the maximum posterior mean value
after observing the putative pair (xt,y(i)), that is:
x X
function n1 x . We can then keep x∗t fixed and calculate the gradient of n1 xt
with respect to xt, the current location under consideration, that is:
xt D n1 xt max Dn1 x xt Dn1 xt
x X
We can then utilize the auto-differentiation capability in PyTorch to calculate the
gradient with respect to the current location xt, after evaluating the posterior mean
function at the maximizing location x∗t . Figure 6-4 summarizes the techniques involved
in the derivation of gradient calculation.
162
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Figure 6-4. Illustrating the gradient calculation process for a given location
Note that we still use the same Monte Carlo technique to handle the expectation
operator via simulations. In Figure 6-5, we summarize the algorithm used to calculate
the gradient element required in the SGD update, where we use Gxt to denote the
gradient of KG at the current running location xt. In this algorithm, we perform
a total of J simulations to approximate the KG gradient. In each iteration, we first
obtain a putative observation y j n x t , n2 x t at the current running
location xt and form the new posterior mean function n 1 x ;n j 1 . Next, we use the
routine optimization protocol L-BFGS-B to locate the maximizer x∗t and calculate
a single gradient Gx tj xt n1 xt of the maximum posterior function value with
J
respect to xt. Finally, we average multiple instances of Gx tj and return it as the
j 1
approximate KG gradient xt KG x t ;n for the current round of SGD update, that is,
x t 1 x t t xt KG x t ;n . The averaging operation is theoretically guaranteed since
the expected value of the average KG gradient is equal to the true gradient of the KG
value, that is, G xt xt KG x t ;n .
163
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Figure 6-5. Illustrating the algorithm for calculating the gradient of KG using
Monte Carlo simulations
164
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Figure 6-6. Illustrating the reparameterization trick for the gradient calculation of
the KG acquisition function
Now we can shift to the outer optimization and look at the big picture: navigating
toward the sampling location with a (locally) maximum KG value. As previously
mentioned, we perform the multi-start stochastic gradient ascent procedure and select
the best candidate with the highest KG value as the next sampling location.
Figure 6-7 provides the details on the algorithmic framework. Before optimization
starts, we prepare the collected dataset n , the total iteration budget R for the number
of restarts and T for the number of gradient ascent updates, as well as the step size used
for the gradient ascent update. We select a random point x 0r as the starting point for
each path of the stochastic gradient ascent update. Each starting point will evolve into a
sequence of updates in x tr and converge to a local minimum at x Tr .
165
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Figure 6-7. Illustrating the algorithm for locating the largest KG for sequential
sampling using multi-start stochastic gradient ascent
perform the iterative stochastic gradient ascent update x tr1 x tr t Gx r to move to the
t
next location. Note that the step size is made adaptive via t , which reduces
t
in magnitude as iteration proceeds. Each stochastic gradient ascent will run for a total
of T iterations. Finally, we evaluate the approximate KG value r
KG x T ;n upon
convergence of the ascent updates (set by the hyperparameter T) and return the location
with the largest approximate KG value, that is, x T argmax r
KG x T ;n .
Due to its focus on improving the maximum posterior mean, the proposed next
sampling location may not necessarily have a higher observation than previous
observations. However, in the case of noisy observations, such nature in KG enables it to
outperform the search strategy based on EI significantly.
Note that the proposed suite of algorithms used to guide the sequential search
process based on KG is much more efficient than the original direct computation.
However, there could still be an improvement, especially when we need to propose
multiple points at the same time, a topic under the theme of parallel Bayesian
166
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Note that since these base samples are independently and identically distributed
(i.i.d.), the resulting estimator of the maximum KG value is theoretically guaranteed to
converge to the true maximum, that is, N max KG x ;Dn max KG x ;Dn . Such
xX xX
theoretical guarantee is backed by the rich literature on the approximation property of
SAA, which also provides convenience to implementing MC-based approximation.
167
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
In addition, we can also use the same set of base samples to estimate the KG
gradient value, although such a gradient estimator may be used for different purposes.
Figure 6-8 illustrates the SAA process used in MC integration for both the KG value
N
and its gradient estimator. The common random numbers z i are usd to pass
i 1
through a deterministic transformation y hn x ,z n x n x z to produce
N
a set of putative observations y i , which will then be used to calculate
i 1
N
1
KG x;n ni1 n as an approximation of KG x;n and
N i 1
Gx
1 N
N i 1
x n1 x as an approximation of x KG x ;n .
Figure 6-8. The sample average approximation technique used in the Monte
Carlo estimation for both the KG value and its gradient estimator using quasi-
Newton methods for faster convergence speed and more robust optimization
With a fixed set of base samples, we can now use second-order optimization routines
such as quasi-Newton methods to achieve faster convergence speed and more robust
optimization. This is because first-order methods such as stochastic gradient descent are
subject to hyperparameter tuning, such as the learning rate, making it easier to achieve
reliable optimization results with significant tuning efforts. When using second-order
methods, however, such tuning is not necessary.
168
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
In the next section, we will discuss the one-shot formulation of KG using the SAA
technique.
observation y n x . The inner loop can be performed either via the same first-order
stochastic gradient ascent algorithm or the second-order L-BFGS-B procedure.
Let us now reexpress the KG value in more detail to facilitate the derivation of the
one-shot KG formulation. Assume we take one more data pair x ,y n x with
y Dn x N Dn x , Dn x and y n x n x , and we further form an updated
(putative) dataset n 1 n x ,y n x . The KG acquisition function can then be
expressed as follows:
In other words, this expression essentially quantifies the expected marginal increase
in the maximum posterior mean after sampling at location x. When using SAA, we can
avoid the nested optimization and use the fixed base samples to convert the original
problem into a deterministic one. Specifically, after drawing a fixed set of base samples
N
in z i , we can approximate the KG acquisition function as follows:
i 1
α
KG ( x ;Dn ) =
1 N
N i =1 x (i )eX [ n+1
( )
Σ max [ yD x (i ) |Dn(i+)1 ]] _ µD* n
169
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Since we would maximize the approximate KG function to obtain the next sampling
position x n+1 = x͡* = argmax α͡
KG ( x ;Dn ) , we can move the second inner maximization
xeX
operator outside of the sample average when conditioned on the fixed set of base
samples. In other words, we have
max α
KG ( x ;Dn ) = max
xeX xeX
1 N
N i =1 x (i )eX [ n+1
( )
Σ max [ yD x (i ) |Dn(i+)1 ]] _ µD* n
= max max
xeX )i )
x eX
1 N [
Σ
N i =1 [
) )
y Dn+1 x )i ) |Dn)i+)1 ] _ µD* n
]
= max
1 N [
x ,x '¢X N
Σ [ ( )
y Dn+1 x (i ) |Dn(i+)1 ] - µD* n
]
i =1
{ }
N
where x ' := x (i ) ¢ N is a collection of N sampling points that represent the locations
i =1
of the maximum posterior mean. These points are also called fantasy points in the
BoTorch paper, used to present the next-stage solutions in the inner optimization loop.
Using such reexpression, we have converted the original problem into an equivalent
form with just one maximization operator, which requires us to solve an optimization
problem over the current maximizing location x and next-stage fantasy points x′.
Note that the only change in the current single optimization loop is that each
loop needs to solve a higher-dimensional problem. That is, we need to output N more
locations in each optimization. Since we are only working with a single optimization
loop, such formulation is thus called the one-shot knowledge gradient (OKG). Figure 6-9
summarizes the formulation process of OKG.
170
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
One-Shot KG in Practice
In this section, we will look at the implementation details of OKG in BoTorch based on
a synthetic function of the form f (x) = sin (2πx1) cos (2πx2), a two-dimensional function
where all input features are confined to the hypercube [0, 1]. We also add a vector of
normally distributed random noise to the simulated observations to simulate noise-
perturbed observations. In addition, we standardize the noisy observations using the
standardize() function in botorch.utils.transforms to create a vector of zero mean
and unit variance. See the following code listing that implements these operations and
creates 20 noisy observations after standardization:
import os
import math
import torch
from botorch.utils.transforms import standardize
bounds = torch.stack([torch.zeros(2), torch.ones(2)])
171
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Next, we will build a surrogate model using Gaussian processes and optimize the
hyperparameters of the GP model. Following a similar approach to the previous chapter,
we first instantiate a SingleTaskGP model based on the provided training dataset in
train_X and train_Y, prepare the (closed-form) marginal likelihood calculator via
ExactMarginalLogLikelihood, and finally optimize the GP hyperparameters using fit_
gpytorch_model(). See the following code listing for these operations:
Note that the semicolon is used here to suppress the output message after
executing the code. Next, we will build a one-shot knowledge gradient instance to
represent this particular acquisition function. In BoTorch, this class is implemented in
qKnowledgeGradient, which generalizes to parallel Bayesian optimization setting where
q represents the number of locations to be considered simultaneously. In addition,
qKnowledgeGradient performs batch Bayesian optimization where each batch goes
through the one-shot optimization process. See the following code listing on creating an
instance for OKG learning later on:
NUM_FANTASIES = 128
qKG = qKnowledgeGradient(model, num_fantasies=NUM_FANTASIES)
Here, we set the number of fantasy samples to 128, corresponding to the number
of MC samples N for the outer expectation operator discussed earlier. More fantasy
samples will have a better approximation of the true KG value, although at the expense
of both RAM and wall time (running the whole script from start to end). As for the inner
expectation, we can directly rely on the closed-form posterior mean or use another MC
integration.
172
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
With the qKG instance created, we can use it to evaluate an arbitrary location across
the search domain. Note that the one-shot formulation in OKG requires a total of 129
input locations per evaluation, including one location under consideration of the outer
optimization and the remaining 128 fantasy locations used in MC integration.
The following code listing tests the OKG value of the first training location together
with 128 fantasy locations sampled following Sobol sequences. We combine the original
location with the 128 fantasy locations using the torch.cat() function, pass to qKG to
evaluate the approximate KG value, and output the final content in the returned result by
invoking the item() method:
sampler = SobolQMCNormalSampler(
num_samples=num_fantasies, resample=False, collapse_batch_
dims=True
)
Now let us look at the evaluation code when passing in a candidate location to the
forward() function in qKnowledgeGradient. As shown in the following code snippet, we
start by splitting the combined location vectors into the actual location in X_actual and
fantasy locations in X_fantasies using the _split_fantasy_points() function:
173
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
# make sure to propagate gradients to the fantasy model train inputs
with settings.propagate_grads(True):
values = value_function(X=X_fantasies) # num_fantasies x b
174
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Both the model object and the fantasy_model object represent the GP model that
is parameterized by the mean function and a covariance function conditioned on
the observed/simulated dataset. In other words, we can supply an arbitrary sampling
location and obtain the corresponding posterior mean and variance. For example, the
175
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
following code snippet evaluates the posterior mean value at the location of the first
training point using the PosteriorMean() function, a utility function used to evaluate
the closed-form mean value from the posterior distribution:
The result shows a scalar value of -0.4020 contained as a tensor object with the grad_
fn attribute, which is the gradient at the specific location and will be used for gradient
update later. We can also extract the scalar value by calling the item() function:
>>> PosteriorMean(model)(train_X[0].view(1,-1)).item()
-0.3386051654815674
We can apply the same operation to the fantasy model. As shown in the following
code snippet, the fantasy model returns five different evaluations on the posterior mean
due to five different fantasized posterior GP models:
>>> PosteriorMean(fantasy_model)(train_X[0].view(1,-1))
tensor([-0.4586, -0.2904, -0.2234, -0.3893, -0.3752], grad_fn=<ViewBackward0>)
def _get_value_function(
model: Model,
objective: Optional[MCAcquisitionObjective] = None,
posterior_transform: Optional[PosteriorTransform] = None,
sampler: Optional[MCSampler] = None,
project: Optional[Callable[[Tensor], Tensor]] = None,
valfunc_cls: Optional[Type[AcquisitionFunction]] = None,
valfunc_argfac: Optional[Callable[[Model, Dict[str, Any]]]] = None,
) -> AcquisitionFunction:
176
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Note that we keep the setting to the simplest case; thus, the returned result is base_
value_function=PosteriorMean(model=model, posterior_transform=posterior_
transform). The value_function is then used to score the posterior mean across all
the fantasy locations and return the average score as the final result of the qKG object.
177
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Note that the calculation process of the posterior mean values using value_function is
performed within the propagate_grads setting, which ensures that all calculations will
backpropagate the gradient values in each tensor.
Up till now, we have looked at the inner workings of calculating the average posterior
mean value as the approximate OKG value over a set of locations (current plus fantasy
locations) instead of over a single current location in other acquisition functions. In the
following section, we will look at the optimization process to identify the location set
with the highest OKG value.
NUM_RESTARTS = 10
RAW_SAMPLES = 512
with manual_seed(1234):
candidates, acq_value = optimize_acqf(
acq_function=qKG,
bounds=bounds,
q=2,
num_restarts=NUM_RESTARTS,
raw_samples=RAW_SAMPLES,
)
178
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
We can access the candidate locations and the corresponding OKG values as follows:
>>> candidates
tensor([[0.3940, 1.0000], [0.0950, 0.0000]])
>>> acq_value
tensor(2.0358)
The modular design in BoTorch makes it easy to plug in new acquisition functions
and test the performance. Under the hood, the optimize_acqf() function has a separate
handle on generating the initial conditions used for optimization, as indicated by the
following code snippet inside the definition of the optimize_acqf() function from
BoTorch:
ic_gen = (
gen_one_shot_kg_initial_conditions
if isinstance(acq_function, qKnowledgeGradient)
else gen_batch_initial_conditions
)
This snippet requires that the initial conditions are generated using the gen_one_
shot_kg_initial_conditions() function when using the OKG acquisition function.
The gen_one_shot_kg_initial_conditions() function generates a set of smart initial
conditions using a combined strategy where some initial conditions are the maximizers
of the simulated fantasy posterior mean.
The following code snippet shows the definition of the gen_one_shot_kg_initial_
conditions() function in BoTorch. Here, we first get the total size of sampling locations
to be selected simultaneously, where q is the number of parallel locations to be
proposed. The definition of the get_augmented_q_batch_size() function is also given
as follows, where we add the number of fantasy points to form the total dimension of
sampling locations and save the result in q_aug:
def gen_one_shot_kg_initial_conditions(
acq_function: qKnowledgeGradient,
bounds: Tensor,
q: int,
num_restarts: int,
179
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
raw_samples: int,
fixed_features: Optional[Dict[int, float]] = None,
options: Optional[Dict[str, Union[bool, float, int]]] = None,
inequality_constraints: Optional[List[Tuple[Tensor, Tensor,
float]]] = None,
equality_constraints: Optional[List[Tuple[Tensor, Tensor,
float]]] = None,
) -> Optional[Tensor]:
options = options or {}
frac_random: float = options.get("frac_random", 0.1)
if not 0 < frac_random < 1:
raise ValueError(
f"frac_random must take on values in (0,1). Value: {frac_
random}"
)
q_aug = acq_function.get_augmented_q_batch_size(q=q)
ics = gen_batch_initial_conditions(
acq_function=acq_function,
bounds=bounds,
q=q_aug,
num_restarts=num_restarts,
raw_samples=raw_samples,
fixed_features=fixed_features,
options=options,
inequality_constraints=inequality_constraints,
equality_constraints=equality_constraints,
)
180
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
)
from botorch.optim.optimize import optimize_acqf
We start by optimizing the OKG acquisition function over a total of q_aug sampling
locations following the parallel BO setting. Equating the nested optimization problem
in the original KG formulation to the one-shot KG formulation and treating it as parallel
BO is a key insight into understanding the OKG acquisition function. In this case, we
adopt the same gen_batch_initial_conditions() function from the previous chapter
on generating the heuristic-based initial conditions. This standard type of random initial
condition is stored in ics.
181
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
Next, we use the same _get_value_function() function from OKG as the scoring
function based on the posterior mean value. The scoring function is then used inside
another optimization function in optimize_acqf() to locate the global maximum
(with q=1) of the posterior mean among all fantasy models. In other words, the initial
conditions are identified based on the maximizers of the posterior mean value of
the fantasy models, which should intuitively be close to the maximizer of the current
posterior mean. This type of nonrandom initial condition is stored in fantasy_cands.
Finally, we would choose a fraction of the initial conditions from the second
type of nonrandom locations based on the specified input parameter frac_random.
Specifically, the maximizing posterior mean values in fantasy_vals are used to weight
a multinomial distribution after the softmax transformation, which is further used to
sample from fantasy_cands and replace into ics. That is, the final initial conditions
(sampling locations) are a mixture of random initializations and maximizing locations of
the posterior mean of simulated fantasy models.
We can look at generating a set of initial conditions using the gen_one_shot_kg_
initial_conditions() function. In the following code snippet, we generate the initial
conditions by setting options={"frac_random": 0.25}. By printing out the size of
the result, we see that the dimension of the initial conditions follows num_restarts x
(q+num_fantasies) x num_features.
Note that the maximizing OKG values we have calculated so far are based on the
simulated posterior fantasy models. By definition, we also need to subtract the current
maximum posterior mean to derive the expected increase in the marginal utility. In
the following code snippet, we obtain the single maximum posterior mean based
on the current model by setting acq_function=PosteriorMean(model) and q=1 in
optimize_acqf():
NUM_RESTARTS = 20
RAW_SAMPLES = 2048
182
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
We can observe the value of the maximum posterior mean as follows. The current
maximum posterior mean value of 1.9548 is indeed lower than the new maximum
posterior mean value of 2.0358 based on fantasized models:
>>> max_pmean
tensor(1.9548)
qKG_proper = qKnowledgeGradient(
model,
num_fantasies=NUM_FANTASIES,
sampler=qKG.sampler,
current_value=max_pmean,
)
Note that the current_value is used in the following snippet inside the definition of
qKnowledgeGradient(), where the current maximum posterior mean is subtracted from
the new fantasized maximum posterior mean:
Next, we apply the same optimization procedure to get the new candidate locations
in candidates_proper and OKG value in acq_value_proper:
with manual_seed(1234):
candidates_proper, acq_value_proper = optimize_acqf(
acq_function=qKG_proper,
bounds=bounds,
q=2,
183
Chapter 6 Knowledge Gradient: Nested Optimization vs. One-Shot Learning
num_restarts=NUM_RESTARTS,
raw_samples=RAW_SAMPLES,
)
>>> candidates_proper
tensor([[0.2070, 1.0000],
[0.0874, 0.0122]])
>>> acq_value_proper
tensor(0.0107)
For the full scale of implementation covered, please visit the accompanying notebook
at https://github.com/Apress/Bayesian-optimization/blob/main/Chapter_6.ipynb.
Summary
In this chapter, we have gone through the inner working of the one-shot knowledge
gradient formulation and its implementation details in BoTorch. This recently proposed
acquisition function enjoys attractive theoretical and practical properties. Specifically,
we covered the following:
In the next chapter, we will cover a case study that fine-tunes the learning rate when
training a deep convolutional neural network.
184
CHAPTER 7
185
© Peng Liu 2023
P. Liu, Bayesian Optimization, https://doi.org/10.1007/978-1-4842-9063-7_7
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
4 ( 6 2)
f ( x ) = -Σα i exp | -ΣAij ( x j - Pij ) |
i =1 ( j =1 )
where αi, Aij, and Pij are all constants. The search space is a six-dimensional unit
hypercube, with each action xj ∈ (0, 1) for all j = 1, …, 6. The global minimum is obtained
at x∗ = (0.20169, 0.150011, 0.476874, 0.275332, 0.311652, 0.6573), with f (x∗) = − 3.32237.
Let us start by importing a few common packages as follows. Note that it is always a
good practice to configure the random seed of all the three common packages (random,
numpy, and torch) before we start to write any code. Here, we set the random seed to 8 for
all three packages:
import os
import math
import torch
import random
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
SEED = 8
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
We would also like to configure two elements: the computing device and the tensor
objects' data type. In the following code snippet, we specify the device to be GPU if it is
186
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
available; having a GPU will significantly accelerate the computation in both BO loop
and training neural networks. We additionally assign the data type to be double for more
precise computation:
We can load the Hartmann function as our unknown objective function and negate it
to fit the maximization setting as before:
neg_hartmann6 = Hartmann(negate=True)
Now we can generate initial conditions in the form of a set of randomly selected
input locations and the corresponding noise-corrupted observations.
def generate_initial_data(n=10):
# generate random initial locations
train_x = torch.rand(n, 6, device=device, dtype=dtype)
# obtain the exact value of the objective function and add output
dimension
exact_obj = neg_hartmann6(train_x).unsqueeze(-1)
# add Gaussian noise to the observation model
train_y = exact_obj + NOISE_SE * torch.randn_like(exact_obj)
187
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
# get the current best observed value, i.e., utility of the
available dataset
best_observed_value = train_y.max().item()
return train_x, train_y, best_observed_value
Let us test this function and generate five initial conditions, as shown in the
following code snippet. The result shows that both the input location variable train_x
and observation variable train_y live in the GPU (device='cuda:0') and assume a data
type of float (torch.float64). The best-observed value is 0.83, which we seek to improve
upon in the follow-up sequential search.
Now we use these initial conditions to obtain the posterior distributions of the
GP model.
Updating GP Posterior
We are concerned with two objects in the language of BoTorch regarding the GP
surrogate model: the GP model itself and the marginal log-likelihood given the set of
collected observations. We define the following function to digest the dataset of initial
conditions and then return these two objects:
188
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
# initialize GP model
from botorch.models import SingleTaskGP
from gpytorch.mlls import ExactMarginalLogLikelihood
Here, we use the SingleTaskGP class to build a single-output GP model and the
ExactMarginalLogLikelihood() function to obtain the marginal log-likelihood based
on the closed-form expression. We can then use this function to create the two variables
that host the GP model and its marginal log-likelihood, respectively, in addition to
printing out the value of the hyperparameters for the GP model:
# fit GP hyperparameters
from botorch.fit import fit_gpytorch_mll
# fit hyperparameters (kernel parameters and noise variance) of a
GPyTorch model
fit_gpytorch_mll(mll.cpu());
mll = mll.to(train_x)
model = model.to(train_x)
>>> print(next(model.parameters()).is_cuda)
True
>>> list(model.named_hyperparameters())
[('likelihood.noise_covar.raw_noise', Parameter containing:
tensor([0.0201], device='cuda:0', dtype=torch.float64, requires_
grad=True)),
('mean_module.raw_constant', Parameter containing:
tensor(0.2875, device='cuda:0', dtype=torch.float64, requires_grad=True)),
('covar_module.raw_outputscale', Parameter containing:
tensor(-1.5489, device='cuda:0', dtype=torch.float64, requires_
grad=True)),
('covar_module.base_kernel.raw_lengthscale', Parameter containing:
tensor([[-0.8988, -0.9278, -0.9508, -0.9579, -0.9429, -0.9305]],
device='cuda:0', dtype=torch.float64, requires_grad=True))]
MC_SAMPLES = 256
qmc_sampler = SobolQMCNormalSampler(sample_shape=torch.Size([MC_SAMPLES]))
The qEI acquisition function requires three input arguments: the GP model object,
the best-observed value so far, and the quasi-Monte Carlo sampler:
Now let us define a function that takes the specified acquisition function as input,
performs the inner optimization to obtain the maximizer of the acquisition function
using optimize_acqf(), and returns the proposed location and noise-corrupted
observation as the output.
In the following code snippet, we define the bound variable to set the upper and
lower bounds to be the unit hypercube with the preassigned computing device and data
type. We also use BATCH_SIZE to indicate the number of parallel candidate locations
to be generated in each BO iteration, NUM_RESTARTS to indicate the number of starting
points used in the multi-start optimization procedure, and RAW_SAMPLES to denote the
number of samples used for initialization when optimizing the MC acquisition function:
# 6d unit hypercube
bounds = torch.tensor([[0.0] * 6, [1.0] * 6], device=device, dtype=dtype)
# parallel candidate locations generated in each iteration
BATCH_SIZE = 3
# number of starting points for multistart optimization
NUM_RESTARTS = 10
191
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
def optimize_acqf_and_get_observation(acq_func):
"""Optimizes the acquisition function, and returns a new candidate and
a noisy observation."""
# optimize
candidates, _ = optimize_acqf(
acq_function=acq_func,
bounds=bounds,
q=BATCH_SIZE,
num_restarts=NUM_RESTARTS,
raw_samples=RAW_SAMPLES, # used for initialization heuristic
options={"batch_limit": 5, "maxiter": 200},
)
Now let us test this function and observe the next batch of three sampling locations
proposed and observations from the observation model. The result shows that both new_x
and new_y contain three elements, all living in the GPU and assuming a floating data type:
We now introduce the main course and move into the full BO loop.
192
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
import time
from botorch.acquisition import qKnowledgeGradient
best_observed_qei.append(best_observed_value_qei)
best_observed_qkg.append(best_observed_value_qkg)
best_random.append(best_observed_value_qei)
# run N_BATCH rounds of BayesOpt after the initial random batch
for iteration in range(1, N_BATCH + 1):
t0 = time.monotonic()
fit_gpytorch_mll(mll_qkg.cpu());
mll_qkg = mll_qkg.to(train_x)
model_qkg = model_qkg.to(train_x)
# reinitialize the models so they are ready for fitting on next
iteration
mll_qei, model_qei = initialize_model(
train_x_qei,
train_y_qei
)
mll_qkg, model_qkg = initialize_model(
train_x_qkg,
train_y_qkg
)
t1 = time.monotonic()
if verbose:
print(
f"\nBatch {iteration:>2}: best_value (random, qEI, qKG) = "
f"({max(best_random):>4.2f}, {best_value_qei:>4.2f}, {best_
value_qkg:>4.2f}),"
f"time = {t1-t0:>4.2f}.", end=""
)
else:
print(".", end="")
best_observed_all_qei.append(best_observed_qei)
best_observed_all_qkg.append(best_observed_qkg)
best_random_all.append(best_random)
195
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
Note that the basic BO procedure stays the same in this seemingly long-running code
listing. For each trial, we would generate a set of initial conditions using generate_initial_
data(), initialize the GP model using initialize_model(), optimize its hyperparameters
using fit_gpytorch_mll(), employ optimize_acqf_and_get_observation() to obtain the
next batch of sampling locations and noise-perturbed observation for the respective search
policy, and finally update the results in the master lists.
The random search policy is a special and simple one; we choose a random set of
locations, collect the noisy observation, and record the current best candidate. The
following function provides the update rule:
def update_random_observations(best_random):
"""Simulates a random policy by drawing a BATCH_SIZE of new
random points,
observing their values, and updating the current best candidate to
the running list.
"""
rand_x = torch.rand(BATCH_SIZE, 6)
next_random_best = neg_hartmann6(rand_x).max().item()
best_random.append(max(best_random[-1], next_random_best))
return best_random
Now let us plot the results and analyze their performances in one place. The
following code snippet generates three lines representing the average cumulative
best candidates (observation) across each BO step, with the vertical bars denoting the
standard deviation from the mean:
GLOBAL_MAXIMUM = neg_hartmann6.optimal_value
196
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
Running this set of code generates Figure 7-1. The result shows that the qKG
acquisition function has the best performance, which is likely due to its less myopic
nature that assesses the global domain instead of only focusing on the observed ones.
On the other hand, qEI only dominates in the first step and starts to perform inferior to
qKG due to its more myopic and restrictive nature: the best candidate must come from
the observed dataset. Both qEI and qKG perform better than the random policy, showing
the benefits of employing this side computation in the global optimization exercise.
197
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
Figure 7-1. Comparing the performance of three search policies: qEI, qKG, and
random search. The qKG acquisition function has the best performance due to its
less myopic nature, while qEI only dominates in the first step and starts to perform
inferior to qKG due to its myopic nature. Both qEI and qKG perform better than
the random policy, showing the benefits of employing this side computation in the
global optimization exercise
In the next section, we will switch gears and look at training and optimizing a
convolutional neural network.
Note that these parameters are part of the CNN model and are the subject of
optimization when training the CNN. In particular, the current parameters would
be used to generate a set of predictions for the current batch of data and obtain a
corresponding loss, a measure of the quality of fit. This completes the forward pass. In
the backward pass, the loss will flow back to each previous layer by taking the partial
derivative with respect to each parameter. These partial derivatives will be used to
update the parameters using the well-known stochastic gradient descent (SGD)
algorithm. This completes one iteration.
Understanding the inner works of CNN is not the main focus of the book. Rather,
we will use CNN as an example and demonstrate how to use BO to make a good choice
of the many hyperparameters. Hyperparameters are different from parameters in that
they must be chosen and fixed before each round of network training starts. Examples of
hyperparameters include the learning rate used in SGD updates, the number of nodes in
a layer, and the number of layers in a neural network. Since the learning rate is typically
the first and foremost hyperparameter to be tuned when training a modern neural
network, we will focus on optimizing this hyperparameter in the following section. That
is, we would like to find the best learning rate for the current network architecture that
gives the highest predictive accuracy for the test set.
Let us first get familiar with the image data we will work with.
Using MNIST
The MNIST dataset (Modified National Institute of Standards and Technology database)
is a database of handwritten digits widely used for experimenting, prototyping, and
benchmarking machine learning and deep learning models. It consists of 60,000 training
images and 10,000 testing images which are normalized and center cropped. They can
be accessed via the torch.datasets class without separate downloading.
We will cover a step-by-step implementation of training a simple CNN model
using MNIST. We start with data loading and preprocessing using the datasets and
dataloader subclasses in PyTorch torchvision package. Using these utility functions
could significantly accelerate the model development process since it offers consistent
handling and preprocessing of the input data. Then, we will define the model
architecture, the cost, and the optimization function. Finally, we will train the model
using the training set and test its predictive performance using the test set.
199
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
Let us download the MNIST dataset using the datasets subpackage from torchvision, a
powerful and convenient out-of-the-box utility for accessing popular image datasets. It stores
the image samples and their corresponding labels. In the following code snippet, we first
import the datasets class and instantiate it to download the training and test data into the
Data folder. In addition, the downloaded image is transformed into a tensor object, which is
a multidimensional NumPy array in PyTorch. The transformation could also include scaling
the image pixels from the original range [0, 255] to a standardized scale [0.0, 1.0].
We can print the objects to access their profile information, including the number of
observations:
>>> print(train_data)
Dataset MNIST
Number of datapoints: 60000
Root location: data
Split: Train
StandardTransform
Transform: ToTensor()
>>> print(test_data)
Dataset MNIST
Number of datapoints: 10000
200
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
After downloading the data in the train_data variable, the input and output are
stored in the data and targets attributes, respectively. We can examine their shape as
follows:
>>> print(train_data.data.size())
>>> print(train_data.targets.size())
torch.Size([60000, 28, 28])
torch.Size([60000])
The output shows that we have 60,000 images, each having a shape of 28 by 28,
corresponding to the width and height of an image. Note that a color image would have
a third dimension called depth or channel, which usually assumes a value of 3. Since we
are dealing with grayscale images, the extra dimension of 1 is omitted.
We can visualize a few images in the training set. The following code randomly
selects 25 images and plots them on a five-by-five grid canvas:
Figure 7-2 shows the output of the 25 digits. Note that each image has a shape of 28
by 28, thus having a total of 784 pixels, that is, features per image.
Figure 7-2. Visualizing 25 random MNIST digits. Each digit is a grayscale image
that consists of 28*28=784 pixels
With the dataset downloaded, the next is to load them for model training purposes.
The dataloader class from the torch.utils.data module provides a convenient way
to load and iterate through the data in batches and perform specific transformations if
needed. Passing the dataset object as an argument to dataloader enables automatic
batching, sampling, shuffling, and multiprocess loading of the dataset. The following
202
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
code iteratively loads the training and test set with a batch size of 100 after shuffling all
the images and stores them in the loaders object:
We can observe the shape of the data loaded after iterating through loaders for the
first time, as shown in the following code:
203
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
specific purpose. In other words, the model parameters and the model architecture
that specifies the interaction between the data and the parameters would have their
dedicated functions.
The common practice is to use a class to wrap all these different utility functions
under one roof. Specifically, we can arrange the model parameters in one function,
which often manifests as the class’s default __init__ function. The __init__ function
contains initialized attributes when the class is first instantiated, meaning the abstract
class is converted to a physical and tangible object. These attributes include essential
building blocks of the class, that is, layers for the neural network and the resulting
parameters implied from the layers. Instantiation is similar to buying a lego toy set with
all the parts provided based on its drawing specification. These parts can also be of
different sizes and structures; a big component can consist of several small ones.
In addition, the function that defines the model architecture and regulates the
flow of data goes into the forward function, which serves as the instruction sheet for
assembling the parts. This special function is automatically executed under the hood
when the class’s object is called. Thus, it acts as the default information road map that
chains together the various components defined by the __init__ function, marries
with the input data, specifies how they pass through the network, and returns the model
prediction.
Let us look at a specific example by creating a model class called CNN, suggesting that
this is a convolutional neural network. Convolution is a particular type of neural network
layer that effectively processes image data. It takes the form of a small square filter (also
called kernel) and operates on local regions of the input data by repeatedly scanning
through the input surface. In the case of grayscale image data in MNIST, each image
entry will manifest as a two-dimensional matrix where each cell stores a pixel value
between 0 and 255.
A common practice is to follow a convolution layer with a ReLU layer (to keep
positive elements only) and a pooling layer (to reduce the parameters). A pooling
operation is similar to convolution in that it involves a kernel interacting with different
patches of the input. However, there are no weights in the kernel. The pooling operation
uses the kernel to zoom in on a specific input patch, choose a representative value such
as the mean (average pooling) or the maximum (max pooling), and store it in the output
feature map. Such an operation reduces the input data dimension and keeps only the
(hopefully) meaningful features.
204
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
The CNN class we will create now uses all three types of layers introduced earlier.
As shown in the following code listing, it inherits from the predefined base class
torch.nn.Module which manages other backend logistics on neural network modules
without having to create them from scratch. The CNN class will have two essential
functions: __init__ and forward. When instantiating this class into an object, say model
= CNN(), what happens is that the model object is created using the __init__ method
of CNN. When using the model for prediction for some new data, say model(new_data),
we pass the new_data input to the forward function under the hood. Both functions are
called implicitly.
The following code snippets show the model class. At a high level, we define two
convolutional blocks conv1 and conv2 upon initialization in the __init__ function. Each
block sequentially applies three layers: convolution by nn.Conv2d(), ReLU by nn.ReLU(),
and max pooling by nn.MaxPool2d(). Note that in nn.Conv2d(), in_channels refers to
the number of color channels in the image data. Since we are dealing with a grayscale
image, it is set to 1. The out_channels parameter refers to the number of kernels to be
created; this will determine the depth of the resulting feature map. In other words, setting
it to 16 means that we will create 16 different kernels to convolve with the input data,
each learning from a different perspective and jointly forming a 16-layer feature map
in the output. The kernel_size parameter determines the size of the square kernel. A
bigger kernel will promote more feature sharing but contain more weights to tune, while
a smaller kernel requires more convolution operations but can better attend to local
features in the input. Finally, we have the stride parameter to control the step size when
moving the kernel to convolve with the next patch, and the padding parameter to adjust
for the shape of the output by adding or padding zeros to the peripheral of the input.
out_channels=16,
kernel_size=5,
stride=1,
padding=2,
),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2),
)
# The second convolutional block
self.conv2 = nn.Sequential(
nn.Conv2d(16, 32, 5, 1, 2),
nn.ReLU(),
nn.MaxPool2d(2),
)
# The final fully connected layer which outputs 10 classes
self.out = nn.Linear(32 * 7 * 7, 10)
# Specify the flow of information
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
# Flatten the output to shape (batch_size, 32 * 7 * 7)
x = x.view(x.size(0), -1)
output = self.out(x)
return output
Lumping together multiple simple layers in one block and repeating such blocks
multiple times with different configurations is a common way of defining relatively
complex neural networks. Such structure in convolutional neural networks also helps
learn the compositionality of image data. A sufficiently trained neural network could
extract low-level features, such as edges in the early layers, and high-level patterns, like
objects in the bottom layers.
In addition, we also define a fully connected layer as the last layer of the network
in the initialization function via nn.Linear(). Note that we need to pass in the correct
number of nodes for the input in the first parameter and the output in the second
parameter. The input size is determined based on the network configuration in previous
layers, and the output is set to ten since we have ten classes of digits to classify.
206
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
And lastly, the forward function chains together these components sequentially.
Pay attention when concatenating a convolutional layer with a fully connected layer. It is
necessary to flatten the multidimensional cubic feature map into a single-dimensional
vector for each image entry. In this case, we use the view() function to reshape the data
into a single column, as indicated by the -1 parameter, while keeping the batch size
(retrieved as the first dimension using x.size(0)) intact.
Besides, the training process would significantly accelerate when using a GPU. When
running the accompanying notebook in Colab, a hassle-free online Jupyter notebook
platform, the GPU resource can be enabled without additional charge (up to a specific
limit, of course) by simply changing the hardware accelerator option in the runtime type
to the GPU.
In the following code snippet, we create the model by instantiating the class into an
object via model = CNN().to(device), where we use the device variable to determine
the use of a GPU if available. Printing out the model object gives the specification of
different components defined by the __init__ function as follows:
207
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
When the neural network architecture starts to scale up and become complex, it is
often helpful to print out the architecture for better clarification of its composition. In the
following code snippet, we resort to the torchsummary package to ease the visualization
task by passing in the size of an input entry. The output shows the model architecture
from top to bottom, with each layer sequentially suffixed by an integer. The output
shape and number of parameters in each layer are also provided. A total of (trainable)
28,938 parameters are used in the model. Note that we do not have any nontrainable
parameters; this relates to the level of model fine-tuning in transfer learning.
208
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
Training CNN
The CNN model contains the preset architecture and initial weights, which will get
optimized as optimization proceeds. Specifically, this optimization is guided by a
direction that minimizes the loss of the training set. In the case of classifying MNIST
digits, the loss function takes the form of a cross-entropy loss:
We can now put together the preceding components into one entire training loop.
Based on the predefined model and data loader, the following code listing defines the
training function that performs SGD optimization via one complete pass through the
209
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
training dataset. We also track the cost evolution as the weights get updated across
different training batches, and this process is then repeated based on the specified
number of epochs.
210
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
We will also define a function to check the model's performance on the test set. The
following code listing generates model predictions and calculates the test accuracy by
comparing them against the target labels. Since no learning is needed, the model is set to
the evaluation mode and put under the torch.no_grad() context when generating the
predictions.
if verbose:
print(f"Test accuracy: {correct:>0.3f}")
return correct
Now let us test both functions over three epochs. Note that each epoch means a full
pass of the whole dataset, including both training and test sets:
num_epochs = 3
for t in range(num_epochs):
print(f"Epoch {t+1}\n-------------------------------")
train(model_cnn, loaders, verbose=False)
test_accuracy = test(model_cnn, loaders, verbose=True)
>>> print("Done!")
211
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
Epoch 1
-------------------------------
Test accuracy: 0.914
Epoch 2
-------------------------------
Test accuracy: 0.937
Epoch 3
-------------------------------
Test accuracy: 0.957
Done!
The result shows over 95% test set accuracy in just three epochs, a decent feat given
the current model architecture and learning rate. We would like to answer the following
question in the following section: Which learning rate could give us the optimal test set
accuracy given the current training budget?
Listing 7-10. Defining a customized module to manage model training and testing
class CNN_CLASSIFIER(nn.Module):
# Specify the components to be created automatically upon instantiation
def __init__(self, loaders, num_epochs=10, verbose_train=False,
verbose_test=False):
super(CNN_CLASSIFIER, self).__init__()
self.loaders = loaders
self.num_epochs = num_epochs
self.model = CNN().to(device)
212
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
self.criterion = nn.CrossEntropyLoss().cuda()
self.verbose_train = verbose_train
self.verbose_test = verbose_test
for t in range(self.num_epochs):
print(f"Epoch {t+1}\n-------------------------------")
self.train(verbose=self.verbose_train)
test_accuracy = self.test(verbose=self.verbose_test)
return test_accuracy
if(verbose):
if batch % 100 == 0:
loss, current_img_idx = loss.item(), batch * len(X)
print(f"loss: {loss:>7f} [{current_img_idx:>5d}/
{total_img:>5d}]")
return self
if verbose:
print(f"Test accuracy: {correct:>0.3f}")
return correct
One particular point to pay attention to is the way we initialize the SGD optimizer.
In the forward() function, we pass self.model.parameters() to optim.SGD() to show
that we are optimizing the parameters of the model within this class, given the learning
rate input. The model will not get updated if such awareness is removed, that is, not
accessing the model parameters as part of the self attributes.
214
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
We can now perform the same trial as before, in a much more convenient manner:
def generate_initial_data_cnn(n=5):
# generate training data
# train_x = torch.rand(n, 1, device=device, dtype=dtype)
215
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
train_x = torch.distributions.uniform.Uniform(0.0001,10).sample([n,1]).
type(torch.DoubleTensor).cuda()
train_y = []
for tmp_x in train_x:
print(f"\nCurrent learning rate: {tmp_x.item()}")
nn_classifier = CNN_CLASSIFIER(loaders=loaders, num_epochs=3,
verbose_test=True)
tmp_y = nn_classifier(learning_rate=tmp_x.item())
train_y.append(tmp_y)
Now let us generate three initial observations. The result from running the following
code shows a highest test set accuracy of 10%, not a good start. We will see how to obtain
more meaningful improvements later.
After initializing and updating the GP model, let us take qEI as the example
acquisition function to obtain the next learning rate. The following function optimize_
acqf_and_get_observation_cnn() helps us achieve this:
216
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
Listing 7-11. Obtaining the next learning rate by optimizing the acquisition
function
def optimize_acqf_and_get_observation_cnn(acq_func):
"""Optimizes the acquisition function, and returns a new candidate and
a noisy observation."""
# optimize
candidates, _ = optimize_acqf(
acq_function=acq_func,
bounds=bounds,
q=BATCH_SIZE,
num_restarts=NUM_RESTARTS,
raw_samples=RAW_SAMPLES, # used for initialization heuristic
options={"batch_limit": 5, "maxiter": 200},
)
new_y = []
for tmp_x in new_x:
print(f"\nCurrent learning rate: {tmp_x.item()}")
nn_classifier = CNN_CLASSIFIER(loaders=loaders, num_epochs=3,
verbose_test=True)
tmp_y = nn_classifier(learning_rate=tmp_x.item())
new_y.append(tmp_y)
Note that we use a for loop to obtain each of the three learning rates in the next
stage. This could be further parallelized if we can access multiple GPUs and design the
corresponding search strategy.
Let us test this function to obtain the next three learning rates as well:
Similarly, we would edit the function to generate the learning rates using the random
search strategy. This function, update_random_observations_cnn(), is defined as
follows:
def update_random_observations_cnn(best_random):
"""Simulates a random policy by drawing a BATCH_SIZE of new
random points,
observing their values, and updating the current best candidate to
the running list.
"""
rand_x = torch.distributions.uniform.Uniform(0.0001,10).sample
([BATCH_SIZE,1]).type(torch.DoubleTensor).cuda()
rand_y = []
for tmp_x in rand_x:
print(f"\nCurrent learning rate: {tmp_x.item()}")
nn_classifier = CNN_CLASSIFIER(loaders=loaders, num_epochs=3,
verbose_test=True)
tmp_y = nn_classifier(learning_rate=tmp_x.item())
rand_y.append(tmp_y)
218
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
return best_random
Finally, we are ready to serve the main course. The following code snippet introduces
the same three competing search strategies: qEI, qKG, and random policy, all of which
proceed for a total of ten steps by setting N_BATCH=10. Note that the total running time
takes around one hour in a single GPU instance, so grab a cup of coffee after you execute
the accompanying notebook!
Listing 7-13. The full BO loop for three competing hyperparameter search
strategies
219
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
best_observed_qei.append(best_observed_value_qei)
best_observed_qkg.append(best_observed_value_qkg)
best_random.append(best_observed_value_qei)
t0_total = time.monotonic()
# run N_BATCH rounds of BayesOpt after the initial random batch
for iteration in range(1, N_BATCH + 1):
t0 = time.monotonic()
fit_gpytorch_mll(mll_qkg.cpu());
mll_qkg = mll_qkg.to(train_x)
model_qkg = model_qkg.to(train_x)
220
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
# reinitialize the models so they are ready for fitting on next
iteration
mll_qei, model_qei = initialize_model(
train_x_qei,
train_y_qei
)
mll_qkg, model_qkg = initialize_model(
train_x_qkg,
train_y_qkg
)
t1 = time.monotonic()
if verbose:
print(
f"\nBatch {iteration:>2}: best_value (random, qEI, qKG) = "
f"({max(best_random):>4.2f}, {best_value_qei:>4.2f}, {best_
value_qkg:>4.2f}),"
f"time = {t1-t0:>4.2f}.", end=""
)
else:
print(".", end="")
best_observed_all_qei.append(best_observed_qei)
best_observed_all_qkg.append(best_observed_qkg)
best_random_all.append(best_random)
t1_total = time.monotonic()
print(f"total time = {t1_total-t0_total:>4.2f}.")
221
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
We plot the cumulative best-observed test set accuracies of the three search
strategies as shown in Figure 7-3. Note that the standard deviation is very large due to
insufficient trials in each experiment, thus exceeding the maximum value (i.e., 100%)
and giving only an indicative result.
Figure 7-3. Comparing the performance of three search policies in identifying the
optimal learning rate. The qEI acquisition function located the highest-performing
learning rate in just two steps, while qKG exhibited more variation across all steps.
The random strategy only caught up in the last few steps
For the full scale of implementation covered, please visit the accompanying notebook
at https://github.com/Apress/Bayesian-optimization/blob/main/Chapter_7.ipynb.
Summary
In this chapter, we introduced the process of optimizing in a full BO loop, covering
a synthetic case using Hartmann and a practical hyperparameter tuning case using
CNN. Specifically, we covered the following:
222
Chapter 7 Case Study: Tuning CNN Learning Rate with BoTorch
223
Index
A utility-driven optimization, 74, 76
Bayesian formula, 11, 16
Acquisition functions, 29–30, 70, 72–74,
Bayesian inference
77–86, 98, 131, 140, 145, 148–149,
actual observation, 24
151–159, 185, 190–192
aims, 11
Adam optimizer, 116
Bayesian formula, 11
AdditiveKernel() function, 127
Bayes’ rule, 23, 24
Airline count dataset, 124–125
Bayes’ theorem, 23
Airline passenger count, 124–129
conjugate, 13
Analytic EI in BoTorch
denominator, 13
best_value, 135
normal distribution, 13, 25
ExpectedImprovement class, 135, 136
observation model, 13
GP surrogate with optimized
posterior distribution, 12, 24, 25
hyperparameters, 134, 135
prior distribution, 12
Hartmann function, 132, 133
prior uniform distribution, 13
inner optimization routine, 140–148
probability density function, 23
optimization, 137–140
subjective expectation, 24
to() method, 137
Bayesian optimization
Approximate dynamic programming
aims, 1
(ADP) methods, 82
BoTorch, 185
Automatic differentiation, 102, 146
characteristics, 31
Ax platform, 185
definition, 1
global optimization, 1, 31
B interactions, 1
backward() function, 104 libraries, 96–98
batch_initial_conditions variable, 141, 145 loop, 30–31, 73, 74, 92, 96, 98, 104, 135,
Bayesian decision theory 187, 192–198, 215–222
Bellman’s principle of objective function, 1
optimality, 79–82 overall process, 2
multi-step lookahead policy, 76–79 statistics (see Bayesian statistics)
optimal policy, 72–74 surrogate function, 185
225
© Peng Liu 2023
P. Liu, Bayesian Optimization, https://doi.org/10.1007/978-1-4842-9063-7
INDEX
226
INDEX
227
INDEX
228
INDEX
H
Hartmann function, 132, 133 J
common packages, 186 Joint multivariate Gaussian, 55, 67
creating Monte Carlo acquisition Joint probability distribution
function, 190–192 definition, 15
229
INDEX
230
INDEX
231
INDEX
232
INDEX
T U
Tensor, 102, 103 Uncertainty intervals, 59, 61
TensorFlow, 102 update_random_observations_cnn()
Test set accuracy, 212, 215, 216, 221 function, 218
Test sets, 7, 124, 199, 203, 211, 212, 215, 221 Utility-driven
torch.no_grad(), 109, 211 optimization, 74–76
torch.quasirandom.SobolEngine class, 141
torch.utils.data module, 202
Two-dimensional observations, 36 V, W, X, Y, Z
.type(torch.DoubleTensor).cuda(), 215 Values/realizations, 46, 47
234