Machine Learning and Pattern Recognition Gaussian Processes
Machine Learning and Pattern Recognition Gaussian Processes
Gaussian processes (GPs) are distributions over functions from an input x, which could
be high-dimensional and could be a complex object like a string or graph, to a scalar or
1-dimensional output f (x). We will use a Gaussian process prior over functions in a Bayesian
approach to regression. Gaussian processes can also be used as a part of larger models.
Goals: 1) model complex functions: not just straight lines/planes, or a combination of a fixed
number of basis functions. 2) Represent our uncertainty about the function, for example
with error bars. 3) Do as much of the mathematics as possible analytically.
0.5
0
f
−0.5
−1
−1.5
0 0.2 0.4 0.6 0.8 1
x
The largest value appears to be around x = 0.4. If we can afford to run a couple of experiments,
we might also test the system at x = 0.8, which might actually perform better. According
to our point estimates, the performance at x = 0 is similar to x = 0.8. However, we aren’t
uncertain at x = 0, so it isn’t worth running experiments there.
Using uncertainty in a probabilistic model to guide search is called Bayesian optimization. We
could do Bayesian optimization with Bayesian linear regression. Although we would have to
be careful: Bayesian linear regression is often too certain if the model is too simple (think of
our ‘underfitting’ example in Bayesian linear regression), or doesn’t include enough basis
functions.
µ i = E[ f i ] (2)
Σij = E[ f i f j ] − µi µ j . (3)
The covariance matrix Σ must be positive definite. (Or positive semi-definite if we allow some
of the elements of f to depend deterministically on each other.) If we know a distribution is
Gaussian and know its mean and covariances, we have completely defined the distribution.
Any marginal distribution of a Gaussian is also Gaussian. So given a joint distribution:
f a A C
p(f, g) = N ; , , (4)
g b C> B
Showing this result from scratch requires quite a few lines of linear algebra. However, this is
a standard result that is easily looked up (and we wouldn’t make you show it in a question!).
0.5
0
f
−0.5
−1
−1.5
0 0.2 0.4 0.6 0.8 1
x
The ‘functions’ plotted here are actually samples from a 100-dimensional Gaussian distribu-
tion. Each sample gave the height of a curve at 100 locations along the x-axis, which were
joined up with straight lines. If we chose a different kernel we could get rough functions,
straighter functions, or periodic functions.
Our prior is that the function comes from a Gaussian process, f ∼ GP , where an N × 1
vector f of function values at training locations X = {x(i) } has prior p(f) = N (f; 0, K ), where
f i = f (x(i) ) and Kij = k (x(i) , x( j) ).
Given noisy observations of the function values at the training locations:
yi ∼ N ( f i , σy2 ), (11)
we can update our beliefs about the function. For now we assume the noise variance σy2
is fixed and known. According to the model, the observations are centred around one
underlying true function as follows:
1
f(x)
0.5 yi
−0.5
−1
−1.5
0 0.2 0.4 0.6 0.8 1
x
K ( X, X ) + σy2 I K ( X, X∗ )
y y
p = N ; 0, , (12)
f∗ f∗ K ( X∗ , X ) K ( X∗ , X∗ )
where the compact notation K ( X, Y ) follows the Rasmussen and Williams GP textbook
(Murphy uses an even more compact notation). Given an N × D matrix1 X and an M × D
matrix Y, K ( X, Y ) is an N × M matrix of kernel values:
Because Gaussians marginalize so easily, we can ignore the enormous (or infinite) prior
covariance matrix over all of the function values that we’re not interested in.
Using the rule for conditional Gaussians (for the conditional distribution over a subset of
values in a multivariate joint Gaussian given the others), we can immediately identify that
the posterior over function values p(f∗ | y) is Gaussian with:
This posterior distribution over a few function values is a marginal of the joint posterior
distribution over the whole function. The posterior distribution over possible functions is
itself a Gaussian Process (a large or infinite Gaussian distribution).
[The website version of this note has a question here.]
1. Actually the inputs don’t have to be D-dimensional vectors. There are kernel functions for graphs and strings.
Once we have computed the covariances, none of the remaining mathematics looks at the feature vectors themselves.
0.5
0
f
−0.5
−1
−1.5
0 0.2 0.4 0.6 0.8 1
x
Really we plotted three samples from a 100-dimensional Gaussian, p(f∗ | data), where 100
test locations were chosen on a grid. We can see from these samples that we’re fairly sure
about the function for x ∈ [0, 0.4], but less sure of what it’s doing for x ∈ [0.5, 0.8].
We can summarize the uncertainty at each input location by plotting the mean of the
posterior distribution and error bars. The figure below shows the mean of the posterior
plotted as a black line, with a grey band indicating ±2 standard deviations.
1
0.5
0
f
−0.5
−1
−1.5
0 0.2 0.4 0.6 0.8 1
x
This figure doesn’t show how the functions might vary within that band, for example
whether they are smooth or rough.2 However, it might be easier to read, and is cheaper to
compute. We don’t need to compute the posterior covariances between test locations to plot
the mean and error band. We just need the 1-dimensional posterior at each test point:
Then the computation is a special case of the posterior expression already provided, but
with only one test point:
M = K + σy2 I, (22)
(∗)> −1
m=k M y, (23)
(∗)> −1 (∗)
s2 = k (x(∗) , x(∗) ) − k
| M
{z k } . (24)
positive
2. Sometimes we care about dependencies in predictions: for example, we might care whether a model says that
several stock-prices are likely to fall together when they fall.
8 Reading
The core recommended reading for Gaussian processes is Bishop section 6.4 to section 6.4.3
inclusive.
Alternatives are: Murphy, Chapter 15, to section 15.2.4 inclusive, Barber Chapter 19 to section
19.3 inclusive, or the dedicated Rasmussen and Williams book3 up to section 2.5. There is
also a chapter on GPs in MacKay’s book.
Gaussian processes are Bayesian kernel methods. Bishop Chapter 6 to section 6.2 inclusive
and Murphy Chapter 14 have more information about kernels in general if you want to read
about the wider picture.
The idea of Bayesian optimization is old. However, a relatively recent paper rekindled interest
in the idea as a way to tune machine learning methods: https://papers.nips.cc/paper/
4522-practical-bayesian-optimization-of-machine-learning-algorithms These authors
then formed a startup around the idea, which was acquired by Twitter. Other companies
like Netflix have also used their software.
3. http://gaussianprocess.org/gpml/