Gaussian Processes in Machine Learning
Gaussian Processes in Machine Learning
Supervised learning in the form of regression (for continuous outputs) and clas-
sification (for discrete outputs) is an important constituent of statistics and
machine learning, either for analysis of data sets, or as a subgoal of a more
complex problem.
Traditionally parametric1 models have been used for this purpose. These have
a possible advantage in ease of interpretability, but for complex data sets, simple
parametric models may lack expressive power, and their more complex counter-
parts (such as feed forward neural networks) may not be easy to work with
in practice. The advent of kernel machines, such as Support Vector Machines
and Gaussian Processes has opened the possibility of flexible models which are
practical to work with.
In this short tutorial we present the basic idea on how Gaussian Process
models can be used to formulate a Bayesian framework for regression. We will
focus on understanding the stochastic process and how it is used in supervised
learning. Secondly, we will discuss practical matters regarding the role of hyper-
parameters in the covariance function, the marginal likelihood and the automatic
Occam’s razor. For broader introductions to Gaussian processes, consult [1], [2].
1 Gaussian Processes
In this section we define Gaussian Processes and show how they can very nat-
urally be used to define distributions over functions. In the following section
we continue to show how this distribution is updated in the light of training
examples.
1
By a parametric model, we here mean a model which during training “absorbs” the
information from the training data into the parameters; after training the data can
be discarded.
O. Bousquet et al. (Eds.): Machine Learning 2003, LNAI 3176, pp. 63–71, 2004.
c Springer-Verlag Berlin Heidelberg 2004
64 C.E. Rasmussen
A Gaussian process is fully specified by its mean function m(x) and covariance
function k(x, x ). This is a natural generalization of the Gaussian distribution
whose mean and covariance is a vector and matrix, respectively. The Gaussian
distribution is over vectors, whereas the Gaussian process is over functions. We
will write:
f ∼ GP(m, k), (1)
meaning: “the function f is distributed as a GP with mean function m and
covariance function k”.
Although the generalization from distribution to process is straight forward,
we will be a bit more explicit about the details, because it may be unfamiliar
to some readers. The individual random variables in a vector from a Gaussian
distribution are indexed by their position in the vector. For the Gaussian process
it is the argument x (of the random function f (x)) which plays the role of index
set: for every input x there is an associated random variable f (x), which is the
value of the (stochastic) function f at that location. For reasons of notational
convenience, we will enumerate the x values of interest by the natural numbers,
and use these indexes as if they were the indexes of the process – don’t let yourself
be confused by this: the index to the process is xi , which we have chosen to index
by i.
Although working with infinite dimensional objects may seem unwieldy at
first, it turns out that the quantities that we are interested in computing, require
only working with finite dimensional objects. In fact, answering questions about
the process reduces to computing with the related distribution. This is the key
to why Gaussian processes are feasible. Let us look at an example. Consider the
Gaussian process given by:
In order to understand this process we can draw samples from the function
f . In order to work only with finite quantities, we request only the value of f at
a distinct finite number n of locations. How do we generate such samples? Given
the x-values we can evaluate the vector of means and a covariance matrix using
Eq. (2), which defines a regular Gaussian distribution:
µi = m(xi ) = 4 xi ,
1 2
i = 1, . . . , n and
(3)
Σij = k(xi , xj ) = exp(− 12 (xi − xj )2 ), i, j = 1, . . . , n,
where to clarify the distinction between process and distribution we use m and
k for the former and µ and Σ for the latter. We can now generate a random
vector from this distribution. This vector will have as coordinates the function
values f (x) for the corresponding x’s:
−2
−5 −4 −3 −2 −1 0 1 2 3 4 5
Fig. 1. Function values from three functions drawn at random from a GP as specified
in Eq. (2). The dots are the values generated from Eq. (4), the two other curves have
(less correctly) been drawn by connecting sampled points. The function values suggest
a smooth underlying function; this is in fact a property of GPs with the squared
exponential covariance function. The shaded grey area represent the 95% confidence
intervals
We could now plot the values of f as a function of x, see Figure 1. How can
we do this in practice? Below are a few lines of Matlab2 used to create the plot:
In the above example, m and k are mean and covariances; chol is a function
to compute the Cholesky decomposition3 of a matrix.
This example has illustrated how we move from process to distribution and
also shown that the Gaussian process defines a distribution over functions. Up
until now, we have only been concerned with random functions – in the next
section we will see how to use the GP framework in a very simple way to make
inferences about functions given some training examples.
2
Matlab is a trademark of The MathWorks Inc.
3
We’ve also added a tiny keps multiple of the identity to the covariance matrix
for numerical stability (to bound the eigenvalues numerically away from zero); see
comments around Eq. (8) for a interpretation of this term as a tiny amount of noise.
66 C.E. Rasmussen
example, in Figure 1 the function is smooth, and close to a quadratic. The goal
of this section is to derive the simple rules of how to update this prior in the
light of the training data. The goal of the next section is to attempt to learn
about some properties of the prior4 in the the light of the data.
One of the primary goals computing the posterior is that it can be used to
make predictions for unseen test cases. Let f be the known function values of
the training cases, and let f∗ be a set of function values corresponding to the
test set inputs, X∗ . Again, we write out the joint distribution of everything we
are interested in:
f µ Σ Σ
∼ N , ∗ , (5)
f∗ µ∗ Σ∗ Σ∗∗
where we’ve introduced the following shorthand: µ = m(xi ), i = 1, . . . , n for the
training means and analogously for the test means µ∗ ; for the covariance we
use Σ for training set covariances, Σ∗ for training-test set covariances and Σ∗∗
for test set covariances. Since we know the values for the training set f we are
interested in the conditional distribution of f∗ given f which is expressed as5 :
f∗ |f ∼ N µ∗ + Σ∗ Σ −1 (f − µ), Σ∗∗ − Σ∗ Σ −1 Σ∗ . (6)
This is the posterior distribution for a specific set of test cases. It is easy to
verify (by inspection) that the corresponding posterior process is:
f |D ∼ GP(mD , kD ),
mD (x) = m(x) + Σ(X, x) Σ −1 (f − m) (7)
−1
kD (x, x ) = k(x, x ) − Σ(X, x) Σ Σ(X, x ),
4
By definition, the prior is independent of the data; here we’ll be using a hierarchical
prior with free parameters, and make inference about the parameters.
5
the formula for conditioning a joint Gaussian distribution is:
x a A C
∼ N , =⇒ x|y ∼ N a + CB −1 (y − b), A − CB −1 C .
y b C B
6
However, it is perhaps interesting that the GP model works also in the noise-free
case – this is in contrast to most parametric methods, since they often cannot model
the data exactly.
Gaussian Processes in Machine Learning 67
−2
−5 −4 −3 −2 −1 0 1 2 3 4 5
Fig. 2. Three functions drawn at random from the posterior, given 20 training data
points, the GP as specified in Eq. (3) and a noise level of σn = 0.7. The shaded area
gives the 95% confidence region. Compare with Figure 1 and note that the uncertainty
goes down close to the observations
In the Gaussian process models, such noise is easily taken into account; the
effect is that every f (x) has a extra covariance with itself only (since the noise
is assumed independent), with a magnitude equal to the noise variance:
where δii = 1 iff i = i is the Kronecker’s delta. Notice, that the indexes to the
Kronecker’s delta is the identify of the cases, i, and not the inputs xi ; you may
have several cases with identical inputs, but the noise on these cases is assumed
to be independent. Thus, the covariance function for a noisy process is the sum
of the signal covariance and the noise covariance.
Now, we can plug in the posterior covariance function into the little Matlab
example on page 65 to draw samples from the posterior process, see Figure 2. In
this section we have shown how simple manipulations with mean and covariance
functions allow updates of the prior to the posterior in the light of the training
data. However, we left some questions unanswered: How do we come up with
mean and covariance functions in the first place? How could we estimate the
noise level? This is the topic of the next section.
functions in the light of the data. This process will be referred to as training 7
the GP model.
In the light of typically vague prior information, we use a hierarchical prior,
where the mean and covariance functions are parameterized in terms of hyper-
parameters. For example, we could use a generalization of Eq. (2):
f ∼ GP(m, k),
(x − x )2 (9)
m(x) = ax2 + bx + c, and k(x, x ) = σy2 exp − + σn2 δii ,
22
n
L = log p(y|x, θ) = − 12 log |Σ| − 12 (y − µ) Σ −1 (y − µ) − 2 log(2π). (10)
We will call this quantity the log marginal likelihood. We use the term
“marginal” to emphasize that we are dealing with a non-parametric model. See
e.g. [1] for the weight-space view of Gaussian processes which equivalently leads
to Eq. (10) after marginalization over the weights.
We can now find the values of the hyperparameters which optimizes the
marginal likelihood based on its partial derivatives which are easily evaluated:
∂L ∂m
= − (y − µ) Σ −1 ,
∂θm ∂θm
(11)
∂L ∂Σ 1 ∂Σ −1 ∂Σ
= 12 trace Σ −1 + 2 (y − µ) Σ (y − µ),
∂θk ∂θk ∂θk ∂θk
where θm and θk are used to indicate hyperparameters of the mean and covari-
ance functions respectively. Eq. (11) can conveniently be used in conjunction
7
Training the GP model involves both model selection, or the discrete choice between
different functional forms for mean and covariance functions as well as adaptation
of the hyperparameters of these functions; for brevity we will only consider the
latter here – the generalization is straightforward, in that marginal likelihoods can
be compared.
Gaussian Processes in Machine Learning 69
−2
−5 −4 −3 −2 −1 0 1 2 3 4 5
Fig. 3. Mean and 95% posterior confidence region with parameters learned by maxi-
mizing marginal likelihood, Eq. (10), for the Gaussian process specification in Eq. (9),
for the same data as in Figure 2. The hyperparameters found were a = 0.3, b = 0.03, c =
−0.7, = 0.7, σy = 1.1, σn = 0.25. This example was constructed so that the approach
without optimization of hyperparameters worked reasonably well (Figure 2), but there
is of course no guarantee of this in a typical application
with a numerical optimization routine such as conjugate gradients to find good8
hyperparameter settings.
Due to the fact that the Gaussian process is a non-parametric model, the
marginal likelihood behaves somewhat differently to what one might expect from
experience with parametric models. Note first, that it is in fact very easy for the
model to fit the training data exactly: simply set the noise level σn2 to zero, and
the model produce a mean predictive function which agrees exactly with the
training points. However, this is not the typical behavior when optimizing the
marginal likelihood. Indeed, the log marginal likelihood from Eq. (10) consists
of three terms: The first term, − 12 log |Σ| is a complexity penalty term, which
measures and penalizes the complexity of the model. The second term a nega-
tive quadratic, and plays the role of a data-fit measure (it is the only term which
depends on the training set output values y). The third term is a log normaliza-
tion term, independent of the data, and not very interesting. Figure 3 illustrates
the predictions of a model trained by maximizing the marginal likelihood.
Note that the tradeoff between penalty and data-fit in the GP model is auto-
matic. There is no weighting parameter which needs to be set by some external
method such as cross validation. This is a feature of great practical importance,
since it simplifies training. Figure 4 illustrates how the automatic tradeoff comes
about.
We’ve seen in this section how we, via a hierarchical specification of the prior,
can express prior knowledge in a convenient way, and how we can learn values
of hyperparameters via optimization of the marginal likelihood. This can be
done using some gradient based optimization. Also, we’ve seen how the marginal
8
Note, that for most non-trivial Gaussian processes, optimization over hyperparam-
eters is not a convex problem, so the usual precautions against bad local minima
should be taken.
70 C.E. Rasmussen
too simple
P(Y|Mi)
"just right"
too complex
Y
All possible data sets
9
The covariance function must be positive definite to ensure that the resulting co-
variance matrix is positive definite.
Gaussian Processes in Machine Learning 71
Acknowledgements
The author was supported by the German Research Council (DFG) through
grant RA 1030/1.
References
1. Williams, C.K.I.: Prediction with Gaussian processes: From linear regression to
linear prediction and beyond. In Jordan, M.I., ed.: Learning in Graphical Models.
Kluwer Academic (1998) 599–621
2. MacKay, D.J.C.: Gaussian processes — a replacement for supervised neural net-
works? Tutorial lecture notes for NIPS 1997 (1997)
3. Williams, C.K.I., Barber, D.: Bayesian classification with Gaussian processes. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20(12) (1998) 1342–
1351
4. Csató, L., Opper, M.: Sparse on-line Gaussian processes. Neural Computation 14
(2002) 641–668
5. Neal, R.M.: Regression and classification using Gaussian process priors (with dis-
cussion). In Bernardo, J.M., et al., eds.: Bayesian statistics 6. Oxford University
Press (1998) 475–501