An Intuitive Tutorial To Gaussian Processes Regression: Jie Wang Ingenuity Labs Research Institute
An Intuitive Tutorial To Gaussian Processes Regression: Jie Wang Ingenuity Labs Research Institute
Regression
arXiv:2009.10862v3 [stat.ML] 2 Feb 2021
Jie Wang
Ingenuity Labs Research Institute
[email protected]
February 3, 2021
Offroad Robotics
c/o Ingenuity Labs Research Institute
Queen’s University
Kingston, ON K7L 3N6 Canada
Abstract
2 Mathematical Basics 1
2.1 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . 4
2.3 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Non-parametric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Gaussian Processes 9
4 Illustrative example 11
4.1 Hyperparameters Optimization . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Gaussian processes packages . . . . . . . . . . . . . . . . . . . . . . . 14
A Appendix 16
An Intuitive Tutorial to Gaussian Processes Regression 1
1 Introduction
The Gaussian processes model is a probabilistic supervised machine learning frame-
work that has been widely used for regression and classification tasks. A Gaus-
sian processes regression (GPR) model can make predictions incorporating prior
knowledge (kernels) and provide uncertainty measures over predictions [11]. Gaus-
sian processes model is a supervised learning method developed by computer sci-
ence and statistics communities. Researchers with engineering backgrounds often
find it is difficult to gain a clear understanding of it. To understand GPR, even only
the basics needs to have knowledge of multivariate normal distribution, kernels,
non-parametric model, and joint and conditional probability.
In this tutorial, we present a concise and accessible explanation of GPR. We first
review the mathematical concepts that GPR models are built on to make sure read-
ers have enough basic knowledge. In order to provide an intuitive understanding
of GPR, plots are actively used. The codes developed to generate plots are pro-
vided at https://github.com/jwangjie/Gaussian-Processes-Regression-Tutorial.
2 Mathematical Basics
This section reviews the basic concepts needed to understand GPR. We start with
the Gaussian (normal) distribution, then explain theories of multivariate normal
distribution (MVN), kernels, non-parametric model, and joint and conditional prob-
ability. In regression, given some observed data points, we want to fit a function to
represent these data points, then use the function to make predictions at new data
points. For a given set of observed data points shown in Fig. 1(a), there are infinite
numbers of possible functions that fit these data points. In Fig. 1(b), we show five
sample functions that fit the data points. In GPR, the Gaussian processes conduct
regression by defining a distribution over these infinite number of functions [8].
10.0 10.0
7.5 7.5
5.0 5.0
2.5 2.5
0.0 0.0
2.5 2.5
5.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(a) Data point observations (b) Five possible regression functions by GPR
Figure 1: A regression example: (a) The observed data points, (b) Five sample
functions that fit the observed data points.
Here, X represent random variables and x is the real argument. The normal distri-
bution of X is usually represented by PX ( x ) ∼ N (µ, σ2 ). The PDF of a uni-variate
normal (or Gaussian) distribution was plotted in Fig. 2. We randomly generated
1000 points from a uni-variate normal distribution and plotted them on the x axis.
0.40
0.35
0.30
0.25
PX(x)
0.20
0.15
0.10
0.05
0.00
4 2 0 2 4
x
Figure 2: One thousand normal distributed data points were plotted as red vertical
bars on the x axis. The PDF of these data points was plotted as a two-dimensional
bell curve.
These random generated data points can be expressed as a vector x1 = [ x11 , x12 , . . . , x1n ].
By plotting the vector x1 on a new Y axis at Y = 0, we projected points [ x11 , x12 , . . . , x1n ]
into another space shown in Fig. 3. We did nothing but vertically plot points of
the vector x1 in a new Y, x coordinates space. We can plot another independent
An Intuitive Tutorial to Gaussian Processes Regression 3
3
2
1
0
x
1
2
3
0.0 0.2 0.4 0.6 0.8 1.0
Y
Figure 3: Two independent uni-variate Gaussian vector points were plotted verti-
cally in the Y, x coordinates space.
3
2
2
1
1
0
0
x
x
1
1
2
2
3
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Y Y
(a) Two Gaussian vectors (b) Twenty Gaussian vectors
0.20
0.15 4
0.10 3
P(x1, x2)
0.05
0.00 2
1
x2
0
43 1
2
x2 1 0 1 2 3 2
23
3 2 1 0 x1 1 4 2 0 2
x1
(a) 3-d bell curve (b) 2-d ellipse contours
Figure 5: The PDF of a BVN visualization: (a) a 3-d bell curve with height rep-
resents the probability density, (b) ellipse contour projections showing the co-
relationship between x1 and x2 points.
# "
σ11 σ12
and x2 respectively. The covariance matrix is , where the diagonal terms
σ21 σ22
σ11 and σ22 are the independent variance of x1 and x2 respectively. The off-diagonal
terms σ12 and σ21 represent correlations between x1 and x2 . The BVN is expressed
as
" # " # " #!
x1 µ1 σ11 σ12
∼N , ∼ N (µ, Σ) .
x2 µ2 σ21 σ22
2.3 Kernels
After reviewing the MVN concept, we want to smooth the functions in Fig. 4(b).
This can be done by defining covariance functions, which reflect our ability to ex-
press prior knowledge of the form of the function we are modelling. In regression,
outputs of the function should be similar when two input are close to each other.
One possible equation format is the dot product A · B = k Akk Bkcosθ, where θ is
An Intuitive Tutorial to Gaussian Processes Regression 6
P (x 1
x2
x1
the angle between two input vectors. When two input vectors are similar, their dot
product output value is high.
If a function is defined solely in terms of inner products in the input space,
then the function k( x, x 0 ) is a kernel function [11]. The most widely used kernel
or covariance function is the squared exponential (SE) kernel function. The SE
kernel is de-facto default kernel for Gaussian processes [5]. This is because it can
be integrated against most functions that you need to due to its universal property.
And every function in its prior has infinitely many derivatives. It’s also known as
the radial basis function (RBF) kernel or Gaussian kernel function that is defined
as 1
!
( x i − x j )2
cov( xi , x j ) = exp − .
2
2.0
2 1.5
1 1.0
0.5
0
0.0
1 0.5
1.0
2
1.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(a) Ten samples of the 20-VN prior with an (b) Ten samples of the 20-VN prior with a RBF
identity kernel kernel
3 Gaussian Processes
Before explaining the Gaussian process, let’s do a quick review of the basic con-
cepts we covered. In regression, there is a function f we are trying to model given
observed data points D (training dataset) from the unknown function f. The tradi-
tional nonlinear regression methods typically give one function that is considered
to fit the dataset best. However, there may be more than one function that fit
the observed data points equally well. We saw that when the dimension of MVN
was infinite, we could make predictions at any point with these infinite number
of functions. These functions are MVN because it’s our assumption (prior). More
formally speaking, the prior distribution of these infinite functions is MVN. The
prior distribution represents the expected outputs of f over inputs x without ob-
serving any data. When we start to have observations, instead of infinite number
of functions, we only keep functions that fit the observed data points. Now we
have the posterior, which is the prior updated with observed data. When we have
new observations, we will use the current posterior as prior, and new observed
data points to obtain a new posterior.
Here is a definition of Gaussian processes: A Gaussian processes model is a
probability distribution over possible functions that fit a set of points. Because
we have the probability distribution over all possible functions, we can calculate
the means as the function, and the variances to indicate how confident the predic-
tions are. The key points are summarized as: 1) the function (posteriors) updates
with new observations, 2) a Gaussian process model is a probability distribution
over possible functions, and any finite sample of functions are jointly Gaussian dis-
tributed, 3) the mean function calculated by the posterior distribution of possible
functions is the function used for regression predictions.
It is time to describe the standard Gaussian processes model. All the parameter
definitions follow the classic textbook [11]. Besides the covered basic concepts,
the Appendix A.1 and A.2 of [11] are also recommended to read. The regression
function modeled by a multivariate Gaussian is given as
P(f | X) = N (f | µ, K) ,
By deriving the conditional distribution, we get the predictive equations for Gaus-
sian processes regression as
f̄∗ | X, y, X∗ ∼ N f̄∗ , cov(f∗ ) ,
where
∆
f̄∗ = E[f̄∗ | X, y, X∗ ]
= KT∗ [K + σn2 I ]−1 y ,
2 −1
cov(f∗ ) = K∗∗ − KT
∗ [K + σn I ] K∗ .
In the variance function cov(f∗ ), it can be noted that the variance does not depend
on the observed output y but only on the inputs X and X∗ . This is a property of the
Gaussian distribution [11].
4 Illustrative example
An implementation of the standard GPR is shown in this section. The example
follows the algorithm in [11] as follows.
L = cholesky(K + σn2 I )
α = L > \ ( L \ y)
f¯∗ = k>
∗α
v = L \ k∗
V[ f ∗ ] = k(x∗ , x∗ ) − v> v.
1 1 n
log p(y | X ) = − y> (K + σn2 I )−1 y − log det(K + σn2 I ) − log 2π
2 2 2
The inputs are X (inputs), y(targets), k(covariance function), σn2 (noise level), and
x∗ (test input). The outputs are f¯∗ (mean), V[ f ∗ ] (variance), and log p(y | X ) (log
marginal likelihood).
An example result was plotted shown in Fig. 10. We conducted regression
within the [-5, 5] region of interest. The observed data points (training dataset)
were generated from a uniform distribution between -5 and 5. Any point within
the given interval [-5, 5] was equal likely to be drawn by a uniform. The functions
were evaluated at evenly spaced points between -5 and 5. The regression func-
tion is composed of mean values estimated by a GPR model. Twenty samples of
posterior mean functions were also plotted within 3µ variances.
An Intuitive Tutorial to Gaussian Processes Regression 12
4 True function
Observed data points
Estimated mean function
3
2
1
0
1
2
3
4 2 0 2 4
Figure 10: A illustrative example of the standard GPR. The observed data points
generated by the blue dotted line (the true function) were plotted as black crosses.
Given the observed/training data points, infinite possible posterior functions were
obtained. We plotted 20 samples of these infinite functions with sorted colors.
The mean function was obtained by the probability distribution of these functions
and plotted as a red solid line. The blue shaded area around the mean function
indicates the 3µ prediction variances.
1
k (xi , x j ) = σ2f exp − (xi − x j )T (xi − x j ) ,
2l
where σ f and l are hyperparameters [5]. The vertical scale σ f describes how much
vertically the function can span. The horizontal scale l indicates how quickly the
correlation relationship between two points drops as their distance increases. The
effect of l was shown in Fig. 11. A higher l provided a smoother function and a
smaller l gave a wigglier function.
6
12.5
4 10.0
2 7.5
5.0
0
2.5
2 0.0
4 2.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(a) l = small (b) l = medium
2
0.0 0.2 0.4 0.6 0.8 1.0
(c) l = large
Figure 11: The function smoothness affected by the horizontal scale l of hyperpa-
rameters.
Note that after learning/tuning the hyperparameters, the predictive variance cov(f∗ )
depends on not only the inputs X and X∗ but also the outputs y [1]. With the op-
timized hyperparameters σ f = 0.0067 and l = 0.0967, the regression result of the
An Intuitive Tutorial to Gaussian Processes Regression 14
observed data points in Fig. 11 was shown in Fig. 12. Here, the hyperparameters
optimization was conducted by the GPy package, which will be introduced in the
next section.
References
[1] Zexun Chen and Bo Wang. How priors of initial hyperparameters affect Gaus-
sian process regression models. Neurocomputing, 275:1702–1710, 2018.
[3] Alexander G De G. Matthews, Mark Van Der Wilk, Tom Nickson, Keisuke Fu-
jii, Alexis Boukouvalas, Pablo León-Villagrá, Zoubin Ghahramani, and James
Hensman. GPflow: A Gaussian process library using TensorFlow. The Journal
of Machine Learning Research, 18(1):1299–1304, 2017.
[5] David Duvenaud. Automatic model construction with Gaussian processes. PhD
thesis, University of Cambridge, 2014.
[6] Roger Frigola, Fredrik Lindsten, Thomas B Schön, and Carl Edward Ras-
mussen. Bayesian Inference and Learning in Gaussian Process State-Space
Models with Particle MCMC. In Advances in Neural Information Processing Sys-
tems, pages 3156–3164, 2013.
An Intuitive Tutorial to Gaussian Processes Regression 16
[7] Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and An-
drew Gordon Wilson. GPyTorch: Blackbox Matrix-Matrix Gaussian Process
Inference with GPU Acceleration. In Advances in Neural Information Processing
Systems, 2018.
[8] Zoubin Ghahramani. A Tutorial on Gaussian Processes (or why I don’t use
SVMs). In Machine Learning Summer School (MLSS), 2011.
[9] Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai. When Gaussian
process meets big data: A review of scalable GPs. IEEE Transactions on Neural
Networks and Learning Systems, 2020.
[10] Kevin P Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press,
2012.
A Appendix
The Marginal and Conditional Distributions of MVN theorem: suppose X = ( x1 , x2 )
is a joint Gaussian with parameters
" # " # " #
µ1 Σ11 Σ12 Λ 11 Λ 12
µ= , Σ= , Λ = Σ −1 = ,
µ2 Σ21 Σ22 Λ21 Λ22
p( x1 ) = N ( x1 |µ1 , Σ11 ) ,
p( x2 ) = N ( x2 |µ2 , Σ22 ) ,
p ( x 1 | x 2 ) = N ( x 1 | µ 1|2 , Σ 1|2 )
−1
µ1|2 = µ1 + Σ12 Σ22 ( x2 − µ2 )
−1
= µ1 − Λ11 Λ12 ( x2 − µ2 )
= Σ1|2 (Λ11 µ1 − Λ12 ( x2 − µ2 ))
−1 −1
Σ1|2 = Σ11 − Σ12 Σ22 Σ21 = Λ11