Expressing Structure With Kernels
Expressing Structure With Kernels
This chapter shows how to use kernels to build models of functions with many different
kinds of structure: additivity, symmetry, periodicity, interactions between variables,
and changepoints. We also show several ways to encode group invariants into kernels.
Combining a few simple kernels through addition and multiplication will give us a rich,
open-ended language of models.
The properties of kernels discussed in this chapter are mostly known in the literature.
The original contribution of this chapter is to gather them into a coherent whole and
to offer a tutorial showing the implications of different kernel choices, and some of the
structures which can be obtained by combining them.
1.1 Definition
A kernel (also called a covariance function, kernel function, or covariance kernel), is
a positive-definite function of two inputs x, x . In this chapter, x and x are usually
vectors in a Euclidean space, but kernels can also be defined on graphs, images, discrete
or categorical inputs, or even text.
Gaussian process models use a kernel to define the prior covariance between any two
function values:
Colloquially, kernels are often said to specify the similarity between two objects. This is
slightly misleading in this context, since what is actually being specified is the similarity
between two values of a function evaluated on each object. The kernel specifies which
2 Expressing Structure with Kernels
functions are likely under the GP prior, which in turn determines the generalization
properties of the model.
Plot of k(x, x ): 0
0 0
xx
x x x (with x = 1)
Functions f (x)
sampled from
GP prior:
x x x
Type of structure: local variation repeating structure linear functions
Kernel parameters Each kernel has a number of parameters which specify the precise
shape of the covariance function. These are sometimes referred to as hyper-parameters,
since they can be viewed as specifying a distribution over function parameters, instead of
being parameters which specify a function directly. An example would be the lengthscale
1.3 Combining kernels 3
parameter of the SE kernel, which specifies the width of the kernel and thereby the
smoothness of the functions in the model.
Stationary and Non-stationary The SE and Per kernels are stationary, meaning
that their value only depends on the difference x x . This implies that the probability
of observing a particular dataset remains the same even if we move all the x values by
the same amount. In contrast, the linear kernel (Lin) is non-stationary, meaning that
the corresponding GP model will produce different predictions if the data were moved
while the kernel parameters were kept fixed.
What if the kind of structure we need is not expressed by any known kernel? For many
types of structure, it is possible to build a made to order kernel with the desired
properties. The next few sections of this chapter will explore ways in which kernels can
be combined to create new ones with different properties. This will allow us to include
as much high-level structure as necessary into our models.
1.3.1 Notation
Below, we will focus on two ways of combining kernels: addition and multiplication. We
will often write these operations in shorthand, without arguments:
All of the basic kernels we considered in section 1.2 are one-dimensional, but kernels
over multi-dimensional inputs can be constructed by adding and multiplying between
kernels on different dimensions. The dimension on which a kernel operates is denoted
by a subscripted integer. For example, SE2 represents an SE kernel over the second
dimension of vector x. To remove clutter, we will usually refer to kernels without
specifying their parameters.
4 Expressing Structure with Kernels
0
0
0 0
x (with x = 1)
x x x (with x = 1) x (with x = 1)
One can multiply any number of kernels together in this way to produce kernels
combining several high-level properties. For example, the kernel SE Lin Per specifies
a prior on functions which are locally periodic with linearly growing amplitude. We will
see a real dataset having this kind of structure in section 1.11.
d=1 2 2d 2 d=1 2d
However, this flexibility means that they can sometimes be relatively slow to learn,
due to the curse of dimensionality (Bellman, 1956). In general, the more structure we
account for, the less data we need - the blessing of abstraction (Goodman et al., 2011)
counters the curse of dimensionality. Below, we will investigate ways to encode more
structure into kernels.
0 0
0 0
x (with x = 1)
xx
x (with x = 1)
x x
periodic plus trend periodic plus noise linear plus variation slow & fast variation
fa GP(a , ka ) (1.5)
fb GP(b , kb ) (1.6)
1.4 Modeling sums of functions 7
Then the distribution of the sum of those functions is simply another GP:
fa + fb GP(a + b , ka + kb ). (1.7)
Kernels ka and kb can be of different types, allowing us to model the data as a sum
of independent functions, each possibly representing a different type of structure. Any
number of components can be summed this way.
When modeling functions of multiple dimensions, summing kernels can give rise to addi-
tive structure across different dimensions. To be more precise, if the kernels being added
together are each functions of only a subset of input dimensions, then the implied prior
over functions decomposes in the same way. For example,
+ =
+ =
Figure 1.5: A sum of two orthogonal one-dimensional kernels. Top row: An additive
kernel is a sum of kernels. Bottom row: A draw from an additive kernel corresponds to
a sum of draws from independent GP priors, each having the corresponding kernel.
Figure 1.5 illustrates a decomposition of this form. Note that the product of two
kernels does not have an analogous interpretation as the product of two functions.
Figure 1.6: Left: A function with additive structure. Center: A GP with an additive
kernel can extrapolate away from the training data. Right: A GP with a product kernel
allows a different function value for every combination of inputs, and so is uncertain
about function values away from the training data. This causes the predictions to revert
to the mean.
with an SE kernel. These eight SE kernels plus a white noise kernel were added together
as in equation (1.8) to form a single GP model whose kernel had 9 additive components.
10 Expressing Structure with Kernels
After learning the kernel parameters by maximizing the marginal likelihood of the
data, one can visualize the predictive distribution of each component of the model.
strength
Data
strength
Posterior density
Posterior samples
Figure 1.7 shows the marginal posterior distribution of each of the eight one-dimensional
functions in the model. The parameters controlling the variance of two of the functions,
f6 (coarse) and f7 (fine) were set to zero, meaning that the marginal likelihood preferred
a parsimonious model which did not depend on these inputs. This is an example of the
automatic sparsity that arises by maximizing marginal likelihood in GP models, and is
another example of automatic relevance determination (ARD) (Neal, 1995).
The ability to learn kernel parameters in this way is much more difficult when using
non-probabilistic methods such as Support Vector Machines (Cortes and Vapnik, 1995),
for which cross-validation is often the best method to select kernel parameters.
1.4 Modeling sums of functions 11
Here we derive the posterior variance and covariance of all of the additive components
of a GP. These formulas allow one to make plots such as figure 1.7.
First, we write down the joint prior distribution over two functions drawn indepen-
dently from GP priors, and their sum. We distinguish between f (X) (the function values
at training locations [x1 , x2 , . . . , xN ]T := X) and f (X ) (the function values at some set
of query locations [x1 , x2 , . . . , xN ] := X ).
T
2 2
f1 (X) + f2 (X) 1 + 2 K1 K1 T K2 K2 T K1 + K2 K1 + K2
f1 (X ) + f2 (X ) 1 + 2 K1 T K
1 K2 T K
2 K1 T + K2 T 1 + K2
K
(1.13)
where we represent the Gram matrices, whose i, jth entry is given by k(xi , xj ) by
Ki = ki (X, X) (1.14)
Ki = ki (X, X ) (1.15)
i = ki (X , X )
K (1.16)
The formula for Gaussian conditionals ?? can be used to give the conditional distri-
bution of a GP-distributed function conditioned on its sum with another GP-distributed
function:
h i
T
f1 (X
)f1 (X) + f2 (X) N 1 + K1 (K1 + K2 )1 f1 (X) + f2 (X) 1 2 ,
T
K
1 K1 (K1 + K2 ) 1
K1 (1.17)
These formulas express the models posterior uncertainty about the different components
of the signal, integrating over the possible configurations of the other components. To
extend these formulas to a sum of more than two functions, the term K1 +K2 can simply
be replaced by i Ki everywhere.
P
12 Expressing Structure with Kernels
Correlation
fly ash
0.5
water
0.5
plasticizer
age
Figure 1.8: Posterior correlations between the heights of the one-dimensional functions
in equation (1.12), whose sum models concrete strength. Red indicates high correla-
tion, teal indicates no correlation, and blue indicates negative correlation. Plots on the
diagonal show posterior correlations between different values of the same function. Cor-
relations are evaluated over the same input ranges as in figure 1.7. Correlations with
f6 (coarse) and f7 (fine) are not shown, because their estimated variance was zero.
One can also compute the posterior covariance between the height of any two functions,
conditioned on their sum:
h i T
Cov f1 (X ), f2 (X )f (X) = K1 (K1 + K2 )1 K2 (1.18)
If this quantity is negative, it means that there is ambiguity about which of the two
functions is high or low at that location. For example, figure 1.8 shows the posterior
correlation between all non-zero components of the concrete model. This figure shows
1.5 Changepoints 13
that most of the correlation occurs within components, but there is also negative corre-
lation between the height of f1 (cement) and f2 (slag).
1.5 Changepoints
An example of how combining kernels can give rise to more structured priors is given by
changepoint kernels, which can express a change between different types of structure.
Changepoints kernels can be defined through addition and multiplication with sigmoidal
functions such as (x) = 1/1+exp(x):
CP(k1 , k2 ) = k1 + k2 (1.20)
f (x)
x x x x
Figure 1.9: Draws from different priors on using changepoint kernels, constructed by
adding and multiplying together base kernels with sigmoidal functions.
We can also build a model of functions whose structure changes only within some
interval a change-window by replacing (x) with a product of two sigmoids, one
increasing and one decreasing.
14 Expressing Structure with Kernels
More generally, we can model an unknown function thats been multiplied by any fixed,
known function a(x), by multiplying the kernel by a(x)a(x ). Formally,
For example, the squared-exponential kernel (SE) on the real line has a representation
in terms of infinitely many radial-basis functions of the form hi (x) exp( 412 (x ci )2 ).
More generally, any stationary kernel can be represented by a set of sines and cosines - a
Fourier representation (Bochner, 1959). In general, any particular feature representation
of a kernel is not necessarily unique (Minh et al., 2006).
In some cases, the input to a kernel, x, can even be the implicit infinite-dimensional
feature mapping of another kernel. Composing feature maps in this way leads to deep
kernels, which are explored in ??.
The link between Gaussian processes, linear regression, and neural networks is explored
further in ??.
1.7 Expressing symmetries and invariances 15
meaning that the features of ka + kb are the concatenation of the features of each kernel.
We can examine kernel multiplication in a similar way:
h i h i
ka (x, x ) kb (x, x ) = a(x)T a(x ) b(x)T b(x ) (1.27)
= ai (x)ai (x ) bj (x)bj (x ) (1.28)
X X
i j
Xh ih i
= ai (x)bj (x) ai (x )bj (x ) (1.29)
i,j
In words, the features of ka kb are made of up all pairs of the original two sets of
features. For example, the features of the product of two one-dimensional SE kernels
(SE1 SE2 ) cover the plane with two-dimensional radial-basis functions of the form:
1 (x1 ci )2 1 (x2 cj )2
! !
hij (x1 , x2 ) exp exp (1.30)
2 221 2 222
Consider the scenario where we have a finite set of transformations of the input space
{g1 , g2 , . . .} to which we wish our function to remain invariant:
How can we construct a prior on functions which respect these symmetries? Gins-
bourger et al. (2012) and Ginsbourger et al. (2013) showed that the only way to construct
a GP prior on functions which respect a set of invariances is to construct a kernel which
respects the same invariances with respect to each of its two inputs:
Formally, given a finite group G whose elements are operations to which we wish our
function to remain invariant, and f GP(0, k(x, x )), then every f is invariant under
G (up to a modification) if and only if k(, ) is argument-wise invariant under G. See
Ginsbourger et al. (2013) for details.
It might not always be clear how to construct a kernel respecting such argument-wise
invariances. Fortunately, there are a few simple ways to do this for any finite group:
1. Sum over the orbit. The orbit of x with respect to a group G is {g(x) : g G},
the set obtained by applying each element of G to x. Ginsbourger et al. (2012)
and Kondor (2008) suggest enforcing invariances through a double sum over the
orbits of x and x with respect to G:
g,G g G
1.7 Expressing symmetries and invariances 17
gGswap g Gswap
= k(x1 , x2 , x1 , x2 ) + k(x1 , x2 , x2 , x1 )
+ k(x2 , x1 , x1 , x2 ) + k(x2 , x1 , x2 , x1 ) (1.37)
For stationary kernels, some pairs of elements in this sum will be identical, and
can be ignored. Figure 1.10(left) shows a draw from a GP prior with a product of
SE kernels symmetrized in this way. This construction has the property that the
marginal variance is doubled near x1 = x2 , which may or may not be desirable.
For example, a fundamental domain of the group Gswap is all {x1 , x2 : x1 < x2 },
h i
a set which can be mapped to using AGswap (x1 , x2 ) = min(x1 , x2 ), max(x1 , x2 ) .
Constructing a kernel using this method introduces a non-differentiable seam
along x1 = x2 , as shown in figure 1.10(center).
18 Expressing Structure with Kernels
gG g G
This method can sometimes produce GP priors with zero variance in some regions,
as in figure 1.10(right).
There are often many possible ways to achieve a given symmetry, but we must be careful
to do so without compromising other qualities of the model we are constructing. For
example, simply setting k(x, x ) = 0 gives rise to a GP prior which obeys all possible
symmetries, but this is presumably not a model we wish to use.
f (x) = f (x + ) (1.40)
achieves the invariance in equation (1.40). Simple algebra reduces this kernel to the
form given in figure 1.1.
This symmetry can be enforced using the sum over orbits method, by the transform
f = f (1.44)
As above, translation invariance in one dimension can be achieved by a double sum over
the orbit, given an initial translation-sensitive kernel between signals k:
D
D X
kinvariant (x, x ) =
k(shift(x, i), shift(x, j)) . (1.46)
X
i=1 j=1
Figure 1.11: Generating 2D manifolds with different topologies. By enforcing that the
functions mapping from R2 to R3 obey certain symmetries, the surfaces created have
corresponding topologies, ignoring self-intersections.
Figure 1.11 shows 2D meshes warped into 3D by functions drawn from GP priors
with various kernels, giving rise to a different topologies. Higher-dimensional analogues
of these shapes can be constructed by increasing the latent dimension and including
corresponding terms in the kernel. For example, an N -dimensional latent space using
kernel Per1 Per2 . . . PerN will give rise to a prior on manifolds having the topology
of N -dimensional toruses, ignoring self-intersections.
x2
x1
Figure 1.12: Generating Mbius strips. Left: A function drawn from a GP prior obeying
the symmetries given by equations (1.47) to (1.49). Center: Simply-connected surfaces
mapped from R2 to R3 by functions obeying those symmetries have a topology corre-
sponding to a Mbius strip. Surfaces generated this way do not have the familiar shape
of a flat surface connected to itself with a half-twist. Instead, they tend to look like
Sudanese Mbius strips (Lerner and Asimov, 1984), whose edge has a circular shape.
Right: A Sudanese projection of a Mbius strip. Image adapted from Wikimedia Com-
mons (2005).
A space having the topology of a Mbius strip can be constructed by enforcing invariance
to the following operations (Reid and Szendri, 2005, chapter 7):
Section 1.7 already showed how to build GP priors invariant to each of these types of
transformations. Well call a kernel which enforces these symmetries a Mbius kernel.
An example of such a kernel is:
Moving along the diagonal x1 = x2 of a function drawn from the corresponding GP prior
is equivalent to moving along the edge of a notional Mbius strip which has had that
22 Expressing Structure with Kernels
Categorical variables are variables which can take values only from a discrete, unordered
set, such as {blue, green, red}. A simple way to construct a kernel over categorical
variables is to represent that variable by a set of binary variables, using a one-of-k
encoding. For example, if x can take one of four values, x {A, B, C, D}, then a one-of-k
encoding of x will correspond to four binary inputs, and one-of-k(C) = [0, 0, 1, 0]. Given
a one-of-k encoding, we can place any multi-dimensional kernel on that space, such as
the SE-ARD:
Short lengthscales on any particular dimension of the SE-ARD kernel indicate that the
function value corresponding to that category is uncorrelated with the others. More
flexible parameterizations are also possible (Pinheiro and Bates, 1996).
Any GP prior can easily be extended to the model multiple outputs: f1 (x), f2 (x), . . . , fT (x).
This can be done by building a model of a single-output function which has had an ex-
tra input added that denotes the index of the output: fi (x) = f (x, i). This can be
done by extending the original kernel k(x, x ) to have an extra discrete input dimension:
k(x, i, x , i ).
A simple and flexible construction of such a kernel multiplies the original kernel
k(x, x ) with a categorical kernel on the output index (Bonilla et al., 2007):
Source code
Source code to produce all figures and examples in this chapter is available at
http://www.github.com/duvenaud/phd-thesis.
References
Mark A. Armstrong, Grard Iooss, and Daniel D. Joseph. Groups and symmetry.
Springer, 1988. (page 16)
Salomon Bochner. Lectures on Fourier integrals, volume 42. Princeton University Press,
1959. (page 14)
Edwin V. Bonilla, Kian Ming Adam Chai, and Christopher K.I. Williams. Multi-task
Gaussian process prediction. In Advances in Neural Information Processing Systems,
2007. (page 22)
David Ginsbourger, Xavier Bay, Olivier Roustant, and Laurent Carraro. Argumentwise
invariant kernels for the approximation of invariant functions. In Annales de la Facult
de Sciences de Toulouse, 2012. (page 16)
Trevor J. Hastie and Robert J. Tibshirani. Generalized additive models. Chapman &
Hall/CRC, 1990. (page 9)
References 25
Imre Risi Kondor. Group theoretical methods in machine learning. PhD thesis, Columbia
University, 2008. (pages 16 and 19)
Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time
series. The handbook of brain theory and neural networks, 3361, 1995. (page 19)
Doug Lerner and Dan Asimov. The Sudanese Mbius band. In SIGGRAPH Electronic
Theatre, 1984. (pages 21 and 22)
David J.C. MacKay. Introduction to Gaussian processes. NATO ASI Series F Computer
and Systems Sciences, 168:133166, 1998. (page 18)
James Mercer. Functions of positive and negative type, and their connection with the
theory of integral equations. Philosophical Transactions of the Royal Society of Lon-
don. Series A, Containing Papers of a Mathematical or Physical Character, pages
415446, 1909. (page 14)
Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of
Machine Learning Research, 7:26512667, 2006. (page 5)
Ha Quang Minh, Partha Niyogi, and Yuan Yao. Mercers theorem, feature maps, and
smoothing. In Learning theory, pages 154168. Springer, 2006. (page 14)
Radford M. Neal. Bayesian learning for neural networks. PhD thesis, University of
Toronto, 1995. (page 10)
Carl E. Rasmussen and Hannes Nickisch. Gaussian processes for machine learning
(GPML) toolbox. Journal of Machine Learning Research, 11:30113015, December
2010. (page 23)
Miles A. Reid and Balzs Szendri. Geometry and topology. Cambridge University Press,
2005. (page 21)
26 References