0% found this document useful (0 votes)
99 views26 pages

Expressing Structure With Kernels

This chapter discusses how kernels can be used to build models that capture different types of structure in functions, such as additivity, symmetry, periodicity, and interactions between variables. Kernels specify the similarity between function values and determine the properties of functions in Gaussian process priors. Basic kernels like squared-exponential, periodic, and linear encode simple structures, and more complex structures can be built by combining kernels through addition and multiplication. This allows rich yet interpretable models to be constructed.

Uploaded by

Frank Puk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views26 pages

Expressing Structure With Kernels

This chapter discusses how kernels can be used to build models that capture different types of structure in functions, such as additivity, symmetry, periodicity, and interactions between variables. Kernels specify the similarity between function values and determine the properties of functions in Gaussian process priors. Basic kernels like squared-exponential, periodic, and linear encode simple structures, and more complex structures can be built by combining kernels through addition and multiplication. This allows rich yet interpretable models to be constructed.

Uploaded by

Frank Puk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Chapter 1

Expressing Structure with Kernels

This chapter shows how to use kernels to build models of functions with many different
kinds of structure: additivity, symmetry, periodicity, interactions between variables,
and changepoints. We also show several ways to encode group invariants into kernels.
Combining a few simple kernels through addition and multiplication will give us a rich,
open-ended language of models.
The properties of kernels discussed in this chapter are mostly known in the literature.
The original contribution of this chapter is to gather them into a coherent whole and
to offer a tutorial showing the implications of different kernel choices, and some of the
structures which can be obtained by combining them.

1.1 Definition
A kernel (also called a covariance function, kernel function, or covariance kernel), is
a positive-definite function of two inputs x, x . In this chapter, x and x are usually
vectors in a Euclidean space, but kernels can also be defined on graphs, images, discrete
or categorical inputs, or even text.
Gaussian process models use a kernel to define the prior covariance between any two
function values:

Cov [f (x), f (x )] = k(x, x ) (1.1)

Colloquially, kernels are often said to specify the similarity between two objects. This is
slightly misleading in this context, since what is actually being specified is the similarity
between two values of a function evaluated on each object. The kernel specifies which
2 Expressing Structure with Kernels

functions are likely under the GP prior, which in turn determines the generalization
properties of the model.

1.2 A few basic kernels


To begin understanding the types of structures expressible by GPs, we will start by
briefly examining the priors on functions encoded by some commonly used kernels: the
squared-exponential (SE), periodic (Per), and linear (Lin) kernels. These kernels are
defined in figure 1.1.

Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin)


2
    
k(x, x ) = f2 exp (xx
22
)
f2 exp 22 sin2 xx
p
f2 (x c)(x c)

Plot of k(x, x ): 0

0 0
xx
x x x (with x = 1)

Functions f (x)
sampled from
GP prior:

x x x
Type of structure: local variation repeating structure linear functions

Figure 1.1: Examples of structures expressible by some basic kernels.

Each covariance function corresponds to a different set of assumptions made about


the function we wish to model. For example, using a squared-exp (SE) kernel implies that
the function we are modeling has infinitely many derivatives. There exist many variants
of local kernels similar to the SE kernel, each encoding slightly different assumptions
about the smoothness of the function being modeled.

Kernel parameters Each kernel has a number of parameters which specify the precise
shape of the covariance function. These are sometimes referred to as hyper-parameters,
since they can be viewed as specifying a distribution over function parameters, instead of
being parameters which specify a function directly. An example would be the lengthscale
1.3 Combining kernels 3

parameter of the SE kernel, which specifies the width of the kernel and thereby the
smoothness of the functions in the model.

Stationary and Non-stationary The SE and Per kernels are stationary, meaning
that their value only depends on the difference x x . This implies that the probability
of observing a particular dataset remains the same even if we move all the x values by
the same amount. In contrast, the linear kernel (Lin) is non-stationary, meaning that
the corresponding GP model will produce different predictions if the data were moved
while the kernel parameters were kept fixed.

1.3 Combining kernels

What if the kind of structure we need is not expressed by any known kernel? For many
types of structure, it is possible to build a made to order kernel with the desired
properties. The next few sections of this chapter will explore ways in which kernels can
be combined to create new ones with different properties. This will allow us to include
as much high-level structure as necessary into our models.

1.3.1 Notation

Below, we will focus on two ways of combining kernels: addition and multiplication. We
will often write these operations in shorthand, without arguments:

ka + kb = ka (x, x ) + kb (x, x ) (1.2)


ka kb = ka (x, x ) kb (x, x ) (1.3)

All of the basic kernels we considered in section 1.2 are one-dimensional, but kernels
over multi-dimensional inputs can be constructed by adding and multiplying between
kernels on different dimensions. The dimension on which a kernel operates is denoted
by a subscripted integer. For example, SE2 represents an SE kernel over the second
dimension of vector x. To remove clutter, we will usually refer to kernels without
specifying their parameters.
4 Expressing Structure with Kernels

Lin Lin SE Per Lin SE Lin Per

0
0
0 0
x (with x = 1)

x x x (with x = 1) x (with x = 1)

quadratic functions locally periodic increasing variation growing amplitude

Figure 1.2: Examples of one-dimensional structures expressible by multiplying kernels.


Plots have same meaning as in figure 1.1.

1.3.2 Combining properties through multiplication


Multiplying two positive-definite kernels together always results in another positive-
definite kernel. But what properties do these new kernels have? Figure 1.2 shows some
kernels obtained by multiplying two basic kernels together.
Working with kernels, rather than the parametric form of the function itself, allows
us to express high-level properties of functions that do not necessarily have a simple
parametric form. Here, we discuss a few examples:

Polynomial Regression. By multiplying together T linear kernels, we obtain a


prior on polynomials of degree T . The first column of figure 1.2 shows a quadratic
kernel.

Locally Periodic Functions. In univariate data, multiplying a kernel by SE


gives a way of converting global structure to local structure. For example, Per
corresponds to exactly periodic structure, whereas Per SE corresponds to locally
periodic structure, as shown in the second column of figure 1.2.

Functions with Growing Amplitude. Multiplying by a linear kernel means


that the marginal standard deviation of the function being modeled grows linearly
away from the location given by kernel parameter c. The third and fourth columns
of figure 1.2 show two examples.
1.3 Combining kernels 5

One can multiply any number of kernels together in this way to produce kernels
combining several high-level properties. For example, the kernel SE Lin Per specifies
a prior on functions which are locally periodic with linearly growing amplitude. We will
see a real dataset having this kind of structure in section 1.11.

1.3.3 Building multi-dimensional models


A flexible way to model functions having more than one input is to multiply together
kernels defined on each individual input. For example, a product of SE kernels over
different dimensions, each having a different lengthscale parameter, is called the SE-ARD
kernel:
D
1 (xd xd )2 1X D
(xd xd )2
! !
SE-ARD(x, x ) =
d2 exp = f2 exp (1.4)
Y

d=1 2 2d 2 d=1 2d

Figure 1.3 illustrates the SE-ARD kernel in two dimensions.

f (x1 , x2 ) drawn from


SE1 (x1 , x1 ) SE2 (x2 , x2 ) SE1 SE2
GP(0, SE1 SE2 )
Figure 1.3: A product of two one-dimensional kernels gives rise to a prior on functions
which depend on both dimensions.

ARD stands for automatic relevance determination, so named because estimating


the lengthscale parameters 1 , 2 , . . . , D , implicitly determines the relevance of each
dimension. Input dimensions with relatively large lengthscales imply relatively little
variation along those dimensions in the function being modeled.
SE-ARD kernels are the default kernel in most applications of GPs. This may be
partly because they have relatively few parameters to estimate, and because those pa-
rameters are relatively interpretable. In addition, there is a theoretical reason to use
them: they are universal kernels (Micchelli et al., 2006), capable of learning any contin-
uous function given enough data, under some conditions.
6 Expressing Structure with Kernels

However, this flexibility means that they can sometimes be relatively slow to learn,
due to the curse of dimensionality (Bellman, 1956). In general, the more structure we
account for, the less data we need - the blessing of abstraction (Goodman et al., 2011)
counters the curse of dimensionality. Below, we will investigate ways to encode more
structure into kernels.

1.4 Modeling sums of functions


An additive function is one which can be expressed as f (x) = fa (x) + fb (x). Additivity
is a useful modeling assumption in a wide variety of contexts, especially if it allows us
to make strong assumptions about the individual components which make up the sum.
Restricting the flexibility of component functions often aids in building interpretable
models, and sometimes enables extrapolation in high dimensions.

Lin + Per SE + Per SE + Lin SE(long) + SE(short)

0 0

0 0
x (with x = 1)

xx
x (with x = 1)
x x

periodic plus trend periodic plus noise linear plus variation slow & fast variation

Figure 1.4: Examples of one-dimensional structures expressible by adding kernels. Rows


have the same meaning as in figure 1.1. SE(long) denotes a SE kernel whose lengthscale
is long relative to that of SE(short)

It is easy to encode additivity into GP models. Suppose functions fa , fb are drawn


independently from GP priors:

fa GP(a , ka ) (1.5)
fb GP(b , kb ) (1.6)
1.4 Modeling sums of functions 7

Then the distribution of the sum of those functions is simply another GP:

fa + fb GP(a + b , ka + kb ). (1.7)

Kernels ka and kb can be of different types, allowing us to model the data as a sum
of independent functions, each possibly representing a different type of structure. Any
number of components can be summed this way.

1.4.1 Modeling noise

Additive noise can be modeled as an unknown, quickly-varying function added to the


signal. This structure can be incorporated into a GP model by adding a local kernel such
as an SE with a short lengthscale, as in the fourth column of figure 1.4. The limit of the
SE kernel as its lengthscale goes to zero is a white noise (WN) kernel. Function values
drawn from a GP with a WN kernel are independent draws from a Gaussian random
variable.
Given a kernel containing both signal and noise components, we may wish to isolate
only the signal components. Section 1.4.5 shows how to decompose a GP posterior into
each of its additive components.
In practice, there may not be a clear distinction between signal and noise. For
example, ?? contains examples of models having long-term, medium-term, and short-
term trends. Which parts we designate as the signal sometimes depends on the task
at hand.

1.4.2 Additivity across multiple dimensions

When modeling functions of multiple dimensions, summing kernels can give rise to addi-
tive structure across different dimensions. To be more precise, if the kernels being added
together are each functions of only a subset of input dimensions, then the implied prior
over functions decomposes in the same way. For example,

f (x1 , x2 ) GP(0, k1 (x1 , x1 ) + k2 (x2 , x2 )) (1.8)


8 Expressing Structure with Kernels

+ =

k1 (x1 , x1 ) k2 (x2 , x2 ) k1 (x1 , x1 ) + k2 (x2 , x2 )

+ =

f1 (x1 ) GP (0, k1 ) f2 (x2 ) GP (0, k2 ) f1 (x1 ) + f2 (x2 )

Figure 1.5: A sum of two orthogonal one-dimensional kernels. Top row: An additive
kernel is a sum of kernels. Bottom row: A draw from an additive kernel corresponds to
a sum of draws from independent GP priors, each having the corresponding kernel.

is equivalent to the model

f1 (x1 ) GP(0, k1 (x1 , x1 )) (1.9)


f2 (x2 ) GP(0, k2 (x2 , x2 )) (1.10)
f (x1 , x2 ) = f1 (x1 ) + f2 (x2 ) . (1.11)

Figure 1.5 illustrates a decomposition of this form. Note that the product of two
kernels does not have an analogous interpretation as the product of two functions.

1.4.3 Extrapolation through additivity


Additive structure sometimes allows us to make predictions far from the training data.
Figure 1.6 compares the extrapolations made by additive versus product-kernel GP mod-
els, conditioned on data from a sum of two axis-aligned sine functions. The training
points were evaluated in a small, L-shaped area. In this example, the additive model is
able to correctly predict the height of the function at an unseen combinations of inputs.
The product-kernel model is more flexible, and so remains uncertain about the function
1.4 Modeling sums of functions 9

GP mean using GP mean using


True function: sum of SE kernels: product of SE kernels:
f (x1 , x2 ) = sin(x1 ) + sin(x2 ) k1 (x1 , x1 ) + k2 (x2 , x2 ) k1 (x1 , x1 )k2 (x2 , x2 )

Figure 1.6: Left: A function with additive structure. Center: A GP with an additive
kernel can extrapolate away from the training data. Right: A GP with a product kernel
allows a different function value for every combination of inputs, and so is uncertain
about function values away from the training data. This causes the predictions to revert
to the mean.

away from the data.


These types of additive models have been well-explored in the statistics literature.
For example, generalized additive models (Hastie and Tibshirani, 1990) have seen wide
adoption. In high dimensions, we can also consider sums of functions of multiple input
dimensions. Section 1.11 considers this model class in more detail.

1.4.4 Example: An additive model of concrete strength


To illustrate how additive kernels give rise to interpretable models, we built an addi-
tive model of the strength of concrete as a function of the amount of seven different
ingredients (cement, slag, fly ash, water, plasticizer, coarse aggregate and fine aggre-
gate), and the age of the concrete (Yeh, 1998). Our simple model is a sum of 8 different
one-dimensional functions, each depending on only one of these quantities:

f (x) = f1 (cement) + f2 (slag) + f3 (fly ash) + f4 (water)


+ f5 (plasticizer) + f6 (coarse) + f7 (fine) + f8 (age) + noise (1.12)

where noise N (0, n2 ). Each of the functions f1 , f2 , . . . , f8 was modeled using a GP


iid

with an SE kernel. These eight SE kernels plus a white noise kernel were added together
as in equation (1.8) to form a single GP model whose kernel had 9 additive components.
10 Expressing Structure with Kernels

After learning the kernel parameters by maximizing the marginal likelihood of the
data, one can visualize the predictive distribution of each component of the model.
strength

cement (kg/m3 ) slag (kg/m3 ) fly ash (kg/m3 )


strength

water (kg/m3 ) plasticizer (kg/m3 ) coarse (kg/m3 )

Data
strength

Posterior density
Posterior samples

fine (kg/m3 ) age (days)


Figure 1.7: The predictive distribution of each one-dimensional function in a multi-
dimensional additive model. Blue crosses indicate the original data projected on to each
dimension, red indicates the marginal posterior density of each function, and colored lines
are samples from the marginal posterior distribution of each one-dimensional function.
The vertical axis is the same for all plots.

Figure 1.7 shows the marginal posterior distribution of each of the eight one-dimensional
functions in the model. The parameters controlling the variance of two of the functions,
f6 (coarse) and f7 (fine) were set to zero, meaning that the marginal likelihood preferred
a parsimonious model which did not depend on these inputs. This is an example of the
automatic sparsity that arises by maximizing marginal likelihood in GP models, and is
another example of automatic relevance determination (ARD) (Neal, 1995).
The ability to learn kernel parameters in this way is much more difficult when using
non-probabilistic methods such as Support Vector Machines (Cortes and Vapnik, 1995),
for which cross-validation is often the best method to select kernel parameters.
1.4 Modeling sums of functions 11

1.4.5 Posterior variance of additive components

Here we derive the posterior variance and covariance of all of the additive components
of a GP. These formulas allow one to make plots such as figure 1.7.
First, we write down the joint prior distribution over two functions drawn indepen-
dently from GP priors, and their sum. We distinguish between f (X) (the function values
at training locations [x1 , x2 , . . . , xN ]T := X) and f (X ) (the function values at some set
of query locations [x1 , x2 , . . . , xN ] := X ).
T

Formally, if f1 and f2 are a priori independent, and f1 GP(1 , k1 ) and f2 GP(2 , k2 ),


then

f (X)
1
1 K1 K1 0 0 K1 K1
f1 (X ) 0 0
T
1 K K K1 K

1 1 1


f2 (X) 2 0 0 K2 K2 K2
K2


N ,
f2 (X ) 2 0 0 K2 T K K2 K




2 2

f1 (X) + f2 (X) 1 + 2 K1 K1 T K2 K2 T K1 + K2 K1 + K2




f1 (X ) + f2 (X ) 1 + 2 K1 T K
1 K2 T K
2 K1 T + K2 T 1 + K2
K

(1.13)

where we represent the Gram matrices, whose i, jth entry is given by k(xi , xj ) by

Ki = ki (X, X) (1.14)
Ki = ki (X, X ) (1.15)
i = ki (X , X )
K (1.16)

The formula for Gaussian conditionals ?? can be used to give the conditional distri-
bution of a GP-distributed function conditioned on its sum with another GP-distributed
function:
 h i
T
f1 (X
) f1 (X) + f2 (X) N 1 + K1 (K1 + K2 )1 f1 (X) + f2 (X) 1 2 ,


T
K
1 K1 (K1 + K2 ) 1
K1 (1.17)

These formulas express the models posterior uncertainty about the different components
of the signal, integrating over the possible configurations of the other components. To
extend these formulas to a sum of more than two functions, the term K1 +K2 can simply
be replaced by i Ki everywhere.
P
12 Expressing Structure with Kernels

cement slag fly ash water plasticizer age


cement
slag

Correlation
fly ash

0.5
water

0.5
plasticizer
age

Figure 1.8: Posterior correlations between the heights of the one-dimensional functions
in equation (1.12), whose sum models concrete strength. Red indicates high correla-
tion, teal indicates no correlation, and blue indicates negative correlation. Plots on the
diagonal show posterior correlations between different values of the same function. Cor-
relations are evaluated over the same input ranges as in figure 1.7. Correlations with
f6 (coarse) and f7 (fine) are not shown, because their estimated variance was zero.

Posterior covariance of additive components

One can also compute the posterior covariance between the height of any two functions,
conditioned on their sum:
h i T
Cov f1 (X ), f2 (X ) f (X) = K1 (K1 + K2 )1 K2 (1.18)

If this quantity is negative, it means that there is ambiguity about which of the two
functions is high or low at that location. For example, figure 1.8 shows the posterior
correlation between all non-zero components of the concrete model. This figure shows
1.5 Changepoints 13

that most of the correlation occurs within components, but there is also negative corre-
lation between the height of f1 (cement) and f2 (slag).

1.5 Changepoints

An example of how combining kernels can give rise to more structured priors is given by
changepoint kernels, which can express a change between different types of structure.
Changepoints kernels can be defined through addition and multiplication with sigmoidal
functions such as (x) = 1/1+exp(x):

CP(k1 , k2 )(x, x ) = (x)k1 (x, x )(x ) + (1 (x))k2 (x, x )(1 (x )) (1.19)

which can be written in shorthand as

CP(k1 , k2 ) = k1 + k2 (1.20)

where = (x)(x ) and = (1 (x))(1 (x )).


This compound kernel expresses a change from one kernel to another. The parameters
of the sigmoid determine where, and how rapidly, this change occurs. Figure 1.9 shows
some examples.

CP(SE, Per) CP(SE, Per) CP(SE, SE) CP(Per, Per)

f (x)

x x x x
Figure 1.9: Draws from different priors on using changepoint kernels, constructed by
adding and multiplying together base kernels with sigmoidal functions.

We can also build a model of functions whose structure changes only within some
interval a change-window by replacing (x) with a product of two sigmoids, one
increasing and one decreasing.
14 Expressing Structure with Kernels

1.5.1 Multiplication by a known function

More generally, we can model an unknown function thats been multiplied by any fixed,
known function a(x), by multiplying the kernel by a(x)a(x ). Formally,

f (x) = a(x)g(x), g GP( 0, k(x, x )) f GP( 0, a(x)k(x, x )a(x )) .


(1.21)

1.6 Feature representation of kernels

By Mercers theorem (Mercer, 1909), any positive-definite kernel can be represented as


the inner product between a fixed set of features, evaluated at x and at x :

k(x, x ) = h(x)T h(x ) (1.22)

For example, the squared-exponential kernel (SE) on the real line has a representation
in terms of infinitely many radial-basis functions of the form hi (x) exp( 412 (x ci )2 ).
More generally, any stationary kernel can be represented by a set of sines and cosines - a
Fourier representation (Bochner, 1959). In general, any particular feature representation
of a kernel is not necessarily unique (Minh et al., 2006).
In some cases, the input to a kernel, x, can even be the implicit infinite-dimensional
feature mapping of another kernel. Composing feature maps in this way leads to deep
kernels, which are explored in ??.

1.6.1 Relation to linear regression

Surprisingly, GP regression is equivalent to Bayesian linear regression on the implicit


features h(x) which give rise to the kernel:
 
f (x) = wT h(x), w N (0, I) f GP 0, h(x)T h(x) (1.23)

The link between Gaussian processes, linear regression, and neural networks is explored
further in ??.
1.7 Expressing symmetries and invariances 15

1.6.2 Feature-space view of combining kernels


We can also view kernel addition and multiplication as a combination of the features of
the original kernels. For example, given two kernels

ka (x, x ) = a(x)T a(x ) (1.24)


kb (x, x ) = b(x)T b(x ) (1.25)

their addition has the form:


T
a(x) a(x )
ka (x, x ) + kb (x, x ) = a(x)T a(x ) + b(x)T b(x ) = (1.26)
b(x) b(x )

meaning that the features of ka + kb are the concatenation of the features of each kernel.
We can examine kernel multiplication in a similar way:
h i h i
ka (x, x ) kb (x, x ) = a(x)T a(x ) b(x)T b(x ) (1.27)
= ai (x)ai (x ) bj (x)bj (x ) (1.28)
X X

i j
Xh ih i
= ai (x)bj (x) ai (x )bj (x ) (1.29)
i,j

In words, the features of ka kb are made of up all pairs of the original two sets of
features. For example, the features of the product of two one-dimensional SE kernels
(SE1 SE2 ) cover the plane with two-dimensional radial-basis functions of the form:

1 (x1 ci )2 1 (x2 cj )2
! !
hij (x1 , x2 ) exp exp (1.30)
2 221 2 222

1.7 Expressing symmetries and invariances


When modeling functions, encoding known symmetries can improve predictive accuracy.
This section looks at different ways to encode symmetries into a prior on functions. Many
types of symmetry can be enforced through operations on the kernel.
We will demonstrate the properties of the resulting models by sampling functions
from their priors. By using these functions to define smooth mappings from R2 R3 ,
we will show how to build a nonparametric prior on an open-ended family of topological
manifolds, such as cylinders, toruses, and Mbius strips.
16 Expressing Structure with Kernels

1.7.1 Three recipes for invariant priors

Consider the scenario where we have a finite set of transformations of the input space
{g1 , g2 , . . .} to which we wish our function to remain invariant:

f (x) = f (g(x)) x X , g G (1.31)

As an example, imagine we wish to build a model of functions invariant to swapping


their inputs: f (x1 , x2 ) = f (x2 , x1 ), x1 , x2 . Being invariant to a set of operations is
equivalent to being invariant to all compositions of those operations, the set of which
forms a group. (Armstrong et al., 1988, chapter 21). In our example, the elements of the
group Gswap containing all operations the functions are invariant to has two elements:

g1 ([x1 , x2 ]) = [x2 , x1 ] (swap) (1.32)


g2 ([x1 , x2 ]) = [x1 , x2 ] (identity) (1.33)

How can we construct a prior on functions which respect these symmetries? Gins-
bourger et al. (2012) and Ginsbourger et al. (2013) showed that the only way to construct
a GP prior on functions which respect a set of invariances is to construct a kernel which
respects the same invariances with respect to each of its two inputs:

k(x, x ) = k(g(x), g(x )), x, x X , g, g G (1.34)

Formally, given a finite group G whose elements are operations to which we wish our
function to remain invariant, and f GP(0, k(x, x )), then every f is invariant under
G (up to a modification) if and only if k(, ) is argument-wise invariant under G. See
Ginsbourger et al. (2013) for details.
It might not always be clear how to construct a kernel respecting such argument-wise
invariances. Fortunately, there are a few simple ways to do this for any finite group:

1. Sum over the orbit. The orbit of x with respect to a group G is {g(x) : g G},
the set obtained by applying each element of G to x. Ginsbourger et al. (2012)
and Kondor (2008) suggest enforcing invariances through a double sum over the
orbits of x and x with respect to G:

ksum (x, x ) = k(g(x), g (x )) (1.35)


X X

g,G g G
1.7 Expressing symmetries and invariances 17

Additive method Projection method Product method

SE(x1 , x1 ) SE(x2 , x2 ) SE(min(x1 , x2 ), min(x1 , x2 )) SE(x1 , x1 ) SE(x2 , x2 )


+ SE(x1 , x2 ) SE(x2 , x1 ) SE(max(x1 , x2 ), max(x1 , x2 )) SE(x1 , x2 ) SE(x2 , x1 )
Figure 1.10: Functions drawn from three distinct GP priors, each expressing symmetry
about the line x1 = x2 using a different type of construction. All three methods introduce
a different type of nonstationarity.

For the group Gswap , this operation results in the kernel:

kswitch (x, x ) = k(g(x), g (x )) (1.36)


X X

gGswap g Gswap

= k(x1 , x2 , x1 , x2 ) + k(x1 , x2 , x2 , x1 )
+ k(x2 , x1 , x1 , x2 ) + k(x2 , x1 , x2 , x1 ) (1.37)

For stationary kernels, some pairs of elements in this sum will be identical, and
can be ignored. Figure 1.10(left) shows a draw from a GP prior with a product of
SE kernels symmetrized in this way. This construction has the property that the
marginal variance is doubled near x1 = x2 , which may or may not be desirable.

2. Project onto a fundamental domain. Ginsbourger et al. (2013) also explored


the possibility of projecting each datapoint into a fundamental domain of the
group, using a mapping AG :

kproj (x, x ) = k(AG (x), AG (x )) (1.38)

For example, a fundamental domain of the group Gswap is all {x1 , x2 : x1 < x2 },
h i
a set which can be mapped to using AGswap (x1 , x2 ) = min(x1 , x2 ), max(x1 , x2 ) .
Constructing a kernel using this method introduces a non-differentiable seam
along x1 = x2 , as shown in figure 1.10(center).
18 Expressing Structure with Kernels

3. Multiply over the orbit. Ryan P. Adams (personal communication) suggested


a construction enforcing invariances through a double product over the orbits:

ksum (x, x ) = k(g(x), g (x )) (1.39)


Y Y

gG g G

This method can sometimes produce GP priors with zero variance in some regions,
as in figure 1.10(right).
There are often many possible ways to achieve a given symmetry, but we must be careful
to do so without compromising other qualities of the model we are constructing. For
example, simply setting k(x, x ) = 0 gives rise to a GP prior which obeys all possible
symmetries, but this is presumably not a model we wish to use.

1.7.2 Example: Periodicity


Periodicity in a one-dimensional function corresponds to the invariance

f (x) = f (x + ) (1.40)

where is the period.


The most popular method for building a periodic kernel is due to MacKay (1998),
who used the projection method in combination with an SE kernel. A fundamental
domain of the symmetry group is a circle, so the kernel

Per(x, x ) = SE (sin(x), sin(x )) SE (cos(x), cos(x )) (1.41)

achieves the invariance in equation (1.40). Simple algebra reduces this kernel to the
form given in figure 1.1.

1.7.3 Example: Symmetry about zero


Another example of an easily-enforceable symmetry is symmetry about zero:

f (x) = f (x). (1.42)

This symmetry can be enforced using the sum over orbits method, by the transform

kreflect (x, x ) = k(x, x ) + k(x, x ) + k(x, x ) + k(x, x ). (1.43)


1.8 Generating topological manifolds 19

1.7.4 Example: Translation invariance in images


Many models of images are invariant to spatial translations (LeCun and Bengio, 1995).
Similarly, many models of sounds are also invariant to translation through time.
Note that this sort of translation invariance is completely distinct from the station-
arity of kernels such as SE or Per. A stationary kernel implies that the prior is invariant
to translations of the entire training and test set. In contrast, here we use translation
invariance to refer to situations where the signal has been discretized, and each pixel
(or the audio equivalent) corresponds to a different input dimension. We are interested
in creating priors on functions that are invariant to swapping pixels in a manner that
corresponds to shifting the signal in some direction:

f = f (1.44)

For example, in a one-dimensional image or audio signal, translation of an input vector


by i pixels can be defined as
h iT
shift(x, i) = xmod(i+1,D) , xmod(i+2,D) , . . . , xmod(i+D,D) (1.45)

As above, translation invariance in one dimension can be achieved by a double sum over
the orbit, given an initial translation-sensitive kernel between signals k:

D
D X
kinvariant (x, x ) =

k(shift(x, i), shift(x, j)) . (1.46)
X

i=1 j=1

The extension to two dimensions, shift(x, i, j), is straightforward, but notationally


cumbersome. Kondor (2008) built a more elaborate kernel between images that was
approximately invariant to both translation and rotation, using the projection method.

1.8 Generating topological manifolds


In this section we give a geometric illustration of the symmetries encoded by different
compositions of kernels. The work presented in this section is based on a collaboration
with David Reshef, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. The
derivation of the Mbius kernel was my original contribution.
Priors on functions obeying invariants can be used to create a prior on topological
20 Expressing Structure with Kernels

Euclidean (SE1 SE2 ) Cylinder (SE1 Per2 ) Toroid (Per1 Per2 )

Figure 1.11: Generating 2D manifolds with different topologies. By enforcing that the
functions mapping from R2 to R3 obey certain symmetries, the surfaces created have
corresponding topologies, ignoring self-intersections.

manifolds by using such functions to warp a simply-connected surface into a higher-


dimensional space. For example, one can build a prior on 2-dimensional manifolds
embedded in 3-dimensional space through a prior on mappings from R2 to R3 . Such
mappings can be constructed using three independent functions [f1 (x), f2 (x), f3 (x)],
each mapping from R2 to R. Different GP priors on these functions will implicitly give
rise to different priors on warped surfaces. Symmetries in [f1 , f2 , f3 ] can connect different
parts of the manifolds, giving rise to non-trivial topologies on the sampled surfaces.

Figure 1.11 shows 2D meshes warped into 3D by functions drawn from GP priors
with various kernels, giving rise to a different topologies. Higher-dimensional analogues
of these shapes can be constructed by increasing the latent dimension and including
corresponding terms in the kernel. For example, an N -dimensional latent space using
kernel Per1 Per2 . . . PerN will give rise to a prior on manifolds having the topology
of N -dimensional toruses, ignoring self-intersections.

This construction is similar in spirit to the GP latent variable model (GP-LVM) of


Lawrence (2005), which learns a latent embedding of the data into a low-dimensional
space, using a GP prior on the mapping from the latent space to the observed space.
1.8 Generating topological manifolds 21

Draw from GP with kernel:


Per(x1 , x1 ) Per(x2 , x2 ) Mbius strip drawn from Sudanese Mbius strip
+Per(x1 , x2 ) Per(x2 , x1 ) R2 R3 GP prior generated parametrically

x2

x1
Figure 1.12: Generating Mbius strips. Left: A function drawn from a GP prior obeying
the symmetries given by equations (1.47) to (1.49). Center: Simply-connected surfaces
mapped from R2 to R3 by functions obeying those symmetries have a topology corre-
sponding to a Mbius strip. Surfaces generated this way do not have the familiar shape
of a flat surface connected to itself with a half-twist. Instead, they tend to look like
Sudanese Mbius strips (Lerner and Asimov, 1984), whose edge has a circular shape.
Right: A Sudanese projection of a Mbius strip. Image adapted from Wikimedia Com-
mons (2005).

1.8.1 Mbius strips

A space having the topology of a Mbius strip can be constructed by enforcing invariance
to the following operations (Reid and Szendri, 2005, chapter 7):

gp1 ([x1 , x2 ]) = [x1 + , x2 ] (periodic in x1 ) (1.47)


gp2 ([x1 , x2 ]) = [x1 , x2 + ] (periodic in x2 ) (1.48)
gs ([x1 , x2 ]) = [x2 , x1 ] (symmetric about x1 = x2 ) (1.49)

Section 1.7 already showed how to build GP priors invariant to each of these types of
transformations. Well call a kernel which enforces these symmetries a Mbius kernel.
An example of such a kernel is:

k(x1 , x2 , x1 , x2 ) = Per(x1 , x1 ) Per(x2 , x2 ) + Per(x1 , x2 ) Per(x2 , x1 ) (1.50)

Moving along the diagonal x1 = x2 of a function drawn from the corresponding GP prior
is equivalent to moving along the edge of a notional Mbius strip which has had that
22 Expressing Structure with Kernels

function mapped on to its surface. Figure 1.12(left) shows an example of a function


drawn from such a prior. Figure 1.12(center) shows an example of a 2D mesh mapped
to 3D by functions drawn from such a prior. This surface doesnt resemble the typical
representation of a Mbius strip, but instead resembles an embedding known as the
Sudanese Mbius strip (Lerner and Asimov, 1984), shown in figure 1.12(right).

1.9 Kernels on categorical variables

Categorical variables are variables which can take values only from a discrete, unordered
set, such as {blue, green, red}. A simple way to construct a kernel over categorical
variables is to represent that variable by a set of binary variables, using a one-of-k
encoding. For example, if x can take one of four values, x {A, B, C, D}, then a one-of-k
encoding of x will correspond to four binary inputs, and one-of-k(C) = [0, 0, 1, 0]. Given
a one-of-k encoding, we can place any multi-dimensional kernel on that space, such as
the SE-ARD:

kcategorical (x, x ) = SE-ARD(one-of-k(x), one-of-k(x )) (1.51)

Short lengthscales on any particular dimension of the SE-ARD kernel indicate that the
function value corresponding to that category is uncorrelated with the others. More
flexible parameterizations are also possible (Pinheiro and Bates, 1996).

1.10 Multiple outputs

Any GP prior can easily be extended to the model multiple outputs: f1 (x), f2 (x), . . . , fT (x).
This can be done by building a model of a single-output function which has had an ex-
tra input added that denotes the index of the output: fi (x) = f (x, i). This can be
done by extending the original kernel k(x, x ) to have an extra discrete input dimension:
k(x, i, x , i ).
A simple and flexible construction of such a kernel multiplies the original kernel
k(x, x ) with a categorical kernel on the output index (Bonilla et al., 2007):

k(x, i, x , i ) = kx (x, x )ki (i, i ) (1.52)


1.11 Building a kernel in practice 23

1.11 Building a kernel in practice


This chapter outlined ways to choose the parametric form of a kernel in order to express
different sorts of structure. Once the parametric form has been chosen, one still needs to
choose, or integrate over, the kernel parameters. If the kernel relatively few parameters,
these parameters can be estimated by maximum marginal likelihood, using gradient-
based optimizers. The kernel parameters estimated in sections 1.4.3 and 1.4.4 were
optimized using the GPML toolbox (Rasmussen and Nickisch, 2010), available at
http://www.gaussianprocess.org/gpml/code.
A systematic search over kernel parameters is necessary when appropriate parameters
are not known. Similarly, sometimes appropriate kernel structure is hard to guess.
The next chapter will show how to perform an automatic search not just over kernel
parameters, but also over an open-ended space of kernel expressions.

Source code

Source code to produce all figures and examples in this chapter is available at
http://www.github.com/duvenaud/phd-thesis.
References

Mark A. Armstrong, Grard Iooss, and Daniel D. Joseph. Groups and symmetry.
Springer, 1988. (page 16)

Richard Bellman. Dynamic programming and Lagrange multipliers. Proceedings of


the National Academy of Sciences of the United States of America, 42(10):767, 1956.
(page 6)

Salomon Bochner. Lectures on Fourier integrals, volume 42. Princeton University Press,
1959. (page 14)

Edwin V. Bonilla, Kian Ming Adam Chai, and Christopher K.I. Williams. Multi-task
Gaussian process prediction. In Advances in Neural Information Processing Systems,
2007. (page 22)

Corinna Cortes and Vladimir N. Vapnik. Support-vector networks. Machine learning,


20(3):273297, 1995. (page 10)

David Ginsbourger, Xavier Bay, Olivier Roustant, and Laurent Carraro. Argumentwise
invariant kernels for the approximation of invariant functions. In Annales de la Facult
de Sciences de Toulouse, 2012. (page 16)

David Ginsbourger, Olivier Roustant, and Nicolas Durrande. Invariances of ran-


dom fields paths, with applications in Gaussian process regression. arXiv preprint
arXiv:1308.1359 [math.ST], August 2013. (pages 16 and 17)

Noah D. Goodman, Tomer D. Ullman, and Joshua B. Tenenbaum. Learning a theory of


causality. Psychological review, 118(1):110, 2011. (page 6)

Trevor J. Hastie and Robert J. Tibshirani. Generalized additive models. Chapman &
Hall/CRC, 1990. (page 9)
References 25

Imre Risi Kondor. Group theoretical methods in machine learning. PhD thesis, Columbia
University, 2008. (pages 16 and 19)

Neil D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian


process latent variable models. Journal of Machine Learning Research, 6:17831816,
2005. (page 20)

Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time
series. The handbook of brain theory and neural networks, 3361, 1995. (page 19)

Doug Lerner and Dan Asimov. The Sudanese Mbius band. In SIGGRAPH Electronic
Theatre, 1984. (pages 21 and 22)

David J.C. MacKay. Introduction to Gaussian processes. NATO ASI Series F Computer
and Systems Sciences, 168:133166, 1998. (page 18)

James Mercer. Functions of positive and negative type, and their connection with the
theory of integral equations. Philosophical Transactions of the Royal Society of Lon-
don. Series A, Containing Papers of a Mathematical or Physical Character, pages
415446, 1909. (page 14)

Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of
Machine Learning Research, 7:26512667, 2006. (page 5)

Ha Quang Minh, Partha Niyogi, and Yuan Yao. Mercers theorem, feature maps, and
smoothing. In Learning theory, pages 154168. Springer, 2006. (page 14)

Radford M. Neal. Bayesian learning for neural networks. PhD thesis, University of
Toronto, 1995. (page 10)

Jos C. Pinheiro and Douglas M. Bates. Unconstrained parametrizations for variance-


covariance matrices. Statistics and Computing, 6(3):289296, 1996. (page 22)

Carl E. Rasmussen and Hannes Nickisch. Gaussian processes for machine learning
(GPML) toolbox. Journal of Machine Learning Research, 11:30113015, December
2010. (page 23)

Miles A. Reid and Balzs Szendri. Geometry and topology. Cambridge University Press,
2005. (page 21)
26 References

Wikimedia Commons. Stereographic projection of a Sudanese Mbius band, 2005. URL


http://commons.wikimedia.org/wiki/File:MobiusSnail2B.png. (page 21)

I-Cheng Yeh. Modeling of strength of high-performance concrete using artificial neural


networks. Cement and Concrete research, 28(12):17971808, 1998. (page 9)

You might also like