0% found this document useful (0 votes)
81 views

SML Book Draft Latest

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

SML Book Draft Latest

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 194

Supervised Machine Learning

Andreas Lindholm, Niklas Wahlström,


Fredrik Lindsten, Thomas B. Schön

Draft version: September 25, 2020

This material will be published by Cambridge University Press. This


pre-publication version is free to view and download for personal use only. Not
for re-distribution, re-sale or use in derivative works. © The authors, 2020.
Feedback and exercise problems: http://smlbook.org
Contents

1 Introduction 5
1.1 About this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 How to proceed throughout the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Supervised learning: a first approach 7


2.1 The supervised learning problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 A distance-based method: k-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 A rule-based method: Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Basic parametric models for regression and classification 25


3.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Classification and logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Polynomial regression and regularization . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Nonlinear regression and generalized linear models . . . . . . . . . . . . . . . . . . . . 40
3.A Derivation of the normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Understanding, evaluating and improving the performance 43


4.1 Expected new data error 𝐸 new : performance in production . . . . . . . . . . . . . . . . 43
4.2 Estimating 𝐸 new . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 The training error–generalization gap decomposition of 𝐸 new . . . . . . . . . . . . . . . 49
4.4 The bias-variance decomposition of 𝐸 new . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Evaluation for imbalanced and asymmetric classification problems . . . . . . . . . . . . 60
4.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 Learning parametric models 65


5.1 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Parameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Optimization with large datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6 Neural networks and deep learning 87


6.1 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Training a neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4 Perspective and further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7 Ensemble methods: Bagging and boosting 103


7.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3 Boosting and AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.4 Gradient boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8 Nonlinear input transformations and kernels 121


8.1 Creating features by nonlinear input transformations . . . . . . . . . . . . . . . . . . . . 121
8.2 Kernel ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.3 Support vector regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 3
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
Contents

8.4 Kernel theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129


8.5 Support vector classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.A The representer theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.B Derivation of support vector classification . . . . . . . . . . . . . . . . . . . . . . . . . 137

9 The Bayesian approach and Gaussian processes 139


9.1 The Bayesian idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.2 Bayesian linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.3 The Gaussian process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.4 Practical aspects of the Gaussian process . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.5 Other Bayesian methods in machine learning . . . . . . . . . . . . . . . . . . . . . . . 159
9.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
9.A The multivariate Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

10 User aspects of machine learning 163


10.1 Defining the machine learning problem . . . . . . . . . . . . . . . . . . . . . . . . . . 163
10.2 Improving a machine learning model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
10.3 What if we cannot collect more data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
10.4 Practical data issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
10.5 Can I trust my machine learning model? . . . . . . . . . . . . . . . . . . . . . . . . . . 175
10.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

11 Generative models and learning from unlabeled data 177


11.1 The Gaussian mixture model and the LDA & QDA classifiers . . . . . . . . . . . . . . . 177
11.2 The Gaussian mixture model when some or all labels are missing . . . . . . . . . . . . . 182
11.3 More unsupervised methods: k-means and PCA . . . . . . . . . . . . . . . . . . . . . . 188

Bibliography 193

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
4
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
1 Introduction

1.1 About this book


The aim of this book is to convey the spirit of supervised machine learning, without requiring any previous
experience in the field. We focus on the underlying mathematics as well as the practical aspects. This
book is a textbook, but neither a reference work nor a programming manual. It therefore contains only
a careful (yet comprehensive) selection of supervised machine learning methods and no programming
code. There are by now many well-written and well-documented code packages available, and it is our
firm belief that with a good understanding of the mathematics and inner workings of the methods, the
reader will be able to make the connection between this book and her favorite code package in her favorite
programming language on her own.
We take a statistical perspective in this book, meaning that we discuss and motivate methods in terms of
their statistical properties. It therefore requires some previous knowledge in statics and probability theory,
as well as calculus and linear algebra. We hope that reading the book from start to end will give the reader
a good starting point for working as a machine learning engineer and/or pursuing further studies within
the subject.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 5
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
1 Introduction

1.2 How to proceed throughout the book


This book is written such that it can be read from start to end. There are, however, multiple possible paths
through the book that are more selective depending on the interest of the reader. Figure 1.1 illustrates
the major dependencies between the chapters. In particular are the most fundamental topics discussed in
Chapter 2, 3 and 4, and we do recommend the reader to read those chapters before proceeding to the later
chapters that contains technically more advanced topics (Chapter 5-9). Chapter 10 focuses on some of the
more practical aspects on designing a successful machine learning solution and has a less technical nature
than the other chapters, whereas Chapter 11 gives some directions for how to go beyond the supervised
setting of machine learning.
Fundamental
chapters

2: Supervised learning: a first 3: Basic parametric models for 4: Understanding, evaluating


approach regression and classification and improving the performance
Advanced chapters

6: Neural networks and deep 7: Ensemble methods: Bag-


5: Learning parametric models
5.3, learning ging and boosting
5.4
5.1 (for 8.3, 8.5)
8.1,
8.2,
8: Nonlinear input transforma- 8.4 9: The Bayesian approach and
tions and kernels Gaussian processes
Special
chapters

10: User aspects of machine 11: Generative models and


learning learning from unlabeled data

Figure 1.1: The structure of this book, illustrated by blocks (chapters) and arrows (recommended order to read the
chapters). We do recommend everyone to read (or at least skim) the fundamental chapters Chapter 2, 3 and 4 first.
The path through the technically more advanced Chapter 5-9, however, can be chosen to match the particular interest
of the reader. For Chapter 10 and 11 we do recommend the reader to, at least, have read the fundamental chapters
first.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
6
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2 Supervised learning: a first approach
In this chapter we will introduce the supervised machine learning problem, as well as two basic machine
learning methods for solving it. In short the problem amounts to use data for learning the relationship
between an input and an output variable. The methods we will introduce are called 𝑘-NN and decision
trees. These two methods are relatively simple but useful on their own, and therefore a good place to start.
Understanding the inner workings, advantages and shortcomings of these basic methods also lays a good
foundation for the more advanced methods that are to come in later chapters.

2.1 The supervised learning problem


The supervised machine learning problem is to use training data to learn (or, equivalently, train) a
mathematical model that relates an output1 variable 𝑦 to an input2 variable x. That mathematical model
can thereafter be used for predicting the output 𝑦 for a new, previously unseen, test data for which only x
is known. The beauty of machine learning is that it is quite arbitrary what these variables represent, and it
varies from application to application. We saw several examples in Chapter 1 of this problem: Inferring
the type of a skin lesion (𝑦) from an image of the skin (x), deducing the type of road user (𝑦) from a lidar
measurement (x) or predicting the effect on carbon dioxide emissions (𝑦) from a political intervention (x).

Learning from labeled data


In most interesting supervised machine learning applications, the relationship between input x and output 𝑦
is difficult to describe explicitly. It may be too cumbersome or complicated to fully unravel from application
domain knowledge, or even unknown. The problem can therefore usually not be solved by writing a
traditional computer program that takes x as input and returns 𝑦 as output from a set of rules. The
supervised machine learning approach is instead to learn the relationship between x and 𝑦 from data,
which contains examples of observed pairs of input and output values. In other words, supervised machine
learning amounts to learning from examples.
The data used for learning is called training data, and it has to consist of several input-output data
points (samples) (x𝑖 , 𝑦 𝑖 ), in total 𝑛 of them. We will compactly write the training data as T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛 .

Each data point in the training data provides a snapshot of how 𝑦 depends on x, and the goal in supervised
machine learning is to squeeze as much information as possible out of T . In this book we will only
consider problems where the individual data points are assumed to be (probabilistically) independent. This
excludes, for example, applications in time series analysis where it is of interest to model the correlation
between x𝑖 and x𝑖+1 .
The fact that the training data contains not only input values x𝑖 , but also output values 𝑦 𝑖 , is the reason
for the term “supervised” machine learning. We may say that each input x𝑖 is accompanied with a label 𝑦 𝑖 ,
or simply that we have labeled data. For some applications, it is only a matter of jointly recording x and 𝑦.
In other applications, the output 𝑦 has to be created by labeling the training data inputs x by a domain
expert. For instance, to construct a training dataset for the skin lesion application, a dermatologist needs
to look at all training data inputs (images) x𝑖 and label them by assigning to the variable 𝑦 𝑖 the type of
lesion that is seen in the image. The entire learning process is thus “supervised” by the domain expert.
We use a vector boldface notation x to denote the input, since we assume it to be a 𝑝-dimensional vector,
x = [𝑥 1 𝑥 2 . . . 𝑥 𝑝 ] T . Each element of the input vector x represent some information that is considered to
1 Theoutput is commonly also called response, regressand, label, explained variable, predicted variable or dependent variable.
2 Theinput is commonly also called feature, attribute, predictor, regressor, covariate, explanatory variable, controlled variable
and independent variable.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 7
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2 Supervised learning: a first approach

be relevant, for example the outdoor temperature or the unemployment rate. In an application where the
input is a black and white image x can be all pixel values in the image, so 𝑝 = ℎ × 𝑤, where ℎ and 𝑤 denote
the height and width of the input image.3 The output 𝑦, on the other hand, is often of low dimension
and throughout most of this book we will assume that it is a scalar value. The type of the output value,
numerical or categorical, turns out to be important and is used to distinguish between two subtypes of the
supervised machine learning problems: regression and classification. We will discuss this next.

Numerical and categorical variables


The variables contained in our data (input as well as output) can be of two different types, numerical or
categorical. A numerical variable has a natural ordering. We can say that one instance of a numerical
variable is larger or smaller than another instance of the same variable. A numerical variable could for
instance be represented by a continuous real number, but it could also be discrete, such as an integer.
Categorical variables, on the other hand, are always discrete and importantly they lack a natural ordering.
In this book we assume that any categorical variable only can take a finite number of different values. A
few examples are given in Table 2.1 below.

Table 2.1: Examples of numerical and categorical variables.


Variable type Example Handled as
Number (continuous) 32.23 km/h, 12.50 km/h, 42.85 km/h Numerical
Number (discrete) with natural ordering 0 children, 1 child, 2 children Numerical
Number (discrete) without natural ordering 1 = Sweden, 2 = Denmark, 3 = Norway Categorical
Text string Hello, Goodbye, Welcome Categorical

The distinction between numerical and categorical is sometimes somewhat arbitrary. We could for
instance argue that having no children is something qualitatively different than having children, and use
the categorical variable “children: yes/no”, instead of the numerical “0, 1 or 2 children”. In other words, it
is sometimes left to the machine learning engineer to decide whether a certain variable is considered to be
numerical or categorical.
The notion of categorical vs. numerical applies to both the output variable 𝑦 and to the 𝑝 elements 𝑥 𝑗
of the input vector x = [𝑥 1 𝑥 2 . . . 𝑥 𝑝 ] T . All 𝑝 input variables do not have to be of the same type. It is
perfectly fine (and common in practice) to have a mix of categorical and numerical inputs.

Classification and regression


We distinguish between different supervised machine learning problems by the type of the output4 𝑦.

Regression means that the output is numerical, and classification means that the output is categorical.

The reason for this distinction is that the regression and classification problems have somewhat different
properties, and require different methods to be solved.
Note that the 𝑝 input variables x = [𝑥 1 𝑥 2 . . . 𝑥 𝑝 ] T can be either numerical or categorical for both
regression and classification problems. It is only the type of the output that determines whether a problem
is a regression or a classification problem. A method which solves a classification problems is often called
a classifier, and when a classifier predicts the wrong output to a data point it is called a misclassification.
For classification the output is categorical and can therefore only take values in a finite set. We use 𝑀 to
denote the number of elements in the set of possible output values. It can, for instance, be {false, true}
3 For image-based problems it is often more convenient to represent the input as a matrix of sizeℎ × 𝑤, than as a vector of length
𝑝 = ℎ𝑤, but the dimension is nevertheless the same. We will get back to this in Chapter 6 when discussing the convolutional
neural network, a model structure tailored to image-type inputs.
4 In the classical machine learning literature, classification is sometimes also called pattern recognition. The term “pattern”
here refers to an input sample x𝑖 and “recognition” should be read as a synonym to “classifying”.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
8
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2.1 The supervised learning problem

(𝑀 = 2) or {Sweden, Norway, Finland, Denmark} (𝑀 = 4). We will refer to these elements as classes
or labels. The number of classes 𝑀 is assumed to be known in the classification problem. To prepare for a
concise mathematical notation, we use integers 1, 2, . . . , 𝑀 to denote the output classes if 𝑀 > 2. The
ordering of the integers is arbitrary, and does not imply any ordering of the classes. When there are only
𝑀 = 2 classes, we have the important special case of binary classification. In binary classification we use
the labels −1 and 1 (instead of 1 and 2). Occasionally we will also use the equivalent terms negative and
positive class. The only reason for using a different convention for binary classification is that it gives a
more compact mathematical notation for some of the methods, and carries no deeper meaning. Let us now
have a look at a classification and a regression problem, which we both will return to later in this book.

Example 2.1: Classifying songs

Say that we want to build a “song categorizer” app, where the user records a song and the app answers by
telling if the song has the artistic style of either the Beatles, Kiss or Bob Dylan. At the heart of this fictitious
app there has to be a machinery that takes an audio recording as an input and returns an artist name (the
Beatles, Kiss or Bob Dylan).
If we first collect some recordings with songs from the three groups/artists (where we know which artist is
behind each song; a labeled dataset), we could use supervised machine learning to learn the characteristics
of their different styles and therefrom predict the artist of the new user-provided song. In the supervised
machine learning terminology, the artist name (the Beatles, Kiss or Bob Dylan) is the output 𝑦. In this
problem 𝑦 is categorical, and we are hence facing a classification problem.
One of the important design choices for a machine learning engineer is a detailed specification of what
the input x really is. It would in principle be possible to consider the raw audio information as input, but
that would give a very high-dimensional x which (unless an audio-specific machine learning method is
used) most likely would require an unrealistically large amount of training data in order to be successful (we
will discuss this aspect in detail in Chapter 4). A better option could therefore be to define some summary
statistics (features) of audio recordings, and use those as input x instead. As input we could for example
use the length of the audio recording and the amount of energy in it. The length of a recording is easy to
measure, and since some songs are rather long it seems natural to take the logarithm of it. The energy is a bit
more trickier (the exact definition may even be ambiguous), but we can leave that to the audio experts and
re-use a piece of software that they have written for this purposea without bothering too much about its inner
workings. As long as this piece of software returns a number for any recording that is sent into it (and always
returns the same number for the same recording), we can use it as an input to a machine learning method.
Below we have plotted a dataset with 230 songs from the three artists. Each song is represented by a dot,
where the horizontal axis is the logarithm of its length (measured in seconds) and the vertical axis the energy
(on a scale 0-1). When we later return to this example, and apply different supervised machine learning
methods to it, this data will be the training data.

Rock and roll all nite The Beatles


1 Kiss
Bob Dylan
Help!
0.8
Energy (scale 0-1)

0.6

0.4

0.2

A hard rain’s a-gonna fall


0
4.4 4.6 4.8 5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7

Length (ln s)
a We use http://api.spotify.com/ here.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 9
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2 Supervised learning: a first approach

Example 2.2: Car stopping distances

Ezekiel and Fox (1959) presents a dataset with 62 observations of the distance needed for various cars at
different initial speeds to break to a full stop.a The dataset has the following two variables:
- Speed: The speed of the car when the break signal is given.
- Distance: The distance traveled after the signal is given until the car has reached a full stop.

150
Distance (feet) Data

100

50

0
0 10 20 30 40

Speed (mph)

To make a supervised machine learning problem out of it, we interpret Speed as the input variable 𝑥,
and Distance as the output variable 𝑦. Since 𝑦 is numerical, this is a regression problem. We then ask
ourselves what the stopping distance would be if the initial speed would be, for example, at 33 mph and 45
mph respectively (two speeds at which no data has been recorded). Another way to frame this question is to
ask for the prediction b
𝑦 (𝑥★) for 𝑥★ = 33 and 𝑥★ = 45.
a The data is somewhat dated, so the conclusions are perhaps not applicable to modern cars.

Time to reflect 2.1: For each application example discussed in Chapter 1, can you determine the
input x, the output 𝑦, and whether it is a regression or classification problem?

Generalizing beyond training data


There are two primary reasons for why it is of interest to design methods for learning mathematical models
of input–output relationships from training data.
(i) To reason about and explore how input and output variables are connected. An often encountered
task in sciences such as medicine and sociology is to determine whether a correlation between a
pair of variables exists or not (“does sea food increase the life expectancy?”). Such questions can be
addressed by learning a mathematical model and carefully reason about the chance that the learned
relationships between input x and output 𝑦 are due only to random effects in the data, or if there
appear to be some substance to the found relationships.

(ii) To predict the output value 𝑦★ for some new, previously unseen input x★. By first learning a
mathematical model of the input–output relationship from the training data, we can make a prediction
by inserting the test input x★ into the model and use the output from the model b
𝑦 (x★) as a prediction
of the true (but unknown) output 𝑦★ associated with x★. The hat b indicates that the prediction is
an estimate (that is, an educated guess) of the true output.
These two objectives are sometimes used to roughly distinguish between classical statistics (i) and
machine learning (ii), even though it is not completely true (predictive modeling is a topic of classical
statistics and explainable models is a topic within machine learning). The primary focus in this book,
however, is on learning models for making predictions, objective (ii) above, which is the foundation of
supervised machine learning. Our overall goal is to obtain as accurate predictions b 𝑦 (x★) as possible

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
10
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2.1 The supervised learning problem

(measured in some appropriate way) for a wide range of possible test inputs x★. We say that the we are
interested in models that generalize well beyond the training data.
A model that generalizes well for the music example above would be able to correctly tell the artist
of a new song which was not in the training data (assuming that the artist of the new song is one of
the three that was present in the training data, of course). The ability to generalize to new data is a key
concept of machine learning. It is not difficult to construct models that give very accurate predictions if
they are only evaluated on the training data (we will see an example in the next section). If the model is
not able to generalize, meaning that the predictions are poor when the model is applied to new test data
points, then the model is of little use in practice for making predictions. If that is the case we say that the
model is overfitting to the training data. We will illustrate the issue of overfitting for a specific machine
learning model in the next section and in Chapter 4 we will return to this concept using a more general and
mathematical approach.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 11
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2 Supervised learning: a first approach

2.2 A distance-based method: k-NN


It is now high time to encounter the first actual machine learning method. We will start with the relatively
simple 𝑘-NN method, which can be used for both regression and classification. Remember that the
𝑛 , which consists of 𝑛 data points with input x and
setting is that we have access to training data {x𝑖 , 𝑦 𝑖 }𝑖=1 𝑖
corresponding output 𝑦 𝑖 . From this we want to construct a prediction b 𝑦 (x★) for what we believe the output
𝑦★ would be for a new x★, which we have not seen previously.

The k-nearest neighbors method


Most methods for supervised machine learning builds on the intuition that if the test data point x★ is close
to training data point x𝑖 , then the prediction b 𝑦 (x★) should be close to 𝑦 𝑖 . This is a general idea, but one
simple way to implement it in practice is the following way: First, compute the distance5 between the test
input and all training inputs, kx𝑖 − x★ k for 𝑖 = 1, . . . , 𝑛. Second, we find the data point x 𝑗 with the shortest
distance to x★ and let the prediction be b 𝑦 (x★) = 𝑦 𝑗 .
This simple prediction method is referred to as the 1-nearest neighbor method. It is not very complicated,
but for most machine learning applications of interest it is too simplistic. In practice we can rarely say
for certain what the output value 𝑦 will be, but we should mathematically describe 𝑦 as random variable.
Consequently, it makes sense to consider the data as noisy, that is, being affected by random noise. In this
perspective, the shortcoming of 1-nearest neighbor is that the prediction relies only on one data point from
the training data, which makes it quite “erratic” and sensitive to noisy training data.
To improve the 1-nearest neighbor method we can extend it to make use of the 𝑘 nearest neighbors
instead. Formally we define the set 𝑅★ = {𝑖 : x𝑖 is one of the 𝑘 training data points closest to x★ } and
aggregate the information from the 𝑘 outputs 𝑦 𝑗 for 𝑗 ∈ 𝑅★ to make the prediction. For regression
problems we take the average of all 𝑦 𝑗 for 𝑗 ∈ 𝑅★, and a majority vote6 for classification problems. We
illustrate the 𝑘-nearest neighbors (𝑘-NN) method by Example 2.3 and summarize by Method 2.1.
Methods that explicitly use the training data when making predictions are referred to as non-parametric.
This is in contrast with parametric methods, where the prediction is given by some function (a model),
governed by a fixed number of parameters. For parametric methods the training data is used to learn
the parameters in an initial training phase, but once the model has been learned, the training data can be
discarded since it is not used explicitly when making predictions. We will introduce some parametric
models in Chapter 3.

Data: Training data {x𝑖 , 𝑦 𝑖 }𝑖=1


𝑛 and test input x

Result: Predicted test output b 𝑦 (x★)
1 Compute the distances kx𝑖 − x★ k for all training data points 𝑖 = 1, . . . , 𝑛
2 Let 𝑅★ = {𝑖 : x𝑖 is one of the 𝑘 data points closest to x★ }
3 Compute the prediction b 𝑦 (x★) as
(
Average{𝑦 𝑗 : 𝑗 ∈ 𝑅★ } (Regression problems)
b
𝑦 (x★) =
MajorityVote{𝑦 𝑗 : 𝑗 ∈ 𝑅★ } (Classification problems)

Method 2.1: 𝑘-nearest neighbor, 𝑘-NN

5 Other distance functions can also be used, and will be discussed in Chapter 8. Categorical input variable can be handled as we
will discuss in Chapter 3.
6 Ties can be handled in different ways, for instance by a coin-flip, or by reporting the actual vote count to the end user who gets
to decide what to do with it.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
12
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2.2 A distance-based method: k-NN

Example 2.3: Predicting colors with 𝑘-NN

We consider a binary classification problem (𝑀 = 2). We are given a training dataset with 𝑛 = 6 observations
of 𝑝 = 2 input variables 𝑥1 , 𝑥2 and one categorical output 𝑦, the color Red or Blue,

𝑖 𝑥1 𝑥2 𝑦
1 −1 3 Red
2 2 1 Blue
3 −2 2 Red
4 −1 2 Blue
5 −1 0 Blue
6 1 1 Red

and we are interested in predicting the output for x★ = [1 2] T . For this purpose we will explore two different
𝑘-NN classifiers, one using 𝑘 = 1 and one using 𝑘 = 3.
First, we compute the Euclidian distance kx𝑖 − x★ k between each training data point x𝑖 and the test data
point x★, and then sort them in ascending order.

4
𝑘=3

1
𝑖 kx𝑖 − x★ k 𝑦𝑖

𝑖=
√ 𝑘=1
6 √1

𝑖=
Red 2
𝑖=3

4
2 √2 Blue
𝑥2

𝑖=6
4 √4 Blue
𝑖=2
1 √5 Red
0
5 √8 Blue 𝑖=5
3 9 Red
−2 0 2
𝑥1

Since the closest training data point to x★ is the data point 𝑖 = 6 (Red), it means that for 𝑘-NN with
𝑘 = 1, we get the prediction b𝑦 (x★) = Red. For 𝑘 = 3, the 3 nearest neighbors are 𝑖 = 6 (Red), 𝑖 = 2 (Blue),
and 𝑖 = 4 (Blue). Taking a majority vote among these three training data points, Blue wins with 2 votes
against 1, so our prediction becomes b 𝑦 (x★) = Blue.
This is also illustrated by the figure above where the training data points x𝑖 are represented with red
squares and the blue triangles depending on which class they belong to. The test data point x★ is represented
with a black filled circle. For 𝑘 = 1 the closest training data point is identified by the inner circle and for
𝑘 = 3 the three closest points are identified by the outer circle.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 13
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2 Supervised learning: a first approach

Decision boundaries for a classifier

In Example 2.3 we only computed a prediction for one single test data point x★. That prediction might
indeed be the ultimate goal of the application, but in order to visualize and better understand a classifier
we can also study its decision boundary, which illustrates the prediction for all possible test inputs. We
introduce the decision boundary using Example 2.4. It is a general concept for classifiers, not only 𝑘-NN,
but it is only possible to visualize when the dimension of x is 𝑝 = 2.

Example 2.4: Decision boundaries for the color example

In Example 2.3 we computed the prediction for x★ = [1 2] T . If we would shift that test point by one step to
the left at x★alt = [0 2] T the three closest training data points would still include 𝑖 = 6 and 𝑖 = 4 but now 𝑖 = 2
is exchanged for 𝑖 = 1. For 𝑘 = 3 this would give two votes for Red and one vote for Blue and we would
therefore predict b 𝑦 = Red. In between these two test data points x★ and x★alt at [0.5 2] T it is equally far to
𝑖 = 1 as to 𝑖 = 2 and it is undecided if the 3-NN classifier should predict Red or Blue. (In practice this is
most often not a problem, since the test data points rarely end up exactly at the decision boundary. If they do,
this can be handled by a coin-flip.) For al classifiers we always end up with such points in the input space
where the class prediction abruptly changes from one class to another. These points are said to be on the
decision boundary of the classifier.
Continuing in a similar way, changing the location of the test input across the entire input space and
recording the class prediction, we can compute the full decision boundaries for Example 2.3. We plot the
decision boundaries using dashed lines below, both for 𝑘 = 1 and 𝑘 = 3.

𝑘=1 𝑘=3
4 4
b
𝑦 = red b
𝑦 = red

2 2
𝑥2

𝑥2

b
0 0
𝑦 = blue
b
𝑦 = blue
−2 0 2 −2 0 2
𝑥1 𝑥1

These figures shows the decision boundaries, that is, the points in input space where the class prediction
changes. This type of figure gives an concise summary of a classifier, but requires of course that the problem
has a 2-dimensional input in order to make such a plot. As we can see, the decision boundaries of 𝑘-NN are
not linear. In the terminology we will introduce later, 𝑘-NN is thereby a nonlinear classifier.

Choosing k

The user has to decide on which 𝑘 to use in 𝑘-NN and this decision has a big impact on the final predictions.
To understand the impact of 𝑘, we study how the decision boundary changes as 𝑘 changes in Figure 2.1,
where 𝑘-NN is applied to the music classification Example 2.1 and car stopping distance Example 2.2,
both with 𝑘 = 1 as well as 𝑘 = 20.
With 𝑘 = 1 all training data points will, by construction, be correctly predicted and the model is adapted
to the exact x and 𝑦 values of the training data. In the classification problem there are for instance small
green (Bob Dylan) regions within the red (the Beatles) area that are most likely not relevant when it comes
to accurately predict the artist of a new song. In order to make good predictions it would probably be
better to instead predict red (the Beatles) for a new song in the entire middle-left region since the vast
majority of training data points in that area are red. For the regression problem 𝑘 = 1 gives a quite shaky

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
14
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2.2 A distance-based method: k-NN

behavior, and also for this problem it is intuitively clear that this does not describe an interesting effect,
but rather noise.

𝑘=1 𝑘 = 20

1 Beatles 1 Beatles
Kiss Kiss
Energy (scale 0-1)

Energy (scale 0-1)


Bob Dylan Bob Dylan

0.5 0.5

0 0
4.5 5 5.5 6 6.5 7 4.5 5 5.5 6 6.5 7

Length (ln s) Length (ln s)


(a) Decision boundaries for the music classification prob- (b) The music classification problem again, now using
lem using 𝑘 = 1. This is a typical example of overfitting, 𝑘 = 20. A higher value of 𝑘 gives a more smooth behavior
meaning that the model has adapted too much to the which, hopefully, predicts the artist of new songs more
training data so that it does not generalize well to new accurately.
previously unseen data.
𝑘=1 𝑘 = 20
150 150
𝑘-NN, 𝑘 = 1 𝑘-NN, 𝑘 = 20
Data Data
Distance (feet)

Distance (feet)

100 100

50 50

0 0
10 20 30 40 10 20 30 40

Speed (mph) Speed (mph)


(c) The black dots are the car stopping distance data, and the (d) The car stopping distance , this time with 𝑘 = 20. Except
blue line shows the prediction for 𝑘-NN with 𝑘 = 1 for any for the boundary effect at the right, this seems like a much more
𝑥. As for the classification problem above, 𝑘-NN with 𝑘 = 1 useful model which captures the interesting effects of the data
overfits to the training data. and ignores the noise.

Figure 2.1: 𝑘-NN applied to the music classification Example 2.1 (a and b) and the car stopping distance Example 2.2
(c and d). For both problems 𝑘-NN is applied with 𝑘 = 1 and well as 𝑘 = 20.

The drawbacks of using 𝑘 = 1 is not specific to these two examples. In most real world problems there
is a certain amount of randomness involved in collecting the training data (or at least some decisions
which we know little about, and may consider as being random). In the music example the 𝑛 = 230 songs
where somehow selected from all songs ever recorded from these artists, and for the car stopping distance
there appears to also be a certain amount of random effects also in 𝑦. Thus, by using 𝑘 = 1 and thereby
adapting very closely to the training data, the predictions will depend not only on the interesting patterns
in the problem, but also on the (more or less) random effects that has shaped the training data. Typically
we are not interested in capturing these effects, and we refer to this as overfitting.
With the 𝑘-NN classifier we can mitigate overfit by increasing the region of the neighborhood used to
compute the prediction, that is, increase the parameter 𝑘. With, for example, 𝑘 = 20 the predictions are no
longer based on only the closest neighbor, but instead a majority vote among the 20 closest neighbors. As
a consequence all training data points are no longer perfectly classified, but some of the songs end up
in the wrong region in Figure 2.1b. The predictions are however less adapted to the peculiarities of the

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 15
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2 Supervised learning: a first approach

training data and thereby less overfitted, and Figure 2.1b as well as Figure 2.1d are indeed less “noisy”
than Figure 2.1a and Figure 2.1c. Selecting 𝑘 is thus a trade-off between flexibility and rigidity. Indeed,
selecting 𝑘 too big will lead to a meaningless classifier, so there must exist a sweet spot for some moderate
𝑘 (possibly 20, but could be less or more) where the classifier generalizes the best. Unfortunately there
is no general answer for which 𝑘 this happens, but it is different for different problems. In the music
classification it seems reasonable that 𝑘 = 20 will predict new test data points better than 𝑘 = 1, but
there might very well be an even better choice of 𝑘. For the car stopping problem the behavior is also
more reasonable for 𝑘 = 20 than 𝑘 = 1, except for the boundary effect for large 𝑥 where 𝑘-NN is unable
to capture the trend in the data as 𝑥 increases (simply because the 20 nearest neighbors are the same
for all test points 𝑥★ around and above 35). A systematic way of choosing a good value of 𝑘 is to use
cross-validation, which we will discuss in Chapter 4.

Time to reflect 2.2: The prediction b𝑦 (x★) obtained using the 𝑘-NN method is a piece-wise constant
function of the input x★. For a classification problem this is natural, since the output is categorical
(see, for example, Figure 2.1 where the colored regions correspond to areas of the input space
where the prediction is constant according to the color of that region). However, also for regression
does 𝑘-NN have piece-wise constant predictions. Why?

Input normalization
A final important practical aspect when using 𝑘-NN is the importance of normalization of the input data.
Since 𝑘-NN is based on the Euclidean distances between points, it is important that it is a relevant measures
of how close two data points are. Imagine a training dataset with 𝑝 = 2 input variables x = [𝑥 1 𝑥 2 ] T where
all values of 𝑥 1 are in the range [100, 1100] and the values for 𝑥2 are in the much smaller range [0, 1]. It
could for example be that 𝑥 1 and 𝑥 2 are measured in√︁different units. The Euclidean distance between a test
point x★ and a training data point x𝑖 is kx𝑖 − x★ k = (𝑥 𝑖1 − 𝑥★1 ) 2 + (𝑥 𝑖2 − 𝑥★2 ) 2 , and this expression will
be totally dominated by the first term (𝑥 𝑖1 − 𝑥★1 ) 2 whereas the second term (𝑥 𝑖2 − 𝑥★2 ) 2 will almost not
matter in practice, due to the different magnitude of 𝑥 1 and 𝑥 2 . That is, the different ranges leads to 𝑥 1
being considered much more important than 𝑥 2 by 𝑘-NN.
To avoid this undesired effect we have to scale the input variables. One option, in the mentioned
example, could be to subtract 100 from 𝑥 1 and thereafter divide it by 1000 and create 𝑥 𝑖1 new = 𝑥𝑖1 −100 such
1000
new
that 𝑥 1 and 𝑥 2 both are in the range [−1, 1]. More generally, this normalization procedure for the input
data can be written as
𝑥 𝑖 𝑗 − minℓ (𝑥ℓ 𝑗 )
𝑥 𝑖new
𝑗 = , for all 𝑗 = 1, . . . , 𝑝, 𝑖 = 1, . . . , 𝑛. (2.1)
maxℓ (𝑥ℓ 𝑗 ) − minℓ (𝑥ℓ 𝑗 )

Another common way of normalizing (sometimes called standardizing) is by using the mean and standard
deviation in the training data:
𝑥 𝑖 𝑗 − 𝑥¯ 𝑗
𝑥 𝑖new
𝑗 = , ∀ 𝑗 = 1, . . . , 𝑝, 𝑖 = 1, . . . , 𝑛, (2.2)
𝜎𝑗

where 𝑥¯ 𝑗 and 𝜎 𝑗 are the mean and standard deviation for each input variable, respectively.
It is crucial for 𝑘-NN to apply some type of input normalization (that was indeed done in Figure 2.1),
but it is a good practice to apply also when using other methods, at least for numerical stability. It is
however important to compute the scaling factors (minℓ (𝑥ℓ 𝑗 ), 𝑥¯ 𝑗 , etc) using training data only and apply
that scaling also to future test data points. Failing this, for example by performing normalization before
setting test data aside (which we will discuss more in Chapter 4), might lead to wrong conclusions on how
well the method will perform in predicting future not yet seen data points.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
16
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2.3 A rule-based method: Decision trees

2.3 A rule-based method: Decision trees

The 𝑘-NN method results in a prediction b 𝑦 (x★) that is a piece-wise constant function of the input x★. That
is, the method partitions the input space into disjoint regions and each region is associated with a certain
(constant) prediction. For 𝑘-NN, these regions are given implicitly by the 𝑘-neighborhood of each possible
test input. An alternative approach, that we will study in this section, is to come up with a set of rules that
define the regions explicitly. For instance, considering the data in Figure 2.1, a simple set of high-level
rules for constructing a classifier would be: inputs on the right side are classified as green, in the lower
left corner as blue, and in the upper left corner as red. We will now see how such rules can be learned
systematically from the training data.

Predicting using a decision tree

The rule-based models that we consider here are referred to as decision trees. The reason is that the rules
used to define the model can be organized in a graph structure referred to as a binary tree. The decision
tree effectively divides the input space into multiple disjoint regions and in each region a constant value is
used for the prediction b 𝑦 (x★). We illustrate this with an example.
Example 2.5: Predicting colors with a decision tree

We consider a classification problem with two numerical input variables x = [𝑥1 𝑥2 ] T and one categorical
output 𝑦, the color Red or Blue. For now we do not consider any training data or how to actually learn the
tree, but only how an already existing decision tree can be used to predict b 𝑦 (x★).
The rules defining the model are organized in the graph below (left figure) which is referred to as a binary
tree. To use this tree to predict a label for the test input x★ = [𝑥★1 𝑥★2 ] T we start at the top, referred to as the
root node of the tree (in the metaphor the tree is growing upside down, with the root at the top and the leaves
at the bottom). If the condition stated at the root is true, that is, if 𝑥★2 < 3.0, then we proceed down the left
branch, otherwise along the right branch. If we reach a new internal node of the tree, we check the rule
associated with that node and pick the left or the right branch accordingly. We continue and work our way
down until we reach the end of a branch. Each such final branch corresponds to a constant prediction b 𝑦 𝑚 , in
this case one of the two classes Red or Blue.
5.0
𝑥2 < 3.0

𝑅2 𝑅3
𝑥1 < 5.0
𝑥2

𝑅1
b
𝑦 1 = Blue
3.0
𝑅1
𝑅2 𝑅3
b
𝑦 2 = Blue b
𝑦 3 = Red
𝑥1

A classification tree. At each internal node a rule A region partition, where each region corresponds
on the form 𝑥 𝑗 < 𝑠 𝑘 indicates the left branch to a leaf node in the tree. Each border between
coming from that split and the right branch then regions correspond to a split in the tree. Each
consequently corresponds to 𝑥 𝑗 ≥ 𝑠 𝑘 . This tree has region is colored with the prediction corresponding
two internal nodes (including the root) and three to that region, and the boundary between red and
leaf nodes. blue is therefore the decision boundary.

The decision tree partitions the input space into axis-aligned “boxes”, as shown in the right panel above.
By increasing the depth of the tree (the number of steps from the root to the leaves), the partitioning can be
made finer and finer and thereby describing more complicated functions of the input variable.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 17
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2 Supervised learning: a first approach

A pseudo code for predicting a test input with the tree above would look like
if x_2 < 3.0 then
return Blue
else
if x_1 < 5.0 then
return Blue
else
return Red
end
end
As an example if we have x★ = [2.5 3.5] T , in the first split we would take the right branch since
𝑥★2 = 3.5 ≥ 3.0 and in the second split we would take the left branch since 𝑥★1 = 2.5 < 5.0. The prediction
for this test point would be b
𝑦 (x★) = Blue.

To set the terminology, the endpoint of each branch 𝑅1 , 𝑅2 and 𝑅3 in Example 2.5 are called leaf nodes
and the internal splits, 𝑥 2 < 3.0 and 𝑥 1 < 5.0 are known as internal nodes. The lines that connect the
nodes are referred to as branches. The tree is referred to as binary since each internal node splits into
exactly two branches.
With more than two input variables it is difficult to illustrate the partitioning of the inputs space into
regions (right figure in the example), but the tree representation can still be use in precisely the same way.
Each internal node corresponds to a rule where one of the 𝑝 input variables 𝑥 𝑗 , 𝑗 = 1, . . . , 𝑝, is compared
with a threshold 𝑠. If 𝑥 𝑗 < 𝑠 we continue along the left branch and if 𝑥 𝑗 ≥ 𝑠 we continue along the right
branch.
The constant predictions that we associate with the leaf nodes can be either categorical, as in the example
above, or numerical and decision trees can thus be used to address both classification and regression
problems. In fact, the models that we describe here also go by the name Classification And Regression
Trees (CART). Example 2.5 Illustrates how a decision tree is used for making a prediction. We will now
turn to the question of how the tree can be learned from training data.

Learning a regression tree


We will now discuss how to learn (or, equivalently, train) a decision tree for a regression problem. Training
a decision tree for classification is conceptually similar and explained in the next section.
As discussed above, the prediction b 𝑦 (x★) corresponding to a classification or regression tree is a
piece-wise constant function of the input x★. We can write this mathematically as,
∑︁
𝐿
b
𝑦 (x★) = b
𝑦 ℓ I{x★ ∈ 𝑅ℓ }, (2.3)
ℓ=1

where 𝐿 is the total number of regions (leaf nodes) in the tree, 𝑅ℓ is the ℓth region, and b 𝑦 ℓ is the constant
prediction for the ℓth region. Note that in the regression setting b 𝑦 ℓ is a numerical variable, and we will
consider it to be a real number for simplicity. In the equation above we have used the indicator function,
I{x ∈ 𝑅ℓ } = 1 if x ∈ 𝑅ℓ and I{x ∈ 𝑅ℓ } = 0 if x ∉ 𝑅ℓ otherwise.
Learning the tree from data corresponds to finding suitable values for the parameters defining the
function (2.3), namely the regions 𝑅ℓ and the constant predictions b 𝑦 ℓ , ℓ = 1, . . . , 𝐿, as well as the total
size of the tree 𝐿. If we start by assuming that the shape of the tree, the partition (𝐿, {𝑅ℓ }ℓ=1 𝐿 ) is known,

then we can compute the constants {b 𝑦 ℓ }ℓ=1 in a natural way, simply as the average of the training data
𝐿

points falling in each region:

b
𝑦 ℓ = Average{𝑦 𝑖 : x𝑖 ∈ 𝑅ℓ }

It remains however to find the shape of the tree, the regions 𝑅ℓ , which requires a bit more work. The
basic idea is of course to select the regions so that the tree fits the training data. This means that the output
predictions from the tree should match the output values in the training data. Unfortunately, even when

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
18
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2.3 A rule-based method: Decision trees

restricting ourselves to seemingly simple regions such as the “boxes” obtained from a decision tree, finding
the tree (a collection of splitting rules) that optimally partitions the input space to fit the training data as
well as possible turns out to be computationally infeasible. The problem is that there is a combinatorial
explosion in the number of ways in which we can partition the input space. Searching through all possible
binary trees is not possible in practice unless the tree size is so small that it is practically useless.
To handle this situation we use a greedy algorithm known as recursive binary splitting for learning the
tree. The word “recursive” means that we will determine the splitting rules one after the other, starting
with the first split at the root and then build the tree from top to bottom. The word greedy means that tree
is constructed one split at a time, without having the complete tree “in mind”. That is, when determining
the splitting rule at the root node, the objective is to obtain a model that explains the training data as well
as possible after a single split, without taking into consideration that additional splits may be added before
arriving at the final model. When we have decided on the first split of the input space (corresponding to
the root node of the tree), this split is kept fixed and we continue in a similar way for the two resulting
half-spaces (corresponding to the two branches of the tree), etc.
To see in detail how one step of this algorithm works, consider the setting when we are about to do our
very first split at the root of the tree. Hence, we want to select one of the 𝑝 input variables 𝑥 1 , . . . , 𝑥 𝑝 and
a corresponding cutpoint 𝑠 which divides the input space into two half-spaces,
𝑅1 ( 𝑗, 𝑠) = {x | 𝑥 𝑗 < 𝑠} and 𝑅2 ( 𝑗, 𝑠) = {x | 𝑥 𝑗 ≥ 𝑠}. (2.4)
Note that the regions depend on the index 𝑗 of the splitting variable as well as the value of the cutpoint 𝑠,
which is why we write them as functions of 𝑗 and 𝑠. This is the case also for the predictions associated
with the two regions,
b
𝑦 1 ( 𝑗, 𝑠) = Average{𝑦 𝑖 : x𝑖 ∈ 𝑅1 ( 𝑗, 𝑠)} and b
𝑦 2 ( 𝑗, 𝑠) = Average{𝑦 𝑖 : x𝑖 ∈ 𝑅2 ( 𝑗, 𝑠)}
since the sums in these expression range over different data points depending on the regions.
For each training data point (x𝑖 , 𝑦 𝑖 ) we can compute a prediction error by first determining which region
the data point falls in, and then computing the difference between 𝑦 𝑖 and the constant prediction associated
with that region. Doing this for all training data points the sum of squared errors can be written as
∑︁ ∑︁
(𝑦 𝑖 − b
𝑦 1 ( 𝑗, 𝑠)) 2 + (𝑦 𝑖 − b
𝑦 2 ( 𝑗, 𝑠)) 2 . (2.5)
𝑖:x𝑖 ∈𝑅1 ( 𝑗,𝑠) 𝑖:x𝑖 ∈𝑅2 ( 𝑗,𝑠)

The square is added to ensure that the expression above is non-negative. The squared error is a common
loss function used for measuring the closeness of a model’s prediction and the training data, but other loss
functions can also be used. We will discuss the choice of loss function in more detail in later chapters.
To find the optimal split we select the values for 𝑗 and 𝑠 that minimize the squared error (2.5). This
minimization problem can be solved easily by looping through all possible values for 𝑗 = 1, . . . , 𝑝. For
each 𝑗 we can scan through the finite number of possible splits, and pick the pair ( 𝑗, 𝑠) for which the
expression above is minimized. As pointed out above, when we have found the optimal split at the
root node, this splitting rule is fixed. We then continue in the same way for the left and right branch
independently. Each branch (corresponding to a half-space) is split again by minimizing the squared
prediction error over all training data points correspond to that branch.
In principle, we can continue in this way until there is only a single training data point in each of the
regions, that is, until 𝐿 = 𝑛. Such a full grown tree will result in predictions that exactly match the training
data points, and the resulting model is quite similar to 𝑘-NN with 𝑘 = 1. As pointed out above, this will
typically result in a too erratic model that has overfitted to (possibly noisy) training data. To mitigate this
issue, it is common to stop the growth of the tree at an earlier stage, for instance by adding a constraint on
the minimum number of training data points associated with each leaf node. Forcing the model to have
more training data points in each leaf will result in an averaging effect, similarly to increasing the value
of 𝑘 in the 𝑘-NN method. Note that using such a stopping criteria means that the value of 𝐿 is not set
manually, but determined adaptively based on the result of the learning procedure. The full algorithm is
summarized as Method 2.2. Note that the learning in Method 2.2 includes a recursive call, where we in
each recursion grow one branch of the tree one step further.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 19
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2 Supervised learning: a first approach

Learn a decision tree using recursive binary splitting


Data: Training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

Result: Decision tree with regions 𝑅1 , . . . , 𝑅 𝐿 ∈ 𝑅


1 Initialize the tree with 𝑅 being the whole input space
2 return Split(𝑅,T )
3 Function Split(𝑅,T ):
4 if splitting criteria fulfilled then
5 return 𝑅
6 else
7 Go through all possible splits 𝑥 𝑗 < 𝑠 for all input variables 𝑗 = 1, . . . , 𝑝.
8 Pick the pair ( 𝑗, 𝑠) that minimizes (2.5)/(2.6) for regression/classification problems.
9 Split region 𝑅 into 𝑅1 and 𝑅2 according to (2.4).
10 Split data T into T1 and T2 accordingly.
11 return Split(𝑅1 ,T1 ), Split(𝑅2 ,T2 )
12 end
13 end

Predict from a decision tree


Data: Decision tree with regions 𝑅1 , . . . , 𝑅 𝐿 , training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛 , test data point x

Result: Predicted test output b
𝑦 (x★)
1 Find the region 𝑅ℓ which x★ belongs to.
2 Compute the prediction b
𝑦 (x★) as
(
Average{𝑦 𝑖 : x𝑖 ∈ 𝑅ℓ } (Regression problems)
b
𝑦 (x★) =
MajorityVote{𝑦 𝑖 : x𝑖 ∈ 𝑅ℓ } (Classification problems)

Method 2.2: Decision trees

Classification trees

Trees can also be used for classification. We use the same procedure of recursive binary splitting but with
two main differences.
Firstly, we then use a majority vote instead of an average to compute the prediction associated with each
region,

b
𝑦 ℓ = MajorityVote{𝑦 𝑖 : x𝑖 ∈ 𝑅ℓ }.

Secondly, when learning the tree we need a different splitting criteria than the squared prediction error
to take into account the fact that the output is categorical. To define these criteria we first introduce

1 ∑︁
b
𝜋ℓ𝑚 ( 𝑗, 𝑠) = I{𝑦 𝑖 = 𝑚}
𝑛ℓ
𝑖:x𝑖 ∈𝑅ℓ ( 𝑗,𝑠)

to be the proportion of training observations in the ℓth region that belong to the 𝑚th class.
Similar to (2.5), we now want to minimize

min 𝑛1 𝑄 1 + 𝑛2 𝑄 2 (2.6)
𝑗,𝑠

where 𝑛ℓ is the number of data points in region 𝑅ℓ and where 𝑄 ℓ is the splitting criteria.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
20
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2.3 A rule-based method: Decision trees

For 𝑄 ℓ , we have a few options. One simple alternative is the misclassification rate

𝑄 ℓ = 1 − max b
𝜋ℓ𝑚 , (2.7)
𝑚

which is simply the proportion of data points in region 𝑅ℓ which do not belong to the most common class.
Another common splitting criteria is the Gini index
∑︁
𝑀
𝑄ℓ = b
𝜋ℓ𝑚 (1 − b
𝜋ℓ𝑚 ) (2.8)
𝑚=1

and the entropy criteria.


∑︁
𝑀
𝑄ℓ = − b
𝜋ℓ𝑚 ln b
𝜋ℓ𝑚 . (2.9)
𝑚=1

In Example 2.5 we illustrate how to construct a classification tree using recursive binary splitting and with
the entropy as the slitting criteria.
Example 2.6: Learning a classification tree (continuation of Example 2.5)

We consider the same setup as in Example 2.5, now with the following dataset

10

𝑥1 𝑥2 𝑦
8
9.0 2.0 Blue
1.0 4.0 Blue
4.0 6.0 Blue 6
4.0 1.0 Blue
𝑥2

1.0 2.0 Blue 4


1.0 8.0 Red
6.0 4.0 Red
7.0 9.0 Red 2

9.0 8.0 Red


9.0 6.0 Red 0
0 2 4 6 8 10
𝑥1

We want to learn a classification tree, by using the entropy criteria in (2.9) and growing the tree until
there are no regions with more than five data points left.
First split: There are infinitely many possible splits we can make, but all splits which gives the same
partition of the data points will be the same. Hence, in practice we only have nine different splits to consider
in this dataset. The data and these splits (dashed lines) are visualized in the figure above.
We consider all nine splits in turn. We start with the split at 𝑥 1 = 2.5, which splits the input space into the
two regions, 𝑅1 = 𝑥1 < 2.5 and 𝑅2 = 𝑥1 ≥ 2.5. In region 𝑅1 we have two blue data points and one red, in
total 𝑛1 = 3 data points. The proportion of the two classes in region 𝑅1 will therefore be b 𝜋1B = 2/3 and
b
𝜋1R = 1/3. The entropy is calculated as
2 2 1 1
𝑄 1 = −b 𝜋1B ) − b
𝜋1B ln(b 𝜋1R ln(b
𝜋1R ) = − ln( ) − ln( ) = 0.64.
3 3 3 3
In region 𝑅2 we have 𝑛2 = 7 data points with the proportions b
𝜋2B = 3/7 and b
𝜋2R = 4/7. The entropy for this
regions will be
3 3 4 4
𝑄 2 = −b 𝜋2B ) − b
𝜋2B ln(b 𝜋2R ) = − ln( ) − ln( ) = 0.68
𝜋2R ln(b
7 7 7 7
and inserted in 2.6 the total weighted entropy for this split becomes

𝑛1 𝑄 1 + 𝑛2 𝑄 2 = 3 · 0.64 + 7 · 0.68 = 6.69.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 21
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2 Supervised learning: a first approach

We compute the cost for all other splits in the same manner, and summarize it in the table below.

Split (𝑅1 ) 𝑛1 b
𝜋1B b
𝜋1R 𝑄1 𝑛2 b
𝜋2B b
𝜋2R 𝑄2 𝑛1 𝑄 1 + 𝑛2 𝑄 2
𝑥 1 < 2.5 3 2/3 1/3 0.64 7 3/7 4/7 0.68 6.69
𝑥 1 < 5.0 5 4/5 1/5 0.50 5 1/5 4/5 0.50 5.00
𝑥 1 < 6.5 6 4/6 2/6 0.64 4 1/4 3/4 0.56 6.07
𝑥 1 < 8.0 7 4/7 3/7 0.68 3 1/3 2/3 0.64 6.69
𝑥 2 < 1.5 1 1/1 0/1 0.00 9 4/9 5/9 0.69 6.18
𝑥 2 < 3.0 3 3/3 0/3 0.00 7 2/7 5/7 0.60 4.18
𝑥 2 < 5.0 5 4/5 1/5 0.50 5 1/5 4/5 0.06 5.00
𝑥 2 < 7.0 7 5/7 2/7 0.60 3 0/3 3/3 0.00 4.18
𝑥 2 < 8.5 9 5/9 4/9 0.69 1 0/1 1/1 0.00 6.18

From the table we can read that the two splits at 𝑥2 < 3.0 and 𝑥 2 < 7.0 are both equally good. We choose to
continue with 𝑥2 < 3.0.

After first split After second split


10 10

8 8

6 6
𝑅2 𝑅3
𝑥2

𝑥2

4 4

2 2
𝑅1 𝑅1

0 0
0 2 4 6 8 10 0 2 4 6 8 10
𝑥1 𝑥1
Second split: We notice that only 𝑅2 has more than five data points. Also there is no point splitting
region 𝑅1 further since it only contains data points from the same class. In the next step we therefore split
the second region into two new regions 𝑅2 and 𝑅3 . All possible splits are displayed above to the left (dashed
lines) and we compute their cost in the same manner as before.
Splits (𝑅1 ) 𝑛2 b 𝜋2B b 𝜋2R 𝑄2 𝑛3 b 𝜋3B b 𝜋3R 𝑄3 𝑛2 𝑄 2 + 𝑛3 𝑄 3
𝑥1 < 2.5 2 1/2 1/2 0.69 5 1/5 4/5 0.50 3.89
𝑥1 < 5.0 3 2/3 1/3 0.63 4 0/4 4/4 0.00 1.91
𝑥1 < 6.5 4 2/4 2/4 0.69 3 0/3 3/3 0.00 2.77
𝑥1 < 8.0 5 2/5 3/5 0.67 2 0/2 2/2 0.00 3.37
𝑥2 < 5.0 2 1/2 1/2 0.69 5 1/5 4/5 0.50 3.88
𝑥2 < 7.0 4 2/4 2/4 0.69 3 0/3 3/3 0.00 2.77
𝑥2 < 8.5 6 2/6 4/6 0.64 1 0/1 1/1 0.00 3.82

The best split is the one at 𝑥1 < 5.0 visualized above to the right. The final tree and partition were displayed
in Example 2.5. None of the three regions has more than five data points. Therefore, we terminate the
training.
If we want to use the tree for prediction, we predict blue if x★ ∈ 𝑅1 or x★ ∈ 𝑅2 since the blue training data
points are in majority in each of these two regions. Similarly, we predict red if x★ ∈ 𝑅3 . This tree is also
visualized in Example 2.5.

When choosing between the different splitting criteria mentioned above, the misclassification rate
sounds like a reasonable choice since that is typically the criteria we want the final model to do well on7 .
However, one drawback is that it does not favor pure nodes. With pure nodes we mean nodes where most

7 This is not always true, for example for imbalanced and asymmetric classification problems, see further in Section 4.5

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
22
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2.3 A rule-based method: Decision trees

0.5
Misclassification rate
Gini index
Entropy
0.25

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
𝑟

Figure 2.2: Three splitting criteria for classification trees as a function of the proportion of the first class 𝑟 = 𝜋ℓ1 in
a certain region 𝑅ℓ . The entropy criteria has been scaled such that it passes through (0.5,0.5).

of the data points belong to a certain class. It is usually an advantage to favor pure nodes in the greedy
procedure that we use to grow the tree, since it can lead to a total of fewer splits. Both the entropy criteria
and Gini index favors node purity more than misclassification rate does.
This advantage can also be illustrated in Example 2.6. Consider the first split in this example. If
we would use the misclassification rate as the splitting criteria, both the split 𝑥 2 < 5.0 as well as the
split 𝑥 2 < 3.0 would provide a total misclassification rate of 0.2. However, the split at 𝑥 2 < 3.0, which
the entropy criteria favored, provides a pure node 𝑅1 . If we would go with the split 𝑥 2 < 5.0 the
misclassification after the second split would still be 0.2. If we would continue to grow the tree until no
data points are misclassified we would need three splits if we used the entropy criteria whereas we would
need five splits if we would use the misclassification criteria and started with the split at 𝑥 2 < 5.0.
To generalize this discussion, consider a problem with two classes where we denote the proportion of
the first class as 𝜋ℓ1 = 𝑟 and hence the proportion of the second class as 𝜋ℓ2 = 1 − 𝑟, the three criteria can
then in terms of 𝑟 be expressed as

Misclassification rate: 𝑄 ℓ = 1 − max(𝑟, 1 − 𝑟),


Gini index: 𝑄 ℓ = 2𝑟 (1 − 𝑟),
Entropy: 𝑄 ℓ = −𝑟 ln 𝑟 − (1 − 𝑟) ln(1 − 𝑟).

These functions are shown in Figure 2.2. All three citeria are similar in the sense that they provide zero
loss if all data points belongs to either of the two classes, and maximum loss the data points are equally
divided between the two classes. However, the Gini index and entropy have a higher loss for all other
proportions. In other words, the gain of having a pure node (𝑟 close to 0 or 1) is higher for the Gini index
and the entropy than for the miscalsification rate. As a consequence, when doing a split using Gini index
or the entropy, they both tend to favour (more than misclassification does) make one of the two nodes pure
(or close to being pure) since that provides a smaller total loss.

How deep should a decision tree be?


The depth of a decision tree (the maximum number of splits in a certain branch) has a big impact on the
final predictions. This impact is very similar the discussion of 𝑘 in 𝑘-NN, and we again use the music
classification Example 2.1 and car stopping distance Example 2.2 to study how the decision boundaries
change depending on the depth of the trees. In Figure 2.3 the decision boundaries are illustrated for
different trees. In Figure 2.3a and Figure 2.3c we have used fully grown trees In Figure 2.3b and Figure 2.3d
the depth has been restricted to 4 and 3 splits in each branch, respectively.
Similar to choosing 𝑘 = 1 in 𝑘-NN, for a fully grown tree all training data points will by construction be
correctly predicted since each region only contains data points with the same output. As a result, for the
music classification problem we get thin and small regions adapted to single training data points and for
the car stopping distance problem we get an overly adaptive line passing exactly through the observations.
Even though these trees give excellent performance on the training data, they are not likely to be the best

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 23
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
2 Supervised learning: a first approach

models for new yet unseen data. As we discussed previously in the context of 𝑘-NN, we refer to this as
overfitting.
In decision trees we can mitigate overfitting by using trees that are not as deep. Consequently, we get
fewer and larger regions and the decision boundaries are less adapted to the peculiarities of the training
data. This is illustrated in Figure 2.3b and Figure 2.3d for the two problems. As for 𝑘 in 𝑘-NN, the optimal
size of the tree depends on many properties of the problem and it is a trade-off between flexibility and
rigidity. Similar trade-offs have to be made in almost all models presented in this book and systematic
strategies will be discussed in Chapter 4.
How can the user control the depth of the tree? Here we have different strategies. The most obvious
strategy is to adjust the stopping criteria, that is, the criteria that should be fulfilled for not proceeding
with further splits in a certain node. As mentioned earlier, this criteria could be that we do not attempt
further splits if there are less than a certain number of training data points the corresponding region, or as
in the example above, stop splitting when we reach a certain depth. Another strategy of controlling the
depth is to use pruning. In pruning we grow a deep and then in a second post-processing step prune it
back to a smaller subtree. However, we will not discuss pruning further here.

Fully grown tree Tree with max depth 4

1 Beatles 1 Beatles
Kiss Kiss
Energy (scale 0-1)

Energy (scale 0-1)


Bob Dylan Bob Dylan

0.5 0.5

0 0
4.5 5 5.5 6 6.5 7 4.5 5 5.5 6 6.5 7

Length (ln s) Length (ln s)


(a) Decision boundaries for the music classification prob- (b) The same problem and data as in 2.3a for which a tree
lem for a fully grown classification tree with Gini index. restricted to depth 4 has been learned, again using Gini
This model overfits the data. index. This models will hopefully make better predictions
for new data.
Fully grown tree Tree with max depth 3
150 150
Decision tree Decision tree (max depth 3)
Data Data
Distance (feet)

Distance (feet)

100 100

50 50

0 0
10 20 30 40 10 20 30 40

Speed (mph) Speed (mph)


(c) The prediction for a fully grown regression tree. As for the (d) The same problem and data as in 2.3c for which a tree
classification problem above, this model overfits to the training restricted to depth 3 has been learned.
data.

Figure 2.3: Decision trees applied to the music classification Example 2.1 (a and b) and the car stopping distance
Example 2.2 (c and d).

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
24
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3 Basic parametric models for regression and
classification

In the previous chapter we saw the supervised machine learning problem, as well as two methods for
solving it. In this chapter we introduce a systematic approach referred to as parametric modeling, by
first looking at linear regression and logistic regression which are two such parametric models. The key
point of a parametric model is that it contains some parameters 𝜽, which are learned from training data.
However, once the parameters are learned, the training data may be discarded, since the prediction only
depends on 𝜽.

3.1 Linear regression

Regression is one of the two fundamental tasks of supervised learning (the other one is classification). We
will now introduce the linear regression model, which might (at least historically) be the most popular
method for solving regression problems. Despite its relative simplicity, it is a surprisingly useful and is an
important stepping stone for more advanced methods, such as deep learning, see Chapter 6.
As discussed in the previous chapter, regression amounts to learning the relationships between some
input variables x = [𝑥 1 𝑥 2 . . . 𝑥 𝑝 ] T and a numerical output variable 𝑦. The inputs can be either categorical
or numerical, but we will start by assuming that all 𝑝 the inputs also are numerical and introduce categorical
inputs later. In a more statistical framework, regression is about learning a model 𝑓

𝑦 = 𝑓 (x) + 𝜀, (3.1)

mapping the input to the output, where 𝜀 is an error term that describes everything about the input–output
relationship that cannot be captured by the model. With a statistical perspective, we consider 𝜀 as random
variable, referred to as a noise, that is independent of x and has mean value of zero. As a running example
of regression, we will use the car stopping distance regression problem introduced in the previous chapter
as Example 2.2.

The linear regression model

The linear regression model assumes that the output variable 𝑦 (a scalar) can be described as an affine1
combination of the input variables 𝑥 1 , 𝑥2 , . . . , 𝑥 𝑝 (each a scalar) plus a noise term 𝜀,

𝑦 = 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥2 + · · · + 𝜃 𝑝 𝑥 𝑝 + 𝜀. (3.2)

We refer to the coefficients 𝜃 0 , 𝜃 1 , . . . 𝜃 𝑝 as the parameters of the model, and we sometimes refer to 𝜃 0
specifically as the intercept (or offset) term. The noise term 𝜀 accounts for random errors in the data
not captured by the model. The noise is assumed to have mean zero and to be independent of x. The
zero-mean assumption is nonrestrictive, since any (constant) non-zero mean can be incorporated in the
offset term 𝜃 0 .
To have a more compact notation, we introduce the parameter vector 𝜽 = [𝜃 0 𝜃 1 . . . 𝜃 𝑝 ] T and extend
the vector x with a constant one in its first position, such that we can write the linear regression model (3.2)
1 An affine function is a linear function plus a constant offset.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 25
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3 Basic parametric models for regression and classification

compactly

1
 
   𝑥 1 
𝑦 = 𝜃 0 + 𝜃 1 𝑥1 + 𝜃 2 𝑥2 + · · · + 𝜃 𝑝 𝑥 𝑝 + 𝜀 = 𝜃 0 𝜃 1 . . . 𝜃 𝑝  .  + 𝜀 = 𝜽 Tx + 𝜀 (3.3)
 .. 
 
𝑥 𝑝 
 

This notation means that the symbol x unfortunately is used both for the 𝑝 + 1 and the 𝑝-dimensional
version of the input vector, with or without the constant one in the leading position. This is only a matter
of book-keeping for handling the intercept term 𝜃 0 and will be clear from context, and carries no deeper
meaning,
The linear regression model is a parametric function of the form (3.3). The parameters 𝜽 can take
arbitrary values, and the actual values that we assign to them will control the input–output relationship
described by the model. Learning of the model therefore amounts to finding suitable values for 𝜽 based on
observed training data. Before discussing how to do this, however, let us have a closer look at how the
model can be used to make predictions once it has been learned.

Making predictions with the linear regression model

The goal in supervised machine learning is making predictions b 𝑦 (x★) for new, previously unseen, test
 T
inputs x★ = 1 𝑥★1 𝑥★2 . . . 𝑥★ 𝑝 . Let us assume that we have already learned some parameter
values b
𝜽 for the linear regression model (how this is done will be described next). We use the symbol b to
indicate that b
𝜽 contains learned values of the otherwise unknown parameter vector 𝜽. Since we assume that
the noise term 𝜀 is random with zero mean and independent of all observed variables, it makes statistical
sense to let 𝜀 = 0 n the prediction. That is, a prediction from the linear regression model takes the form

𝑦 (x★) = b
b 𝜃0 + b
𝜃 1 𝑥★1 + b 𝜃 𝑝 𝑥★ 𝑝 = b
𝜃 2 𝑥★2 + · · · + b
T
𝜽 x★ . (3.4)

This is in fact applicable for all regression models of the type (3.1), where the model used for prediction in
general is b𝑦 (x★) = 𝑓 (x★).. The noise term 𝜀 is often referred to as an irreducible error or an aleatoric2
uncertainty in the prediction. We illustrate the predictions made by a linear regression model in Figure 3.1.

Linear regression model


prediction

b
𝑦 (x★ ) Data
Prediction
output 𝑦

𝑦3
𝜀
data

𝑦2 𝜀
𝑦1
𝜀

x1 x2 x3 x★

data test input


input x
Figure 3.1: Linear regression with 𝑝 = 1: The black dots represent 𝑛 = 3 data points, from which a linear regression
model (blue line) is learned. The model does not fit the data perfectly, but there is a remaining error/noise 𝜀 (red).
The model can be used to predict (blue circle) the output b
𝑦 (x★) for a test input x★.

2 From the Latin word aleator, meaning dice-player.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
26
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3.1 Linear regression

Learning linear regression from training data


Let us now discuss how to learn a linear regression model, that is learn 𝜽, from training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛 .

We collect the training data, which consists of 𝑛 data point with inputs x𝑖 and outputs 𝑦 𝑖 , in the 𝑛 × ( 𝑝 + 1)
matrix X and 𝑛-dimensional vector y,
 1 
−xT −  𝑦1   
 1     𝑥 𝑖1 
−xT −  𝑦2   
 2     
X=  . , y=  ..  , where each x𝑖 =  𝑥 𝑖2  . (3.5)
 ..  .  .. 
     . 
−xT − 𝑦 𝑛   
 𝑛    𝑥 𝑖 𝑝 
 

Example 3.1: Car stopping distances

We continue Example 2.2, and attempt to learn a liner regression model for it. We therefore form the
matrices X and y. Since we only have one input and one output, both 𝑥𝑖 and 𝑦 𝑖 are scalar. We get
1 4.0   4.0 
   
1 4.9   8.0 
   
1 5.0   8.0 
     
1 5.1   4.0 
  𝜃  
X = 1 5.2  , 𝜽= 0 , and y =  2.0  . (3.6)
  𝜃1  
 .. ..   .. 
. .   . 
  
1 39.6 134.0
   
1 39.7 110.0
   

Altogether we can use this vector ant matrix notation to describe the linear regression model for each
and every training data point 𝑖 = 1, . . . , 𝑛 X in one equation as a matrix multiplication

y = X𝜽 + 𝝐, (3.7)

where 𝝐 is a vector of errors/noise. Moreover we can also define a vector of predicted outputs for the
 T
training data b
y= b 𝑦 (x1 ) b
𝑦 (x2 ) . . . b
𝑦 (x𝑛 ) which also allows a compact matrix formulation

b
y = X𝜽. (3.8)

Note that whereas y is a vector of recorded training data values, b


y is a vector whose entries are functions
of 𝜽. Learning the unknown parameters 𝜽 amounts to finding values such that b y is similar to y, that is,
the model should fit the training data well. There are multiple ways to define what ‘similar’ or ‘well’
actually means, but it somehow amounts to find 𝜽 such that b y − y = 𝝐 is small. We will approach this
by formulating a loss function, which will imply a meaning of ‘fitting the data well’. We will thereafter
interpret the loss function from a statistical perspective, by understanding this as selecting the value of
𝜽 which makes the observed training data y as likely as possible to have been observed—the so-called
maximum likelihood solution. Later in Chapter 9 we will also introduce a conceptually different way to
learn 𝜽.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 27
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3 Basic parametric models for regression and classification

Model
Data
𝜀

output 𝑦 input 𝑥

Figure 3.2: A graphical explanation of the squared error loss function: the goal is to choose the model (blue line)
such that the sum of the squares (light red) of each error 𝜀 is minimized. That is, the blue line is to be chosen so that
the amount of red color is minimized. The black dots, the training data, are fixed. This motivates the name least
squares.

Loss functions and cost functions

One principled way to define the learning problem is to introduce a loss function 𝐿(b 𝑦 , 𝑦) which measures
how close the model’s prediction b 𝑦 is to the observed data 𝑦. If the model fits the data well, so that b
𝑦 ≈ 𝑦,
then the loss function should take a small value, and vice versa. Based on the chosen loss function we
also define the cost function as the average loss over the training data. Learning a model then amounts to
finding the parameter values that minimize the cost

loss function
𝑛 z }| {
b 1 ∑︁
𝜽 = arg min 𝑦 (x𝑖 ; 𝜽), 𝑦 𝑖 ) .
𝐿(b (3.9)
𝜽 𝑛 𝑖=1
| {z }
cost function 𝐽 (𝜽)

Note that each term in the expression above corresponds to evaluating the loss function for the predic-
tion b
𝑦 (x𝑖 ; 𝜽), given by (3.4), for the training point with index 𝑖 and the true output value 𝑦 𝑖 at that point.
To emphasize that the prediction depends on the parameters 𝜽, we have included 𝜽 as an argument to b 𝑦 for
clarity. The operator arg min𝜽 means “the value of 𝜽 for which the cost function attains it minimum”. The
relationship between loss and cost functions (3.9) is general for all cost functions in this book.

Least squares and the normal equations

For regression, a commonly used loss function is the squared error loss
2
𝑦 (x; 𝜽), 𝑦) = b
𝐿(b 𝑦 (x; 𝜽) − 𝑦 . (3.10)

This loss function attains 0 if b𝑦 (x; 𝜽) = 𝑦, and grows fast (quadratic) as the difference between 𝑦 and the
prediction b
𝑦 (x; 𝜽) = 𝜽 T x increases. The corresponding cost function for the linear regression model (3.7)
can be written with matrix multiplications as

1 ∑︁
𝑛
1 1 1
𝐽 (𝜽) = 𝑦 (x𝑖 ; 𝜽) − 𝑦 𝑖 ) 2 = kb
(b y − yk 22 = kX𝜽 − yk 22 = k𝝐 k 22 , (3.11)
𝑛 𝑖=1 𝑛 𝑛 𝑛

where k · k 2 denotes the usual Euclidean vector norm, and k · k 22 its square. Due to the square, this particular
cost function is also commonly referred to as the least squares cost. It is illustrated in Figure 3.2. We will
discuss other loss functions in Chapter 5.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
28
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3.1 Linear regression

When using the squared error loss for learning a linear regression model from T , we thus need to solve
the problem

1 ∑︁
𝑛
1
b
𝜽 = arg min (𝜽 T x𝑖 − 𝑦 𝑖 ) 2 = arg min kX𝜽 − yk 22 . (3.12)
𝜽 𝑛 𝑖=1 𝜽 𝑛

From a linear algebra point of view, this can be seen as the problem of finding the closest vector to y (in an
Euclidean sense) in the subspace of R𝑛 spanned by the columns of X. The solution to this problem is the
orthogonal projection of y onto this subspace, and the corresponding b 𝜽 can be shown (see Section 3.A) to
fulfill

XT Xb
𝜽 = XT y. (3.13)

Equation (3.13) is often referred to as the normal equations, and gives the solution to the least squares
problem (3.12). If XT X is invertible, which is often the case, then b
𝜽 has the closed form expression

b
𝜽 = (XT X) −1 XT y. (3.14)

The fact that this closed-form solution exists is important, and is probably the reason for why the linear
regression with squared error loss is so extremely commonly used in practice. Other loss functions lead to
optimization problems that often lack closed-form solutions.
We now have everything in place for using linear regression, and we summarize it as Method 3.1 and
illustrate by Example 3.2.

Time to reflect 3.1: What does it mean in practice if XT X is not invertible?

Time to reflect 3.2: If the columns of X are linearly independent and 𝑝 = 𝑛 − 1, X spans the entire
R𝑛 . That means a unique solution exists such that y = X𝜽 exactly, that is, the model fits the training
data perfectly. If that is the case, (3.14) reduces to 𝜽 = X−1 y, and the model fits the data perfectly.
Why would that not be a desired property in practice?

Learn linear regression with squared error loss


Data: Training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

Result: Learned parameter vector b 𝜽


1 Construct the matrix X and vector y according to (3.5).
2 Compute b𝜽 by solving (3.13).

Predict with linear regression


Data: Learned parameter vector b
𝜽 and test input x★
Result: Prediction b
𝑦 (x★)
𝑦 (x★) = b
Compute b
T
1 𝜽 x★.

Method 3.1: Linear regression

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 29
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3 Basic parametric models for regression and classification

Example 3.2: Car stopping distances

By inserting the matrices (3.6) from Example 3.1 into the normal equations (3.7), we obtain b
𝜃 0 = −20.1 and
b
𝜃 1 = 3.14. If we plot the resulting model, it looks like this:

150
Linear regression model
Data
Distance (feet)
100

50

0
0 10 20 30 40

Speed (mph)
This can be compared to how 𝑘-NN and decision trees solves the same problem, Figure 2.1 and 2.3. Clearly
the linear regression model behaves differently to 𝑘-NN and decision trees; linear regression does not share
the “local” nature of 𝑘-NN and decision trees (only training data points close to x★ affects b
𝑦 (x★)), which is
related to the fact that linear regression is a parametric model.

The maximum likelihood perspective


To get another perspective of the squared error loss we will now reinterpret the least squares method above
as a maximum likelihood solution. The word ‘likelihood’ refers to the statistical concept of the likelihood
function, and maximizing the likelihood function amounts to finding the value of 𝜽 that makes observing
y as likely as possible. That is, instead of (somewhat arbitrarily) selecting a loss function, we start with the
problem
b
𝜽 = arg max 𝑝(y | X; 𝜽). (3.15)
𝜽

Here 𝑝(y | X; 𝜽) is the probability density of all observed outputs y in the training data, given all inputs X
and parameters 𝜽. This determines mathematically what ‘likely’ means, but we need to specify it in more
detail. We do that by considering the noise term 𝜀 as a stochastic variable with a certain distribution. A
common assumption is the noise terms are independent, each with a Gaussian distribution with zero mean
and variance 𝜎𝜀2 ,
 
𝜀 ∼ N 0, 𝜎𝜀2 . (3.16)

This implies that the 𝑛 observed training data points are independent, and 𝑝(y | X; 𝜽) factorizes as
Ö
𝑛
𝑝(y | X; 𝜽) = 𝑝(𝑦 𝑖 | x𝑖 , 𝜽). (3.17)
𝑖=1

Together with the linear regression model (3.3) we have


   2
1 1  T
𝑝(𝑦 𝑖 | x𝑖 , 𝜽) = N 𝑦 𝑖 ; 𝜽 x𝑖 , 𝜎𝜀 = √︁
T 2
exp − 2 𝜽 x𝑖 − 𝑦 𝑖 . (3.18)
2𝜋𝜎𝜀2 2𝜎𝜀
Recall that we want to maximize the likelihood with respect to 𝜽. For numerical reasons, it is usually
better to work with the logarithm of 𝑝(y | X; 𝜽),
∑︁
𝑛
ln 𝑝(y | X; 𝜽) = ln 𝑝(𝑦 𝑖 | x𝑖 , 𝜽). (3.19)
𝑖=1

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
30
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3.1 Linear regression

Since the logarithm is a monotonically increasing function, maximizing the log-likelihood (3.19) is
equivalent to maximizing the likelihood itself. Putting (3.18) and (3.19) together, we get

1 ∑︁  T 2
𝑛
𝑛
ln 𝑝(y | X; 𝜽) = − ln(2𝜋𝜎𝜀2 ) − 𝜽 x𝑖 − 𝑦 𝑖 . (3.20)
2 2𝜎𝜀2 𝑖=1

Removing terms and factors independent of 𝜽 does not change the maximizing argument, and we see that
we can rewrite (3.15) as
𝑛  2
∑︁ ∑︁ 1 2
𝑛
b
𝜽 = arg max 𝑝(y | X; 𝜽) = arg max − 𝜽 T x𝑖 − 𝑦 𝑖 = arg min 𝜽 T x𝑖 − 𝑦 𝑖 . (3.21)
𝜽 𝜽 𝑖=1 𝜽 𝑖=1
𝑛

This is indeed linear regression with the least squares cost (the cost function implied by the squared error
loss function (3.10)). The key to arrive at this expression was the assumption of 𝜀 having a Gaussian
distribution. Other assumptions on 𝜀 leads to other loss functions, as we will discuss more in Chapter 5.

Categorical input variables


The regression problem is characterized by a numerical output 𝑦, and inputs x of arbitrary type. We have,
however, only discussed the case of numerical inputs so far, but categorical inputs are perfectly possible as
well.
Assume that we have a categorical input variable that only takes two different values. We refer to those
two values as A and B. We can then create a dummy variable 𝑥 as
(
0 if A,
𝑥= (3.22)
1 if B,

and use this variable in any supervised machine learning method as if it was numerical. For linear
regression, this effectively gives us a model which looks like
(
𝜃0 + 𝜀 if A,
𝑦 = 𝜃0 + 𝜃1𝑥 + 𝜀 = (3.23)
𝜃 0 + 𝜃 1 + 𝜀 if B.

The choice is somewhat arbitrary, since A and B of course can be switched. If the categorical variable takes
more than two values, let us say A, B, C and D, we can make a so-called one-hot encoding by constructing a
four-dimensional vector
 T
x = 𝑥 𝐴 𝑥 𝐵 𝑥𝐶 𝑥 𝐷 (3.24)

where 𝑥 𝐴 = 1 if A, 𝑥 𝐵 = 1 if B, and so on. That is, only one element of x will be 1, the rest are 0. Again,
this construction can be used for any supervised machine learning method, not only linear regression.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 31
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3 Basic parametric models for regression and classification

3.2 Classification and logistic regression


After presenting a parametric method for solving the regression problem, we now turn our attention to
classification. As we will see, with a modification of the linear regression model we can apply it to the
classification problem as well, however at the cost of not being able to use the convenient normal equations
for learning the parameters. To perform Instead we have to resort to numerical optimization, which we
will discuss later in Section 5.3.

A statistical view of the classification problem


Supervised machine learning amounts to predicting the output from the input. With a statistical perspective,
classification amounts to predicting the conditional class probabilities

𝑝(𝑦 = 𝑚 | x), (3.25)

where 𝑦 is the output (1, 2, . . . , or 𝑀) and x is the input.3 In words, 𝑝(𝑦 = 𝑚 | x) describes the probability
for class 𝑚 given that we know the input x. Talking about 𝑝(𝑦 | x) implies that we think about the class
label 𝑦 as a random variable. Why? Because we choose to model the real world, from where the data
originates, as involving a certain amount of randomness (much like the random error 𝜀 in regression). Let
us illustrate with an example:

Example 3.3: Describing voting behavior using probabilities

We want to construct a model that can predict voting preferences (= 𝑦, the categorical output) for different
population groups (= x, the input). However, we then have to face the fact that not everyone in a certain
population group will vote for the same political party. We can therefore think of 𝑦 as a random variable
which follows a certain probability distribution. If we knew that the vote count in the group of 45 year old
women (= x) is 13% for the cerise party, 39% for the turquoise party and 48% for the purple party (here we
have 𝑀 = 3), we could describe it as

𝑝(𝑦 = cerise party | x = 45 year old women) = 0.13,


𝑝(𝑦 = turqoise party | x = 45 year old women) = 0.39,
𝑝(𝑦 = purple party | x = 45 year old women) = 0.48.

In this way, the probabilities 𝑝(𝑦 | x) describe the non-trivial fact that
(a) all 45 year old women do not vote for the same party, but
(b) the choice of party does not appear to be completely random among 45 year old women either; the
purple party is the most popular, and the cerise party is the least popular.
Thus, it can be useful to have a classifier which predicts not only a class b
𝑦 (one party), but a distribution over
classes 𝑝(𝑦 | x).

We now aim to construct a classifier which can not only predict classes, but also learn the class
probabilities 𝑝(𝑦 | x). More specifically, for binary classification problems (𝑀 = 2, and 𝑦 is either 1 or
−1), we learn a model 𝑔(x) for which

𝑝(𝑦 = 1 | x) is modeled by 𝑔(x). (3.26a)

By the laws of probabilities, it holds that 𝑝(𝑦 = 1 | x) + 𝑝(𝑦 = −1 | x) = 1, which gives that

𝑝(𝑦 = −1 | x) is modeled by 1 − 𝑔(x). (3.26b)

Since 𝑔(x) is a model for a probability, it is natural to require that 0 ≤ 𝑔(x) ≤ 1 for any x. We will see
how this constraint can be enforced below.
3 We use the notation 𝑝(𝑦 | x) to denote probability masses (𝑦 discrete) as well as probability densities (𝑦 continuous).

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
32
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3.2 Classification and logistic regression

0.8

0.6
ℎ(𝑧)
0.4

0.2

0
−10 −8 −6 −4 −2 0 2 4 6 8 10
𝑧

Figure 3.3: The logistic function ℎ(𝑧) = 𝑒𝑧


1+𝑒 𝑧 .

For the multiclass problem, we instead let the classifier return a vector-valued function g(x), where
 𝑝(𝑦 = 1 | x)   𝑔1 (x) 
   
 𝑝(𝑦 = 2 | x)   𝑔2 (x) 
   
  is modeled by  ..  = g(x). (3.27)
   . 
..
 .   
 𝑝(𝑦 = 𝑀 | x)  𝑔 𝑀 (x) 
   
In words, each element 𝑔𝑚 (x) of g(x) corresponds to the conditional class probability 𝑝(𝑦 = 𝑚 | x).
Since g(x) models a probability vector, we require that each element 𝑔𝑚 (x) ≥ 0 and that kg(x) k 1 =
Í𝑀
𝑚=1 |𝑔 𝑚 (x)| = 1 for any x.

The logistic regression model


We will now introduce the logistic regression model, which is one possible way of modeling conditional
class probabilities. Logistic regression can be viewed as a modification of the linear regression model so
that it fits the classification (rather than regression) problem.
Let us start with binary classification, that is, learning a function 𝑔(x) that approximates the conditional
probability of the positive class. The linear regression model, without the noise term, is given by
𝑧 = 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥2 + · · · + 𝜃 𝑝 𝑥 𝑝 = 𝜽 T x. (3.28)
This is a mapping which takes x and returns 𝑧, which in this context is called the logit. Note that 𝑧 takes
values on the entire real line, whereas we need a function which instead returns a value on the interval
[0, 1]. The key idea of logistic regression is thus to ‘squeeze’ 𝑧 from (3.28) to the interval [0, 1] by using
𝑒𝑧
the logistic function ℎ(𝑧) = 1+𝑒 𝑧 , see Figure 3.3. This gives the function

𝑒𝜽
T
x
𝑔(x) = , (3.29a)
1 + 𝑒𝜽 x
T

which is restricted to [0, 1] and hence can be interpreted as a probability. (3.29a) is the logistic regression
model for 𝑝(𝑦 = 1 | x). Note that this implicitly also gives a model for 𝑝(𝑦 = −1 | x),
𝑒𝜽
T
x 1 𝑒 −𝜽
T
x
1 − 𝑔(x) = 1 − = . = (3.29b)
1 + 𝑒𝜽 x 1 + 𝑒𝜽 x 1 + 𝑒 −𝜽 x
T T T

These equations defines the logistic regression model, which in a nutshell is linear regression appended
with the logistic function. The reason why there is no noise term 𝜀 in (3.28) when comparing to the
linear regression model (3.3) is that the randomness in classification is statistically modeled by the class
probability construction 𝑝(𝑦 = 𝑚 | x) instead of an additive noise 𝜀.
As for linear regression, we have a model (3.29) which contains unknown parameters 𝜽. Logistic
regression is thereby also a parametric model, and we learn the parameters from training data.
Remark 3.1 Despite its name, logistic regression is a method for classification, not regression! The
(somewhat confusing) name is due to historical reasons.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 33
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3 Basic parametric models for regression and classification

Learning the logistic regression model by maximum likelihood


By using the logistic function, we have transformed linear regression (a model for regression problems)
into logistic regression (a model for classification problems). As it will turn out, the price to pay is that we
will not be able to use the convenient normal equations for learning 𝜽 in logistic regression (as we could
for linear regression if we used the squared error loss).
In order to derive a principled way of learning 𝜽 in (3.29) from training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛 , we start

with the maximum likelihood approach. From a maximum likelihood perspective, learning a classifier
amounts to solving
∑︁
𝑛
b
𝜽 = arg max 𝑝(y | X; 𝜽) = arg max ln 𝑝(𝑦 𝑖 | x𝑖 ; 𝜽), (3.30)
𝜽 𝜽 𝑖=1

where we, similarly to linear regression (3.19), assume that the training data points are independent and we
consider the logarithm of the likelihood function for numerical reasons. We have also added 𝜽 explicitly
to the notation to emphasize the dependence on the model parameters. Remember that our model of
𝑝(𝑦 = 1 | x; 𝜽) is 𝑔(x; 𝜽), which gives
(
ln 𝑔(x𝑖 ; 𝜽) if 𝑦 𝑖 = 1,
ln 𝑝(𝑦 𝑖 | x𝑖 ; 𝜽) = (3.31)
ln (1 − 𝑔(x𝑖 ; 𝜽)) if 𝑦 𝑖 = −1.
It is common to turn the maximization problem (3.30) into an equivalent minimization problem by using
Í
the negative log-likelihood as cost function, 𝐽 (𝜽) = − 𝑛1 ln 𝑝(𝑦 𝑖 | x𝑖 ; 𝜽), that is
(
1 ∑︁ − ln 𝑔(x𝑖 ; 𝜽)
𝑛
if 𝑦 𝑖 = 1,
𝐽 (𝜽) = (3.32)
𝑛 𝑖=1 − ln (1 − 𝑔(x𝑖 ; 𝜽)) if 𝑦 𝑖 = −1.
| {z }
Binary cross-entropy loss 𝐿 (𝑔 (x𝑖 ;𝜽),𝑦𝑖 )

The loss function in the expression above is called the cross-entropy loss. It is not specific to logistic
regression, but can be used for any binary classifier that predicts class probabilities 𝑔(x; 𝜽).
Considering specifically the logistic regression model we can write out the cost function (3.32) in more
detail. In doing so, the particular choice of labeling {−1, 1} turns out to be convenient since we for 𝑦 𝑖 = 1
we can write
𝑒𝜽
T
x𝑖 𝑒 𝑦𝑖 𝜽
T
x𝑖
𝑔(x𝑖 ; 𝜽) = = , (3.33a)
1 + 𝑒𝜽
T
x𝑖 1 + 𝑒 𝑦𝑖 𝜽
T
x𝑖

and for 𝑦 𝑖 = −1
𝑒 −𝜽
T
x𝑖 𝑒 𝑦𝑖 𝜽
T
x𝑖
1 − 𝑔(x𝑖 ; 𝜽) = . = (3.33b)
1 + 𝑒 −𝜽 x𝑖 1 + 𝑒 𝑦𝑖 𝜽 x𝑖
T T

Since we get the same expression in both cases, we write (3.32) compactly as
1 ∑︁
𝑛
𝑒 𝑦𝑖 𝜽 x𝑖
T
1 ∑︁
𝑛
1 1 ∑︁
𝑛  
−𝑦𝑖 𝜽 T x𝑖
𝐽 (𝜽) = − ln = − ln = ln 1 + 𝑒 . (3.34)
𝑛 𝑖=1 1 + 𝑒 𝑦𝑖 𝜽 x𝑖 𝑛 𝑖=1 1 + 𝑒 −𝑦𝑖 𝜽 x𝑖 𝑛 𝑖=1 | {z }
T T

Logistic loss 𝐿 (x𝑖 , 𝑦𝑖 , 𝜽)

The loss function 𝐿(x, 𝑦 𝑖 , 𝜽) above, which is the cross-entropy loss specialized to logistic regression, is
called the logistic loss (or sometimes binomial deviance). Learning a logistic regression model thus
amounts to solving
1 ∑︁
𝑛  
b
𝜽 = arg min ln 1 + 𝑒 −𝑦𝑖 𝜽 x𝑖 .
T
(3.35)
𝜽 𝑛 𝑖=1

Contrary to linear regression with squared error loss, (3.35) has no closed-form solution, so we have to
use numerical optimization instead. We will come back to that topic in Section 5.3.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
34
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3.2 Classification and logistic regression

Learn binary logistic regression


Data: Training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛 (with output classes 𝑦 = {−1, 1})

Result: Learned parameter vector b 𝜽


1 Compute b𝜽 by solving (3.35) numerically.

Predict with logistic regression


Data: Learned parameter vector b
𝜽 and test input x★
Result: Prediction b
𝑦 (x★)
1 Compute 𝑔(x★) (3.29a).
2 If 𝑔(x★) > 0.5, return b
𝑦 (x★) = 1, otherwise return b
𝑦 (x★) = −1.

Method 3.2: Logistic regression

Predictions and decision boundaries


So far, we have discussed logistic regression as a method for predicting class probabilities for a test input
x★ by first learning 𝜽 from training data and thereafter computing 𝑔(x★) (our model for 𝑝(𝑦 = 1 | x★)
in binary classification) or g(x★) (our model for 𝑝(𝑦 = 𝑚 | x★) in multiclass classification). However,
sometimes we want to make a “hard” prediction for the test input x★, that is, predicting b 𝑦 (x★) = −1 or
b
𝑦 (x★) = 1 in binary classification, just like with 𝑘-NN or decision trees. We then have to append logistic
regression with a final step in which the predicted probabilities are turned into a class prediction. The most
common approach is to let the most probable class be b 𝑦 (x★). For binary classification, we can express
(
4
this as
1 if 𝑔(x) > 𝑟
b
𝑦 (x★) = , (3.36)
−1 if 𝑔(x) ≤ 𝑟
with decision threshold 𝑟 = 0.5, which is illustrated in Figure 3.4. With binary logistic regression, this is
equivalent to
b
𝑦 (x★) = sign(𝜽 T x★). (3.37)
We now have everything in place for summarizing binary logistic regression in Method 3.2.
In some applications, however, it can be beneficial to explore different thresholds than 𝑟 = 0.5. It can be
shown that if 𝑔(x) = 𝑝(𝑦 = 1 | x), that is, the model provides a correct description of the real-world class
probabilities, then the choice 𝑟 = 0.5 will give the smallest possible number of misclassification on average.
In other words, 𝑟 = 0.5 minimizes the so-called misclassification rate. The misclassification rate is,
however, not always the most important aspect of classifier. Many classification problems are asymmetric
(the classes are of different importance to correctly predict) or imbalanced (the classes occur with very
different frequency). In a medical diagnosis application, for example, it can be of higher importance not
to falsely predict the negative class (that is, by mistake predict a sick patient being healthy) than falsely
predict the positive class (by mistake predict a healthy patient as sick). For such a problem, minimizing the
misclassification rate might lead not lead to the desired performance. Furthermore, the medical diagnosis
problem could be imbalanced if the disorder is very rare, meaning that the vast majority of the data points
(patients) belong to the negative class. By only considering the misclassification rate in such a situation we
implicitly value accurate predictions of the negative class higher than accurate predictions of the positive
class, simply because the negative class is more common in the data. We will discuss how we can evaluate
such situations more systematically in Section 4.5. In the end, however, the decision threshold 𝑟 is a choice
that the user has to make.
For the multiclass problem, taking the prediction as the most probable class amounts to solving
b
𝑦 (x★) = arg max 𝑔𝑚 (x★). (3.38)
𝑚
4 It is arbitrary what happens if 𝑔(x) = 0.5.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 35
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3 Basic parametric models for regression and classification

0.75
𝑔(x)

0.5

0.25

𝑥1 𝑥2

Figure 3.4: In binary classification (𝑦 = −1 or 1) logistic regression predicts 𝑔(x★) (x is two-dimensional here),
which is an attempt to determine 𝑝(𝑦 = 1 | x★). This implicitly also gives a prediction for 𝑝(𝑦 = −1 | x★) as 1 − 𝑔(x★).
To turn these probabilities into actual class predictions (b
𝑦 (x★) is either −1 or 1), the class which is modeled to have
the highest probability can be taken as the prediction, as in Equation (3.36). The point(s) where the prediction
changes from from one class to another is the decision boundary (gray plane).

As for the binary case, it is possible to modify this when working with an asymmetric or imbalanced
problem.
The decision boundary for binary classification can be computed by solving the equation

𝑔(x) = 1 − 𝑔(x). (3.39)

The solutions to this equation are points in the input space for which the two classes are predicted to be
equally probable. These points therefore lie on the decision boundary. For binary logistic regression, this
means

𝑒𝜽
T
x 1 x
= 1 ⇔ 𝜽 T x = 0.
T
= ⇔ 𝑒𝜽 (3.40)
1+ 𝑒𝜽 x 1+ 𝑒𝜽 x
T T

The equation 𝜽 T x = 0 parameterizes a (linear) hyperplane. Hence, the decision boundaries in logistic
regression always have the shape of a (linear) hyperplane. The same argument can be generalized to
multiclass logistic regression, but the decision boundaries will then be given by a combination of 𝑀 − 1
hyperplanes.
In general we distinguish between different types of classifiers by the shape of their decision boundaries.

A classifier whose decision boundaries are linear hyperplanes is a linear classifier.

All other classifiers are nonlinear classifiers. Logistic regression is an example of a linear classifier,
whereas 𝑘-NN and decision trees are nonlinear classifiers.
Note that the term “linear” is used in a different sense for linear regression; linear regression is a model
which is linear in its parameters, whereas a linear classifier is a model whose decision boundaries are
linear.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
36
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3.2 Classification and logistic regression

Logistic regression for more than two classes


Logistic regression can be used also for the multiclass problem when there are more than two classes,
𝑀 > 2. There are several ways of generalizing logistic regression to the multiclass problem. We will
follow one path using the so-called softmax function which will be useful also later when introducing
deep learning models in Chapter 6.
For the binary problem, we used the logistic function to design a model for 𝑔(x) (a scalar-valued
function representing 𝑝(𝑦 = 1 | x)). For the multiclass problem we instead have to design a vector-valued
function g(x), whose elements should be non-negative and sum to one. For this purpose, we first use 𝑀
instances of (3.28), each denoted 𝑧 𝑚 and each with a different set of parameters 𝜽 𝑚 . We stack all 𝑧 𝑚 into a
vector of logits z = [𝑧 1 𝑧 2 . . . 𝑧 𝑀 ] T and use the softmax function as a vector-valued generalization of the
logistic function,
 𝑒 𝑧1 
 
 𝑒 𝑧2 
1  
softmax(z) , Í 𝑀  ..  . (3.41)
 . 
𝑒 𝑧𝑚  
𝑒 𝑧 𝑀 
𝑚=1
 
Note that the argument z to the softmax function is an 𝑀-dimensional vector, and that it also returns a
vector of the same dimension. By construction, the output vector from the softmax function always sums
to 1, and each element is always ≥ 0. Similarly to how we combined linear regression and the logistic
function for the binary classification problem (3.29), we have now combined linear regression and the
softmax function to model the class probabilities,
 𝜽 T x𝑖 
 T1 
 𝜽 x𝑖 
 2 
g(x) = softmax(z), where z =  .  . (3.42)
 .. 
 T 
𝜽 x𝑖 
 𝑀 
Equivalently, we can write out the individual class probabilities, that is, the elements of the vector g(x), as

𝑒𝜽 𝑚 x𝑖
T

𝑔𝑚 (x) = Í , 𝑚 =1, . . . , 𝑀. (3.43)


𝑒𝜽 𝑗 x𝑖
T
𝑀
𝑗=1

This is the multiclass logistic regression model. Note that this construction uses 𝑀 parameter vectors
𝜽 1 , . . . , 𝜽 𝑀 (one for each class), meaning that the number of parameters to learn grows with 𝑀. As for
binary logistic regression, we can learn those parameters using the maximum likelihood method. We use
𝜽 to denote all model parameters, 𝜽 = {𝜽 1 , . . . , 𝜽 𝑀 }. Since 𝑔𝑚 (x; 𝜽) is our model for 𝑝(𝑦 𝑖 = 𝑚 | x𝑖 ), the
cost function for the cross-entropy (or negative log-likelihood) loss for the multiclass problem is

1 ∑︁
𝑛
𝐽 (𝜽) = − ln 𝑔 𝑦𝑖 (x𝑖 ; 𝜽) . (3.44)
𝑛 𝑖=1 | {z }
Multi-class cross-entropy
loss 𝐿 (g(x𝑖 ;𝜽),𝑦𝑖 )

Note that we use the training data labels 𝑦 𝑖 as index variables to select the correct conditional probability
for the loss function. That is, the 𝑖th term of the sum is the negative logarithm of the 𝑦 𝑖 th element of the
vector g(x𝑖 ; 𝜽). We illustrate the meaning of this further in example 3.4.
Inserting the model (3.43) into the loss function (3.44) gives the cost function to optimize when learning
multiclass logistic regression (the softmax version),

1 ∑︁ © T ∑︁ T ª
𝑛 𝑀
𝐽 (𝜽) = ­−𝜽 𝑦𝑖 x𝑖 + ln 𝑒𝜽 𝑗 x𝑖 ® . (3.45)
𝑛 𝑖=1
« 𝑗=1 ¬

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 37
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3 Basic parametric models for regression and classification

Example 3.4: The cross-entropy loss for multiclass problems

Consider the following (very small) data set with 𝑛 = 6 data points, 𝑝 = 2 input dimensions and 𝑀 = 3
classes, which we want to use for learning a multiclass classifier:
 0.20 0.86  2
  
 0.41 0.18  3
  
 0.96 −1.84 1
X =  ,
 y =   .
−0.25 1.57  2
−0.82 −1.53 1
   
−0.31 0.58  3
   
Multiclass logistic regression with softmax, or any other multiclass classifier which predicts conditional class
probabilities, return a 3-dimensional probability vector g(x; 𝜽) for any x and 𝜽. If we stack the logarithms of
the transpose of all vectors g(x𝑖 ; 𝜽) for 𝑖 = 1, . . . , 6, we obtain the matrix

 
 
 ln 𝑔1 (x1 ; 𝜽) ln 𝑔2 (x1 ; 𝜽) ln 𝑔3 (x1 ; 𝜽) 
 
 
 ln 𝑔1 (x2 ; 𝜽) ln 𝑔2 (x2 ; 𝜽) ln 𝑔3 (x2 ; 𝜽) 
 
 
G =  .
ln 𝑔1 (x3 ; 𝜽) ln 𝑔2 (x3 ; 𝜽) ln 𝑔3 (x3 ; 𝜽)

 
 ln 𝑔1 (x4 ; 𝜽) ln 𝑔2 (x4 ; 𝜽) ln 𝑔3 (x4 ; 𝜽) 
 
 
 ln 𝑔1 (x5 ; 𝜽) ln 𝑔2 (x5 ; 𝜽) ln 𝑔3 (x5 ; 𝜽) 
 
 ln 𝑔1 (x6 ; 𝜽) ln 𝑔2 (x6 ; 𝜽) ln 𝑔3 (x6 ; 𝜽) 
 
 

Computing the multi-class cross-entropy cost (3.44) now simply amounts to compute the average of all
circled elements, and multiply that with −1. The element that we have circled in row 𝑖 is given by the training
label 𝑦 𝑖 . Training the model amounts to finding 𝜽 such that this average is maximized.

Time to reflect 3.3: Can you derive (3.32) as a special case of (3.44)?  
𝑔(x)
Hint: think of the binary case as a special case of the multiclass case with g(x) = .
1 − 𝑔(x)

Time to reflect 3.4: The softmax-based logistics regression is actually over-parameterized, in


the sense that we can construct an equivalent model with fewer parameters. That is often not a
problem in practice, but compare the multiclass model (3.42) for the case 𝑀 = 2 with binary
logistic regression (3.29) and see if you can spot the over-parametrization!

3.3 Polynomial regression and regularization

In comparison to 𝑘-NN and decision trees in Chapter 2, linear and logistic regression might appear to be
rigid and non-flexible models with their straight lines (such as Figure 3.1 and 3.4). However, both models
are able to adapt to the training data well if the input dimension 𝑝 is large, or the number of data points 𝑛
is small.
A common way of increasing the input dimension in linear and logistic regression, which we will
discuss more thoroughly in Chapter 8, is to make a nonlinear transformation of the input. A simple
nonlinear transformation is to replace a one-dimensional input 𝑥 with itself raised to different powers,
which makes the linear regression model a polynomial

𝑦 = 𝜃 0 + 𝜃 1 𝑥 + 𝜃 2 𝑥 2 + 𝜃 3 𝑥 3 + · · · + 𝜀. (3.46)

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
38
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3.3 Polynomial regression and regularization

This is called polynomial regression, and can be applied also to logistic regression. Note that if we let
𝑥 1 = 𝑥, 𝑥 2 = 𝑥 2 and 𝑥 3 = 𝑥 3 , this is still a linear model (3.2) with input x = [1 𝑥 𝑥 2 𝑥 3 ], but we have
‘lifted’ the input from being one-dimensional (𝑝 = 1) to three-dimensional (𝑝 = 3). Using nonlinear input
transformations can be very useful in practice, but it effectively increases 𝑝 and we may easily end up
overfitting the model to the noise—rather than the interesting patterns—in the training data, as in the
example below.

Example 3.5: Car stopping distances with polynomial regression

We return to Example 2.2, but this time we also add the squared speed as an input and thereby use a 2nd
order polynomial in linear regression. This gives the new matrices (instead of Example 3.1)
1 4.0 16.0   4.0 
  
1 4.9 24.0   8.0 
 𝜃 0   
1 5.0 25.0     8.0 
  
X = . ..  , 𝜽 = 𝜃 1  , y=  . , (3.47)
 .. .   .. 
.

.. 𝜃 2   
1 39.6 1568.2   134.0
   
1 39.7 1576.1 110.0
   

and when we insert them into the normal equations (3.13), the new parameter estimates are b 𝜃 0 = 1.58,
b
𝜃 1 = 0.42 and b
𝜃 2 = 0.07. (Note that also b
𝜃 0 and b
𝜃 1 change, compared to Example 3.2.)
In a completely analogous way we also learn a 10th order polynomial, and we illustrate them all below.

150
Linear regression with 𝑥
Linear regression with 𝑥 and 𝑥 2
Linear regression with 𝑥, 𝑥 2 , . . . , 𝑥 10
Data
Distance (feet)

100

50

0
0 5 10 15 20 25 30 35 40 45

Speed (mph)
The second order polynomial (blue solid line) appears sensible, and using a 2nd order polynomial seems to
give some advantage compared to plain linear regression (blue dashed line, from Example 3.2). However,
using a 10th order polynomial (red solid line) seems to make the model less useful than even plain linear
regression, and instead suffering from overfit. In conclusion it seems that there is some merit to the idea of
polynomial regression, but it has to be applied carefully.

For linear (or logistic) regression with nonlinear input transformations, one could indeed mitigate overfit
issues by very carefully selecting the nonlinear transformations. Such a careful selection can be quite hard
to perform in practice, and overfit can occur also in other situations when 𝑝 is large compared to 𝑛.
A useful approach in practice is therefore to use regularization. The idea of regularization can be
described as ‘keeping the parameters b 𝜽 small unless the data really convinces us otherwise’, or alternatively
‘if a model with small parameter b 𝜽 values fits the data almost as well as a model with larger parameter
values, the one with small parameter values should be preferred’. There are several ways to implement
this idea mathematically, which leads to different regularization methods. We will give a more complete
treatment of this in Section 5.2, and only discussed the so-called 𝐿 2 regularization for now. When paired
with regularization, the idea of using nonlinear input transformations can be very powerful and enables
a whole family of supervised machine learning methods that we will properly introduce and discuss in

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 39
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3 Basic parametric models for regression and classification

Chapter 8.
To keep b𝜽 small (in order to prevent overfit) an extra penalty term 𝜆k𝜽 k 22 is added to the cost function
when using 𝐿 2 regularization. The purpose of the penalty term is to prevent overfit, whereas the original
cost function only rewards the fit to training data (which does not prevent overfit). The regularization
parameter 𝜆 ≥ 0 is therefore important to select wisely, in order to obtain the right amount of regularization.
With 𝜆 = 0 the regularization has no effect, whereas 𝜆 → ∞ will force all parameters b 𝜽 to 0. A common
solution is to use cross-validation (Chapter 4) to select 𝜆. Applied to linear regression with square error
lost (3.12), it becomes5

b 1
𝜽 = arg min kX𝜽 − yk 22 + 𝜆k𝜽 k 22 . (3.48)
𝜽 𝑛

It turns out that (3.48) has a cosed-form solution, as a modified version of the normal equations, namely

(XT X + 𝑛𝜆I 𝑝+1 )b


𝜽 = XT y, (3.49)

where I 𝑝+1 is the identity matrix of size ( 𝑝 + 1) × ( 𝑝 + 1).


Regularization is not restricted to linear regression. The very same 𝐿 2 regularization idea can be applied
to any method that involves optimizing a cost function, such as logistic regression

1 ∑︁   
𝑛
b
𝜽= ln 1 + exp −𝑦 𝑖 𝜽 T x𝑖 + 𝜆k𝜽 k 22 . (3.50)
𝑛 𝑖=1

It is common in practice to train logistic regression using (3.50) instead of (3.29). One reason is indeed
to decrease possible overfit issues. Another reason is that for the non-regularized cost function (3.29),
the optimal b
𝜽 is not finite if the training data is linearly separable (meaning there exists a linear decision
boundary which separates the classes perfectly). In practice it means that the logistic regression learning
diverges with some datasets, unless (3.50) (with 𝜆 > 0) is used instead of (3.29).

3.4 Nonlinear regression and generalized linear models


In this chapter we have so far introduced two basic parametric models for regression and classification,
linear regression and logistic regression, respectively. The concept of parametric modeling is however
much more general. Before leaving this chapter we will briefly discuss how the models introduced above
can be generalized to describe more intricate input–output relationships. We will also dig deeper into a
special case of this later in Chapter 6, where we discuss neural networks.

Generalized linear models


To be written.

Nonlinear parametric functions


Let us remind ourselves about the general regression model (3.1),

𝑦 = 𝑓𝜽 (x) + 𝜀. (3.51)

We have introduced an explicit dependence on the parameters 𝜽 in the notation to emphasize that 𝑓𝜽 is a
model of the input–output relationship which depends on some parameters 𝜽. To turn this model into
a linear regression, that could be trained using least squares with a closed form solution, we made two
assumptions in Section 3.1. First, the function 𝑓𝜽 was assumed to be linear in the model parameters,
𝑓𝜽 (x) = 𝜽 T x. Second, the noise term 𝜀 was assumed to be Gaussian, 𝜀 ∼ N 0, 𝜎𝜀2 . The latter assumption
5 In practice, it can be wise to exclude 𝜃 0 , the intercept, from the regularization.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
40
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3.A Derivation of the normal equations

is sometimes implicit, but as we saw above it makes the maximum likelihood formulation equivalent to
least squares.
Both of these assumptions can be replaced with other assumptions. Replacing the Gaussian assumption
will be discussed in Section 5.1, and we will here discuss what happens if we allow the function 𝑓𝜽 to be
some arbitrary nonlinear function. We still want to learn the model from data, so 𝑓𝜽 has to depend on
some parameters 𝜽 which we can fit to the training data, no matter if it is a linear function or not. Let us
give an example of a nonlinear 𝑓𝜽 from biochemistry.
Example 3.6: Michaelis–Menten kinetics

An example of a relatively simple nonlinear parametric function is the Michaelis–Menten equation for
modeling enzyme kinetics. The model is given by
𝜃1𝑥
𝑦= +𝜀
𝜃2 + 𝑥
| {z }
= 𝑓𝜽 ( 𝑥)

where 𝑦 corresponds to a reaction rate and 𝑥 a substrate concentration. The model is parameterized by
the maximum reaction rate 𝜃 1 > 0 and the so called Michaelis constant of the enzyme 𝜃 2 > 0. Just like
the parameters in linear regression, these parameters can be learned from input-output data {𝑥 𝑖 , 𝑦 𝑖 }𝑖=1
𝑛 by

formulating a cost function and solving the resulting optimization problem.


Within biochemistry this model is often expressed as a deterministic relationship without the noise term
𝜀, but we include it as an error term for consistency with our framework.

In the example above the parameters 𝜃 1 and 𝜃 2 have physical interpretations and are restricted to be
positive. However, in machine learning we most often lack such physical interpretations of the parameters.
The way we use a model is more of a “black box” which is adapted to fit the training data as well as
possible. The archetype of such nonlinear black-box models are neural networks, which we will discuss in
more detail in Chapter 6.
Nonlinear classification models can be constructed in a very similar way, as a generalization of
the logistic regression model (3.42). In multiclass logistic regression we compute a vector of logits
z = [𝑧 1 𝑧 2 . . . 𝑧 𝑀 ] T where each element of the vector is given by a linear model 𝑧 𝑚 = 𝜽 T𝑚 x. The
class probabilities are then obtained by propagating the logit vector through the softmax function as
𝑔𝜽 (x) = softmax(z). To turn this into a nonlinear classification model we can replace the expression for
the logit vector with

z = 𝒇𝜽 (x)

where 𝒇𝜽 is some arbitrary function that maps x to an 𝑀-dimensional real-valued vector. Propagating
this logit vector through the softmax function, analogously to the logistic regression case, results in a
nonliear model for the conditional class probabilities, 𝑔𝜽 (x) = softmax( 𝒇𝜽 (x)). We will return to nonlinear
classification models of this form in Chapter 6, where we use neural networks to construct the function 𝒇𝜽 .

3.A Derivation of the normal equations


The normal equations (3.13)

XT Xb
𝜽 = XT y.
1
can be derived from (3.12) (the scaling 𝑛 does not affect the minimizing argument)

b
𝜽 = arg min kX𝜽 − yk 22 ,
𝜽

in different ways. We will present one based on (matrix) calculus and one based on geometry and linear
algebra.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 41
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
3 Basic parametric models for regression and classification

No matter how (3.13) is derived, if XT X is invertible, it (uniquely) gives


b
𝜽 = (XT X) −1 XT y,

If XT X is not invertible, then (3.13) has infinitely many solutions b


𝜽, which all are equally good solutions
to the problem (3.12).

A calculus approach
Let

𝑉 (𝜽) = kX𝜽 − yk 22 = (X𝜽 − y) T (X𝜽 − y) = yT y − 2yT X𝜽 + 𝜽 T XT X𝜽, (3.52)

and differentiate 𝑉 (𝜽) with respect to the vector 𝜽,


𝜕
𝑉 (𝜽) = −2XT y + 2XT X𝜽. (3.53)
𝜕𝜽
Since 𝑉 (𝜽) is a positive quadratic form, its minimum must be attained at 𝜕
𝜕𝜽 𝑉 (𝜽) = 0, which characterizes
the solution b
𝜽 as
𝜕 b
𝑉 ( 𝜽) = 0 ⇔ −2XT y + 2XT X𝜽 = 0 ⇔ XT Xb
𝜽 = XT y, (3.54)
𝜕𝜽
which is the normal equations.

A linear algebra approach


Denote the 𝑝 + 1 columns of X as 𝑐 𝑗 , 𝑗 = 1, . . . , 𝑝 + 1. We first show that kX𝜽 − yk 22 is minimized if 𝜽 is
chosen such that X𝜽 is the orthogonal projection of y onto the (sub)space spanned by the columns 𝑐 𝑗 of X,
and then show that the orthogonal projection is found by the normal equations.
Let us decompose y as y⊥ + y k , where y⊥ is orthogonal to the (sub)space spanned by all columns 𝑐 𝑖 ,
and 𝑦 k is in the (sub)scpace spanned by all columns 𝑐 𝑖 . Since y⊥ is orthogonal to both y k and X𝜽, it
follows that

kX𝜽 − yk 22 = kX𝜽 − (y⊥ + y k ) k 22 = k (X𝜽 − y k ) − y⊥ k 22 ≥ ky⊥ k 22 , (3.55)

and the triangle inequality also gives us

kX𝜽 − yk 22 = kX𝜽 − y⊥ − y k k 22 ≤ ky⊥ k 22 + kX𝜽 − y k k 22 . (3.56)

This implies that if we choose 𝜽 such that X𝜽 = y k , the criterion kX𝜽 − yk 22 must have reached its minimum.
Thus, our solution b 𝜽 must be such that Xb𝜽 − y is orthogonal to the (sub)space spanned by all columns 𝑐 𝑖 ,
meaning

(y − Xb
𝜽) T 𝑐 𝑗 = 0, 𝑗 = 1, . . . , 𝑝 + 1 (3.57)

(remember that two vectors u, v are, by definition, orthogonal if their scalar product, uT v, is 0.) Since the
columns 𝑐 𝑗 together form the matrix X, we can write this compactly as

(y − Xb
𝜽) T X = 0, (3.58)

where the right hand side is the 𝑝 + 1-dimensional zero vector. This can equivalently be written as

XT Xb
𝜽 = XT y,

which is the normal equations.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
42
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4 Understanding, evaluating and improving the
performance

We have so far encountered four different methods for supervised machine learning, and more are to come
in later chapters. We always learn the models by adapting them to training data, and hope that the models
thereby will give us good predictions also when faced with new, previously unseen, data. But can we
really expect that to work? This may sound like a trivial question, but on a second thought it is perhaps not
(so) obvious, and we will give it some attention in this chapter before we dive into more advanced methods
in later chapters. By doing so we will unveil some interesting concepts, and also discover some practical
tools for evaluating, choosing between and improving supervised machine learning methods.

4.1 Expected new data error 𝐸 new : performance in production


We start by introducing some concepts and notation. First, we define an error function 𝐸 (b 𝑦 , 𝑦) which
encodes the purpose of classification or regression. The error function compares a prediction b 𝑦 (x) to a
measured data point 𝑦, and returns a small value (possibly zero) if b
𝑦 (x) is a good prediction of 𝑦, and a
larger value otherwise. One could consider many different error functions, but our default choices are
misclassification and squared error, respectively:
(
0 if b
𝑦=𝑦
Misclassification: 𝐸 (b𝑦 , 𝑦) , (classification) (4.1a)
1 if b
𝑦≠𝑦
Squared error: 𝐸 (b 𝑦 − 𝑦) 2
𝑦 , 𝑦) , (b (regression) (4.1b)

When we compute the average misclassification (4.1a), we usually refer to it as the misclassification rate.
The misclassification rate is often a natural quantity to consider in classification, but for imbalanced or
asymmetric problems other aspects might be more important, as we discuss in Section 4.5.
The error function 𝐸 (b𝑦 , 𝑦) has similarities to a loss function 𝐿(b 𝑦 , 𝑦). However, they are used differently:
A loss function is used to train a model, whereas we use the error function to analyze performance of
an already trained model. There are reasons for choosing 𝐸 (b 𝑦 , 𝑦) and 𝐿 (b𝑦 , 𝑦) differently, which we will
come back to soon.
In the end, supervised machine learning amounts to designing a method which performs well when
faced with an endless stream of new, unseen data. Imagine for example all real-time recordings of street
views that have to be processed by a vision system in a self-driving car once it is sold to a customer, or
all incoming patients that have to be classified by a medical diagnosis system once it is implemented
in clinical practice. The performance on fresh unseen data can in mathematical terms be understood as
the average of the error function—how often the classifier is right, or how good the regression method
predicts. To be able to mathematically describe the endless stream of new data, we introduce a distribution
over data 𝑝(x, 𝑦). In most other chapters, we only consider the output 𝑦 as a random variable whereas the
input x is considered fixed. In this chapter, however, we have to think of also the input x as a random
variable with a certain probability distribution. In any real-world machine learning scenario 𝑝(x, 𝑦) can
be extremely complicated and practically impossible to write down. We will nevertheless use 𝑝(x, 𝑦)
to reason about supervised machine learning methods, and the bare notion of 𝑝(x, 𝑦) (even though it is
unknown in practice) will be helpful for that.
No matter which specific classification or regression method we consider, once it has been trained on
𝑛 , it will return predictions b
training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1 𝑦 (x★) for any new input x★ we give to it. We will

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 43
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4 Understanding, evaluating and improving the performance

in this chapter write b𝑦 (x; T ) to emphasize that the training data T was used to train the model. Indeed,
different training datasets will train the model differently and, consequently, give different predictions.
In the other chapters we mostly discuss how a model predicts one, or a few, test inputs x★. Let us take
that to the next level by integrating (averaging) the error function (4.1) over all possible test data points
with respect to the distribution 𝑝(x, 𝑦). We refer to this as the expected new data error

𝑦 (x★; T ), 𝑦★)] ,
𝐸 new , E★ [𝐸 (b (4.2)

where the expectation E★ is the expectation over all possible test data points with respect to the distribution
(x★, 𝑦★) ∼ 𝑝(x, 𝑦), that is,

E★ [𝐸 (b𝑦 (x★; T ), 𝑦★)] = 𝐸 (b𝑦 (x★; T ), 𝑦★) 𝑝(x★, 𝑦★) 𝑑x★ 𝑑𝑦★ . (4.3)

Remember that the model (regardless if it is linear regression, a classification tree, an ensemble of trees, a
neural network or something else) is trained on a given training dataset T and represented by b 𝑦 (·; T ).
What is happening in equation (4.2) is an averaging over possible test data points (x★, 𝑦★). Thus, 𝐸 new
describes how well the model generalizes from the training data T to new situations.
We also introduce the training error

1 ∑︁
𝑛
𝐸 train , 𝑦 (x𝑖 ; T ), 𝑦 𝑖 ) ,
𝐸 (b (4.4)
𝑛 𝑖=1

where {x𝑖 , 𝑦 𝑖 }𝑖=1


𝑛 is the training data T . 𝐸
train simply describes how well a method performs on the
training data on which it was trained, but gives no information on how well the method will perform for
new unseen data points.1

Time to reflect 4.1: What is 𝐸 train for 𝑘-NN with 𝑘 = 1?

Whereas the training error 𝐸 train describes how well the method is able to “reproduce” the data from
which it was learned, the expected new data error 𝐸 new tells us how well a method performs when we put
it into production; what proportions of predictions a classifier will get right, and how well a regression
method will predict in terms of average squared error. Or, in a more applied setting, what rate of false and
missed detections of pedestrians we can expect a vision system in a self-driving car to make, or how big a
proportion of all future patients a medical diagnosis system will get wrong.

The overall goal in supervised machine learning is to achieve as small 𝐸 new as possible.

This actually sheds some additional light upon the comment we made previously, that the loss function
𝐿(b𝑦 , 𝑦) and the error function 𝐸 (b
𝑦 , 𝑦) do not have to be the same. As we will discuss thoroughly in this
chapter, a model which fits the training data well and consequently has a small 𝐸 train might still have a
large 𝐸 new when faced with new unseen data. The best strategy to achieve as small 𝐸 new as possible is
therefore not necessarily to minimize 𝐸 train . Besides the fact that the misclassification (4.1a) is unsuited
as optimization objective (it is discontinuous and has derivative zero almost everywhere) it can also,
depending on the method, be argued that 𝐸 new can be made smaller by a more clever choice of loss
function. Such examples include gradient boosting and support vector machines. (Of course, this is
only relevant for methods that are trained using a loss function. 𝑘-NN, for example, is not. The idea of
evaluating using an error function applies however to all methods, no matter how it is trained.)
In practical cases we can, unfortunately, never compute 𝐸 new to assess how well we are doing. The
reason is that 𝑝(x, 𝑦)—which we do not know in practice—is part of the definition of 𝐸 new . It is, however,
1 The term “risk function” is used in some literature both for loss and error functions, assuming they are chosen equally. In that
terminology, 𝐸 new is referred to as “expected risk”, 𝐸 train as “empirical risk”, and the idea of minimizing the cost function as
“empirical risk minimization”.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
44
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4.2 Estimating 𝐸 new

a too important construction to be abandoned, just because we cannot compute it. We will instead spend
quite some effort both trying to estimate 𝐸 new (essentially by replacing the integral with a sum) as well as
analyze how 𝐸 new behaves, to better understand how we can decrease it.
We would like to emphasize that 𝐸 new is a property of a trained model and a specific machine learning
problem. That is, we cannot talk about “𝐸 new for logistic regression” in general, but instead we have to
make more specific statements, like “𝐸 new for the handwritten digit recognition problem, with a logistic
regression classifier trained with the MNIST data2 ”.

4.2 Estimating 𝐸 new


There are multiple reasons for a machine learning engineer to be interested in 𝐸 new , such as:
• judging if the performance is satisfying (whether 𝐸 new is small enough), or if more work should be
put into the solution and/or more training data should be collected
• choosing between different methods
• choosing hyperparameters (such as 𝑘 in 𝑘-NN, the regularization parameter in ridge regression or
the number of hidden layers in deep learning) in order to minimize 𝐸 new
• reporting the expected performance to the customer
As discussed above, we can unfortunately not compute 𝐸 new in any practical situation. We will therefore
explore some possibilities to estimate 𝐸 new , which will lead us to a very useful concept known as
cross-validation.

𝐸 train 0 𝐸 new : We cannot estimate 𝐸 new from training data


We have both introduced the expected new data error, 𝐸 new , and the training error 𝐸 train . In contrast to
𝐸 new , we can always compute 𝐸 train .
We assume for now that T consists of samples (data points) from 𝑝(x, 𝑦). This assumption means that
the training data is collected under similar circumstances as the ones the learned model will be used under,
which seems reasonable.
When an integral is hard to compute, it can be numerically approximated with a sum. Now, the question
is if the integral in 𝐸 new can be well approximated by the sum in 𝐸 train , like
∫ ∑︁
𝑛
?? 1
𝐸 new = 𝐸 (b
𝑦 (x; T ), 𝑦)) 𝑝(x, 𝑦)𝑑x𝑑𝑦 ≈ 𝑦 (x𝑖 ; T ), 𝑦 𝑖 ) = 𝐸 train .
𝐸 (b (4.5)
𝑛 𝑖=1

Or, put differently: Can we expect a method to perform equally well when faced with new, previously
unseen, data, as it did on the training data?
The answer is, unfortunately, no.

Time to reflect 4.2: Why can we not expect the performance on training data (𝐸 train ) to be a good
approximation for how a method will perform on new, previously unseen data (𝐸 new ), even though
the training data is drawn from the distribution 𝑝(x, 𝑦)?

Equation (4.5) does not hold, and the reason is that the training data are not just any data points, but the
predictions b 𝑦 depends on them since they are used for training the model. We can therefore not expect
(4.5) to hold. (Technically, the conditions for approximating an integral with a sum are not fulfilled since
b
𝑦 depends on T .)
As we will discuss more thoroughly later, the average behavior of 𝐸 train and 𝐸 new is, in fact, typically
𝐸 train < 𝐸 new . That means that a method usually performs worse on new, unseen data, than on training
data. The performance on training data 𝐸 train is therefore not a good estimate of 𝐸 new .
2 http://yann.lecun.com/exdb/mnist/

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 45
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4 Understanding, evaluating and improving the performance

All available data

Training data T Hold-out validation data

Figure 4.1: The hold-out validation dataset approach: If we split the available data in two sets and train the model
on the training set, we can compute 𝐸 hold-out using the hold-out validation set. 𝐸 hold-out is an unbiased estimate of
𝐸 new , and the more data in the hold-out validation dataset the less variance (better estimate) in 𝐸 hold-out , but the less
data left for training the model (larger 𝐸 new ). The split here is only pictorial, in practice one should always split the
data randomly.

𝐸 hold-out ≈ 𝐸 new : We can estimate 𝐸 new from hold-out validation data


We could not use the “approximate-the-integral-with-a-finite-sum” trick to estimate 𝐸 new by 𝐸 train , due to
the fact that it effectively meant using the training data twice: first, to train the model (b 𝑦 in (4.4)) and
second, to evaluate the error function (the sum in (4.4)). A remedy is to set aside some hold-out validation
data {x 𝑗 , 𝑦 𝑗 } 𝑛𝑗=1
𝑣
, which are not in T used for training, and then use the hold-out validation data only for
estimating the model performance as the hold-out validation error

1 ∑︁
𝑛𝑣

𝐸 hold-out , 𝑦 (x 𝑗 ; T ), 𝑦 𝑗 ) .
𝐸 (b (4.6)
𝑗 𝑗=1

In this way, not all data will be used for training, but some data points (the hold-out validation data) will
be saved and used only for computing 𝐸 hold-out . This procedure is a simple version of cross-validation,
and is illustrated by Figure 4.1.

Be aware! If you are splitting your data, always do it randomly! Someone might—intentionally or
unintentionally—have sorted the dataset for you. If you do not split randomly, your binary classification
problem might end up with one class in your training data and the other class in your hold-out validation
data . . .

With the conditions 𝑛 𝑣 ≥ 1 and that all data is drawn from 𝑝(x, 𝑦), it can be shown that 𝐸 hold-out is an
unbiased estimate of 𝐸 new (meaning that if the entire procedure is repeated multiple times, each time with
new data, the average value of 𝐸 hold-out will be 𝐸 new ). That is reassuring, at least on a theoretical level, but
it does not tell us how close 𝐸 hold-out will be to 𝐸 new in a single experiment. However, the variance of
𝐸 hold-out decreases when the size of hold-out validation data 𝑛 𝑣 increases; a small variance of 𝐸 hold-out
means that we can expect it to be close to 𝐸 new . Thus, if we take the hold-out validation dataset big enough,
𝐸 hold-out will be close to 𝐸 new . However, setting aside a big validation dataset means that the dataset left
for training becomes small. You can presuppose that the more training data, the smaller 𝐸 new (which we
will discuss later in Section 4.3). That is bad news since achieving a small 𝐸 new is our ultimate goal.
Sometimes there is a lot of available data. When we really have a lot of data, we can indeed afford to set
aside a few percent data into a reasonably large hold-out validation dataset, without sacrificing the size
of the training dataset too much. In such data-rich situations, the hold-out validation data approach is
sufficient.
If the amount of available data is more limited, this becomes more of a problem. We are in practice
faced with the following dilemma: the better we want to know 𝐸 new (more hold-out validation data gives
less variance in 𝐸 hold-out ), the worse we have to make it (less training data increases 𝐸 new ). That is not very
satisfying, and we have to look for an alternative to the hold-out validation data approach.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
46
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4.2 Estimating 𝐸 new

k-fold cross-validation: 𝐸 𝑘-fold ≈ 𝐸 new without setting aside validation data

To avoid setting aside validation data, but still obtain an estimate of 𝐸 new , one could suggest a two-step
procedure of

(i) splitting the available data in one training and one hold-out validation set, train the model on the
training data and compute 𝐸 hold-out using hold-out validation data (as in Figure 4.1), and then

(ii) training the model again, this time using the entire dataset.

By such a procedure, we get an estimate of 𝐸 new at the same time as a model trained on the entire dataset.
That is not bad, but not perfect either. Why? To achieve small variance in the estimate, we have to put lots
of data in the hold-out validation dataset in step (i). Unfortunately that means the model trained in (i) will
possibly be very different from step (ii), and the estimate of 𝐸 new concerns the model from step (i), not the
possibly very different model from step (ii). Hence, this will not give us a good estimate of 𝐸 new . It is,
however, rather close to the useful 𝑘-fold cross-validation idea.
We would like to use all available data to train a model, and at the same time have a good estimate
of 𝐸 new for that model. By 𝑘-fold cross-validation, we can achieve (almost) this. The idea of 𝑘-fold
cross-validation is simply to repeat the hold-out validation dataset approach multiple times with a different
hold-out dataset each time, in the following way:

(i) split the dataset in 𝑘 batches of similar size (see Figure 4.2), and let ℓ = 1

(ii) take batch ℓ as the hold-out validation data, and the remaining batches as training data
(ℓ)
(iii) train the model on the training data, and compute 𝐸 hold-out as the average error on the hold-out
validation data, (4.6)

(iv) if ℓ < 𝑘, set ℓ ← ℓ + 1 and return to (ii). If ℓ = 𝑘, compute the 𝑘-fold cross-validation error

1 ∑︁ (ℓ)
𝑘
𝐸 𝑘-fold , 𝐸 (4.7)
𝑘 ℓ=1 hold-out

(v) train the model again, this time using the entire dataset

This illustrated in Figure 4.2.


With 𝑘-fold cross-validation, we get a model which is trained on all data, as well as an approximation of
𝐸 new for that model, namely 𝐸 𝑘-fold . Whereas 𝐸 hold-out (Section 4.2) was an unbiased estimate of 𝐸 new (to
the cost of setting aside hold-out validation data), 𝐸 𝑘-fold is only approximately so. However, with 𝑘 large
enough, it turns out to often be a sufficiently good approximation. Let us try to understand why 𝑘-fold
cross-validation works.
First, we have to distinguish between the final model, which is trained on all data in step (v), and
the intermediate models which are trained on all except a 1/𝑘 fraction of the data in step (iii). The
key in 𝑘-fold cross-validation is that if 𝑘 is large enough, the intermediate models are quite similar to
the final model (since they are trained on almost the same dataset, only a fraction 1/𝑘 of the data is
(ℓ)
missing). Furthermore, each intermediate 𝐸 hold-out is an unbiased but high-variance estimate of 𝐸 new for
the corresponding intermediate model ℓ. Since all intermediate and the final model are similar, 𝐸 𝑘-fold (4.7)
is approximately the average of 𝑘 high-variance estimates of 𝐸 new for the final model. When averaging
estimates, the variance decreases and 𝐸 𝑘-fold will have a lower variance.

Be aware! For the same reason as with the hold-out validation data approach, it is important to always
split the data randomly for cross-validation to work! A simple solution is to first randomly permute the
entire dataset, and thereafter split it into batches.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 47
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4 Understanding, evaluating and improving the performance

All available data

batch 1 batch 2 batch 3 ··· batch 𝑐

for ℓ = 1, . . . , 𝑘
Validation data Training data

(1)
ℓ=1 ··· → 𝐸 hold-out

(2)
ℓ=2 ··· → 𝐸 hold-out

..
.

··· (𝑘)
ℓ=𝑘 → 𝐸 hold-out

Training data Validation data


end for
average = 𝐸 𝑘-fold

Figure 4.2: Illustration of 𝑘-fold cross-validation. The data is split in 𝑘 batches of similar sizes. When looping over
ℓ = 1, 2, . . . , 𝑘, batch ℓ is held out as validation data, and the model is trained on the remaining 𝑘 − 1 data batches.
(ℓ)
Each time, the trained model is used to compute the average error 𝐸 𝑘-fold for the validation data. The final model is
(ℓ)
trained using all available data, and the estimate of 𝐸 new for that model is 𝐸 𝑘-fold , the average of all 𝐸 𝑘-fold .

We usually talk about training (or learning) as a procedure that is executed once. However, in 𝑘-fold
cross-validation the training is repeated 𝑘 (or even 𝑘 + 1) times. A special case is 𝑘 = 𝑛 which is also called
leave-one-out cross-validation. For methods such as linear regression, the actual training (solving the
normal equations) is usually done within milliseconds on modern computers, and doing it an extra 𝑛 might
not be a problem in practice. If the training is computationally intense (as for deep learning), it becomes a
rather computationally demanding procedure, and a choice like 𝑘 = 10 might be more practically feasible.
If there is much data available, it is also an option to use the computationally less demanding hold-out
cross-validation.

Using a test dataset

A very important use of 𝐸 𝑘-fold (or 𝐸 hold-out ) in practice is to choose between methods and select
different types of hyperparameters such that 𝐸 𝑘-fold (or 𝐸 hold-out ) becomes as small as possible. Typical
hyperparameters to choose in this way are 𝑘 in 𝑘-NN, tree depths or regularization parameters. However,
much like we cannot use the training data error 𝐸 train to estimate the new data error 𝐸 new , selecting
models and hyperparameters based on 𝐸 𝑘-fold (or 𝐸 hold-out ) will invalidate its use as an estimator of 𝐸 new .
If it is important to have a good estimate of the final 𝐸 new , it is wise to first set aside another hold-out
dataset, which we refer to as a test set. This test set should be used only once (after selecting models and
hyperparameters) to estimate 𝐸 new for the final model.
In problems where the training data is expensive, it is common to increase the training dataset using
more or less artificial techniques. Such techniques can be to duplicate the data and add noise to the
duplicated versions, to use simulated data, or to use data from a different but related problem, as we
discuss in more depth in Chapter 10. With such techniques (which indeed can be very successful), the
training data T is no longer drawn from 𝑝(x, 𝑦). In the worst case (if the artificial training data is very

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
48
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4.3 The training error–generalization gap decomposition of 𝐸 new

poor), T might not provide any information about 𝑝(x, 𝑦), and we can not really expect the model to
learn anything useful. It can therefore be very useful to have a good estimate of 𝐸 new if such techniques
were used during training, but a reliable estimate of 𝐸 new can only be achieved from data that we know
are drawn from 𝑝(x, 𝑦) (that is, collected under production-like circumstances). If the training data is
extended artificially, it is therefore extra important to set aside a test dataset before that extension is done.
The error function evaluated on the test data set could indeed be called “test error”. To avoid confusion
we do however not use the term “test error” since it commonly is used (ambiguously) as a name both for
the error on the test dataset as well as another name for 𝐸 new .

4.3 The training error–generalization gap decomposition of 𝐸 new

Designing a method with small 𝐸 new is the goal in supervised machine learning, and cross-validation
helps in estimating 𝐸 new . However, more can be said to also understand 𝐸 new . To be able to reason about
𝐸 new , we have to introduce another abstraction level, namely the training-data averaged versions of 𝐸 new
and 𝐸 train ,

𝐸¯ new , E T [𝐸 new ] , (4.8a)


𝐸¯ train , E T [𝐸 train ] . (4.8b)

Here, E T denotes the expected value when the training dataset T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛 (of a fixed size 𝑛) is drawn

from 𝑝(x, 𝑦). Thus 𝐸¯ new is the average 𝐸 new if we would train the model multiple times on different
training datasets, and similarly for 𝐸¯ train . The point of introducing these, as it turns out, is that we can
say more about the average behavior 𝐸¯ new and 𝐸¯ train , than we can say about 𝐸 new and 𝐸 train when the
model is trained on one specific training dataset T . Even though we most often care about 𝐸 new in the
end (the training data is usually fixed), insights from studying 𝐸¯ new are still useful. In fact, the 𝑘-fold
cross-validation is rather an estimate of 𝐸¯ new instead of 𝐸 new , since each model in 𝑘-fold is trained with a
(slightly) different training dataset.
We have already discussed the fact that 𝐸 train cannot be used in estimating 𝐸 new . In fact, it usually holds
that

𝐸¯ train < 𝐸¯ new , (4.9)

Put in words, this means that on average, a method usually performs worse on new, unseen data, than on
training data. A methods ability to perform well on unseen data after being trained on training data can be
understood as the method’s ability to generalize from training data. We consequently call the difference
between 𝐸¯ new and 𝐸¯ train the generalization gap3 , as

generalization gap , 𝐸¯ new − 𝐸¯ train . (4.10)

The generalization gap is the difference between performance on training data and the performance ‘in
production’ on new, previously unseen data. Since we can always compute 𝐸 train and cross-validation
gives an estimate of 𝐸 new , we can also estimate the generalization gap in practice.
With the decomposition of 𝐸¯ new into

𝐸¯ new = 𝐸¯ train + generalization gap, (4.11)

we also have an opening for digging deeper and trying to understand what affects 𝐸¯ new in practice. We
will refer to (4.11) as the training error–generalization gap decomposition of 𝐸¯ new .
3 We use a loose terminology and refer also to 𝐸 new − 𝐸 train as the generalization gap.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 49
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4 Understanding, evaluating and improving the performance

Underfit Overfit

𝐸¯ new

Generalization gap
Error

𝐸¯ train

Model complexity

Figure 4.3: Behavior of 𝐸¯ train and 𝐸¯ new for many supervised machine learning methods, as a function of model
complexity. We have not made a formal definition of complexity, but a rough proxy is the number of parameters
that are learned from the data. The difference between the two curves is the generalization gap. The training error
𝐸¯ train decreases as the model complexity increases, whereas the new data error 𝐸¯ new typically has a U-shape. If the
model is so complex that 𝐸¯ new is larger than it had been with a less complex model, the term overfit is commonly
used. Somewhat less commonly is the term underfit used for the opposite situation. The level of model complexity
which gives the minimum 𝐸¯ new (at the dotted line) could be called a balanced fit. When we, for example, use
cross-validation to select hyperparameters (that is, tuning the model complexity), we are searching for a balanced fit.

What affects the generalization gap

The generalization gap depends on the method and the problem. Concerning the method, one can typically
say that the more a method adapts to training data, the larger the generalization gap. A theoretical
framework for how much method a method adapts to training data is given by the so-called VC dimension.
From the VC dimension framework, probabilistic bounds on the generalization gap can be derived, but
those bounds are unfortunately rather conservative, and we will not pursue that approach any further.
Instead, we only use the vague terms model complexity or model flexibility (we use them as synonyms),
by which we mean the ability of a method to adopt to patterns in the training data. A model with high
complexity/flexibility (such as a fully connected deep neural network, deep trees and 𝑘-NN with small 𝑘)
can describe almost arbitrarily complicated relationships, whereas a model with low complexity/flexibility
(such as logistic regression) is less flexible in what functions it can describe. For parametric methods,
the model complexity is somewhat related to the number of parameters that are trained, but is also
affected by regularization techniques. As we will come back to later, the idea of model complexity is an
oversimplification and does not capture the full nature of various supervised machine learning methods,
but it nevertheless carries some useful intuition.
Typically, higher model complexity implies larger generalization gap. Furthermore, 𝐸¯ train decreases
as the model complexity increases, whereas 𝐸¯ new typically attains a minimum for some intermediate
model complexity value: too small and too high model complexity both raises 𝐸¯ new . This is illustrated in
Figure 4.3. A too high model complexity, meaning that 𝐸¯ new is higher than it had been with a less complex
model, is called overfit. The other situation, when the model complexity is too low, is sometimes called
underfit. In a consistent terminology, the point where 𝐸¯ new attains it minimum could be referred to as a
balanced fit. Since the goal is to minimize 𝐸¯ new , we are interested in finding this sweet spot. We also
illustrate this by Example 4.1.
Note that we discuss the usual behavior of 𝐸¯ new , 𝐸¯ train and the generalization gap. We use the term ‘usual’
because there are so many supervised machine learning methods and problems that it is almost impossible

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
50
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4.3 The training error–generalization gap decomposition of 𝐸 new

to make any claim that is always true for all possible situations, and pathological counter-examples also
exist. One should also keep in mind that claims about 𝐸¯ train and 𝐸¯ new are about the average behavior,
which hopefully is clear in Example 4.1.

Example 4.1: The training error–generalization gap decomposition for 𝑘-NN

We consider a simulated binary classification example with a two-dimensional input x. On the contrary to all
real world machine learning problems, in a simulated problem like this we do know 𝑝(x, 𝑦) (otherwise we
could not make the simulation). In this example, 𝑝(x) is a uniform distribution on the square [−1, 1] 2 , and
𝑝(𝑦 | x) is defined as follows: all points above the dotted curve in the figure below are green with probability
0.8, and points below the curve are red with probability 0.8. (The optimal classifier, in terms of minimal
𝐸 new , would have the dotted line as its decision boundary and achieve 𝐸 new = 0.2.)

0
𝑥2

−1
−1 0 1
𝑥1
We have 𝑛 = 200 in the training data, and learn three classifiers: 𝑘-NN with 𝑘 = 70, 𝑘 = 20 and 𝑘 = 2,
respectively. In model complexity sense, 𝑘 = 70 gives the least flexible model, and 𝑘 = 2 the most flexible
model. We plot their decision boundaries, together with the training data:
𝑘-NN, 𝑘 = 70 𝑘-NN, 𝑘 = 20 𝑘-NN, 𝑘 = 2
1 1 1

0 0 0
𝑥2

𝑥2

𝑥2

−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
𝑥1 𝑥1 𝑥1
Intuitively we see that 𝑘 = 2 (right) adapts too well to the data. With 𝑘 = 70, on the other hand, the model is
rigid enough not to adapt to the noise, but appears to possibly be too inflexible to adapt well to the true
dotted line above.
We can compute 𝐸 train by counting the fraction of misclassified training data points in the figures above.
From left to right, we get 𝐸 train = 0.27, 0.24, 0.22. Since this is a simulated example, we can also access
𝐸 new (or rather estimate it numerically by simulating a lot of test data), and from left to right we get
𝐸 new = 0.26, 0.23, 0.33. This pattern resembles Figure 4.3, except for the fact that 𝐸 new is acutally smaller
than 𝐸 train for some values of 𝑗. This is, however, not unexpected. What we have discussed in the main text
is the average 𝐸¯ new and 𝐸¯ train , not the situation with 𝐸 new and 𝐸 train for one particular set of training data.
To study 𝐸¯ new and 𝐸¯ train , we therefore repeat this experiment 100 times, and compute the average over those
100 experiments:

𝑘-NN with 𝑘 = 70 𝑘-NN with 𝑘 = 20 𝑘-NN with 𝑘 = 2


𝐸¯ train 0.24 0.22 0.17
𝐸¯ new 0.25 0.23 0.30

This table follows Figure 4.3 well: The generalization gap (difference between 𝐸¯ new and 𝐸¯ train ) is positive and
increases with model complexity (decreasing 𝑘 in 𝑘-NN), whereas 𝐸¯ train decreases with model complexity.
Among these values for 𝑘, 𝐸¯ new has its minimum for 𝑘 = 20. This suggests that 𝑘-NN with 𝑘 = 2 suffers
from overfitting for this problem, whereas 𝑘 = 70 is a case of underfitting.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 51
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4 Understanding, evaluating and improving the performance

𝐸¯ new
𝐸¯ new
Error

Error
𝐸¯ train

𝐸¯ train

Size of training data 𝑛 Size of training data 𝑛


(a) Simple model (b) Complex model

Figure 4.4: Typical relationship between 𝐸¯ new , 𝐸¯ train and the number of data points 𝑛 in the training dataset for a
simple model (low model flexibility, left) and a complex model (high model flexibility, right). The generalization
gap (difference between 𝐸¯ new and 𝐸¯ train ) decreases, at the same time as 𝐸¯ train increases. Typically, a more complex
model (right panel) will for large enough 𝑛 attain a smaller 𝐸¯ new than a simpler model (left panel) would on the
same problem (the axes of the figures are comparable). However, the generalization gap is typically larger for a more
complex model, in particular when the training dataset is small.

The have so far been concerned about the relationship between the generalization gap and the model
complexity. Another very important aspect is the size of the training dataset, 𝑛. We can in general expect
that the more training data, the smaller the generalization gap. On the other hand, 𝐸¯ train typically increases
as 𝑛 increases, since most models are not able to fit all training data perfectly if there are too many of them.
A typical behavior of 𝐸¯ train and 𝐸¯ new is sketched in Figure 4.4.

Reducing 𝐸 new in practice

Our overall goal is to achieve a small error “in production”, that is, a small 𝐸 new . To achieve that, according
to the decomposition 𝐸 new = 𝐸 train + generalization gap, we need to have 𝐸 train as well as the generalization
gap small. Let us draw two conclusions from what we have seen so far.

• The new data error 𝐸 new will (most often) not be smaller than the training error 𝐸 train . Thus, if 𝐸 train
is much bigger than the 𝐸 new you need for your application to be successful, you do not even need to
waste time on implementing cross-validation for estimating 𝐸 new . Instead, you should re-think the
problem and which method you are using.

• The generalization gap and 𝐸 new decreases as good as always as 𝑛 increases. Thus, increasing the
size of the training data may help a lot, and will at least not hurt.

Making the model more flexible decreases 𝐸 train but increases, often, the generalization gap. Making
the model less flexible, on the other hand, typically decreases the generalization gap but increases 𝐸 train .
The optimal tradeoff, in terms of small 𝐸 new , is often obtained when neither the generalization gap nor the
training error 𝐸 train is zero. Thus, by monitoring 𝐸 train and estimating 𝐸 new with cross-validation we also
get the following advice:

• If 𝐸 hold-out ≈ 𝐸 train (small generalization gap), it might be beneficial to increase the model flexibility
by loosening the regularization, increasing the model order (more parameters to learn), etc.

• If 𝐸 train is close to zero, it might be beneficial to decrease the model flexibility by tightening the
regularization, decreasing the order (fewer parameters to learn), etc.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
52
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4.3 The training error–generalization gap decomposition of 𝐸 new

Shortcomings of the model complexity scale


When having only one hyperparameter to choose, the situation sketched in Figure 4.3 is often a relevant
picture. However, when having multiple hyperparameters (or even competing methods) to choose, it is
important to realize that the one-dimensional model complexity scale in Figure 4.3 does not make justice
for the space of all possible choices. For a given problem, one method can have a smaller generalization
gap than another method without having a larger training error. Some methods are simply better for
certain problems. The one-dimensional complexity scale can be particularly misleading for intricate
deep learning models, but as we illustrate in Example 4.2 it is not even sufficient for the relatively simple
problem of jointly choosing the degree of polynomial regression (higher degree means more flexibility)
and regularization parameter (more regularization means less flexibility).
Example 4.2: Training error and generalization gap for a regression problem

To illustrate how the training error and generalization gap can behave, we consider a simulated problem so
that we can compute 𝐸 new . We let 𝑛 = 10 data points be generated as 𝑥 ∼ U [−5, 10], 𝑦 = min(0.1𝑥 2 , 3) + 𝜀,
and 𝜀 ∼ N (0, 1), and consider the following regression methods:
• Linear regression with 𝐿 2 regularization
• Linear regression with a quadratic polynomial and 𝐿 2 regularization
• Linear regression with a third order polynomial and 𝐿 2 regularization
• Regression tree
• A random forest (Chapter 7) with 10 regression trees
For each of these methods, we try a few different values of the hyperparameters (regularization parameter
and tree depth, respectively), compute 𝐸¯ train and the generalization gap.
5

4
Linear regression, 𝐿 2 regularization
Linear regression, 2rd order polynomial, 𝐿 2 regularization
Linear regression, 3rd order polynomial, 𝐿 2 regularization
1.5 3 Regression tree
5 Random forest
4
Generalization gap

3
2
1
2

0.1 1 0.1
1 1
0.5 1
10 10 0.1
1 1 000
100
100 10

1 000
0
0 0.5 1 1.5 2

𝐸¯ train

For each method the hyperparameter that minimizes 𝐸¯ new is the value which is closest (in the 1-norm sense)
to the origin, since 𝐸¯ new = 𝐸¯ train + generalization gap. Having decided on a certain model, and only having
one hyperparameter left to choose, corresponds well to the situation in Figure 4.3.
However, when we compare the different methods, a more complicated situation is revealed than is
described by the one-dimensional model complexity scale. Compare, for example, the second (green) to the
third order polynomial (red) linear regression: for some values of the regularization parameter, the training
error decreases without increasing the generalization gap. Similarly is the generalization gap smaller, while
the training error remains the same, for the random forest (black) than for the tree (purple) for a maximum
tree depth of 2. These type of relationships are quite intricate, problem-dependent and impossible to describe
using the simplified picture in Figure 4.3.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 53
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4 Understanding, evaluating and improving the performance

In any real problem we cannot make such a plot as in Example 4.2, that is only possible for simulated
problems. In practice it is therefore important to use your knowledge about the problem and how different
methods are designed, in order to make a good choice. It is good to choose models that are known to work
well for a specific type of data and use experience from similar problems. We can also use cross-validation
for selecting between different models and choosing hyperparameters. Despite the simplified picture, the
intuition about under- and overfit from Figure 4.3 can still be very helpful when deciding on what method
or hyperparameter value to explore next with cross-validation.

4.4 The bias-variance decomposition of 𝐸 new


We will now introduce another decomposition of 𝐸¯ new into a (squared) bias and a variance term, as well
as an unavoidable component of irreducible noise. This decomposition is somewhat more abstract than
the training-generalization gap, but provides some additional insights into 𝐸 new and how different models
behave.
Let us first make a short reminder of the general concepts of bias and variance. Consider an experiment
with an unknown constant 𝑧 0 , which we would like to estimate. To our help for estimating 𝑧 0 we have a
random variable 𝑧. Think, for example, of 𝑧 0 as being the (true) position of an object, and 𝑧 of being noisy
GPS measurements of that position. Since 𝑧 is a random variable, it has some mean E[𝑧] which we denote
by 𝑧¯. We now define

Bias: 𝑧¯ − 𝑧 0 (4.12a)
   
Variance: E (𝑧 − 𝑧¯) 2 = E 𝑧 2 − 𝑧¯2 . (4.12b)

The variance describes how much the experiment varies each time we perform it (the amount of noise in
the GPS measurements), whereas the bias describes the systematic error in 𝑧 that remains no matter how
many times we repeat the experiment (a possible shift or offset in the GPS measurements). If we consider
the expected squared error between 𝑧 and 𝑧 0 as a metric of how good the estimator 𝑧 is, we can re-write it
in terms of the variance and the squared bias,
  h 2i
E (𝑧 − 𝑧 0 ) 2 = E (𝑧 − 𝑧¯) + ( 𝑧¯ − 𝑧 0 ) =
 
= E (𝑧 − 𝑧¯) 2 +2 (E[𝑧] − 𝑧¯) ( 𝑧¯ − 𝑧 0 ) + ( 𝑧¯ − 𝑧 0 ) 2 . (4.13)
| {z } | {z } | {z }
Variance 0 bias2

In words, the average squared error between 𝑧 and 𝑧 0 is the sum of the squared bias and the variance. The
main point here is that to obtain a small expected squared error, we have to consider the bias and the
variance. Only a small bias or little variance in the estimator is not enough, but both aspects are important.
We will now apply the bias and variance concept to our supervised machine learning setting, and
in particular to the regression problem. The intuition, however, carries over also to the classification
problem4 . In this setting, 𝑧 0 corresponds to the true relationship between inputs and output, and the
random variable 𝑧 corresponds to the model learned from training data. (Since the training data collection
includes randomness, the model learned from it will also be random.) Let us spell out the details.
We first make the assumption that the true relationship between input x and output 𝑦 can be described
as some (possibly very complicated) function 𝑓0 (x) plus independent noise 𝜀,

𝑦 = 𝑓0 (x) + 𝜀, with E[𝜀] = 0 and var(𝜀) = 𝜎 2 . (4.14)

In our notation, b
𝑦 (x; T ) represents the model when it is trained on training data T . This is our random
variable, corresponding to 𝑧 above. We now also introduce the average trained model, corresponding to 𝑧¯,

𝑓¯(x) , E T [b
𝑦 (x; T )] . (4.15)
4 For intuition, we may think of classification problems as regression in terms of the decision boundaries.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
54
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4.4 The bias-variance decomposition of 𝐸 new

As before, E T denotes the expected value over training data drawn from 𝑝(x, 𝑦). Thus, 𝑓¯(x) is the
(hypothetical) average model we would achieve, if we could re-train the model an infinite number of times
on different training datasets and compute the average.
Remember that the definition of 𝐸¯ new (for regression with squared error) is
h h ii
𝐸¯ new = E T E★ (b𝑦 (x★; T ) − 𝑦★) 2 , (4.16)

and assuming these integrals (expressed as expected values) fulfill some technical assumptions, we can
change the order of integration and write (4.16) as
h h ii
𝐸¯ new = E★ E T (b𝑦 (x★; T ) − 𝑓0 (x★) − 𝜀) 2 (4.17)

With a slight extension of (4.13) to also include the zero-mean noise term 𝜀 (which is independent of
b
𝑦 (x★; T )), we can rewrite the expression inside the expected value E★ in (4.17) as
 2 2 h 2i
ET b 𝑦 (x★; T ) − 𝑓0 (x★) −𝜀 = 𝑓¯(x★) − 𝑓0 (x★) + E T b 𝑦 (x★; T ) − 𝑓¯(x★) + 𝜀2 . (4.18)
| {z } | {z }
“𝑧” “𝑧0 ”

This is (4.13) applied to supervised machine learning. In 𝐸¯ new , which we are interested in decomposing,
we also have the expectation over new data points E★. By incorporating also that expected value in the
expression, we can decompose 𝐸¯ new as
h 2i h h 2i i
𝐸¯ new = E★ 𝑓¯(x★) − 𝑓0 (x★) + E★ E T b 𝑦 (x★; T ) − 𝑓¯(x★) + 𝜎2 . (4.19)
|{z}
| {z } | {z } Irreducible
Bias2 Variance error

h 2i
The squared bias term E★ 𝑓¯(x★) − 𝑓0 (x★) now describes how much the average trained model 𝑓¯(x★)
h the true 𝑓0 (x★), averaged over all possible data points x★. In a similar fashion the variance
2 i
differs from
term E★ E T [ b 𝑦 (x★; T ) − 𝑓 (x★) ] describes how much b
¯ 𝑦 (x; T ) varies each time the model is trained
on a different training dataset. The bias term cannot be small unless the model is flexible enough such that
𝑓¯(x) can be close to 𝑓0 (x). If the variance term is small, the model is not very sensitive to exactly which
data points happened to be in the training data, and vice versa. The irreducible error 𝜎 2 is simply an effect
of the assumption (4.14) —it is not possible to predict 𝜀 since it is truly random. There is not so much
more to say about the irreducible error, but we will focus on the bias and variance terms.

What affects the bias and variance


We have not properly defined model complexity, but we can actually use the bias and variance concept to
give it a more concrete meaning: A high model complexity means low bias and high variance, and a low
model complexity means high bias and low variance, as illustrated by Figure 4.5.
This resonates well with the intuition. The more flexible a model is, the more it will adapt to the training
data T —not only to the interesting patterns, but also to the actual data points and noise that happened to
be in T . That is exactly what is described by the variance term. On the other hand, a model with low
flexibility can be too rigid to capture the true relationship 𝑓¯(x) between inputs and outputs well. This
effect is described by the squared bias term.
Figure 4.5 may very well be compared to Figure 4.3, which builds on the training error–generalization
gap decomposition of 𝐸¯ new instead. From Figure 4.5 we can talk about the challenge of finding the right
model complexity level also as the bias-variance tradeoff. We give an example of this in Example 4.3.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 55
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4 Understanding, evaluating and improving the performance

Underfit Overfit

𝐸¯ new
r ror
e du c ible e
Irr
Error

Variance

Bias2

Model complexity

Figure 4.5: The bias-variance decomposition of 𝐸¯ new , instead of the training error-generalization gap decomposition
in Figure 4.3. Low model complexity means high bias. The more complicated the model is, the more it adapts to
training data, and the higher variance. The irreducible error is always constant. The problem of achieving a small
𝐸 new by selecting a good model complexity level is often called the bias-variance tradeoff.

Irreducible error 𝐸¯ new 𝐸¯ new


Variance
Error

Error

Variance Irreducible
error
Bias2

Bias2

Size of training data 𝑛 Size of training data 𝑛


(a) Simple model (b) Complex model

Figure 4.6: The typical relationship between bias, variance and the size 𝑛 of the training dataset. The bias is
(approximately) constant, whereas the variance decreases as the size of the training dataset increases. This figure
can be compared with Figure 4.4.

The squared bias term is mostly a property of the model rather than of the training dataset, and we may
think5 of the bias term as independent of the number of data points 𝑛 in the training data. The variance
term, on the other hand, varies highly with 𝑛. As we know, 𝐸¯ new typically decreases as 𝑛 increases, and
essentially the entire decline in 𝐸¯ new is because of the decline in the variance. Intuitively, the more data,
the more information about the parameters, meaning less variance. This is summarized by Figure 4.6,
which can be compared to Figure 4.4.

5 This is not exactly true. The average model 𝑓¯ might indeed be different if all training datasets (which we average over) contain
𝑛 = 2 or 𝑛 = 100 000 data points, but we neglect that effect here.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
56
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4.4 The bias-variance decomposition of 𝐸 new

Example 4.3: The bias–variance tradeoff for 𝐿 2 regularized linear regression

Let us consider a simulated regression example. We let 𝑝(𝑥, 𝑦) follow from 𝑥 ∼ U [0, 1] and

𝑦 = 5 − 2𝑥 + 𝑥 3 + 𝜀, 𝜀 ∼ N (0, 1) . (4.20)

We let the training data consist of only 𝑛 = 10 data points. We now try to model the data using linear
regression with a 4th order polynomial

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥 2 + 𝛽3 𝑥 3 + 𝛽4 𝑥 4 + 𝜀. (4.21)

Since (4.20) is a special case of (4.21) and the squared error loss corresponds to Gaussian noise, we do
actually have zero bias for this model if we train it using squared error loss. However, learning 5 parameters
from only 10 data points leads to very high variance, so we decide to train the model with squared error loss
and 𝐿 2 regularization, which will decrease the variance (but increase the bias). The more regularization
(bigger 𝜆), the more bias and less variance.
Since this is a simulated example, we can repeat the experiment multiple times and estimate the bias and
variance terms (since we can simulate as much training and test data as needed). We plot them in the very
same style as Figures 4.3 and 4.5 (note the reversed x-axis: a smaller regularization parameter corresponds
to a higher model complexity). For this problem the optimal value of 𝜆 would have been about 0.7 since
𝐸¯ new attains its minimum there. Finding this optimal 𝜆 is a typical example of the bias–variance tradeoff.

𝐸¯ new
Error

2
or
ible err
Bias2 𝐸¯ train Irreduc Variance

0
101 100 10−1 10−2 10−3

Regularization parameter

Connections between bias, variance and the generalization gap


The bias and variance are theoretically well defined properties, but often intangible in practice since they
are defined in terms of 𝑝(x, 𝑦). In practice, we only have an estimate of the generalization gap (for example
as 𝐸 hold-out − 𝐸 train ), but neither the variance nor the bias. It is therefore interesting to explore what 𝐸 train
and the generalization gap says about the bias and variance.
Considering the regression problem, and assuming that the error function (squared error) also is the
loss function and that the global minimum is found during training, we can write
  1 ∑︁
𝑛
1 ∑︁
𝑛
2
𝜎 + bias = E★ 2 ¯ 2
( 𝑓 (x★) − 𝑦★) ≈ ¯ 2
( 𝑓 (x𝑖 ) − 𝑦 𝑖 ) ≥ 𝑦 (x𝑖 ; T ) − 𝑦 𝑖 ) 2 = 𝐸 train .
(b (4.22)
𝑛 𝑖=1 𝑛 𝑖=1

In the approximate equality, we used the “replace the integral with a sum”-trick6 . In the next step,
equality is obtained if b
𝑦 = 𝑓¯ (which we assume is possible), and inequality otherwise. Remembering that
6 Since neither 𝑓¯(𝑥★) nor 𝑦★ depends on the training data {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛 , we can use {x , 𝑦 } 𝑛 for approximating the integral.
𝑖 𝑖 𝑖=1

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 57
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4 Understanding, evaluating and improving the performance

𝐸¯ new = 𝜎 2 + bias2 + variance, and allowing ourselves to write 𝐸¯ new − 𝐸 train = generalization gap, we have

generalization gap ' variance, (4.23a)


2 2
𝐸 train / bias + 𝜎 . (4.23b)

We have made several assumptions in this derivation that are not always met in practice, but it at least
gives us some rough idea.
As we discussed previously, the choice of method is crucial for what 𝐸 new is obtained. Again the
one-dimensional scale in Figure 4.5 and the notion of a bias-variance tradeoff is a simplified picture;
decreased bias does not always lead to increased variance, and vice versa. However, in contrast to the
decomposition of 𝐸 new into training error and generalization gap, the bias and variance decomposition can
shed some more light over why 𝐸 new decreases for different methods: sometimes, the superiority of one
method over another can sometimes be attributed to either a lower bias or a lower variance.
A simple (and useless) way to increase the variance without decreasing the bias in linear regression, is
to first learn the parameters using the normal equations and thereafter add zero-mean random noise to
them. The extra noise does not affect the bias, since the noise has zero mean and hence leaves the average
model 𝑓¯ unchanged, but the variance increases. (This effects also the training error and the generalization
gap, but in a less clear way.) This way of training linear regression would be pointless in practice since it
increases 𝐸 new , but it illustrates the fact that increased variance does not automatically leads to decreased
bias.
A much more useful way of dealing with bias and variance is the meta-method called bagging, Chapter 7.
It makes use of several copies (an ensemble) of a base model, each trained on a slightly different version
of the training dataset. Since bagging averages over many base models, it reduces the variance, but the
bias remains (essentially) unchanged. Hence, by using bagging instead of the base model, the variance is
decreased (almost) without increasing the bias, which consequently decreases 𝐸 new .
To conclude, the world is more complex than just the one-dimensional model complexity scale used in
Figure 4.3 and 4.5, which we illustrate by Example 4.4.

Time to reflect 4.3: Can you modify linear regression such that the bias increases, without
decreasing the variance?

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
58
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4.4 The bias-variance decomposition of 𝐸 new

Example 4.4: Bias and variance for a regression problem

We consider the exact same setting as in Example 4.2, but decompose 𝐸¯ new into bias and variance instead.
This gives us the figure below.

5 Linear regression, 1st order polynomial, 𝐿 2 regularization


1
4 Linear regression, 2nd order polynomial, 𝐿 2 regularization
Linear regression, 3rd order polynomial, 𝐿 2 regularization
0.8 3 Regression tree
Random forest
2
Variance

0.6 5
4
3
2
0.4
0.1 1
0.1
1 1 1
10 10 0.1 1 000
0.2 100
100 1
10
1 000 100
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6

Bias2

There are clear resemblances to Example 4.2, as expected from (4.23). The effect of bagging (random
forest) is, however, more clear, namely that it reduces the variance compared to the regression tree with no
noteworthy increase in bias.
For another illustration of what bias and variance means, we illustrate some of these cases in more detail.
First we plot some of the linear regression models. The dashed red line is the true 𝑓0 (x), the dotted gray
lines are different models b𝑦 (x★; T ) learned from different training datasets T , and the solid black line their
mean 𝑓¯(x). In these figures, bias is the difference between the dashed red and solid black lines, whereas
variance is the spread of the dotted gray lines around the solid black. The variance appears to be roughly
the same for all three models, perhaps somewhat smaller for the first order polynomial, whereas the bias is
clearly smaller for the higher order polynomials. This can be compared to the figure above.

Linear regression, 𝛾 = 0.1 2nd order polynomial, 𝛾 = 0.1 3rd order polynomial, 𝛾 = 0.1

4 4 4

2 2 2
𝑦

0 0 0

−5 0 5 10 −5 0 5 10 −5 0 5 10

2nd order polynomial, 𝛾 = 1 000 Regression tree, max depth 5 Random forest, max depth 5

4 4 4

2 2 2

0 0 0

−5 0 5 10 −5 0 5 10 −5 0 5 10
𝑥 𝑥 𝑥

Comparing the second order polynomial with little (𝛾 = 0.1) and heavy (𝛾 = 1 000) regularization, it is
clear that regularization reduces variance, but also increases bias. Furthermore has random forest a smaller
variance than the regression tree, but without any change in the black line 𝑓¯(x), and hence no change in bias.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 59
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4 Understanding, evaluating and improving the performance

4.5 Evaluation for imbalanced and asymmetric classification problems


We started this chapter by presenting the misclassification as the default error function, but sometimes
classification problems are imbalanced and/or asymmetric and other error functions are more relevant.
The typical example is perhaps when binary classification is used to detect the presence of something,
such as a disease, an object on the radar, etc. The convention is that 𝑦 = 1 (positive) denotes presence, and
𝑦 = −1 (negative) denotes absence. We say that a problem is

(i) imbalanced if the vast majority of the data points belongs to one class, typically the negative class
𝑦 = −1. This imbalance implies that a (useless) classifier which always predicts b 𝑦 (x) = −1 will
score very well in terms of misclassification rate (4.1a).

(ii) asymmetric if a missed detection (predicting b 𝑦 (x) = −1, when in fact 𝑦 = 1) is considered more
severe than a false detection (predicting b
𝑦 = 1, when in fact 𝑦 = −1), or vice versa. That asymmetry
is not taken into account in the misclassifcation rate (4.1a).

If a classification problem is imbalanced and/or asymmetric, the misclassification rate, and hence 𝐸 new
as we defined it in Section 4.1, is not always very useful for evaluating a classifier. The concepts of
generalization gap and bias-variance tradeoff are still applicable, but for those problems other metrics
than misclassification rate might be more relevant. We will now briefly introduce some useful tools for
asymmetric and/or imbalanced classification problems, which can be used either as an alternative or as a
complement to the misclassification error function (4.1a) in 𝐸 new . For simplicity we consider the binary
problem and use a hold-out validation dataset approach, but the ideas can be extended to the multiclass
problem as well as to 𝑘-fold cross-validation.

The confusion matrix and the F1 score

If we learn a binary classifier and evaluate it on a hold-out validation dataset, a simple yet useful way
to inspect the performance more than just computing 𝐸 hold-out is a confusion matrix. By separating the
validation data in four groups depending on 𝑦 (the actual output) and b 𝑦 (x) (the output predicted by the
classifier), we can make the confusion matrix,

𝑦 = −1 𝑦=1 total
b
𝑦 (x) = −1 True neg (TN) False neg (FN) N*
b
𝑦 (x) = 1 False pos (FP) True pos (TP) P*
total N P 𝑛

Of course, TN, FN, FP, TP (and also N*, P*, N, P and 𝑛) should be replaced by the actual numbers, as
in Example 4.5. The confusion matrix provides a quick and informative overview of the characteristics
of a classifier. For asymmetric problems, it is important to distinguish between false positive (FP, also
called type I error) and false negative (FN, also called type II error). Ideally they both should be 0, but in
practice one usually has to decide which ones are considered most important.
There is also a wide body of terminology related to the confusion matrix, which is summarized in
Table 4.1. Some particularly common terms are the recall (TP/P) and the precision (TP/P*). Recall
describes how big proportion among the true positive points that are predicted as positive. A high recall
(close to 1) is good, and a low recall (close to 0) indicates a problem with false negatives. Precision
describes what the ratio of true positive points are among the ones predicted as positive. A high precision
(close to 1) is good, and a low recall (close to 0) indicates a problem with false positives.
For imbalanced (but not asymmetric) problems where the negative class 𝑦 = −1 is the most common
class, it makes sense to summarize the precision and recall by their harmonic mean into the F1 score.
Instead of using the misclassification rate for imbalanced problems, the F1 score can be used instead. Note,
however, the scale for the F1 score: the perfect classifier attains F1 score equal to 1 whereas 0 is worst (for
misclassification rate, 1 is worst).

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
60
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4.5 Evaluation for imbalanced and asymmetric classification problems

Ratio Name
FP/N False positive rate, Fall-out, Probability of false alarm
TN/N True negative rate, Specificity, Selectivity
TP/P True positive rate, Sensitivity, Power, Recall, Probability of detection
FN/P False negative rate, Miss rate
TP/P* Positive predictive value, Precision
FP/P* False discovery rate
TN/N* Negative predictive value
FN/N* False omission rate
P/𝑛 Prevalence
(FN+FP)/𝑛 Misclassification rate
(TN+TP)/𝑛 Accuracy
2TP/(P*+P) F1 score

Table 4.1: Some common terms related to the quantities (TN, FN, FP, TP) in the confusion matrix. The terms
written in italics are discussed in the text.

Example 4.5: The confusion matrix in thyroid disease detection

The thyroid is an endocrine gland in the human body. The hormones it produces influences the metabolic rate
and the protein synthesis, and thyroid disorders may have serious implications. We consider the problem of de-
tecting thyroid diseases, using the dataset provided by UCI Machine Learning Repository (Dheeru and Karra
Taniskidou 2017). The dataset contains 7200 data points, each with 21 medical indicators as inputs (both qual-
itative and quantitative). It also contains the qualitative diagnosis {normal, hyperthyroid, hypothyroid},
which we convert into the binary problem with the output classes {normal, not normal}. The dataset is
split into a training and hold-out validation part, with 3772 and 3428 data points respectively. The problem
is imbalanced since only 7% of the data points are not normal, and possibly asymmetric if false negatives
(not indicating the disease) are considered more problematic than false positives (falsely indicating the
disease). We train a logistic regression classifier and evaluate it on the validation dataset (using the default
decision threshold 𝑟 = 0.5, see (3.36)). We obtain the confusion matrix

𝑦 = normal 𝑦 = not normal


b
𝑦 (x) = normal 3177 237
b
𝑦 (x) = not normal 1 13

Most validation data points are correctly predicted as normal, but a large part of the not normal data is
also falsely predicted as normal. This might indeed be undesired in the application. The misclassification
rate is 0.069 and the F1 score is 0.106. (The useless predictor of always predicting normal has a very similar
misclassification rate of 0.073, but worse F1 score 0.)
To change the picture, we change the threshold to 𝑟 = 0.15, and obtain new predictions with the following
confusion matrix instead:
𝑦 = normal 𝑦 = not normal
b
𝑦 = normal 3067 165
b
𝑦 = not normal 111 85

This change gives more true positives (85 instead of 13 patients are correctly predicted as not normal), but
this happens at the expense of more false positives (111, instead of 1, patients are now falsely predicted as
not normal). As expected, the misclassification rate is now higher (worse) at 0.081, but the F1 score is
higher (better) at 0.381. Remember, however, that the F1 score does not take the asymmetry into account,
but only the imbalance. We have to decide ourselves whether this classifier is a good tradeoff between the
false negative and false positive rates, by considering which type of error has the most severe consequences.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 61
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4 Understanding, evaluating and improving the performance

1 1

→ decreasing 𝑟 →
g𝑟
sin
True positive rate

crea
de

Precision
0.5 0.5
𝑃/𝑛

Typical example Typical example


Perfect classifier Perfect classifier
Random guess Random guess
0 0
0 0.5 1 0 0.5 1

False positive rate Recall


(a) The ROC curve (b) The precision-recall curve

Figure 4.7: The ROC (left) and the precision-recall (right) curve. Both plots summarizes the performance of a
classifier for all decision thresholds 𝑟 (see (3.36)), but the ROC curve is most relevant for balanced problems,
whereas the precision-recall curve is more informative for imbalanced problems.

ROC and precision-recall curve


As suggested by the example above, the tuning of the threshold 𝑟 in (3.36) can be crucial for the performance
in binary classification. If we want to compare different classifiers for a certain problem without specifying
a certain decision threshold 𝑟, the ROC curve can be useful. The abbreviation ROC means ”receiver
operating characteristics”, and is due to its history from communications theory.
To plot an ROC curve, the recall/true positive rate (TP/P, a large value is good) is drawn against the
false positive rate (FP/N, a small value is good) for all values of 𝑟 ∈ [0, 1]. The curve typically looks as
shown in Figure 4.7a. An ROC curve for a perfect classifier (always predicting the correct value for all
𝑟 ∈ (0, 1)) touches the upper left corner, whereas a classifier which only assigns random guesses gives a
straight diagonal line.
A compact summary of the ROC curve is the area under the ROC curve, ROC-AUC. From Figure 4.7a,
we conclude that a perfect classifier has ROC-AUC 1, whereas a classifier which only assigns random
guesses has ROC-AUC 0.5. The ROC-AUC is thus summarizing the performance of a classifier for all
possible values of the decision threshold 𝑟.
We previously discussed how the misclassification rate might be misleading for imbalanced problems,
but we should consider the precision, recall and F1 score instead. Similarly might the ROC curve be
misleading for imbalanced problems, but instead the precision-recall curve can (for imbalanced problems
where 𝑦 = −1 is the most common class) be more useful. As the name suggests, the precision-recall curve
plots the precision (TP/P*, a large value is good) against the recall (TP/P, a large value is good) for all
values of 𝑟 ∈ [0, 1], much like the ROC curve. The precision-recall curve for the perfect classifier touches
the upper right corner, and a classifier which only assigns random guesses gives a horizontal line with
height 𝑃/𝑛, as shown in Figure 4.7b.
Also for the precision-recall curve, we can define area under the curve, precision-recall AUC. The best
possible precision-recall AUC is 1, and the classifier which only makes random guesses has precision-recall
AUC equal to 𝑃/𝑛.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
62
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
4.6 Further reading

4.6 Further reading


This chapter was to a large extent inspired by the introductory machine learning textbook by Abu-Mostafa
et al. (2012). There are also several other textbooks on machine learning, including Vapnik (2000) and
Mohri et al. (2018), in which the central theme is understanding the generalization gap using formal
definitions of model flexibility such as the VC dimension or Rademacher complexity. The understanding
of model flexibility for deep neural networks (Chapter 6) is, however, still subject to research, see for
example C. Zhang et al. (2017), Neyshabur et al. (2017), Belkin et al. (2019) and Neal et al. (2019) for
some directions. Furthermore is the bias-variance decomposition most often (including this chapter)
presented only for regression, but a possible generalization to the classification problem is suggested by
Domingos (2000).

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 63
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

This chapter resolves around three different topics, namely loss functions, regularization and optimization.
We have touched upon all of them already, mostly in connection to the parametric models in Chapter 3,
linear and logistic regression. These topics are however central for many more supervised machine
learning methods and deserve a dedicated discussion, which we will give now. The understanding of loss
functions is particularly helpful when we introduce boosting (Chapter 7) and support vector machines
(Chapter 8) later on. Regularization is crucial in preventing overfit (Chapter 4) and should be in the back
of the mind of any machine learning engineer. Finally, it is always helpful to be familiar with the main
ideas of optimization, in particular since they are a key enabler for deep learning (Chapter 6).

5.1 Loss functions


Most parametric regression and classification models, including linear and logistic regression (Chapter 3),
are trained by minimizing a cost function 𝐽 (𝜽). The cost function is the average of the loss function1 𝐿
evaluated on the training data,
loss function
𝑛 z }| {
b 1 ∑︁
𝜽 = arg min 𝑦 (x; 𝜽), 𝑦 𝑖 ) .
𝐿(b (5.1)
𝜽 𝑛 𝑖=1
| {z }
cost function 𝐽 (𝜽)

The choice of loss function is, in the end, a user choice. Different loss functions give rise to different
solutions b𝜽, which in turn results in models with different characteristics. There is never a “right” or
“wrong” loss function, but one loss functions may of course perform better than another on a given problem.
Certain combinations of models and loss functions have proven particularly fruitful and have historically
been branded as different methods carrying different names. For example, the term “linear regression”
most often refers to the combination of a linear-in-the-parameter model and the squared error loss, whereas
the term “support vector classification” (Chapter 8) refers to a linear-in-the-parameter model trained using
the hinge loss. In this section, however, we provide a general discussion about different loss functions and
their properties, without connections to a specific method.
One important aspect of a loss function is its robustness. Robustness is tightly connected to outliers,
meaning spurious data points that do not describe the relationship we are interested in modeling. If
outliers in the training data only have a minor impact to the learned model, we say that the loss function
is robust. Conversely, a loss function is not robust if the outliers have a major impact to the learned
model. Robustness is therefore a very important property in applications where the training data is
contaminated with outliers. Some of the common “default” loss-functions, such as the squared error loss,
are unfortunately not particularly robust, and it is therefore important to make an informed user choice.
Robustness is however not a binary property, instead loss functions can be robust to a greater or smaller
degree.
Another important property of loss functions for binary classification is their so-called asymptotic
minimizers. The asymptotic minimizer refers to the function that minimizes the loss function when 𝑛 → ∞,
and reveals whether the model will possibly learn the full conditional class probability 𝑝(𝑦 = 1 | x) or only
the decision boundary. We will discuss this in more detail later.
1 Asyou might already have noticed, the arguments to the loss function (here b 𝑦 and 𝑦★) varies with context. This is unfortunate
but unavoidable, since different loss functions are formulated in terms of different quantities, for example the prediction b
𝑦 , the
predicted conditional class probability 𝑔(x), the classifier margin, etc.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 65
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

4
Squared error loss
Absolute error loss
3 Huber loss
𝜖-insensitive loss

Loss
2

0
−2 −1 −𝜖 0 𝜖 1 2x

error b
𝑦−𝑦

Figure 5.1: The loss functions for regression presented in the text, each as a function of the error b
𝑦 − 𝑦.

Loss functions for regression


In Chapter 3 we introduced the squared error loss
2
𝑦 , 𝑦) = b
𝐿(b 𝑦−𝑦 , (5.2)

which is the default choice for linear regression since it simplifies the training to only solving the normal
equations. The squared error loss is often used also for other regression models, such as neural networks.
Another common, but more robust, loss function is the absolute error loss,

𝑦 , 𝑦) = |b
𝐿 (b 𝑦 − 𝑦|. (5.3)

In Chapter 3 we introduced the maximum likelihood motivation of the squared error loss by assuming that
the output 𝑦 is measured with additive uncorrelated noise 𝜀 from a Gaussian distribution. We can similarly
motivate the absolute error loss by instead assuming 𝜀 to have a Laplace distribution, 𝜀 ∼ L (0, 𝑏 𝜀 ). This
statistical perspective is one way to understand the fact that the absolute error loss is more robust (less
sensitive to outliers) than the squared error loss, since the Laplace distribution has a thicker tail compared
to the Gaussian distribution, see Figure 5.1. The Laplace distribution therefore encodes sporadic large
noise values (that is, outliers) as more probable, compared to the Gaussian distribution.
If we have statistical knowledge about how the noise 𝜀 in a regression problem behaves (for example
a skewed normal distribution), it is perfectly possible to use that information and make a maximum
likelihood derivation like (3.17)-(3.21) to derive a loss function for regression tailored to that information.
There are, however, a few additional off-the-shelf loss functions commonly used for regression which are
less natural to derive from a maximum likelihood perspective.
It is sometimes argued that the squared error loss is a good choice because of its quadratic shape which
penalizes small errors (𝜀 < 1) less than linearly. After all, the Gaussian distribution appears (at least
approximately) quite often in nature. However, the quadratic shape for large errors (𝜀 > 1) is the reason
for its non-robustness, and the Huber loss has therefore been suggested as a hybrid between the absolute
loss and squared error loss:
(
𝑦 − 𝑦) 2 if |b
(b 𝑦 − 𝑦| < 1,
𝑦 , 𝑦) =
𝐿(b (5.4)
|b
𝑦 − 𝑦| otherwise.

Another extension to the absolute error loss is the 𝜖-insensitive loss,


(
0 if |b
𝑦 − 𝑦| < 𝜖,
𝐿(b𝑦 , 𝑦) = (5.5)
𝑦 − 𝑦| − 𝜖 otherwise,
|b

where 𝜖 is a user-chosen design parameter. This loss places a “tolerance” of width 2𝜖 around the true 𝑦
and behaves like the absolute error loss outside this region. Furthermore the robustness properties of the

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
66
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5.1 Loss functions

𝜖-insensitive loss are very similar to those of the absolute error loss. The 𝜖-insensitive loss turns out to be
a useful loss function for support vector regression in Chapter 8. We illustrate all these loss functions for
regression in Figure 5.1.

Loss functions for binary classification


An intuitive loss function for binary classification is provided by the misclassification loss,
(
1 if 𝑦★ = 𝑦,
𝐿(b𝑦 , 𝑦) = (5.6)
0 otherwise.
However, even though a small misclassification loss often is the ultimate goal in practice, this loss function
is rarely used when training models. There are (at least) two reasons for this. The most apparent reason in
that the resulting cost function is a piecewise constant function, which is a poor optimization objective
since the gradient is zero everywhere, except where it is undefined. Furthermore, a more subtle reason is
that for many models the final prediction b𝑦 does not reveal all aspects of the classifier. Thinking intuitively
in terms of decision boundaries, we may prefer to not have the decision boundary close to the training
data points, but instead push the boundary further away to have some “safety leeway” if possible. The
misclassification loss (5.6) cannot achieve this, sine it only encourages training data points to be on the
right side of the decision boundary.
For binary classifiers that predicts conditional class probabilities 𝑝(𝑦 = 1 | x) in terms of a function
𝑔(x), the cross entropy loss is, as introduced in Chapter 3,
(
ln 𝑔(x) if 𝑦 = 1,
𝐿(𝑔(x), 𝑦) = (5.7)
ln(1 − 𝑔(x)) if 𝑦 = −1.
This loss was derived from a maximum likelihood perspective, but unlike regression (where we had to
specify a distribution for 𝜀) there are no user choices left in the cross entropy loss, other than what model
to use for 𝑔(x).
The misclassification and cross entropy loss are however not the only loss functions for binary
classification. To introduce an entire family of loss functions for logistic regression, let us first introduce
the concept of margins in binary classification. Many binary classifiers b 𝑦 (x) can be constructed by
thresholding some real-valued function 𝑐(x) at 0. That is, we can write
b
𝑦 (x) = sign{𝑐(x)}. (5.8)

Logistic regression, for example, can be brought into this form by simply using 𝑐(x) = 𝜽 T x. The decision
boundary for any classifier on the form (5.8) is given by the values of x for which 𝑐(x) = 0. To simplify
our discussion we will assume that none of the data points fall exactly on the decision boundary (which
always gives rise to an ambiguity). This will imply that we can assume that b 𝑦 (x) as defined above is
always either −1 or +1.
Based on the function 𝑐(x), we say that
the margin of a classifier is 𝑦 · 𝑐(x). (5.9)
It follows that if 𝑦 and 𝑐(x) have the same sign, meaning the classification is correct, then the margin is
positive. Analogously, for an incorrect classification 𝑦 and 𝑐(x) will have different signs and the margin is
negative. The margin can be viewed as a measure of certainty in a prediction, where data points with
small margins in some sense (not necessarily Euclidian) are close to the decision boundary. The margin
plays a similar role for binary classification as the error b
𝑦 − 𝑦 does for regression.
We can now define loss functions for binary classification in terms of the margin, by assigning a small
loss to positive margins (correct classifications) and a large loss to negative margins (misclassifications).
We can, for instance, re-formulate the logistic loss (3.34) as
𝐿(𝑦 · 𝑐(x)) = ln (1 + exp (−𝑦 · 𝑐(x))) , (5.10)

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 67
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

with 𝑐(x) = 𝜃 T x. That is, learning 𝜽 by minimizing the loss function (5.10) and making predictions
according to (5.8) is equivalent to logistic regression as described in Chapter 3, except for the fact that we
have lost the notion of a conditional class probability estimate 𝑔(x) and only have a “hard” prediction
b
𝑦 (x★). We will, however, recover the class probability estimate later when we discuss the asymptotic
minimizer of (5.10).
We can also formulate the misclassification loss in terms of the margin,
(
1 if 𝑦 · 𝑐(x) < 0,
𝐿(𝑦 · 𝑐(x)) = (5.11)
0 otherwise.

The benefit of formulating a loss function in terms of the margin is that it is easy to come up with new loss
functions. Let us now introduce the exponential loss defined as

𝐿(𝑦 · 𝑐(x)) = exp(−𝑦 · 𝑐(x)), (5.12)

which turns out to be a useful loss function when we later will derive the AdaBoost in Chapter 7 due to
its properties. The downside of the exponential loss is that it is not particularly robust against outliers,
compared to the logistic loss. We also have the hinge loss, which we will use later for support vector
classification in Chapter 8,
(
1 − 𝑦 · 𝑐(x) for 𝑦 · 𝑐(x) ≤ 1,
𝐿(𝑦 · 𝑐(x)) = (5.13)
0 otherwise.

As we will see in Chapter 8, the hinge loss has an attractive so-called support-vector property. However,
as we soon will see its asymptotic minimizer is somewhat unfortunate. As a remedy to that, one may
instead consider the squared hinge loss
(
(1 − 𝑦 · 𝑐(x)) 2 for 𝑦 · 𝑐(x) ≤ 1,
𝐿(𝑦 · 𝑐(x)) = (5.14)
0 otherwise,

which on the other hand is much less robust than the hinge loss. A more elaborate alternative might
therefore be the Huberized squared hinge loss





−4𝑦 · 𝑐(x) for 𝑦 · 𝑐(x) ≤ −1,
𝐿(𝑦 · 𝑐(x)) = (1 − 𝑦 · 𝑐(x)) 2 for − 1 ≤ 𝑦 · 𝑐(x) ≤ 1, (squared hinge loss) (5.15)


0
 otherwise,

whose name refers to its similarities to the Huber loss for regression, namely that the quadratic function
is replaced with a linear function for margins < −1. The three loss functions presented above are all
particularly interesting for support vector classification, due to the fact that they are all exactly 0 for
margins > 1.
We summarize this cascade of loss functions for binary classification in Figure 5.2, which illustrates all
these losses as a function of the margin.
We have already made some claims about robustness. Let us motivate them using Figure 5.2. One
characterization of an outlier is as a data point on the “wrong” side of and far away from the decision
boundary. In a margin perspective, that is equivalent to a large negative margin. The robustness of a
loss function is therefore tightly connected to the shape of the loss function for large negative margins.
The steeper slope and heavier penalization of large negative margins, the more sensitive it is to outliers.
We can therefore tell from Figure 5.2 that the exponential loss is most sensitive to outliers, due to its
exponential left asymptote, whereas the squared hinge loss is somewhat more robust with a quadratic
asymptote instead. However, even more robust are the Huberized squared hinge loss, the hinge loss and
the logistic loss which all have an asymptotic behavior which is linear. Most robust is the misclassification
loss, but as already discussed that loss has other disadvantages.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
68
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5.1 Loss functions

Logistic loss
Exponential loss
12
Hinge loss
Squared hinge loss
Huberized squared hinge loss
Misclassification loss
10

8
Loss

0
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

Margin, 𝑦 · 𝑐(x)

Figure 5.2: Comparison of some common loss functions for classification, plotted as a function of the margin.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 69
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

The asymptotic minimizer in binary classification


As mentioned earlier the asymptotic minimizer is a useful theoretical concept un understanding loss
functions for classification. The asymptotic minimizer is the function (𝑔(x) or 𝑐(x), depending on the
parametrization of the loss function) which minimizes the cost function when the number of training data
𝑛 → ∞, hence the name asymptotic. It contains valuable information whether the model will, eventually
as 𝑛 → ∞, learn the conditional class probability 𝑝(𝑦 = 1 | x) correctly or not.
In the following discussion we will make heavy use of the notion of a ground truth conditional class
probability 𝑝(𝑦 = 1 | x), which we assume the training data is drawn from. Learning 𝑝(𝑦 = 1 | x), or some
aspect of it, is the ultimate goal with binary classification in supervised machine learning. We assume that
the model is flexible enough such that it can describe 𝑝(𝑦 = 1 | x) well. This assumption is not always met
in practice, and we will discuss the implications of this assumption at the end.
There is a notion of a strictly proper loss function. A strictly proper loss function is a loss function
for which the asymptotic minimizer (i) is unique and (ii) is an invertible transformation of 𝑝(𝑦 = 1 | x).
That is, a strictly proper loss function forces the model to learn the full2 distribution 𝑝(𝑦 = 1 | x) if given
enough training data. All models can be used for predicting classes b 𝑦 (x★), but only a model trained
with a strictly proper loss function can be used for predicting conditional class probabilities (like 𝑔(x) in
logistic regression). Deriving the asymptotic minimizer of a loss function is most often a straightforward
calculation, but for brevity we do not include the derivations here.
Starting with the binary maximum likelihood loss (5.7), its asymptotic minimizer can be shown to be
𝑔(x) = 𝑝(𝑦 = 1 | x). In other words, when 𝑛 → ∞ the loss function (5.7) trains the model such that 𝑔(x)
becomes the conditional class probability. Since we derived (5.7) with this intention in Chapter 3, this was
indeed expected.
| x)
The asymptotic minimizer of the logistic loss (5.10) is 𝑐(x) = ln 1−𝑝𝑝( 𝑦=1 ( 𝑦=1 | x) . This is an invertible
expression and hence the logistic loss is a strictly proper loss function. By inverting 𝑐(x) we obtain
exp 𝑐 (x)
𝑝(𝑦 = 1 | x) = 1+exp 𝑐 (x) which shows how conditional class probability predictions can be obtained from
𝑐(x). (With the “margin formulation” of logistic regression, we seemingly lost the class probability
predictions 𝑔(x). We have now recovered it.)
For the misclassification loss (5.11) the asymptotic minimizer is not unique but only fulfills
(
1 if 𝑝(𝑦 = 1 | x) > 0.5,
sign{𝑐(x)} =
−1 if 𝑝(𝑦 = 1 | x) < 0.5.
From this expression we see that a model trained with the misclassification loss has the correct decision
boundary (when 𝑛 → ∞). However, the misclassification loss does not train the model such that
𝑝(𝑦 = 1 | x) can be inferred from 𝑐(x), since the asymptotic minimizer is not an invertible transformation
of it. The misclassification loss is therefore not a strictly proper loss function, and 𝑐(x) can not be used for
predicting conditional class probabilities. This is yet another downside of the misclassification loss.
| x)
Turning to the exponential loss (5.12), its asymptotic minimizer is 𝑐(x) = 12 ln 1−𝑝𝑝( 𝑦=1
( 𝑦=1 | x) , almost like
the logistic loss. The exponential loss is therefore also strictly proper, and 𝑐(x) can be transformed and
used for predicting conditional class probabilities.
The asymptotic minimizer of the hinge loss is
(
1 if 𝑝(𝑦 = 1 | x) > 0.5,
𝑐(x) =
−1 if 𝑝(𝑦 = 1 | x) < 0.5.

This is a non-invertible transformation of 𝑝(𝑦 = 1 | x), which means that 𝑝(𝑦 = 1 | x) is not possible to
infer from the asymptotic minimizer 𝑐(x). This implies that a classifier learned using hinge loss (such as
support vector classification, Section 8.5) can not predict conditional class probabilities.
The squared hinge loss, on the other hand, is a strictly proper loss function, since its asymptotic
minimizer is 𝑐(x) = 2𝑝(𝑦 | x) − 1. This also holds for the Huberized square hinge loss. Recalling our
2 Loss functions that are proper, but not strictly proper, may for example learn only the correct decision boundary, that is
𝑝(𝑦 = 1 | x) = 0.5, which is not a full characterization of 𝑝(𝑦 = 1 | x).

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
70
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5.1 Loss functions

robustness discussion, we see that by squaring the hinge loss we make it strictly proper but a the same
time we impact its robustness. However, the “huberization” (replacing the quadratic curve with a linear
one for margins < −11) improves the robustness while keeping the property of being strictly proper.
We have now seen that some (but not all) of the loss functions are strictly proper, meaning they could
potentially predict conditional class probabilities correctly. However, this is only under the assumption
that the model is sufficiently flexible such that 𝑔(x) or 𝑐(x) actually can take the shape of the asymptotic
minimizer. This is possibly problematic; remember that 𝑐(x) is a linear function in logistic regression,
whereas 𝑝(𝑦 = 1 | x) can be almost arbitrarily complicated in real world applications. It is therefore not
sufficient to use a strictly proper loss function in order to predict conditional class probabilities, but our
model also has to be flexible enough. This discussion is also only valid in the limit as 𝑛 → ∞. However,
in practice 𝑛 is always bounded, and we may ask how large 𝑛 has to be for a flexible enough model to at
least approximately learn the asymptotic minimizer? Unfortunately we cannot give any general numbers,
but following the same principles as the overfit discussion in Chapter 4, the more flexible the model
the larger 𝑛 is required. If 𝑛 is not large enough, the predicted conditional class probabilities tend to
be “overfitted” to the training data. We therefore unfortunately have to conclude that the concept of the
asymptotic minimizer only give limited guidance on how to interpret the learned 𝑐(x) in practice.

Multiclass classification
So far we have only discussed the binary classification problem with 𝑀 = 2. The maximum likelihood
loss is indeed possible to generalize to the multiclass problem, that is 𝑀 > 2, as we did in Chapter 3 for
logistic regression. The generalization of the other loss functions is, however, a more complicated problem
and would require a generalization of the margin to the multiclass problem to start with. That is possible
but we do not discuss it in this book. A more pragmatic approach is instead to reformulate the problem as
several binary problems. This re-formulation can be done using either a one-versus-rest or one-versus-one
scheme.
The one-versus-rest (or one-versus-all or binary relevance) idea is to train 𝑀 binary classifiers. Each
classifier in this scheme is trained for predicting one class against all the other classes. To make a prediction
for a test data point, all 𝑀 classifiers are used, and the class which, for example, is predicted with the
largest margin is taken as the predicted class. This approach is a pragmatic solution, which may turn out
to work well for some problems.
The one-versus-one idea is instead to train one classifiers for each pair of classes. If there are 𝑀 classes
in total, there are 12 𝑀 (𝑀 − 1) such pairs. To make a prediction each classifier predicts either of its two
classes, and the class which overall obtains most “votes” is chosen as the final prediction. The predicted
margins can be used to break a tie if that happens. Compared to one-versus-rest, the one-versus-one
approach has the disadvantage of involving 12 𝑀 (𝑀 − 1) classifiers, instead of only 𝑀. On the other hand
each of these classifiers is trained on much smaller datasets (only the data points that belong to either
of the two classes) compared to one-versus-rest which uses the entire original training dataset for all 𝑀
classifiers.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 71
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

5.2 Regularization
We will now take a closer look at regularization, which was briefly introduced in Section 3.3 as a useful
tool for avoiding overfit if the model was too flexible, such as a polynomial of high degree. We have
also discussed thoroughly in Chapter 4 the need for tuning the model flexibility, which is the purpose
of regularization. Finding the right level of flexibility, and thereby avoiding overfit, is very important in
practice.
Regularization is applicable to any parametric model, and the idea it to ‘keep the parameters b 𝜽 small
unless the data really convinces us otherwise’, or alternatively ‘if a model with small values of the
parameters b𝜽 fits the data almost as well as a model with larger parameter values, the one with small
parameter values should be preferred’. We will now have a closer look at the so-called 𝐿 2 and 𝐿 1
regularization.

𝐿 2 regularization
The 𝐿 2 regularization (also known as Tikhonov regularization, ridge regression and weight decay) amounts
to adding an extra penalty term k𝜽 k 22 to the cost function. Linear regression with squared error loss and
𝐿 2 regularization, as an example, amounts to solving

b 1
𝜽 = arg min kX𝜽 − yk 22 + 𝜆k𝜽 k 22 . (5.16)
𝜽 𝑛
By choosing the regularization parameter 𝜆 ≥ 0, a trade-off between the original cost function (fitting the
training data as well as possible) and the regularization term (keeping the parameters b 𝜽 close to zero) is
made. In the setting 𝜆 = 0 we recover the original least squares problem (3.12), whereas 𝜆 → ∞ will force
all parameters b 𝜽 to 0. A good choice of 𝜆 is usually somewhere in between and depends on the actual
problem, and can be selected using cross-validation.
It is actually possible to derive a version of the normal equations for (5.16), namely

(XT X + 𝑛𝜆I 𝑝+1 )b


𝜽 = XT y, (5.17)

where I 𝑝+1 is the identity matrix of size ( 𝑝 + 1) × ( 𝑝 + 1). For 𝜆 > 0, the matrix XT X + 𝑛𝜆I 𝑝+1 is always
invertible, and we have the closed form solution
b
𝜽 = (XT X + 𝑛𝜆I 𝑝+1 ) −1 XT y. (5.18)

This also reveals another reason for using regularization in linear regression, namely if XT X is not
invertible. When XT X is not invertible, the ordinary normal equations (3.13) have no unique solution b
𝜽,
whereas the 𝐿 2 -regularized version always has the unique solution (5.18) if 𝜆 > 0.

𝐿 1 regularization
With 𝐿 1 regularization (also called LASSO, an abbreviation for Least Absolute Shrinkage and Selection
Operator), the penalty term k𝜽 k 1 is added to the cost function. Here k𝜽 k 1 is the 1-norm or ‘taxicab norm’
k𝜽 k 1 = |𝜃 0 | + |𝜃 1 | + · · · + |𝜃 𝑝 |. The 𝐿 1 regularized cost function for linear regression (with squared error
loss) (3.12) then becomes

b 1
𝜽 = arg min kX𝜽 − yk 22 + 𝜆k𝜽 k 1 . (5.19)
𝜽 𝑛

Contrary to linear regression with 𝐿 2 regularization (3.48), there is no closed-form solution available
for (5.19). However, as we will see in Section 5.3, it is possible to design an efficient numerical optimization
algorithm for solving (5.19).
As for 𝐿 2 regularization, the regularization parameter 𝜆 has to be chosen by the user, and has a similar
meaning: 𝜆 = 0 gives the ordinary least squares solution and 𝜆 → ∞ gives b 𝜽 = 0. Between these extremes,

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
72
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5.2 Regularization

however, 𝐿 1 and 𝐿 2 tend to give different solutions. Whereas 𝐿 2 regularization pushes all parameters
towards small values (but not necessarily exactly zero), 𝐿 1 tends to favor so-called sparse solutions where
only a few of the parameters are non-zero, and the rest are exactly zero. Thus, 𝐿 1 regularization can
effectively ‘switch off’ some inputs (by setting the corresponding parameter 𝜃 𝑘 to zero) and it can therefore
be used as an input (or feature) selection method.

Example 5.1: Regularization for car stopping distance

Consider again Example 2.2 with the car stopping distance regression problem. We use the 10th order
polynomial that was considered meaningless in Example 3.5 and apply 𝐿 2 and 𝐿 1 regularization to it,
respectively. With manually chosen 𝜆, we obtain the following models
𝐿 2 regularization 𝐿 1 regularization
150 150
Distance (feet)

Distance (feet)
100 100

50 50

0 0
0 10 20 30 40 0 10 20 30 40

Speed (mph) Speed (mph)


Both models suffer less from overfit than the non-regularized 10th order polynomial in Example 3.5. The
two models here are, however, not identical. Whereas all parameters are relatively small but non-zero in the
𝐿 2 -regularized model (left panel), only 4 (out of 11) parameters are non-zero in the 𝐿 1 -regularized model
(right panel). It is typical for 𝐿 1 regularization to give sparse models, where some parameters are set to
exactly zero.

General cost function regularization

The 𝐿 1 and 𝐿 2 regularization are two commonly used regularization methods, and they are both formulated
as additions to the cost function. Together they suggest a more general regularization scheme

b
𝜽 = arg min 𝐽 (𝜽; X, y) + 𝜆 𝑅(𝜽) . (5.20)
𝜽 | {z } |{z} |{z}
(i) (iii) (ii)

This expression contains three important elements:

(i) the cost function, which encourages a good fit to training data,
(ii) the regularization term, which encourages small parameter values, and
(iii) the regularization parameter 𝜆, which determines the trade-off between (i) and (ii).

The regularization term can be designed in many ways. As a combination of the 𝐿 1 and 𝐿 2 terms, one
option is 𝑅(𝜽) = k𝜽 k 1 + k𝜽 k 22 , which often is referred to as elastic net regularization. Regardless of the
exact expression of the regularization term, its purpose is to encourage small parameter values and thereby
decrease the flexibility of the model. As we discussed in depth in Chapter 4, a too flexible (or complex)
model may overfit to training data and not be able to generalize well to previously unseen test data. By
using regularization we get a systematic and practically useful way of controlling the model flexibility and
thereby counteracting overfit.

Implicit regularization

Any supervised machine learning method that is trained by minimizing a cost function can be regularized
as (5.20). There are, however, alternative ways to achieve similar effects. One such example of implicit

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 73
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

regularization is early stopping. Early stopping is applicable to any method that is trained using numerical
optimization, and amounts to aborting the optimization before it has reached the minimum of the cost
function. Even though it might appear counter-intuitive to pre-maturely abort an optimization procedure,
early stopping has shown to be a of practical importance to avoid overfit for some models, most notably
deep learning Chapter 6.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
74
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5.3 Parameter optimization

5.3 Parameter optimization


Many supervised machine learning methods, linear and logistic regression included, involves one (or more)
optimization problems, such as (3.12), (3.35) or (5.19). A machine learning engineer therefore needs to be
familiar with the main strategies for how to solve optimization problems fast. Starting in the optimization
problems from linear and logistic regression, we will introduce the ideas behind some of the optimization
methods common in supervised machine learning. This section is, however, not a complete treatment of
optimization theory, and we will for example only discuss unconstrained optimization problems.
Optimization is about finding the minimum or maximum of an objective function. Since the maximization
problem can be formulated as minimization of the negative objective function, we can limit ourselves
to minimization without any loss of generality. When we use optimization for training models in
machine learning, that objective function is the cost function 𝐽. Optimization is, however, often used
also in combination with cross-validation (Chapter 4) for tuning hyperparameters, such as selecting the
regularization parameter 𝜆 by minimizing the prediction error on a held-out validation dataset 𝐸 hold-out .
An important class of objective functions are convex functions. Optimization is often easier to carry out
for convex objective functions, and it is a general advise to spend some extra effort to consider whether a
non-convex optimization problem can be re-formulated into a convex problem (which sometimes, but not
always, is possible). The most important property of a convex function, for this discussion, is that a convex
function has a unique minimum3 , and no other local minima. Examples of convex functions are the cost
functions for logistic regression, linear regression and 𝐿 1 -regularized linear regression. An example of
non-convex functions is the cost function for a deep neural network. We illustrate by Example 5.2.

Example 5.2: Example of objective functions

These figures are examples of what an objective function can look like.
Objective function

Objective function

1
1

0
−5 4 −5 4
2 2
0 0 0 0
−2 −2
−4 −4
𝜃1 𝜃2 𝜃1 𝜃2
Both examples are functions of a two-dimensional parameter vector 𝜽 = [𝜃 1 𝜃 2 ] . The left is convex and has T

a finite unique global minimum, whereas the right is non-convex and has three local minima (of which only
one is the global minimum). We will in the following examples illustrate these objective functions using
contour plots instead, as shown below.

5 5

0 0
𝜃2

𝜃2

−5 −5

−5 0 5 −5 0 5

𝜃1 𝜃1

3 The minimum does, however, not have to be finite.The exponential function, for example, is convex but attains its minimum in
−∞. Convexity is a relatively strong property, and also non-convex functions may have only one minimum.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 75
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

Time to reflect 5.1: After reading the rest of this book, return here and try to fill out this table,
summarizing how optimization is used by the different methods.
What is optimization used for? What type of optimization?
Method Training Hyper- Nothing Closed- Grid search Gradient- Stochastic
parameters form* based gradient
Linear regression
Linear regression with L2-regularization
Linear regression with L1-regularization
Logistic regression
𝑘 -NN
Trees
Random forests
AdaBoost
Gradient boosting
Deep learning
Gaussian processes
*including coordinate descent

Optimization using closed-form expressions


For linear regression with squared error loss, training the model amounts to solving the optimization
problem (3.12)

b 1
𝜽 = arg min kX𝜽 − yk 22 .
𝜽 𝑛
As we have discussed, and also proved in Appendix 3.A, the solution (3.14) to this problem can (under
the assumption that XT X is invertible) be derived analytically. If we only spend some time to efficiently
implement (3.14) once, for example using the Cholesky or QR factorization, we can use that every time
we want to train a linear regression model with squared error loss. Each time we use it we know that we
have found the optimal solution in a computationally efficient way.
If we instead want to learn the 𝐿 1 -regularized version, we instead have to solve (5.19)

b 1
𝜽 = arg min kX𝜽 − yk 22 + 𝜆k𝜽 k 1 .
𝜽 𝑛
This problem can, unfortunately, not be solved analytically. Instead we have to use computer power to
solve it, by constructing an iterative procedure for seeking the solution. With a certain choice of such an
optimization algorithm, we can make use of some analytical expressions along the way, which turns out to
give an efficient way of solving it. Remember that 𝜽 is a vector containing 𝑝 + 1 parameters we want to
learn from the training data. As it turns out, if we seek the minimum for only one of these parameters, say
𝜃 𝑗 , while keeping the other parameters fix, we can find the optimum as

1 ∑︁
𝑛 ∑︁
arg min kX𝜽 − yk 22 + 𝜆k𝜽 k 1 = sign(𝑡) (|𝑡| − 𝜆), where 𝑡 = 𝑥 𝑖 𝑗 (𝑦 𝑖 − 𝑥 𝑖𝑘 𝜃 𝑘 ). (5.21)
𝜃𝑗 𝑛 𝑖=1 𝑘≠ 𝑗

It turns out that making repeated “sweeps” through the vector 𝜽 and updating one-parameter-at-a-time
according to (5.21) is a good way to solve (5.19). This type of algorithm, where we update one-parameter-
at-a-time, is referred to as coordinate descent, and we illustrate it in Example 5.3
It can be shown that the cost function in (5.19) is convex. Only convexity is not sufficient to guarantee
that coordinate descent will find its (global) minimum, but for the 𝐿 1 -regularized cost function (5.19) it
can be shown that coordinate descent actually finds the (global) minimum. In practice we know that we
have found the global minimum when no parameters have changed during a full “sweep” of the parameter
vector.
It turns out that coordinate descent is a very efficient method for 𝐿 1 -regularized linear regression (5.19).
The keys are (i) that (5.21) exists and is cheap to compute, and (ii) many updates will simply set 𝜃 𝑗 = 0 due
to the sparsity of the optimal b𝜽. This makes the algorithm fast. For most machine learning optimization
problems it can, however, not be said that coordinate descent is the preferred method. We will now have a
look at some more general families of optimization methods that are widely used in machine learning.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
76
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5.3 Parameter optimization

Example 5.3: Coordinate descent

We apply coordinate descent to the objective functions from Example 5.2. For coordinate descent to be
an efficient alternative in practice, closed-form solutions for updating one-parameter-at-a-time, similar to
(5.21), have to be available.

5 5

0 0
𝜃2

𝜃2
−5 −5

−5 0 5 −5 0 5

𝜃1 𝜃1
The figures show how the parameters are updated in the coordinate descent algorithm, for two different
initial parameter vectors (blue and green trajectory, respectively). It is clear from the figures that only one
parameter is updated each time, which gives the trajectory a characteristic shape. The obtained minimum
is marked with a yellow dot. Note how the different initializations lead to different (local) minima in the
non-convex case (right panel).

Grid search
Most often we do not have closed-form solutions to the optimization problems we need to solve.
Sometimes we can only evaluate the objective function in certain points, and we have no knowledge
abouts its derivatives. This typically happens when selecting hyperparameters, such as the regularization
parameter 𝜆 in (5.19). The objective is then to minimize the prediction error for a held-out validation
dataset (or some other version of cross-validation). That is a very complicated objective function to write
down, and it is even harder to find its derivative (in fact, this objective function includes an optimization
problem itself—learning b 𝜽 for a given value of 𝜆), but we can nevertheless evaluate it for any given 𝜆 (that
is, run the entire learning procedure and see how good the predictions become for the validation dataset).
The perhaps simplest way to solve such an optimization problem is to “try a few different parameter
values, and pick the one which works best”. That is the idea of grid search. The term “grid” here refers
to some (more or less arbitrary chosen) set of different parameter values to try out, and we illustrated it
in Example 5.4.
Although simple to implement, grid search can be computationally inefficient, in particular if the
parameter vector has a high dimension. As an example, having a grid with a resolution of 10 grid
points per dimension (which is a very coarse-grained grid) for a 5-dimensional parameter vector requires
105 = 100 000 evaluations of the objective function. If possible one should avoid using grid search for this
reason. However, with low-dimensional hyperparameters (in 𝐿 1 and 𝐿 2 regularization, 𝜆 is 1-dimensional,
for example), grid search can be feasible. We summarize grid search in Algorithm 5.1, where we use it for
determining a regularization parameter 𝜆.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 77
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

Algorithm 5.1: Grid search for regularization parameter 𝜆


Input: Training data {x𝑖 , 𝑦 𝑖 }𝑖=1𝑛 , validation data {x , 𝑦 } 𝑛𝑣
𝑗 𝑗 𝑗=1
b
Result: 𝜆
1 for 𝜆 = 10−3 , 10−2 , . . . , 103 (as an example) do
2 Learn b
𝜽 with regularization parameter 𝜆 from training data
Í 𝑣
3 Compute error on validation data 𝐸 val (𝜆) ← 𝑛𝑗=1 𝑦 (x 𝑗 ; b
(b 𝜽) − 𝑦 𝑗 ) 2
4 end
b as arg min𝜆 𝐸 val (𝜆)
5 return 𝜆

Example 5.4: Grid search

We apply grid search to the objective functions from Example 5.2, with an arbitrary chosen grid indicated
below by blue marks. The found minimum, which is the grid point with the smallest value of the objective
functions, is marked with a yellow dot.

5 5

0 0
𝜃2

𝜃2

−5 −5

−5 0 5 −5 0 5

𝜃1 𝜃1
Due to the unfortunate selection of the grid, the global minimum is not found in the non-convex problem
(right). That problem could be handled by increasing the resolution of the grid, which however requires
more computations (more evaluations of the objective function).

Some hyperparameters (for example 𝑘 in 𝑘-NN, Chapter 2) are integers, and sometimes it is feasible to
simply try all reasonable integer values in grid search. However, most of the time the major challenge in
grid search is to select a good grid. The grid used in Algorithm 5.1 is logarithmic between 0.001 and
1 000, but that is of course only an example. One could indeed do some manual work by first selecting a
coarse grid to get an initial guess, and thereafter refine the grid only around the promising candidates, etc.
In practice, if the problem has more than one dimension, it can also be beneficial to select the grid points
randomly instead of using an equally spaced linear or logarithmic grid.
The manual procedure of choosing a grid might, however, become quite tedious, and one could wish
for an automated method. That is, in fact, possible by treating the grid point selection problem as a
machine learning problem itself. If we consider the points in which the objective function already has
been evaluated as a training dataset, we can use a regression method to learn a model for the objective
function. That model can, in turn, be used to answer questions on where to evaluate the objective function
next, and thereby automatically selecting the next grid point. A concrete method built from this idea is the
Gaussian process optimization method, which uses Gaussian processes (Chapter 9) for learning a model
of the objective function.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
78
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5.3 Parameter optimization

Gradient descent
In many situation we do have access to more than just the value of the objective function in certain
points. For example when training parameters it is most often possible to compute the derivative (or rather
gradient) of the objective function, and sometimes even the second derivative (or rather Hessian). In those
situations, it is often a good idea to use a gradient descent method, which we will introduce now, or even
Newton’s method that we will discuss later.
Gradient descent can be used for learning parameter vectors 𝜽 of high dimension when the objective
function 𝐽 (𝜽) is simple enough such that its gradient is possible to compute. Let us therefore consider the
parameter learning problem

b
𝜽 = arg min 𝐽 (𝜽) (5.22)
𝜽

(even though gradient descent possibly can be used for hyperparameters as well). We will assume4 that
the gradient of the cost function ∇𝜽 𝐽 (𝜽) exists for all 𝜽. As an example, the gradient of the cost function
for logistic regression (3.34) is
𝑛  
1 ∑︁ 1
∇𝜽 𝐽 (𝜽) = − 𝑦 𝑖 x𝑖 . (5.23)
𝑛 𝑖=1 1 + 𝑒 𝑦𝑖 𝜽 T x𝑖

Note that ∇𝜽 𝐽 (𝜽) is a vector of the same dimension as 𝜽, which describes the direction in which 𝐽 (𝜽)
increases. Consequently, and more useful for us, −∇𝜽 𝐽 (𝜽) describes the direction in which 𝐽 (𝜽) decreases.
That is,
 
𝐽 𝜽 − 𝛾∇𝜽 𝐽 (𝜽) ≤ 𝐽 𝜽 (5.24)

for some (possibly very small) 𝛾 > 0. If 𝐽 (𝜽) is convex, the inequality in (5.24) is strict except at the
minimum (where ∇𝜽 𝐽 (𝜽) is zero). This suggests that if we have 𝜽 (𝑡) and want to select 𝜽 (𝑡+1) such that
𝐽 (𝜽 (𝑡+1) ) ≤ 𝐽 (𝜽 (𝑡) ), we should

update 𝜽 (𝑡+1) as 𝜽 (𝑡) − 𝛾∇𝜽 𝐽 (𝜽 (𝑡) ) (5.25)

with some positive 𝛾 > 0. Repeating (5.25) gives the gradient descent Algorithm 5.2.

Algorithm 5.2: Gradient descent


Input: Objective function 𝐽 (𝜽), initial 𝜽 (0) , learning rate 𝛾
Result: b
𝜽
1 Set 𝑡 ← 0
2 while k𝜽 (𝑡) − 𝜽 (𝑡−1) k not small enough do
3 Update 𝜽 (𝑡+1) ← 𝜽 (𝑡) − 𝛾∇𝜽 𝐽 (𝜽 (𝑡) )
4 Update 𝑡 ← 𝑡 + 1
5 end
6 return b
𝜽 ← 𝜽 (𝑡−1)

4 This assumption is primarily made for the theoretical discussion.


In practice, there are successful examples of gradient descent
being applied to objective functions not differentiable everywhere, such as neural networks with ReLu activation functions
(Chapter 6).

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 79
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

𝐽 (𝜃) 1 1 1

0.5 0.5 0.5

0 0 0

−1 0 1 −1 0 1 −1 0 1

𝜃 𝜃 𝜃
(a) Low learning rate 𝛾 = 0.05 (b) High learning rate 𝛾 = 1.2 (c) Good learning rate 𝛾 = 0.3

Figure 5.3: Optimization using gradient descent of a cost function 𝐽 (𝜃) where 𝜃 is a scalar parameter. In the
different subfigures we use a too low learning rate (a), a too high learning rate (b), and a good learning rate (c).
Remember that a good value of 𝛾 is very much related to the shape of the cost function; 𝛾 = 0.3 might be too small
(or high) for a different 𝐽 (𝜃).

In practice we do not know 𝛾, which determines how big the 𝜽-step is at each iteration. It is possible to
formulate the selection of 𝛾 as an optimization problem itself to be solved at each iteration, a so-called
line-search problem. Here we will consider the simpler solution where we leave the choice of 𝛾 to the user.
When left as a user choice, 𝛾 is often referred to as the learning rate or step-size. Note that the gradient
∇𝜽 𝐽 (𝜽) will typically decrease and eventually attain 0 at a stationary point (possibly, but not necessarily, a
minimum), so Algorithm 5.2 may converge if 𝛾 is kept constant. This is in contrast to what we later will
discuss the stochastic gradient algorithm.
The choice of learning rate 𝛾 is important. Some typical situations with too small, too high and a good
choice of learning rate are shown in Figure 5.3. With the intuition from these figures, we advise to monitor
𝐽 (𝜽 (𝑡) ) during the optimization, and

• decrease the learning rate 𝛾 if the cost function values 𝐽 (𝜃 (𝑡) ) are getting worse or oscillates widely
(as in Figure 5.3b),

• increase the learning rate 𝛾 if the cost function values 𝐽 (𝜃 (𝑡) ) are fairly constant and only slowly
decreasing (as in Figure 5.3a).

No general convergence guarantees can be given for gradient descent, basically because a bad learning
rate 𝛾 may break the method. However, with the “right” choice of 𝛾, the value of 𝐽 (𝜽) will decrease for
each iteration (as suggested by (5.24)) until a point with zero gradient is found, that is, a stationary point.
A stationary point is, however, not necessarily a minimum, but can also be a maximum or a saddle-point
of the objective function. In practice one typically monitor the value of 𝐽 (𝜽) and terminate the algorithm
when it seems not to decrease anymore, and hope it has arrived at a minimum.
In non-convex problem with multiple local minima, we can not expect gradient descent to always find
the global minimum. The initialization is usually critical for determining which minimum (or stationary
point) is found, as illustrated by Example 5.5. It can therefore be a good practice (if time and computational
resources permits) to run the optimization multiple times with different initializations. For computationally
heavy non-convex problems such as training a deep neural network (Chapter 6) when we cannot afford to
re-run the training, we usually employ method-specific heuristics and tricks to find a good initialization
point.
For convex problems there is only one stationary point, which also is the global minimum. Hence, the
initialization for convex problem can be done arbitrarily. However, by warm-starting the optimization
with a good initial guess we may still save valuable computational time. Sometimes, such as when doing
𝑘-fold cross validation (Chapter 4), we have to train 𝑘 models on similar (but not identical) datasets. In

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
80
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5.3 Parameter optimization

situations like that we can typically make use of that by initializing Algorithm 5.2 with the parameters
learned for the previous model.
For learning logistic regression (3.35), gradient descent can be used. Since its cost function is convex,
we know that once the gradient descent has converged to a minimum, it has reached the global minimum
and we are done. For logistic regression there are, however, more advanced alternatives that usually
performs even better.

Example 5.5: Gradient descent

We first consider the convex objective function from Example 5.2, and apply gradient descent to it with
a seemingly reasonable learning rate. Note that each step is perpendicular to the level curves at the point
where it starts, which is a property of the gradient. As expected, we find the (global) minimum with both of
the two different initializations.

0
𝜃2

−5

−5 0 5

𝜃1

For the non-convex objective function from Example 5.2 we apply gradient descent with two different
learning rates. In the left plot, the learning rate seems well chosen and the optimization converges nicely,
albeit to different minima depending on the initialization. Note that it could have converged also to one
of the saddle points between the different minima. In the right plot the learning rate is too big, and the
procedure does not seem to converge.

5 5

0 0
𝜃2

𝜃2

−5 −5

−5 0 5 −5 0 5

𝜃1 𝜃1

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 81
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

Second order gradient methods


We can think of gradient descent as approximating 𝐽 (𝜽) with a first order Taylor expansion around 𝜽 (𝑡) ,
that is, a (hyper-)plane. The next parameter 𝜽 (𝑡+1) is selected by taking a step in the steepest direction of
the (hyper-)plane. Let us now see what happen if we instead use a second order Taylor expansion,
1
𝐽 (𝜽 + v) ≈ 𝐽 (𝜽) + vT [∇𝜽 𝐽 (𝜽)] + vT [∇𝜽2 𝐽 (𝜽)]v, (5.26)
2
| {z }
,𝑠 (𝜽,v)

where v is a vector of same dimension as 𝜽. This expression does not only contain the gradient of the
cost function ∇𝜽 𝐽 (𝜽), but also the Hessian matrix of the cost function ∇𝜽2 𝐽 (𝜽). Remember that we are
searching for the minimum of 𝐽 (𝜽). If the Hessian ∇𝜽2 𝐽 (𝜽) is positive definite, that minimum is obtained
where the derivative of 𝑠(𝜽, v) is zero,
𝜕
𝑠(𝜽, v) = ∇𝜽 𝐽 (𝜽) + [∇𝜽2 𝐽 (𝜽)]v = 0 ⇔ v = −[∇𝜽2 𝐽 (𝜽)] −1 [∇𝜽 𝐽 (𝜽)]. (5.27)
𝜕v
This suggests to update

𝜽 (𝑡+1) as 𝜽 (𝑡) − [∇𝜽2 𝐽 (𝜽 (𝑡) )] −1 [∇𝜽 𝐽 (𝜽 (𝑡) )], (5.28)

which is Netwton’s method for minimization. Unfortunately, no general convergence guarantees can be
given for Newton’s method either. For certain cases, Newton’s method can be much faster than gradient
descent. In fact, if the cost function 𝐽 (𝜽) is a quadratic function in 𝜽 then (5.26) is exact and Newton’s
method (5.28) will find the optimum in only one iteration! Quadratic objective functions are, however,
rare in machine learning5 . It is not even guaranteed that the Hessian ∇𝜽2 𝐽 (𝜽) is always positive definite in
practice, which may result in rather strange parameter updates in (5.28). To still make use of the potentially
valuable second order information, but at the same time also have a robust and practically useful algorithm,
we have to introduce some modification of Newton’s method. There are multiple options, and we will look
at so-called trust regions.
We derived Newton’s method using the second order Taylor expansion (5.26) as a model for how 𝐽 (𝜽)
behave around 𝜽 (𝑡) . We should perhaps not trust the Taylor expansion to be a good model for all values
of 𝜽, but only in the vicinity of 𝜽 (𝑡) . One natural restriction is therefore to trust the second order Taylor
expansion (5.26) only within a ball of radius 𝐷 around 𝜽 (𝑡) , which we refer to as our trust region. This
suggests that we could make a Newton update (5.28) of the parameters, unless the step is longer than 𝐷, in
which case we downscale the step to never leave our trust region. In the next iteration, the trust region is
moved to be centered around the updated 𝜽 (𝑡+1) , and another step is taken from there. We can express this
as

update 𝜽 (𝑡+1) as 𝜽 (𝑡) − 𝜂[∇𝜽2 𝐽 (𝜽 (𝑡) )] −1 [∇𝜽 𝐽 (𝜽 (𝑡) )], (5.29)

where 𝜂 ≤ 1 is chosen as large as possible such that k𝜽 (𝑡+1) − 𝜽 (𝑡) k ≤ 𝐷. The radius of the trust region 𝐷
can be updated and adapted as the optimization proceeds, but for simplicity we consider here 𝐷 to be a
user choice (much like the learning rate for gradient descent). We summarize this by Algorithm 5.3 and
look at it in Example 5.6. The trust-region Newton method, with a certain set of rules on how to update 𝐷,
is actually one of the methods commonly used for training logistic regression in practice.
It can be computationally expensive or even impossible to compute the inverse of the Hessian matrix
[∇𝜽2 𝐽 (𝜽 (𝑡) )] −1 . To this end, there is an entire class of methods called quasi-Newton methods that all use
different ways to approximate the inverse of the Hessian matrix [∇𝜽2 𝐽 (𝜽)] −1 in (5.28). This class includes,
among other, the Broyden method and the BFGS method (an abbreviation of Broyden, Fletcher, Goldfarb
and Shanno). A further approximation of the latter, called limited-memory BFGS or L-BFGS, has proven
to be another good choice for the logistic regression problem.
5 For regression, we often use the squared error loss 𝑦 − 𝑦) 2 , which indeed is a quadratic function in b
𝑦 , 𝑦) = (b
𝐿 (b 𝑦 . That does not
imply that 𝐽 (𝜽) (the objective function) is a quadratic function in 𝜽.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
82
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5.3 Parameter optimization

Algorithm 5.3: Truncated Newton’s method


Input: Objective function 𝐽 (𝜽), initial 𝜽 (0) , trust region radius 𝐷
Result: b
𝜽
1 Set 𝑡 ← 0
2 while k𝜽 (𝑡) − 𝜽 (𝑡−1) k not small enough do
3 Compute v ← [∇𝜽2 𝐽 (𝜽 (𝑡) )] −1 [∇𝜽 𝐽 (𝜽 (𝑡) )]
4 Compute 𝜂 ← max( kv 𝐷
k,𝐷)
5 Update 𝜽 (𝑡+1) ← 𝜽 (𝑡) − 𝜂v
6 Update 𝑡 ← 𝑡 + 1
7 end
8 return b
𝜽 ← 𝜽 (𝑡−1)

Example 5.6: Newton’s method

We first apply Newton’s method to the cost functions from Example 5.2. Since the convex cost (left) function
also happens to be close to a quadratic function, the Newton’s method works well and finds, for both
initializations, the minimum in only two iterations. For the non-convex problem (right), Newton’s method
diverges for both initializations, since the second order Taylor expansion (5.26) is a poor approximation of
this function and leads the method wrong.

5 5

0 0
𝜃2

𝜃2

−5 −5

−5 0 5 −5 0 5

𝜃1 𝜃1

We apply also the trust-region Newton’s method to both problems. Note that the first step direction is
identical to the non-truncated version above, but the steps are now limited to stay within the trust region
(here a circle of radius 2). This prevents the severe divergence problems for the non-convex case, and all
cases converges nicely. Indeed, the convex case (left) requires more iterations than for the non-truncated
version above, but that is a price we have to pay in order to have a robust method which also works for the
non-convex case shown to the right.

5 5

0 0
𝜃2

𝜃2

−5 −5

−5 0 5 −5 0 5

𝜃1 𝜃1

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 83
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

5.4 Optimization with large datasets


In machine learning the training data may have 𝑛 = millions (or more) of data points. Computing, for
example, the gradient of the cost function

1 ∑︁
𝑛
∇𝜽 𝐽 (𝜽) = ∇𝜽 𝐿 (x𝑖 , 𝑦 𝑖 , 𝜽) (5.30)
𝑛 𝑖=1

thus involves summing a million of elements. Besides taking lot of time to sum, it can also be an issue to
keep all data points in the computer memory at the same time. However, with that many data points many
of them are probably relatively similar, and in practice we might not need to consider all of them every
time, but looking only at a subset of them might give sufficient information. This is a general idea called
subsampling, and we will have a closer look at how subsampling can be combined with gradient descent
into a very useful optimization method called stochastic gradient. It is, however, possible to combine the
subsampling idea also with other methods.

Stochastic gradient
With 𝑛 very big, we can expect the gradient computed only for the first half of the dataset ∇𝜽 𝐽 (𝜽) ≈
Í𝑛/2
, y , 𝜽) to be almost identical to the gradient based on the second half of the dataset
𝑖=1 ∇𝜽 𝐿(xÍ𝑖 𝑛 𝑖
∇𝜽 𝐽 (𝜽) ≈ 𝑖=𝑛/2+1 ∇𝜽 𝐿(x𝑖 , y𝑖 , 𝜽). Consequently, it might be a waste of time to compute the gradient
based on the whole training dataset for each iteration of gradient descent. Instead, we could compute the
gradient based on the first half of the training dataset, update the parameters according to the gradient
descent method Algorithm 5.2, and then compute the gradient for the new parameters based on the second
half of the training data,

1 ∑︁
𝑛
2

𝜽 (𝑡+1)
=𝜽 (𝑡)
−𝛾 ∇𝜽 𝐿 (x𝑖 , y𝑖 , 𝜽 (𝑡) ), (5.31a)
𝑛/2 𝑖=1
1 ∑︁
𝑛
𝜽 (𝑡+2) = 𝜽 (𝑡+1) − 𝛾 ∇𝜽 𝐿 (x𝑖 , y𝑖 , 𝜽 (𝑡+1) ). (5.31b)
𝑛/2 𝑛
𝑖= 2 +1

In other words, we use only a subsample of the training when we compute the gradient. In this case we only
use half of the data for each computation, and (5.31) would require roughly only half the computational
time compared to normal gradient descent. This computational saving illustrates the benefit of the
subsampling idea.
The extreme version of subsampling would be to use only one single data point each time when
computing the gradient. In practice it is most common to do something in between. We call a small
subsample of data a mini-batch, which typically can contain 𝑛𝑏 = 10, 𝑛𝑏 = 100 or 𝑛𝑏 = 1000 data
points. One complete pass through the training data is called an epoch, and consists consequently of 𝑛/𝑛𝑏
iterations.
When using mini-batches it is important to ensure that the different mini-batches are balanced and
representative for the whole dataset. For example, if we have a big training dataset with a few different
output classes and the dataset is sorted with respect to the output, the mini-batch with the first 𝑛𝑏 data
points would only include one class and hence not give a good approximation of the gradient for the
full dataset. For this reason, the mini-batches should be formed randomly. One implementation of this
is to first randomly shuffle the training data, and thereafter dividing it into mini-batches in an ordered
manner. When we have completed one epoch, we do another random reshuffling of the training data and
do another pass through the dataset. We summarize gradient descent with mini-batches, often called
stochastic gradient, by Algorithm 5.4.
Stochastic gradient is widely used in machine learning, and there are many extensions tailored for
different methods. For training deep neural networks (Chapter 6), some commonly used methods include

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
84
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5.4 Optimization with large datasets

automatic adaption of the learning rate and an idea called momentum to counteract the randomness
caused by subsampling. The AdaGrad (short for adaptive gradient), RMSProp (short for root mean square
propagation) and Adam (short for adaptive moments) methods are such examples. For logistic regression
in this setting, the stochastic average gradient (SAG) method, which averages over all previous gradient
estimates, has proven useful, to only mention a few.

Algorithm 5.4: Stochastic gradient


Í𝑛
Input: Objective function 𝐽 (𝜽) = 𝑖=1 𝐿(x𝑖 , 𝑦 𝑖 ; 𝜽), initial 𝜽 (0) , learning rate 𝛾 (𝑡)
Result: b𝜽
1 Set 𝑡 ← 0
2 while Convergence criteria not met do
3 for 𝑖 = 1, 2, . . . , 𝐸 do
4 Randomly shuffle the training data {x𝑖 , 𝑦 𝑖 }𝑖=1 𝑛

5 for 𝑗 = 1, 2, . . . , 𝑛𝑏 do
𝑛

Approximate the gradient using the mini-batch {(x𝑖 , y𝑖 )}𝑖=(𝑏𝑗−1)𝑛 +1 ,


𝑗𝑛
6
b Í 𝑗𝑛𝑏 𝑏
d (𝑡) = 1
𝑛𝑏 ∇𝜽 𝐿 (x𝑖 , 𝑦 𝑖 , 𝜽 (𝑡) ).
𝑖=( 𝑗−1)𝑛𝑏 +1
7 Update 𝜽 (𝑡+1)
←𝜽 (𝑡)
− 𝛾 (𝑡) d (𝑡)
8 Update 𝑡 ← 𝑡 + 1
9 end
10 end
11 end
12 return b
𝜽 ← 𝜽 (𝑡−1)

Learning rate and convergence for stochastic gradient

Standard gradient descent converges if the learning rate is wisely chosen and constant, since the gradient
itself is zero at the minimum (or any other stationary point). For stochastic gradient, on the other hand, we
can not obtain convergence with a constant learning rate. The reason is that we only use an estimate of the
true gradient, and this estimate will not necessarily be zero at the minimum of the objective function, but
there might still be a considerable amount of “noise” in the gradient estimate due to the subsampling. As a
consequence, the stochastic gradient algorithm with a constant learning rate will not converge towards a
point, but continue to walk around, somewhat randomly.
By not using a constant learning rate, but instead decrease it gradually towards zero, the parameter
updates will be smaller and smaller, and eventually converge. We hence start at 𝑡 = 0 with a fairly high
learning rate 𝛾 (𝑡) (meaning that we take big steps) and then decay 𝛾 (𝑡) as 𝑡 increases. Under certain
regularity conditions of the cost function and with a learning rate fulfilling the Robbins-Monro conditions
Í∞ (𝑡) Í (𝑡) 2
𝑡=0 𝛾 = ∞ and ∞ 𝑡=0 (𝛾 ) < ∞, the stochastic gradient algorithm can be shown to converge almost
surely to a local minimum. The Robbins-Monro conditions are, for example, fulfilled if using 𝛾 (𝑡) = 1𝑡 .
For many machine learning problems, however, it has been found that better performance is often obtained
in practice if not letting 𝛾 (𝑡) → 0, but to cap it at some small value 𝛾min > 0. This will cause stochastic
gradient not to exactly converge, but the algorithm will in fact walk around indefinitely (or until the
algorithm is aborted by the user). For practical purposes this seemingly undesired property does usually
not cause any major issue if 𝛾min is only small enough, and one possible rule for setting the learning rate
in practice is
𝑡
𝛾 (𝑡) = 𝛾min + (𝛾max − 𝛾min )𝑒 − 𝜏 . (5.32)

Now the learning rate 𝛾 (𝑡) starts at 𝛾max and goes to 𝛾min as 𝑡 → ∞. How to pick the parameters 𝛾min ,
𝛾max and 𝜏 is more of an art than science. As a rule of thumb 𝛾min can be chosen approximately as 1% of
𝛾max . The parameter 𝜏 depends on the size of the dataset and the complexity of the problem, but should be

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 85
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
5 Learning parametric models

chosen such that multiple epochs have passed before we reach 𝛾min . The strategy to pick 𝛾max can be done
by monitoring the cost function as for standard gradient descent in Figure 5.3.
Example 5.7: Stochastic gradient

We apply the stochastic gradient method to the objective functions from Example 5.2. For the convex
function below, the choice of learning rate is not very crucial. Note, however, that the algorithm does not
converge as nicely as, for example, gradient descent, due to the “noise” in the gradient estimate caused by
the subsampling. This is the price we have to pay for the substantial computational savings offered by the
subsampling.

0
𝜃2

−5

−5 0 5

𝜃1

For the objective function with multiple local minima, we apply stochastic gradient with two decaying
learning rates, but with different initial 𝛾 (0) . With a smaller learning rate, left, stochastic gradient converges
to the closest minima, whereas a larger learning rate causes it to initially take larger steps and therefore not
necessarily converge to the closest minimum (right).

5 5

0 0
𝜃2

𝜃2

−5 −5

−5 0 5 −5 0 5

𝜃1 𝜃1

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
86
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6 Neural networks and deep learning

Neural networks can be used for both regression and classification, and they can be seen as an extension of
linear regression and logistic regression, respectively. Traditionally neural networks with one so-called
hidden layer have been used and analysed, and several success stories came in the 1980s and early 1990s.
In the 2000s it was, however, realized that deep neural networks with several hidden layers, or simply deep
learning, are even more powerful. With the combination of a lot of training data, new software, hardware
and parallel algorithms for training, deep learning has made a major contribution to machine learning
and several other fields. Deep learning has excelled in many applications, including image classification,
speech recognition and language translation. New applications, analysis, and algorithmic developments to
deep learning are published literally every day.
We will start in Section 6.1 by generalizing linear regression to a two-layer neural network (that is,
a neural network with one hidden layer), and then generalize it further to a deep neural network. We
thereafter leave regression and look at the classification setting in Section 6.1. In Section 6.2 we present a
special neural network tailored for images and finally we look into some details on how to train neural
networks in Section 6.3.

6.1 Neural networks


A neural network is a nonlinear function that describes a prediction of the output b
𝑦 as a nonlinear function
of its input variables

b
𝑦 = 𝑓𝜽 (𝑥 1 , . . . , 𝑥 𝑝 ), (6.1)

where the function 𝑓 is parametrized by 𝜽. Such a nonlinear function can be parametrized in many
ways. In a neural network, the strategy is to use several layers of linear regression models and nonlinear
activation functions. We will explain this carefully in turn below.

Generalized linear regression

We start the description with a graphical illustration of the linear regression model

b
𝑦 = 𝑊1 𝑥 1 + 𝑊2 𝑥 2 + · · · + 𝑊 𝑝 𝑥 𝑝 + 𝑏, (6.2)

which is shown in Figure 6.1a. Each input variable 𝑥 𝑗 is represented with a node and each parameter 𝑊 𝑗
with a link. Furthermore, the output 𝑧 is described as the sum of all terms 𝑊 𝑗 𝑥 𝑗 . Note that we use 1 as the
input variable corresponding to the offset term 𝑏.
To describe nonlinear relationships between x = [1 𝑥 1 𝑥 2 . . . 𝑥 𝑝 ] T and b
𝑦 we introduce a nonlinear scalar
function called the activation function ℎ : R → R. The linear regression model (6.2) is now modified into
a generalized linear regresssion model where the linear combination of the inputs is squashed through the
(scalar) activation function

b
𝑦 = ℎ 𝑊1 𝑥 1 + 𝑊2 𝑥 2 + · · · + 𝑊 𝑝 𝑥 𝑝 + 𝑏 . (6.3)
This extension to the generalized linear regression model is visualized in Figure 6.1b.
Common choices for activation function are the logistic function and the rectified linear unit (ReLU).
These are illustrated in Figure 6.2a and Figure 6.2b, respectively. The logistic (or sigmoid) function has
already been used in the context of logistic regression (Section 3.2). The logistic function is linear close to

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 87
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6 Neural networks and deep learning

1 𝑏 1 𝑏
𝑥1 𝑥1
.. 𝑊1 b
𝑦 .. 𝑊1 ℎ b
𝑦
. .
𝑥𝑝 𝑊 𝑥𝑝 𝑊
𝑝 𝑝

(a) (b)

Figure 6.1: Graphical illustration of a linear regression model (Figure 6.1a), and a generalized linear regression
model (Figure 6.1b). In Figure 6.1a, the output 𝑧 is described as the sum of all terms 𝑏 and {𝑊 𝑗 𝑥 𝑗 } 𝑝𝑗=1 , as in (6.2).
In Figure 6.1b, the circle denotes addition and also transformation through the activation function ℎ, as in (6.3).

1 ℎ(𝑧) 1 ℎ(𝑧)

𝑧 𝑧
−6 6 −1 1
1
Logistic: ℎ(𝑧) = 1+𝑒−𝑧 ReLU: ℎ(𝑧) = max(0, 𝑧)
(a) (b)

Figure 6.2: Two common activation functions used in neural networks. The logistic (or sigmoid) function
(Figure 6.2a), and the rectified linear unit (Figure 6.2b).

𝑧 = 0 and saturates at 0 and 1 as 𝑧 decreases or increases. The ReLU is even simpler. The function is the
identity function for positive inputs and equal to zero for negative inputs.
The logistic function used to be the standard choice of activation function in neural networks for many
years, whereas the ReLU is now the standard choice (despite its simplicity!) in most neural network
models.
The generalized linear regression model (6.3) is very simple and is itself not capable of describing
very complicated relationships between the input x and the output b 𝑦 . Therefore, we make two further
extensions to increase the generality of the model: We will first make use of several generalized linear
regression models to build a layer (which will lead us to the two-layer neural network) and then stack these
layers in a sequential construction (which will result in a deep neural network, or simply deep learning).

Two-layer neural network

In (6.3), the output is constructed by one scalar regression model. To increase its flexibility and turn it into
a two-layer neural network, we instead let the output be a sum of 𝑈 such generalized linear regression
models, each of which has its own parameters. The parameter for the 𝑖th regression model are 𝑏 𝑖 , . . . , 𝑊𝑖 𝑝
and we denote its output by 𝑞 𝑖 ,


𝑞 𝑖 = ℎ 𝑊𝑖1 𝑥 1 + 𝑊𝑖2 𝑥 2 + · · · + 𝑊𝑖 𝑝 𝑥 𝑝 + 𝑏 𝑖 , 𝑖 = 1, . . . , 𝑈. (6.4)

These intermediate outputs 𝑞 𝑖 are so-called hidden units, since they are not the output of the whole model.
The 𝑈 different hidden units {𝑞 𝑖 }𝑈
𝑖=1 instead act as input variables to an additional linear regression model

b
𝑦 = 𝑊1 𝑞 1 + 𝑊2 𝑞 2 + · · · + 𝑊𝑈 𝑞𝑈 + 𝑏. (6.5)

To distinguish the parameters in (6.4) and (6.5) we add the superscripts (1) and (2), respectively. The
equations describing this two-layer neural network (or equivalently, a neural network with one layer of

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
88
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6.1 Neural networks

Input variables Hidden units Output

1
1 𝑏 1(1)
𝑏 (2)
ℎ 𝑞1
𝑥1
ℎ b
𝑦
..
. ..
.
𝑥𝑝
𝑞𝑈 𝑊𝑈(2)
𝑊𝑈(1)𝑝 ℎ

Figure 6.3: A two-layer neural network, or equivalently, a neural network with one intermediate layer of hidden
units.

hidden units) are thus


 
(1) (1) (1) (1)
𝑞 1 = ℎ 𝑊11 𝑥 1 + 𝑊12 𝑥 2 + · · · + 𝑊1 𝑝 𝑥 𝑝 + 𝑏 1 ,
 
(1) (1)
𝑞 2 = ℎ 𝑊21 𝑥 1 + 𝑊22 𝑥 2 + · · · + 𝑊2(1)
𝑝 𝑝𝑥 + 𝑏 (1)
2 , (6.6a)
..
.
 
𝑞𝑈 = ℎ 𝑊𝑈(1)1 𝑥 1 + 𝑊𝑈(1)2 𝑥 2 + · · · + 𝑊𝑈(1)𝑝 𝑥 𝑝 + 𝑏𝑈
(1)
,

b
𝑦 = 𝑊1(2) 𝑞 1 + 𝑊2(2) 𝑞 2 + · · · + 𝑊𝑈(2) 𝑞𝑈 + 𝑏 (2) . (6.6b)

Extending the graphical illustration from Figure 6.1, this model can be depicted as a graph with two layers
of links (illustrated using arrows), see Figure 6.3. As before, each link has a parameter associated with it.
Note that we include an offset term not only in the input layer, but also in the hidden layer.

Matrix notation
The two-layer neural network model in (6.6) can also be written more compactly using matrix notation,
where the parameters in each layer are stacked in a weight matrix W and an offset vector1 b as
𝑊 (1)  𝑏 (1) 
 11 ... 𝑝 
𝑊1(1)  1  h i
 ..  ,    
W (1) =  ... .  b (1) =  ...  , W (2) = 𝑊1(2) . . . 𝑊𝑈(2) , b (2) = 𝑏 (2) . (6.7)
 (1)   (1) 
𝑊𝑈 1 𝑊𝑈(1)𝑝  𝑏 

...
  𝑈 
The full model can then be written as
 
q = ℎ W (1) x + b (1) , (6.8a)
b
𝑦 = W (2) q + b (2) , (6.8b)

where we have also stacked the components in x and q as x = [𝑥 1 , . . . , 𝑥 𝑝 ] T and q = [𝑞 1 , . . . , 𝑞𝑈 ] T . The


activation function ℎ acts element-wise. The two weight matrices and the two offset vectors will be the
parameters in the model, which can be written as
 T
𝜽 = vec(W (1) ) T vec(W (2) ) T b (1) T b (2) T . (6.9)
1 Theword “bias” is often used for the offset vector in the neural network literature, but this is really just a model parameter and
not a bias in the statistical sense. To avoid confusion we refer to it as an offset instead.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 89
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6 Neural networks and deep learning

By this we have described a nonlinear regression model on the form b


𝑦 = 𝑓 (x; 𝜽) according to above. Note
that the predicted output b
𝑦 in (6.8b) depends on all the parameters in 𝜽 even though it is not explicitly
stated in the notation.

Deep neural network


The two-layer neural network is a useful model on its own, and a lot of research and analysis has been
done for it. However, the real descriptive power of a neural network is realized when we stack multiple
such layers of generalized linear regression models, and thereby achieve a deep neural network. Deep
neural networks can model complicated relationships (such as the one between an image and its class),
and is one of the state-of-the-art methods in machine learning as of today.
We enumerate the layers with index 𝑙. Each layer is parametrized with a weight matrix W (𝑙) and an
offset vector b (𝑙) , as for the two-layer case. For example, W (1) and b (1) belong to layer 𝑙 = 1, W (2) and
b (2) belong to layer 𝑙 = 2 and so forth. We also have multiple layers of hidden units denoted by q (𝑙−1) .
Each such layer consists of 𝑈𝑙 hidden units q (𝑙) = [𝑞 1(𝑙) , . . . , 𝑞𝑈
(𝑙) T
𝑙
] , where the dimensions 𝑈1 , 𝑈2 , . . .
can be different across the various layers.
Each layer maps a hidden layer q (𝑙−1) to the next hidden layer q (𝑙) according to

q (𝑙) = ℎ(W (𝑙) q (𝑙−1) + b (𝑙) ). (6.10)

This means that the layers are stacked such that the output of the first layer q (1) (the first layer of hidden
units) is the input to the second layer, the output of the second layer q (2) (the second layer of hidden units)
is the input to the third layer, etc. By stacking multiple layers we have constructed a deep neural network.
A deep neural network of 𝐿 layers can mathematically be described as.

q (1) = ℎ(W (1) x + b (1) ),


q (2) = ℎ(W (2) q (1) + b (2) ),
..
. (6.11)
q (𝐿−1)
= ℎ(W q
(𝐿−1) (𝐿−2)
+b (𝐿−1)
),
b
y = W (𝐿) q (𝐿−1) + b (𝐿) .

A graphical representation of this model is provided in Figure 6.4. The equation (6.11) for a deep neural
network can be compared with the equation (6.8) for a two-layer neural network.
The weight matrix W (1) for the first layer 𝑙 = 1 has the dimension 𝑈1 × 𝑝 and the corresponding offset
vector b (1) has the dimension 𝑈1 . In deep learning it is common to consider applications where also the
output is multi-dimensional b 𝑦1, . . . , b
y = [b 𝑦 𝑀 ] T . This means that for the last layer the weight matrix W (𝐿)
has the dimension 𝑀 × 𝑈 𝐿−1 and the offset vector b (𝐿) has the dimension 𝑀. For all intermediate layers
𝑙 = 2, . . . , 𝐿 − 1, W (𝑙) has the dimension 𝑈𝑙 × 𝑈𝑙−1 and the corresponding offset vector 𝑈𝑙 .
The number of inputs 𝑝 and the number of outputs 𝑀 are given by the problem, but the number of layers
𝐿 and the dimensions 𝑈1 , 𝑈2 , . . . are user design choices that will determine the flexibility of the model.

Learning the network from data


Analogously to the parametric models presented earlier (e.g. linear regression and logistic regression) we
need to learn all the parameters in order to use the model. For deep neural networks the parameters are
 T
𝜽 = vec(W (1) ) T vec(W (2) ) T · · · vec(W (𝐿) ) T b (1) T b (2) T · · · b (𝐿) T (6.12)

The wider and deeper the network is, the more parameters there are. Practical deep neural networks
can easily have in the order of millions of parameters and these models are therefore also extremely
flexible. Hence, some mechanism to avoid overfitting is needed. Regularization such as ridge regression
(3.48) is common, but there are also other techniques specific to deep learning; see further Section 6.3.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
90
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6.1 Neural networks

Input Hidden Hidden Hidden Hidden Outputs


variables units units units units
1 1 1 1
1
(1) (2) (𝐿−2) (𝐿−1) b
𝑦1
ℎ 𝑞1 ℎ 𝑞1 ℎ 𝑞1 ℎ 𝑞1
𝑥1
..
ℎ ℎ ... ℎ ℎ .
..
. .. .. .. .. ..
𝑥𝑝
.
(1)
.
(2)
. .
(𝐿−2)
.
(𝐿−1) b
𝑦𝑀
𝑞𝑈 1
𝑞𝑈 2
𝑞𝑈 𝐿−2
𝑞𝑈 𝐿−1
ℎ ℎ ℎ ℎ
Layer 1 Layer 2 Layer 𝐿-1 Layer 𝐿
W (1) b (1) W (2) b (2) W b
(𝐿−1)
(𝐿−1)
W (𝐿) b (𝐿)

Figure 6.4: A deep neural network with 𝐿 layers. Each layer 𝑙 is parameterized by W (𝑙) and b (𝑙) .

Furthermore, the more parameters there are, the more computational power is needed to train the model.
As before, the training data T = {(x𝑖 , y𝑖 )}𝑖=1
𝑛 consists of 𝑛 data points with input x and output y.

For a regression
 problem we typically start with maximum likelihood and assume Gaussian noise
2
𝜀 ∼ N 0, ℎ 𝜀 , and thereby obtain the square error loss function as in c3,

1 ∑︁
𝑛
b
𝜽 = arg min 𝐿(x𝑖 , y𝑖 , 𝜽), where 𝐿 (x𝑖 , y𝑖 , 𝜽) = ky𝑖 − 𝑓 (x𝑖 ; 𝜽) k 2 = ky𝑖 − b
y𝑖 k 2 . (6.13)
𝜽 𝑛
𝑖=1

This problem can be solved with numerical optimization, and more precisely stochastic gradient. This is
described in more detail in Chapter 5.
From the parameters 𝜽 and inputs {x𝑖 }𝑖=1
𝑛 we can compute the predicted outputs {b y𝑖 }𝑖=1
𝑛 using the model

b
y𝑖 = 𝑓 (x𝑖 ; 𝜽). For example, for the two-layer neural network presented in Section 6.1 we have

qT𝑖 = ℎ(xT𝑖 W (1) T + b (1) T ), (6.14a)


b
yT𝑖 = qT𝑖 W (2) T + b (2) T , 𝑖 = 1, . . . , 𝑛 (6.14b)

In (6.14) the equations are transposed in comparison to the model in (6.8). This is a small trick such that
we easily can extend (6.14) to include multiple data points 𝑖. Similar to (3.5) we stack all data points in
matrices, where each data point represents one row

yT  xT  b T qT 


 1  1 y1   1
    b =  ...  ,
  
Y =  ...  , X =  ...  , Y and Q =  ...  . (6.15)
 T  T  T  T
y𝑛  x𝑛  b  q𝑛 
    y𝑛   

We can then conveniently write (6.14) as

Q = ℎ(XW (1) T + b (1) T ), (6.16a)


b = QW (2) T + b (2) T ,
Y (6.16b)

where we have also stacked the predicted output and the hidden units in matrices. Note that the transposed
offset vectors bT1 and bT2 are added and broadcasted to each row in this notation.
The vectorized equations in (6.16) is also how the model would typically be implemented in languages
that support array programming. For the implementation you might want to consider using the transposed
version of W and b as your weight matrix and offset vector to avoid taking transpose in each layer.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 91
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6 Neural networks and deep learning

Input Hidden Hidden Hidden Hidden Logits Outputs


variables units units units units
1 1 1 1
1 𝑧1 𝑔1
(1) (2) (𝐿−2) (𝐿−1)
ℎ 𝑞1 ℎ 𝑞1 ℎ 𝑞1 ℎ 𝑞1
𝑥1
.. .. ..
ℎ ℎ ... ℎ ℎ . . .
..
. .. .. .. .. ..
. . . . . 𝑧𝑀 𝑔𝑀
𝑥𝑝 (1) (2) (𝐿−2) (𝐿−1)
𝑞𝑈 1
𝑞𝑈 2
𝑞𝑈 𝐿−2
𝑞𝑈 𝐿−1
ℎ ℎ ℎ ℎ
Layer 1 Layer 2 Layer 𝐿-1 Layer 𝐿 Softmax
W (1) b (1) W (2) b (2) W b (𝐿−1)
( 𝐿−1)
W (𝐿) b (𝐿)

Figure 6.5: A deep neural network with 𝐿 layers for classification. The only difference to regression (Figure 6.4) is
the softmax transformation after layer 𝐿.

Neural networks for classification

Neural networks can also be used for classification where we have qualitative outputs 𝑦 ∈ {1, . . . , 𝑀 }
instead of quantitative. In Section 3.2 we extended linear regression to logistic regression by simply adding
the logistic function to the output. In the same manner we can extend the neural network presented in the
previous section to a neural network for classification. In doing this extension, we will use the multi-class
version of logistic regression presented in Section 2, and more specifically the softmax parametrization
given in (3.41), repeated here for convenience

 𝑒 𝑧1 
 
 𝑒 𝑧2 
1  
softmax(z) , Í 𝑀  ..  . (6.17)
 . 
 
𝑧𝑗
𝑗=1 𝑒
𝑒 𝑧 𝑀 
 

The softmax function now becomes an additional activation function acting on the final layer of the
neural network. In addition to the regression network in (6.11) we add the softmax function at the end of
the network as

q (1) = ℎ(W (1) x + b (1) ), (6.18a)


..
.
q (𝐿−1) = ℎ(W (𝐿−1) q ( 𝐿−2) + b (1) ), (6.18b)
z=W q
(𝐿) (𝐿−1)
+b (𝐿)
, (6.18c)
g = softmax(z). (6.18d)

The softmax function maps the output of the last layer z = [𝑧 1 , . . . , 𝑧 𝑀 ] T to g = [𝑔1 , . . . , 𝑔 𝑀 ] T where 𝑔𝑚
is a model of the class probability 𝑝(𝑦 = 𝑚 | x𝑖 ), see also Figure 6.5 for a graphical illustration. The input
variables 𝑧 1 , . . . , 𝑧 𝑀 to the softmax function are referred to as logits.
Note that the softmax function does not come as a layer with additional parameters, it merely acts as
a transformation from the output into the modeled class probabilities. By construction, the outputs of
Í𝑀
the softmax function will always be in the interval 𝑔𝑚 ∈ [0, 1] and sum to 𝑚=1 𝑔𝑚 = 1 , otherwise they
could not be interpreted as probabilities.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
92
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6.1 Neural networks

𝑚=1 𝑚=2 𝑚=3 𝑚=1 𝑚=2 𝑚=3


𝑦 𝑖𝑚 0 1 0 𝑦 𝑖𝑚 0 1 0
𝑔𝑖𝑚 (𝜽 𝐴) 0.1 0.8 0.1 𝑔𝑖𝑚 (𝜽 𝐵 ) 0.8 0.1 0.1
Cross-entropy: Cross-entropy:
𝐿(x𝑖 , y𝑖 , 𝜽 𝐴) = −1 · ln 0.8 = 0.22 𝐿(x𝑖 , y𝑖 , 𝜽 𝐵 ) = −1 · ln 0.1 = 2.30
Figure 6.6: Illustration of the cross-entropy between a data point y𝑖 and two different prediction outputs g𝑖 (𝜽 𝐴) =
[𝑔𝑖1 (𝜽 𝐴), 𝑔𝑖2 (𝜽 𝐴), 𝑔𝑖3 (𝜽 𝐴)] and g𝑖 (𝜽 𝐵 ) = [𝑔𝑖1 (𝜽 𝐵 ), 𝑔𝑖2 (𝜽 𝐵 ), 𝑔𝑖3 (𝜽 𝐵 )].

Learning classification networks from data


The training data consists of 𝑛 data points with inputs and outputs {(x𝑖 , y𝑖 )}𝑖=1
𝑛 . For the classification

problem we use one-hot encoding scheme for the output y𝑖 . This means that for a problem with 𝑀
 T
different classes, y𝑖 consists of 𝑀 elements y𝑖 = 𝑦 𝑖1 . . . 𝑦 𝑖 𝑀 . If a data point 𝑖 belongs to class 𝑚
then 𝑦 𝑖𝑚 = 1 and 𝑦 𝑖 𝑗 = 0 for all 𝑗 ≠ 𝑚. See more about the one-hot encoding in Section 2.
For a neural network which has the softmax activation function in the final layer we typically use the
negative log-likelihood, which is also commonly referred to as the cross-entropy loss function (see (3.44)),
to train the model

1 ∑︁
𝑛 ∑︁
𝑀
b
𝜽 = arg min 𝐿(x𝑖 , y𝑖 , 𝜽), where 𝐿(x𝑖 , y𝑖 , 𝜽) = − 𝑦 𝑖𝑚 ln 𝑝(𝑦 = 𝑚 | x𝑖 ; 𝜽). (6.19)
𝜽 𝑛
𝑖=1 𝑚=1

To motivate the use of this cost function, we note that cross-entropy is close to its minimum if the
predicted probability 𝑝(𝑚 | x𝑖 ; 𝜽) is close to 1 for the class 𝑚 for which 𝑦 𝑖𝑚 = 1. For example, if the 𝑖th
 T
data point belongs to class 𝑚 = 2 out of a total of 𝑀 = 3 classes we have y𝑖 = 0 1 0 . Assume
that we have a set of parameters for the network that we denote by 𝜽 𝐴, and with these parameters we
predict 𝑝(𝑦 = 1 | x𝑖 ; 𝜽 𝐴) = 0.1, 𝑝(𝑦 = 2 | x𝑖 ; 𝜽 𝐴) = 0.8 and 𝑝(𝑦 = 3 | x𝑖 ; 𝜽 𝐴) = 0.1 indicating that we are
quite sure that data point 𝑖 actually belongs to class 𝑚 = 2. This would generate a low cross-entropy
𝐿(x𝑖 , y𝑖 , 𝜽 𝐴) = −(0 · ln 0.1 + 1 · ln 0.8 + 0 · ln 0.1) ≈ 0.22. If we instead for another set of parameters
𝜽 𝐵 predict 𝑝(𝑦 = 1 | x𝑖 ; 𝜽 𝐵 ) = 0.8, 𝑝(𝑦 = 2 | x𝑖 ; 𝜽 𝐵 ) = 0.1 and 𝑝(𝑦 = 3 | x𝑖 ; 𝜽 𝐵 ) = 0.1, the cross-entropy
would be much higher 𝐿(x𝑖 , y𝑖 , 𝜽 𝐵 ) = −(0 · ln 0.8 + 1 · ln 0.1 + 0 · ln 0.1) ≈ 2.30. For this case, we would
indeed prefer the parameters 𝜽 𝐴 over 𝜽 𝐵 . The above reasoning is summarized in Figure 6.6.
Computing the loss function explicitly via the logarithm could lead to numerical problems when
𝑝(𝑦 = 𝑚 | x𝑖 ; 𝜽) is close to zero since ln(𝑥) → −∞ as 𝑥 → 0. This can be avoided since the logarithm in
the cross-entropy loss function (6.19) can “undo” the exponential in the softmax function (6.17),

∑︁
𝑀 ∑︁
𝑀
𝐿 (x𝑖 , y𝑖 , 𝜽) = − 𝑦 𝑖𝑚 ln 𝑝(𝑦 = 𝑚 | x𝑖 ; 𝜽) = − 𝑦 𝑖𝑚 ln 𝑔𝑖𝑚
𝑚=1 𝑚=1
∑︁
𝑀  nÍ o
=− 𝑦 𝑖𝑚 𝑧 𝑖𝑚 − ln 𝑀
𝑗=1 𝑒 𝑧𝑖 𝑗 , (6.20)

 o
𝑚=1
∑︁
𝑀 nÍ
=− 𝑦 𝑖𝑚 𝑧 𝑖𝑚 − max 𝑧 𝑖 𝑗 − ln 𝑀
𝑗=1 𝑒 𝑧𝑖 𝑗 −max 𝑗 𝑧𝑖 𝑗 , (6.21)
𝑗
𝑚=1

where 𝑧 𝑖𝑚 denote the logits.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 93
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6 Neural networks and deep learning

Image Data representation Input variables

0.0 0.0 0.8 0.9 0.6 0.0 𝑥 1,1 𝑥 1,2 𝑥 1,3 𝑥 1,4 𝑥 1,5 𝑥 1,6

0.0 0.9 0.6 0.0 0.8 0.0 𝑥 2,1 𝑥 2,2 𝑥 2,3 𝑥 2,4 𝑥 2,5 𝑥 2,6

0.0 0.0 0.0 0.0 0.9 0.0 𝑥 3,1 𝑥 3,2 𝑥 3,3 𝑥 3,4 𝑥 3,5 𝑥 3,6

0.0 0.0 0.0 0.9 0.6 0.0 𝑥 4,1 𝑥 4,2 𝑥 4,3 𝑥 4,4 𝑥 4,5 𝑥 4,6

0.0 0.0 0.9 0.0 0.0 0.0 𝑥 5,1 𝑥 5,2 𝑥 5,3 𝑥 5,4 𝑥 5,5 𝑥 5,6

0.0 0.8 0.9 0.9 0.9 0.9 𝑥 6,1 𝑥 6,2 𝑥 6,3 𝑥 6,4 𝑥 6,5 𝑥 6,6

Figure 6.7: Data representation of a grayscale image with 6 × 6 pixels. Each pixel is represented with a number
encoding the grayscale color. We denote the whole image as X (a matrix), and each pixel value is an input variable
𝑥 𝑗,𝑘 (element in the matrix X).

6.2 Convolutional neural networks

Convolutional neural networks (CNN) are a special kind of neural networks originally tailored for
problems where the input data has a grid-like topology. In this text we will focus on images, which have a
2D-topology of pixels. Images are also the most common type of input data in applications where CNNs
are applied. However, CNNs can be used for any input data on a grid, also in 1D (e.g. audio waveform
data) and 3D (volumetric data e.g. CT scans or video data). We will focus on grayscale images, but the
approach can easily be extended to color images as well.

Data representation of an image

Digital grayscale images consist of pixels ordered in a matrix. Each pixel can be represented as a range
from 0 (total absence, black) to 1 (total presence, white) and values between 0 and 1 represent different
shades of gray. In Figure 6.7 this is illustrated for an image with 6 × 6 pixels. In an image classification
problem, an image is the input x and the pixels in the image are the input variables 𝑥 1,1 , 𝑥1,2 , . . . , 𝑥 6,6 . The
two indices 𝑗 and 𝑘 determine the position of the pixel in the image, as illustrated in Figure 6.7.
If we put all input variables representing the image pixels in a long vector, we can use the network
architecture presented in Section 6.1 and 6.1. However, by doing that, a lot of the structure present in the
image data will be lost. For example, we know that two pixels close to each other typically have more in
common than two pixels further apart. This information would be destroyed by such a vectorization. In
contrast, CNNs preserve this information by representing the input variables as well as the hidden layers
as matrices. The core component in a CNN is the convolutional layer, which will be explained next.

The convolutional layer

Following the input layer, we use a hidden layer with as many hidden units as there are input variables.
For the image with 6 × 6 pixels we consequently have 6 × 6 = 36 hidden units. We choose to order the
hidden units in a 6 × 6 matrix, in the same manner as we did for the input variables, see Figure 6.8a.
The network layers presented in earlier sections (like the one in Figure 6.3) have been dense layers.
This means that each input variable is connected to all hidden units in the subsequent layer, and each such
connection has a unique parameter 𝑊 𝑗 𝑘 associated to it. These layers have empirically been found to
provide too much flexibility for images and we might not be able to capture the patterns of real importance,
and hence not generalize and perform well on unseen data. Instead, a convolutional layer appears to exploit
the structure present in images to find a more efficiently parameterized model. In contrast to a dense layer,
a convolutional layer leverages two important concepts – sparse interactions and parameter sharing – to
achieve such a parametrization.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
94
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6.2 Convolutional neural networks

Input variables Hidden units Input variables Hidden units

(1) (1)
𝑥 1,1 𝑥 1,2 𝑥 1,3 𝑥 1,4 𝑥 1,5 𝑥 1,6 𝑊1,3 ℎ ℎ ℎ ℎ ℎ ℎ 𝑥 1,1 𝑥 1,2 𝑥 1,3 𝑥 1,4 𝑥 1,5 𝑥 1,6 𝑊1,3 ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 2,1 𝑥 2,2 𝑥 2,3 𝑥 2,4 𝑥 2,5 𝑥 2,6 ℎ ℎ ℎ ℎ ℎ ℎ 𝑥 2,1 𝑥 2,2 𝑥 2,3 𝑥 2,4 𝑥 2,5 𝑥 2,6 ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 3,1 𝑥 3,2 𝑥 3,3 𝑥 3,4 𝑥 3,5 𝑥 3,6 𝑊 (1) ℎ ℎ ℎ ℎ ℎ ℎ 𝑥 3,1 𝑥 3,2 𝑥 3,3 𝑥 3,4 𝑥 3,5 𝑥 3,6 (1)
𝑊3,3 ℎ ℎ ℎ ℎ ℎ ℎ
3,3
𝑥 4,1 𝑥 4,2 𝑥 4,3 𝑥 4,4 𝑥 4,5 𝑥 4,6 ℎ ℎ ℎ ℎ ℎ ℎ 𝑥 4,1 𝑥 4,2 𝑥 4,3 𝑥 4,4 𝑥 4,5 𝑥 4,6 ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 5,1 𝑥 5,2 𝑥 5,3 𝑥 5,4 𝑥 5,5 𝑥 5,6 ℎ ℎ ℎ ℎ ℎ ℎ 𝑥 5,1 𝑥 5,2 𝑥 5,3 𝑥 5,4 𝑥 5,5 𝑥 5,6 ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 6,1 𝑥 6,2 𝑥 6,3 𝑥 6,4 𝑥 6,5 𝑥 6,6 ℎ ℎ ℎ ℎ ℎ ℎ 𝑥 6,1 𝑥 6,2 𝑥 6,3 𝑥 6,4 𝑥 6,5 𝑥 6,6 ℎ ℎ ℎ ℎ ℎ ℎ

(a) (b)

Figure 6.8: An illustration of the interactions in a convolutional layer: Each hidden unit (circle) is only dependent
on the pixels in a small region of the image (red boxes), here of size 3 × 3 pixels. The location of the hidden
unit corresponds to the location of the region in the image: if we move to a hidden unit one step to the right,
the corresponding region in the image also moves one step to the right, compare Figure 6.8a and Figure 6.8b.
(1) (1) (1)
Furthermore, the nine parameters 𝑊1,1 , 𝑊1,2 , . . . , 𝑊3,3 are the same for all hidden units in the layer.

Input variables Hidden units


(1)
0 0 0 𝑊1,3

0 𝑥 1,1 𝑥 1,2 𝑥 1,3 𝑥 1,4 𝑥 1,5 𝑥 1,6 ℎ ℎ ℎ ℎ ℎ ℎ


(1)
0 𝑥 2,1 𝑥 2,2 𝑥 2,3 𝑥 2,4 𝑥 2,5 𝑥 2,6 𝑊3,3 ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 3,1 𝑥 3,2 𝑥 3,3 𝑥 3,4 𝑥 3,5 𝑥 3,6 ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 4,1 𝑥 4,2 𝑥 4,3 𝑥 4,4 𝑥 4,5 𝑥 4,6 ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 5,1 𝑥 5,2 𝑥 5,3 𝑥 5,4 𝑥 5,5 𝑥 5,6 ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 6,1 𝑥 6,2 𝑥 6,3 𝑥 6,4 𝑥 6,5 𝑥 6,6 ℎ ℎ ℎ ℎ ℎ ℎ

Figure 6.9: An illustration of zero-padding used when the region is partly outside the image. With zero-padding,
the size of the image can be preserved in the following layer.

Sparse interactions

With sparse interactions we mean that most of the parameters in a corresponding dense layer are forced to
be equal to zero. More specifically, a hidden unit in a convolutional layer only depends on the pixels in a
small region of the image and not on all pixels. In Figure 6.8 this region is of size 3 × 3. The position of
the region is related to the position of the hidden unit in its matrix topology. If we move to the hidden unit
one step to the right, the corresponding region in the image also moves one step to the right, as displayed
by comparing Figure 6.8a and Figure 6.8b. For the hidden units on the border, the corresponding region is
partly located outside the image. For these border cases, we typically use zero-padding where the missing
pixels are simply replaced with zeros. Zero-padding is illustrated in Figure 6.9.

Parameter sharing

In a dense layer each link between an input variable and a hidden unit has its own unique parameter. With
parameter sharing we instead let the same parameter be present in multiple places in the network. In a
convolutional layer the set of parameters for the different hidden units are all the same. For example, in
Figure 6.8a we use the same set of parameters to map the 3 × 3 region of pixels to the hidden unit as we do
in Figure 6.8b. Instead of learning separate sets of parameters for every position we only learn one set of a
few parameters, and use it for all links between the input layer and the hidden units. We call this set of
parameters a filter. The mapping between the input variables and the hidden units can be interpreted as a
convolution between the input variables and the filter, hence the name convolutional neural network.
The sparse interactions and parameter sharing in a convolutional layer makes the CNN fairly invariant
to translations of objects in the image. If the parameters in the filter are sensitive to a certain detail (such
as a corner, an edge, etc.) a hidden unit will react to this detail (or not) regardless of where in the image
that detail is present! Furthermore, a convolutional layer uses significantly fewer parameters compared to
the corresponding dense layer. In Figure 6.8 only 3 · 3 + 1 = 10 parameters are required (including the

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 95
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6 Neural networks and deep learning

Input variables Hidden units Input variables Hidden units

𝑥 1,1 𝑥 1,2 𝑥 1,3 𝑥 1,4 𝑥 1,5 𝑥 1,6 𝑊 (1) 𝑥 1,1 𝑥 1,2 𝑥 1,3 𝑥 1,4 𝑥 1,5 𝑥 1,6 (1)
𝑊1,3
1,3
𝑥 2,1 𝑥 2,2 𝑥 2,3 𝑥 2,4 𝑥 2,5 𝑥 2,6 𝑥 2,1 𝑥 2,2 𝑥 2,3 𝑥 2,4 𝑥 2,5 𝑥 2,6
ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 3,1 𝑥 3,2 𝑥 3,3 𝑥 3,4 𝑥 3,5 𝑥 3,6 (1) 𝑥 3,1 𝑥 3,2 𝑥 3,3 𝑥 3,4 𝑥 3,5 𝑥 3,6 (1)
𝑊3,3 ℎ ℎ ℎ 𝑊3,3 ℎ ℎ ℎ
𝑥 4,1 𝑥 4,2 𝑥 4,3 𝑥 4,4 𝑥 4,5 𝑥 4,6 𝑥 4,1 𝑥 4,2 𝑥 4,3 𝑥 4,4 𝑥 4,5 𝑥 4,6
ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 5,1 𝑥 5,2 𝑥 5,3 𝑥 5,4 𝑥 5,5 𝑥 5,6 𝑥 5,1 𝑥 5,2 𝑥 5,3 𝑥 5,4 𝑥 5,5 𝑥 5,6
𝑥 6,1 𝑥 6,2 𝑥 6,3 𝑥 6,4 𝑥 6,5 𝑥 6,6 𝑥 6,1 𝑥 6,2 𝑥 6,3 𝑥 6,4 𝑥 6,5 𝑥 6,6

(a) (b)

Figure 6.10: A convolutional layer with stride [2,2] and a filter of size 3 × 3.

offset parameter). If we instead had used a dense layer (36 + 1) · 36 = 1332 parameters would have been
needed! Another way of interpreting this is: with the same amount of parameters, a convolutional layer
can encode more properties of an image than a dense layer.

Condensing information with strides

In the convolutional layer presented above we have equally many hidden units as we have pixels in the
image. As we add more layers to the CNN we usually want to condense the information by reducing the
number of hidden units in each layer. One way of doing this is by not applying the filter to every pixel but
to say every two pixels. If we apply the filter to every two pixels both row-wise and column-wise, the
hidden units will only have half as many rows and half as many columns. For a 6 × 6 image we get 3 × 3
hidden units. This concept is illustrated in Figure 6.10.
The stride controls how many pixels the filter shifts over the image at each step. In Figure 6.8 the stride
is [1,1] since the filter moves by one pixel both row- and column-wise. In Figure 6.10 the stride is [2,2]
since it moves by two pixels row- and column-wise. Note that the convolutional layer in Figure 6.10 still
requires 10 parameters, as the convolutional layer in Figure 6.8 does. Another way of condensing the
information after a convolutional layer is by subsampling the data, so-called pooling. The interested can
read further about pooling in Goodfellow et al. 2016.

Multiple channels

The networks presented in Figure 6.8 and 6.10 only have 10 parameters each. Even though this
parameterization comes with several important advantages, one filter is probably not sufficient to encode
all interesting properties of the images in our dataset. To extended the network, we add multiple filters,
each with their own set of parameters. Each filter produces its own set of hidden units—a so-called
channel—using the same convolution operation as explained in Section 6.2. Hence, each layer of hidden
units in a CNN is organized into a tensor with the dimensions (rows × columns × channels). In Figure 6.11,
the first layer of hidden units has four channels and that hidden layer consequently has dimension 6 × 6 × 4.
When we continue to stack convolutional layers, each filter depends not only on one channel, but on all
the channels in the previous layer. This is displayed in the second convolutional layer in Figure 6.11. As
a consequence, each filter is a tensor of dimension (filter rows × filter columns × input channels). For
example, each filter in the second convolutional layer in Figure 6.11 is of size 3 × 3 × 4. If we collect all
filter parameters in one weight tensor W, that tensor will be of dimension (filter rows × filter columns ×
input channels × output channels). In the second convolutional layer in Figure 6.11, the corresponding
weight matrix W (2) is a tensor of dimension 3 × 3 × 4 × 6. With multiple filters in each convolutional
layer, each of them can be sensitive to different features in the image, such as certain edges, lines or circles
enabling a rich representation of the images in our training data.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
96
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6.3 Training a neural network

Input variables Hidden units Hidden units Hidden units Logits Outputs
dim 6 × 6 × 1 dim 6 × 6 × 4 dim 3 × 3 × 6 dim 𝑈3 dim 𝑀 dim 𝑀
(1)
𝑥 1,1 𝑥 1,2 𝑥 1,3 𝑥 1,4 𝑥 1,5 𝑥 1,6𝑊1,3,1 ℎ ℎ ℎ ℎ ℎ ℎ
ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 2,1 𝑥 2,2 𝑥 2,3 𝑥 2,4 𝑥 2,5 𝑥 2,6 ℎ ℎ ℎ ℎ ℎ ℎ
ℎ ℎℎ ℎℎ ℎℎ ℎℎ ℎℎ ℎ (2)
ℎ ℎ ℎ ℎ ℎ ℎ 𝑊1,3,4,6
ℎ ℎ ℎ ℎ
𝑥 3,1 𝑥 3,2 𝑥 3,3 𝑥 3,4 𝑥 3,5 𝑥 3,6𝑊 (1) ℎ ℎ ℎ ℎ ℎ ℎ ℎ ℎ ℎ 𝑧1 𝑔1
ℎ ℎℎ ℎℎ ℎℎ ℎℎ ℎℎ ℎ ℎ ℎ ℎ
3,3,1 ℎ ℎ ℎ ℎ ℎ ℎ ℎℎ ℎℎ ℎ
ℎ ℎ ℎ ℎ ℎ ℎ ℎ ℎ .. .. ..
𝑥 4,1 𝑥 4,2 𝑥 4,3 𝑥 4,4 𝑥 4,5 𝑥 4,6 ℎ ℎℎ ℎℎ ℎℎ ℎℎ ℎℎ ℎ ℎ ℎℎℎ ℎℎℎ ℎ ℎ . . .
(1)
𝑊1,3,4 ℎ ℎ ℎ ℎ ℎ ℎ (2) ℎ ℎ ℎ
ℎ ℎ ℎ ℎ ℎ ℎ 𝑊3,3,4,6 ℎℎ ℎℎℎ ℎ ℎ
ℎ ℎ ..
𝑥 5,1 𝑥 5,2 𝑥 5,3 𝑥 5,4 𝑥 5,5 𝑥 5,6 ℎ ℎℎ ℎℎ ℎℎ ℎℎ ℎℎ ℎ ℎ ℎℎ ℎℎ ℎ .
ℎ ℎ ℎ ℎ ℎ ℎ ℎ ℎ ℎ 𝑧𝑀 𝑔𝑀
ℎ ℎ ℎ ℎ ℎ ℎ ℎ ℎ ℎ
𝑥 6,1 𝑥 6,2 𝑥 6,3 𝑥 6,4 𝑥 6,5 𝑥 6,6 ℎℎ ℎℎ ℎℎ ℎℎ ℎℎ ℎ ℎ ℎ ℎ
ℎ ℎ ℎ ℎ ℎ
(1)
𝑊3,3,4 ℎ ℎ ℎ ℎ ℎ ℎ
ℎ ℎ ℎ ℎ ℎ ℎ
ℎ ℎ ℎ ℎ ℎ ℎ

Convolutional Convolutional Dense Dense Softmax


layer layer layer layer
W (1) ∈ R3×3×1×4 W (2) ∈ R3×3×4×6 W (3) ∈ R54×𝑈3 W (4) ∈ R𝑈3 ×𝑀
b (1) ∈ R4 b (2) ∈ R6 b (3) ∈ R𝑈3 b (4) ∈ R 𝑀

Figure 6.11: A full CNN architecture for classification of grayscale 6 × 6 images. In the first convolutional layer
four filters each of size 3 × 3 produce a hidden layer with four channels. The first channel (in the back) is visualized
in red and the forth channel (in the front) is visualized in blue. We use stride [1,1] which maintains the number of
rows and columns. In the second convolutional layer, six filters of size 3 × 3 × 4 and the stride [2,2] are used. They
produce a hidden layer with 3 rows, 3 columns and 6 channels. After the two convolutional layers follows a dense
layer where all 3 · 3 · 6 = 54 hidden units in the second hidden layer are densely connected to the third layer of
hidden units where all links have their unique parameters. We add an additional dense layer mapping down to the 𝑀
logits. The network ends with a softmax function to provide predicted class probabilities as output.

Full CNN architecture

A full CNN architecture consists of multiple convolutional layers. Typically, we decrease the number of
rows and columns in the hidden layers as we proceed though the network, but instead increase the number
of channels to enable the network to encode more high level features. After a few convolutional layers we
usually end the network with one or more dense layers. If we consider an image classification task, we
place a softmax layer at the very end to get outputs in the range [0,1]. The loss function when training a
CNN will be the same as in the regression and classification networks explained earlier, depending on
which type of problem we have at hand. In Figure 6.11 a small example of a full CNN architecture is
displayed.

6.3 Training a neural network


To use a neural network for prediction we need to find suitable values for its parameters 𝜽. To do that we
solve an optimization problem on the form

1 ∑︁
𝑛
b
𝜽 = arg min 𝐽 (𝜽) where 𝐽 (𝜽) = 𝐿(x𝑖 , y𝑖 , 𝜽). (6.22)
𝜽 𝑛 𝑖=1

We denote 𝐽 (𝜽) as the cost function and 𝐿 (x𝑖 , y𝑖 , 𝜽) as the loss function. The functional form of the loss
function depends on the characteristics of the problem at hand, see for example (6.13) for regression and
(6.19) for classification.
These optimization problems can not be solved in closed form, so numerical optimization has to be
used. In all numerical optimization algorithms the parameters are updated in an iterative manner. In deep
learning we typically use various versions of gradient based search:

1. Pick an initialization 𝜽 0 .
2. Update the parameters as 𝜽 𝑡+1 = 𝜽 𝑡 − 𝛾∇𝜽 𝐽 (𝜽 𝑡 ) for 𝑡 = 1, 2, . . . . (6.23)
3. Terminate when some criterion is fulfilled, and take the last 𝜽 𝑡 as b
𝜽.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 97
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6 Neural networks and deep learning

This problem has two main computational challenges. The first computational challenge is the big data
problem. For deep learning application the number of data points 𝑛 is typically very big making the
computation of the cost function and its gradient very costly since it requires a sum over all data points.
As a consequence, we cannot afford to compute the exact gradient ∇𝜽 𝐽 (𝜽 𝑡 ) at each iteration. Instead, we
compute an approximation of this gradient by considering a random subset for the training data at each
iteration. These so-called stochastic gradient algorithms are further explained in Section 5.4.
The second computational challenge is that the number of parameters dim(𝜽), which is also very big
for deep learning problems. To efficiently compute the gradient ∇𝜽 𝐽 (𝜽 𝑡 ) we apply the chain rule of
calculus and reuse partial derivatives needed to compute this gradient. This is called the back-propagation
algorithm and is not further explained here. The intrested reader can for example consult Goodfellow et al.
2016.

Initialization

Most of the previous optimization problems (such as 𝐿 1 regularization and logistic regression) that we
have so far encountered have all been convex. This means that we can guarantee global convergence
regardless of what initialization 𝜽 0 we use. In contrast, the cost functions for training neural networks
is usually non-convex. This means that the training is sensitive to the value of the initial parameters.
Typically, we initialize all the parameters to small random numbers to enable the different hidden units to
encode different aspects of the data. If the ReLU activation functions are used, the offset elements 𝑏 0 are
typically initialized to a small positive value such that it operates in the non-negative range of the ReLU.

Dropout

Like all models presented in this course, neural network models can suffer from overfitting if we have a
too flexible model in relation to the complexity of the data. Bagging (Section 7.1) is one way to reduce
the variance and by that also reducing the risk of overfitting. In bagging we train an entire ensemble of
models. We train all models (ensemble members) on a different dataset each, which has been bootstrapped
(sampled with replacement) from the original training dataset. To make a prediction, we first make one
prediction with each model (ensemble member), and then average over all models to obtain the final
prediction.
Bagging is also applicable to neural networks. However, it comes with some practical problems; a large
neural network model usually takes quite some time to train and it also has quite some parameters to store.
To train not just one, but an entire ensemble of many large neural networks would thus be very costly, both
in terms of runtime and memory. Dropout is a bagging-like technique that allows us to combine many
neural networks without the need to train them separately. The trick is to let the different models share
parameters with each other, which reduces the computational cost and memory requirement.

Ensemble of sub-networks

Consider a neural network like the one in Figure 6.12a. In dropout we construct the equivalent to an
ensemble member by randomly removing some of the hidden units. We say that we drop the units,
hence the name dropout. Via this process we obtain a sub-network of our original network. Two such
sub-networks are displayed in Figure 6.12b. We randomly sample with a pre-defined probability which
units to drop, and the collection of dropped units in one sub-network is independent from the collection of
dropped units in another sub-network. When a unit is removed, we also remove all of its incoming and
outgoing connections. Not only hidden units can be dropped, but also input variables.
Since all sub-networks stem from the very same original network, the different sub-networks share
(1)
some parameters with each other. For example, in Figure 6.12b the parameter 𝑊55 is present in both
sub-networks. The fact that they share parameters with each other allow us to train the ensemble of
sub-networks in an efficient manner.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
98
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6.3 Training a neural network

(2)
𝑊11
𝑥1 ℎ ℎ

𝑥2 ℎ ℎ

𝑥3 ℎ ℎ

𝑥4 (3)
ℎ ℎ 𝑊15

𝑥5 ℎ ℎ
(1)
𝑊55

(a) A standard neural network


(2)
𝑊11
𝑥1 ×
ℎ × ℎ ×
𝑥1 ℎ ℎ

×𝑥2 ℎ ℎ 𝑥2 ℎ ×ℎ
𝑥3 ×ℎ ℎ ×𝑥 3 ℎ ℎ

×𝑥4 ℎ ×ℎ (3)
𝑊15 𝑥4 ×ℎ ×ℎ
𝑥5
(1)
𝑊55
ℎ ℎ 𝑥5
(1)
𝑊55
ℎ ×ℎ
(b) Two sub-networks

Figure 6.12: A neural network with two hidden layers (a), and two sub-networks with dropped units (b). The
collection of units that have been dropped are independent between the two sub-networks.

Training with dropout

To train with dropout we use the stochastic gradient algorithm described in Algorithm 5.4. In each gradient
step a mini-batch of data is used to compute an approximation of the gradient, as before. However, instead
of computing the gradient for the full network, we generate a random sub-network by randomly dropping
units as described above. We compute the gradient for that sub-network as if the dropped units were
not present and then do a gradient step. This gradient step only updates the parameters present in the
sub-network. The parameters that are not present are left untouched. In the next gradient step we grab
another mini-batch of data, remove another randomly selected collection of units and update the parameters
present in that sub-network. We proceed in this manner until some terminal condition is fulfilled.

Dropout vs bagging

The dropout procedure to generate an ensemble of models differs from bagging in a few ways:

• In bagging all models are independent in the sense that they have their own parameters. In dropout
the different models (the sub-networks) share parameters.

• In bagging each model is trained until convergence. In dropout each sub-network is only trained for
a singe gradient step. However, since they share parameters all models will be updated also when
the other networks are trained.

• Similar to bagging, in dropout we train each model on a dataset that has been randomly selected
from our training data. However, in bagging we usually do it on a bootstrapped version of the whole
dataset whereas in dropout each model is trained on a randomly selected mini-batch of data.

Even though dropout differs from bagging in some aspects it has empirically been shown to enjoy similar
properties as bagging in terms of avoiding overfittning and reducing the variance of the model.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 99
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6 Neural networks and deep learning

b (2)
𝑟𝑊11
𝑥1 ℎ ℎ

𝑥2 ℎ ℎ

𝑥3 ℎ ℎ 𝑝(𝑦 = 𝑚 | x; b
𝜽)

𝑥4 ℎ ℎ b (3)
𝑟𝑊15

𝑥5 ℎ ℎ
b (1)
𝑟𝑊 55

Figure 6.13: The network used for prediction after being trained with dropout. All units and links are present (no
dropout) but the weights going out from a certain unit are multiplied with the probability of that unit being included
during training. This is to compensate for the fact that some of them were dropped during training. Here all units
have been kept with probability 𝑟 during training (and consequently dropped with probability 1 − 𝑟).

Prediction at test time

After we have trained the sub-networks, we want to make a prediction based on an unseen input data
point x★. In bagging we evaluate all the different models in the ensemble and combine their results. This
would be infeasible in dropout due to the very large (combinatorial) number of possible sub-networks.
However, there is a simple trick to approximately achieve the same result. Instead of evaluating all possible
sub-networks we simply evaluate the full network containing all the parameters. To compensate for the
fact that the model was trained with dropout, we multiply each estimated parameter going out from a unit
with the probability of that unit being included during training. This ensures that the expected value of the
input to a unit is the same during training and testing, as during training only a fraction of the incoming
links were active. For instance, assume that we during training kept a unit with probability 𝑝 in all layers,
then during testing we multiply all estimated parameters with 𝑝 before we do a prediction based on the
network. This is illustrated in Figure 6.13. This procedure of approximating the average over all ensemble
members has been shown to work surprisingly well in practice even though there is not yet any solid
theoretical argument for the accuracy of this approximation.

Dropout as a regularization method

As a way to reduce the variance and avoid overfitting, dropout can be seen as a regularization method.
There are plenty of other regularization methods for neural networks including parameter penalties (analog
to ridge regression and LASSO in Chapter 5), early stopping (the training is stopped before the parameters
have converged, and thereby the risk of overfitting is reduced) and various sparse representations (for
example CNNs can be seen as a regularization method where most parameters are forced to be zero),
just to mention a few. Since its invention, dropout has become one of the most popular regularization
techniques due to its simplicity, the fact that it is computationally cheap and its good performance. In fact,
a good practice of designing a neural network is often to extended the network until it overfits, then extend
it a bit more and finally add a regularization like dropout to avoid that overfitting.

6.4 Perspective and further reading


Although the first conceptual ideas of neural networks date back to the 1940s (McCulloch and Pitts
1943), they had their first main success stories in the late 1980s and early 1990s with the use of the
so-called back-propagation algorithm. At that stage, neural networks could, for example, be used to
classify handwritten digits from low-resolution images (LeCun, Boser, et al. 1990). However, in the late
1990s neural networks were largely forsaken because it was widely believed that they could not be used to

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
100
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
6.4 Perspective and further reading

solve any challenging problems in computer vision and speech recognition. In these areas, neural networks
could not compete with hand-crafted solutions based on domain specific prior knowledge.
This situation has changed dramatically since the late 2000s, with multiple layers under the name deep
learning. Progress in software, hardware and algorithm parallelization made it possible to address more
complicated problems, which were unthinkable only a couple of decades ago. For example, in image
recognition, these deep models are now the dominant methods of use and they reach human or even
super-human performance on some specific tasks (LeCun, Bengio, et al. 2015). Recent advances based
on deep neural networks have generated algorithms that can learn how to play computer games based
on pixel information only (Mnih et al. 2015), how to beat the would champion in the board game of Go
(Silver et al. 2016) and automatically understand the situation in images for automatic caption generation
(K. Xu et al. 2015).
An accessible introduction and overview of deep learning is provided by LeCun, Bengio, et al. (2015),
and via the textbook by Goodfellow et al. (2016).

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 101
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7 Ensemble methods: Bagging and boosting

In Chapter 3 and 2 we introduced four fundamental models for supervised machine learning. In this
chapter we will introduce ensemble methods, a type of meta-algorithm, which makes use of multiple
copies of some fundamental model. We refer to a set of multiple copies of a fundamental model as an
ensemble of base models, and the key idea is to train each such base model in a slightly different way. To
obtain a prediction from an ensemble, we let each base model make its own prediction and then use a
(possibly weighted) average or majority vote to obtain the final prediction. With a carefully constructed
ensemble, the prediction obtained in this way is better than the prediction of a single base model.
We start in Section 7.1 by introducing a general technique referred to as bootstrap aggregating, or
bagging for short. The bagging idea is to first create multiple slightly different “versions” of the training
data by, essentially, randomly sample overlapping subsets of the training data (the so-called bootstrap).
Thereafter, one base model is trained from each such “version” of the training data. In this way, an
ensemble of similar, but not identical, base models is obtained. With this procedure it is possible to
reduce the variance (without any notable increase in bias) compared to using only a single base model
learned from the entire training dataset. In practice this means that by using bagging the risk of overfit
decreases, compared to using the base model itself. In Section 7.2 we introduce an extension to bagging
only applicable when the base model is a classification or regression tree, which results in a powerful
off-the-shelf method called random forests. In random forests, each tree is randomly perturbed in order to
obtain additional variance reduction, beyond what is already obtained by the bagging procedure itself.
In Section 7.3-7.4 we introduce another ensemble method known as boosting. Boosting is different
from bagging and random forests, since its base models are trained sequentially, one after the other, where
each model tries to “correct” for the “mistakes” made by the previous ones. On the contrary to bagging,
the main effect of boosting is bias reduction compared to the base model. Thus, boosting is able to turn an
ensemble of “weak” base models (e.g., linear classifiers) into one “strong” ensemble model (e.g., a heavily
non-linear classifier), and has also shown very useful in practice.

7.1 Bagging

As discussed already in Chapter 4, a central concept in machine learning is the bias–variance trade–off.
Roughly speaking, the more flexible a model is, the lower its bias will be. That is, a flexible model is
capable of representing complicated input–output relationships. Examples of simple yet flexible models
are 𝑘-NN with a small value of 𝑘 and a classification tree that is grown deep. Such highly flexible models
are sometimes needed for solving real-world machine learning problems, where relationships are far from
linear. The downside, however, is the risk of overfitting1 , or equivalently, high model variance. Despite
their high variance, those models are not useless. By using them as base models in bootstrap aggregating,
or bagging, we can

reduce the variance of the base model, without increasing its bias.

We outline the main idea of bagging with the example below.

1 Botha 𝑘-NN model with 𝑘 = 1 and a classification tree with a single data point per leaf node will result in zero training error,
typical cases of severe over-fitting.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 103
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7 Ensemble methods: Bagging and boosting

Example 7.1: Using bagging for a regression problem

Consider the data (black dots) below that are drawn from a function (dashed line) plus noise. As always
in supervised machine learning, we want to learn a model from the data which is able to predict new data
points well. Being able to predict new data points well means, among other things, that the model should
predict the dotted line at 𝑥★ (the empty blue circle) well.
For solving this problem, we could use any regression method. Here, we use a regression tree which is
grown until each leaf node only contains one data point, whose prediction is shown to the lower left (blue
line and dot). This is a typical low-bias high-variance model, and the overfit to the training data is apparent
from the figure. We could decrease its variance, and hence the overfit, by using a more shallow (less deep)
tree, but that would on the other hand increase the bias. Instead, we lower the variance (without increasing
the bias much) by using bagging with the regression tree as base model.

Ensemble of bootstrapped regression trees


1
0
Data

𝑦
−1
−2
1
0 𝑥★ 2 0 𝑥★ 2 0 𝑥★ 2

0 0

𝑦
−1
𝑦

−2
0 𝑥★ 2 0 𝑥★ 2 0 𝑥★ 2
−1
1
0

𝑦
−1
−2
−2
0 𝑥★ 2 0 𝑥★ 2 0 𝑥★ 2 0 𝑥★ 2
𝑥 𝑥 𝑥 𝑥
Regression tree learned from all data Average of bootstrapped regression trees = bagging

1 1

0 0
𝑦

−1 −1

−2 −2

0 𝑥★ 2 0 𝑥★ 2
𝑥 𝑥

The rationale behind bagging goes as follows: Because of the noise in the training data, we may think of the
prediction 𝑦ˆ (𝑥★) (the blue dot) as a random variable. In bagging, we learn an ensemble of base models
(upper right panel), where each base model is trained on a different “version” of the training data obtained
using the bootstrap. We may therefore think of each base model to be a different realization of the random
variable 𝑦ˆ (𝑥★). It is well-known that the average of multiple realizations of a random variable has a lower
variance than the random variable itself, which means that by taking the average (lower right) of all base
models we obtain a prediction with less variance than the base model itself. That is, the bagged regression
tree (lower right) has lower variance than a single prediction tree (lower left). Since the base model itself
also has low bias, the averaged prediction will have low bias and low variance. We can visually confirm that
the prediction is better (blue dot and circle are closer to each other) for bagging than for the single regression
tree.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
104
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7.1 Bagging

The bootstrap
As outlined in Example 7.1, the idea of bagging is to average over multiple base models, each learned
from a different training dataset. First we therefore have to construct different training datasets. In the best
of worlds we would just collect multiple datasets, but most often we cannot do that and instead we have to
make the most out of the limited data available. For this purpose, the bootstrap is useful.
The bootstrap is a method for artificially creating multiple datasets (of size 𝑛) out of one dataset (also of
size 𝑛). The traditional usage of the bootstrap is to quantify uncertainties in statistical estimators (such as
confidence intervals), but it turns out to be useful also for machine learning. We denote the original dataset
𝑛 , and assume that T provides a good representation of the real-world data generating
T = {x𝑖 , 𝑦 𝑖 }𝑖=1
process, in the sense that if we were to collect more training data, these data points would likely be similar
to the training data points already contained in T . We can thus argue that randomly picking data points
from T is a reasonable way to simulate a “new” training dataset. In statistical terms, instead of sampling
from the population (collecting more data), we sample from the available training data which is assumed
to provide a good representation of the population.
The bootstrap is stated in Algorithm 7.1 and illustrated in Example 7.2 below. Note that the sampling is
done with replacement, meaning that the resulting bootstrapped dataset may contain multiple copies of
some of the original training data points, whereas other data points are not included at all.

Algorithm 7.1: The bootstrap.


Data: Training dataset T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

e = {e
Result: Bootstrapped data T x𝑖 , e 𝑛
𝑦 𝑖 }𝑖=1
1 for 𝑖 = 1, . . . , 𝑛 do
2 Sample ℓ uniformly on the set of integers {1, . . . , 𝑛}
3 Set e
x𝑖 = xℓ and e 𝑦𝑖 = 𝑦ℓ
4 end

Time to reflect 7.1: What would happen if the sampling was done without replacement in the
bootstrap?

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 105
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7 Ensemble methods: Bagging and boosting

Example 7.2: The bootstrap

We have a small training dataset with 𝑛 = 10 data points with a two-dimensional input x = [𝑥1 𝑥2 ] and a
binary output 𝑦 ∈ {Blue, Red}.

Original training data, T = {x𝑖 , 𝑦 𝑖 }10


𝑖=1
10
Index 𝑥1 𝑥2 𝑦
1 9.0 2.0 Blue 8
2 1.0 4.0 Blue
3 4.0 6.0 Blue 6
4 4.0 1.0 Blue

𝑥2
5 1.0 2.0 Blue 4
6 1.0 8.0 Red
7 6.0 4.0 Red 2
8 7.0 9.0 Red
9 9.0 8.0 Red 0
0 2 4 6 8 10
10 9.0 6.0 Red 𝑥1

To generate a bootstrapped dataset T e = {e x𝑖 , e


𝑦 𝑖 }10
𝑖=1 we simulate 10 times with replacement from the
index set {1, . . . , 10}, resulting in the indices {2, 10, 10, 5, 9, 2, 5, 10, 8, 10}. Thus, (ex1 , e
𝑦 1 ) = (x2 , 𝑦 2 ),
x2 , e
(e 𝑦 2 ) = (x10 , 𝑦 10 ), etc. We end up with the following dataset, where the numbers in parentheses in the
right panel indicate that there are multiple copies of some of the original data points in the bootstrapped data.

e = {e
Bootstrapped data, T x𝑖 , e
𝑦 𝑖 }10
𝑖=1
e e e
10
Index 𝑥1 𝑥2 𝑦
2 1.0 4.0 Blue 8
10 9.0 6.0 Red
10 9.0 6.0 Red 6
5 1.0 2.0 Blue (4)
𝑥2

9 9.0 8.0 Red (2)


4
2 1.0 4.0 Blue
(2)
5 1.0 2.0 Blue 2
10 9.0 6.0 Red
8 7.0 9.0 Red 0
0 2 4 6 8 10
10 9.0 6.0 Red 𝑥1

Variance reduction by averaging


By running the bootstrap (Algorithm 7.1) repeatedly 𝐵 times we obtain 𝐵 identically distributed
e 1, . . . , T
bootstrapped datasets T e 𝐵 . We can then use those bootstrapped datasets to train an ensemble of 𝐵
base models. We thereafter average their predictions

1 ∑︁ 𝑏 1 ∑︁ 𝑏
𝐵 𝐵
b
𝑦 bag (x★) = e
𝑦 (x★) or gbag (x★) = e
g (x★), (7.1)
𝐵 𝑏=1 𝐵 𝑏=1

depending on whether we are concerned with regression (predicting an output value b 𝑦 bag (x★)) or classi-
fication (predicting class probabilities gbag (x★)). In (7.1), e 1 𝑦 (x★) and e
𝑦 (x★), . . . , e𝐵 g1 (x★), . . . ,eg 𝐵 (x★)
denote the predictions from the individual ensemble members. The averaged prediction, denoted b 𝑦 bag (x★)
or gbag (x★), is the final prediction obtained from bagging. We summarize this by Method 7.1. (For classi-
fication, the prediction could alternatively be decided by majority vote among the ensemble members, but
that typically degrades the performance slightly compared to averaging the predicted class probabilities.)
We will now give some more details on the variance reduction that happens in (7.1), which is the entire
point of bagging. We focus on regression, but the intuition works also for classification.
Let us point out a basic property of random variables, namely that averaging reduces variance. To
formalize this, let 𝑧 1 , . . . , 𝑧 𝐵 be a collection of identically distributed (but possibly dependent) random

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
106
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7.1 Bagging

Learn all base models


Data: Training dataset T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

Result: 𝐵 base models


1 for 𝑏 = 1, . . . , 𝐵 do
2 e (𝑏)
Run Algorithm 7.1 to obtain a bootstrapped training dataset T
3 Learn a base model from T e (𝑏)
4 end
5 Obtain b
𝑦 bag (x★) or gbag (x★) by averaging (7.1).

Predict with the base models


Data: 𝐵 base models and test input x★
Result: A prediction b
𝑦 bag (x★) or gbag (x★)
1 for 𝑏 = 1, . . . , 𝐵 do
2 Use base model 𝑏 to predict e 𝑦 𝑏 (x★) or e
g𝑏 (x★)
3 end
4 Obtain b
𝑦 bag (x★) or gbag (x★) by averaging (7.1).

Method 7.1: Bagging

variables with mean value E[𝑧 𝑏 ] = 𝜇 and variance Var[𝑧 𝑏 ] = 𝜎 2 for 𝑏 = 1, . . . , 𝐵. Furthermore, assume
that the average correlation2 between any pair of variables is 𝜌. Then, computing the mean and the
Í𝐵
variance of the average 𝑏=1 𝑧 𝑏 of these variables we get
" #
1 ∑︁
𝐵
E 𝑧 𝑏 = 𝜇, (7.2a)
𝐵 𝑏=1
" 𝐵 #
1 ∑︁ 1−𝜌 2
Var 𝑧𝑏 = 𝜎 + 𝜌𝜎 2 . (7.2b)
𝐵 𝑏=1 𝐵

The first equation (7.2a) tells us that the mean is unaltered by averaging a number of identically distributed
random variables. Furthermore, the second equation (7.2b) tells us that the variance is reduced by
averaging if the correlation 𝜌 < 1. The first term in the variance expression (7.2b) can be made arbitrarily
small by increasing 𝐵, whereas the second term is only determined by the correlation 𝜌.
To make a connection between bagging and (7.2), consider the predictions e 𝑦 𝑏 (x★) from the base
models as random variables (in other words, imagine 𝑧 𝑏 being the red dots in Example 7.1). All base
models, and hence their predictions, originate from the same data T (via the bootstrap), and e 𝑦 𝑏 (x★) are
therefore identically distributed but correlated. By averaging the predictions we decrease the variance,
according to (7.2b). If we only chose 𝐵 large enough, the achieved variance reduction will be limited
by the correlation 𝜌. Experience has shown that 𝜌 is often small enough such that the computational
complexity of bagging (compared to only using the base model itself) pays off well in terms of decreased
variance. To summarize, by averaging the identically distributed predictions from several base models as
in (7.1), each with a low bias, the bias remains low3 (according to (7.2a)) and the variance is reduced
(according to (7.2b)).
At first glance, one might think that a bagging model (7.1) becomes more “complex” as the number of
ensemble members 𝐵 increase, and that we therefore run a risk of overfitting if we use many ensemble
2 That 1 Í 2
is 𝐵 ( 𝐵−1) 𝑏≠𝑐 E[(𝑧 𝑏 − 𝜇)(𝑧 𝑐 − 𝜇)] = 𝜌𝜎 .
3 Strictly speaking, (7.2a) implies that the bias is identical for a single ensemble member and the ensemble average. The use of
the bootstrap might however affect the bias, in that a base model trained on the original data might have a smaller bias than a
base model trained on a bootstrapped version of the training data. Most often, this is not an issue in practice.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 107
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7 Ensemble methods: Bagging and boosting

members 𝐵. However, there is nothing in (7.2) which indicate any such problem (bias remains low,
variance decreases), and we confirm this by Example 7.3.
Example 7.3: Bagging for regression (cont.)

We consider the problem from Example 7.1 again, and explore how the number of base models 𝐵 affects the
result. We measure the squared error between the “true” function value at 𝑥★ and the predicted b 𝑦 bag (𝑥★)
when using different 𝐵. (Because of the bootstrap, there is a certain amount of randomness in the bagging
algorithm itself. To avoid that “noise”, we average the result over multiple runs of the bagging algorithm.)
𝑦 bag (𝑥★)

0.3
Squared error for b

0.2

0.1
5 10 15 20 25 30 35 40 45

𝐵
What we see here is that the squared error eventually reaches a plateau as 𝐵 → ∞. Had there been an overfit
issue with 𝐵 → ∞, the squared error would have started do increase again for some large value of 𝐵.

Despite the fact that the number of parameters in the model increases as 𝐵 increases, the lack of overfit
as 𝐵 → ∞ according to Example 7.3 is the expected (and intended) behavior. It is important to understand
that by the construction of bagging, more ensemble members does not make the resulting model more
flexible, but only reduces the variance. With this in mind, in practice the choice of 𝐵 is mainly guided by
computational constraints. The larger 𝐵 the better, but increasing 𝐵 when there is no further reduction in
test error is computationally wasteful.
Be aware! Bagging can still overfit, we cannot prevent from that. The only claim we have is that the
problem never gets worse when 𝐵 → ∞.

Out-of-bag error estimation


When using bagging (or random forests), it turns out that there is a way to estimate the expected new data
error 𝐸 new without using cross-validation. The first observation we have to make is that not all data points
from the original dataset T will have been used for training all ensemble members. It is actually possible
to show that with the bootstrap, on average only 63% of the original training data points in T = {x𝑖 , 𝑦 𝑖 }𝑖=1 𝑛

will be present in a bootstrapped training dataset T e = {e x𝑖 , e 𝑛 . Roughly speaking, this means that for
𝑦 𝑖 }𝑖=1
any given {x𝑖 , 𝑦 𝑖 } in T , one third of the ensemble members will not have seen that data point yet. We
refer to these roughly 𝐵/3 ensemble members as being out-of-bag for data point 𝑖, and we let them form
their own ensemble, the out-of-bag-ensemble 𝑖. Note that the out-of-bag-ensemble is different for each
data point {x𝑖 , 𝑦 𝑖 }.
The next key insight is that for the out-of-bag-ensemble 𝑖, the data point {x𝑖 , 𝑦 𝑖 } can actually act as a
test data point since it has not yet been seen by any of its ensemble members. By computing the squared
or misclassification error when the out-of-bag-ensemble 𝑖 predicts {x𝑖 , 𝑦 𝑖 }, we thus get an estimate of
(𝑖) (𝑖)
𝐸 new for this out-of-bag-ensemble, which we denote 𝐸 OOB . Since 𝐸 OOB is based on only one data point, it
will be a fairly poor estimate of 𝐸 new . If we however repeat this for all data points {x𝑖 , 𝑦 𝑖 } in the training
Í𝑛
data T and average 𝐸 OOB = 𝑛1 𝑖=1 (𝑖)
𝐸 OOB , we get a better estimate of 𝐸 new . Indeed, 𝐸 OOB will be an
estimate of 𝐸 new for an ensemble with only 𝐵/3 (and not 𝐵) members, but as we have seen (Example 7.3),
the performance of bagging plateaus after a certain number of ensemble members. Hence, if 𝐵 is large
enough so that ensembles with 𝐵 and 𝐵/3 members perform roughly equally, 𝐸 OOB provides an estimate of
𝐸 new which can be at least as good as the estimate 𝐸 𝑘-fold from 𝑘-fold cross-validation. Most importantly,
however, 𝐸 OOB comes almost for free in bagging, whereas 𝐸 𝑘-fold requires much more computations when
re-training 𝑘 times.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
108
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7.2 Random forests

7.2 Random forests


In bagging we reduce the variance by averaging over an ensemble of models. That variance reduction,
however, is limited by the correlation between the individual ensemble members (encoded as 𝜌 in (7.2b)).
Using a simple trick, it turns out to be possible to reduce that correlation further, beyond what is achieved
by using the bootstrap. This is known as random forests.
While bagging is a general technique that in principle can be used to reduce the variance of any base
model, random forests assumes that these base models are classification or regression trees. The idea is to
inject additional randomness when constructing each tree, in order to further reduce the correlation among
the base models. At first this might seem like a silly idea: randomly perturbing the training of a model
should intuitively degrade its performance. There is a rationale for this perturbation, however, which we
will discuss below, but first we present the details of the algorithm.
Let Te (𝑏) be one of the 𝐵 bootstrapped datasets in bagging. To train a classification or regression tree
on this data we proceed as usual (see Section 2.3), but with one difference. Throughout the training,
whenever we are about to split a node we do not consider all possible input variables 𝑥 1 , . . . , 𝑥 𝑝 as splitting
variables. Instead, we pick a random subset consisting of 𝑞 ≤ 𝑝 inputs, and only consider these 𝑞 variables
as possible splitting variables. At the next splitting point we draw a new random subset of 𝑞 inputs to use
as possible splitting variables, and so on. Naturally, this random subset selection is done independently for
each of the 𝐵 ensemble members, so that we (with high probability) end up using different subsets for the
different trees. This additional random constraint when training is what turns bagging into random forests.
This will cause the 𝐵 trees to be less correlated and averaging their predictions can therefore result in larger
variance reduction compared to bagging. It should be noted, however, that this random perturbation of the
training procedure will increase the variance4 of each individual tree. In the notation of Equation (7.2b),
random forests decreases 𝜌 (good) but increases 𝜎 2 (bad) compared to bagging. Experience has however
shown that the reduction in correlation is the dominant effect, so that the averaged prediction variance is
often reduced. We illustrate this in Example 7.4 below.
To understand why it can be a good idea to only consider a subset of inputs as splitting variables,
recall that tree-building is based on recursive binary splitting which is a greedy algorithm. This means
that the algorithm can make choices early on that appear to be good, but which nevertheless turn out to
be suboptimal further down the splitting procedure. For instance, consider the case when there is one
dominant input variable. If we construct an ensemble of trees using plain bagging, it is then very likely
that all of the ensemble members will pick this dominant variable up as the first splitting variable, making
all trees identical (that is, perfectly correlated) after the first split. If we instead apply random forests,
some of the ensemble members will not even have access to this dominant variable at the first split, since
it most likely will not be present in the random subset of 𝑞 inputs selected at the first split for some of
the ensemble members. This will force those members to split according to some other variable. While
there is no reason for why this would improve the performance of the individual tree, it could prove to be
useful further down the splitting process, and since we average over many ensemble members the overall
performance can be improved.

4 And possibly also the bias, in a similar manner as the bootstrap might increase the bias, see Footnote 3, page 107.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 109
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7 Ensemble methods: Bagging and boosting

Example 7.4: Random forests and bagging for a binary classification problem

Consider the binary classification with 𝑝 = 2 using the data given below. The different classes are the blue
circles and the red dots. The input values were randomly sampled from [0, 2] × [0, 2], and labeled red with
probability 0.98 if above the dotted line, and vice versa. We use two different classifiers: bagging with
classification trees (which is equivalent to a random forest with 𝑞 = 𝑝 = 2) and a random forest with 𝑞 = 1,
each with 𝐵 = 9 ensemble members. Below we plot the decision boundary for each ensemble member as
well as the majority-voted final decision boundary.
The most apparent difference is that the variation among the ensemble members of the random forest than
those of bagging. Roughly half of the random forest ensemble members have been forced to make the first
split along the horizontal axis, which has lead to increased variance and decreased correlation compared to
bagging where all ensemble members made the first split along the vertical axis.
Data Bagging Random forest
2 2 2

1 1 1

0 0 0
0 1 2 0 1 2 0 1 2

Bagging ensemble members Random forest ensemble members


2 2

1 1

0 0
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2

2 2

1 1

0 0
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2

2 2

1 1

0 0
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2

0.16
Random forest
0.14 Bagging
𝐸 new

0.12

0.1

5 10 15 20 25 30 35 40 45

𝐵
While it is hard to visually compare the final decision boundaries for bagging and random forest (top right),
we also compute 𝐸 new for different numbers of ensemble members 𝐵. Since the learning itself has a certain
amount of randomness, we average over multiple learned models to not be confused by that random effect.
Indeed we see that the random forest performs better than bagging, except for very small 𝐵, and we conclude
that the positive effect of the reduced correlation between the ensemble members outweighs the negative
effect of additional variance. The poor performance of random forest with only one ensemble member is
expected, since this lonely model has higher variance and no averaging is taking place when 𝐵 = 1.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
110
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7.3 Boosting and AdaBoost

1 Beatles
Kiss
Bob Dylan

Energy (scale 0-1)


0.5

0
4.5 5 5.5 6 6.5 7

Length (ln s)

Figure 7.1: Random forest applied to the music classification problem from Example 2.1.

User aspects

Since random forest is a bagging method, the tools and properties from Section 7.1 applies also to random
forests. One such example is the out-of-bag error estimation, which is applicable also to random forests.
Also for random forests, 𝐵 → ∞ does not lead to overfit. Hence, there is no reason to choose 𝐵 small, other
than the computational load. Compared to using a single tree, a random forest requires approximately 𝐵
times as much computations. Since all trees are identically distributed, it is however possible to parallelize
the random forest learning over multiple nodes, where each node learns a few ensemble members.
The choice of 𝑞 is a tuning parameter, where for 𝑞 = 𝑝 we recover the basic bagging method described

previously. As a rule-of-thumb we can set 𝑞 = 𝑝 for classification problems and 𝑞 = 𝑝/3 for regression
problems (values rounded down to closest integer). A more systematic way of selecting 𝑞 is to use
out-of-bag error estimation or cross-validation and select 𝑞 such that 𝐸 OOB or 𝐸 𝑘-fold is minimized.
We conclude by applying random forest to the music classification problem from Example 2.1 in
Figure 7.1.

7.3 Boosting and AdaBoost

Whereas bagging primarily is an ensemble method for reducing variance in high-variance base models,
boosting is rather an ensemble method for reducing bias in high-bias base models. A typical example of a
simple (or, in other words, weak) high-bias model is a classification tree of depth one (sometimes called a
classification stump). Boosting is built on the idea that even a weak high-bias model often can capture
some of the relationship between the inputs and the output. Thus, by training multiple weak models, each
describing part of the input-output relationship, it might be possible to combine the predictions of these
models into an overall better prediction. Hence, the intention is to reduce the bias by turning an ensemble
of weak models into one strong model.
Boosting shares some similarities with bagging. Both are ensemble methods, in the sense that they are
based on combining the predictions from multiple models (an ensemble). Both bagging and boosting
can also be viewed as meta-algorithms, in the sense that they can be used to combine essentially any
regression or classification algorithm—they are algorithms built on top of other algorithms. However,
there are also important differences between boosting and bagging which we will discuss below.
The main difference is how the base models are learned. In bagging we learn 𝐵 identically distributed
models in parallel. Boosting, on the other hand, uses a sequential construction of the ensemble members.
Informally, this is done in such a way that each model tries to correct the mistakes made by the previous
one. This is accomplished by modifying the training dataset at each iteration in order to put more emphasis
on the data points for which the model (so far) has performed poorly. The final prediction is obtained from
a weighted average or a weighted majority vote among the models. We look at the simple Example 7.5 to

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 111
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7 Ensemble methods: Bagging and boosting

illustrate this idea.


The example illustrates the idea of boosting, but there are several important details left to specify in
order to have a useful off-the-shelf algorithm. We will soon have a look at a specific boosting algorithm
called AdaBoost, and thereafter consider the more general framework of gradient boosting. We will
restrict our attention to binary classification (𝐾 = 2, with 𝑦 ∈ {+1, −1}), but boosting is possible to apply
also for the multiclass problem and for regression.
Example 7.5: Boosting illustration

We consider a binary classification problem with a two-dimensional input x = [𝑥 1 𝑥 2 ]. The training data
consists of 𝑛 = 10 data points, 5 from each of the two classes. We use a decision stump, a classification tree
of depth one, as a simple (weak) base classifier. A decision stump means that we select one of the input
variables, 𝑥1 or 𝑥2 , and split the input space into two half spaces, in order to minimize the training error.
This results in a decision boundary that is perpendicular to one of the axes. The left panel below shows the
training data, illustrated by red crosses and blue dots for the two classes, respectively. The colored regions
shows the decision boundary for a decision stump b 𝑦 (1) (x) trained on this data.

b
𝑦 (1) (x)
Iteration b=1 b
𝑦 (2) (x)
Iteration b=2 b
𝑦 (3) (x)
Iteration b=3 b (x)
𝑦 boost model
Boosted
X2

X2

X2

X2
X1 X1 X1 X1

The model b 𝑦 (1) (x) misclassifies three data points (red crosses falling in the blue region), which are encircled
in the figure. To improve the performance of the classifier we want to find a model that can distinguish
these three points from the blue class. To this end, we train another decision stump, b 𝑦 (2) (x), on the same
data. To put emphasis on the three misclassified points, however, we assign weights {𝑤 𝑖(2) }𝑖=1 𝑛 to the data.

Points correctly classified by b 𝑦 (x) are down-weighted, whereas the three points misclassified by b
(1) 𝑦 (1) (x)
are up-weighted. This is illustrated in the second panel of the figure above, where the marker sizes have
been scaled according to the weights. The classifier b 𝑦 (2) (x) is then found by minimizing the weighted
1 Í𝑛
misclassification error, 𝑛 𝑖=1 𝑤 𝑖 I{𝑦 𝑖 ≠ b
(2)
𝑦 (x𝑖 )}, resulting in the decision boundary shown in the second
(2)

panel. This procedure is repeated for a third and final iteration: we update the weights based on the hits
and misses of b 𝑦 (2) (x) and train a third decision stump b𝑦 (3) (x) shown in the third panel. The final classifier,
b
𝑦 boost (x) is then obtained as a majority vote of the three decision stumps.
The decision boundary of the boosted classifier is shown in the right panel. Note that this decision
boundary is nonlinear, whereas the decision boundary for each ensemble member is linear. This illustrates
the concept of turning an ensemble of three weak (high-bias) base models into a stronger (low-bias) model.

AdaBoost

What we have discussed so far is a general idea, but there are still a few technical design choices left.
Let us now derive an actual boosting method, the AdaBoost (Adaptive Boosting) algorithm for binary
classification. AdaBoost was the first successful practical implementation of the boosting idea and lead
the way for its popularity.
As we outlined in Example 7.5 boosting attempts to construct a sequence of 𝐵 (weak) binary classifiers
b
𝑦 (x), b
(1) 𝑦 (2) (x), . . . , b
𝑦 (𝐵) (x). We will in this procedure only consider the final ‘hard’ prediction b𝑦 (x)
from the base models, and not their predicted class probabilities 𝑔(x). Any classification model can in
principle be used as base classifier—shallow classification trees are common in practice. The individual
predictions of the 𝐵 ensemble members are then combined into a final prediction. Unlike bagging, all
(𝐵)
ensemble members are not treated equally. Instead, we assign some positive coefficients {𝛼 (𝑏) } 𝑏=1 and

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
112
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7.3 Boosting and AdaBoost

construct the boosted classifier using a weighted majority vote


( 𝐵 )
∑︁
b (𝐵)
𝑦 boost (x) = sign 𝛼 b
(𝑏) (𝑏)
𝑦 (x) . (7.3)
𝑏=1

Each ensemble member votes either −1 or +1, and the output from the boosted classifier is +1 if the
weighted sum of the individual votes is positive and −1 if it is negative.
In AdaBoost, the ensemble members and their coefficients 𝛼 (𝑏) are trained greedily by minimizing
the exponential loss of the boosted classifier at each iteration. That is, one ensemble member is added
iteratively at a time. When member 𝑏 is added, it is trained such that the exponential loss (5.12) of the
entire ensemble of 𝑏 members (that is, the boosted classifier we have so far) is minimized. The reason for
choosing the exponential loss is that the problem becomes easier to solve (much like the squared error loss
in linear regression), which we will now see when we derive a mathematical expression for this prodecure.
Let us write the boosted classifier after 𝑏 iterations as b (𝑏)
𝑦 boost (x) = sign{𝑐 (𝑏) (x)} where 𝑐 (𝑏) (x) =
Í𝑏
𝑗=1 𝛼 b
( 𝑗) 𝑦 ( 𝑗) (x). Since 𝑐 (𝑏) (x) is a sum (see (7.3)), we can express it iteratively as

𝑐 (𝑏) (x) = 𝑐 (𝑏−1) (x) + 𝛼 (𝑏) b


𝑦 (𝑏) (x), (7.4)

initialized with 𝑐0 (x) = 0. The ensemble members are constructed sequentially, meaning that at iteration 𝑏
of the procedure the function 𝑐 (𝑏−1) (x) is known and fixed. This is what makes this construction “greedy”.
What remains to be learned at iteration 𝑏 is the ensemble member b 𝑦 (𝑏) (x) and its coefficient 𝛼 (𝑏) . We do
this by minimizing the exponential loss of the training data,
∑︁
𝑛
(𝛼 (𝑏)
,b
(𝑏)
𝑦 ) = arg min 𝐿(𝑦 𝑖 · 𝑐 (𝑏) (x𝑖 )) (7.5a)
( 𝛼,b
𝑦) 𝑖=1
∑︁
𝑛   
= arg min exp −𝑦 𝑖 𝑐 (𝑏−1) (x𝑖 ) + 𝛼b
𝑦 (x𝑖 ) (7.5b)
( 𝛼,b
𝑦) 𝑖=1
∑︁
𝑛  
= arg min exp −𝑦 𝑖 𝑐 (𝑏−1) (x𝑖 ) exp (−𝑦 𝑖 𝛼b
𝑦 (x𝑖 )) , (7.5c)
( 𝛼,b
𝑦) 𝑖=1 | {z }
(𝑏)
=𝑤𝑖

where for the first equality we have used the definition of the exponential loss function (5.12) and the
sequential structure of the boosted classifier (7.4). The last equality is where the convenience of the
exponential loss appears, namely the fact that exp(𝑎 + 𝑏) = exp(𝑎) exp(𝑏). This allows us to define the
quantities
 
def
𝑤 𝑖(𝑏) = exp −𝑦 𝑖 𝑐 (𝑏−1) (x𝑖 ) , (7.6)

which can be interpreted as weights for the individual data points in the training dataset. Note that the
weights 𝑤 𝑖(𝑏) are independent of 𝛼 and b
𝑦 . That is, when learning b
𝑦 (𝑏) (x) and its coefficient 𝛼 (𝑏) by solving
(𝑏) 𝑛
(7.5c) we can regard {𝑤 𝑖 }𝑖=1 as constants.
To solve (7.5) we start by rewriting the objective function as
∑︁
𝑛 ∑︁
𝑛 ∑︁
𝑛
𝑤 𝑖(𝑏) exp (−𝑦 𝑖 𝛼b
𝑦 (x𝑖 )) = 𝑒 −𝛼 𝑤 𝑖(𝑏) I{𝑦 𝑖 = b
𝑦 (x𝑖 )} + 𝑒 𝛼 𝑤 𝑖(𝑏) I{𝑦 𝑖 ≠ b
𝑦 (x𝑖 )}, (7.7)
| {z } | {z }
𝑖=1 𝑖=1 𝑖=1

=𝑊𝑐 =𝑊𝑒

where we have used the indicator function to split the sum into two sums: the first ranging over all training
data points correctly classified by b
𝑦 and the second ranging over all points erroneously classified by b 𝑦.
(Remember that b 𝑦 is the ensemble member we are to learn at this step.) Furthermore, for notational

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 113
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7 Ensemble methods: Bagging and boosting

simplicity we define 𝑊𝑐 and 𝑊𝑒 for the sum of weights of correctly classified and erroneously classified
Í𝑛
data points, respectively. Furthermore, let 𝑊 = 𝑊𝑐 + 𝑊𝑒 be the total weight sum, 𝑊 = 𝑖=1 𝑤 𝑖(𝑏) .
Minimizing (7.7) is done in two stages, first with respect to b
𝑦 and then with respect to 𝛼. This is possible
since the minimizing argument in b 𝑦 turns out to be independent of the actual value of 𝛼 > 0, another
convenient effect of using the exponential loss function. To see this, note that we can write the objective
function (7.7) as

𝑒 −𝛼 𝑊 + (𝑒 𝛼 − 𝑒 −𝛼 )𝑊𝑒 . (7.8)

Since the total weight sum 𝑊 is independent of b𝑦 and since 𝑒 𝛼 − 𝑒 −𝛼 > 0 for any 𝛼 > 0, minimizing this
expression with respect to b
𝑦 is equivalent to minimizing 𝑊𝑒 with respect to b𝑦 . That is,
∑︁
𝑛
b
𝑦 (𝑏) = arg min 𝑤 𝑖(𝑏) I{𝑦 𝑖 ≠ b
𝑦 (x𝑖 )}. (7.9)
b
𝑦 𝑖=1

In words, the 𝑏 th ensemble member should be trained by minimizing the weighted misclassification loss,
where each data point (x𝑖 , 𝑦 𝑖 ) is assigned a weight 𝑤 𝑖(𝑏) . The intuition for these weights is that, at iteration
𝑏, we should focus our attention on the data points previously misclassified in order to “correct the
mistakes” made by the ensemble of the first 𝑏 − 1 classifiers.

Time to reflect 7.2: In AdaBoost, we use the exponential loss for training the boosting ensem-
ble. How come that we end up training the individual ensemble members using a weighted
misclassification loss (and not the unweighted exponential loss) then?

How the problem (7.9) is solved in practice depends on the choice of base classifier that we use, that
is, on the specific restrictions that we put on the function b 𝑦 (for example a shallow classification tree).
However, solving (7.9) is almost our standard classification problem, except for the weights 𝑤 𝑖(𝑏) . Training
the ensemble member 𝑏 on a weighted classification problem is, for most base classifiers, straightforward.
Since most classifiers are trained by minimizing some cost function, this simply boils down to weighting
the individual terms of the cost function and solve that slightly modified problem instead.
When the 𝑏 th ensemble member, b 𝑦 (𝑏) (x), has been trained for solving the weighted classification
problem (7.9) it remains to learn its coefficient 𝛼 (𝑏) . This is done by solving (7.5), which amounts to
minimizing (7.8) once b 𝑦 has been trained. By differentiating (7.8) with respect to 𝛼 and setting the
derivative to zero we get the equation
   
2𝛼 1 𝑊
−𝛼𝑒 𝑊 + 𝛼 (𝑒 + 𝑒 ) 𝑊𝑒 = 0 ⇔ 𝑊 = 𝑒 + 1 𝑊𝑒 ⇔ 𝛼 = ln
−𝛼 𝛼 −𝛼
−1 .
2 𝑊𝑒

Thus, by defining

𝑊𝑒 ∑︁ 𝑤 𝑖
𝑛 (𝑏)
I{𝑦 𝑖 ≠ b
def
Í𝑛
(𝑏)
𝐸 train = = (𝑏)
𝑦 (𝑏) (x𝑖 )} (7.10)
𝑊 𝑖=1 𝑤𝑗=1 𝑗

to be the weighted misclassification error for the 𝑏 th classifier, we can express the optimal value for its
coefficient as
(𝑏)
!
1 1 − 𝐸 train
𝛼 (𝑏) = ln . (7.11)
2 𝐸 (𝑏) train

This completes the derivation of the AdaBoost algorithm, which is summarized in Method 7.2. In the
algorithm we exploit the fact that the weights (7.6) can be computed recursively by using the expression
(7.4) in line 5 in the learning. Furthermore, we have added an explicit weight normalization (line 6) which
is convenient in practice and which does not affect the derivation of the method above.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
114
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7.3 Boosting and AdaBoost

(𝐵)
The derivation of AdaBoost assumes that all coefficients {𝛼 (𝑏) } 𝑏=1 are positive. To see that this is
indeed the case when the coefficients are computed according to (7.11), note that the function ln((1 − 𝑥)/𝑥)
is positive for any 0 < 𝑥 < 0.5. Thus, 𝛼 (𝑏) will be positive as long as the weighted training error for the
𝑏 th classifier, 𝐸 train
(𝑏)
, is less than 0.5. That is, the classifier just has to be slightly better than a coin flip,
(𝑏) (𝑏)
which is always the case in practice (note that 𝐸 train is the training error). (Indeed, if 𝐸 train > 0.5, then we
could simply flip the sign of all predictions made by b 𝑦 (x) to reduce the error below 0.5.)
(𝑏)

Learn an AdaBoost classifier


Data: Training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

Result: 𝐵 weak classifiers


1 Assign weights 𝑤 𝑖(1) = 1/𝑛 to all data points.
2 for 𝑏 = 1, . . . , 𝐵 do
3 Train a weak classifier b 𝑦 (𝑏) (x) on the weighted training data {(x𝑖 , 𝑦 𝑖 , 𝑤 𝑖(𝑏) )}𝑖=1
𝑛 .
Í𝑛
4 Compute 𝐸 train = 𝑖=1 𝑤 𝑖 I{𝑦 𝑖 ≠ b
(𝑏) (𝑏)
𝑦 (𝑏) (x𝑖 )}.
(𝑏) (𝑏)
5 Compute 𝛼 (𝑏) = 0.5 ln((1 − 𝐸 train )/𝐸 train ).
6 Compute 𝑤 𝑖 (𝑏+1)
= 𝑤 𝑖 exp(−𝛼 𝑦 𝑖 b
(𝑏) (𝑏) 𝑦 (x𝑖 )), 𝑖 = 1, . . . , 𝑛.
(𝑏)
(𝑏+1) (𝑏+1) Í𝑛 (𝑏+1)
7 Set 𝑤 𝑖 ← 𝑤𝑖 / 𝑗=1 𝑤 𝑗 , for 𝑖 = 1, . . . , 𝑛.
8 end

Predict with the AdaBoost classifier


Data: 𝐵 weak classifiers and test input x★
Result: Prediction b (𝐵)
𝑦 boost (x★)
Í 𝐵
1 Output b 𝛼 (𝑏) b
(𝐵)
𝑦 boost (x★) = sign 𝑏=1 𝑦 (𝑏) (x★) .

Method 7.2: AdaBoost

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 115
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7 Ensemble methods: Bagging and boosting

Example 7.6: AdaBoost and bagging for a binary classification example

Consider the same binary classification problem as in Example 7.4. We now compare how AdaBoost and
bagging performs on this problem, when using trees of depth one (decision stumps) and three, respectively.
The decision boundaries for each method with 𝐵 = 1, 5, 20 and 100 ensemble members are shown below.
Despite using quite weak ensemble members (a shallow tree has high bias), AdaBoost adapts quite well to
the data (Example 7.4). This is in contrast to bagging, where the decision boundary does not become much
more flexible despite using many ensemble members. In other words, AdaBoost reduces the bias of the base
model, whereas bagging only has minor effect on the bias.

2 2 2 2
Tree depth 1

1 1 1 1
AdaBoost

0 0 0 0
0 1 2 0 1 2 0 1 2 0 1 2

2 2 2 2
Tree depth 3

1 1 1 1

0 0 0 0
0 1 2 0 1 2 0 1 2 0 1 2

2 2 2 2
Tree depth 1

1 1 1 1
Bagging

0 0 0 0
0 1 2 0 1 2 0 1 2 0 1 2

2 2 2 2
Tree depth 3

1 1 1 1

0 0 0 0
0 1 2 0 1 2 0 1 2 0 1 2

𝐵=1 𝐵=5 𝐵 = 20 𝐵 = 100

AdaBoost, tree depth 1


0.14
AdaBoost, tree depth 3
Bagging, tree depth 1
𝐸¯ new

0.12
Bagging, tree depth 3
0.1

0.08
10 20 30 40 50 60 70 80 90 100

We also numerically compute 𝐸¯ new for this problem, as a function of 𝐵, which is shown above. Remember
that 𝐸¯ new depends on both the bias and the variance. As discussed, the main effect of bagging is variance
reduction, but that does not help much since the base model already is quite low-variance (but high-bias).
Boosting, on the other hand, reduces bias, which has a much bigger effect in this case. Furthermore, bagging
does not overfit as 𝐵 → ∞, but that is not the case for boosting! We can indeed see that for trees of depth 3,
the smallest 𝐸¯ new is obtained for 𝐵 ≈ 25, and there is actually a slight increase in 𝐸¯ new for larger values of 𝐵.
Hence, AdaBoost with depth-3 trees suffers from a (minor) overfit as 𝐵 & 25 in this problem.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
116
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7.3 Boosting and AdaBoost

User aspects of AdaBoost


AdaBoost, and in fact any boosting algorithm, has two important design choices, (i) which base classifier
to use, an (ii) how many iterations 𝐵 to run the boosting algorithm for. As previously pointed out, we can
use essentially any classification method as base classifier. However, the most common choice in practice
is to use a shallow classification tree, or even a decision stump (a tree of depth one; see Example 7.5).
This choice is guided by the fact that boosting reduces bias efficiently, and can thereby learn good models
despite using a very weak (high-bias) base model. Since shallow trees can be trained quickly, they are a
good default choice. Practical experience suggests that trees with roughly 6 terminal nodes often works
well as base models, but trees of depth one (only 𝑀 = 2 terminal nodes in binary classification) are
perhaps even more commonly used. In fact, using deep classification trees (high-variance models) as base
classifiers typically deteriorates performance.
The base models are learned sequentially in boosting, each iteration introduces a new base model aiming
at reducing the errors made by the current model. As an effect, the boosting model becomes more and more
flexible as the number of iterations 𝐵 increases and using too many base models can result in overfitting (in
contrast to bagging, where increased 𝐵 cannot lead to overfit). It has been observed in practice, however,
that this overfitting often occurs slowly and the performance tends to be rather insensitive to the choice of
𝐵. Nevertheless, it is a good practice to select 𝐵 in some systematic way, for instance using early stopping
during training, similarly to how it is used for neural networks (recall Section 6.3). Another unfortunate
aspect of the sequential nature of boosting it that it is not possible to parallelize the learning.
In the method discussed above we have assumed that each base classifier outputs a class prediction,
b
𝑦 (x) ∈ {−1, 1}. However, many classification models outputs 𝑔(x), which is an estimate of the class
(𝑏)

probability 𝑝(𝑦 = 1 | x). In AdaBoost it is possible to use the predicted probabilities 𝑔(x) (instead of
the binary prediction b 𝑦 (x)) when constructing the prediction, however at the cost of a more complicated
expression than (7.3). This extension of Method 7.2 is referred to as Real AdaBoost.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 117
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7 Ensemble methods: Bagging and boosting

7.4 Gradient boosting


It has been seen in practice that AdaBoost performs well if there is little noise (few mislabeled data points)
in the training data. If the training data has more noise and hence contains more outliers, the performance
of AdaBoost typically deteriorates. That is not an artifact of the boosting idea, but of the exponential loss
function that we discussed in Section 5.1. It is possible to construct more robust boosting algorithms by
choosing another loss function, but the training becomes more computationally involved, as we will see.
If the squared error loss in linear regression is replaced with, say, absolute error loss, the simple
analytical solution (the normal equations) is lost. We can, however, still learn the model by using numerical
optimization. The situation is quite similar for boosting. The simplicity of AdaBoost is that the boosting
problem is turned into a sequence of weighted classification problems, which we can solve (at least
approximately) by almost any off-the-shelf classification method. If the exponential loss in (7.5a) is
replaced, the problem does not separate into weighted classification problems anymore, and the situation
becomes more intricate.
It turns out to be possible to approximately solve (7.5a) for rather general loss functions using a method
reminiscent of gradient descent ((5.25) in Chapter 5). The resulting method is referred to as gradient
boosting.
Like AdaBoost (7.4), we construct the classifier sequentially as

𝑐 (𝑏) (x) = 𝑐 (𝑏−1) (x) + 𝛼 (𝑏) b


𝑦 (𝑏) (x), (7.12)

for 𝑏 = 1, . . . , 𝐵, and the goal is to sequentially select {𝛼 (𝑏) , b


𝑦 (𝑏) (x)} such that the final 𝑐 (𝐵) (x) minimizes

1 ∑︁
𝑛
𝐽 (𝑐) = 𝐿(𝑦 𝑖 · 𝑐(x𝑖 )) (7.13)
𝑛 𝑖=1

for any arbitrary differentiable loss function 𝐿, such as the logistic loss.
The idea of gradient boosting is to think of

 𝑐 (𝐵) (x1 ) 
 
 
 
..
 (𝐵) . 
𝑐 (x𝑛 ) 
 
| {z }
𝑐 (𝐵) (X)

as an 𝑛-dimensional optimization variable, which by construction is build up in the additive fashion (7.12).
We then observe that the boosted model (7.12) actually looks like a gradient descent update (recall (5.25)
in Chapter 5), if (7.13) would be the objective function 𝐽 (𝑐(X)) and b
𝑦 (𝑏) (x) its gradient ∇𝑐 𝐽 (𝑐 (𝑏−1) (X)).
With this inspiration from gradient descent, we would like to choose b 𝑦 (𝑏) (x) in (7.12) such that

b   𝜕𝐿 ( 𝑦1 ·𝑐 (𝑏−1) (x1 )) 
 𝑦 (x1 ) 
(𝑏)
 
   𝜕𝑐 

..
 is close to  .. .
 (𝑏) .   .  (7.14)
b   𝜕𝐿 ( 𝑦𝑛 ·𝑐 (𝑏−1) (x𝑛 )) 
 (x )   
 
𝑦 𝑛
| {z }
𝜕𝑐

∇𝑐 𝐽 (𝑐 (𝑏−1) (X))

By the arguments of gradient descent this would as 𝑏 → ∞ lead to a 𝑐 ( 𝐵) (x) which is, at least approximately,
a local minimizer of (7.13).
We have now outlined the idea of gradient boosting, but it is not yet a concrete algorithm. Specifically
we have to be more precise about b 𝑦 (𝑏) in (7.14). In gradient boosting, we simply handle (7.14) as a
regression problem where b
(𝑏−1)
𝑦 (𝑏) is learned with input x𝑖 and output 𝜕𝐿 ( 𝑦𝑖 ·𝑐𝜕𝑐 (x𝑖 )) (𝑖 = 1, . . . , 𝑛). (We
have assumed that the loss function 𝐿 is differentiable, meaning that we can always compute the necessary

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
118
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7.4 Gradient boosting

derivatives.) That is, gradient boosting builds a classifier by using regression. The type of base regression
model b𝑦 is in principle arbitrary, but regression trees are often used in practice.
We have not yet discussed how to choose 𝛼 (𝑏) , which corresponds to 𝛾 (the step size or learning rate) in
gradient descent. In the simplest version of gradient descent it is considered a tuning choice left to the
user, but in can also be formulated as a line-search optimization problem itself to choose the optimal value
at each iteration. For gradient boosting, it is most often handled in the latter way. When using trees as base
models optimizing 𝛼 (𝑏) can be done jointly with learning b 𝑦 , resulting in a more efficient implementation.
If multiplying the optimal 𝛼 with a parameter < 1, a regularizing effect is obtained which has proven
(𝑏)

useful in practice.
We summarize gradient boosting by Method 7.3.

Learn a simple gradient boosting classifier


Data: Training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

Result: 𝐵 weak classifiers


Í𝑛
1 Initialize (as a constant), 𝑐0 (x) ≡ arg min𝑐 𝑖=1 𝐿(𝑦 𝑖 , 𝑐).
2 for 𝑏 = 1, . . . , 𝐵 do
Compute the negative gradient of the loss function
h i
3

𝑔𝑖(𝑏) = − 𝜕𝐿 (𝜕𝑐𝑦𝑖 ,𝑐)


(𝑏−1)
, 𝑖 = 1, . . . , 𝑛.
𝑐=𝑐 (x𝑖 )
4 Train a base regression model b 𝑓 (𝑏) (x) to fit the gradient values,
Í𝑛  2
b
𝑓 (𝑏) = arg min 𝑓 𝑖=1 𝑓 (x𝑖 ) − 𝑔𝑖(𝑏) .
5 Update the boosted model, 𝑐 (𝑏) (x) = 𝑐 (𝑏−1) (x) + 𝛾 b 𝑓 (𝑏) (x).
6 end

Predict with the gradient boosting classifier


Data: 𝐵 weak classifiers and test input x★
Result: Prediction b (𝐵)
𝑦 boost (x★)
1 Output b (𝐵)
𝑦 boost (x) = sign{𝑐 (𝐵) (x)}.

Method 7.3: A simple gradient boosting algorithm

While presented for classification in Method 7.3, gradient boosting can also be used for regression with
minor modifications. As mentioned above, gradient boosting requires a certain amount of smoothness in
the loss function. A minimal requirement is that it is almost everywhere differentiable, so that it is possible
to compute the gradient of the loss function. However, some implementations of gradient boosting require
stronger conditions, such as second order differentiability. The logistic loss (see Figure 5.2) is in this
respect a “safe choice” which is infinitely differentiable and strongly convex, while still enjoying good
statistical properties. As a consequence, the logistic loss is one of the most commonly used loss functions
in practice.
We conclude this chapter by applying AdaBoost as well as gradient boosting to the music classification
problem from Example 2.1 in Figure 7.2.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 119
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
7 Ensemble methods: Bagging and boosting

1 Beatles 1 Beatles
Kiss Kiss
Energy (scale 0-1)

Energy (scale 0-1)

Bob Dylan Bob Dylan

0.5 0.5

0 0
4.5 5 5.5 6 6.5 7 4.5 5 5.5 6 6.5 7

Length (ln s) Length (ln s)


(a) AdaBoost (b) A gradient boosting classifier

Figure 7.2: Two boosting algorithms applied to the music classification problem from Example 2.1.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
120
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8 Nonlinear input transformations and kernels

In this chapter we will continue to develop the idea from Chapter 3 of creating new input features by using
nonlinear transformations 𝝓(x). It turns out that by the so-called kernel trick, we can have infinitely many
such nonlinear transformations and we can extend our basic methods, such as linear regression and 𝑘-NN,
into more versatile and flexible ones. When we also change the loss function of linear regression, we
obtain support vector machines, another powerful off-the-shelf machine learning method. The concept of
kernels is important also to the next chapter (Chapter 9), where a Bayesian perspective of linear regression
and kernels leads us to the Gaussian process.

8.1 Creating features by nonlinear input transformations


The reason for the word ‘linear’ in the name ‘linear regression’ is that the output is modelled as a linear
combination of the inputs. We have, however, not made a clear definition of what an input is: if the speed
is an input in Example 2.2, then why could not also the kinetic energy—the square of the speed—be
considered as another input? The answer is yes, it can. We can in fact make use of arbitrary nonlinear
transformations of the “original” input variables in any model, including linear regression. If we, for
example, only have a one-dimensional input 𝑥, the vanilla linear regression model is

𝑦 = 𝜃 0 + 𝜃 1 𝑥 + 𝜀. (8.1)

Starting from this we can extend the model with 𝑥 2 , 𝑥 3 , . . . , 𝑥 𝑑−1 as inputs (𝑑 is a user-choice), and thus
obtain a linear regression model which is a polynomial in 𝑥,

𝑦 = 𝜃 0 + 𝜃 1 𝑥 + 𝜃 2 𝑥 2 + · · · + 𝜃 𝑑−1 𝑥 𝑑−1 + 𝜀 = 𝜽 T 𝝓(𝑥) + 𝜀. (8.2)

Since 𝑥 is known, we can directly compute 𝑥 2 , . . . , 𝑥 𝑑−1 . Note that this is still a linear regression model
since the parameters 𝜽 appear in a linear fashion with 𝝓(𝑥) = [1 𝑥 𝑥 2 . . . 𝑥 𝑑−1 ] T as a new input vector.
We refer to a transformation of x as a feature1 and the vector of transformed inputs 𝝓(x), a vector of
dimension 𝑑 × 1, as a feature vector. The parameters b 𝜽 are still learned the same way, but we

−xT −  −𝝓(x1 ) T − 
 1   
−x −  −𝝓(x2 ) T − 
 2   
T
replace the original X =  .  with the transformed 𝚽(X) =   (8.3)
 .   
. ..
   . 
−xT − −𝝓(x𝑛 ) T −
 𝑛   
| {z } | {z }
𝑛× 𝑝+1 𝑛×𝑑

in the cost function. For linear regression, this means that we can learn the parameters by doing the
substitution (8.3) directly in the normal equations (3.13).
The idea of nonlinear input transformations is not unique to linear regression, but any choice of
nonlinear transformation 𝝓(·) can be used with any supervised machine learning methods: the nonlinear
transformation is first applied to the input, like a pre-processing step, and the transformed input is
thereafter used when training, evaluating and using the model. We illustrated this for regression already
by Example 3.5 in Chapter 3, and for classification by Example 8.1.
1 The original input x is sometimes also referred to as a feature.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 121
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8 Nonlinear input transformations and kernels

Model Model
Data Data
output 𝑦

output 𝑦
input 𝑥 input 𝑥
(a) A linear regression model with a 2nd order polynomial, (b) A linear regression model with a 4th order polynomial,
trained with squared error loss. The line is no longer straight trained with squared error loss. Note that a 4th order polyno-
(as in Figure 3.1), but this is merely an artifact of the plot: in mial implies 5 unknown parameters, which roughly means that
a three-dimensional plot with each feature (here, 𝑥 and 𝑥 2 ) on we can expect the learned model to fit 5 data points exactly, a
a separate axis, it would still be an affine model. typical case of overfit.

Figure 8.1: A linear regression model with 2nd and 4th order polynomials in the input 𝑥, as shown in (8.2).

Time to reflect 8.1: Figure 8.1 shows an example of two linear regression models with transformed
(polynomial) inputs. When studying the figure one may ask how a linear regression model can
result in a curved line? Are linear regression models not restricted to linear (or affine) straight
lines?

Example 8.1: Nonlinear feature transformations for classification

Consider the data for a binary classification problem in the left panel below with x = [𝑥1 𝑥2 ] T and with a
blue and a red class. By only looking at the data, we can conclude that a linear classifier would not be able
to perform any good on this problem.

0.5
0.5

0 0
𝑥2

𝑥1 𝑥2

−0.5 0.5
−0.5
0
−2
−1
0
1 −0.5 𝑥2
−2 −1 0 1 2 2
𝑥1
𝑥1
However, by adding the nonlinear transformation 𝑥1 𝑥 2 as a feature, such that 𝝓(x) = [𝑥 1 𝑥2 𝑥 1 𝑥 2 ] T we get
the situation in the right panel. With this relatively simple introduction of an extra feature, the problem now
appears to be much better suited for a linear classifier, since the data can be separated relatively well by the
sketched plane. The conclusion here is that one strategy for increasing the capability of otherwise relatively
simple methods is by introducing nonlinear feature transformations.

Polynomials are only one out of (infinitely) many possible choices of features 𝝓(x). One should take
care when using polynomials higher than second order in practice, because of their behavior outside
the range where the data is observed (recall Figure 8.1b). Instead, there are several alternatives that
are often more useful in practice, such as Forier series, corresponding to (for scalar 𝑥) something like
𝝓(𝑥) = [1 sin(𝑥) cos(𝑥) sin(2𝑥) cos(2𝑥) · · · ] T , step functions, regression splines, etc. The use of
nonlinear input transformations 𝝓(x) arguably makes simple models more flexible and applicable to
real-world problems with nonlinear characteristics. In order to obtain a good performance, it is important
to chose 𝝓(x) such that enough flexibility is obtained, but overfit avoided. With a very careful choice

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
122
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8.2 Kernel ridge regression

of 𝝓(x) good performance can be obtained for certain problems, but that choice is problem-specific and
perhaps more of an art form than a science. Let us instead explore the conceptual idea of letting the
number of features 𝑑 → ∞, and combine it with regularization. In a sense this will automate the choice of
features, and leads us to a family of powerful off-the-shelf machine learning tools called kernel methods.

8.2 Kernel ridge regression


A carefully engineered transformation 𝝓(𝑥) in linear regression, or any other method for that matter, may
indeed perform well for a specific machine learning problem. However, we would like 𝝓(𝑥) to contain
“all” transformations that could possibly be of interest for most problems, in order to obtain a general
off-the-shelf method. We will therefore explore the idea of choosing 𝑑 really big, much bigger than the
number of data points 𝑛, and eventually even let 𝑑 → ∞. The derivation and reasoning will be done using
𝐿 2 regularized linear regression, but we will later see that the idea is applicable also to other models.

Re-formulating linear regression


First of all we have to use some kind of regularization if we are going to increase 𝑑 in linear regression, in
order to avoid overfit when 𝑑  𝑛. For reasons that we will discuss later we chose to use 𝐿 2 regularization.
Let us repeat the equation for 𝐿 2 -regularized linear regression,
𝑛 
∑︁ 2
b 1
𝜽 = arg min 𝜽 T 𝝓(x𝑖 ) −𝑦 𝑖 + 𝜆k𝜽 k 22 = (𝚽(X) T 𝚽(X) + 𝑛𝜆I) −1 𝚽(X) T y, (8.4a)
𝜽 𝑛 𝑖=1
| {z }
b
𝑦 (x𝑖 )

𝑦 (x★) = b
b
T
𝜽 𝝓(x★) (8.4b)

We have not fixed the nonlinear transformations 𝝓(x) to anything specific yet, but we are preparing for
choosing 𝑑  𝑛 of these transformations. The downside of choosing 𝑑, the dimension of 𝝓(x), large is
that we also have to learn 𝑑 parameters when training. In linear regression, we usually first learn and store
the 𝑑-dimensional vector b 𝜽, and thereafter we use it for computing a prediction. To be able to choose 𝑑
really large, perhaps even 𝑑 → ∞, we have to re-formulate the model such that there are no computations
or storage demands that scales with 𝑑. The first step is to realize that the prediction b
𝑦 (x★) can be rewritten
as
 T
𝑦 (x★) = b
b
T
𝜽 𝝓(x★) = 𝚽(X) T 𝚽(X) + 𝑛𝜆I) −1 𝚽(X) T y 𝝓(x★)
|{z} |{z}
1×𝑑 𝑑×1
= yT 𝚽(X) (𝚽(X) T 𝚽(X) + 𝑛𝜆I) −1 𝝓(x★) (8.5)
|{z} |{z} | {z } |{z}
| {z }
1×𝑛 𝑛×𝑑 𝑑×𝑑 𝑑×1

𝑛×1

(where the underbraces provides the size of the corresponding vectors and matrices). This expression for
𝑦 (x★) suggests that instead of computing and storing the 𝑑-dimensional b
b 𝜽 once (independently of x★) we
could, for each test input x★, compute the 𝑛-dimensional vector 𝚽(X) (𝚽(X) T 𝚽(X) + 𝑛𝜆I) −1 𝝓(x★). By
doing so, we avoid storing a 𝑑-dimensional vector, but this would still require the inversion of a 𝑑 × 𝑑
matrix. Hence we have some more work to do before we have a practically useful method where we can
select 𝑑 really large.
The push-through matrix identity says that A(AT A + I) −1 = (AAT + I) −1 A holds for any matrix A. By
using it in (8.5), we can write b
𝑦 (x★) as

b
𝑦 (x★) = yT (𝚽(X)𝚽(X) T + 𝑛𝜆I) −1 𝚽(X)𝝓(x★) . (8.6)
|{z} | {z } | {z }
1×𝑛 𝑛×𝑛 𝑛×1

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 123
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8 Nonlinear input transformations and kernels

It now seems as if we can compute b 𝑦 (x★) without having to deal with any 𝑑-dimensional vectors or
matrices. That requires, however, that the matrix multiplication 𝚽(X)𝚽(X) T and 𝚽(X)𝝓(x★) in (8.6) can
be computed. Let us therefore have a closer look at these,
 𝝓(x1 ) T 𝝓(x1 ) 𝝓(x1 ) T 𝝓(x2 ) . . . 𝝓(x1 ) T 𝝓(x𝑛 ) 

 𝝓(x2 ) T 𝝓(x1 ) 𝝓(x2 ) T 𝝓(x2 ) . . . 𝝓(x2 ) T 𝝓(x𝑛 ) 

𝚽(X)𝚽(X) T =   (8.7)
 
.. .. ..
 . . . 
𝝓(x𝑛 ) T 𝝓(x1 ) 𝝓(x𝑛 ) T 𝝓(x2 ) . . . 𝝓(x𝑛 ) 𝝓(x𝑛 ) 

T

 𝝓(x1 ) T 𝝓(x★) 
 
 𝝓(x2 ) T 𝝓(x★) 
 
and 𝚽(X)𝝓(x★) =  . (8.8)
 
..
 . 
𝝓(x𝑛 ) T 𝝓(x★) 
 
Remember that 𝝓(x) T 𝝓(x0) is an inner product between the two 𝑑-dimensional vectors 𝝓(x) and 𝝓(x0).
The key insight here is to note that the transformed inputs 𝝓(x) only enters into (8.6) as inner products
𝝓(x) T 𝝓(x0). That is, if we are able to compute the inner product 𝝓(x) T 𝝓(x0) directly, without first
explicitly computing 𝝓(x), we are safe.
As a concrete illustration, let us use polynomials since they are familiar and have no complicated
expressions. With 𝑝 = 1, meaning
2
√ x is a scalar 𝑥, and 𝝓(𝑥) is a third order polynomial (𝑑 = 4) with the
second and third term scaled by 3, we have
 1 
√ 0
   3𝑥 
3𝑥 𝑥 √ 02  = 1 + 3𝑥𝑥 0 + 3𝑥 2 𝑥 02 + 𝑥 3 𝑥 03 = (1 + 𝑥𝑥 0) 3 .
√ √ 2 3
0
𝝓(𝑥) 𝝓(𝑥 ) = 1
T
3𝑥 (8.9)
 3𝑥 
 𝑥 03 
 
It can generally be shown that if 𝝓(𝑥) is a (scaled) polynomial of order 𝑑 −1, then 𝝓(𝑥) T 𝝓(𝑥 0) = (1+𝑥𝑥 0) 𝑑−1 .
The point we want to make is that instead of first computing the two 𝑑-dimensional vectors 𝝓(𝑥) and 𝝓(𝑥 0)
and thereafter computing their inner product, we could just evaluate the expression (1 + 𝑥𝑥 0) 𝑑−1 directly
instead. With a second or third order polynomial this might not make much of a difference, but imagine a
situation where it is of interest to use 𝑑 in the hundreds or thousands.
The key insight we are getting at is that if we only make the choice of 𝝓(x) such that the inner product
𝝓(x) T 𝝓(x0) can be computed without first computing 𝝓(x), we can let 𝑑 be arbitrary big. Since it is
possible to define inner products also between infinite-dimensional vectors, there is nothing preventing us
from letting 𝑑 → ∞.
We have now derived a version of 𝐿 2 -regularized linear regression that we can use in practice also
with an unbounded number of features 𝑑 in 𝝓(x), if we only restrict ourselves to 𝝓(x) such that its inner
product 𝝓(x) T 𝝓(x0) has a closed-form expression (or can, at least, be computed n such a way that it does
not scale with 𝑑). This might appear to be of rather limited interest for a machine learning engineer, since
one still has to come up with a nonlinear transformation 𝝓(x), choose 𝑑 (possibly ∞) and thereafter make
a pen-and-paper derivation (like (8.9)) of 𝝓(x) T 𝝓(x0). To alleviate this, let us introduce the concept of a
kernel.

Introducing the kernel idea


A kernel 𝜅(x, x0) is (in this book) any function that takes two arguments x and x0 from the same space,
and returns a scalar. Throughout this book, we will limit ourselves to kernels that are real-valued and
symmetric, that is, 𝜅(x, x0) = 𝜅(x0, x) ∈ R for all x and x0. The inner product of two nonlinear input
transformations is an example of a kernel

𝜅(x, x0) = 𝝓(x) T 𝝓(x0). (8.10)


2 The

scaling 3 can be compensated with an inverse scaling of the second and third element in 𝜽.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
124
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8.2 Kernel ridge regression

Equation (8.9), for example, is such a kernel. The key insight at this stage is that since 𝝓(x) only appears
in the linear regression model (8.6) via inner products, we do not have to engineer a 𝑑-dimensional vector
𝝓(x) and derive its inner product. Instead, we can just chose a kernel 𝜅(x, x0) directly. This is known as
the kernel trick,

If x only enters the model as 𝝓(x) T 𝝓(x0), we can choose a kernel 𝜅(x, x0) instead of chosing 𝝓(x).

To be clear on what this means in practice, we rewrite (8.6) using the kernel (8.10),

b
𝑦 (x★) = yT (𝑲 (X, X) + 𝑛𝜆I) −1 𝑲 (X, x★) , (8.11a)
|{z} | {z } | {z }
1×𝑛 𝑛×𝑛 𝑛×1
 𝜅(x1 , x1 ) 𝜅(x1 , x2 ) . . . 𝜅(x1 , x𝑛 ) 

 𝜅(x2 , x1 ) 𝜅(x2 , x2 ) . . . 𝜅(x2 , x𝑛 ) 

where 𝑲 (X, X) =  , (8.11b)
 
.. .. ..
 . . . 
𝜅(x𝑛 , x1 ) 𝜅(x𝑛 , x2 ) . . . 𝜅(x𝑛 , x𝑛 ) 

 𝜅(x1 , x★) 
 
 𝜅(x2 , x★) 
 
𝑲 (X, x★) =  . (8.11c)
 
..
 . 
𝜅(x𝑛 , x★) 
 

These equations describes linear regression with 𝐿 2 regularization using a kernel 𝜅(x, x0). Since 𝐿 2
regularization also is called ridge regression, we refer to (8.11) as kernel ridge regression. The 𝑛 × 𝑛
matrix 𝑲 (X, X) is called the Gram matrix. We initially argued that linear regression with a possibly
infinite-dimensional nonlinear transformation vector 𝝓(x) could be an interesting model, and (8.11) is
(for certain choices of 𝝓(x) and 𝜅(x, x0)) equivalent to that. The design choice for the user is now to
select a kernel 𝜅(x, x0) instead of 𝝓(x). In practice, choosing 𝜅(x, x0) is a much less tedious problem than
choosing 𝝓(x).
As users we may in principle choose the kernel 𝜅(x, x0) arbitrarily, as long as we can compute (8.11a).
This requires the inverse of (𝑲 (X, X) + 𝑛𝜆I) to exist. We are therefore on the safe side if we restrict
ourselves to kernels for which the Gram matrix 𝑲 (X, X) (8.11b) is always positive semidefinite. Such
kernels are called3 positive semidefinite kernels.
The user of kernel ridge regression chooses a positive semidefinite kernel 𝜅(x, x0), and does not have
to select (and even less compute) 𝝓(x). However, a corresponding 𝝓(x) always exists for a positive
semidefinite kernel as we will discuss in Section 8.4.
There is a number of positive semidefinite kernels that a user can choose between. A commonly used
positive semidefinite kernel is the squared exponential kernel
!
kx − x0 k 22
𝜅(x, x0) = exp − , (8.12)
2ℓ 2

where the hyperparameter ℓ > 0 is a design choice left to the user, for example to be chosen using cross
validation. Another example of a positive semidefinite kernel mentioned earlier is the polynomial kernel
𝜅(x, x0) = (𝑐 + xT x0) 𝑑−1 . A special case thereof is the linear kernel 𝜅(x, x0) = xT x0. We will give more
examples later.
From the formulation (8.11), it may seem as if we have to compute the inverse of (𝑲 (X, X) + 𝑛𝜆I) every
time we want to make a prediction. That is, however, not necessary since it does not depend on the test

3 Confusingly enough, such kernels are called positive definite in some texts.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 125
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8 Nonlinear input transformations and kernels

input x★. It is therefore wise to introduce the 𝑛-dimensional vector

b 
 𝛼1 
b 
 𝛼2 
𝜶= .  (8.13a)
 .. 
 
b 
𝛼 𝑛 
which allows us to rewrite kernel ridge regression (8.11) as

𝑦 (x★) = b
b 𝜶T 𝑲 (X, x★), (8.13b)
b
𝜶 = y (𝑲 (X, X) + 𝑛𝜆I) .
T −1
(8.13c)

That is, instead of computing and storing a 𝑑-dimensional vector b


𝜽 as we intended initially, we now instead
compute and store an 𝑛-dimensional vector b 𝜶. However, we also need to store X, since we have to compute
𝑲 (X, x★) for every prediction.
We summarize kernel ridge regression in Method 8.1. Kernel ridge regression is in itself a practically
useful method, but we will next take a step back and discuss what we have derived, in order to prepare for
more kernel methods. We will also come back to kernel ridge regression in Chapter 9, where it is used as
a stepping stone in deriving the Gaussian process.
Example 8.2: Linear regression with kernels

This example is not yet complete.

Time to reflect 8.2: Verify that you retrieve 𝐿 2 -regularized linear regression, without any nonlinear
transformations, by using the linear kernel 𝜅(x, x0) = xT x in (8.11).

Learn kernel ridge regression


Data: Training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

Result: Learned parameters b𝜶


1 Compute b
𝜶 as (8.13c)

Predict with kernel ridge regression


Data: Learned parameters b𝜶 and test input x★
Result: Prediction b
𝑦 (x★)
1 Compute b
𝑦 (x★) as (8.13b).

Method 8.1: Kernel ridge regression

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
126
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8.3 Support vector regression

8.3 Support vector regression


Kernel ridge regression, as we just derived, is our fist kernel method for regression, which is a useful
method on its own. We will now develop kernel ridge regression into support vector regression by
replacing the loss function. First, however, we take a step back and make an interesting observation that
suggests the so-called representer theorem, which will be useful later in this chapter.

Preparing for more kernel methods: the representer theorem


The formulation (8.13) is not only practical for implementation, it is also important for theoretical
understanding. We can interpret (8.13) as a dual formulation of linear regression, where we have the dual
parameters 𝜶 instead of the primal formulation (8.4) with primal parameters 𝜽. Remember that 𝑑, the
(possibly infinite) number of primal parameters in 𝜽, is a user design choice, whereas 𝑛, the number of
dual parameters in 𝜶, is the number of data points.
By comparing (8.5) and (8.4b), we have that

𝑦 (x★) = b
b 𝜽 𝝓(x★) = b
T
𝜶T 𝚽(X)𝝓(x★) (8.14)
| {z }
𝑲 (X,x★ )

for all x★, which suggests that


b
𝜽 = 𝚽(X) T b
𝜶. (8.15)

This relationship between the primal parameters 𝜽 and the dual parameters 𝜶 is not specific for kernel
ridge regression, but (8.15) is the consequence of a general result called the representer theorem.
In essence the representer theorem states that if b 𝑦 (x) = 𝜽 T 𝝓(x), the equation (8.15) holds when 𝜽 is
2
learned using (almost) any loss function and 𝐿 regularization. A full treatment is beyond the scope of this
chapter, but we give a complete statement of the theorem in Section 8.A. An implication of the representer
theorem is that 𝐿 2 regularization is crucial in order to obtain kernel ridge regression (8.13), and we could
not have achieved it using, say, 𝐿 1 regularization instead. The representer theorem is a cornerstone for
most kernel methods, since it tells us that we can express some models in terms of dual parameters 𝜶 (of
finite length 𝑛) and a kernel 𝜅(x, x0) instead of the primal parameters 𝜽 (possibly of infinite length 𝑑) and
a nonlinear feature transformation 𝝓(x), just like we did with linear regression in (8.13).

Support vector regression


We will now look at support vector regression, another off-the-shelf kernel method for regression. From a
model perspective the only difference to kernel ridge regression is a change of loss function. This new loss
function has an interesting effect in that the dual parameter vector b 𝜶 in support vector regression becomes
sparse, meaning that several elements of b 𝜶 are exactly zero. The training data points corresponding to
the non-zero elements of b 𝜶 are referred to as support vectors, since b 𝑦 (x★) will depend only on those (in
contrast to kernel ridge regression (8.11), where all training data points are needed to compute b 𝑦 (x★)).
This makes support vector regression an example of a so-called support vector machine (SVM), a family
of methods with sparse dual parameter vectors.
The loss function we will use for support vector regression is the 𝜖-insensitive loss
(
0 if |b
𝑦 (x★) − 𝑦★ | < 𝜖,
𝑦 (x★), 𝑦★) =
𝐿(b , (8.16)
𝑦 (x★) − 𝑦★ | − 𝜖 otherwise,
|b

which was introduced in Section 5.1. The parameter 𝜖 is a user design choice. In its primal formulation,
support vector regression also makes use of the linear regression model

b
𝑦 (x★) = 𝜽 T 𝝓(x★), (8.17a)

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 127
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8 Nonlinear input transformations and kernels

but instead of the least square cost function in (8.4), we have

1 ∑︁
𝑛
b
𝜽 = arg min max{0, | 𝜽 T 𝝓(x𝑖 ) −𝑦 𝑖 | − 𝜖 } + 𝜆k𝜽 k 22 . (8.17b)
𝜽 𝑛 𝑖=1
| {z }
b
𝑦 (x𝑖 )

As with kernel ridge regression, we reformulate the primal formulation (8.17) into a dual formulation
with 𝜶 instead of 𝜽 and use the kernel-trick. For the dual formulation, we cannot repeat the convenient
closed-form derivation along the lines of (8.4)-(8.13) since there is no closed-form solution for b
𝜽. Instead
we have to use optimization theory and introduce slack variables and construct the Lagrangian of (8.17b)).
We do not give the full derivation here, but as it turns out the dual formulation becomes

𝑦 (x★) = b
b 𝜶T 𝑲 (X, x★), (8.18a)

where b
𝜶 is the solution to the optimization problem
1 T
b
𝜶 = arg min 𝜶 𝑲 (X, X)𝜶 − 𝜶T y + 𝜖 k𝜶k 1 (8.18b)
𝜶 2
2
subject to |𝛼𝑖 | ≤ (8.18c)
𝑛𝜆
Note that (8.18a) is identical to the corresponding expression for kernel ridge regression (8.13b). This is a
consequence of the representer theorem. The only difference to kernel ridge regression is how the dual
parameters 𝜶 are learned; by numerically solving the optimization problem (8.18b) instead of using the
closed-form solution in (8.13c).
The 𝜖-insensitive loss function could be used for any regression model but it is particularly interesting
in combination in this kernel context since the dual parameter vector 𝜶 becomes sparse (meaning that only
some elements are non-zero). Remember that 𝜶 has one entry per training data point. This implies that the
prediction (8.18a) depends only on some of the training data points, namely those whose corresponding
𝛼𝑖 is non-zero. All training data is indeed used at training time (solving (8.18b)), but when making
predictions (using (8.18a)) only the data points, the support vectors, with non-zero 𝛼𝑖 contributes, which
significantly reduces the computational burden. The larger 𝜖 chosen by the user, the fewer support vectors
and the fewer computations needed in making predictions. It can also be argued that 𝜖 has a regularizing
effect, in the sense that the more/fewer support vectors that are used the more/less complicated model. We
illustrate support vector regression with Example 8.3, and summarize it in Method 8.2.
Example 8.3: support vector regression and kernel ridge regression

This example is not yet complete.

The 𝜖-insensitive loss indeed makes the dual parameter vector 𝜶 sparse. Note however that it does not
mean that the corresponding primal parameter vector 𝜽 is sparse (their relationship is given by (8.15)).
Also note that (8.18b) is a constrained optimization problem (there is a constraint given by (8.18c)), and
more theory than what we presented in Section 5.3 is needed to derive a good solver.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
128
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8.4 Kernel theory

Learn support vector regression


Data: Training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

Result: Learned parameters b𝜶


1 Compute b
𝜶 by numerically solving (8.18b)-(8.18c)

Predict with kernel ridge regression


Data: Learned parameters b𝜶 and test input x★
Result: Prediction b
𝑦 (x★)
1 Compute b
𝑦 (x★) as (8.18a).

Method 8.2: Support vector regression

The nonlinear feature vector 𝚽(x) corresponding to some kernels, such as the squared exponential
kernel (8.12), does not have a constant/offset term. Therefore an additional 𝜃 0 is sometimes included in
Í
(8.18a) for support vector regression which adds the constraint 𝑖 𝛼𝑖 = 0 to the optimization problem
in (8.18b). The same addition could be made also to kernel ridge regression (8.13b), but that would break
the closed-form calculation of 𝜶 (8.13b).

Conclusions on using kernels for regression

With kernel ridge regression and support vector regression we have actually been dealing with the interplay
between three different concepts, each of them interesting on its own. To clarify this, we repeat them in an
ordered list:

1. We have considered the primal and dual formulation of a model. The primal formulation expresses
the model in terms of 𝜽 (fixed size 𝑑), whereas the dual formulation uses 𝜶 (one 𝛼𝑖 per data point
𝑖, hence the size is 𝑛 no matter what 𝑑 happens to be). Both formulations are mathematically
equivalent, but more or less useful in practice depending on whether 𝑑 > 𝑛 or 𝑛 > 𝑑.

2. We have introduced kernels 𝜅, which allows us to let 𝑑 → ∞. This idea is practically useful only
when using the dual formulation with 𝜶, since 𝑑 is the dimension of 𝜽.

3. We have used different loss functions. Kernel ridge regression makes use of squared error loss,
whereas support vector regression uses the 𝜖-insensitive loss. The 𝜖-insensitive loss is particularly
interesting in the dual formulation, since it gives sparse 𝜶. (We will later also use the hinge loss for
support vector classification in Section 8.5, which has a similar effect.)

We will now spend some additional effort on understanding the kernel concept in Section 8.4, and thereafter
introduce support vector classification (SVC) in Section 8.5.

8.4 Kernel theory

We have defined a kernel as being any function taking two arguments from the same space, and returning a
scalar. We have also suggested that we should (at least sometimes) restrict ourselves to positive semidefinite
kernels, and presented two practically useful algorithms—kernel ridge regression and support vector
regression. Before we introduce also support vector classification, we will discuss what a kernel means in
practice, and also give a flavor of the available theory behind it. To make the discussion more concrete, let
us start by introducing the kernel version of 𝑘-NN.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 129
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8 Nonlinear input transformations and kernels

Introducing kernel k-NN


As you know from Chapter 2, 𝑘-NN constructs the prediction for x★ by taking the average or a majority
vote among the 𝑘 nearest neighbors to x★. In its standard formulation, “nearest” is defined by the Euclidian
distance. Since the Euclidian distance always is positive, we can equivalently consider the squared
Euclidian distance instead, which can be written in terms of the linear kernel 𝜅(x, x0) = xT x0 as
kx − x0 k 22 = (x − x0) T (x − x0) = xT x + x0T x0 − 2xT x0 = 𝜅(x, x) + 𝜅(x0, x0) − 2𝜅(x, x0). (8.19)
To generalize the 𝑘-NN algorithm to use kernels, we allow the linear kernel to be replaced by any, say,
positive semidefinite kernel 𝜅(x, x0) in (8.19). Kernel 𝑘-NN thus works as standard 𝑘-NN, but determines
proximity between the data points using the right hand side of (8.19) (with a user-chosen kernel 𝜅(x, x0))
instead of the left hand side of (8.19).
For many (but not all) kernels, it holds that 𝜅(x, x) = 𝜅(x0, x0) for all x and x0, suggesting that the most
interesting part of the right hand side of (8.19) is the term −2𝜅(x, x0). Thus, if 𝜅(x, x0) takes a large value,
the two data points x and x0 are considered to be close, and vice versa if 𝜅(x, x0) takes a small value. That
is, the kernel determines how close any two data points are.
Furthermore, kernel 𝑘-NN allows us to use 𝑘-NN also for data where the Euclidian distance has no
natural meaning. As long as we have a kernel which acts on the input space, we can apply kernel 𝑘-NN
even if the Euclidian distance is not defined for that input type. We can thereby apply kernel 𝑘-NN to
input data that is neither numerical nor categorical, such as text snippets, as illustrated by Example 8.4.
Example 8.4: Kernel 𝑘-NN for interpreting words

This example illustrates how kernel 𝑘-NN can be applied to text data, where the Euclidian distance has no
meaning and standard 𝑘-NN therefore can not be applied. In this example, the input is single words (or
more technically: character strings) and we use the so-called Levenshtein distance to construct a kernel.
The Levenshtein distance is the number of single-character edits needed to transform one word (string) into
another. It takes two strings and returns a non-negative integer, which is zero only if the two strings are
 and we 0can
equivalent. It fulfills the properties of being a metric on the space of character strings,  thereby
𝑥,𝑥 )) 2
use it for example in the squared exponential to construct a kernel as 𝜅(𝑥, 𝑥 0) = exp − (LD(2ℓ2 (where
LD is the Levenshtein distance) with, say, ℓ = 5.
In this very small example, we consider a training dataset of 10 adjectives shown below (𝑥 𝑖 ), each labeled
(𝑦 𝑖 ) Positive or Negative, according to their meaning. We will now use kernel 𝑘-NN (with the kernel
defined above) to predict whether the word “horrendous” (𝑥★) is a positive or negative word. In the third
column below we have therefore computed the Levenshtein distance (LD) between each known word (𝑥𝑖 )
and “horrendous” (𝑥★). The rightmost column shows the value of the right hand side of (8.19), which is the
value that kernel 𝑘-NN uses to determine how close two data points are.
Word, 𝑥 𝑖 Meaning, 𝑦 𝑖 Levenshtein dist. from 𝑥𝑖 to 𝑥★ 𝜅(𝑥𝑖 , 𝑥𝑖 ) + 𝜅(𝑥★, 𝑥★) − 2𝜅(𝑥𝑖 , 𝑥★)
Awesome Positive 8 1.44
Excellent Positive 10 1.73
Spotless Positive 9 1.60
Terrific Positive 8 1.44
Tremendous Positive 4 0.55
Awful Negative 9 1.60
Dreadful Negative 6 1.03
Horrific Negative 6 1.03
Outrageous Negative 6 1.03
Terrible Negative 8 1.44
Inspecting the rightmost column, the closest word to horrendous is the positive word tremendous. Thus, if
we use 𝑘 = 1 the conclusion would be that horrendous is a positive word. However, the second, third and
fourth closest words are all negative (dreadful, horrific, outrageous), and with 𝑘 = 3 or 𝑘 = 4 the conclusion
thereby becomes that horrendous is a negative word (which also happens to be correct in this case).
The purpose of this example is to illustrate how a kernel allows a basic method such as 𝑘-NN to be used
for a problem where the input has a more intricate structure than just being numerical. For the particular
application of predicting word semantics, more elaborate machine learning methods exist.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
130
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8.4 Kernel theory

The meaning of a kernel


From kernel 𝑘-NN we got (at least) two lessons about kernels that are generally applicable to all supervised
machine learning methods that use kernels:

• The kernel defines how close/similar any two data points are. If, say, 𝜅(x𝑖 , x★) > 𝜅(x 𝑗 , x★), then x★
is considered to be more similar to x𝑖 than x 𝑗 . Intuitively speaking, for most methods the prediction
b
𝑦 (x★) is most influenced by the training data points that are closest/most similar to x★. The kernel
thereby plays an important role in determining the individual influence of each training data point
when making a prediction.

• Even though we started by introducing kernels from the inner product 𝝓(x) T 𝝓(x0), no inner product
has to be defined for the space in which x itself lives! As we saw in Example 8.4, we can apply a
positive semidefinite kernel method also to text string without worrying about inner products, as
long as we have a kernel for that type of data.

In addition to this, the kernel also plays a somewhat more subtle role in methods that builds on
the representer theorem (such as kernel ridge regression, support vector regression and support vector
classification, but not kernel 𝑘-NN). Remember that the primal formulation of those methods, by virtue
of the representer theorem, contains the 𝐿 2 regularization term 𝜆k𝜽 k 22 . Even though we do not solve
the primal formulation explicitly when using kernels (we solve the dual instead), it is nevertheless an
equivalent representation, and we may ask what impact the regularization 𝜆k𝜽 k 22 has on the solution?
The 𝐿 2 regularization means that primal parameter values 𝜽 close to zero are favored. Besides the
regularization term, 𝜽 only appear in the expression 𝜽 T 𝝓(x). The solution b 𝜽 to the primal problem is
2
therefore an interplay between the feature vector 𝝓(x) and the 𝐿 regularization term. Consider two
different choices of feature vectors 𝝓1 (x) and 𝝓2 (x). If they both span the same space of functions, there
exists 𝜽 1 and 𝜽 2 such that 𝜽 T1 𝝓1 (x) = 𝜽 T2 𝝓2 (x) for all x, and it might appear irrelevant which feature vector
is being used. However, the 𝐿 2 regularization complicates the situation because it acts directly on 𝜽, and it
therefore matters whether 𝝓1 (x) or 𝝓2 (x) is being used. In the dual formulation we choose the kernel
𝜅(x, x0) instead of feature vector 𝝓(x), but since that choice implicitly corresponds to a feature vector the
effect is still present, and we may add one more bullet point about the meaning of a kernel:

• The choice of kernel corresponds to a choice of a regularization functional. That is, the kernel
implies a preference for certain functions in the space of all functions that are spanned by the feature
vector. For example, the squared exponential kernel enforce smooth solutions.

Using a kernel makes a method quite flexible, and one could perhaps expect it to suffer heavily from overfit.
However, the regularizing role of the kernel explains why that rarely is the case in practice.
All three bullet points above are central to understand the usefulness and versatility of kernel methods.
They also highlight the importance for the machine learning engineer to choose the kernel wisely, and not
only resort to “default” choices.

Valid choices of kernels


We introduced kernels as a way to compactly work with nonlinear feature transformations as (8.10). A
direct consequence of this is that it is now sufficient 𝜅(x, x0), and not 𝝓(x). A natural question to ask is
now if an arbitrary kernel 𝜅(x, x0) always corresponds to a feature transformation 𝝓(x), such that it can be
written as the inner product

𝜅(x, x0) = 𝝓(x) T 𝝓(x0)? (8.20)

Before answering the question, we have to be aware that this question is merely of theoretical nature.
As long as we can use 𝜅(x, x0) when computing predictions it serves its purpose, no matter if it admits
the factorization (8.20) or not. The specific requirements on 𝜅(x, x0) are different for different methods,
for example the inverse (𝑲 (X, X) + 𝑛𝜆I) −1 is needed for kernel ridge regression but not for support

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 131
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8 Nonlinear input transformations and kernels

vector regression. Furthermore, whether a kernel admits the factorization (8.20) or not has no direct
correspondence to how well it performs in terms of, say, prediction accuracy. For any practical machine
learning problem the performance still has to be evaluated using cross validation or similarly.
That being said, we will now have a closer look at the important family of positive semidefinite kernels.
A kernel is said to be positive semidefinite if the matrix 𝑲 (X, X) (8.11b) is positive semidefinite (has no
negative eigenvalues) for any choice of X.
First, it holds that any kernel 𝜅(x, x0) that is defined as an inner product between feature vectors 𝝓(x),
as in (8.20), is always positive semidefinite. It can, for example, be shown from the definition of positive
semidefinite that vT 𝑲 (X, X)v ≥ 0 holds for any vector v = [𝑣 1 . . . 𝑣 𝑛 ] T . By using (8.11b), the definition
of matrix multiplication (first equality below) and some properties of the inner product (second equality
below) we can indeed conclude that
!T
∑︁
𝑛 ∑︁
𝑛 ∑︁
𝑛
©∑︁
𝑛
ª
vT 𝑲 (X, X)v = 𝑣 𝑖 (𝝓(x𝑖 )) T 𝝓(x 𝑗 )𝑣 𝑗 = 𝑣 𝑖 𝝓(x𝑖 ) ­ 𝑣 𝑗 𝝓(x 𝑗 ) ® ≥ 0. (8.21)
𝑖=1 𝑗=1 𝑖=1 « 𝑗=1 ¬
The last inequality follows from the fact that the inner product is always non-negative.
Less trivially the other direction also holds, that is, for any positive semidefinite kernel 𝜅(x, x0) there
always exists a feature vector 𝝓(x) such that 𝜅(x, x0) can be written as an inner product (8.20). Technically
it can be shown that for any positive semidefinite kernel 𝜅(x, x0) it is possible to construct a function space,
more specifically a Hilbert space, that is spanned by a feature vector 𝝓(x) for which (8.20) holds. The
dimensionality of the Hilbert space, and thereby also the dimension of 𝝓(x), can however be infinite.
Give a kernel 𝜅(x, x0) there are multiple ways to construct a Hilbert space spanned by 𝝓(x), and we
will only mention some directions here. One alternative is to consider the so-called reproducing kernel
map. The reproducing kernel map is obtained by consider one argument, say the latter, to 𝜅(x, x0) fix and
let 𝜅(·, x0) span the Hilbert space and define the inner product h·, ·i such that h𝜅(·, x), 𝜅(·, x0)i = 𝜅(x, x0).
This inner product has a so-called reproducing property, and it is the main building block for the so-called
reproducing kernel Hilbert space. Another alternative is to use the so-called Mercer kernel map, which
constructs the Hilbert space using eigenfunctions to an integral operator which is related to the kernel.
A given Hilbert space uniquely defines a kernel, but for a given kernel there exists multiple Hilbert
spaces which corresponds to it. In practice this means that given a kernel 𝜅(x, x0) the corresponding
feature vector 𝝓(x) is not unique, in fact not even its dimensionality. As a simple example consider the
linear kernel 𝜅(x, x0) = xT x0, which can either be expressed as an inner product between 𝝓(x) = x (one
h iT
1 1
dimensional 𝝓(x)) or as an inner product between 𝝓(x) = √2 x √2 x (two dimensional 𝝓(x)).

Examples of kernels
We will now give a list of some of the commonly used kernels, of which we already have introduced some.
These examples are only for the case when x is a numeric variable. For other types of input variables
(such as Example 8.4), we have to resort to the more application specific literature. We start with some
positive semidefinite kernels, where the linear kernel might be the simplest one

𝜅(x, x0) = xT x0 . (8.22)

A generalization thereof, still positive semidefinite, is the polynomial kernel

𝜅(x, x0) = (𝑐 + xT x0) 𝑑−1 (8.23)

with hyperparameter 𝑐 ≥ 0 and polynomial order 𝑑 − 1 (integer). The polynomial kernel corresponds to
a finite-dimensional feature vector 𝝓(x) of monomials up to order 𝑑 − 1. The polynomial kernel does
therefore not conceptually enable anything that could not be achieved by instead implementing the primal
formulation and the finite-dimensional 𝝓(x) explicitly. The other positive semidefinite kernels below, on
the other hand, all corresponds to infinite-dimensional feature vectors 𝝓(x).

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
132
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8.4 Kernel theory

We have previously also mentioned the squared exponential kernel4 ,


!
kx − x0 k 22
𝜅(x, x ) = exp −
0
, (8.24)
2ℓ 2

with hyperparameter ℓ > 0 (usually called lengthscale). As we saw in Example 8.2 and 8.3, this kernel has
more of a “local” nature compared to the polynomial since 𝜅(x, x0) → 0 as kx − x0 k → ∞. This property
makes sense in many problems, and is perhaps the reason why this might be the most commonly used
kernel.
Somewhat related to the squared exponential we have the family of Matérn kernels
√ !𝜈 √ !
21−𝜈 2𝜈kx − x0 k 2 2𝜈kx − x0 k
𝜅(x, x ) =
0
𝜁𝜈 , (8.25)
Γ(𝜈) ℓ ℓ

with hyperparameters ℓ > 0 and 𝜈 > 0. Here Γ is the Gamma function and 𝜁 𝜈 is a modified Bessel
function. All Matérn kernels are positive semidefinite. Of particular interest are the cases 𝜈 = 12 , 32 and 52 ,
when (8.25) simplifies to
 
1 kx − x0 k 2
𝜈= ⇒ 𝜅(x, x ) = exp −
0
, (8.26)
2 ℓ
√ ! √ !
3 3kx − x0 k 2 3kx − x0 k 2
𝜈= ⇒ 𝜅(x, x ) = 1 +
0
exp − , (8.27)
2 ℓ ℓ
√ ! √ !
5kx x 0 k2
5 5kx − x 0k
2 − 2 5kx − x 0k
2
𝜈= ⇒ 𝜅(x, x0) = 1 + + exp − . (8.28)
2 ℓ 3ℓ 2 ℓ

The Matérn kernel with 𝜈 = 12 is also called the exponential kernel. It can furthermore be shown that the
Matérn kernel (8.25) equals the squared exponential (8.24) when 𝜈 → ∞.
As a final example of a positive semidefinite kernel, we mention the rational quadratic kernel
!
kx − x 0 k 2 −𝑎
2
𝜅(x, x0) = 1 + (8.29)
2𝑎ℓ 2

with hyperparameters ℓ > 0 and 𝑎 > 0. The squared exponential, Matérn and rational quadratic kernel are
all examples of isotropic kernels, since they are a function only of kx − x0 k 2 .
Going back to the discussion in connection to (8.20), positive semidefinite kernels are a subset of all
kernels, for which we know that certain theoretical properties holds. In practice, however, a kernel is
potentially useful as long as we can compute a prediction using it, no matter its theoretical properties. One
(at least historically) popular kernel for SVMs which is not positive semidefinite is the Sigmoid kernel
 
𝜅(x, x0) = tanh 𝑎xT x0 + 𝑏 , (8.30)

where 𝑎 > 0 and 𝑏 < 0 are hyperparameters. The fact that it is not positive semidefinite can, for example,
be seen by computing the eigenvalues of 𝑲 (X, X) with 𝑎 = 1, 𝑏 = −1 and X = [1 2] T . Since this kernel
is not positive semidefinite the inverse (𝑲 (X, X) + 𝑛𝜆I) −1 does not always exists, and it is therefore not
suitable for kernel ridge regression. It can, however, be used in support vector regression and classification,
where that inverse is not needed. For certain values of 𝑏 it can be shown to be a so-called conditional
positive semidefinite kernel (a weaker property than positive semidefinite).
It is possible to construct “new” kernels by modifying or combining existing ones. There are in
particular a set of operations that preserves the positive semidefinite property: If 𝜅(x, x0) is a positive
semidefinite kernel, then so is 𝑎𝜅(x, x0) if 𝑎 > 0 (scaling). Furthermore if both 𝜅1 (x, x0) and 𝜅 2 (x, x0)
4 Also known as the Gaussian radial basis function kernel.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 133
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8 Nonlinear input transformations and kernels

are positive semidefinite kernels, then so are 𝜅1 (x, x0) + 𝜅 2 (x, x0) (addition) as well as 𝜅1 (x, x0)𝜅 2 (x, x0)
(multiplication).
Most kernels contains a few hyperparameters that are left to the user to chose. Much like cross validation
can give valuable help in choosing between different kernels, cross validation can also help choosing
hyperparameters (and regularization parameter 𝜆) with grid search.

8.5 Support vector classification


We have so far spent most time deriving two kernel versions of linear regression: kernel ridge regression
and support vector regression. We will now focus on classification. Unfortunately the derivations are
now more technically intricate than for regression, and we have therefore placed the details in the chapter
appendix. However, the intuition carries over from regression, as well as the main ideas of the dual
formulation, the kernel trick and the change of loss function.
It is possible to derive a kernel version of logistic regression with 𝐿 2 regularization. The derivation can
be made by first replacing x with 𝝓(x), and then use the representer theorem to derive its dual formulation
and apply the kernel trick. Kernel logistic regression is a classification counterpart to kernel ridge
regression. Due to the fact that kernel logistic regression os very scarcely used in practice, we will instead
go straight to support vector classification. As the name suggests, support vector classification is the
classification counterpart of support vector regression. Both support vector regression and classification
are referred to as support vector machines (SVM) since they both have sparse dual parameter vectors.
We consider the binary classification problem 𝑦 ∈ {−1, 1}, and start with the margin formulation (see
(5.8) in Chapter 5) of the logistic regression classifier

b𝑦 (x★) = sign 𝜽 T 𝝓(x) . (8.31)

If we now were to learn 𝜽 using the logistic loss (5.10), we would obtain logistic regression with a nonlinear
feature transformation 𝝓(x), from which kernel logistic regression eventually would follow. Inspired by
support vector regression, we will instead make use of the hinge loss function (5.13),
(
1 − 𝑦𝜽 T 𝝓(x) if 𝑦𝜽 T 𝝓(x) < 1 
𝐿(x, 𝑦, 𝜽) = = max 0, 1 − 𝑦 𝑖 𝜽 T 𝝓(x𝑖 ) (8.32)
0, otherwise

From Figure 5.2, it is not immediately clear what advantages the hinge loss has over the logistic loss.
Analogously to the 𝜖-insensitive loss, the main advantage of the hinge loss comes when we look at the dual
formulation using 𝜶 instead of the primal formulation with 𝜽. But before introducing a dual formulation
we first have to spell out the primal one. Since the representer theorem is lurking behind all this, we have
to use 𝐿 2 regularization, which together with (8.32) gives the primal formulation

1 ∑︁
𝑛

b
𝜽 = arg min max 0, 1 − 𝑦 𝑖 𝜽 T 𝝓(x𝑖 ) + 𝜆k𝜽 k 22 . (8.33)
𝜽 𝑛 𝑖=1

The primal formulation does not allow for the kernel trick, since the feature vector does not show up
as 𝝓(x) T 𝝓(x). By using optimization theory and constructing the Lagrangian (the details are found in
Appendix 8.B), we can arrive at the dual formulation5 of (8.33),

1 T
b
𝜶 = arg min 𝜶 𝑲 (X, X)𝜶 − 𝜶T y (8.34a)
𝜶 2
2
subject to |𝛼𝑖 | ≤ and 0 ≤ 𝛼𝑖 𝑦 𝑖 . (8.34b)
𝑛𝜆
5 Inother texts, it is common to let the dual variables instead be 𝛼 0 = (𝛼𝑖 𝑦 𝑖 ) (which happens to be the Lagrange multipliers). It
is mathematically equivalent, but this formulation better highlights the similarities to the other methods and the importance
of the representer theorem.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
134
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8.5 Support vector classification

with

b 𝜶 𝑲 (X, x★))
𝑦 (x★) = sign (b (8.34c)

instead of (8.31). Note that we here also have made use of the kernel trick by replacing 𝝓(x)𝝓(x0) with
𝜅(x, x0), which gives 𝑲 (X, X) in (8.34a) and 𝑲 (X, x★) in (8.34c).
Because of the representer theorem, the formulation (8.34c) should come as no surprise to us, since it
simply corresponds to inserting (8.15) into (8.31). However, the representer theorem only tells us that this
dual formulation exists, but (8.34a)-(8.34b) does not follow automatically from the representer theorem
but requires its own derivation.
The perhaps most interesting property of the constrained optimization problem (8.34) is that its solution
b
𝜶 turns out to be sparse. This is exactly the same phenomena as with support vector regression, and
explains why (8.34) is an SVM. More specifically, (8.34) is called support vector classification. The
strength of this method, like support vector regression, is that the model has the full flexibility of being a
kernel method, and yet the prediction (8.34c) only explicitly depends on some of the training data points
(the support vectors). It is however important to realize that all training data points are needed when
solving (8.34a).
The support vector property is due to the fact that the loss function is exactly equal to zero when the
margin (see Section 5.1) is > 1. In the dual formulation, the parameter 𝛼𝑖 becomes nonzero only if the
margin for data point 𝑖 is < 1, which makes b 𝜶 sparse.
As a consequence of using the hinge loss, as we discussed in Section 5.1, support vector classification
does not provide probability estimates 𝑔(x) but only a “hard” classification b 𝑦 (x★). The margin, in this
case 𝑦b 𝜶 𝑲 (X, x★), is not possible to interpret as a class probability estimate, because of the asymptotic
minimizer of the hinge loss. As an alternative it is possible to instead use the squared hinge loss or the
Huberized squared hinge loss which allows for a probability interpretation of the margin. Since all these
loss functions are exactly zero for margins > 1, they retain the support vector property with a sparse b 𝜶.
However, by using a different loss function than the hinge loss we will obtain a different optimization
problem since the derivation would no longer start in (8.33).
It is not necessary to make use of kernels in support vector classification. It is perfectly possible to use
the linear kernel 𝜅(x, x0) = xT x0 or any other kernel corresponding to a finite dimensional 𝝓(x)) in (8.34).
That would indeed limit the flexibility of the classifier, using the linear kernel would limit the classifier to
linear decision boundaries, for instance. The possible benefit, however, of not making full use of kernels
is that it suffices to implement and solve the primal (and not the dual) formulation (8.33), since 𝜽 is of
finite dimension. The support vector property would still be present, but much less visible since 𝜶 not are
explicitly computed. If only using the primal formulation and no kernels, the representer theorem is not
needed, and one could therefore also replace the 𝐿 2 regularization with any other regularization method.

Learn support vector classification


Data: Training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

Result: Learned parameters b𝜶


1 Compute b
𝜶 by numerically solving (8.34a)-(8.34b)

Predict with kernel ridge regression


Data: Learned parameters b𝜶 and test input x★
Result: Prediction b
𝑦 (x★)
1 Compute b
𝑦 (x★) as (8.34c).

Method 8.3: Support vector classification

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 135
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8 Nonlinear input transformations and kernels

Example 8.5:

ex:svcThis example is not yet complete.

The support vector classifier is most often formulated as a solution to the binary classification problem.
The generalization to the multiclass problem is unfortunately not straightforward, since it requires a
multiclass generalization of the loss function, as discussed in Section 5.1. In practice it is common
to construct a multiclass classifier from multiple binary ones using either the one-versus-rest or the
one-versus-one strategy (see page 71).

8.6 Further reading


A comprehensive textbook on kernel methods is Schölkopf and Smola (2002), which includes a thorough
discussion on SVM and kernel theory, as well as several references to original work which we do not
repeat here. A commonly used software package for solving SVM problems is Chang and Lin 2011.
Kernel 𝑘-NN is described by Yu et al. (2002) (and the specific kernel based on the Levenshtein distance in
example Example 8.4 is found in J. Xu and X. Zhang (2004)), and kernel logistic regression by Zhu and
Hastie (2005). Some more discussion related to the comment about the presence or absence of the offset
term in the SVM formulation on page 129 is found in Poggio et al. (2001) and Steinwart et al. (2011).

8.A The representer theorem


We give a slightly simplified version of the representer theorem by Schölkopf, Herbrich, et al. (2001)
adapted to our notation and terminology, without using the reproducing kernel Hilbert space formalism.
It is stated in a regression-like setting, since b
𝑦 (x) = 𝜽 T 𝝓(x), and we discuss how it applies also to
classification below.
Theorem 8.1 (The representer theorem) Let b 𝑦 (x) = 𝜽 T 𝝓(x) with fix nonlinear feature transformations
𝝓(x) and 𝜽 to be learned from training data {x𝑖 , 𝑦 𝑖 }𝑖=1𝑛 . (The dimensionality of 𝜽 and 𝝓(x) does not

have to be finite.) Furthermore, let 𝐿 (𝑦, b


𝑦 ) be any arbitrary loss function and ℎ : [0, ∞] ↦→ R a strictly
monotonic increasing function. Then each minimizing 𝜽 to the regularized cost function

1 ∑︁
𝑛
𝐿(𝑦 𝑖 , 𝜽 T 𝝓(x𝑖 ) ) + ℎ(k𝜽 k 22 ) (8.35)
𝑛 𝑖=1 | {z }
b
𝑦 (x𝑖 )

can be written as 𝜽 = 𝚽(X) T 𝜶 (or equivalently that b


𝑦 (x) = 𝜶𝑲 (X, x★)) with some 𝑛-dimensional vector
𝜶.
Proof: For a given X, any 𝜽 can be decomposed into one part 𝚽(X) T 𝜶 (with some 𝜶) that lives in the row
span of 𝚽(X) and one part v orthogonal to it, that is, 𝜽 = 𝚽(X) T 𝜶 + v with v being orthogonal to all rows
𝝓(x𝑖 ) of 𝚽(X).
For any x𝑖 in {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛 it therefore holds that

b
𝑦 (x𝑖 ) = 𝜽 T 𝝓(x𝑖 ) = (𝚽(X) T 𝜶 + v) T 𝝓(x𝑖 ) = 𝜶T 𝚽(X)𝝓(x𝑖 ) + vT 𝝓(x𝑖 ) = 𝜶T 𝚽(X)𝝓(x𝑖 ). (8.36)
| {z }
=0

The first term in (8.35) is therefore independent of v. Concerning the second term in (8.35) we have

ℎ(k𝜽 k 22 ) = ℎ(k𝚽(X) T 𝜶 + vk 22 ) = ℎ(k𝚽(X) T 𝜶k 22 + kvk 22 ) ≥ ℎ(k𝚽(X) T 𝜶k 22 ) (8.37)

where the second inequality follows from the fact that v is orthogonal to 𝚽(X) T 𝜶, and equality in the last
step only holds if v = 0. The equation (8.37) therefore implies that the minimum of (8.35) is found for
v = 0, from which the theorem follows. 

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
136
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
8.B Derivation of support vector classification

The assumption that the model is linear in both parameters and features, 𝜽 T 𝝓(x), is indeed crucial for
Theorem 8.1. That is not an issue when we consider models of linear regression type, but in order to apply
it to, for example, logistic regression we have to find a linear formulation of that model model. Not all
 can (instead
models are possible to formulate as linear, but logistic regression  of (3.29a)) be understood as
𝑝 ( 𝑦=1 | x)
a linear model predicting the so-called log-odds, 𝜽 𝝓(x) = ln 𝑝 ( 𝑦=−1 | x) , and the representer theorem is
T

therefore applicable to it. Furthermore is support vector classification also  a linear


model if we consider
the function 𝑐(x) = 𝜽 T 𝝓(x) rather than the predicted class b𝑦 (x★) = sign 𝜽 T 𝝓(x) .

8.B Derivation of support vector classification

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 137
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

So far, learning a parametric models has in all previous chapters amounted to somehow finding a
parameter value b 𝜽. With the Bayesian approach (also called the probabilistic approach) learning amounts
to instead finding the distribution of parameter values 𝜽 conditioned on the observed training data
𝑛 = {X, y}, that is, 𝑝(𝜽 | y) (to ease the notation we do not write out X in the conditioning).
T = {x𝑖 , 𝑦 𝑖 }𝑖=1
With the Bayesian approach also the prediction is a distribution 𝑝(𝑦★ | y), instead of a single value. Before
we become more concrete and use equations, let us just say that on a theoretical (or even philosophical)
level the Bayesian approach is rather different to what we previously have seen in this book. As it
however opens up for a family of new, versatile and practically useful methods, the extra effort required to
understand this somewhat different approach pays off well and provides another interesting perspective on
supervised machine learning. As the Bayesian approach makes repeated use of probability distribution, it
is also natural that this chapter will be more heavy on the probability theory side compared to the rest of
the book.
We will start this chapter by first giving a general introduction to the Bayesian idea. We thereafter go
back to basics and applies the Bayesian approach to linear regression, which we thereafter extend to the
non-parametric Gaussian process model.

9.1 The Bayesian idea

In this chapter we will use the term model with a very specific meaning. A model, in this chapter, refers to
a set of equations that contain parameters 𝜽 and defines a probability distribution over the output y given
some input X and parameters 𝜽, that is, 𝑝(y | 𝜽) (we omit X from the probability notation since it is not a
random variable). Remember that 𝑝(y | 𝜽) was used also for the maximum likelihood perspective; one
example of a model 𝑝(y | 𝜽) is linear regression (3.17)-(3.18).
In the Bayesian approach the parameters 𝜽 of any model are consistently treated as being random
variables. This implies that 𝜽 is always subject to a probability distribution. Consequently, learning (in the
Bayesian setting) amounts to computing the distribution of 𝜽 conditional on training data T = {X, y}.
To ease the notation we consistently omit X in the conditioning (mathematically we motivate this by the
fact that we only consider 𝑦, and not x, to be a random variable), and we therefore denote the distribution
of 𝜽 conditioned on training data as 𝑝(𝜽 | y). The computation of 𝑝(𝜽 | y) is done using the laws of
probabilities, and notably Bayes’ theorem,

𝑝(y | 𝜽) 𝑝(𝜽)
𝑝(𝜽 | y) = , (9.1)
𝑝(y)

which is the reason why it is called the Bayesian approach. The left hand side of (9.1) is the sought after
distribution 𝑝(𝜽 | y). The right hand side of (9.1) contains some important elements: 𝑝(y | 𝜽) is the model,
and 𝑝(𝜽) is the distribution of 𝜽 before any observations are made (that is, not conditional on training
data). By nature 𝑝(𝜽) cannot be computed, but has to be postulated by the user (much like the model is
chosen by the user as well). Finally 𝑝(y) can, by the laws of probabilities, be rewritten as
∫ ∫
𝑝(y) = 𝑝(y, 𝜽)𝑑𝜽 = 𝑝(y | 𝜽) 𝑝(𝜽)𝑑𝜽 (9.2)

which is an integral that can be computed. In other words, learning a parametric model (in the Bayesian
fashion) amounts to conditioning 𝑝(𝜽) on y, that is, compute 𝑝(𝜽 | y). After learning, the next step with a

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 139
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

parametric model is to compute predictions. Again this is a matter of computing a distribution 𝑝(𝑦★ | y)
(rather than a point prediction b
𝑦 ) for a test input x★, which connects to 𝑝(𝜽 | y) via the integral

𝑝(𝑦★ | y) = 𝑝(𝑦★ | , 𝜽) 𝑝(𝜽 | y)𝑑𝜽. (9.3)

Here 𝑝(𝑦★ | 𝜽) encodes the model for a test data input x★ (again omitted in the notation).
The model is denoted by 𝑝(y | 𝜽) (with the training data) as well as 𝑝(𝑦★ | , 𝜽) (with the test data).
Sometimes 𝑝(y | 𝜽) is referred to, somewhat sloppy, as the ‘likelihood’. The other elements involved in
the Bayesian approach are traditionally given the names

• 𝑝(𝜽) prior,

• 𝑝(𝜽 | y) posterior,

• 𝑝(𝑦★ | y) posterior predictive,

These names are useful for communication, but it is important to remember that they are nothing but
different probability distributions. In addition 𝑝(y) is often called the marginal likelihood or evidence.

A representation of beliefs
The main feature of the Bayesian approach is its use of probability distributions. It is possible to interpret
those distributions as representing beliefs in the following sense: The prior 𝑝(𝜽) represents what beliefs
there are about 𝜽 a priori, that is, before any data has been observed. The model, encoded as 𝑝(y | 𝜽),
defines how data y relates to the parameter 𝜽. Using Bayes’ theorem (9.1) we update the belief about
𝜽 to the posterior 𝑝(𝜽 | y) which also takes the observed data y into account. In an everyday language,
these distributions could be said to represent uncertainty about 𝜽 before and after observing the data y
respectively.
An interesting and practically relevant consequence is that the Bayesian approach is less prone to overfit,
compared to using a maximum-likelihood based method. This is due to the fact that the prior (which has a
similar effect as regularization) is taken into account, as well as the fact that an entire distribution 𝑝(𝜽 | y)
is obtained instead of a single value b 𝜽. In particular in the regime of small datasets, the “uncertainty”
seen in the posterior 𝑝(𝜽 | y) reflects how much (or little) that can be said about 𝜽 from the (presumably)
limited information in y, under the assumed conditions.
The posterior 𝑝(𝜽 | y) is a combination of the prior belief 𝑝(𝜽) and the information about 𝜽 carried by
y through the model. Without a meaningful prior 𝑝(𝜽), the posterior 𝑝(𝜽 | y) is not meaningful either. In
some applications it can be hard to make a choice of 𝑝(𝜽) that is not influenced by the personal experience
of the machine learning engineer, which sometimes is emphasized by saying that 𝑝(𝜽), and thereby also
𝑝(𝜽 | y), represents a subjective belief. This notion is meant to reflect that the result is not independent of
the human that designed the solution. However, also the model (no matter if the Bayesian approach is used
or not) is often chosen based on the personal experience of the machine learning engineer, meaning that
most machine learning results are in that sense subjective anyway.
An interesting situation for the Bayesian approach is when data arrives in a sequential fashion, that is,
one data point after the other. Say that we have two sets of data, y1 and y2 . Starting with a prior 𝑝(𝜽) we
can condition on y1 by computing 𝑝(𝜽 | y1 ) using Bayes’ theorem (9.1). However, if we thereafter want to
condition on all data, y1 and y2 as 𝑝(𝜽 | y1 , y2 ), we do not have to start over again. We can instead replace
the prior 𝑝(𝜽) with 𝑝(𝜽 | y1 ) in Bayes’ theorem to compute 𝑝(𝜽 | y1 , y2 ). In a sense, the “old posterior”
becomes the “new prior” when data arrives sequentially.

The marginal likelihood as a model selection tool


When using the Bayesian approach there are often some hyperparameters in the model or the prior, say
𝜂, that needs to be chosen. It is indeed possible to assume a prior 𝑝(𝜂) and compute the posterior also

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
140
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9.2 Bayesian linear regression

for the hyperparameters, 𝑝(𝜂 | y). That would be the fully Bayesian solution, but sometimes that is too
computationally challenging.
A more pragmatic solution is to select a value 𝜂b using cross-validation instead. It is perfectly possible to
use cross-validation, but the Bayesian approach also comes with an alternative for selecting hyperparameters
𝜂 by maximizing the marginal likelihood (9.2) as

𝜂b = arg max 𝑝 𝜂 (y), (9.4)


𝜂

where we have added an index 𝜂 to emphasize the fact that 𝑝(y) depends on 𝜂. This approach is sometimes
referred to as empirical Bayes. Choosing hyperparameter 𝜂 is, in a sense, a selection of a model (and/or
prior), and we can therefore understand the marginal likelihood as a tool for selecting a model. Maximizing
the marginal likelihood is, however, not equivalent to using cross-validation (the obtained hyperparameter
value might differ), and unlike cross-validation the marginal likelihood does not give an estimate of 𝐸 new .
In many situations, however, it is relatively easy to compute (and maximize) the marginal likelihood,
compared to employing a full cross-validation procedure.
In the previous section we argued that the Bayesian approach was less prone to overfit compared to
the maximum likelihood approach. However, maximizing the marginal likelihood is, in a sense, a kind
of a maximum likelihood approach, and one may ask if there is a risk of overfitting when maximizing
the marginal likelihood. To some extent that might be the case, but the key point is that handling one
(or, at most, a few) hyperparameters 𝜂 with maximum (marginal) likelihood typically does not cause
any severe overfit, much like there rarely are overfit issues when learning a straight line with plain linear
regression. In other words, we can usually “afford” to learn one or a few (hyper)parameters by maximizing
the (marginal) likelihood; overfit typically becomes a potential issue first only learning a larger number of
(hyper)parameters.

9.2 Bayesian linear regression


As a first example of the Bayesian approach, we will apply it to linear regression. In itself Bayesian linear
regression is perhaps not the most useful and versatile method, but just like ordinary linear regression it
is a good starting point and illustrates the main concepts well. Just like ordinary linear regression it is
possible to extend in various directions, and it also opens up for the arguably more interesting and useful
Gaussian process. Before we work out the details of the Bayesian approach applied to linear regression,
we will however repeat some facts about the multivariate Gaussian distribution that will be useful.

The multivariate Gaussian distribution


A central mathematical object for Bayesian linear regression (and later also the Gaussian process) is
the multivariate Gaussian distribution. We assume that the reader already has some familiarity with
multivariate random variables, or equivalently random vectors, and repeat only the most important
properties of the multivariate Gaussian distribution here.
Let z denote a 𝑞-dimensional multivariate Gaussian random vector z = [𝑧 1 𝑧 2 . . . 𝑧 𝑞 ] T . The multivariate
Gaussian distribution is parametrized by a 𝑞-dimensional mean vector 𝝁 and a 𝑞 × 𝑞 covariance matrix 𝚺,
 𝜇1   𝜎 2 𝜎12 . . . 𝜎1𝑞 
   1
 𝜇2   𝜎21 𝜎 2 𝜎2𝑞 
  
𝝁=  . , 𝚺 =  .. ..  .
2
 ..   . . 
  
 𝜇𝑞  𝜎𝑞1 𝜎𝑞2 . . . 𝜎𝑞2 
  
The covariance matrix is a real-valued positive semidefinite matrix, that is, a symmetric matrix with
nonnegative eigenvalues. As a shorthand notation we write z ∼ N ( 𝝁, 𝚺) or 𝑝(z) = N (z; 𝝁, 𝚺). Note that
we use the same symbol N to denote the univariate as well as the multivariate Gaussian distribution. The
reason is that the former is only a special case of the latter.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 141
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

 
The expected value of z is E[z] = 𝝁 and the variance of 𝑧 1 is var(𝑧 1 ) = E (𝑧 1 − E[𝑧 1 ]) 2 = 𝜎12 , and simi-
larly for 𝑧 2 , . . . , 𝑧 𝑞 . Moreover the covariance between 𝑧 1 and 𝑧 2 is cov(𝑧 1 , 𝑧2 ) = E[(𝑧 1 − E[𝑧 1 ]) (𝑧 2 − E[𝑧2 ])] =
𝜎12 = 𝜎21 , and similarly for 𝑧 3 , . . . , 𝑧 𝑞 . All these properties can be derived from the probability density
function of the multivariate Gaussian distribution, which is
 
− 𝑞2 − 12 1
N (z; 𝝁, 𝚺) = (2𝜋) det(𝚺) exp − (z − 𝝁) 𝚺 (z − 𝝁) .T −1
(9.5)
2

If all off-diagonal elements of 𝚺 are 0, the elements of z are just independent univariate Gaussian random
variables. However, if some off-diagonal element, say 𝜎𝑖 𝑗 (𝑖 ≠ 𝑗), is nonzero there is a correlation between
𝑧 𝑖 and 𝑧 𝑗 . Intuitively the correlation means that 𝑧 𝑖 carries information also about 𝑧 𝑗 , and vice versa. Some
important results on how the multivariate Gaussian distribution can be manipulated are summarized in
Appendix 9.A.

Linear regression with the Bayesian approach

We will now apply the Bayesian approach to the linear regression model. We will first spend some effort
on mapping the elements of linear regression from Chapter 3 to the Bayesian terminology, and thereafter
derive the solution.
From Chapter 3 we have that the linear regression model is
 
𝑦 = 𝜽 T x + 𝜀, 𝜀 ∼ N 0, 𝜎 2 (9.6)

which equivalently can be written as


 
𝑝(𝑦 | 𝜽) = N 𝑦; 𝜽 T x, 𝜎 2 . (9.7)

This is an expression for one output data point 𝑦, and for the entire vector of all training data outputs y we
can write
Ö
𝑛 Ö
𝑛    
𝑝(y | 𝜽) = 𝑝(𝑦 𝑖 | 𝜽) = N 𝑦 𝑖 ; 𝜽 T x𝑖 , 𝜎 2 = N y; X𝜽, 𝜎 2 I (9.8)
𝑖=1 𝑖=1

where we in the last step used the notation X from (3.5) and the fact that an 𝑛-dimensional Gaussian
random vector with a diagonal covariance matrix is equivalent to 𝑛 scalar Gaussian random variables.
In the Bayesian approach there is also a need for a prior 𝑝(𝜽) for the unknown parameters 𝜽. In Bayesian
linear regression the prior distribution is most often chosen as a Gaussian with mean 𝝁0 and covariance 𝚺0 ,

𝑝(𝜽) = N 𝜽; 𝝁0 , 𝚺0 . (9.9)

with 𝚺0 = I𝜎02 . The reason for this choice is frankly that it simplifies the calculations (much like the
squared error loss simplifies the computations for plain linear regression).
The next step is now to compute the posterior. It is perfectly possible to derive it using Bayes’ theorem,
but since 𝑝(y | 𝜽) as well as 𝑝(𝜽) are multivariate Gaussian distributions, Corollary 9.1 in Appendix 9.A
directly gives us that

𝑝(𝜽 | y) = N 𝜽; 𝝁 𝑛 , 𝚺 𝑛 , (9.10a)
 
𝝁 𝑛 = 𝚺 𝑛 𝜎12 𝝁0 + 𝜎12 XT y , (9.10b)
  −1
0

1 1
𝚺𝑛 = 𝜎02
I + 𝜎2
XT X . (9.10c)

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
142
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9.2 Bayesian linear regression

From (9.10) we can also derive the posterior predictive by Corollary 9.2 in Appendix 9.A,

𝑝(𝑦★ | y) = N (𝑦★; 𝑚★, 𝑠★) , (9.11a)


𝑚★ = x★ 𝝁 𝑛 ,
T
(9.11b)
2
𝑠★ = 𝜎 + x★𝚺 𝑛 x★ .
T
(9.11c)

We now have all pieces of Bayesian linear regression in place. The main difference to plain linear
regression is that we compute a posterior distribution 𝑝(𝜽 | y) (instead of a single value b
𝜽) and a posterior
predictive distribution 𝑝(𝑦★ | y) instead of a prediction b
𝑦 . We summarize it as Method 9.1 and illustrate it
with Example 9.1.

Learn Bayesian linear regression


Data: Training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

Result: Posterior 𝑝(𝜽 | y) = N 𝜽; 𝝁 𝑛 , 𝚺 𝑛
1 Compute 𝝁 𝑛 and 𝚺 𝑛 as (9.10).

Predict with Bayesian linear regression



Data: Posterior 𝑝(𝜽 | y) = N 𝜽; 𝝁 𝑛 , 𝚺 𝑛 and test input x★
Result: Posterior predictive 𝑝(𝑦★ | y) = N (𝑦★; 𝑚★, 𝑠★)
1 Compute 𝑚★ and 𝑠★ as (9.11).

Method 9.1: Bayesian linear regression

We have so far assumed that the noise variance 𝜎 2 is fixed. Most often 𝜎 2 is also a parameter the user
has to decide, which can be done by maximizing the marginal likelihood. Corollary 9.2 gives us the
marginal likelihood
 
𝑝(y) = N y; 𝑋 𝝁0 , 𝜎 2 I + X𝚺0 XT . (9.12)

It is also possible to chose the prior variance 𝜎02 in Σ0 = I𝜎02 by maximizing (9.12).
It is possible to use nonlinear input transformations, such as polynomials, in Bayesian linear regression,
just as for plain linear regression. We give an example of that in Example 9.2, where we return to the
running regression example of car stopping distances. We can, however, go one step further and also use
the kernel trick from Chapter 8. That will lead us to the Gaussian process, which is the topic of Section 9.3.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 143
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

Example 9.1: Bayesian linear regression

To illustrate the inner workings of Bayesian linear regression, we consider a one-dimensional example with
𝑦 = 𝜃 1 + 𝜃 2 𝑥 + 𝜀. In the first row of the plot below, the left panel shows the prior 𝑝(𝜽) (blue surface; a
two-dimensional Gaussian distribution over 𝜃 1 and 𝜃 2 ) from which 10 samples are drawn (red dots). Each of
these samples correspond to a straight line in the 𝑥 − 𝑦 plot to the right (red lines).

𝑝(𝜽) 4

𝑦
−2
−1 −1
0 0 −4
1 1
𝜃2 𝜃1 −2 0 2
𝑥
𝑝(𝜽 | 𝑦 1 ) 4

0
𝑦

−2
−1 −1
0 0 −4
1 1
𝜃2 𝜃1 −2 0 2
𝑥
𝑝(𝜽 | {𝑦 𝑖 }3𝑖=1 ) 4

0
𝑦

−2
−1 −1
0 0 −4
1 1
𝜃2 𝜃1 −2 0 2
𝑥
𝑝(𝜽 | {𝑦 𝑖 }30
𝑖=1 ) 4 Samples from 𝑝 (𝜽 | {𝑦𝑖 }30 )
𝑖=1
Data {𝑥𝑖 , 𝑦𝑖 }30
𝑖=1
2

0
𝑦

−2
−1 −1
0 0 −4
1 1
𝜃2 𝜃1 −2 0 2
𝑥
In the second row of the plot above one (𝑛 = 1) data point {𝑥1 , 𝑦 1 } is introduced. Its value is shown in the
right panel (black dot), and the posterior for 𝜽 in the left panel (a Gaussian; blue surface). In addition ten
samples are drawn from the posterior (red dots), each corresponding to a straight line in the right panel (red
lines). This is repeated also with 𝑛 = 3 and 𝑛 = 30 data points. We can in particular see how the posterior
contracts (“less uncertainty”) as more data arrives, in terms of the blue surface being more peaked as well as
the red lines being more concentrated.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
144
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9.2 Bayesian linear regression

Example 9.2: Car stopping distances

To be completed.

Connection to regularized linear regression


The main feature of the Bayesian approach is that it provides a full distribution 𝑝(𝜽 | y) over the parameters
𝜽, rather than a single point estimate b 𝜽. There is, however, also an interesting connection between the
Bayesian approach and regularization. We will make this concrete by considering the posterior 𝑝(𝜽 | y)
from Bayesian linear regression and the point estimate b 𝜽 𝐿 2 obtained from 𝐿 2 -regularized linear regression
with squared error loss. Let us extract the so called maximum a posteriori (MAP) point estimate b 𝜽 MAP from
the posterior 𝑝(𝜽 | y). The MAP estimate is the value of 𝜽 for which the posterior reaches its maximum,

b
𝜽 MAP = arg max 𝑝(𝜽 | y) = arg max 𝑝(y | 𝜽) 𝑝(𝜽) = arg max [ln 𝑝(y | 𝜽) + ln 𝑝(𝜽)] , (9.13)
𝜽 𝜽 𝜽

where the second equality follows from the fact that 𝑝(𝜽 | y) = 𝑝 (y 𝑝| 𝜽)
(y)
𝑝 (𝜽)
and that 𝑝(y) does not depend
2
on 𝜽. Remember that 𝐿 -regularized linear regression can be understood as using the cost function (3.48),
 
b ˜ k2 ,
𝜽 𝐿 2 = arg max ln 𝑝(y | 𝜽) + 𝜆k𝜽 (9.14)
2
𝜽

with some regularization parameter 𝜆. ˜ When comparing (9.14) to (9.13), we realize that if ln 𝑝(𝜽) ∝ k𝜽 k 2 ,
2
2
the MAP estimate and the 𝐿 regularized estimate of 𝜽 are identical for some value of 𝜆. ˜ With the prior
𝑝(𝜽) in (9.9), that is indeed the case, and the MAP estimate is in that case identical to b 𝜽 𝐿2 .
This connection between MAP estimates and regularized maximum likelihood estimates holds as long
as the regularization is proportional to the logarithm of the prior. If, for example, we instead would chose
a Laplace prior for 𝜽, the MAP estimate would be identical to 𝐿 1 regularization. In general there are many
regularization methods that can be interpreted as implicitly choosing a certain prior. This connection to
regularization gives another perspective as to why the Bayesian approach is less prone to overfit.
It is however important to note that the connection between the Bayesian approach and the use of
regularization does not imply that the two approaches are equivalent. The main point with the Bayesian
approach is still that a posterior distribution is computed for 𝜽, instead of just a point estimate b
𝜽.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 145
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

9.3 The Gaussian process

We introduced the Bayesian approach as the idea of considering unknown parameters 𝜽 as random
variables, and consequently learning a posterior distribution 𝑝(𝜽 | y) instead of a single value b 𝜽. The
Bayesian idea does, however, not only apply to models with parameters, but also to non-parametric models.
We will now introduce the Gaussian process, where instead of considering only parameters 𝜽 as being
random variables we effectively consider an entire function 𝑓 to be a stochastic process, and compute
the posterior 𝑝( 𝑓 | y). The Gaussian process is an interesting and very useful Bayesian non-parametric
model. In a nutshell it is the Bayesian approach applied to kernel ridge regression (Chapter 8). We will
present the Gaussian process as a method for handling regression problems, but it is possible to also use it
for classification problems as well (somewhat similar to how linear regression is modified into logistic
regression).
In this section we will discuss the fundamentals of the Gaussian process, and thereafter in Section 9.4
discuss more on how it can be used as a supervised machine learning method. We will first discuss what a
Gaussian process is and thereafter see how we can construct a Gaussian process that connects closely to
kernel ridge regression from Section 8.2.
Remember that for regression in general we have

𝑦★ = 𝑓 (x★) + 𝜀. (9.15)


As for Bayesian linear regression, we assume 𝜀 ∼ N 0, 𝜎 2 . Throughout the Gaussian process discussion
we will however consider the posterior predictive for the (theoretical) “noise-free” 𝑓 (x★) instead of the
noise-corrupted test output 𝑦★. In mathematical terms we will compute 𝑝(f (x★) | y) instead of 𝑝(𝑦★ | y).
In the previous chapters with only a point prediction b 𝑦 (x★) there was no difference in predicting 𝑓 (x★)
or 𝑦★, since 𝜀 by assumption has mean 0. Now, when considering the posterior predictive, there is a
difference since we are less certain (there is more variance) about 𝑦★ than 𝑓 (x★) because of the random
noise 𝜀 added to it. With the Gaussian process, 𝑝(f (x★) | y) will be a Gaussian distribution, and we will
later discuss how it can be turned into 𝑝(𝑦★ | y) (another Gaussian distribution). Note, however, that we
still assume that the training data y contains noise 𝜀.

What is a Gaussian process?

A Gaussian process is a stochastic process. A stochastic process, in turn, is a generalization of a random


variable. Whereas a random variable has a finite dimension (even though it may take infinitely many
different values), a stochastic process is not limited to that. We will now introduce the Gaussian process
by extending a multivariate Gaussian random variable into a Gaussian process.
Consider a 𝑞-dimensional random vector f = [ 𝑓1 · · · 𝑓𝑞 ] T with a multivariate Gaussian distribution

𝑝(f) = N (f; 𝝁, 𝚺) (9.16)

with mean vector 𝝁 and covariance matrix 𝚺. The reason why we denote the random variable by f (instead
of, say, z) is that it will eventually represent function values (like 𝑓 (x1 ) and 𝑓 (x2 )) in the regression
 T
problem. Let us partition f into two vectors f 1 and f 2 such that f = f T1 f T2 , and 𝝁 and 𝚺 similarly,
allowing us to write

       
f1 f1 𝝁1 𝚺11 𝚺12
𝑝 =N ; , . (9.17)
f2 f2 𝝁2 𝚺21 𝚺22

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
146
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9.3 The Gaussian process

𝑓1 𝑓2 𝑓1 𝑓2

(a) A two-dimensional Gaussian distribution for the random (b) The conditional distribution of 𝑓1 (green line), when 𝑓2
variables 𝑓1 and 𝑓2 , with a blue surface plot for the density, is observed (orange dot). The conditional distribution of 𝑓1
and the marginal distribution for each component sketched is given by (9.18), which (apart from a normalizing constant)
using dashed blue lines along each axis. Note that the marginal in this graphical representation also is the green ‘slice’ of
distributions do not contain all information about the distribu- the joint distribution (blue surface). The marginals of the
tion of 𝑓1 and 𝑓2 , since the covariance information is lacking joint distribution from Figure 9.1a are kept for reference (blue
in that representation. dashed lines).

Figure 9.1: A two-dimensional multivariate Gaussian distribution for 𝑓1 and 𝑓2 in (a), and the conditional distribution
for 𝑓1 , when a particular value of 𝑓2 is observed, in (b).

𝑓1 𝑓2 𝑓1 𝑓2

(a) The marginal distributions for 𝑓1 and 𝑓2 from Figure 9.1a. (b) The distribution for 𝑓1 (green line) when 𝑓2 is observed
(orange dot), as in Figure 9.1b.

Figure 9.2: The marginals of the distributions in Figure 9.1, here plotted slightly differently. Note that this more
compact plot comes with the cost of missing the information about the covariance between 𝑓1 and 𝑓2 .

If some elements of f, let us say the ones in f 2 , are observed, the conditional distribution for f 1 given
the observation of f 2 is, by Theorem 9.3,
 
𝑝 (f 1 | f 2 ) = N f 1 ; 𝝁1 + 𝚺12 𝚺−1
22 (f 2 − 𝝁 2 ), 𝚺 11 − 𝚺 12 𝚺 −1
22 𝚺 21 . (9.18)

The conditional distribution is nothing but another Gaussian distribution with closed-form expressions for
the mean and covariance. This is particularly useful.
Figure 9.1 shows a 2-dimensional example (f 1 is a scalar 𝑓1 , and f 2 is a scalar 𝑓2 ) where a multivariate
Gaussian distribution is conditioned on an observation of 𝑓2 . In Figure 9.2 we have plotted the marginal
distributions from Figure 9.1. We have made this plot to prepare for the generalization to the Gaussian
process. In a similar fashion to Figure 9.2 we can plot a 6-dimensional multivariate Gaussian distribution
by its marginal distributions in Figure 9.3. Bear in mind that to fully illustrate the joint distribution for
𝑓1 , . . . , 𝑓6 , a 6-dimensional surface plot would be needed, whereas Figure 9.3a only contains the marginal
distributions for each component. We may also condition the 6-dimensional distribution underlying
Figure 9.3a on an observation of, say, 𝑓4 . Once again, the conditional distribution is another Gaussian

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 147
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

𝑓1 𝑓2 𝑓3 𝑓4 𝑓5 𝑓6
𝑓1 𝑓2 𝑓3 𝑓4 𝑓5 𝑓6
(b) The conditional distribution 𝑓1 , 𝑓2 , 𝑓3 , 𝑓5 and 𝑓6 when 𝑓4
(a) A 6-dimensional Gaussian distribution, plotted in the same
is observed (orange dot), illustrated by its marginals (green
way as Figure 9.2a, that is, only its marginals are illustrated.
lines) similarly to Figure 9.2b.

Figure 9.3: A 6-dimensional Gaussian distribution, illustrated in the same fashion as Figure 9.2.

distribution, and the marginal distributions of the 5-dimensional random variable [ 𝑓1 , 𝑓2 , 𝑓3 , 𝑓5 , 𝑓6 ] T are
plotted in Figure 9.3b.
In Figure 9.2 and 9.3 we illustrated the marginal distributions for a finite-dimensional multivariate
Gaussian random variable. However, our aim is the Gaussian process, which is a stochastic process on a
continuous space.
The extension of the Gaussian distribution (defined on a finite set) to the Gaussian process (defined on
a continuous space) is achieved by replacing the index set {1, 2, 3, 4, 5, 6} in Figure 9.3 by a variable x
taking values on a continuous space, for example the real line. We then also have to replace the random
variables 𝑓1 , 𝑓2 , . . . , 𝑓6 with a random function (a stochastic process) 𝑓 which can be evaluated at any x as
𝑓 (x). Furthermore, in the Gaussian multivariate distribution 𝝁 is a vector with 𝑞 components, and Σ is a
𝑞 × 𝑞 matrix. In the Gaussian process we replace 𝝁 by a mean function 𝜇(x) into which we can insert any
x, and the covariance matrix 𝚺 is replaced by a covariance function 𝜅(x, x0) into which we can insert any
pair x and x0.
From this we can define the Gaussian process. If, for any finite set of points {x1 , . . . , x𝑛 }, it holds that
 𝑓 (x )   𝑓 (x )   𝜇(x )   𝜅(x , x ) · · · 𝜅(x1 , x𝑛 ) 
©  . 1  ª ©  . 1   . 1   1. 1  ª®
𝑝 ­­  ..  ®® = N ­­  ..  ;  ..  ,  .. ..
® , (9.19)
       . 
       𝜅(x𝑛 , x𝑛 )  ¬
«  𝑓 (x𝑛 )  ¬ «  𝑓 (x𝑛 )   𝜇(x𝑛 )  𝜅(x𝑛 , x1 ) · · ·
then 𝑓 is a Gaussian process.
That is, with a Gaussian process 𝑓 and any choice of {x1 , . . . , x𝑛 }, the vector of function values
[ 𝑓 (x1 ), . . . , 𝑓 (x𝑛 )] has a multivariate Gaussian distribution, just like the one in Figure 9.3. Since
{x1 , . . . , x𝑛 } can be chosen arbitrarily from the continuous space on which it lives, the Gaussian process
defines a distribution for all points in that space. Indeed, for this definition to make sense, 𝜅(x, x0) has to
be such that a positive semidefinite covariance matrix is obtained for any choice of {x1 , . . . , x𝑛 }.
We will use the notation

𝑓 ∼ GP ( 𝝁, 𝜅) (9.20)

to express that the function 𝑓 (x) is distributed according to a Gaussian process with mean function 𝜇(x)
and covariance function 𝜅(x, x0). If we want to illustrate a Gaussian process, which we do in Figure 9.4,
we can choose {x1 , . . . , x𝑛 } to correspond to the pixels on the screen or the printer dots on the paper,
and print the marginal distributions for each {x1 , . . . , x𝑛 }, so that it appears as a continuous line to the
eye (despite the fact that we actually can access the distribution only in a finite, however arbitrary, set of
points).
It is no coincidence that we use the same symbol 𝜅 for covariance functions as we used for kernels in
Chapter 8. As we soon will discuss, applying the Bayesian approach to kernel ridge regression will give
us a Gaussian process in which the covariance function is the kernel.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
148
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9.3 The Gaussian process

𝑓 (𝑥) 𝑓 (𝑥)
𝑓 (𝑥★
1) 𝑓 (𝑥★
2) 𝑓 (𝑥★
1) 𝑓 (𝑥1𝑑 )

𝑥 𝑥
𝑥1★ 𝑥2★ 𝑥1★ 𝑥1𝑑

(a) A Gaussian process defined on the real line parameterized (b) The conditional Gaussian process given the observation
by 𝑥, not conditioned on any observations. The intensity of of 𝑓 (𝑥1𝑑 ) in the point 𝑥1𝑑 corresponding to 𝑥★ in (a). The
2
the blue color is proportional to the (marginal) density, and prior distribution from Figure (a) is dashed gray. Note how
the marginal distributions for some 𝑥★ 1
and 𝑥★
2
are pictured in the conditional distribution adjusts to the observation, both
red. Akin to Figure 9.3, we only plot the marginal distribution in terms of mean (closer to the observation) and (marginal)
for each x★, but the Gaussian process defines a full joint variance (smaller in the proximity of the observation, but it
distribution for all points on the 𝑥-axis. remains unchanged in areas distant from it).

Figure 9.4: A Gaussian process. Figure (a) shows the prior distribution (shaded blue), whereas (b) shows the
posterior distribution (shaded green) after conditioning on one observation (orange dot).

We can also condition the Gaussian process on some observations { 𝑓 (x𝑖 ), x𝑖 }𝑖=1
𝑛 , the Gaussian process

counterpart to Figure 9.1b, 9.2b and 9.3b. As usual we stack the observed inputs in X, and let 𝑓 (X) denote
the observed outputs (we assume for now that the observations are made without any noise). We use the
notation 𝑲 (X, X) and 𝑲 (X, x★) as defined by (8.13b) to write the joint distribution between the observed
values 𝑓 (X) and the test value 𝑓 (x★) as
       
𝑓 (x★) 𝑓 (x★) 𝜇(x★) 𝜅(x★, x★) 𝑲 (X, x★) T
𝑝 =N ; , . (9.21)
𝑓 (X) 𝑓 (X) 𝜇(X) 𝑲 (X, x★) 𝑲 (X, X)

Now, as we have observed 𝑓 (X), we use the by now well-known expressions for the Gaussian distribution
to write the distribution for 𝑓 (x★) conditional on the observations of 𝑓 (X) as

𝑝 ( 𝑓 (x★) | 𝑓 (X)) = (9.22)


 
N 𝑓 (x★); 𝜇(x★) + 𝑲 (X, x★) T 𝑲 (X, X) −1 ( 𝑓 (X) − 𝜇(X)) , 𝜅(x★, x★) − 𝑲 (X, x★) T 𝑲 (X, X) −1 𝑲 (X, x★) ,

which is another Gaussian distribution for any test input x★. We illustrate this in Figure 9.4.
We have now introduced the somewhat abstract concept of a Gaussian process. In some subjects, such
as signal processing, are so-called white Gaussian processes common. A white Gaussian process has a
white covariance function (
1 if x = x0,
𝜅(x, x0) = (9.23)
0 otherwise,
which implies that 𝑓 (x) is uncorrelated to 𝑓 (x0) unless x = x0. White Gaussian processes are of less
use in supervised machine learning, but we will instead have a look at how kernel ridge regression can
be turned into a Gaussian process, where the mean function becomes zero and the covariance function
becomes the kernel from Chapter 8.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 149
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

(Regularized) Using kernels Kernel ridge


linear regression regression

Bayesian approach

Bayesian approach
Bayesian linear
The Gaussian process
regression Using kernels

Figure 9.5: A graphical summary on the connections between linear regression, Bayesian linear regression, kernel
ridge regression and the Gaussian process.

Extending kernel ridge regression into a Gaussian process


In order to obtain a Gaussian process construction that we can use as a practical tool for supervised
machine learning, we apply the kernel trick from Section 8.2 to Bayesian linear regression (9.11). The
connection between linear regression, Bayesian linear regression, kernel ridge regression and the Gaussian
process is summarized in Figure 9.5. This will essentially lead us back to (9.22), with the kernel being the
covariance function 𝜅(x, x0) and the mean function being 𝜇(x) = 0.
Let us now repeat the posterior predictive for Bayesian linear regression (9.11), however with three
changes. The first change is that we assume the prior mean and covariance for 𝜽 is 𝝁0 = 0 and 𝜎0 = 1,
respectively. This assumption is not strictly needed for our purposes, but simplifies the expressions. The
second change is that we, as in the previous section, consider the prediction of 𝑓 (x★) rather than 𝑦★
(see (9.15)), which simply amounts to not adding 𝜎 2 in (9.11c). Lastly, as in Section 8.1, we introduce
nonlinear feature transformations 𝝓(x) of the input variable x in the linear regression model. We therefore
replace X with 𝚽(X) in the notation. Altogether (9.11) becomes

𝑝( 𝑓 (x★) | y) = N ( 𝑓 (x★); 𝑚★, 𝑠★) , (9.24a)


  −1
𝑚★ = 𝝓(x★) T 𝜎 2 I + 𝚽(X) T 𝚽(X) 𝚽(X) T y, (9.24b)
  −1
𝑠★ = 𝝓(x★) T I + 𝜎12 𝚽(X) T 𝚽(X) 𝝓(x★) (9.24c)

In a similar fashion to the derivation of kernel ridge regression, we use the push-through matrix identity
A(AT A + I) −1 = (AAT + I) −1 A to re-write 𝑚★ with the aim to have 𝝓(x) only entering through inner
products,
  −1
𝑚★ = 𝝓(x★) T 𝚽(X) T 𝜎 2 I + 𝚽(X)𝚽(X) T y. (9.25a)

To re-write 𝑠★ in a similar fashion, we have to use the matrix inversion lemma (I−UV) −1 = I−U(I+VU) −1 V
(which holds for any matrices U, V of compatible dimensions),
  −1
𝑠★ = 𝝓(x★) T 𝝓(x★) − 𝝓(x★) T 𝚽(X) T 𝜎 2 I + 𝚽(X)𝚽(X) T 𝚽(X)𝝓(x★). (9.25b)

Analogously to the derivation of kernel ridge regression as in (8.11), we are now ready to apply the kernel
trick and replace all instances of 𝝓(x) T 𝝓(x0) with a kernel 𝜅(x, x0). With the same notation as in (8.11),

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
150
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9.3 The Gaussian process

we get
  −1
𝑚★ = 𝑲 (X, x★) T 𝜎 2 I + 𝑲 (X, X) y, (9.26a)
  −1
𝑠★ = 𝜅(x★, x★) − 𝑲 (X, x★) T 𝜎 2 I + 𝑲 (X, X) 𝑲 (X, x★). (9.26b)

The posterior predictive that is defined by (9.24a) and (9.26) is the Gaussian process model again, identical
to (9.22) if 𝜇(x★) = 0 and 𝜎 2 = 0. The reason for 𝜇(x★) = 0 is that we made this derivation starting with
𝝁0 = 0. When we derived (9.22) we assumed that we observed 𝑓 (x★) (rather than 𝑦★ = 𝑓 (x★) + 𝜀), which
is the reason why 𝜎 2 = 0 in (9.22). The Gaussian process is hence a kernel version of Bayesian linear
regression, much like kernel ridge regression is a kernel version of (regularized) linear regression, as
illustrated in Figure 9.5.
The fact that the kernel plays the role of a covariance function in the Gaussian process gives us another
interpretation of the kernel in addition to the ones in Section 8.4, namely that the kernel 𝜅(x, x0) determines
how much 𝑓 (x) and 𝑓 (x0) are assumed to correlate.
We have so far only derived the posterior predictive for 𝑓 (x★). Since we have 𝑦★ = 𝑓 (x★) + 𝜀, where 𝜀
is assumed to be independent with variance 𝜎 2 , we get
 
𝑝(𝑦★ | y) = N 𝑦★; 𝑚★, 𝑠★ + 𝜎 2 , (9.27)

if we are interested in the posterior predictive for 𝑦★.

Time to reflect 9.1: Verify that you retrieve Bayesian linear regression when using the linear kernel
(8.22) in the Gaussian process. Why is that?

Concerning the notation with the noise variance 𝜎 2 , it is common to not write out the addition of 𝜎 2 I
to the Gram matrix 𝑲 (X, X), but instead add a white noise kernel (9.23) multiplied with 𝜎 2 to the original
kernel.

A nonparametric distribution over functions


As a supervised machine learning tool we use the Gaussian process for making predictions, that is,
computing the posterior predictive 𝑝( 𝑓 (x★) | y) (or 𝑝(𝑦★ | y)). However, unlike most other methods which
only delivers a point prediction b
𝑦 (x★), the posterior predictive is a distribution. Since we can compute the
posterior predictive for any x★, the Gaussian process actually defines a distribution over functions, as we
illustrate in Figure 9.6.
Much like we could derive a connection between Gaussian linear regression and 𝐿 2 regularized linear
regression, there is a connection between the Gaussian process and kernel ridge regression as well. If
we only consider the mean 𝑚★ of the posterior predictive, we recover kernel ridge regression. To take
full advantage of the Bayesian perspective, we have to consider also the posterior predictive variance 𝑠★.
With most kernels, the predictive variance is smaller if there is a training data point nearby, and larger
if the closest training data point is distant. The predictive variance hence give a quantification of the
“uncertainty” in the prediction. Altogether the Gaussian process is an interesting and practical tool for
regression problems, as we illustrate in Figure 9.7.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 151
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

𝑓?
𝑦3
𝑦1
𝑦

𝑦
𝑦2

𝑓? 𝑓?

𝑥1 𝑥2 𝑥3 𝑥1 𝑥2 𝑥3

𝑥 𝑥
(a) The training data {𝑥𝑖 , 𝑦 𝑖 }3𝑖=1 of a regression problem is (b) The underlying assumption when we do regression is
given to us. that there exists some function 𝑓 , which describes the data
as 𝑦 𝑖 = 𝑓 (𝑥𝑖 ) + 𝜀. It is unknown to us, but the purpose of
regression (no matter which method is used) is, in one way or
another, to determine 𝑓 .

𝑝( 𝑓 (𝑥) | y)
𝑝( 𝑓 (𝑥 0) | y)

𝑓?
𝑦
𝑥 𝑓?
𝑥0
𝑥
(c) The Gaussian process defines a distribution over 𝑓 . We can condition that distribution on data (that is, learn) and access it for
any input, say 𝑥 and 𝑥 0 . That is, we make a prediction for 𝑥 and 𝑥 0 . The Gaussian process gives us a Gaussian distribution for
𝑓 (𝑥) and 𝑓 (𝑥 0 ), illustrated by solid blue lines. That distribution is heavily influenced by the choice of kernel, which is a design
choice in the Gaussian process.

Figure 9.6: The Gaussian process defines a distribution over functions, which we can condition on training data and
access in arbitrary points (such as 𝑥 and 𝑥 0) in order to compute predictions.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
152
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9.3 The Gaussian process

0
𝑦

−2

−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥
2

0
𝑦

−2

−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥
2

0
𝑦

−2

−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥

Figure 9.7: The Gaussian process as a supervised machine learning method: we can learn (that is, compute (9.24a)
and (9.26)) the posterior predictive for 𝑓 (𝑥) (shaded gray; dark gray for one standard deviation, light gray for 2
standard deviations, and solid black line for the mean) learned from 0 (upper), 2 (middle), 30 (bottom) observations
(red dots). We see how the model adapts to training data, and note in particular that the variance shrinks in the
regions where observations are made, but remains larger in regions where no observations are made.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 153
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

0
𝑦

−2

−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥
2

0
𝑦

−2

−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥
2

0
𝑦

−2

−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥

Figure 9.8: Figure 9.7 again, this time appended also with samples from 𝑝( 𝑓 (x★) | y).

Drawing samples from a Gaussian process


When making a prediction of 𝑓 (x★) for a single input point x★, the posterior predictive 𝑝( 𝑓 (x★) | y)
captures all information the Gaussian process has about 𝑓 (x★). However, if we are interested not only
in predicting 𝑓 (x★), but also 𝑓 (x★ + 𝜹), the Gaussian process contains more information than what is
present in the two posterior predictive distributions 𝑝( 𝑓 (x★) | y) and 𝑝( 𝑓 (x★ + 𝜹) | y). This is because
the Gaussian process also contains information on the correlation between 𝑓 (x★) and 𝑓 (x★ + 𝜹), and the
pitfall is that 𝑝( 𝑓 (x★) | y) and 𝑝( 𝑓 (x★ + 𝜹) | y) are only the marginal distributions of the joint distribution
𝑝( 𝑓 (x★), 𝑓 (x★ + 𝜹) | y). As we saw in connection to Figure 9.1 and 9.2, the joint distribution contain
more information than only its marginals.
If we are interested in computing predictions for a larger set of input values it can be rather cumbersome
to grasp and visualize the resulting high-dimensional posterior predictive. An interesting alternative
can therefore be to visualize the Gaussian process posterior by samples from it. Technically this simply
amounts to drawing a sample from the posterior predictive distribution, which we illustrate in Figure 9.8.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
154
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9.4 Practical aspects of the Gaussian process

9.4 Practical aspects of the Gaussian process


When using the Gaussian process as a method for supervised machine learning, there are a few important
design choices left to the user. Like the methods presented in Chapter 8, the Gaussian process is a kernel
method, and the choice of kernel is very important. Most kernels contain a few hyperparameters, which
also have to be chosen. That choice can be done by maximizing the marginal likelihood, which we will
discuss now.

Kernel choice
Since the Gaussian process can be understood as the Bayesian version of kernel ridge regression, the
Gaussian process also requires a positive semidefinite kernel. Any of the positive semidefinite kernels
presented in Section 8.4 can therefore be used also for Gaussian processes.
Among all kernel methods presented in this book, the Gaussian process could be the method where the
exact choice of kernel has the biggest importance since the Gaussian posterior predictive 𝑝( 𝑓 (x★ | y))
has a mean and a variance, both of which are heavily affected by the choice of kernel. It is therefore
important to make a good choice, and besides the discussion in Section 8.4 it can also be instructive to
visually compare different kernel choices as in Figure 9.9. In the end it is a design choice left to the
machine learing engineer. In the Bayesian perspective the kernel is an important part of the prior that
implements crucial assumptions about the function 𝑓 . For example, as can be seen from Figure 9.9 the
squared exponential and the Matérn kernel with 𝜈 = 12 corresponds to drastically different assumptions
about the smoothness of 𝑓 (x).

Hyperparameter tuning
Most kernels 𝜅(x, x0) contain a few hyperparameters that needs to be chosen 𝜂, such as ℓ and 𝛼, as well
as the noise variance 𝜎 2 . Since any positive semidefinite kernel remains a positive semidefinite kernel
also when multiplied with a positive constant, as 𝜍 2 𝜅(x, x0), it is common to include also 𝜍 as yet another
hyperparameter that needs to be chosen as well. Cross-validation can indeed be used for this purpose, but
the Bayesian approach also comes with the option to maximize the marginal likelihood 𝑝(y) as a method
for selecting hyperparameters, (9.4). To emphasize how 𝜂 enters into the marginal likelihood, we add the
subscript 𝜂 to all terms that depends
 on it. From the construction of the Gaussian process, we have that
𝑝 𝜂 (y) = N y; 0, 𝑲 𝜂 (X, X) + 𝜎 2 I , and consequently

1   −1 1   𝑛
ln 𝑝 𝜂 (y) = − yT 𝑲 𝜂 (X, X) + 𝜎 2 I y − ln det 𝑲 𝜂 (X, X) + 𝜎 2 I − log 2𝜋. (9.28)
2 2 2
In other words, the hyperparameters of the Gaussian process kernel can be chosen by solving the
optimization problem of maximizing (9.28) with respect to 𝜂. If using this approach, solving the
optimization problem can be seen as a part of learning the Gaussian process, which is illustrated in
Figure 9.10.
When selecting hyperarameters 𝜂 of a kernel, it is important to be aware that (9.28) (as a function of
the hyperparameters) may have multiple local maxima, as we illustrate by Figure 9.11. It is therefore
important to carefully choose the initialization of the optimization procedure. The challenge with local
maxima is not unique to using the marginal likelihood approach, but can arise also when cross-validation
is used to choose hyperparameters.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 155
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

Squared exponential, ℓ = 0.5


2

0
𝑦

−2
−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥
Squared exponential, ℓ = 2
2

0
𝑦

−2
−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥
Matérn 𝜈 = 21 , ℓ = 2
2

0
𝑦

−2
−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥
Matérn 𝜈 = 23 , ℓ = 2
2

0
𝑦

−2
−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥
Matérn 𝜈 = 25 , ℓ = 2
2

0
𝑦

−2
−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥
Rational Quadratic 𝛼 = 2, ℓ = 2
2

0
𝑦

−2
−4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9
𝑥

Figure 9.9: The Gaussian process applied with different kernels and hyperparameters to the same data, in order to
illustrate what assumptions are made by the different kernels. The data is marked with red dots, and the Gaussian
process is illustrated by its mean (black line), variance (gray areas; dark gray for one standard deviation and light
gray for two standard deviations) and samples (blue lines).

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
156
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9.4 Practical aspects of the Gaussian process

ln 𝑝(y) ln 𝑝(y) = −7.18


3

2
0
𝜎2

𝑦
1 −2

−4
0
1 2 3 −4 −2 0 2 4 6 8

ℓ 𝑥
ln 𝑝(y) = −8.52 ln 𝑝(y) = −8.59

2 2

0 0
𝑦

−2 −2

−4 −4

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
𝑥 𝑥
ln 𝑝(y) = −7.38 ln 𝑝(y) = −49.36

2 2

0 0
𝑦

−2 −2

−4 −4

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
𝑥 𝑥

Figure 9.10: To choose hyperparameters 𝜂 for the Gaussian process kernel, the marginal likelihood 𝑝 𝜂 (y) can be
maximized. For a given dataset of five points (black dots) and the squared exponential kernel, the landscape of the
logarithm of the marginal likelihood (as a function of the hyperparameters ℓ, lengthscale, and 𝜎 2 , noise variance) is
shown as a contour plot in the upper left panel. Each point in that plot corresponds to a certain selection of the
hyperparameters ℓ, 𝜎 2 . For five such points (yellow, purple, blue, green and red crosses) the corresponding Gaussian
process is shown in separate panels with the same color. Note that the red cross is located at the maximum of
𝑝 𝜂 (y). The red upper right plot therefore corresponds to a Gaussian process where the hyperparameters have been
maximized using marginal likelihood. It is clear that optimizing 𝑝 𝜂 (y) does not mean selecting hyperparameters
such that the Gaussian process follows the data as close as possible (as the blue one does).

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 157
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

ln 𝑝(y)
1

0.8

0.6
𝜎2

0.4

0.2

0
1 2 3 4 5


ln 𝑝(y) = −8.06 ln 𝑝(y) = −8.13

2 2

0 0
𝑦

−2 −2

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
𝑥 𝑥

Figure 9.11: The landscape of 𝑝(y) may have several local maxima. In this case there is one local maximum at the
blue cross, with relatively large noise variance and length scale. There is also another local maximum, which also is
the global one, at the red cross, with much less noise variance and a shorter lengthscale. There is also a third local
maximum in between (not shown). It is not uncommon that the different maximum points corresponds to different
“interpretations” of the data, as in this case. As a machine learning engineer it is important to be aware that this can
happen; the red one does indeed optimize the marginal likelihood, but the blue one can also be practically useful.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
158
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9.5 Other Bayesian methods in machine learning

9.5 Other Bayesian methods in machine learning


Besides introducing the Bayesian approach in general, this chapter contains the Bayesian treatment of
linear regression (Bayesian linear regression, Section 9.2) and kernel ridge regression (Gaussian processes,
Section 9.3-9.4). The Bayesian approach is, however, applicable to all methods that somehow learns a
model from training data. The reason for the selection of methods in this chapter is frankly that Bayesian
linear regression and Gaussian processes are among the few Bayesian supervised machine learning
methods where the posterior and the posterior predictive are easy to compute.
Most often the Bayesian approach often requires numerical integration routines as well as numerical
methods for representing the posterior distribution (the posterior does not have to be a Gaussian or
any other standard distribution) when applied to various models. There are two major families of such
numerical methods called variational inference and Monte Carlo methods, respectively. The idea of
variational inference is to approximate probability distributions in such a way that the problem becomes
tractable enough, whereas Monte Carlo methods represent probability distributions using random samples
from them.
A particular direction which has gained a lot of attention in the machine learning research community is
the Bayesian approach applied to deep learning, often referred to as Bayesian deep learning. In short
Bayesian deep learning amounts to compute the posterior 𝑝(𝜽 | y), instead of only parameter values b 𝜽. In
doing so stochastic gradient descent has to be replaced with either some version of variational inference or
some Monte Carlo method. Due to the massive number of parameters often used in deep learning, that is a
computationally rather demanding problem.

9.6 Further reading


There are several textbooks devoted to the Bayesian approach to machine learning, for example Barber
(2012), Gelman et al. (2014) and Rogers and Girolami (2017), and to some parts also the comprehensive
books by Bishop (2006) and Murphy (2012). Gaussian processes in particular are covered in depth in the
textbook by Rasmussen and Williams (2006).

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 159
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9 The Bayesian approach and Gaussian processes

9.A The multivariate Gaussian distribution


This appendix contains some results on the multivariate Gaussian distribution that are essential for Bayesian
linear regression and the Gaussian process. Figure 9.12 summarizes how they relate to each other.

𝑝(z𝑎 , z𝑏 )

Theorem 9.1 Theorem 9.2

Theorem 9.3
𝑝(z𝑎 ) 𝑝(z𝑏 |z𝑎 )

Corollary 9.2 = Corollary 9.1 =


Theorem 9.1 + Theorem 9.3 Theorem 9.2 + Theorem 9.3

𝑝(z𝑏 ) 𝑝(z𝑎 |z𝑏 )

Figure 9.12: A graphical summary of how Theorem 9.2 - 9.4 and Corollary 9.1 - 9.2 relate to each other. In all
results, z𝑎 and z𝑏 are dependent multivariate Gaussian random variables.

Theorem 9.2 (Marginalization) Partition the Gaussian random vector z ∼ N ( 𝝁, 𝚺) according to


     
z𝑎 𝝁𝑎 𝚺 𝑎𝑎 𝚺 𝑎𝑏
z= , 𝝁= , 𝚺= . (9.29)
z𝑏 𝝁𝑏 𝚺 𝑏𝑎 𝚺 𝑏𝑏
The marginal distribution 𝑝(z𝑎 ) is then given by

𝑝(z𝑎 ) = N z𝑎 ; 𝝁 𝑎 , 𝚺 𝑎𝑎 . (9.30)

Theorem 9.3 (Conditioning) Partition the Gaussian random vector z ∼ N ( 𝝁, 𝚺) according to

    
z 𝝁𝑎 𝚺 𝑎𝑎 𝚺 𝑎𝑏
z= 𝑎 , 𝝁= , 𝚺= . (9.31)
z𝑏 𝝁𝑏 𝚺 𝑏𝑎 𝚺 𝑏𝑏
The conditional distribution 𝑝(z𝑎 | z𝑏 ) is then given by
 
𝑝(z𝑎 | z𝑏 ) = N z𝑎 ; 𝝁 𝑎 | 𝑏 , 𝚺 𝑎 | 𝑏 , (9.32a)

where

𝝁 𝑎 | 𝑏 = 𝝁 𝑎 + 𝚺 𝑎𝑏 𝚺−1
𝑏𝑏 (z𝑏 − 𝝁 𝑏 ), (9.32b)
𝚺 𝑎 | 𝑏 = 𝚺 𝑎𝑎 − 𝚺 𝑎𝑏 𝚺−1
𝑏𝑏 𝚺 𝑏𝑎 . (9.32c)

Theorem 9.4 (Affine transformation) Assume that z𝑎 as well as z𝑏 | z𝑎 are both Gaussian distributed
according to

𝑝(z𝑎 ) = N z𝑎 ; 𝝁 𝑎 , 𝚺 𝑎 , (9.33a)

𝑝(z𝑏 | z𝑎 ) = N z𝑏 ; Az𝑎 + b, 𝚺 𝑏 | 𝑎 . (9.33b)

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
160
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
9.A The multivariate Gaussian distribution

Then the joint distribution of z𝑎 and z𝑏 is


    
z𝑎 𝝁𝑎
𝑝(z𝑎 , z𝑏 ) = N ; ,R (9.34a)
z𝑏 A𝝁 𝑎 + b

with
 
𝚺𝑎 𝚺 𝑎 AT
R= (9.34b)
A𝚺 𝑎 𝚺 𝑏 | 𝑎 + A𝚺 𝑎 AT

Corollary 9.1 (Affine transformation – conditional) Assume that z𝑎 as well as z𝑏 | z𝑎 are both Gaussian
distributed according to

𝑝(z𝑎 ) = N z𝑎 ; 𝝁 𝑎 , 𝚺 𝑎 , (9.35a)

𝑝(z𝑏 | z𝑎 ) = N z𝑏 ; Az𝑎 + b, 𝚺 𝑏 | 𝑎 . (9.35b)

Then the conditional distribution of z𝑎 given z𝑏 is


 
𝑝(z𝑎 | z𝑏 ) = N z𝑎 ; 𝝁 𝑎 | 𝑏 , 𝚺 𝑎 | 𝑏 , (9.36a)

with
 
𝝁 𝑎 | 𝑏 = 𝚺 𝑎 | 𝑏 𝚺−1
𝑎 𝝁 𝑎 + A T −1
𝚺 (z
𝑏|𝑎 𝑏 − b) , (9.36b)
  −1
𝚺 𝑎 | 𝑏 = 𝚺−1𝑎 + A T −1
𝚺 𝑏|𝑎 A . (9.36c)

Corollary 9.2 (Affine transformation – Marginalization) Assume that z𝑎 as well as z𝑏 | z𝑎 are both
Gaussian distributed according to

𝑝(z𝑎 ) = N z𝑎 ; 𝝁 𝑎 , 𝚺 𝑎 , (9.37a)

𝑝(z𝑏 | z𝑎 ) = N z𝑏 ; Az𝑎 + b, 𝚺 𝑏 | 𝑎 . (9.37b)

Then the marginal distribution of z𝑏 is then given by



𝑝(z𝑏 ) = N z𝑏 ; 𝝁 𝑏 , 𝚺 𝑏 , (9.38a)

where

𝝁 𝑏 = A𝝁 𝑎 + b, (9.38b)
𝚺 𝑏 = 𝚺 𝑏 | 𝑎 + A𝚺 𝑎 A . T
(9.38c)

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 161
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10 User aspects of machine learning

Dealing with supervised machine learning problems in practice is to a great extent an engineering discipline
where many practical issues have to be considered and where the scarce resource is in the end the available
work-hours. To use this resource efficiently, we need to have a well-structured procedure for how to
develop and improve the model. Multiple actions can potentially be taken. How do we know which action
to take and is it really worth spending the time implementing them? Is it for example worth spending
an extra week collecting and labeling more training data or should we do something else? These issues
will be addressed in this chapter. Note that the layout of this chapter is thematic and does not necessarily
represent the sequential order the different issues should be addressed.

10.1 Defining the machine learning problem


Solving a machine learning problem in practice is an iterative process. We train the model, evaluate the
model, and from there suggest an action for improvement and train the model again, and so on. To do this
efficiently, we need to be able to tell whether a new model is an improvement over the previous model
or not. One way to evaluate the model after each iteration would be to put it in production (for example
running a traffic-sign classifier in a self-driving car for a few hours). Besides the obvious safety issues, this
would be very time inefficient as well as inaccurate since it could still be hard to tell whether the proposed
change was an actual improvement or not.
A better strategy is to automate this evaluation procedure without the need to put the model in production
each time we want to evaluate its performance. We do this by putting aside a validation dataset and
test dataset and evaluate the performance using a single number evaluation metric. This will define the
machine learning problem that we are solving.

Training, validation and test data


In Chapter 4 we introduced the strategy of splitting the available data into training data, validation data
and test data.

• Training data is used for training the model.

• Hold-out validation data is used for comparing different model structures, choosing hyperparame-
ters of the model, feature selection, and so on.

• Test data is used to evaluate the performance of the final model.

If the amount of available data is small, it is possible to perform 𝑘-fold cross-validation instead of
putting aside hold-out validation data, the idea of how to use it in the iterative procedure is unchanged. To
get a final estimate of the performance, test data is always needed.

All available data

Val. Test
Training data
data data

Figure 10.1: Splitting the data into training data, hold-out validation data and test data.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 163
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10 User aspects of machine learning

In the iterative procedure, the hold-out validation data (or 𝑘-fold cross-validation) is used to judge if
the new model is an improvement over the previous model. This step is called validation. During the
validation stage, we can also choose to train several new models and not just one new model. For example,
if we have a neural network model and are interested in the number of hidden units to use in a certain
layer, we can train several models, each with a different choice of hidden units. Afterwards, we pick the
one that performs best on the validation data. In the next iteration we can use validation to choose the
learning rate (by training multiple models using different learning rates), and so on.
Eventually, we will effectively have used the validation data to compare many models. Depending on
the size of the validation data, we might risk picking a model that does particularly well on the validation
data in comparison to completely unseen data. To detect this and to get a fair estimate of the actual
performance of a model, we use the test data, which has neither been used during training nor validation.
If the performance on the validation data is substantially better than the performance on the test data, we
have overfitted on the validation data. The easiest solution in that case would be to extend the size of the
validation data, if possible. Make sure not to use the test data too often. If we start making major decisions
based on our test data, the test data will become the validation data and we can no longer trust that the
performance on the test data is an objective indication of the actual performance of our model.
It is important that both the validation data and the test data always come from the same data distribution,
namely the data distribution that we are expecting to see when we put the model into production. If they
do not stem from the same distribution, we are validating and improving our model towards something
that is not represented in the test data and hence "aiming for the wrong target".
Eventually, we should also evaluation the performance of the model in production. Also after deploying
the model, we should monitor the performance. If we realize that the model is doing systematically worse
in production than on the test data„ the test data and validation data have to be updated such they actually
represent what we expect to see at runtime. Preferably, also the training data should come from the same
data distribution as the test and validation data, but this requirement can be relaxed if we have good reasons
to do so, more about this in Section 10.2.
When splitting the data into training, validation and test, group leakage could be a potential problem.
Group leakage can occur if the data points are ordered into different groups. For example, in the medical
domain many X-ray images may belong to the very same patient. In this case, if we do a random split over
the images, different images belonging to the same patient then most likely end up both the training and
the validation set. If the model learns the properties of a certain patient, the performance on the validation
data might be better than what we would expect at production.
The solution to the group leakage problem is to do group partitioning. Instead of doing a random split
over the data points, we do the split over the groups the data points belong to. In the medical example
above, that would mean that we do a random split over the patients rather than over the medical images.
By this, we make sure that the images for a certain patient only end up in one of the datasets and the
leakage of unintentional information from the training data to the validation and test data is avoided.

Size of validation and test data


How much data should we set aside as hold-out validation data and test data, or should we even avoid
setting aside hold-out validation data and use 𝑘-fold cross-validation instead? This depends on how much
data we have available, which performance difference we plan to detect, and how many models we plan to
compare. For example, if we have a classification model with a 99.8% accuracy and want to know if a new
model is even better, a validation dataset of 100 data points will not be able to tell that difference. Also, if
we plan to compare many (say, hundreds or more) different hyperparameter values and model structures
using 100 validation data points, we will most likely overfit to that validation data.
If we have, say, 500 data points, a reasonable split would be 60%-20%-20% (that is 300-100-100 data
points) for training-validation-test. With such a small validation dataset, we cannot afford to compare
several hyperparameter values and model structures or detecting an improvement in accuracy of 0.1%. In
this situation, we are probably better off using 𝑘-fold cross-validation to decrease the risk of overfitting
to the validation data. Be aware, however, that the risk of overfitting to the data still exists even with

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
164
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10.1 Defining the machine learning problem

𝑘-fold cross-validation. We also still need to set aside test data if we want a final unbiased estimate of the
performance.
Many machine learning problems have substantially larger datasets. Assume we have a dataset
of 1 000 000 data points. In this scenario it would be enough to use a split of 98%-1%-1%, that is,
leaving 10 000 data points for validation and test, respectively, unless we really care about the very last
decimals in performance. Here, 𝑘-fold cross-validation is of less use, since having all 99%=98%+1%
(training+validation) available for training would make a small difference in comparison to using “only”
98%. Also the price for training 𝑘 models (instead of only one) with this amount of data would be much
higher.
Another advantage of having a hold-out validation dataset is that we can allow the training data to come
from a slightly different distribution than the validation and test dataset, for example if that would enable
us to find a much larger training dataset. We will discuss this more in Section 10.2.

Single number evaluation metric


In Section 4.5 we introduced additional metrics, besides the misclassification rate such as precision, recall
and F1-score, for evaluating binary classifiers. There is no unique answer to which metric is the most
appropriate. What metric to pick is rather a part of the problem definition. To improve the model quickly
and in a more automated fashion, it is advisable to agree on a single number evaluation metric, especially
if a larger team of engineers are working on the problem.
The single number evaluation metric together with the validation data defines the supervised machine
learning problem. Having an efficient procedure in place where we can evaluate the model on the hold-out
validation data (or by 𝑘-fold cross-validation) using the metric, allows us to speed up the iterations since
we quickly can see if a proposed change to the model improves the performance or not. This is important
in order to manage an efficient workflow of trying out and accepting or rejecting new models.
With that said, beside the single number evaluation metric we might want to monitor other metrics as
well to reveal the trade-offs being made using our metric. For example, we might develop the model with
different end users in mind who care more or less about different metrics, but for practical reasons we train
only one model to accommodate them all. If be based on these trad-offs realize that the single number
evaluation metric we have chosen does not favor the properties we want a good model to have, we can
always change that metric.

Baseline and achievable performance level


Before working with the machine learning problem it is a good idea to establish some reference points for
the performance level of the model. A baseline is a very simple model that serves as a lower expected
level of the performance. A baseline can for example be to randomly pick an output value 𝑦 𝑖 from the
training data and use that as the prediction. Another baseline for regression problem is to take the mean
of all output values in the training data and use that as the prediction. A corresponding baseline for a
classification problem is to pick the most common class among class labels in the training data and use
that for the prediction. For example if we have a binary classification problem with 70% of the training
data belonging to one class and 30% belonging to the other class and we have chosen the accuracy as
our performance metric, the accuracy for that baseline is 70%. The baseline is a lower threshold on the
performance. We know that the model has to be better than this baseline.
Hopefully, the model will perform well beyond the naive baselines stated above. In addition, it is also
good to define an achievable performance which is in pair with the maximum performance we can expect
from the model. This performance is for regression problems in theory limited by the irreducible error
presented in Chapter 4 and for classification problems analog to the so-called Bayes error rate. In practice,
we don’t have access to these theoretical bounds but there are a few strategies to estimate them.
For supervised problems which humans can do very well on (and our aim is to automate this process
with the machine learning model), the human-level performance can serve as the achievable performance.
Consider for example an image classification problem. If humans can identify the correct class with an

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 165
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10 User aspects of machine learning

accuracy of 99% that serves as a reference point for what you can expect to achieve. The achievable
performance can also be based on what other state-of-the-art models on the same or a similar problem
achieve. To compare the performance with the achievable performance gives us a reference point to assess
the quality of the model. Also, if the model is close to the achievable performance we might not be able to
improve our model further.

10.2 Improving a machine learning model


As already mentioned, solving a machine learning problem is an iterative procedure where we train,
evaluate, and suggest action for improvement, for instance by changing some hyperparameters or trying
another methods. How do we start this iterative procedure?

Try simple things first


A good strategy is to try simple things first. This could for example be to start with a basic methods like
𝑘-NN or linear/logistic regression on the problem. Also, do not add extra adds-on for you first model like
regularization. This will come later, at this stage we will avoid introducing any bug already from the start.
A simple thing can also be to start with an already existing solution to the same or similar problem which
you trust. For example, to build an image classifier it can be simpler to start with an existing pre-trained
neural network and then fine-tune it rather to handcraft features from these images to be used with 𝑘-NN.
Starting simple may also include not to consider all training data when training your first model but only a
subset of it. Also, avoid doing more data pre-processing than necessary for your first model. The purpose
is again, to minimize the risk of introducing bug early in the process. This first step does not only involve
writing code for learning your first simple model, but also code for evaluating it on your validation data
using your single number performance metric.
Trying a simple thing first allows us to start early with the iterative procedure of finding a good
model. This is important, since it might reveal important aspects of the problem formulation that we
need to re-think before it makes sense to proceed with more complicated models. Also, if we start with a
low-complexity model, it also reduces the risk of ending up with a too complicated model when a much
simpler model would be just as good (or even better).

Debugging your model


Before proceeding we should make sure that the code we have is producing what we are expecting it to do.
The first obvious check is to make sure that your code runs without any errors or warning. If it does not,
use debugging tool to spot the error. These are the easy bugs to spot.
The trickier (and most common) bugs are these where the code is correct syntactically and runs without
warnings, but the code is still not doing what we except it to do. The procedure how to debug depends on
the model you have picked, but there are a few general tips
• Compare with baseline Compare you model performance on validation data with the baselines you
have stated in Section 10.1. If do not manage to beat these baselines or are even worse than them,
the code for learning and evaluating the model might not work as expected.
• Overfit a small subset Try to overfit the model on very small subset (e.g as small as two data points)
of the training data and make sure that we can achieve the best possible performance evaluated on
that training data subset, and, if it is a parametric model, also the lowest possible cost (often zero).
When we verified to the best of our ability that the code is bug-free and does what it is expected do
we are ready to proceed. There are many actions that could be taken to improve the model, for example
changing the type of model, increasing/decreasing model complexity, changing input variables, collecting
more data, starting correcting mislabeled data (if there are any) etc. What should we do next? Two possible
strategies for guiding us to meaningful actions to improve the solution are by trading training error and
generalization gap, or by applying error analysis.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
166
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10.2 Improving a machine learning model

Training error vs generalization gap


With the notation from Chapter 4, 𝐸 train is the performance of the model on training data and 𝐸 hold-out the
performance on hold-out validation data. In the validation step, we are interested in changing the model
such that 𝐸 hold-out is minimized. We can write the hold-out validation error as a sum of the training error
and the generalization gap as

𝐸 hold-out = 𝐸 train + (𝐸 hold-out − 𝐸 train ) . (10.1)


| {z }
≈ generalization gap

In words, the generalization gap is the difference between the validation error 𝐸 hold-out and the training
error 𝐸 train . 1
Once the validation step is completed (𝐸 hold-out or 𝐸 𝑘-fold computed), we can easily compute the training
error 𝐸 train and the generalization gap 𝐸 hold-out − 𝐸 train . By computing these quantities, we can actually get
good guidance for what changes we may consider for the next iteration.
As we discussed in Chapter 4, if the training error is small and the generalization gap is big (𝐸 train small,
𝐸 hold-out big), we have typically overfitted the model. The opposite situation, big training error and small
generalization gap (both 𝐸 train and 𝐸 hold-out big), typically indicates underfitting.
If we want to reduce the generalization gap 𝐸 hold-out − 𝐸 train (reduce overfitting), the following actions
can be explored:

• Use a less flexible model. If we have a very flexible model we might start overfitting to the training
data, that is, that 𝐸 train is much smaller than 𝐸 hold-out . If we use a less flexible model, we also reduce
this gap.

• Use more regularization. Using more regularization will reduce the flexibility of the model, and
hence also reduce the generalization gap.

• Use bagging, or use more ensemble members if we already are using it. Bagging is a method for
reducing the variance of the model, which typically also means that we reduce the generalization
gap.

• Collect more training data. If we collect more training data, the model is less prone to overfit that
extended training dataset and is forced to only focus on aspects which generalizes to the validation
data.

• Early stopping For models that are trained iteratively we can stop the training before reaching the
minimum. One good practice is to monitor 𝐸 hold-out during training and stop if it starts increasing.

If we want to reduce the training error 𝐸 train (reduce underfitting), the following actions can be
considered:

• Use a more flexible model that is able to fit the training data better. This can be to change a
hyperparameter in the model we are considering, for example decreasing 𝑘 in 𝑘-NN or changing
the model to a more flexible one, for example change the linear regression model to deep neural
network.

• Extend the set of input variables. If we suspect that there are more input variables that carry
information, we might want to extend the data with these input variables.

• Use less regularization. Can of course only be applied if regularization is used at all.

• Train the model longer For models that are trained iteratively we can reduce 𝐸 train by training longer.
1 This can be related to (4.11), if approximating 𝐸¯ train ≈ 𝐸 train and 𝐸¯ new ≈ 𝐸 hold-out . If using 𝑘-fold cross validation,
𝐸¯ new ≈ 𝐸 𝑘-fold would be an equally good approximation when computing the generalization gap.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 167
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10 User aspects of machine learning

High • More flexible model


training Yes
• Less regularization
error?
• Train longer

No
• Less flexible model
High
• More regularization
generalization Yes
error? • Early stopping

• Get more training data


No
Done

Figure 10.2: The iterative procedure of improving the model based on the decomposition into on the training error
and generalization gap .

We summarize the above discussion in Figure 10.2. Fortunately, evaluating 𝐸 train and 𝐸 hold-out is cheap.
We only have to evaluate the model on the training data and the validation data, respectively. Yet, it gives
us good advice on what actions to take next. Besides suggesting what action to explore next, this procedure
also tells us what not to do: If 𝐸 train  𝐸 hold-out − 𝐸 train , collecting more training data will most likely not
help. Furthermore, if 𝐸 train  𝐸 hold-out − 𝐸 train a more flexible model will most likely not help.

Error analysis
Another strategy to identify actions that can improve the model is to perform error analysis. Below we
only describe error analysis for classification problems, but the same strategy can be applied to regression
problems as well.
In error analysis we manually look at a subset, say 100 data points, of the validation data that the model
classified incorrectly. Such an analysis does not take much time, but might give valuable clues as to what
type of data the model is struggling with and how much improvement we can expect by fixing these issues.
We illustrate the procedure with an example.
Example 10.1: Error analysis applied to vehicle detection

Consider a classification problem of detecting cars, bicycles and pedestrians in an image. The model takes
an image as input and one of the four classes car, bike, pedestrian, or other as output. Assume that the
model has a classification accuracy of 90% on validation data.
When looking at a subset of 100 images that were misclassified in the validation data, we make the
following observations:
• All 10 images of class pedestrian that were incorrectly classified as bike contained a baby carrier.
• 30 images were substantially tilted.
• 15 images were mislabeled.
From this observation we can conclude:
• If we launch a project for improving the model to classify pedestrians with baby carriers as
pedestrian and not incorrectly as bike, an improvement of at most a ∼ 1% (a tenth of the 10%
classification error rate) can be expected.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
168
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10.2 Improving a machine learning model

• If we improve the performance on tilted images, an improvement of at most ∼ 3% can be expacted.


• If we correct all mislabeled data, an improvement of at most ∼ 1.5% can be expected.

Following the example, we get an indication on what improvement we can expect by tackling these
three issues. These numbers should be considered as the maximal possible improvement. To prioritize
which aspect to focus on, we should also consider what strategies are available for improving them, how
much progress we expect to make applying these strategies and how much effort we have to invest fixing
these issues.
For example, to improve the performance on tilted images we could try to extend the training data by
augmenting it with more tilted images. This strategy could be investigated without too much extra effort
by augmenting the training data with tilted versions of the training data point that we already have. Since,
this could be applied fairly quickly and have a maximal performance increase of 3%, it seems to be a good
thing to try out.
To improve the performance on the images with baby carriers, one approach would be to collect more
images of pedestrians with baby carriers. This obviously requires some more manual work and can be
questioned if it is worth the effort since it would only give performance improvement of at most 1%.
Regarding the mislabeled data, the obvious actions to take to improve on this issue is to manually go
through the data and correct these labels. In the example above we may say it is not quite worth the effort
of getting an improvement of 1.5%. However, assume that we have improved the model with other actions
to an accuracy of 98.0% on validation data and that still 1.5% of the total error is due to mislabeled data,
this issue is now quite relevant to address if we want to improve the model further. Remember, the purpose
of the validation data is to choose between different models. This purpose is degraded when the majority
of the reported error on validation is due to incorrectly labeled data rather than the actual performance of
the model.
There are three levels of ambitions for correcting the labels:

1. Go through the data points in the validation/test data that were misclassified by the best algorithm
and correct these labels.

2. Go through all data points in the validation/test data and correct the labels.

3. Go through all data points, including the training data and correct the labels.

Approach 1 is in general not recommended since there could be mislabeled data points that the algorithm
also classified to belong to that mislabeled class. These data points would then not be corrected, resulting
in a too optimistic estimate of the performance on validation and test data. Therefore, it is safer to go
through all data points in the validation and test data as suggested in approach 2. The advantage of
approach 1, in comparison to approach 2, is the less amount of work it requires. If we have a model
with 98% accuracy, there are 50 times less data to process in comparison to approach 2. Also, note that
correcting labels in test and validation data and test data only does not necessarily increase the performance
of model in production, but it will give us a more fair estimate of the actual performance of the model.
An even more ambitious approach is to go through all mislabeled data in the training data as well, as
suggested in approach 3. This is also substantially more labor intensive. Assume, for example, that we
have made a 98% − 1% − 1% split of training-validation-test data. Then it is yet another factor 50 more
time consuming in comparison with approach 2. Unless the mislabeling is systematic, correcting the
labels in the training data does not necessarily pay off.
Applying the data cleaning to validation and test data only, as suggested in approach 2, will result in the
training data coming from a slightly different distribution than the validation and test data. However, if we
are eager to correct the mislabeled data in the training data as well, a good recommendation would still be
to start correcting validation and test data only and then use the techniques in the following section to see
how much extra performance we can expect by cleaning data in the training data as well before lunching
that substantially more labor intensive data cleaning project.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 169
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10 User aspects of machine learning

𝐸 hold-out
All available data training-val
mismatch
𝐸 train-val
Train-
Val. Test generalization gap
Training data val.
data data
data 𝐸 train
training error
Slightly different Data distribution 0
data distribution you care about

Figure 10.3: Revising the training-validation-test data split by carving out a separate train-validation data from the
training data.

Mismatched training and validation/test data

As already pointed out in Chapter 4, we should strive for letting the training data come from the same
distribution as the validation and test data. However, there are situations where we, for different reasons,
can accept the training data to come from a slightly different distribution than the validation and test data.
One reason was presented in the previous section where we choose to correct mislabeled data in validation
and test data, but not necessarily invest the time to do the same correction to the training data.
Another reason is that we might have access to another substantially larger dataset which comes from a
slightly different distribution than the data we care about, but similar enough that the advantage of having a
larger training data overcomes the disadvantage of that data mismatch. This scenario is further elaborated
in Section 10.3.
If we have a mismatch between training data and validation/test data, that mismatch contributes to yet
another error source of the final validation error 𝐸 hold-out that we care about. We want to estimate the
magnitude of that error source. This can be done by revising the training-validation-test data split. From
the training data we carve out a separate training-validation dataset, see Figure 10.3. That dataset is
neither used for training nor for validation. However, we do evaluate the performance of our model on that
dataset as well. As before, the remaining part of the training data is used for training, the validation data is
used for comparing different model structures, and test data is used for evaluating the final performance of
the model.
This revised data split also allows us to revise the decomposition in (10.1)

𝐸 hold-out = 𝐸 train + (𝐸 train-val − 𝐸 train ) + (𝐸 hold-out − 𝐸 train-val ) , (10.2)


| {z } | {z }
≈ generalization gap ≈ Train-val mismatch

where 𝐸 train-val is the performance of the model on the new training-validation data and where, as
before, 𝐸 hold-out and 𝐸 train are the performances on validation and training data, respectively. With this
new decomposition, the term 𝐸 train-val − 𝐸 train is an approximation of the generalization gap, that is,
how well the model generalizes to unseen data of the same distribution as the training data, whereas
the term 𝐸 hold-out − 𝐸 train-val is the error related to the training-validation data mismatch. If the term
𝐸 hold-out − 𝐸 train-val is small in comparison to the other two terms, it seems likely that the training-validation
data mismatch is not a big problem and that it is better to focus on techniques reducing the other training
error and the generalization gap as we talked about earlier. On the other hand, if 𝐸 hold-out − 𝐸 train-val is
significant, the data mismatch does have an impact and it might be worth investing time reducing that
term. For example, if the mismatch is caused by the fact that we only corrected labels in the validation and
test data, we might want to consider correcting labels for the training data as well.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
170
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10.3 What if we cannot collect more data?

10.3 What if we cannot collect more data?

We have seen in Section 10.2 that collecting more data is a good strategy to reduce the generalization gap
and hence reduce overfitting. However, collecting labeled data is usually expensive and sometimes not
even possible. What can we do if we cannot afford to collect more data but still want to benefit from the
advantages that a larger dataset would give? In this section a few approaches are presented.

Extending the training data with slightly different data

As already mentioned, there are situations where we can accept the training data to come from a slightly
different distribution than the validation and test data. One reason to accept this is if we then would have
access to a substantially larger training dataset.
Consider a problem with 10 000 data points representing the data that we also would expect to get when
the model is deployed in production. We call this dataset A. We also have another dataset with 200 000
data points that come from a slightly different distribution, but which is similar enough that we think
exploiting information from that data can improve the model. We call this dataset B. Some options to
proceed would be the following:

• Option 1 Use only dataset A and split it into training, validation, and test data.
Dataset A

Val. Test
Training data
data data

5 000 2 500 2 500

The advantage of this option is that we only train, validate, and evaluate on dataset A, which is also
the type of data that we want our model to perform well on. The disadvantage is that we have quite
few data points and we do not exploit potentially useful information in the larger dataset B.

• Option 2 Use both dataset A and dataset B. Randomly shuffle the data and split it in training,
validation and test data.
Dataset A + Dataset B

Val. Test
Training data
data data

205 000 2 500 2 500

The advantage over option 1 is that we have a lot more data available for training. However, the
disadvantage is that we mainly evaluate the model on data from dataset B, whereas we want our
model to perform well on data from dataset A.

• Option 3 Use both dataset A and dataset B. Use data points from dataset A for validation data and
test data and some in the training data. Dataset B only goes into the training data.
Dataset B Dataset A

Val. Test
Training data
data data
2 500 2 500
200 000 from dataset B
data- data-
5 000 from dataset A
set A set A

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 171
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10 User aspects of machine learning

Figure 10.4: Different examples of data augmentation applied to images. Above: An image of a cat has been
reproduced by tilting and vertical flipping. Below: An image of a digit has been reproduced by tilting and blurring.

Similar to option 2, the advantage is that we have more training data in comparison to option 1 and
in contrast to option 2 we now evaluate the model on data from dataset A, which is the data we want
our model to perform well on. However, one disadvantage is that the training data no longer have
the same distribution as the validation and test data.

From these three options we would strongly recommend option 3. We do exploit the information
available in the much larger dataset B, but evaluate only on the data where we want the model to perform
well on (dataset A). The main disadvantage with option 3 is that the training data no longer come from the
same distribution as the validation data and test data. In order to quantify how big impact this mismatch
has on the final performance, the techniques described in Section 10.2 can be used. To push the model to
do better on data from dataset A during training, we can also consider giving data from dataset A higher
weight in the cost function than data from dataset B.

Artificially extending the training dataset


Data augmentation is another approach to extend the dataset without the need to collect more data.
In data augmentation we construct new data points duplicating the existing data by applying invariant
transformations. This is especially common for images where such invariant transformations can be
scaling, rotation, and vertical flipping of the images. For example, if we vertically flip an image of a cat, it
still displays a cat, see the examples Figure 10.4.

Transfer learning
Yet another technique to effectively use more data than the dataset we have is by applying transfer learning.
In transfer learning we use the knowledge from a model that has been trained on a different task with a
different dataset and then apply that model in solving a different, but slightly related, problem.
Transfer learning is especially common for sequential model structures such as the neural network
models introduced in Chapter 6. Consider an application where we want to detect whether a certain type
of skin cancer is malignant or benign, and for this task we have 100 000 labeled images of skin cancer.
We call this task A. Instead of training the full neural network from scratch on this data, we can reuse an
already pre-trained network from another image classification task (task B), which preferably has been
trained on a much larger dataset, not necessarily containing images even resembling skin cancer tumors.
By using the weights from the model trained in task B and only train the last few layers on the data in task
A, we can get a better model than if the whole model had been trained on just dataset A. The procedure is
also displayed in Figure 10.5. The intuition is that the layers closer to the input accomplish tasks that are
generic for all types of images, such as extracting lines, edges and corners in the image, whereas the layers
closer to the output are more specific to the particular problem.
In order for transfer learning to be applicable, we need the two tasks to have the same type of input (in
the example above, images of the same dimension). Further, for transfer learning to be an attractive option,

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
172
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10.4 Practical data issues

×
×
𝑦 ∈ {car, cat, . . . }

Task B × ×𝑦ˆ

.. .. .. .. .. .. ..
. . . . . . .
Task A 𝑦ˆ
𝑦 ∈ {malignant,
benign }

Figure 10.5: In transfer learning we reuse models that have been trained on a different task than the one we are
interested in. Here we reuse a model which has been trained on images displaying all sorts of classes, such as cars,
cats, computers and later train only the last few layers on the skin cancer data which is the task we are interested in.

0.2
5

0 𝑥2
𝑦

−0.2

−5

−0.2 0 0.2 0.4 0.6 −2 0 2 4 6

𝑥 𝑥1

Figure 10.6: Two typical examples of outliers (marked with red circle) in regression (left) and classification (right),
respectively.

the task that we transfer from should have been trained on substantially more data than the task we transfer
to.

10.4 Practical data issues


Besides the amount and distribution of data, a machine learning engineer may also face other data issues.
In this section we will discuss some of the most common ones; outliers, missing data and if some features
can be removed.

Outliers

In some applications, a common issue is outliers, meaning data points whose outputs do not follow the
overall pattern. Two typical examples of outliers are sketched in Figure 10.6. Even though the situation
in Figure 10.6 looks simple, it can be quite hard to find outliers when the data has more dimensions
and is harder to visualize. The error analysis discussed in Section 10.2, which amounts to inspecting
misclassified data points in the validation data, is a systematic way to discover outliers.
When facing a problem with outliers, the first question to ask is whether the outliers are meant to
be captured by the model or not. Do the encircled data points in Figure 10.6 describe an interesting
phenomena that we would like to predict, or are they irrelevant noise (possibly originating from a poor
data collection process)? The answer to this question depends on the actual problem and ambition. Since
outliers by definition (no matter their origin) do not follow the overall pattern, they are typically hard to

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 173
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10 User aspects of machine learning

predict.
If the outliers are not of any interest, the first is to consult the data provider and identify the reason for
the outliers and if something could be changed in the data collection to avoid these outliers, for example
replacing a malfunction sensor. If the outliers are unavoidable, there are basically two approaches one
could take. The first approach is to simply delete (or replace) the outliers in the data. Unfortunately this
means that one has to first find the outliers, which can be hard, but sometimes some thresholding and
manual inspection (that is, look at all data points whose output value is smaller/larger than some value)
can help. Once the outliers are removed from the data, one can proceed as usual. The second approach is
to instead make sure that the learning algorithm is robust against outliers, for example by using a robust
loss function such as absolute error instead of squared error loss (see Chapter 5 for more details). Making
a model more robust amounts, to some extent, to making it less flexible. However, robustness amounts to
making the model less flexible in a particular way, namely by putting less emphasis on the data points
whose predictions are severely wrong.
If the outliers are of interest to the prediction, they are not really an issue, but rather a challenge. We
have to use a model that is flexible enough to capture the behavior (small bias). This has to be done with
care, since very flexible models have a high risk of overfitting also to noise. If it turns out that the outliers
in a classification problem indeed are interesting and in fact are from an underrepresented class, we are
rather facing an imbalanced problem, see Section 4.5.

Missing data

A common practical issue is that certain values are sporadically missing in the data. Throughout this
book so far the data has always consisted of complete input-output pairs {x𝑖 , 𝑦 𝑖 }𝑖=1 𝑛 , and missing data

refers to the situation where some (or a few) values from either the input x𝑖 or the output 𝑦 𝑖 , for some 𝑖, is
missing. If the output 𝑦 𝑖 is missing, we can also refer to it as unlabeled data. It is a common practice to
denote missing data in a computer with NaN (not a number), but less obvious codings also exists, such as 0.
Reasons for missing data could for instance be a malfunctioning sensor or similar issues at data collection
time, or that certain values for some reason have been discarded during the data processing.
As for outliers, a sensible first option is to figure out the reason for the missing data. By going back to
the data provider this issue could potentially be fixed and the missing data recovered. If this is not possible,
there is no universal solution for how to handle missing data. There is, however, some common practice
which can serve as a guideline. First of all, if the output 𝑦 𝑖 would be missing, the data point is useless for
supervised machine learning,2 and can be discarded. In the following, we assume that the missing values
are only in the input x𝑖 .
The easiest way to handle missing data is to discard the entire data points (“rows in X”) where data is
missing. That is, if some feature is missing in x𝑖 , the entire input x𝑖 and its corresponding output value 𝑦 𝑖
is discarded from the data, and we are left with a smaller dataset. If the dataset that remains after this
procedure still contains enough data, this approach can work well. However, if this would lead to a too
small dataset, it is of course problematic. More subtle, but also important, is the situation when the data is
missing in a systematic fashion, for example that missing data is more common for a certain class. In such
a situation, discarding data points with missing data would lead to a mismatch between reality and training
data, which may degrade the performance of the learned model further.
If missing data is common, but only for certain features, another easy option is to not use those features
(“column of X”) which are suffering from missing data. It depends on the situation whether this is a
fruitful approach or not.
Instead of discarding the missing data, it is possible to impute (fill in) the missing values using some
heuristics. Say, for example, that the 𝑗th feature 𝑥 𝑗 is missing from data point x𝑖 . A simple imputation
strategy would be to take the mean or median of 𝑥 𝑗 for all other data points (where it is not missing), or
the mean or median of 𝑥 𝑗 for all data points of the same class (if it is a classification problem). It is also
2 The “partly labeled data” problem is a semi-supervised problem, which is introduced in Chapter 11 but not covered in depth by
this book.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
174
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10.5 Can I trust my machine learning model?

possible to come up with more complicated imputation strategies, but each imputation strategy implies
some assumptions about the problem. Those assumptions might or might not be fulfilled, and it is hard to
guarantee that imputation will help the performance in the end. A poor imputation can even degrade the
performance compared to just discarding the missing data.
Some methods are actually able to handle missing data to some extent (which we have not discussed
in this book), but under rather restrictive assumptions. Such an example is that the data is “completely
missing at random”, meaning that which data that is missing is completely uncorrelated to what value it,
and other features and the output, would have had, had it not been missing. Assumptions like these are
very strong and rarely met in practice, and the performance can be severely degraded if those assumptions
are not fulfilled.

Feature selection
When working with a supervised machine learning problem, the question whether all available input
variables/features contribute to the performance is often relevant. Removing the right feature is indeed a
type of regularization that possibly can reduce overfit and improve the performance, and the data collection
might be simplified if a certain variable do not even have to be collected. Selecting between the available
features is an important task for the machine learning engineer.
The connection between regularization an feature selection becomes clear by considering 𝐿 1 regular-
ization. Since the main feature of 𝐿 1 regularization is that the learned parameter vector b 𝜽 is sparse, it
1
effectively removes the influence of certain features. If using a model where 𝐿 regularization is possible,
we can study b 𝜽 to see which features we simply can remove from the dataset. However, if we can not or
prefer not to use 𝐿 1 regularization, we have to use a more manual approach to feature selection.
Remember that our overall goal is to obtain a small new data error 𝐸 new , which we for most methods
estimate using cross-validation. We can therefore always use cross-validation to tell whether we gain,
or lose, by including a certain feature in x. Depending on the amount of data, evaluating all possible
combinations of removed features might not be a good idea, either due to computational aspects or the risk
of overfit. There are, however, some rule of thumbs that possibly can give us some guidance for which
features we should investigate closer whether they contribute to the performance or not.
To get a feeling for the different features, we can look at the correlation between each feature and the
output, and thereby get a clue about which features that might be more informative about the output. If
there is little correlation between a feature and the output, it is possibly a useless feature and we could
investigate further if we can remove them. However, looking only at one feature at a time can be misleading
and there exists cases where this would lead to the wrong conclusion, for example the case in Example 8.1.
Another approach is to explore if there are redundant features, with the reasoning that having two
features that (essentially) contain the same information will lead to an increased variance compared to
having only one feature with the same information. Based on this argument one may look at the pairwise
correlation between the features, and investigate removing features that have high correlation to other
features. This approach is somewhat related to PCA (Chapter 11).

10.5 Can I trust my machine learning model?


Supervised machine learning presents a powerful family of all-purpose general black-box methods, and
has demonstrated impressive performance in many applications. The main argument for supervised
machine learning is, frankly, that it works well empirically. However, depending on the requirements
by the applications, supervised machine learning also has a potential shortcoming, in that it relies on
“repeating patterns seen in training data” rather than “deduction from a set of carefully written rules”.

Understanding why a certain prediction was made


In some applications there might be an interest to “understand” why a certain prediction was made by a
supervised machine learning model, for example in medicine or law. Unfortunately, the underlying design

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 175
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
10 User aspects of machine learning

philosophy in machine learning is to deliver good predictions rather than explaining them.
With simpler model, like the ones in Chapter 2-Chapter 3, it can to some degree be possible for an
engineer to inspect the learned model and explain the “reasoning” behind it for a non-expert. For the more
complicated models it can be a rather hard task.
There are however methods on the research forefront, and the situation may look different in the future.
A related topic is that of so-called adversarial examples, which essentially amounts to finding an input x0
which is as close as possible to x but still gives another prediction. In the image classification setting, it
can for example be the problem of having a picture of a car being predicted as a dog by only changing a
few pixel values.

Worst case guarantees


In the view of this book, a supervised machine learning model is good if it attains a small 𝐸 new . It is,
however, important to remember that 𝐸 new is a statistical claim, under the assumption that the training
and/or test data resembles the reality which the model will face once it is put in production. And even if
that non-trivial assumption is satisfied, there are no claims about how bad the model will predict in the
worst individual cases. This is indeed a shortcoming of supervised machine learning, and potentially also
a show-stopper for some applications.
Simpler and more interpretable models, like logistic regression and trees for example, can be inspected
manually in order to deduce the “worst case” that could happen. By looking at the leaf nodes in a
regression tree, as an example, it is possible to give an interval to which all predictions will belong. With
more complicated models, like random forests and deep learning, it is very hard to give any worst case
guarantees about how the wrong the model can be in its predictions when faced with some particular input.
However, an extensive testing scheme might reveal some of the potential issues.

10.6 Further reading


User aspects of machine learning is a fairly under-explored area both in academic research publications
and in standard textbooks on machine learning. Two exceptions are Ng (2019) and Burkov (2020) from
which parts of this chapter has been inspired. Connected to data augmentation, see further in Shorten and
Khoshgoftaar (2019) for a review of different data augmenting techniques for images.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
176
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11 Generative models and learning from unlabeled
data
The models introduced so far in this book are so-called discriminative models, meaning (in the context of
supervised machine learning) that they are designed to learn from data with the aim of predicting new
outputs. We will in the first half of this chapter introduce models of another paradigm, namely so-called
generative models. Generative models are indeed also learned from data, but their scope is wider. In
contrast to discriminative models, a generative model is for instance able to simulate more data, and
predictions are merely obtained as a by-product. Generative modeling is therefore a natural way to take us
beyond supervised learning, which we will do in the second half of this chapter.
A generative model aims to describe the distribution 𝑝(x, 𝑦), that is, how the data (both its input and
its output) is generated. Perhaps we should write 𝑝(x, 𝑦 | 𝜽) to emphasize that generative models also
contain some parameters that we will learn from data, but to ease the notation we settle for 𝑝(x, 𝑦). To
make a prediction with a generative model, the expression for 𝑝(𝑦 | x) has to be derived from 𝑝(x, 𝑦)
using probability theory. We will make this idea concrete by considering the rather simple, yet useful,
generative Gaussian mixture model. As any generative model also the Gaussian mixture model can be
used for different purposes. When used for prediction it results in a method traditionally called linear or
quadratic discriminant analysis (LDA or QDA, respectively). We will thereafter see how the Gaussian
mixture model also can be used for semi-supervised learning (where labels 𝑦 are partly missing) and
unsupervised learning (where no labels are present at all; there are only x and no 𝑦) as well. In the latter
case, the Gaussian mixture model can be used for solving the so-called clustering problem where similar x
are to be grouped together.
Generative models bridge the gap between supervised and unsupervised machine learning, but not all
methods for unsupervised learning comes from generative models. We therefore finally also review two
other popular unsupervised methods (not derived from generative models), namely 𝑘-means and principal
component analysis (PCA). 𝑘-means is another clustering method, whereas the purpose of PCA is to infer
which dimensions of x that are most informative, so-called dimensionality reduction.

11.1 The Gaussian mixture model and the LDA & QDA classifiers
We will now introduce a generative model, the Gaussian mixture model, from which we will derive several
methods for different purposes. The Gaussian mixture model attempts to model 𝑝(x, 𝑦), that is, the joint
distribution between inputs x and outputs 𝑦. (The discriminative models in previous chapters only attempt
to model the the conditional distribution 𝑝(𝑦 | x), a less ambitious problem since 𝑝(𝑦 | x) can be derived
from 𝑝(x, 𝑦) but not vice versa.) For the Gaussian mixture model, we assume that x is a numerical and 𝑦 a
categorical variable.

The Gaussian mixture model


The Gaussian mixture model makes use of the factorization

𝑝(x, 𝑦) = 𝑝(x | 𝑦) 𝑝(𝑦), (11.1a)

where 𝑝(x | 𝑦) is assumed to be a Gaussian distribution


 
𝑝(x | 𝑦) = N x | 𝝁 𝑦 , 𝚺 𝑦 (11.1b)

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 177
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11 Generative models and learning from unlabeled data

5 5
𝑥2

𝑥2
0 0

−1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1

Figure 11.1: The Gaussian mixture model is a generative model, and we think about the input variables x as random
and assume that they have a certain distribution. The Gaussian mixture model assumes that 𝑝(x | 𝑦) has a Gaussian
distribution for each 𝑦. In this figure x is two-dimensional and there are two classes 𝑦 (red and blue). The left panel
shows some data with this nature. The right panel shows, for each value of 𝑦, the contour lines of the Gaussians
𝑝(x | 𝑦) that are learned from the data using (11.3).

where 𝝁 𝑦 and 𝚺 𝑦 depends on 𝑦. Since 𝑦 is categorical, and thereby takes values 1, . . . , 𝑀, the distribution
𝑝(𝑦) is modeled as a categorical distribution with 𝑀 parameters {𝜋 𝑚 } 𝑚=1
𝑀 as

𝑝(𝑦 = 1) = 𝜋1 ,
..
. (11.1c)
𝑝(𝑦 = 𝑀) = 𝜋 𝑀 .

In words the model (11.1) assumes a categorical distribution for 𝑦 and, for each possible value of 𝑦, a
Gaussian distribution for x. If we look at x, as we do in Figure 11.1, the model corresponds to a mixture
of Gaussians (one component for each value of 𝑦), and hence its name. Altogether (11.1) is a generative
model which describes an assumption on how data (x, 𝑦) is generated.
The Gaussian mixture model (11.1) will lead us to classifiers (for supervised learning) which are easy
to learn (requires no numerical optimization, in contrast to logistic regression) and useful in practice also
when the data do not obey the Gaussian assumption (11.1b) perfectly. The classifiers are for historical
reasons called linear and quadratic discriminant analysis (LDA1 and QDA, respectively). It will also lead
us also to a clustering method for unsupervised learning as well as being useful also for the in-between
problem when the output label 𝑦 is missing for some of the data points.

Supervised learning of the Gaussian mixture model

Like any machine learning model the Gaussian mixture model (11.1) is learned from training data. The
unknown parameters to be learned are 𝜽 = {𝝁 𝑚 , 𝚺 𝑚 , 𝜋 𝑚 } 𝑚=1
𝑀 . We start by the supervised case, meaning

that the training data does contain inputs x and outputs 𝑦 as T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛 (which has been the case for

all other methods in this book so far). We will, however, later see how we can learn the Gaussian mixture
model also when the output 𝑦 is missing.
Mathematically we learn the Gaussian mixture model by maximizing the likelihood

b
𝜽 = arg max 𝑝({x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛
| 𝜽). (11.2)
𝜽 | {z }
T

Alternatively it is possible to learn the Gaussian mixture model also using the Bayesian approach, although
we do not pursue that any further here. The optimization problem (11.2) turns out to have the closed form
solution
𝑛𝑚
b
𝜋𝑚 = , (11.3a)
𝑛
1 Note to be confused with Latent Dirichlet Allocation, also abbreviated LDA, which is a completely different method.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
178
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11.1 The Gaussian mixture model and the LDA & QDA classifiers

where 𝑛𝑚 is the number of training data points in class 𝑚. (Consequently, all 𝑛𝑚 must sum to 𝑛, and
Í
thereby 𝑚 b𝜋 𝑚 = 1.) Furthermore, the mean vector 𝝁 𝑚 of each class is learned as

1 ∑︁
b
𝝁𝑚 = x𝑖 , (11.3b)
𝑛𝑚 𝑖:𝑦 =𝑚
𝑖

the empirical mean among all training data points of class 𝑚, and the covariance matrix 𝚺 𝑚 of each class
𝑚 = 1, . . . , 𝑀, is learned as
∑︁
b𝑚 = 1
𝚺 (x𝑖 − b
𝝁 𝑚 ) (x𝑖 − b
𝝁𝑚)T . (11.3c)
𝑛𝑚 𝑖:𝑦 =𝑚
𝑖

Note that we can compute the learned parameters b 𝜽 no matter if the data comes from a distribution where
𝑝(x | 𝑦) is Gaussian or not. In fact (11.3b)-(11.3c) learns a Gaussian distribution for x for each class such
that the mean and covariance fits the data (so-called moment-matching).

Predicting output labels for new inputs: QDA

We have so far described the generative Gaussian mixture model 𝑝(x, 𝑦), where x is numerical and 𝑦
categorical, and how to learn the unknown parameters in 𝑝(x, 𝑦) from training data. We will now see how
this can be used as a classifier for supervised machine learning.
The key insight for using a generative model 𝑝(x, 𝑦) to make predictions is to realize that a prediction
“only” amounts to compute 𝑝(𝑦 | x). From probability theory we have

𝑝(x, 𝑦) 𝑝(x, 𝑦)
𝑝(𝑦 | x) = = Í𝑀 . (11.4)
𝑝(x) 𝑚=1 𝑝(x, 𝑚)

The left hand side 𝑝(𝑦 | x) is the prediction whereas all expressions on the right hand side are defined by
the generative Gaussian mixture model (11.1). We therefore get the classifier
 
𝜋 𝑚 N x★ | b
b 𝝁𝑚, 𝚺b𝑚
𝑝(𝑦 = 𝑚 | x★) = Í  , (11.5)
𝑀
𝜋 𝑗 N x★ | b
b b𝑗
𝝁 𝑗, 𝚺
𝑗=1

which historically has been called quadratic discriminant analysis (QDA). As usual, we can obtain “hard”
predictions b
𝑦★ by selecting the class which is predicted to be the most probable,

b
𝑦★ = arg max 𝑝(𝑦 = 𝑚 | x★), (11.6)
𝑚

and compute corresponding decision boundaries. It turns out that the decision boundary for QDA is
always a quadratic function (hence its name). We summarize this by Method 11.1 and Figure 11.3, and in
Figure 11.2 we show the decision boundary when the Gaussian mixture model from Figure 11.1 is turned
into a QDA classifier.
It is possible to make the restriction that the covariance matrix is equal for all classes, 𝚺1 = 𝚺2 = · · · =
𝚺 𝑀 = 𝚺 in (11.1b). With that restriction (11.3c) is replaced with

1 ∑︁ ∑︁
𝑀
b=
𝚺 (x𝑖 − b
𝝁 𝑚 ) (x𝑖 − b
𝝁𝑚)T . (11.7)
𝑛 − 𝑀 𝑚=1 𝑖:𝑦 =𝑚
𝑖

Using (11.7) in (11.5) leads to the so-called linear discriminant analysis (LDA), a classifier with linear
decision boundaries. Note that LDA is obtained by replacing (11.3c) with (11.7) in Method 11.1. We
compare LDA and QDA in Figure 11.4 by applying both of them to the music classification problem from
Example 2.1.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 179
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11 Generative models and learning from unlabeled data

𝑥2
0

−1 0 1 2 3
𝑥1

Figure 11.2: The decision boundary for the QDA classifier (obtained by (11.5) and (11.6)) corresponding to the
learned Gaussian mixture model in the right panel of Figure 11.1.

Learn the Gaussian mixture model


Data: Training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

Result: Gaussian mixture model


1 for 𝑚 = 1, . . . , 𝑀 do
2 Compute b 𝜋 𝑚 (11.3a), b b𝑚 (11.3c)
𝝁 𝑚 (11.3b) and 𝚺
3 end

Predict with Gaussian mixture model


Data: Gaussian mixture model and test input x★
Result: Prediction b
𝑦 (x★)
1 for 𝑚 = 1, . . . , 𝑀 do
2 Compute 𝑝(𝑦 = 𝑚 | x★) (11.5)
3 end
4 Find largest 𝑝(𝑦 = 𝑚 | x★) and set b
𝑦★ to that 𝑚

Method 11.1: Quadratic Discriminant Analysis, QDA

Time to reflect 11.1: In the Gaussian mixture model it was assumed that 𝑝(x | 𝑦) is Gaussian.
When applying LDA or QDA “out of the box” for a supervised problem, is there any check that the
Gaussian assumption actually holds? If yes, what? If no, is that a problem?

We have now derived a classifier, QDA, from a generative model. In practice the QDA classifier can be
employed just like any discriminative classifier. It can be argued that a generative model contains more
assumptions than a discriminative model, and if the assumptions are fulfilled we could possibly expect
QDA to be slightly more data efficient (requiring fewer data points to reach a certain performance) than a
discriminative model. In most practical cases that makes, however, not a major difference. The difference
between using a generative and a discriminative model will, however, be larger when we next look at the
semi-supervised problem.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
180
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11.1 The Gaussian mixture model and the LDA & QDA classifiers

𝑝(𝑥 | 𝑦) 𝑝(𝑦)
𝑦=2

1 0.6 b2
𝜋

𝑦=1 0.4
b2
𝜎 b1
𝜋
0.5
0.2 b3
𝜋
b1
𝜎 𝑦=3

0 b3
𝜎 0
−3 b1
𝜇 −1 b2
𝜇 1 b3
𝜇 3 y=1 y=2 y=3

Equation (11.4)

𝑝(𝑦 | 𝑥)

1
𝑦=1 𝑦=2 𝑦=3

0.5

0
−3 −1 1 3
𝑥

Figure 11.3: An illustration of QDA for 𝑀 = 3 classes, with dimension 𝑝 = 1 of the input x. In the top is the
generative Gaussian mixture model shown. To the left is the Gaussian model of 𝑝(x | 𝑘), parameterized by b 𝝁 𝑚 and
b𝑚 . To the right is the model of 𝑝(𝑦) shown, parameterized by b
𝚺 𝜋 𝑚 . All parameters are learned from training data,
not shown in the figure. (Since 𝑝 = 1, we only have a scalar variance 𝜎𝑚 2 , instead of a covariance matrix 𝚺 ). By
𝑚
(11.4) is the generative model “warped” into 𝑝(𝑦 = 𝑚 | 𝑥), shown in the bottom. The decision boundaries are shown
as vertical dotted lines in the bottom plot.

LDA QDA

1 Beatles 1 Beatles
Kiss Kiss
Energy (scale 0-1)

Energy (scale 0-1)

Bob Dylan Bob Dylan

0.5 0.5

0 0
4.5 5 5.5 6 6.5 7 4.5 5 5.5 6 6.5 7

Length (ln s) Length (ln s)


(a) Decision boundaries for the music classification prob- (b) Decision boundaries for the music classification prob-
lem for an LDA classifier. lem for a QDA classifier.

Figure 11.4: We apply an LDA and QDA classifier to the music classification problem from Example 2.1 and plot
the decision boundaries. Note that the LDA classifier gives linear decision boundaries, whereas the QDA classifier
has decision boundaries with quadratic shapes.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 181
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11 Generative models and learning from unlabeled data

5 5
𝑥2

𝑥2
0 0

−1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1

Figure 11.5: We consider the same situation as in Figure 11.1, except for the fact that we have “lost” the output 𝑦 𝑖
for most of the data points. The problem is now semi-supervised. The unlabeled data points {x𝑖 }𝑖=1 𝑛𝑢
are illustrated
in the left panel as gray dots, whereas the {x𝑖 , 𝑦 𝑖 }𝑖=1 = 6 complete data points {x𝑖 , 𝑦 𝑖 }𝑖=1 are red or blue. In the
𝑛𝑐 𝑛𝑐

right panel we have learned a Gaussian mixture model using only the complete data points, as if the problem was
supervised with only 𝑛𝑐 data points. Clearly the unlabeled data points have made the problem harder, compared
to Figure 11.1. We will, however, continue this story in Figure 11.6 where we use this as an initialization to a
semi-supervised procedure.

11.2 The Gaussian mixture model when some or all labels are missing
We have so far discussed how the Gaussian mixture model can be learned in the supervised setting,
that is, from training data that contains input as well as output values. We will now have a look at the
so-called semi-supervised problem where some output values 𝑦 𝑖 are missing in the training data. The
input values x𝑖 for which the corresponding 𝑦 𝑖 are missing are called unlabeled data points. As before we
denote the total number of training data points as 𝑛, out of which now only 𝑛𝑐 are complete input-output
pairs {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛𝑐
and the remaining 𝑛𝑢 unlabeled data points {x𝑖 }𝑖=1
𝑛𝑢
. All in all we have the training data
T = {{x𝑖 , 𝑦 𝑖 }𝑖=1 , {x𝑖 }𝑖=1 }, where 𝑛 = 𝑛𝑐 + 𝑛𝑢 . Semi-supervised problems are common to encounter, since
𝑛𝑐 𝑛𝑢

supervised problems in practice often appears to rather be semi-supervised because of various data issues
that gives corrupted or missing output values 𝑦 𝑖 .
Remember that a generative model is a model of 𝑝(x, 𝑦), which can be factorized as 𝑝(x, 𝑦) =
𝑝(x) 𝑝(y | x). Since 𝑝(x) in this sense is a part of the generative model, it seems intuitively plausible
that also the unlabeled data points {x𝑖 }𝑖=1 𝑛𝑢
can be useful when learning a generative model. This is in
contrast to discriminative models, which only cares about how 𝑦 depends on x, and unlabeled data points
are harder to make use of.
We will later also consider the unsupervised problem, where all data points are unlabaled.

Semi-supervised learning of the Gaussian mixture model


The simplest solution to the semi-supervised problem would be to discard the 𝑛𝑢 unlabeled data points,
and thereby turn the problem into a standard supervised machine learning problem. That is indeed a
pragmatic solution if the number of unlabeled data points 𝑛𝑢 is smalled compared to the total number of
data points 𝑛. However, if the unlabeled data points are a significant part of all available data T , we might
be left with rather few complete data points {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛𝑐
. We illustrate this with Figure 11.5, which depicts
a semi-supervised problem where we have learned a (poor) Gaussian mixture model by only using the
few 𝑛𝑐 complete data points. From before we also know that the more data the better, which also is an
incentive for a more careful way to handle the semi-supervised problem.
Conceptually we would like to solve the maximum likelihood problem again (as for the supervised
problem, (11.2)), but leaving out the missing outputs. That is, we would like to solve

b
𝜽 = arg max 𝑝({{x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛𝑐 𝑛𝑢
, {x𝑖 }𝑖=1 } | 𝜽). (11.8)
𝜽 | {z }
T

Unfortunately this problem has no closed-form solution for the Gaussian mixture model since, if we are to
spell 𝑝({{x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛𝑐 𝑛𝑢
, {x𝑖 }𝑖=1 } | 𝜽) out, it contains integrals over the missing 𝑦 𝑖 .

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
182
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11.2 The Gaussian mixture model when some or all labels are missing

Leaving the seemingly hard problem (11.8) aside, a pragmatic approach to the semi-supervised problem
would be to use the Gaussian mixture model to predict the missing output values {𝑦 𝑖 }𝑖=1
𝑛𝑢
, in the iterative
fashion
(i) Learn the Gaussian mixture model from the 𝑛𝑐 complete available input-output pairs {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛𝑐

(ii) Use the Gaussian mixture model to predict (as a QDA classifier) the missing outputs to {x𝑖 }𝑖=1
𝑛𝑢

(iii) Update the Gaussian mixture model using also the predicted outputs from step (ii)
and then repeat step (ii) and (iii) until convergence.
It is far from obvious whether such a procedure makes any sense, and even less so whether it will
converge. As it turns out, however, it can be rigorously shown that this is indeed a very sensible idea, and
it does converge. If we only pay attention to some important details, the suggested procedure is actually
the so-called expectation-maximization (EM) algorithm applied to (11.8).
The EM algorithm is a widely used computational statistics tool for solving maximum likelihood
problems which contain latent (unobserved) variables, which can be shown to converge to a local optimum
of (11.8). By considering the missing output values {𝑦 𝑖 }𝑖=1 𝑛𝑢
as being such latent variables, the EM
algorithm fits perfectly and boils down to the suggested iteration with a few important details: From step
(ii) we should return the predicted class probabilities 𝑝(𝑦 = 𝑚 | x) (and not the class prediction b 𝑦 (x★)),
and in step (iii) we make use of the predicted class probabilities by introducing the notation





𝑝(𝑦 = 𝑚 | x𝑖 ) if 𝑦 𝑖 is missing
𝑤 𝑖 (𝑚) = 1 if 𝑦 𝑖 = 𝑚 (11.9a)


0
 otherwise

and learn the parameters as

1 ∑︁
𝑛
b
𝜋𝑚 = 𝑤 𝑖 (𝑚), (11.9b)
𝑛 𝑖=1
1 ∑︁
𝑛
𝝁 𝑚 = Í𝑛
b x𝑖 𝑤 𝑖 (𝑚), (11.9c)
𝑖=1 𝑤 𝑖 (𝑚) 𝑖=1
∑︁
𝑛
b 𝑚 = Í𝑛 1
𝚺 (x𝑖 − b
𝝁 𝑚 ) (x𝑖 − b
𝝁 𝑚 ) T 𝑤 𝑖 (𝑚). (11.9d)
𝑖=1 𝑤 𝑖 (𝑚) 𝑖=1

Note that (11.9) is a generalization of the supervised case, and that (11.3) is a special case of (11.9) when
no labels 𝑦 𝑖 are missing. We also summarize this as Method 11.2, and illustrate it in Figure 11.6 by
applying it to the situation introduced in Figure 11.5.
We have now devised a way to handle semi-supervised classification problems using the Gaussian
mixture model, and thereby extended the QDA classifier such that it can be used also in the semi-supervised
setting when some output values 𝑦 𝑖 are missing from the training data.
It has perhaps yet not been clear why we have chosen to introduce the semi-supervised problem in
connection to generative models. We could indeed think of using any discriminative model (instead of
the Gaussian mixture model) for iteratively predicting the missing 𝑦 𝑖 and updating the model using the
predictions. That could sometimes work to some extent, in particular if the fraction of unlabeled data
points is small, but the procedure becomes much more robust and able to handle a larger proportion
unlabeled data points when using a generative model. The reason for this is, however, somewhat subtle.
Let us first remind ourselves about the difference between a discriminative and a generative model.
A discriminative model is a model for 𝑝(𝑦 | x) only, which encodes information like “if x, then 𝑦 is
. . . ”. Consequently the discriminative model does not contain any model of x whatsoever, and makes no
assumptions about the connection from 𝑦 to x. This is in contrast to a generative model which models
𝑝(𝑦, x), from which 𝑝(𝑦 | x) as well as 𝑝(x | 𝑦) follows. The presence of 𝑝(x | 𝑦) in a generative model is
what makes it more useful for semi-supervised learning.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 183
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11 Generative models and learning from unlabeled data

5 5
𝑥2

𝑥2
0 0

5 5
𝑥2

𝑥2
0 0

5 5
𝑥2

𝑥2
0 0

5 5
𝑥2

𝑥2

0 0

5 5
𝑥2

𝑥2

0 0

5 5
𝑥2

𝑥2

0 0

5 5
𝑥2

𝑥2

0 0

5 5
𝑥2

𝑥2

0 0

Figure 11.6: The iterative Method 11.2 applied to the problem from Figure 11.5. For each iteration, the left panel
shows the predicted class probabilities from the previously learned model (using color coding; purple is in the
middle between red and blue). The previous model to the top row was shown in the right panel of Figure 11.5. The
new model learned using (11.9) and the predictions in the left panel is shown in the right panel for each row.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
184
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11.2 The Gaussian mixture model when some or all labels are missing

Learn the Gaussian mixture model


Data: Incomplete training data T = {{x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛𝑐 𝑛𝑢
, {x𝑖 }𝑖=1 } (with output classes 𝑘 = 1, . . . , 𝑀)
Result: Gaussian mixture model
𝜋 𝑚 (11.3a), b
1 Compute, for each 𝑚 = 1, . . . , 𝑀, b 𝝁 𝑚 (11.3b) and 𝚺 b𝑚 (11.3c) using only the complete
data {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛𝑐

2 repeat
3 Compute, for each x𝑖 in {x𝑖 }𝑖=1
𝑛𝑢
, the prediction 𝑝(𝑦 | x𝑖 ).
4 Compute, for each 𝑚 = 1, . . . , 𝑀, b 𝜋 𝑚 (11.9b), b𝝁 𝑚 (11.9c) and 𝚺 b𝑚 (11.9d)
5 until convergence

Predict as QDA, Method 11.1

Method 11.2: Learning of the Gaussian mixture model from incomplete data

Consider the semi-supervised problem in Figure 11.5, and let us think about using a discriminative
model, say a neural network, instead of a Gaussian mixture model in the iterative fashion of Method 11.2
and Figure 11.6. We would first learn the model (the neural network) using only the complete data points
𝑛𝑐
{x𝑖 , 𝑦 𝑖 }𝑖=1 . That would give us a decision boundary that separates the blue and red data points well. We
would thereafter use that model for predicting the output for the unlabeled (gray) data points. The next
step would be to train the model (the neural network) anew using also the predicted outputs. However, a
discriminative model (like a neural network) would in this step only care about fitting the training data
(including the predicted ones) as close as possible, and the learned model would in practice look very
similar to the model learned in the previous iteration. The procedure would therefore in practice get stuck
after the first iteration, and the discriminative model would not converge to something much better than a
model trained using only the complete data points {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛𝑐
.
The generative model, on the other hand, contains more assumptions than a discriminative model, which
helps when handling the semi-supervised problem. The generative model implies a model 𝑝(x | 𝑦), which
for the Gaussian mixture model encodes the information that all inputs x that belongs to a certain output 𝑦
should all have the same Gaussian distribution and thereby live in a ”cluster”. This assumption is the key
enabler for Method 11.2 to work in practice, also in such a challenging situation as in Figure 11.5-11.6
where the vast majority of the data points are unlabeled. The downside of using a generative model is that
the result can be rather misleading if the assumptions are not fulfilled.

Unsupervised learning of the Gaussian mixture model

In the supervised setting, all data points are complete (both x and 𝑦), and in the semi-supervised problem
some data points are complete and some data points are unlabeled (only x). In the unsupervised problem,
all data points are unlabeled, T = {x𝑖 }𝑖=1
𝑛 . Whereas the overall goal in supervised and semi-supervised

machine learning is to make as good predictions as possible, the overall goal of unsupervised machine
learning is in general less pronounced. In the context of generative models, however, unsupervised machine
learning amounts to learning 𝑝(x).
The Gaussian mixture model (11.1) is a joint model for x and 𝑦, as 𝑝(x, 𝑦). To obtain a model only for
Í
x we can marginalize out 𝑦 as 𝑝(x) = 𝑦 𝑝(x, 𝑦) from it. The marginalization implies that we consider 𝑦
as being a latent random variable, that is, a random variable which exists in the model but is not observed
in the data. In practice we still learn 𝑝(x, 𝑦) (11.1), but from data only containing inputs {x𝑖 }𝑖=1
𝑛 .

Luckily we already have a tool for learning the Gaussian mixture model from unlabeled data. Method 11.2,
which we devised for the semi-supervised case, also works for completely unlabeled data {x𝑖 }𝑖=1 𝑛 , if we

only replace the initialization (line 1) with some pragmatic choice of initial b 𝜋𝑚 , b b𝑚 . It can be
𝝁 𝑚 and 𝚺
shown that Method 11.2 also is the EM algorithm applied to solving the unsupervised maximum likelihood

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 185
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11 Generative models and learning from unlabeled data

problem

b
𝜽 = arg max 𝑝({x𝑖 }𝑖=1
𝑛
| 𝜽). (11.10)
𝜽 | {z }
T

If we have unlabeled data and the number of classes 𝑀 we can use Method 11.2 to learn a Gaussian
mixture model, as we illustrate by Figure 11.7.
There are a few important details when learning the Gaussian mixture model in the unsupervised setting
which deserve some attention. First of all, even though the data is unlabeled, the number of classes (or
rather the number of Gaussian components in the mixture) 𝑀 still has to be known. Furthermore, since
there are only unlabeled data points, the indexation of the 𝑀 Gaussian components becomes arbitrary. In
Figure 11.7 it means that the colors (red and blue) are interchangeable, and the way it happened to turn out
is only an effect of the initialization.
The initialization is also an important detail to carefully consider, since Method 11.2 (which builds on
the EM algorithm) is only guaranteed to find local optima. With an unfortunate initialization Method 11.2
may therefore end up in a poor local optima of (11.10). A pragmatic remedy is to run Method 11.2 with
different (random) initializations. Another issue with Method 11.2 is that it might find a local optima
which corresponds to a degenerate solution where some Gaussian component have a singular covariance
matrix that causes numerical problems.
We have so far only said that unsupervised learning of a generative model corresponds to learning
𝑝(x), but it is perhaps not clear why 𝑝(x) can be of interest. As the name suggests, a generative model
can be used to generate more data, by drawing samples from 𝑝(x). That is true also for the Gaussian
mixture model. The Gaussian mixture model in particular can be used also for solving the so-called
clustering problem. By learning a Gaussian mixture model and then use it as a QDA classifier (even
though it was learned from unlabeled data), we get a method for clustering (classifying) “similar” x𝑖 into
𝑀 different clusters (classes). We will next have a look at another common clustering methods, which can
be understood as a simplified version of the Gaussian mixture model and Method 11.2.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
186
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11.2 The Gaussian mixture model when some or all labels are missing

5 5
𝑥2

𝑥2
0 0

5 5
𝑥2

𝑥2
0 0

5 5
𝑥2

𝑥2
0 0

5 5
𝑥2

𝑥2

0 0

5 5
𝑥2

𝑥2

0 0

5 5
𝑥2

𝑥2

0 0

5
5
𝑥2

𝑥2

0
0

𝑥1

5 5
𝑥2

𝑥2

0 0

𝑥1 𝑥1

Figure 11.7: The Method 11.2 applied to an unsupervised problem, where all training data is unlabeled. The only
difference, in practice, to Figure 11.6 is the initialization (in the upper row), which here is done arbitrarily instead of
using the complete data points.
This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 187
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11 Generative models and learning from unlabeled data

11.3 More unsupervised methods: k-means and PCA


Whereas supervised machine learning deals with the problem of predicting b 𝑦 (x★) by learning a model from
input-output data {x𝑖 , 𝑦 𝑖 }𝑖=1 , unsupervised machine learning is more of an umbrella term for methods
𝑛

that somehow attempt to extract information from unlabeled data {x𝑖 }𝑖=1
𝑛 . We have introduced generative

models, illustrated by the Gaussian mixture model, which is a general concept that can be used for
supervised as well as unsupervised machine learning. We will now have a look at two more methods
for unsupervised machine learning, for solving the clustering and dimensionality reduction problem
respectively. The clustering problem amounts to grouping similar data points x𝑖 , whereas dimensionality
reduction amounts to finding a lower dimensional representation of the data onto that still contain most of
the information.

k-means clustering
The clustering problem amounts to group similar data points x𝑖 . There are several alternative ways of
mathematically defining as well as solving the clustering problem. In 𝑘-means it is assumed that all
elements of x are numerical and that similarity is defined in terms of the Euclidian distance. We can define
the 𝑘-means objective if we first introduce 𝑘 sets (clusters) 𝑅1 , 𝑅2 , . . . , 𝑅 𝑘 . With the condition that each
data point x𝑖 in {x𝑖 }𝑖=1
𝑛 should be a member of exactly one cluster, 𝑘-means clustering amounts to selecting

the clusters such that the sum of pairwise squared Euclidian distances within each cluster are minimized,

∑︁
𝑘
1 ∑︁
arg min kx − x0 k 22 , (11.11)
𝑅1 ,𝑅2 ,...,𝑅 𝑘 𝑗=1 |𝑅 𝑗 | x,x0 ∈𝑅
𝑗

where |𝑅 𝑗 | is the number of data points in cluster 𝑅 𝑗 . The intention of (11.11) is to select the clusters
such that all points within each cluster are as similar as possible. Solving (11.11) can be shown to be
equivalent to selecting the clusters such that the distance to the cluster center, summed over all data points,
is minimized,
∑︁
𝑘 ∑︁
arg min kx − 𝝁 𝑗 k 22 . (11.12)
𝑅1 ,𝑅2 ,...,𝑅 𝑘 𝑗=1 x∈𝑅
𝑗

Here 𝝁 𝑗 is the center (average) of all data points x ∈ 𝑅 𝑗 . Unfortunately both (11.11) and (11.12) are
combinatorial problems, meaning that we can not expect to solve them exactly if the number of data points
𝑛 is large.
A common way to heuristically solve (11.11)/(11.12) is to

(i) Set the cluster centers 𝝁1 , 𝝁2 , . . . , 𝝁 𝑘 to some initial values

(ii) Determine which cluster 𝑅 𝑗 each x𝑖 belongs to, that is, compute which 𝝁 𝑗 is closest to each x𝑖

(iii) Update each cluster center 𝝁 𝑗 as the average of all x𝑖 that belongs to 𝑅 𝑗

and then iterate step (ii) and (iii) until convergence. This procedure, which is an instance of the so-called
Lloyds algorithm and often simply called “the 𝑘-means algorithm”, is guaranteed to converge (since we
can verify that the objective in (11.12) decreases for each iteration) but it is not guaranteed to find the
global optima. In practice it is common to run it multiple times, each with a different initialization in step
(i), and pick the result of the run for which the objective in (11.11)/(11.12) attains the smallest value. As an
illustration of 𝑘-means, we apply it to the input data from the music classification problem in Example 2.1.
Indeed 𝑘-means has some similarities to 𝑘-NN, in particular its heavy use of the Euclidian distance to
define similarities in the input space. This implies that also 𝑘-means is very sensitive to the normalization
of the input values. It is, however, important not to confuse the two methods. Whereas 𝑘-NN is a method
for supervised machine learning, 𝑘-means is a method for the unsupervised clustering problem. Note, in
particular, that 𝑘 has a different meaning in the two methods.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
188
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11.3 More unsupervised methods: k-means and PCA

𝑘=1 𝑘=1

1 Beatles 1 Beatles
Kiss Kiss
Energy (scale 0-1)

Energy (scale 0-1)


Bob Dylan Bob Dylan

0.5 0.5

0 0
4.5 5 5.5 6 6.5 7 4.5 5 5.5 6 6.5 7

Length (ln s) Length (ln s)

Figure 11.8: 𝑘-means applied to the music classification Example 2.1. In this example we do have input as well as
output data, so the purpose of applying a clustering algorithm to its input only is rather unclear, but if nothing else
we do it for the sake of curiosity (we could perhaps imagine that we have lost the artist labels, and are attempting to
recover them). We try 𝑘 = 3 (left) and 𝑘 = 5 (right). It is worth to note that it so happens that for 𝑘 = 3 there is
almost one artist per cluster.

It is actually possible to see 𝑘-means almost as a special case of the unsupervised learning of the
Gaussian mixture model with 𝑘 = 𝑀. In the view of a Gaussian mixture model, all covariance matrices in
𝑘-means are set to the identity matrix, and furthermore is b
𝑦 (instead of 𝑝(𝑦 | x)) used in Method 11.2.

Selecting k
In 𝑘-means it is assumed that 𝑘 is known. However, when doing clustering in practice it is common that 𝑘
is a design variable left for the user to choose. Unfortunately we cannot choose 𝑘 using cross-validation,
since clustering is an unsupervised method and the concept of new data error 𝐸 new (which cross-validation
builds upon, see Chapter 4) is not applicable to clustering. One heuristics to decide 𝑘 instead is to compare
the objective in (11.11), the sum of pairwise squared Euclidian distances within each cluster, for various
values 𝑘.
Selecting 𝑘 based on (11.11) is however not a fully automatic procedure, but there is some interpretation
left to the user. By scrutinizing the equivalent formulation (11.12) we can see that it is minimized (zero)
for the meaningless solution with 𝑘 = 𝑛 clusters and one data point per cluster. We can furthermore see
that if comparing (11.12) with 𝑘 and 𝑘 + 1 clusters, the objective will always take a smaller value for 𝑘 + 1
clusters. We can therefore not blindly search for the minimum of (11.11) (which always must be found at
𝑘 = 𝑛). Instead we should look for the point when the decrease in the objective becomes smaller, that is,
selecting 𝑘 such that the gain of going from 𝑘 − 1 to 𝑘 clusters is significantly larger than the gain of going
from 𝑘 to 𝑘 + 1 clusters. For a dataset which has very distinct clusters, a graph of the objective in (11.11)
looks visually like an elbow, and this method for selecting 𝑘 in 𝑘-means is therefore sometimes called the
elbow method. We illustrate it in Figure 11.9.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 189
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11 Generative models and learning from unlabeled data

Data 𝑘=1 𝑘=2 𝑘=3 𝑘=4

4
𝑥2

𝑘=5 𝑘=6 𝑘=7 𝑘=8 𝑘=9

4
𝑥2

0
−3 −2 −1 0 1 −3 −2 −1 0 1 −3 −2 −1 0 1 −3 −2 −1 0 1 −3 −2 −1 0 1
𝑥1 𝑥1 𝑥1 𝑥1 𝑥1
Objective in (11.11)

400

300

200

100

0
2 4 6 8

Figure 11.9: For selecting 𝑘 in 𝑘-means we can use the so-called elbow method, which amounts to try different
values of 𝑘 (the upper panels) and record what the objective in (11.11) (the bottom panel). To select 𝑘, we look for a
“bend’ in the bottom panel. In an ideal case there is a very distinct kink, but for this particular data we could either
draw the conclusion 𝑘 = 2 or 𝑘 = 4 and it is up to the user to decide. Note that in this example the data hs only 2
dimensions, and we can therefore visualize also the clusters themselves and visually compare them. If there are more
than two dimensions of the data, however, we have to select 𝑘 based only on the “elbow plot” in the bottom panel.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
190
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11.3 More unsupervised methods: k-means and PCA

Figure 11.10: The idea of PCA is to find the dimensions along which the data varies the most. These dimensions are
parameterized by the so-called principal components. In the left panel is some data {x𝑖 }𝑖=1
𝑛 shown. The middle panel

also shows the first (red) and second (blue) principal component. The right panel shows the data when projected
onto only the first principal component (red dots). The red dots are now a one-dimensional representation {t𝑖 }𝑖=1
𝑛 of

the original two-dimensional blue dots {x𝑖 }𝑖=1 .


𝑛

Principal component analysis

We will now introduce principal component analysis (PCA), a method for dimensionality reduction. We
𝑛 , where x is a 𝑝-dimensional vector x = [𝑥 𝑥 . . . 𝑥 ] T of numerical variables.
start with a dataset {x𝑖 }𝑖=1 1 2 𝑝
The underlying assumption of dimensionality reduction is that there is some amount of redundancy in the
𝑝-dimensional representation of the data, and that it lives on (or close to) some manifold that allows a more
compact representation. Dimensionality reduction therefore amounts to creating another dataset {t𝑖 }𝑖=1 𝑛

where t is a 𝑞-dimensional vector t = [𝑡1 𝑡2 . . . 𝑡 𝑞 ] T . Importantly the new dataset has a smaller dimension,
𝑞 < 𝑝, but with as much “information” as possible preserved from the original dataset. The purpose of
doing dimensionality reduction is to get a more compact representation of the data, and thereby (hopefully)
also filter out noise and only keep the most dominant patterns. In linear dimensionality reduction we also
impose the condition that t has to be a linear transformation of x, meaning that it must hold that

t = U x . (11.13)
|{z} |{z} |{z}
𝑞×1 𝑞× 𝑝 𝑝×1

for some U. Designing a linear dimensionality reduction method therefore “only” amounts to prescribing
a way of choosing the transformation matrix U.
In PCA, which probably is the most common method for linear dimensionality reduction, 𝑞 is a user
choice and the transformation U is selected such that the variance of the transformed data {t𝑖 }𝑖=1
𝑛 is

maximized. The matrix U is built up of 𝑞 so-called principal components u, which are 𝑝-dimensional
vectors,
uT 
 1T 
u 
 2
U =  ..  . (11.14)
 . 
 T
u 
 𝑞

The idea is illustrated in Figure 11.10.


Let us start with how to compute the first principal component u1 . Since u1 prescribes a direction
onto which x will be projected, it is convenient to let it be normalized, uT1 u1 = 1. We are searching for
u1 such that the variance of 𝑡1 = uT1 x is as large as possible. Strictly speaking we can not maximize the
variance of 𝑡1 , since we do not know the distribution of x but only have 𝑛 samples of it {x𝑖 }𝑖=1
𝑛 . Instead
1 Í𝑛
we therefore maximize the sample variance of 𝑡1 . If we denote the mean of {x𝑖 }𝑖=1 as x̄ = 𝑛 𝑖=1 x𝑖 , the
𝑛

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 191
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
11 Generative models and learning from unlabeled data

sample variance of 𝑡1 = uT1 x is by definition

1 ∑︁  T  2 1 ∑︁  
𝑛 𝑛
u1 x𝑖 − uT1 x̄ = uT1 x𝑖 xT𝑖 u1 − 2uT1 x𝑖 x̄T u1 + uT1 x̄x̄T u1
𝑛 𝑖=1 𝑛 𝑖=1
!
1 ∑︁
𝑛
= uT1 (x𝑖 − x̄) (x𝑖 − x̄) T u1 = uT1 Su1 . (11.15)
𝑛 𝑖=1
| {z }
S

where S is the sample covariance matrix for {x𝑖 }𝑖=1


𝑛 . Maximizing the sample variance of 𝑡 can thus be
1
done by solving the optimization problem

arg max uT1 Su1 (11.16)


u1

subject to the constraint that uT1 u1 = 1. Note that the constraint is important for the optimization problem
to make sense. By introducing a Lagrange multiplier 𝜆 1 we can rewrite the above optimization problem as
the unconstrained problem

arg max uT1 Su1 + 𝜆1 (1 − uT1 u1 ), (11.17)


u1 ,𝜆1

which is a quadratic problem. By taking the derivative with respect to u1 and set to zero, we know that the
solution must fulfill

Su1 = 𝜆1 u1 . (11.18)

We recognize this equation as the eigenequation for S, implying that the solution u1 to (11.16) is an
eigenvector to S with eigenvalue 𝜆1 . Furthermore, by multiplying (11.18) from the left with uT1 we obtain
an expression for the sample variance of 𝑡1 = uT1 x (11.15) as

uT1 Su1 = uT1 𝜆1 u1 = 𝜆1 . (11.19)

Since we seek to maximize the sample variance of 𝑡1 , we conclude that the solution u1 to (11.16) must be
the eigenvector to S which has the largest eigenvalue 𝜆1 .
The number 𝑞 of principal components in U is a user choice, and if 𝑞 is chosen larger than 1 we have to
find (at least) a second one, u2 . We will not give a complete derivation of the second principal component
u2 , but considering that it should be chosen such that the sample variance of 𝑡2 is maximized but at the
same time not “overlap” with 𝑡1 , and the fact that all 𝑝 × 𝑝 symmetric matrices have a set of 𝑝 orthogonal
eigenvectors, it is perhaps not hard to accept that the second principal component u2 is the eigenvector to S
with the second largest eigenvalue 𝜆2 . In fact we have the general pattern that the 𝑗th principal component
u 𝑗 is the eigenvector to S with the 𝑗th largest eigenvalue 𝜆 𝑗 . If some eigenvalues to S are not distinct,
there is a corresponding degree of freedom in choosing the principal components. There is also a strong
connection between PCA and singular value decomposition which we do not discuss any further here.
It can be a good practice to normalize the data (2.2) before applying PCA, in particular if the different
variables have different scales. It is, however, more of an art than a science, in particular since the overall
goal of dimensionality reduction is defined in rather vague terms. If all variables are measured in the same
unit, for example, one can argue that PCA is more informative if not using normalization.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
192
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
Bibliography
Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin (2012). Learning From Data. A short
course. AMLbook.com.
David Barber (2012). Bayesian reasoning and machine learning. Cambridge University Press.
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal (2019). “Reconciling modern machine-
learning practice and the classical bias–variance trade-off”. In: Proceedings of the National Academy of
Sciences 116.32, pp. 15849–15854.
Christopher M. Bishop (2006). Pattern Recognition and Machine Learning. Springer.
Andriy Burkov (2020). Machine Learning Engineering. url: http://www.mlebook.com/.
Chih-Chung Chang and Chih-Jen Lin (2011). “LIBSVM: A library for support vector machines”.
In: ACM Transactions on Intelligent Systems and Technology 2 (3). Software available at http :
//www.csie.ntu.edu.tw/~cjlin/libsvm, 27:1–27:27.
Dua Dheeru and Efi Karra Taniskidou (2017). UCI Machine Learning Repository. url: http://archive.
ics.uci.edu/ml.
Pedro Domingos (2000). “A Unified Bias-Variance Decomposition and its Applications”. In: Proceedings
of 17th International Conference on Machine Learning, pp. 231–238.
Mordecai Ezekiel and Karl A. Fox (1959). Methods of Correlation and Regression Analysis. John Wiley &
Sons, Inc.
Andrew Gelman, John B. Carlin, Hal S. Stern, David. B. Dunson, Aki Vehtari, and Donald B. Rubin
(2014). Bayesian data analysis. 3rd ed. CRC Press.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2016). Deep Learning. http://www.deeplearningbook.
org. MIT Press.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep learning”. In: Nature 521, pp. 436–444.
Yann LeCun, Bernhard Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel
(1990). “Handwritten Digit Recognition with a Back-Propagation Network”. In: Advances in Neural
Information Processing Systems (NIPS), pp. 396–404.
Warren S McCulloch and Walter Pitts (1943). “A logical calculus of the ideas immanent in nervous
activity”. In: The bulletin of mathematical biophysics 5.4, pp. 115–133.
Volodymyr Mnih et al. (2015). “Human-level control through deep reinforcement learning”. In: Nature
518.7540, pp. 529–533.
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar (2018). Foundations of Machine Learning.
2nd ed. MIT Press.
Kevin P. Murphy (2012). Machine learning – a probabilistic perspective. MIT Press.
Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste-Julien,
and Ioannis Mitliagkas (2019). “A Modern Take on the Bias-Variance Tradeoff in Neural Networks”. In:
arXiv:1810.08591.
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro (2017). “Exploring
Generalization in Deep Learning”. In: 31st Conference on Neural Information Processing Systems.
Andrew Ng (2019). Machine learning yearning. In press. url: http://www.mlyearning.org/.
Tomaso Poggio, Sayan Mukherjee, Ryan M. Rifkin, Alexander Rakhlin, and Alessandro Verri (2001). b.
Tech. rep. AI Memo 2001-011/CBCL Memo 198. Massachusetts Institue of Technology - Artificial
Intelligence Laboratory.
Carl E. Rasmussen and Christopher K. I. Williams (2006). Gaussian processes for machine learning. MIT
press.
Simon Rogers and Mark Girolami (2017). A first course on machine learning. CRC Press.

This material will be published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 193
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.
Bibliography

Bernhard Schölkopf, Ralf Herbrich, and Alexander J. Smola (2001). “A Generalized Representer Theorem”.
In: Lecture Notes in Computer Science, Vol. 2111. LNCS 2111. Springer, pp. 416–426.
Bernhard Schölkopf and Alexander J. Smola (2002). Learning with kernels. Ed. by Thomas Dietterich.
MIT Press.
Connor Shorten and Taghi M. Khoshgoftaar (2019). “A survey on Image Data Augmentation for Deep
Learning”. In: Journal of Big Data 6.1, p. 60.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. (2016). “Mastering
the game of Go with deep neural networks and tree search”. In: Nature 529.7587, p. 484.
Ingo Steinwart, Don Hush, and Clint Scovel (2011). “Training SVMs Without Offset”. In: Journal of
Machine Learning Research 12, pp. 141–202.
Vladimir N. Vapnik (2000). The Nature of Statistical Learning Theory. 2nd ed. Springer.
Jianhua Xu and Xuegong Zhang (2004). “Kernels Based on Weighted Levenshtein Distance”. In: IEEE
International Joint Conference on Neural Networks, pp. 3015–3018.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua
Bengio (2015). “Show, attend and tell: Neural image caption generation with visual attention”. In:
Proceedings of the International Conference on Learning representations (ICML).
Kai Yu, Liang Ji, and Xuegong Zhang (2002). “Kernel Nearest-Neighbor Algorithm”. In: Neural Processing
Letters 15.2, pp. 147–156.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals (2017). “Understanding
deep learning requires rethinking generalization”. In: 5th International Conference on Learning
Representations.
Ji Zhu and Trevor Hastie (2005). “Kernel Logistic Regression and the Import Vector Machine”. In: Journal
of Computational and Graphical Statistics 14.1, pp. 185–205.

Draft (September 25, 2020) of Supervised Machine Learning. Feedback and exercise problems: http://smlbook.org
194
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2020.

You might also like