0% found this document useful (0 votes)
16 views

Machine Learning - The Science of Selection under Uncertainty

The document is a comprehensive educational resource on machine learning, developed by Yevgeny Seldin and co-authors for various courses at the University of Copenhagen. It covers topics such as supervised learning, concentration of measure inequalities, generalization bounds, online learning, and foundational concepts in set and probability theory. The material is periodically updated and aims to provide a thorough understanding of machine learning principles and techniques.

Uploaded by

dent
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Machine Learning - The Science of Selection under Uncertainty

The document is a comprehensive educational resource on machine learning, developed by Yevgeny Seldin and co-authors for various courses at the University of Copenhagen. It covers topics such as supervised learning, concentration of measure inequalities, generalization bounds, online learning, and foundational concepts in set and probability theory. The material is periodically updated and aims to provide a thorough understanding of machine learning principles and techniques.

Uploaded by

dent
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Machine Learning

The Science of Selection under Uncertainty

Yevgeny Seldin

June 27, 2023


Foreword

The material was developed in the process of teaching the following courses:

• Machine Learning A, Department of Computer Science, University of Copenhagen. (2021 - present)


• Machine Learning B, Department of Computer Science, University of Copenhagen. (2021 - present)
• Online and Reinforcement Learning, Department of Computer Science, University of Copenhagen.
(2021 - present)

• Machine Learning, Department of Computer Science, University of Copenhagen. (2015 - 2021)


• Advanced Topics in Machine Learning, Department of Computer Science, University of Copen-
hagen. (2015 - 2020)
The material is periodically updated (check the compilation date on the title page). The courses are
co-taught by me, Christian Igel, Sadegh Talebi, and Fabian Gieseke. The material only covers my part
of the above courses.

I would like to thank all students who have pointed out typos and flaws in the lecture notes. There
are certainly more and if you spot any, please, report them to me at [email protected]. Your feedback will
serve future generations of students.
Contents

1 Supervised Learning 3
1.1 The Supervised Learning Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Classification, Regression, and Other Supervised Learning Problems . . . . . . . . 4
1.1.2 The Loss Function `(Y 0 , Y ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 K Nearest Neighbors for Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 How to Pick K in K-NN? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Test Set: It’s not about how you call it, it’s about how you use it! . . . . . . . . . 7
1.3.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Perceptron - Basic Algorithm for Linear Classification . . . . . . . . . . . . . . . . . . . . 8

2 Concentration of Measure Inequalities 10


2.1 Markov’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Hoeffding’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Understanding Hoeffding’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Basics of Information Theory: Entropy, Relative Entropy, and the Method of Types . . . 15
2.5 kl Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 Relaxations of the kl-inequality: Pinsker’s and refined Pinsker’s inequalities . . . . 17
2.6 Sampling Without Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Generalization Bounds for Classification 20


3.1 Overview: Learning by Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Generalization Bound for a Single Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Generalization Bound for Finite Hypothesis Classes . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Occam’s Razor Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Applications of Occam’s Razor bound . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Vapnik-Chervonenkis (VC) Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5.1 The VC Analysis: Symmetrization . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.2 Bounding the Growth Function: The VC-dimension . . . . . . . . . . . . . . . . . 33
3.6 VC Analysis of SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 VC Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 PAC-Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8.1 Relation and Differences with other Learning Approaches . . . . . . . . . . . . . . 40
3.8.2 A Proof of PAC-Bayes-kl Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8.3 Application to SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8.4 Relaxation of PAC-Bayes-kl: PAC-Bayes-λ Inequality . . . . . . . . . . . . . . . . 43
3.8.5 Alternating Minimization of PAC-Bayes-λ Bound . . . . . . . . . . . . . . . . . . . 44
3.8.6 Construction of a Hypothesis Space for PAC-Bayes-λ . . . . . . . . . . . . . . . . . 44
3.9 PAC-Bayesian Analysis of Ensemble Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 45
3.9.1 Ensemble Classifiers and Weighted Majority Vote . . . . . . . . . . . . . . . . . . . 45
3.9.2 First Order Oracle Bound for the Weighted Majority Vote . . . . . . . . . . . . . . 45
3.9.3 Second Order Oracle Bound for the Weighted Majority Vote . . . . . . . . . . . . 46
3.9.4 Comparison of the First and Second Order Oracle Bounds . . . . . . . . . . . . . . 47

1
3.9.5 Second Order PAC-Bayesian Bounds for the Weighted Majority Vote . . . . . . . . 48
3.9.6 Ensemble Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.9.7 Comparison of the Empirical Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Supervised Learning - Regression 50


4.1 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.1 Analytical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.2 Algebraic Approach - Fast Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.3 Algebraic Approach - Complete Picture . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.4 Using Linear Least Squares for Learning Coefficients of Non-linear Models . . . . . 52

5 Online Learning 53
5.1 The Space of Online Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 A General Basic Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 I.I.D. (stochastic) Multiarmed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 Prediction with Expert Advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Adversarial Multiarmed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6 Adversarial Multiarmed Bandits with Expert Advice . . . . . . . . . . . . . . . . . . . . . 68
5.6.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A Set Theory Basics 71

B Probability Theory Basics 72


B.1 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
B.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
B.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
B.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
B.5 The Bernoulli and Binomial Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 76
B.6 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

C Linear Algebra 77

D Calculus 80
D.1 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

2
Chapter 1

Supervised Learning

The most basic and widespread form of machine learning is supervised learning. In the classical batch
supervised learning setting the learner is given an annotated sample, which is used to derive a prediction
rule for annotating new samples. We start with a simple informal example and then formalize the
problem.
Let’s say that we want to build a prediction rule that will use the average grade of a student in
home assignments, say on a 100-points scale, to predict whether the student will pass the final exam.
Such a prediction rule could be used for preliminary filtering of students to be allowed to take the final
exam. The annotated sample could be a set of average grades of students from the previous year with
indications of whether they have passed the final exam. The prediction rule could take a form of a
threshold grade (a.k.a. decision stump), above which the student is expected to pass and below fail.
Now assume that we want to take a more refined approach and look into individual grades in each
assignment, say, 5 assignments in total. For example, different assignments may have different relevance
for the final exam or, maybe, some students may demonstrate progression throughout the course, which
would mean that their early assignments should not be weighted equally with the later ones. In the refined
approach each student can be represented by a point in a 5-dimensional space. The one-dimensional
threshold could be replaced by a separating hyperplane, which separates the 5-dimensional space of
grades into a linear subspace, where most students are likely to pass, and the complement, where they
are likely to fail. An alternative approach is to look at “nearest neighbors” of a student in the space
of grades. Given a grade profile of a student (the point representing the student in the 5-dimensional
space) we look at students with the closest grade profile and see whether most of them passed or failed.
This is known as the K Nearest Neighbors algorithm, where K is the number of neighbors we look at.
But how many neighbors K should we look at? Considering the extremes gives some intuition about
the problem. Taking just one nearest neighbor may be unreliable. For example, we could have a good
student that accidentally failed the final exam and then all the neighbors will be marked as “expected to
fail”. Going in the other extreme and taking all the students in as neighbors is also undesirable, because
effectively it will ignore the individual profile altogether. So a good value of K should be somewhere
between 1 and n, where n is the size of the annotated set. But how to find it? Well, read on and you
will learn how to approach this question formally.

1.1 The Supervised Learning Setting


We start with a bunch of notations and then illustrate them with examples.
• X - the sample space.
• Y - the label space.
• X ∈ X - unlabeled sample.
• (X, Y ) ∈ (X × Y) - labeled sample.
• S = {(X1 , Y1 ), . . . , (Xn , Yn )} - a training set. We assume that (Xi , Yi ) pairs in S are sampled i.i.d.
according to an unknown, but fixed distribution p(X, Y ).

3
• h : X → Y - a hypothesis, which is a function from X to Y.
• H - a hypothesis set.
• `(Y 0 , Y ) - the loss function for predicting Y 0 instead of Y .
Pn
• L̂(h, S) = n1 i=1 `(h(Xi ), Yi ) - the empirical loss (a.k.a. error or risk) of h on S. (In many
textbooks S is omitted from the notation and L̂(h) or L̂n (h) is used to denote L̂(h, S).)

• L(h) = E [`(h(X), Y )] - the expected loss (a.k.a. error or risk) of h, where the expectation is taken
with respect to p(X, Y ).

The Learning Protocol


The classical supervised learning acts according to the following protocol:
1. The learner gets a training set S of size n sampled i.i.d. according to p(X, Y ).

2. The learner returns a prediction rule h.


3. New instances (X, Y ) are sampled according to p(X, Y ), but only X is observed and h is used to
predict the unobserved Y .
The goal of the learner is to return h that minimizes L(h), which is the expected error on
new samples.

Examples - Sample and Label Spaces


Let’s say that we want to predict person’s height based on age, gender, and weight. Then X = N ×
{±1} × R and Y = R. If we want to predict gender based on age, weight, and height, then X = N × R × R
and Y = {±1}. If we want to predict the height of a baby at the age of 4 years based on his or her height
at the ages of 1, 2, and 3 years, then X = R3 and Y = R.

1.1.1 Classification, Regression, and Other Supervised Learning Problems


The most widespread forms of supervised learning are classification and regression. We also mention a
few more, mainly to show that the supervised learning setting is much richer.

Classification A supervised learning problem is a classification problem when the output (label) space
Y is binary. The goal of the learning algorithm is to separate between two classes: yes or no; good or
bad; healthy or sick; male or female; etc. Most often the translation of the binary label into numerical
representation is done by either taking Y = {±1} or Y = {0, 1}. Sometimes the setting is called binary
classification to emphasize that Y takes just two values.

Regression A supervised learning problem is a regression problem when the output space Y = R. For
example, prediction of person’s height would be a regression problem.

Multiclass Classification When Y consists of a finite and typically unordered and relatively small set
of values, the corresponding supervised learning problem is called multiclass classification. For example,
prediction of a study program a student will apply for based on his or her grades would be a multiclass
classification problem. Finite ordered output spaces, for example, prediction of age or age group can
also be modeled as multiclass classification, but it may be possible to exploit the structure of Y to
obtain better solutions. For example, it may be possible to exploit the fact that ages 22 and 23 are
close together, whereas 22 and 70 are far apart; therefore, it may be possible to share some information
between close ages, as well as exploit the fact that predicting 22 instead of 23 is not such a big mistake
as predicting 22 instead of 70. Depending on the setting, it may be preferable to model prediction of
ordered sets as regression rather than multiclass classification.

4
Structured Prediction Consider the problem of machine translation. An algorithm gets a sentence
in English as an input and should produce a sentence in Danish as an output. In this case the output
(the sentence in Danish) is not merely a number, but a structured object and such prediction problems
are known as structured prediction.

1.1.2 The Loss Function `(Y 0 , Y )


The loss function (a.k.a. the error function) encodes how much the user of an algorithm cares about
various kinds of mistakes. Most literature on binary classification, including these lecture notes, uses the
zero-one loss defined by (
1, if Y 0 6= Y
`(Y 0 , Y ) = 1(Y 0 6= Y ) =
0, otherwise,
where 1 is the indicator function. Common loss functions in regression are the square loss

`(Y 0 , Y ) = (Y 0 − Y )2

and the absolute loss


`(Y 0 , Y ) = |Y 0 − Y |.
The above loss functions are convenient general choices, but not necessarily the right choice for a
particular application. For example, imagine that you design an algorithm for fire alarm that predicts
“fire / no fire”. Assume that the cost of a house is 3,000,000 DKK and the cost of calling in a fire brigade
is 2,000 DKK. Then the loss function would be

HH Y no fire
H
fire
Y’ HH
0
`(Y , Y ) = no fire 0 3,000,000
fire 2,000 0

The loss for making the correct prediction is zero, but the loss of false positive (predicting fire when in
reality there is no fire) and false negative (predicting no fire when the reality is fire) are not symmetric
anymore.
Put attention that the loss depends on how the predictions are used and the loss table depends on
the user. For example, if the same alarm is installed in a house that is worth 10,000,000 DKK, the ratio
between the cost of false positives and false negatives will be very different and, as a result, the optimal
prediction strategy will not necessarily be the same.

1.2 K Nearest Neighbors for Binary Classification


One of the simplest algorithms for binary classification is K Nearest Neighbors (K-NN). The algorithm
is based on an externally provided distance function d(x, x0 ) that computes distances between pairs of
points x and qP x0 . For example, for points in Rd the distance could be the Euclidean distance d(x, x0 ) =
d
p
kx−x0 k = 0 2
j=1 (xj − xj ) = (x − x0 )T (x − x0 ), where x = (x1 , . . . , xd ) and xj is the j-th coordinate
of vector x. Other choices of distance measures are possible and, in general, lead to different predictions.
The choice of the distance measure d(x, x0 ) is the key for success or failure of K-NN, but we leave the
topic of selection of d outside the scope of the lecture notes.
K-NN algorithm takes as input a set of training points S = {(x1 , y1 ), . . . , (xn , yn )} and predicts the
label of a target point x based on the majority vote of K points from S, which are the closest to x in
terms of the distance measure d(xi , x).
The ordering of di -s in Step 3 is identical to the ordering of d2i and for the Euclidean distance we can
save the computation of the square root by working with squared distances.
The hypothesis space H is implicit in the K-NN algorithm. It is the space of all possible parti-
tions of the sample space X . The output hypothesis h is parametrized by all training points hS =
h{(x1 ,y1 ),...,(xn ,yn )} . In the sequel we will see other prediction rules that operate with more explicit
hypothesis spaces, for example, a space of all linear separators.

5
Algorithm 1 K Nearest Neighbors (K-NN) for Binary Classification with Y = {±1}
1: Input: A set of labeled points {(x1 , y1 ), . . . , (xn , yn )} and a target point x that has to be classified.
2: Calculate the distances di = d(xi , x).
3: Sort di -s in ascending order and let σ : {1, . . . , n} → {1, . . . , n} be the corresponding permutation of
indices. In other words, for any pair of indices
PK  i < j we should have dσ(i) ≤ dσ(j) .
4: The output of K-NN is y = sign i=1 yσ(i) . It is the majority vote of K points that are the closest
to x. Note that we can calculate the output of K-NN for all K in one shot.

1.2.1 How to Pick K in K-NN?


One of the key questions in K-NN is how to pick K. It is instructive to consider the extreme cases to
gain some intuition. In 1-NN the prediction is based on a single sample (xi , yi ) which happens to be
closest to the target point x. This may not be the best thing to do. Imagine that you are admitted to a
hospital and a diagnostic system determines whether you are healthy or sick based on a single annotated
patient that has the symptoms closest to yours (in distance measure d). You would likely prefer to be
diagnosed based on the majority of diagnoses of several patients with similar symptoms. At the other
extreme, in n-NN, where n is the number of samples in S, the prediction is based on the majority of
labels yi within the sample S, without even taking any particular x into account. So the desirable K is
somewhere between 1 and n, but how to find it?
Let hK-NN denote the prediction rule of K-NN. As K goes from 1 to n, K-NN provides n different
prediction rules, h1-NN , h2-NN , . . . , hn-NN (or half of that if we only take the odd values of K). Recall that
we are interested in finding K that minimizes the expected loss L(hK-NN ) and that L(hK-NN ) is unobserved.
We can calculate the empirical loss L̂(hK-NN , S) for any K. However, L̂(h1-NN , S) is always zero1 and in
general the empirical error of K-NN is an underestimate of its expected error and we need other tools
to estimate L(hK-NN ). We start developing these tools in the next section and continue throughout the
lecture notes.

1.3 Validation
Whenever we select a hypothesis ĥ∗S out of a hypothesis set H based on empirical performances L̂(h, S),
the empirical performance L̂(ĥ∗S , S) becomes a biased estimate of L(ĥ∗S ). This is clearly observed in
1-NN, where L̂(h1-NN , S) = 0, but L(h1-NN ) is most often not zero (we remind that the hypothesis space
in 1-NN is the space of all possible partitions of the sample space X and h1-NN is the hypothesis that
achieves the minimal empirical error in this space). The reason is that when we do the selection we pick
ĥ∗S that is best suited for S (it achieves the minimal L̂(h, S) out of all h in H). Therefore, from the
perspective of ĥ∗S the new samples (X, Y ) are not “similar” to the samples (Xi , Yi ) in S. A bit more
precisely, (X, Y ) is not exchangeable with (Xi , Yi ), because if we would exchange (Xi , Yi ) with (X, Y ) it
is likely that ĥ∗S , the hypothesis that minimizes L̂(h, S), would be different. Again, this is very clear in
1-NN: if we change one sample (Xi , Yi ) in S we get a different prediction rule h1-NN . We get back to this
topic in much more details in Chapter 3 after we develop some mathematical tools for analyzing the bias
in Chapter 2. For now we present a simple solution for estimating L(ĥ∗S ) and motivate why we need the
tools from Chapter 2.
The solution is to split the sample set S into training set Strain and validation set Sval . We can then
find the best hypothesis for the training set, h∗Strain , and validate it on the validation set by computing
L̂(h∗Strain , Sval ). Note that from the perspective of h∗Strain the samples in Sval are exchangeable with any
new samples (X, Y ). If we exchange (Xi , Yi ) ∈ Sval with another  sample (X, Y) coming from the same
distribution, h∗Strain will stay the same and in expectation E `(h∗Strain (Xi ), Yi ) = E `(h∗Strain (X), Y ) ,


meaning that on average L̂(h∗Strain , Sval ) will also stay the same (only on average, the exact value may
change). Therefore, L̂(h∗Strain , Sval ) is an unbiased estimate of L(h∗Strain ). (We get back to this point in
much more details in Chapter 3.)
1 This is because the closest point in S to a sample point x is x itself and we assume that S includes no identical points
i i
with dissimilar labels, which is a reasonable assumption if X = Rd .

6
Now we get to the question of how to split S into Strain and Sval , and again it is very instructive to
consider the extreme cases. Imagine that we keep a single sample for validation and use the remaining
n − 1 samples for training. Let’s say that we keep the last sample, (Xn , Yn ), for validation, then
L̂(h∗Strain , Sval ) = `(h∗Strain (Xn ), Yn ) and in the case of zero-one loss it is either zero or one. Even though
L̂(h∗Strain , Sval ) is an unbiased estimate of L(h∗Strain ), it clearly does not represent it well. At the other
extreme, if we keep n − 1 points for validation and use the single remaining point for training we run into
a different kind of problem: a classifier trained on a single point is going to be extremely weak. Let’s
say that we have used the first point, (X1 , Y1 ), for training. In the case of K-NN classifier, as well as
most other classifiers, h∗Strain will always predict Y1 , no matter what input it gets. The validation error
L̂(h∗Strain , Sval ) will be a very good estimate of L(h∗Strain ), but this is definitely not a classifier we want.
So how many samples from S should go into Strain and how many into Sval ? Currently there is no
“gold answer” to this question, but in Chapters 2 and 3 we develop mathematical tools for intelligent
reasoning about it. An important observation to make is that for h independent of (X, Y ) the zero-one
loss `(h(X), Y ) is a Bernoulli random variable with bias P(`(h(X), Y ) = 1) = L(h). Furthermore, when h
is independent of a set of samples {(X1 , Y1 ), . . . , (Xm , Ym )} (i.e., these samples are not used for selecting
h), the losses `(h(Xi ), Yi ) are independent identically distributed (i.i.d.) Bernoulli random variables with
bias L(h). Therefore, when Sval is of size m, the validation loss L̂(h∗Strain , Sval ) is an average of m i.i.d.
Bernoulli random variables with bias L(h∗Strain ). The validation loss L̂(h∗Strain , Sval ) is observed, but the
expected loss that we are actually interested in is unobserved. One of the key questions that we are
interested in is how far L̂(h∗Strain , Sval ) can be from L(h∗Strain ). We have already seen that m = 1 is
too little. But how large should it be, 10, 100, 1000? Essentially this question is equivalent to asking
how many times do we need to flip a biased coin in order to get a satisfactory estimate of its bias. In
Chapter 2 we develop concentration of measure inequalities that answer this question.
Another technical question is which samples should go into Strain and which into Sval ? From the
theoretical perspective we assume that S is sampled i.i.d. and, therefore, it does not matter. We can
take the first n − m samples into Strain and the last m into Sval or split in any other way. From a
practical perspective the samples may actually not be i.i.d. and there could be some parameter that has
influenced their order in S. For example, they could have been ordered alphabetically. Therefore, from a
practical perspective it is desirable to take a random permutation of S before splitting, unless the order
carries some information we would like to preserve. For example, if S is a time-ordered series of product
reviews and we would like to build a classifier that classifies them into positive and negative, we may
want to get an estimate of temporal variation and keep the order when we do the split, i.e., train on the
earlier samples and validate on the later.

1.3.1 Test Set: It’s not about how you call it, it’s about how you use it!
Assume that we have split S into Strain and Sval ; we have trained h1-NN , . . . , hn-NN on Strain ; we calculated
L̂(h1-NN , Sval ), . . . , L̂(hn-NN , Sval ) and picked the value K ∗ that minimizes L̂(hK-NN , Sval ). Is L̂(hK∗ -NN , Sval )
an unbiased estimate of L(hK∗ -NN )?
This is probably one of the most conceptually difficult points about validation, at least when you
encounter it for the first time. While for each hK-NN individually L̂(hK-NN , Sval ) is an unbiased estimate
of L(hK-NN ), the validation loss L̂(hK∗ -NN , Sval ) is a biased estimate of L(hK∗ -NN ). This is because Sval was
used for selection of K ∗ and, therefore, hK∗ -NN depends on Sval . So if we want to get an unbiased estimate
of L(hK∗ -NN ) we have to reserve some “fresh” data for that. So we need to split S into Strain , Sval , and
Stest ; train the K-NN classifiers on Strain ; pick the best K ∗ based on L̂(hK-NN , Sval ); and then compute
L̂(hK∗ -NN , Stest ) to get an unbiased estimate of L(hK∗ -NN ).

It’s not about how you call it, it’s about how you use it! Some people think that if you call
some data a test set it automatically makes loss estimates on this set unbiased. This is not true. Imagine
that you have split S into Strain , Sval , and Stest ; you trained K-NN on Strain , picked the best value K ∗
using Sval , and estimated the loss of hK∗ -NN on Stest . And now you are unhappy with the result and you
want to try a different learning method, say a neural network. You go through the same steps: you train
networks with various parameter settings on Strain , you validate them on Sval , and you pick the best
parameter set θ∗ based on the validation loss. Finally, you compute the test loss of the neural network
parametrized by θ∗ on Stest . It happens to be lower than the test loss of K ∗ -NN and you decide to go

7
with the neural network. Does the empirical loss of the neural network on Stest represent an unbiased
estimate of its expected loss? No! Why? Because our choice to pick the neural network was based on
its superior performance relative to hK∗ -NN on Stest , so Stest was used in selection of the neural network.
Therefore, there is dependence between Stest and the hypothesis we have selected, and the loss on Stest
is biased. If we want to get an unbiased estimate of the loss we have to find new “fresh” data or reserve
such data from the start and keep it in a locker until the final evaluation moment. Alternatively, we
can correct for the bias and in Chapter 3 we will learn some tools for making the correction. The main
take-home message is: It is not about how you call a data set, Strain , Sval , or Stest , it is the
way you use it which determines whether you get unbiased estimates or not! In some cases it
is possible to get unbiased estimates or to correct for the bias already with Strain , and sometimes there
is bias even on Stest and we need to correct for that.

1.3.2 Cross-Validation
Sometimes it feels wasteful to use only part of the data for training and part for validation. A heuristic
way around it is cross-validation. In the standard N -fold cross-validation setup the data S are split into
N non-overlapping folds S1 , . . . , SN . Then for i ∈ {1, . . . , N } we train on all folds except the i-th and
validate on Si . We then take the average of the N validation errors and pick the parameter that achieves
the minimum (for example, the best K in K-NN). Finally, we train a model with the best parameter we
have selected in the cross-validation procedure (for example the best K ∗ in K-NN) using all the data S.
The standard cross-validation procedure described above is a heuristic and has no theoretical guar-
antees. It is fairly robust and widely used in practice, but it is possible to construct examples, where
it fails. In Chapter 3 we describe a modification of the cross-validation procedure, which comes with
theoretical generalization guarantees and is empirically competitive with the standard cross-validation
procedure.

1.4 Perceptron - Basic Algorithm for Linear Classification


Linear classification is another basic family of classification strategies. Let X = Rd and Y = {±1}. A
hyperplane in Rd is described by a tuple (w, b), where w ∈ Rd and b ∈ R. The points x on the hyperplane
are described by the equation
wT x + b = 0.
A linear classifier h = (w, b) assigns label +1 to all points on the “positive” side of the hyperplane and
−1 on the “negative” side of the hyperplane. Specifically,
h(x) = sign wT x + b .


Homogeneous classifiers We distinguish between homogeneous linear classifiers and non-homogeneous


linear classifiers. A homogeneous linear classifier is described by a hyperplane passing through the origin.
From the mathematical point of view it means that b = 0.
We note that any linear classifier in Rd can be transformed into a homogeneous linear classifier in
d+1
R by the following transformation
x → (x; 1)
{w, b} → (w; b)
(where by “;” we mean that we append a row to a column vector). In other words, we append “1” to
the x vector and combine w and b into one vector in Rd+1 . Note that wT x + b = (w; b)T (x; 1) and,
therefore, the predictions of the transformed model are identical to predictions of the original model.
Through this transformation any learning algorithm for homogeneous classifiers can be directly applied
to learning non-homogeneous classifiers.

Hypothesis space The hypothesis space in linear classification is the space of all possible separating
hyperplanes. If we are talking about homogeneous linear classifiers then it is restricted to hyperplanes
passing through the origin. Thus, for homogeneous linear classifiers H = Rd and for general linear
classifiers H = Rd+1 .

8
Perceptron algorithm Perceptron is the simplest algorithm for learning homogeneous separating
hyperplanes. It operates under the assumption that the data are separable by a homogeneous
hyperplane, meaning that there exists a hyperplane passing through the origin that perfectly separates
positive points from negative.

Algorithm 2 Perceptron
1: Input: A training set {(x1 , y1 ), . . . , (xn , yn )}
2: Initialization: w1 = 0 (where 0 is the zero vector)
3: t=1
4: while exists (xit , yit ), such that yit (wtT xit ) ≤ 0 do
5: wt+1 = wt + yit xit
6: t=t+1
7: end while
8: Return: wt

Note that a point (x, y) is classified correctly if ywT x > 0 and misclassified if ywT x ≤ 0. Thus, the
selection step (line 4 in the pseudocode) picks a misclassified point, as long as there exists such. The
update step (line 5 in the pseudocode) rotates the hyperplane w, so that the classification is “improved”.
T
Specifically, the following property is satisfied: if (xit , yit ) is the point selected at step t then yit wt+1 x it >
T
yit wt xit (verification of this property is left as an exercise to the reader). Note this property does not
guarantee that after the update wt+1 will classify (xit , yit ) correctly. But it will rotate in the right
direction and after sufficiently many updates (xit , yit ) will end up on the right side of the hyperplane.
Also note that while the classification of (xit , yit ) is improved, it may go the opposite way for other
points. As long as the data are linearly separable, the algorithm will eventually find the separation.
The algorithm does not specify the order in which misclassified points are selected. Two natural
choices are sequential and random. We leave it as an exercise to the reader to check which of the two
choices leads to faster convergence of the algorithm.

9
Chapter 2

Concentration of Measure
Inequalities

Concentration of measure inequalities are one of the main tools for analyzing learning algorithms. This
chapter is devoted to a number of concentration of measure inequalities that form the basis for the results
discussed in later chapters.

2.1 Markov’s Inequality


Markov’s Inequality is the simplest and relatively weak concentration inequality. Nevertheless, it forms
the basis for many much stronger inequalities that we will see in the sequel.
Theorem 2.1 (Markov’s Inequality). For any non-negative random variable X and ε > 0:

E [X]
P(X ≥ ε) ≤ .
ε

Proof. Define a random variable Y = 1(X ≥ ε) to be the indicator function of whether X exceeds ε. Then
Y ≤ Xε (see Figure 2.1). Since Y is a Bernoulli random variable, E [Y ] = P(Y = 1) (see Appendix B).
We have:  
X E [X]
P(X ≥ ε) = P(Y = 1) = E [Y ] ≤ E = .
ε ε
Check yourself: where in the proof do we use non-negativity of X and strict positiveness of ε?

Figure 2.1: Relation between the identity function and the indicator function.

10
By denoting the right hand side of Markov’s inequality by δ we obtain the following equivalent
statement. For any non-negative random variable X:
 
1
P X ≥ E [X] ≤ δ.
δ

Example. We would like to bound the probability that we flip a fair coin 10 times and obtain 8 or more
1
heads. Let X1 , . . . , X10 be i.i.d. Bernoulli random variables with
hPbias 2 .i The question is equivalent to
P10 10
asking what is the probability that i=1 Xi ≥ 8. We have E i=1 Xi = 5 (the reader is invited to
prove this statement formally) and by Markov’s inequality
hP i
10
! 10
X E i=1 Xi 5
P Xi ≥ 8 ≤ = .
i=1
8 8

We note that even though Markov’s inequality is weak, there are situations in which it is tight. We
invite the reader to construct an example of a random variable for which Markov’s inequality is tight.

2.2 Chebyshev’s Inequality


Our next stop is Chebyshev’s inequality, which exploits variance to obtain tighter concentration.

Theorem 2.2 (Chebyshev’s inequality). For any ε > 0

Var [X]
P(|X − E [X]| ≥ ε) ≤ .
ε2
Proof.
  a transformation of a random variable. We have that P(|X − E [X]| ≥ ε) =
The proof uses
2
P (X − E [X]) ≥ ε2 , because the first statement holds if and only if the second holds. In addition,
2
using Markov’s inequality and the fact that (X − E [X]) is a non-negative random variable we have
h i
  E (X − E [X])2 Var [X]
2
P(|X − E [X]| ≥ ε) = P (X − E [X]) ≥ ε2 ≤ = .
ε2 ε2
Check yourself: where in the proof did we use the positiveness of ε?
In order to illustrate the relative advantage of Chebyshev’s inequality compared to Markov’s consider
the following example. PnLet X1 , . . . , Xn be n independent identically distributed Bernoulli random vari-
ables and let µ̂n = n1 i=1 Xi be their average. We would like to bound the probability that µ̂n deviates
from E [µ̂n ] by more than ε (this is the central question in machine learning). We have E [µ̂P
n ] = E [X1 ] = µ
n
and by independence of Xi -s and Theorem B.26 we have Var [µ̂n ] = n12 Var [nµ̂n ] = n12 i=1 Var [Xi ] =
1
n Var [X1 ]. By Markov’s inequality

E [µ̂n ] E [X1 ]
P(µ̂n − E [µ̂n ] ≥ ε) = P(µ̂n ≥ E [µ̂n ] + ε) ≤ = .
E [µ̂n ] + ε E [X1 ] + ε

Note that as n grows the inequality stays the same. By Chebyshev’s inequality we have

Var [µ̂n ] Var [X1 ]


P(µ̂n − E [µ̂n ] ≥ ε) ≤ P(|µ̂n − E [µ̂n ]| ≥ ε) ≤ = .
ε2 nε2
Note that as n grows the right hand side of the inequality decreases at the rate of n1 . Thus, in this case
Chebyshev’s inequality is much tighter than Markov’s and it illustrates that as the number of random
variables grows the probability that their average significantly deviates from the expectation decreases.
In the next section we show that this probability actually decreases at an exponential rate.

11
2.3 Hoeffding’s Inequality
Hoeffding’s inequality is a much more powerful concentration result.
Theorem 2.3 (Hoeffding’s Inequality). Let X1 , . . . , Xn be independent real-valued random variables,
such that for each i ∈ {1, . . . , n} there exist ai ≤ bi , such that Xi ∈ [ai , bi ]. Then for every ε > 0:
n
" n # !
X X 2 Pn 2
P Xi − E Xi ≥ ε ≤ e−2ε / i=1 (bi −ai ) (2.1)
i=1 i=1

and " n # !
n
X X 2 Pn 2
P Xi − E Xi ≤ −ε ≤ e−2ε / i=1 (bi −ai ) . (2.2)
i=1 i=1

By taking a union bound of the events in (2.1) and (2.2) we obtain the following corollary.
Corollary 2.4. Under the assumptions of Theorem 2.3:
n
" n # !
X X 2 Pn 2
P Xi − E Xi ≥ ε ≤ 2e−2ε / i=1 (bi −ai ) . (2.3)
i=1 i=1

Equations (2.1) and (2.2) are known as “one-sided Hoeffding’s inequalities” and (2.3) is known as
“two-sided Hoeffding’s inequality”.
If we assume that Xi -s are identically distributed and belong to the [0, 1] interval we obtain the
following corollary.

Corollary 2.5. Let X1 , . . . , Xn be independent random variables, such that Xi ∈ [0, 1] and E [Xi ] = µ
for all i, then for every ε > 0: !
n
1X 2
P Xi − µ ≥ ε ≤ e−2nε (2.4)
n i=1
and !
n
1X 2
P µ− Xi ≥ ε ≤ e−2nε . (2.5)
n i=1
Pn
Recall that by Chebyshev’s inequality µ̂n = n1 i=1 Xi converges to µ at the rate of n−1 . Hoeffding’s
inequality demonstrates that the convergence is actually much faster, at least at the rate of e−n .
The proof of Hoeffding’s inequality is based on Hoeffding’s lemma.
Lemma 2.6 (Hoeffding’s Lemma). Let X be a random variable, such that X ∈ [a, b]. Then for any
λ ∈ R:
λ2 (b−a)2
E eλX ≤ eλE[X]+ 8 .
 

The function f (λ) = E eλX is known as the moment generating function of X, since f 0 (0) = E [X],
 

f 00 (0) = E X 2 , and, more generally, f (k) (0) = E X k . We provide the proof of the lemma immediately
   

after the proof of Theorem 2.3.

Proof of Theorem 2.3. We prove the first inequality in Theorem 2.3. The second inequality follows by
applying the first inequality to −X1 , . . . , −Xn . The proof is based on Chernoff’s bounding technique.
For any λ > 0 the following holds:
h Pn Pn i
 E eλ( i=1 Xi −E[ i=1 Xi ])
n
" n # !
 Pn Pn
Xi ≥ ε = P eλ( i=1 Xi −E[ i=1 Xi ]) ≥ eλε ≤
X X
P Xi − E ,
i=1 i=1
eλε

12
where the first step holds since eλx is a monotonously increasing function for λ > 0 and the second step
holds by Markov’s inequality. We now take a closer look at the nominator:
h Pn Pn i h Pn i
E eλ( i=1 Xi −E[ i=1 Xi ]) = E e( i=1 λ(Xi −E[Xi ]))
" n #
Y
λ(Xi −E[Xi ])
=E e
i=1
n
Y h i
= E eλ(Xi −E[Xi ]) (2.6)
i=1
n
2
(bi −ai )2 /8
Y
≤ eλ (2.7)
i=1
2 Pn 2
= e(λ /8) i=1 (bi −ai ) ,

where (2.6) holds since X1 , . . . , Xn are independent and (2.7) holds by Hoeffding’s lemma applied to a
random variable Zi = Xi −E [Xi ] (note that E [Zi ] = 0 and that Zi ∈ [ai −µi , bi −µi ] for µi = E [Xi ]). Put
attention to the crucial role that independence of X1 , . . . , Xn plays in the proof ! Without independence
we would not have been able to exchange the expectation with the product and the proof would break
down! To complete the proof we substitute the bound on the expectation into the previous calculation
and obtain: " n # !
n
2 Pn 2
X ≥ ε ≤ e(λ /8)( i=1 (bi −ai ) )−λε .
X X
P X −E
i i
i=1 i=1

This expression is minimized by


n
! !
∗ (λ2 /8)( n
P 2
i=1 (bi −ai ) )−λε 2
X
2 4ε
λ = arg min e = arg min (λ /8) (bi − ai ) − λε = Pn 2
.
λ λ
i=1 i=1 (bi − ai )

It is important to note that the best choice of λ does not depend on the sample. In particular, it allows
to fix λ before observing the sample. By substituting λ∗ into the calculation we obtain the result of the
theorem.
Proof of Lemma 2.6. Note that
h i h i
E eλX = E eλ(X−E[X])+λE[X] = eλE[X] × E eλ(X−E[X]) .
 

Hence, it is sufficient to show that for any random variable Z with E [Z] = 0 and Z ∈ [a, b] we have:
2 2
E eλZ ≤ eλ (b−a) /8 .
 

By convexity of the exponential function, for z ∈ [a, b] we have:


z − a λb b − z λa
eλz ≤ e + e .
b−a b−a
Let p = −a/(b − a). Then:
 
 λZ  Z − a λb b − Z λa
E e ≤E e + e
b−a b−a
E [Z] − a λb b − E [Z] λa
= e + e
b−a b−a
−a λb b λa
= e + e
b−a b−a
 
= 1 − p + peλ(b−a) e−pλ(b−a)
= eφ(u) ,

13
where u = λ(b − a) and φ(u) = −pu + ln (1 − p + peu ) and we used the fact that E [Z] = 0. It is easy to
verify that the derivative of φ is
p
φ0 (u) = −p +
p + (1 − p)e−u
and, therefore, φ(0) = φ0 (0) = 0. Furthermore,

p(1 − p)e−u 1
φ00 (u) = 2 ≤ .
(p + (1 − p)e−u ) 4

u2 00
By Taylor’s theorem, φ(u) = φ(0) + uφ0 (0) + 2 φ (θ) for some θ ∈ [0, u]. Thus, we have:
2
u2 00 u2 00 u2 λ2 (b − a)
φ(u) = φ(0) + uφ0 (0) + φ (θ) = φ (θ) ≤ = .
2 2 8 8

2.3.1 Understanding Hoeffding’s Inequality


2
Hoeffding’s inequality involves three interconnected terms: n, ε, and δ = 2e−2nε , which is the bound on
the probability that the event under P() holds (for the purpose of the discussion we consider two-sided
Hoeffding’s inequality for random variables bounded in [0, 1]). We can fix any two of the three terms n,
2
ε, and δ and then the relation δ = e−2nε provides the value of the third. Thus, we have
2
δ = 2e−2nε ,
s
ln 2δ
ε= ,
2n
ln 2δ
n= .
2ε2
Pn
Overall, Hoeffding’s inequality tells by how much the empirical average n1 i=1 Xi can deviate from
its expectation µ, but the interplay between the three parameters provides several ways of seeing and
using Hoeffding’s inequality. For example, if the number of samples n is fixed (we have made a fixed
number of experiments and now analyze what we can get from them), there is an interplay between the
precision ε and confidence δ. We can request Pn higher precision ε, but then we have to compromise on
the confidence δ that the desired bound n1 i=1 Xi − µ ≤ ε holds. And the other way around: we can
request higher confidence δ, but then we have to compromise on precision ε, i.e.,
Pnwe have to increase the
allowed range ±ε around µ, where we expect to find the empirical average n1 i=1 Xi .
As another example, we may have target precision ε and confidence δ and then the inequality provides
us the number of experiments n that we have to perform in order to achieve the target.
It is often convenient to write the inequalities (2.4) and (2.5) with a fixed confidence in mind, thus
we have
 s 
n
1 X ln 1δ
P Xi − µ ≥  ≤ δ,
n i=1 2n
 s 
n
1 X ln 1δ
Pµ − Xi ≥  ≤ δ,
n i=1 2n
 s 
n 2
1 X ln δ
P Xi − µ ≥ ≤ δ.
n i=1 2n

(Put attention that the ln 2 factor in the last inequality comes from the union bound over the first two
inequalities: if we want to keep the same confidence we have to compromise on precision.)

14
In many situations we are interested in the complimentary events. Thus, for example, we have
 s 
n
1 X ln 1δ
Pµ − Xi ≤  ≥ 1 − δ.
n i=1 2n

Careful reader may point out that the inequalities above should be strict (“<” and “>”). This is true,
but if it holds for strict inequalities it also holds for non-strict inequalities (“≤” and “≥”). Since strict
inequalities provide no practical advantage we will use the non-strict inequalities to avoid the headache
of remembering which inequalities should be strict and which should not.
The last inequality essentially says that with probability at least 1 − δ we have
s
n
1X ln 1δ
µ≤ Xi +
n i=1 2n
Pn
and this is how we will occasionally use it. Note that the random variable is n1 i=1 Xi and the right
way of interpreting the above inequality is actually that with probability at least 1 − δ
s
n
1X ln 1δ
Xi ≥ µ − ,
n i=1 2n
Pn
i.e., the probability is over n1 i=1 Xi and not over µ. However, many generalization bounds that we
study in Chapter 3 are written in the first form in the literature and we follow the tradition.

2.4 Basics of Information Theory: Entropy, Relative Entropy,


and the Method of Types
In this section we briefly introduce a number of basic concepts from information theory that are very
useful for deriving concentration inequalities. Specifically, we introduce the notions of entropy and
relative entropy (Cover and Thomas, 2006, Chapter 2) and some basic tools from the method of types
(Cover and Thomas, 2006, Chapter 11). We start with some definitions.
Definition 2.7 (Entropy). Let p(x) be a distribution of a discrete random variable X taking values in
a finite set X . We define the entropy of p as:
X
H(p) = − p(x) ln p(x).
x∈X

We use the convention that 0 ln 0 = 0 (which is justified by continuity of z ln z, since z ln z → 0 as z → 0).


We have special interest in Bernoulli random variables.
Definition 2.8 (Bernoulli random variable). X is a Bernoulli random variable with bias p if X accepts
values in {0, 1} with P(X = 0) = 1 − p and P(X = 1) = p.
Note that expectation of a Bernoulli random variable is equal to its bias:

E [X] = 0 × P(X = 0) + 1 × P(X = 1) = P(X = 1) = p.

With a slight abuse of notation we specialize the definition of entropy to Bernoulli random variables.
Definition 2.9 (Binary entropy). Let p be a bias of Bernoulli random variable X. We define the entropy
of p as
H(p) = −p ln p − (1 − p) ln(1 − p).
Note that when we talk about Bernoulli random variables p denotes the bias of the random variable
and when we talk about more general random variables p denotes the complete distribution.
Entropy is one of the central quantities in information theory and it has numerous applications. We
start by using binary entropy to bound binomial coefficients.

15
Lemma 2.10.  
1 n H( nk ) n
≤ en H( n ) .
k
e ≤
n+1 k
k k

(Note that n ∈ [0, 1] and H n in the lemma is the binary entropy.)
Proof. By the binomial formula we know that for any p ∈ [0, 1]:
n  
X n i
p (1 − p)n−i = 1. (2.8)
i=0
i

We start with the upper bound. Take p = nk . Since the sum is larger than any individual term, for the
k-th term of the sum we get:
 
n k n−k
1≥ p (1 − p)
k
   k  n−k
n k k
= 1−
k n n
   k  n−k
n k n−k
=
k n n
 
n k ln k +(n−k) ln n−k
= e n n
k
 
n n( nk ln nk + n−k n−k
n ln n )
= e
k
 
n −n H( nk )
= e .
k
By changing sides of the inequality we obtain the upper bound.
For the lower bound it is possible to show that if we fix p = nk then nk pk (1 − p)n−k ≥ ni pi (1 − p)n−i
 

for any i ∈ {0, . . . , n}, see Cover and Thomas (2006, Example 11.1.3) for details. We also note that there
are n + 1 elements in the sum in equation (2.8). Again, take p = nk , then
   i  n−i    k  n−k  
n k n−k n k n−k n −n H( nk )
1 ≤ (n + 1) max = (n + 1) = (n + 1) e ,
i i n n k n n k
where the last step follows the same steps as in the derivation of the upper bound.
Lemma 2.10 shows that the number of configurations of chosing k out of n objects is directly related
to the entropy of the imbalance nk between the number of objects that are selected (k) and the number
of objects that are left out (n − k).
We now introduce one additional quantity, the Kullback-Leibler (KL) divergence, also known as
Kullback-Leibler distance and as relative entropy.
Definition 2.11 (Relative entropy or Kullback-Leibler divergence). Let p(x) and q(x) be two probability
distributions of a random variable X (or two probability density functions, if X is a continuous random
variable), the Kullback-Leibler divergence or relative entropy is defined as:
 (P
p(x) ln p(x)
q(x) , if X is discrete

p(X)
KL(pkq) = Ep ln = R x∈X p(x) .
q(X) x∈X
p(x) ln q(x) dx, if X is continuous

We use the convention that 0 ln 00 = 0 and 0 ln 0q = 0 and p ln p0 = ∞.


We specialize the definition to Bernoulli distributions.
Definition 2.12 (Binary kl-divergence). Let p and q be biases of two Bernoulli random variables. The
binary kl divergence is defined as:
p 1−p
kl(pkq) = KL([1 − p, p]k[1 − q, q]) = p ln + (1 − p) ln .
q 1−q

16
KL divergence is the central quantity in information theory. Although it is not a distance measure,
because it does not satisfy the triangle inequality, it is the right way of measuring distances between
probability distributions. This is illustrated by the following example.
Example
Pn 2.13. Let X1 , . . . , Xn be an i.i.d. sample of n BernoulliPn randomvariables with bias p and
let n1 i=1 Xi be the empirical bias of the sample. (Note that n1 i=1 Xi ∈ 0, n1 , n2 , . . . , nn .) Then by
Lemma 2.10:
n
!  
1X k n k
p (1 − p)n−k ≤ en H( n ) en( n ln p+ n ln(1−p)) = e−n kl( n kp)
k k n−k k
P Xi = = (2.9)
n i=1 n k

and !
n
1X k 1 −n kl( nk kp)
P Xi = ≥ e .
n i=1 n n+1

Thus, kl nk kp governs the probability of observing empirical bias nk when the true bias is p. It is easy


to verify that kl(pkp) = 0 and it is also possible to show that kl(p̂kp) is convex in p̂ and that kl(p̂kp) ≥ 0.
Thus, the probability of empirical bias is maximized when it coincides with the true bias.

2.5 kl Inequality
Example 2.13 shows that kl can be used to bound the empirical bias when the true bias is known. But
in machine learning we are usually interested in the inverse problem - how to infer the true bias p when
the empirical bias p̂ is known. Next we demonstrate that this is also possible and that it leads to an
inequality, which in most cases is tighter than Hoeffding’s inequality. We start with the following lemma.
Pn
Lemma 2.14. Let X1 , . . . , Xn be i.i.d. Bernoulli with bias p and let p̂ = n1 i=1 Xi be the empirical
bias. Then h i
E en kl(p̂kp) ≤ n + 1.

Proof.
n   n
h
n kl(p̂kp)
i X k n kl( nk kp) X −n kl( nk kp) n kl( nk kp)
E e = P p̂ = e ≤ e e = n + 1,
n
k=0 k=0

where the inequality was derived in equation 2.9.


We combine this lemma with Markov’s inequality to obtain the following result.
1
Pn
Theorem 2.15 (kl inequality). Let X1 , . . . , Xn be i.i.d. Bernoulli with bias p and let p̂ = n i=1 Xi be
the empirical bias. Then
P(kl(p̂kp) ≥ ε) ≤ (n + 1)e−nε . (2.10)
Proof. By Markov’s inequality and Lemma 2.14:
  E en kl(p̂kp)  n+1
P(kl(p̂kp) ≥ ε) = P en kl(p̂kp) ≥ enε ≤ ≤ nε .
enε e

2.5.1 Relaxations of the kl-inequality: Pinsker’s and refined Pinsker’s in-


equalities
By denoting the right hand side of kl inequality (2.10) by δ, we obtain that with probability greater than
1 − δ:
ln n+1
δ
kl(p̂kp) ≤ . (2.11)
n
This leads to an implicit bound on p, which is not very intuitive and not always convenient to work with.
In order to understand the behavior of the kl inequality better we use a couple of its relaxations. The
first relaxation is known as Pinsker’s inequality, see Cover and Thomas (2006, Lemma 11.6.1).

17
Lemma 2.16 (Pinsker’s inequality).
1
KL(pkq) ≥ kp − qk21 ,
2
P
where kp − qk1 = x∈X |p(x) − q(x)| is the L1 -norm.
Corollary 2.17 (Pinsker’s inequality for the binary kl divergence).
1 2
kl(pkq) ≥ (|p − q| + |(1 − p) − (1 − q)|) = 2(p − q)2 . (2.12)
2
By applying Corollary 2.17 to inequality (2.11) we obtain that with probability greater than 1 − δ
r s
kl(p̂kp) ln n+1
δ
|p − p̂| ≤ ≤ .
2 2n
Recall that Hoeffding’s inequality assures that with probability greater than 1 − δ
s
ln 1δ
p ≤ p̂ + .
2n
Thus, in the worst case the kl inequality is only weaker by the ln(n + 1) factor and in fact the ln(n + 1)
factor can be reduced by a more careful analysis, see Maurer (2004), Langford (2005). Next we show
that the kl inequality can actually be significantly tighter than Hoeffding’s inequality. For this we use
refined Pinsker’s inequality, see Marton (1996, 1997), Samson (2000), Boucheron et al. (2013, Lemma
8.4).
Lemma 2.18 (Refined Pinsker’s inequality).

(p − q)2 (p − q)2
kl(pkq) ≥ + .
2 max {p, q} 2 max {(1 − p), (1 − q)}
Corollary 2.19 (Refined Pinsker’s inequality). If q > p then

(p − q)2
kl(pkq) ≥ .
2q
Corollary 2.20 (Refined Pinsker’s inequality). If kl(pkq) ≤ ε then
p
q ≤ p + 2pε + 2ε.

By applying Corollary 2.20 to inequality (2.11) we obtain that with probability greater than 1 − δ
s
2p̂ ln n+1
δ 2 ln n+1
δ
p ≤ p̂ + + .
n n
Note that when p̂ is close to zero, the latter inequality is much tighter than Hoeffding’s inequality. Finally,
we note that although there is no analytic inversion of kl(p̂kp) it is possible to invert it numerically to
obtain even tighter bounds than the relaxations above. Additionally, the bound in Theorem 2.15 can be
improved slightly, see Maurer (2004), Langford (2005).

2.6 Sampling Without Replacement


Let X1 , . . . , Xn be a sequence of random variables sampled without replacement from a finite set of
values X = {x1 , . . . , xN } of size N . The random variables X1 , . . . , Xn are dependent. For example,
if X = {−1, +1} and we sample two values then X1 = −X2 . Since X1 , . . . , Xn are dependent, the
concentration results from previous sections do not apply directly. However, the following result by
Hoeffding (1963, Theorem 4), which we cite without a proof, allows to extend results for sampling with
replacement to sampling without replacement.

18
Lemma 2.21. Let X1 , . . . , Xn denote a random sample without replacement from a finite set X =
{x1 , . . . , xN } of N real values. Let Y1 , . . . , Yn denote a random sample with replacement from X . Then
for any continuous and convex function f : R → R
" n
!# " n
!#
X X
E f Xi ≤E f Yi .
i=1 i=1

In particular, the lemma can be used to prove Hoeffding’s inequality for sampling without replace-
ment.

Theorem 2.22 (Hoeffding’s inequality for sampling without replacement). Let X1 , . . . , Xn denote a
random sample without replacement from a finite set X = {x1 , . . . , xN } of N values, where each element
PN
xi is in the [0, 1] interval. Let µ = N1 i=1 xi be the average of the values in X . Then for all ε > 0
n
!
1X 2
P Xi − µ ≥ ε ≤ e−2nε ,
n i=1
n
!
1X 2
P µ− Xi ≥ ε ≤ e−2nε .
n i=1

The proof is a minor adaptation of the proof of Hoeffding’s inequality for sampling with replacement
using Lemma 2.21 and is left as an exercise. (Note that it requires a small modification inside the proof,
because Lemma 2.21 cannot be applied directly to the statement of Hoeffding’s inequality.)
While formal proof requires a bit of work, intuitively the result is quite expected. Imagine the process
of sampling without replacement. If the average of points sampled so far starts deviating from the mean
of the values in X , the average of points that are left in X deviates in the opposite direction and “applies
extra force” to new samples to bring the average back to µ. In the limit when n = N we are guaranteed
to have the average of Xi -s being equal to µ.

19
Chapter 3

Generalization Bounds for


Classification

One of the most central questions in machine learning is: “How much can we trust the predictions
of a learning algorithm?”. A way of answering this question is by providing generalization bounds on
the expected performance of the algorithm on new data points. In this chapter we derive a number of
generalization bounds for supervised classification.

3.1 Overview: Learning by Selection


The classical process of learning can be seen as a selection process (see Figure 3.1):
1. We start with a hypothesis set H, which is a set of plausible prediction rules (for example, linear
separators).

2. We observe a sample S sampled i.i.d. according to a fixed, but unknown distribution p(X, Y ).
3. Based on the empirical performances L̂(h, S) of the hypotheses in H, we select a prediction rule
ĥ∗S , which we consider to be the “best” in H in some sense. Typically, ĥ∗S is either the empirical
risk minimizer (ERM), ĥ∗S = arg min L̂(h, S), or a regularized empirical risk minimizer.
h

4. ĥ∗S is then applied to predict labels for new samples X.

In this chapter we are concerned with the question of what can be said about the expected loss L(ĥ∗S ),
which is the error we are expected to make on new samples. More precisely, we provide tools for bounding
the probability that L̂(ĥ∗S , S) is significantly smaller than L(ĥ∗S ). Recall that L̂(ĥ∗S , S) is observed and
L(ĥ∗S ) is unobserved. Having small L̂(ĥ∗S , S) and large L(ĥ∗S ) is undesired, because it means that based
on L̂(ĥ∗S , S) we believe that ĥ∗S performs well, but in reality it does not.

Figure 3.1: Learning by Selection.

20
Assumtions There are two key assumptions we make throughout the chapter:
1. The samples in S are i.i.d..
2. The new samples (X, Y ) come from the same distribution as the samples in S.
These are the assumptions behind concentration of measure inequalities developed in Chapter 2 and
it is important to remember that if they are not satisfied the results derived in this chapter are not valid.
In a sense, it is intuitive why we have to make these assumptions. For example, if we train a language
model using data from The Wall Street Journal and then apply it to Twitter the change in prediction
accuracy can be very dramatic. Even though both are written in English and comprehensible by humans,
the language used by professional journalists writing for The Wall Street Journal is very different from
the language used in the short tweets.
The two assumptions are behind most supervised learning algorithms that you can meet in practice
and, therefore, it is important to keep them in mind. In Chapter 5 we discuss how to depart from them,
but for now we stick with them.
Given the assumptions above, for any fixed h prediction
i rule that is independent of S, the empirical
loss is an unbiased estimate of the true loss, E L̂(h, S) = L(h). An intuitive way to see it is that under
the assumptions that the samples in S are i.i.d. and coming from the same distribution as new samples
(X, Y ), from the perspective of h the new samples (X, Y ) are in no way different from the samples in S:
any new sample (X, Y ) could have happened to be in S instead of some other sample (Xi , Yi ) (they are
“exchangeable”). Formally,
" n #
h i 1X
E(X1 ,Y1 ),...,(Xn ,Yn ) L̂(h, S) = E(X1 ,Y1 ),...,(Xn ,Yn ) `(h(Xi ), Yi )
n i=1
n
1X
= E(X1 ,Y1 ),...,(Xn ,Yn ) [`(h(Xi ), Yi )]
n i=1
n
1X
= E(Xi ,Yi ) [`(h(Xi ), Yi )]
n i=1
n
1X
= L(h)
n i=1
= L(h).

However,
h when we i make
h the selection
i of ĥ∗S based on S the “exchangeability” argument no longer applies
and E L̂(ĥ∗S , S) 6= E L(ĥ∗S ) (note that ĥ∗S is a random variable depending on S and we take expectation
with respect to this randomness). This is because ĥ∗S is tailored to S (for example, it minimizes L̂(h, S))
and from the perspective of selection process the samples in S are not exchangeable with new samples
(X, Y ). If we exchange the samples we may end up with a different ĥ∗S . In the extreme case when the
hypothesis space H is so rich that it can fit any possible labeling of the data (for example, the hypothesis
space corresponding toh 1-nearest-neighbor
i prediction rule) we may end up in a situation, where L̂(ĥ∗S , S)
is always zero, but E L(ĥ∗S ) ≥ 41 , as in the following informal example.

Informal Lower Bound Imagine that we want to learn a classifier that predicts whether a student’s
birthday is on an even or odd day based on student’s id. Assume that the total number of students
is 2n, that the hypothesis class H includes all possible mappings from student id to even/odd, so that
|H| = 22n , and that we observe a sample of n uniformly sampled students (potentially with repetitions).
Since all possible mappings are within H, we have ĥ∗S ∈ H for which L̂(ĥ∗S , S) = 0. However, ĥ∗S is
guaranteed to make zero error only on the samples that were observed, which constitute at most half
of the total number of students. For the remaining students ĥ∗S can, at the best, make a random guess
which will succeed with probability 12 . Therefore, the expected loss of ĥ∗S is L(ĥ∗S ) ≥ 12 · 0 + 21 · 12 = 41 ,
where the first term is an upper bound on the probability of observing an already seen student times
the expected error ĥ∗S makes in this case and the second term is a lower bound on the probability of

21
Figure 3.2: Learning by Selection.

observing a new student times the expected error ĥ∗S makes in this case. For a more formal treatment
see the lower bounds in Chapter 3.7.
Considering it from the perspective of expectations, we have:
" n #
h

i 1X ∗
E(X1 ,Y1 ),...,(Xn ,Yn ) L̂(ĥS , S) = E(X1 ,Y1 ),...,(Xn ,Yn ) `(ĥS (Xi ), Yi )
n i=1
n
1X h i
= E(X1 ,Y1 ),...,(Xn ,Yn ) `(ĥ∗S (Xi ), Yi )
n i=1
n
1X h i
= E(X1 ,Y1 ),...,(Xn ,Yn ) `(ĥ∗S (X1 ), Y1 )
n i=1
h i
= E(X1 ,Y1 ),...,(Xn ,Yn ) `(ĥ∗S (X1 ), Y1 )
h h ii
6= E(X,Y ) E(X1 ,Y1 ),...,(Xn ,Yn ) `(ĥ∗S (X), Y )
h h ii
= E(X1 ,Y1 ),...,(Xn ,Yn ) E(X,Y ) `(ĥ∗S (X), Y )
h i
= E(X1 ,Y1 ),...,(Xn ,Yn ) L(ĥ∗S ) .

The selection leads to the approximation-estimation trade-off (a.k.a. bias-variance trade-off), see
Figure 3.2. If the hypothesis class H is small it is easy to identify a good hypothesis h in H, but since
H is small it is likely that all the hypotheses in H are weak. On the other hand, if H is large it is more
likely to contain stronger hypotheses, but at the same time the probability of confusion with a poor
hypothesis grows. This is because there is always a small chance that the empirical loss L̂(h, S) does
not represent the true loss L(h) faithfully. The more hypotheses we take, the higher is the chance that
L̂(h, S) is misleading for some of them, which increases the chance of confusion.
Finding a good balance between approximation and estimation errors is one of the central questions
in machine learning. The main tool for analyzing the trade-off from the theoretical perspective are
concentration of measure inequalities. Since concentration of measure inequalities do not apply when
the prediction rule ĥ∗S depends on S, the main approach to analyzing the prediction power of ĥ∗S is to
consider cases with no dependency and then take a union bound over selection from these cases. In this
chapter we study three different ways of implementing this idea, see Figure 3.3 for an overview. We
distinguish between hard selection, where the learning procedure returns a single hypothesis h and soft
selection, where the learning procedure returns a distribution over H.

22
Figure 3.3: Overview of the major approaches to derivation of generalization bounds considered in this
chapter.

1. Occam’s razor applies to hard selection from a countable hypothesis space H and it is based on a
weighted union bound over H. We know that for every fixed h the expected loss is close to the
empirical loss, meaning that L(h) − L̂(h, S) is small. When H is countable we can take a weighted
union bound and obtain that L(h) − L̂(h, S) is “small” for all h ∈ H (where the magnitude of
“small” is inversely proportional to the weight of h in the union bound) and thus it is “small” for
ĥ∗S .

2. Vapnik-Chervonenkis (VC) analysis applies to hard selection from an uncountable hypothesis space
H and it is based on projection of H onto S and a union bound over what we obtain after the
projection. The idea is that even when H is uncountably infinite, there is only a finite number
of “behaviors” (ways to label S) we can observe on a finite sample S. In other words, when we
look at H through the prism of S we can only distinguish between a finite number of subsets of
H and everything that falls within the subsets is equivalent in terms of L̂(h, S). Therefore, S
only serves for a (finite) selection of a subset of H out of a finite number of subsets, whereas the
(infinite) selection from within the subset is independent of S. Selection that is independent of S
introduces no bias. As before, the VC analysis exploits the fact that for any fixed h the distance
L(h) − L̂(h, S) is small and then takes a union bound over the potential dependencies, which are
the dependencies between the subsets (the projections) and S.
3. PAC-Bayesian analysis applies to soft selection from an uncountable hypothesis space H and it
is based on change of measure inequality, which can be seen as a refinement of the union bound.
Unlike the preceding two approaches, which return a single classifier ĥ∗S , PAC-Bayesian analysis
returns a randomized classifier defined by a distribution ρ over H. The actual classification then
happens by drawing a new classifier h from H according to ρ at each prediction round and applying
it to make a prediction. When H is countable, ρ can (but does not have to) be a delta-distribution
putting all the mass on a single hypothesis ĥ∗S and in this case the generalization guarantees are
identical to those in Occam’s razor approach. The amount of selection is measured by deviation
of ρ from a prior distribution π, where π is selected independently of S. It is natural to put more
of ρ-mass on hypotheses that perform well on S, but the more we skew ρ toward well-performing
hypotheses the more it deviates from π. This provides a more refined way of measuring the amount
of selection compared to the other two approaches. Furthermore, randomization allows to avoid
selection when it is not necessary. The avoidance of selection reduces the variance without impairing
the bias. For example, when two hypotheses have similar empirical performance we do not have

23
to commit to one of them, but can instead distribute ρ equally among them. The analysis then
provides a certain “bonus” for avoiding commitment.

3.2 Generalization Bound for a Single Hypothesis


We start with the simplest case, where H consists of a single prediction rule h. We are interested in
the quality of h, measured by L(h), but all we can measure is L̂(h, S). What can we say about L(h)
based on L̂(h, S)? Note that the samples (Xi , Yi ) ∈ S come from the same distribution as any future
samples (X, Y ) we will observe. Therefore, `(h(Xi ), Yi ) has the same distribution as `(h(X), PnY ) for any
future sample (X, Y ). Let Zi = `(h(Xi ), Yi ) be the loss of h on (Xi , Yi ). Then L̂(h, S) = n1 i=1 Zi is an
average of n i.i.d. random variables with E [Zi ] = E [`(h(X), Y )] = L(h). The distance between L̂(h, S)
and L(h) can thus be bounded by application of Hoeffding’s inequality.
Theorem 3.1. Assume that ` is bounded in the [0, 1] interval (i.e., `(Y 0 , Y ) ∈ [0, 1] for all Y 0 , Y ), then
for a single h and any δ ∈ (0, 1) we have:
 s 
ln 1δ
PL(h) ≥ L̂(h, S) + ≤δ (3.1)
2n

and  s 
2
lnδ
P L(h) − L̂(h, S) ≥  ≤ δ. (3.2)
2n
q 1
ln
Proof. For (3.1) take ε = 2n in (2.5) and rearrange the terms. Equation (3.2) follows in a similar
δ

way from the two-sided Hoeffding’s inequality. Note that in (3.1) we have 1δ and in (3.2) we have 2δ .
There is an alternative way to read equation (3.1): with probability at least 1 − δ we have
s
ln 1δ
L(h) ≤ L̂(h, S) + .
2n
We remind the reader that the above inequality should actually be interpreted as
s
ln 1δ
L̂(h, S) ≥ L(h) −
2n

and it means that with probabilitypat least 1 − δ the empirical loss L̂(h, S) does not underestimate the
expected loss L(h) by more than ln(1/δ)/2n. However, it is customary to write the inequality in the
first form (as an upper bound on L(h) and we follow the tradition (see the discussion at the end of
Section 2.3.1).
Theorem 3.1 is analogous to the problem of estimating a bias of a coin based on coin flip outcomes.
There is always a small probability that the flip outcomes will not be representative of the coin bias.
For example, it may happen that we flip a fair coin 1000 times (without knowing that it is a fair coin!)
and observe “all heads” or some other misleading outcome. And if this happens we are doomed - there
is nothing we can do when the sample does not represent the reality faithfully. Fortunately for us, this
happens with a small probability that decreases exponentially with the sample size n.
Whether we use the one-sided bound (3.1) or the two-sided bound (3.2) depends on the situation.
In most cases we are interested in the upper bound on the expected performance of the prediction rule
given by (3.1).

3.3 Generalization Bound for Finite Hypothesis Classes


A hypothesis set H containing a single hypothesis is a very boring set. In fact, we cannot learn in
this case, because we end up with the same single hypothesis no matter what the sample S is. Learning

24
Figure 3.4: Validation (the red part in the figure) is identical to learning with a reduced hypothesis set
H0 (most often H0 is finite).

becomes interesting when training sample S helps to improve future predictions or, equivalently, decrease
the expected loss L(h). In this section we consider the simplest non-trivial case, where H consists of
a finite number of hypotheses M . There are at least two cases, where we meet a finite H in real life.
The first is when the input space X is finite. This case is relatively rare. The second and much more
frequent case is when H itself is an outcome of a learning process. For example, this is what happens in
a validation procedure, see Figure 3.4. In validation we are using a validation set in order to select the
best hypothesis out of a finite number of candidates corresponding to different parameter values and/or
different algorithms.
And now comes the delicate point. Let ĥ∗S be a hypothesis with minimal empirical risk, ĥ∗S =
arg min L̂(h, S) (it is natural to pick the empirical risk minimizer ĥ∗S to make predictions on new samples,
h
but the following discussion equally applies to any other selection rule that takes sample S into account;
note that there may be multiple hypotheses that achieve the minimal h empirical
i error and in this case
we can pick one arbitrarily). While for each h individually E L̂(h, S) = L(h), this is not true for
h i h i h i
E L̂(ĥ∗S , S) . In other words, E L̂(ĥ∗S , S) 6= E L(ĥ∗S ) (we have to put expectation on the right hand
side, because ĥ∗S depends on the sample). The reason is that when we pick ĥ∗S that minimizes the
empirical error on S, from the perspective of ĥ∗S the samples in S no longer look identical to future
samples (X, Y ). This is because ĥ∗S is selected in a very special way - it is selected to minimize the
empirical error on S and, thus, it is tailored to S and most likely does better on S than on new random
samples (X, Y ). One way to handle this issue is to apply a union bound.
Theorem 3.2. Assume that ` is bounded in the [0, 1] interval and that |H| = M . Then for any δ ∈ (0, 1)
we have:  s 
M
ln δ
P∃h ∈ H : L(h) ≥ L̂(h, S) +  ≤ δ. (3.3)
2n

Proof.
 s   s 
ln M
δ 
X ln M
δ 
X δ
P∃h ∈ H : L(h) ≥ L̂(h, S) + ≤ PL(h) ≥ L̂(h, S) + ≤ = δ,
2n 2n M
h∈H h∈H

where the first inequality is by the union bound and the second is by Hoeffding’s inequality.
Another way of reading Theorem 3.2 is: with probability at least 1 − δ for all h ∈ H
s
ln M
δ
L(h) ≤ L̂(h, S) + . (3.4)
2n

It means that no matter which h from H is returned by the algorithm, with high probability we have the
guarantee (3.4). In particular, it holds for ĥ∗S . Again, remember that
p the random quantity is actually
L̂(h, S) and the right way to read the bound is L̂(h, S) ≥ L(h) − ln(M/δ)/2n, see the discussion in
the previous section.

25
The price for considering M hypotheses instead of a single one is ln M . Note that it grows only
logarithmically with M ! Also note that there is no contradiction between the upper bound and the lower
2n
bound we have discussed in Section 3.1. In the construction p we took M = |H| = 2 .
p of the lower bound
If we substitute this value of M into (3.4) we obtain ln(M/δ)/2n ≥ ln(2) ≥ 0.8, which has no
contradiction with L(h) ≥ 0.25.
Similar to theorem 3.1 it is possible to derive a two-sided bound on the error. It is also possible to
derive a lower bound by using the other side ofpHoeffding’s inequality (2.4): with probability at least
1 − δ, for all h ∈ H we have L(h) ≥ L̂(h, S) − ln(M/δ)/2n. Typically we want the upper bound on
L(h), but if we want to compare two prediction rules, h and h0 , we need an upper bound for one and
a lower bound for the other. The “lazy” approach is to take the two-sided bound for everything, but
sometimes it is possible to save the factor of ln(2) by carefully considering which hypotheses require the
lower bound and which require the upper bound and applying the union bound correspondingly (we are
not getting into the details).

3.4 Occam’s Razor Bound


Now we take a closer look at Hoeffding’s inequality. It says that
 s 
ln 1δ
PL(h) ≥ L̂(h, S) +  ≤ δ,
2n

where δ is the probability that things go wrong and L̂(h, S) happens to be far away from L(h) because
S is not representative for the performance of h. There is a dependence between the probability that
things go wrong and the requirement on the closeness between L(h) and L̂(h, S). If we want them to be
very close (meaning that ln 1δ is small) then δ has to be large, but if we can allow larger distance then


δ can be smaller.
So, δ can be seen as our “confidence budget” (or, more precisely, “uncertainty budget”) - the probabil-
ity that we allow things to go wrong. The idea behind Occam’s RazorPbound is to distribute this budget
unevenly among the hypotheses in H. We use π(h) ≥ 0, such that h∈H π(h) ≤ 1 as our distribution
of the confidence budget δ, where each hypothesis h is assigned π(h) fraction of the budget. This means
that for every hypothesis h ∈ H the sample S is allowed to be “non representative” with probability at
most π(h)δ, so that the probability that there exists any h ∈ H for which S is not representative is at
most δ (by the union bound). The price that we pay is that the precision (the closeness of L̂(h, S) to
L(h)) now differs from one hypothesis to another and depends on the confidence budget p π(h)δ that was
assigned to it. More precisely, L̂(h, S) is allowed to underestimate
P L(h) by up to ln(1/(π(h)δ))/2n.
The precision increases when π(h) increases, but since h∈H π(h) ≤ 1 we cannot afford high precision
for every h and have to compromise. More on this in the next theorem and its applications that follow.
Theorem 3.3. Let ` be bounded in [0, 1], let H be a X
countable hypothesis set and let π(h) be independent
of the sample and satisfying π(h) ≥ 0 for all h and π(h) ≤ 1. Then:
h∈H
 v  
u
u ln 1
 t π(h)δ 
∃h ∈ H : L(h) ≥ L̂(h, S) +
P  ≤ δ.
2n 

Proof.
 v    v  
u u
u ln 1 1
t ln π(h)δ 
u
 t π(h)δ  X 
∃h ∈ H : L(h) ≥ L̂(h, S) +
P ≤ PL(h) ≥ L̂(h, S) + 
2n   2n 
h∈H

X
≤ π(h)δ
h∈H
≤ δ,

26
where the first inequality is by the union bound, the second inequality is by Hoeffding’s inequality, and
the last inequality is by the assumption on π(h). Note that π(h) has to be selected before we observe the
sample (or, in other words, independently of the sample), otherwise the second inequality
 does not hold.
Pn p
More explicitly, in Hoeffding’s inequality P E [Z1 ] − n i=1 Zi ≥ ln(1/δ )/2n ≤ δ the parameter δ 0
1 0 0

has to be independent of Z1 , . . . , Zn . For π(h) independent of S we take δ 0 = π(h)δ and apply the
inequality. But if π(h) would be dependent on S we would not be able to apply it.
Another way of reading Theorem 3.3 is that with probability at least 1 − δ, for all h ∈ H:
v  
u
u ln 1
t π(h)δ
L(h) ≤ L̂(h, S) + .
2n
Again, refer back to the discussion in Section 3.2 regarding the correct interpretation of the inequality.
Note that the bound on L(h) depends both on L̂(h, S) and on π(h). Therefore, according to the bound,
the best generalization is achieved by h that optimizes the trade-off between empirical performance
L̂(h, S) and π(h), where π(h) can be interpreted as a complexity measure or a prior belief. Also, note
that π(h) can be designed arbitrarily, but it should be independent of the sample S. If π(h) happens to
put more mass on h-s with low L̂(h, S) the bound will be tighter, otherwise the bound will be looser,
but it will still be a valid bound. But we cannot readjust π(h) after observing S! Some considerations
behind the choice of π(h) are provided in Section 3.4.1.P
Also note that while we can select π(h) such that h∈H P π(h) = 1 and interpret π as a probability
distribution over H, it is not a requirement (we may have h∈H π(h) < 1) and π is used as an auxiliary
construction for derivation of the bound rather than the prior distribution in the Bayesian sense (for
readers who are familiar with Bayesian learning). However, we can use π to incorporate prior knowledge
into the learning procedure.

3.4.1 Applications of Occam’s Razor bound


We consider two applications of Occam’s Razor bound.

Generalization bound for finite hypotheses spaces


An immediate corollary of Occam’s razor bound is the generalization bound for finite hypotheses classes
that we have already seen in Section 3.3.
Corollary 3.4. Let H be a finite hypotheses class of size M , then
r !
ln (M/δ)
P ∃h ∈ H : L(h) ≥ L̂(h, S) + ≤ δ.
2n
1
Proof. We set π(h) = M (which means that we distribute the confidence budget δ uniformly among the
hypotheses in H) and apply Theorem 3.3.

Generalization bound for binary decision trees


S∞
Theorem 3.5. Let Hd be the set of binary decision trees of depth d and let H = d=0 Hd be the set of
binary decision trees of unlimited depth. Let d(h) be the depth of tree (hypothesis) h. Then
 s 
ln 22d(h) 2d(h)+1 /δ
P∃h ∈ H : L(h) ≥ L̂(h, S) +  ≤ δ.
2n

d
1 1
Proof. We first note that |Hd | = 22 . We define π(h) = 2d(h)+1 22d(h)
. The first part of π(h) distributes
1
the confidence budget δ among Hd -s (we can see it as p(Hd ) = 2d(h)+1 , the share of confidence budget
P∞to H
that goes
1
d ) and the second part P
of π(h) distributes the confidence budget uniformly within Hd .
Since d=0 2d+1 = 1, the assumption h∈H π(h) ≤ 1 is satisfied. The result follows by application of
Theorem 3.3.

27
(a) Subsets of linear homogeneous separators defined (b) Subsets of linear homogeneous separators defined
by two sample points. by three sample points.

Figure 3.5: Subsets of homogeneous linear separators in R2 formed by 3.5a two and 3.5b
three sample points. A homogeneous linear separator in R2 is defined by a vector w ∈ R2 . The
sample points define a number of regions in R2 that are shown by the numbers in circles. We say that a
linear separator falls within a certain region when the vector w defining it falls within that region. All
homogeneous linear separators falling within the same region have the same empirical loss L̂(h, S) and,
therefore, any selection among them is not based on the sample S and introduces no bias. The sample
only discriminates between the subsets.

  d(h)
1 1
Note that the bound depends on ln π(h)δ and the dominating term in π(h) is 22 . We could
1
have selected a different distribution of confidence over Hd -s, for example, p(Hd ) = (for which
(d+1)(d+2)
P∞ 1
we also have d=0 (d+1)(d+2) = 1), which is perfectly fine, but makes no significant difference for the
 d(h) 
bound. The dominating complexity term ln 22 comes from uniform distribution of confidence within
Hd , which makes sense unless we have some prior information about the problem. In absence of such
information there is no reason to give preference to any of the trees within Hd , because Hd is symmetric.
The prior selected in the proof of Theorem 3.5 exploits structural symmetries within the hypothesis
class H and assigns equal weight to hypotheses that are symmetric under permutation of names of
the input variables. While we want π(h) to be as large as possible for every h, the number of such
permutationP symmetric hypotheses is the major barrier dictating how large π(h) can be (because π has
to satisfy h∈H π(h) ≤ 1). Deeper trees have more symmetric permutations and, therefore, get smaller
π(h) compared to shallow trees. If there is prior information that breaks the permutation symmetry it
can be used to assign higher prior to the corresponding trees and if it correctly reflects the true data
distribution it will also lead to tighter bounds. If the prior information does not match the true data
distribution such adjustments may have the opposite effect.

3.5 Vapnik-Chervonenkis (VC) Analysis


Now we present Vapnik-Chervonenkis (VC) analysis of generalization when a hypothesis is selected from
an uncountably infinite hypothesis class H. The reason that we are able to derive a generalization bound
even though we are selecting from an uncountably large set is that only a finite part of this selection
is based on the sample S, whereas the remaining uncountable selection is not based on the sample
and, therefore, introduces no bias. Since the sample is finite, the number of distinct labeling patterns,
also called dichotomies, (h(X1 ), . . . , h(Xn )) is also finite. When two hypotheses, h and h0 , produce the
same labeling pattern, (h(X1 ), . . . , h(Xn )) = (h0 (X1 ), . . . , h0 (Xn )), the sample does not discriminate
between them and the selection between h and h0 is based on some other considerations rather than
the sample. Therefore, the sample defines a finite number of (typically uncountably infinite) subsets of
the hypothesis space H, where hypotheses within the same subset produce the same labeling pattern
(h(X1 ), . . . , h(Xn )). The sample then allows selection of the “best” subset, for example, the subset that
minimizes the empirical error. All prediction rules within the same subset have the same empirical error

28
L̂(h, S) and selection among them is independent of S. See Figure 3.5 for an illustration.
The effective selection based on the sample S depends on the number of subsets of H with distinct
labeling patterns on S. When the number of such subsets is exponential in the size of the sample
n, the selection is too large and leads to overfitting, as we have already seen for selection from large
finite hypothesis spaces in the earlier sections. I.e., we cannot guarantee closeness of L̂(ĥ∗S , S) to L(ĥ∗S ).
However, if the number of subsets is subexponential in n, we can provide generalization guarantees for
L(ĥ∗S ). In Figure 3.5 we illustrate (informally) that at a certain point the number of subsets of the class
of homogeneous linear separators in R2 stops growing exponentially with n.1 For n = 2 the sample
defines 4 = 2n subsets, but for n = 3 the sample defines 6 < 2n subsets. It can be formally shown that
no 3 sample points can define more than 6 subsets of the space of homogeneous linear separators in R2
(some may define less, but that is even better for us) and that for n > 2 the number of subsets grows
polynomially rather than exponentially with n.
In what follows we first bound the distance between L̂(h, S) and L(h) for all h ∈ H in terms of
the number of subsets using symmetrization (Section 3.5.1) and then bound the number of subsets
(Section 3.5.2).

3.5.1 The VC Analysis: Symmetrization


We start with a couple of definitions.
Definition 3.6 (Dichotomies). Let x1 , . . . , xn ∈ X . The set of dichotomies (the labeling patterns)
generated by H on x1 , . . . , xn is defined by

H (x1 , . . . , xn ) = {h(x1 ), . . . , h(xn ) : h ∈ H} .

Definition 3.7 (The Growth Function). The growth function of H is the maximal number of dichotomies
it can generate on n points:
mH (n) = max |H (x1 , . . . , xn )| .
x1 ,...,xn

Put attention that mH (n) is defined by the “worst-case” configuration of points x1 , . . . , xn , for which
|H (x1 , . . . , xn )| is maximized. Thus, for lower bounding mH (n) (i.e., for showing that mH (n) ≥ v
for some value v) we have to find a configuration of points x1 , . . . , xn for which |H (x1 , . . . , xn )| ≥
v or, at least, prove that such configuration exists. For upper bounding mH (n) (i.e., for showing
that mH (n) ≤ v) we have to show that for any possible configuration of points x1 , . . . , xn we have
|H (x1 , . . . , xn )| ≤ v. In other words, coming up with an example of a particular configuration x1 , . . . , xn
for which |H (x1 , . . . , xn )| ≤ v is insufficient for proving that mH (n) ≤ v, because there may potentially
be an alternative configuration of points achieving a larger number of labeling configurations. To be
concrete, the illustration in Figure 3.5b shows that for the hypothesis space H of homogeneous linear
separators in R2 we have mH (3) ≥ 6, but it does not show that mH (3) ≤ 6. If we want to prove that
mH (3) ≤ 6 we have to show that no configuration of 3 sample points can differentiate between more
than 6 distinct subsets of the hypothesis space. More generally, if we want to show that mH (n) = v we
have to show that mH (n) ≥ v and mH (n) ≤ v. I.e., the only way to show equality is by proving a lower
and an upper bound.
The following theorem uses the growth function to bound the distance between empirical and expected
loss for all h ∈ H.
Theorem 3.8. Assume that ` is bounded in the [0, 1] interval. Then for any δ ∈ (0, 1)
 s 
8 ln 2mHδ(2n)
P∃h ∈ H : L(h) ≥ L̂(h, S) +  ≤ δ.
n

The result is useful when mH (2n)  en . In Section 3.5.2 we discuss when we can and cannot expect
to have it, but for now we concentrate on the proof of the theorem.
The proof of the theorem is based on three ingredients. First we introduce a “ghost sample” S 0 ,
which is an imaginary sample of the same size as S (i.e., of size n). We do not need to have this sample
1 Homogeneous linear separators are linear separators passing through the origin.

29
Figure 3.6: Illustration for Step 2 of the proof of Theorem 3.8.

at hand, but we ask what would have happened if we had such sample. Then we apply symmetrization:
we show that the probability that for any h the empirical loss L̂(h, S) is far from L(h) by more than ε is
bounded by twice the probability that L̂(h, S) is far from L̂(h, S 0 ) by more than ε/2. This allows us to
consider the behavior of H on the two samples, S and S 0 , instead of studying it over all X (because the
definition of L(h) involves all X , whereas the definition of L̂(h, S 0 ) involves only S 0 ). In the third step
we project H onto the two samples, S and S 0 . Even though H is uncountably infinite, when we look at
it through the prism of S ∪ S 0 we can only observe a finite number of distinct behaviors. More precisely,
the number of different ways H can label S ∪ S 0 is at most mH (2n). We show that the probability that
for any of the possible ways to label S ∪ S 0 the empirical losses L̂(h, S) and L̂(h, S 0 ) diverge by more
than ε/2 decreases exponentially with n.
Now we do this formally.

Step 1 We introduce a ghost sample S 0 = {(X10 , Y10 ), . . . , (Xn0 , Yn0 )} of size n.

Step 2 [Symmetrization] We prove the following result.


2
Lemma 3.9. Assuming that e−nε /2 ≤ 21 we have
   ε
P ∃h ∈ H : L(h) − L̂(h, S) ≥ ε ≤ 2P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ . (3.5)
2
The illustration in Figure 3.6 should be helpful for understanding the proof. The distance L(h) −
L̂(h, S) can be expressed as L(h) − L̂(h, S) = (L(h) − L̂(h, S 0 )) + (L̂(h, S 0 ) − L̂(h, S)). We remind that
in general empirical losses are likely to be close to their expected values. More explicitly, under the mild
2
assumption that e−nε /2 ≤ 1/2 we have that L(h) − L̂(h, S 0 ) ≤ ε/2 with probability greater than 1/2 . If
L(h)− L̂(h, S) ≥ ε and L(h)− L̂(h, S 0 ) ≤ ε/2 we must have L̂(h, S 0 )− L̂(h, S) ≥ ε/2 (see the illustration).
The proof is based on a careful exploitation of this observation.
Proof of Lemma 3.9. We start from the right hand side of (3.5).
 ε
P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥
2

0 ε  
≥ P ∃h ∈ H : L̂(h, S ) − L̂(h, S) ≥ AND ∃h ∈ H : L(h) − L̂(h, S) ≥ ε
  2 ε 
= P ∃h ∈ H : L(h) − L̂(h, S) ≥ ε P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ ∃h ∈ H : L(h) − L̂(h, S) ≥ ε .
2
(3.6)

30
(a) Illustration of the split. (b) Illustration of the distances.

Figure 3.7: Illustration of the split of S ∪ S 0 into S and S 0 . On the left: First we sample the
joint sample S ∪ S 0 . Then each hypothesis hj produces a “big bag” of losses {Z1 , . . . , Z2n }, where
Zi = `(hj (Xi ), Yi ). Even though H is uncountably infinite, the number of different ways to label S ∪ S 0
is at most mH (2n) by the definition of the growth function and thus the number of different “big bags”
of losses is at most mH (2n) (in the illustration we have m ≤ mH (2n)). Finally, we split S ∪ S 0 into S
and S 0 , which corresponds to splitting the “big bags” of 2n losses into pairs of “small bags” of n losses,
corresponding to L̂(hj , S) and L̂(hj , S 0 ). On the right: we illustrate the distances between the average
losses in a pair of “small bags” and the corresponding “big bag”, which is the average of the two “small
bags”.

The inequality follows by the fact that for any two events A and B we have P(A) ≥ P(A AND B)
and the equality by P(A AND B) = P(B)P(A|B). The first term in (3.6) is the term we want and
we need to lower bound the second term. We let h∗ be any h for which, by conditioning, we have
L(h∗ ) − L̂(h∗ , S) ≥ ε. With high probability we have that L̂(h∗ , S 0 ) is close to L(h∗ ) up to ε/2. And
since we are given that L̂(h, S) is far from L(h∗ ) by more than ε it must also be far from L̂(h∗ , S 0 ) by
more than ε/2 with high probability, see the illustration in Figure 3.6. Formally, we have:
 ε 
P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ ∃h ∈ H : L(h) − L̂(h, S) ≥ ε
2  ε 
≥ P L̂(h∗ , S 0 ) − L̂(h∗ , S) ≥ L(h∗ ) − L̂(h∗ , S) ≥ ε (3.7)
2
 ε 
≥ P L(h∗ ) − L̂(h∗ , S 0 ) ≤ L(h∗ ) − L̂(h∗ , S) ≥ ε (3.8)
2
 ε 
= P L(h∗ ) − L̂(h∗ , S 0 ) ≤ (3.9)
2
 ε 
≥ 1 − P L(h∗ ) − L̂(h∗ , S 0 ) ≥
2
2
≥ 1 − e−2n(ε/2) (3.10)
1
≥ . (3.11)
2
Explanation of the steps: in (3.7) the event  on the left hand
 side
 includes the event on the right hand
0
side; in (3.8) we have L̂(h, S ) − L̂(h, S) = L(h) − L̂(h, S) − L(h) − L̂(h, S 0 ) and since we are given
that L(h) − L̂(h, S) ≥ ε the event L̂(h, S 0 ) − L̂(h, S) ≥ ε/2 follows from L(h) − L̂(h, S 0 ) ≤ ε/2, see
Figure 3.6; in (3.9) we can remove the conditioning on S, because the event of interest concerns S 0 ,
which is independent of S; (3.10) follows by Hoeffding’s inequality; and (3.11) follows by the lemma’s
2
assumption on e−nε /2 .
By plugging the result back into (3.6) and multiplying by 2 we obtain the statement of the lemma.

 
Step 3 [Projection] Now we focus on P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ 2ε , which concerns the be-
havior of H on two finite samples, S and S 0 . There are two possible ways to sample S and S 0 . The first
is to sample S and then S 0 . An alternative way is to sample a joint sample S2n = S ∪ S 0 and then split

31
it into S and S 0 by randomly assigning half of the samples into S and half into S 0 . The two procedures
are equivalent and lead to the same distribution over S and S 0 . We focus on the second procedure. Its
advantage is that once we have sampled S ∪ S 0 the number of ways to label it with hypotheses from
H is finite, even though H is uncountably infinite. This way we turn an uncountably infinite problem
into a finite problem. The number of different sequences of losses on S ∪ S 0 is at most the number of
different ways to label it, which is at most the growth function mH (2n) by definition. The probability
of having L̂(h, S 0 ) − L̂(h, S) ≥ ε/2 for a fixed h reduces to the probability of splitting a sequence of 2n
losses into n and n losses and having more than ε/2 difference between the average of the two. The latter
reduces to the problem of sampling n losses without replacement from a bag of 2n losses and obtaining
an average which deviates from the bag’s average by more than ε/4, see Figure 3.7. This probability
2
can be bounded by Hoeffding’s inequality for sampling without replacement and decreases as e−nε /8 .
Putting this together we obtain the following result.
Lemma 3.10.  ε 2
P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ ≤ mH (2n)e−nε /8 .
2
As you may guess, mH (2n) comes from a union bound over the number of possible sequences of losses
we may obtain with hypotheses from H on S ∪ S 0 . We now prove the lemma formally.
Proof of Lemma 3.10.
 ε X  ε 
P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ = P(S ∪ S 0 )P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ S ∪ S 0
2 2
S∪S 0
 ε 
≤ sup P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ S ∪ S 0 .
S∪S 0 2
Put attention that the conditional probabilities are with respect to the splitting of S ∪ S 0 into S and S 0 .
Let Z(S ∪ S 0 ) = {Z1 , . . . , Z2n : Zi = `(h(Xi ), Yi ), h ∈ H} be the set of all possible sequences of losses
that can be obtained by applying h ∈ H to S ∪ S 0 . Since there are at most mH (2n) distinct ways to
label S ∪ S 0 we have |Z(S ∪ S 0 )| ≤ mH (2n). Let σ : {1, . . . , 2n} → {1, . . . , 2n} denote a permutation of
indexes. We have
 ε 
sup P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ S ∪ S 0
S∪S 0 2
n 2n
!
0 1X 1 X ε
= sup P ∃ {Z1 , . . . , Z2n } ∈ Z(S ∪ S ) : Zσ(i) − Zσ(i) ≥ (3.12)
S∪S 0 n i=1 n i=n+1 2
n 2n
!
X 1X 1 X ε
≤ sup P Zσ(i) − Zσ(i) ≥ (3.13)
S∪S 0 n i=1 n i=n+1 2
{Z1 ,...,Z2n }∈Z(S∪S 0 )
n 2n
!
X 1X 1 X ε
= sup P Zσ(i) − Zi ≥ (3.14)
S∪S 0 n i=1 2n i=1 4
{Z1 ,...,Z2n }∈Z(S∪S 0 )
X 2
≤ sup e−nε /8 (3.15)
S∪S 0
{Z1 ,...,Z2n }∈Z(S∪S 0 )
2
≤ sup mH (2n)e−nε /8
(3.16)
S∪S 0
2
= mH (2n)e−nε /8
,
where (3.12) follows by the fact that Z(S ∪ S 0 ) is the set of all possible losses on S ∪ S 0 and in the step
of splitting S ∪ S 0 into S and S 0 and computing L̂(h, S 0 ) and L̂(h, S) we are splitting a “big bag” of 2n
losses into two “small bags” of n and n; all that is left from H in the splitting process is Z(S ∪ S 0 );
the probability in (3.12) is over the split of S ∪ S 0 into S and S 0 , which is expressed by taking the
first n elements of a random permutation σ of indexes into S 0 and the last n elements into S and the
probability is over σ; in (3.13) we apply the union bound; for (3.14) see the illustration in Figure 3.7b; in
(3.15) we apply Hoeffding’s inequality for sampling without replacement (Theorem 2.22) to the process
of randomly sampling n losses out of 2n and observing ε/4 deviation from the average; in (3.16) we apply
the bound on |Z(S ∪ S 0 )|.

32
Step 4 [Putting Everything Together] All that is left for the proof of Theorem 3.8 is to put
Lemmas 3.9 and 3.10 together.
2
Proof of Theorem 3.8. Assuming that e−nε /2 ≤ 1/2 we have by Lemmas 3.9 and 3.10:
   ε
P ∃h ∈ H : L(h) − L̂(h, S) ≥ ε ≤ 2P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥
2
2
≤ 2mH (2n)e−nε /8 .
2 2
Note that if e−nε /2 > 1/2 then 2mH (2n)e−nε /8 > 1 and the inequality is satisfied trivially (because
probabilities are always upper bounded by 1).
By denoting the right hand side of the inequality by δ and solving for ε we obtain the result.

3.5.2 Bounding the Growth Function: The VC-dimension


In Theorem 3.8 we relate the distance between the expected and empirical losses to the growth function
of H. Our next goal is to bound the growth function. In order to do so we introduce the concept of
shattering and the VC dimension.
Definition 3.11. A set of points x1 , . . . , xn is shattered by H if functions from H can produce all
possible binary labellings of x1 , . . . , xn or, in other words, if

kH(x1 , . . . , xn )k = 2n .

For example, the set of homogeneous linear separators in R2 shatters the two points in Figure 3.5a,
but it does not shatter the three points in Figure 3.5b. Note that if two points lie on one line passing
through the origin, they are not shattered by the set of homogeneous linear separators, because they
always get the same label. Thus, we may have two sets of points of the same size, where one is shattered
and the other is not.
Definition 3.12. The Vapnik-Chervonenkis ( VC) dimension of H, denoted by dVC (H) is the maximal
number of points that can be shattered by H. In other words,

dVC (H) = max {n|mH (n) = 2n } .

If mH (n) = 2n for all n, then dVC (H) = ∞.


Similar to the growth function, if we want to show that dVC (H) = d we have to show that dVC (H) ≥ d
and dVC (H) ≤ d. For example, the illustration in Figure 3.5a provides a configuration of points that
are shuttered by homogeneous separating hyperplanes in R2 and thus shows that the VC-dimension of
homogeneous separating hyperplanes in R2 is at least 2. However, the illustration in Figure 3.5b does
not demonstrate that the VC-dimension of homogeneous separating hyperplanes in R2 is smaller than
3. If we want to show that the VC-dimension of homogeneous separating hyperplanes in R2 is smaller
than 3 we have to prove that no configuration of 3 points can be shattered. It is not sufficient to show
that one particular configuration of points cannot be shattered. In the same spirit, two points lying on
the same line passing through the origin cannot be shattered by homogeneous linear separators, but this
does not tell anything about the VC-dimension, because we have another configuration of two points
in Figure 3.5a that can be shattered. It is possible to show that the VC-dimension of homogeneous
separating hyperplanes in Rd is d and the VC-dimension of general separating hyperplanes in Rd (not
necessarily passing through the origin) is d + 1, see Abu-Mostafa et al. (2012, Exercise 2.4).
The next theorem bounds the growth function in terms of the VC-dimension.
Theorem 3.13 (Sauer’s Lemma).
dVC (H)  
X n
mH (n) ≤ . (3.17)
i=0
i

33
n

We remind that the binomial coefficient k counts the number of ways to pick k elements out of n
as nk = 0. Thus, equation (3.17) is well-defined even when n < dVC (H).

and that for n < k it is defined
Pn n
We also remind that i=0 i = 2n , where 2n is the number of all possible subsets of n elements, which
is equal to the sum over i going from 0 to n to select i elements out of n. For n ≤ dVC (H) we have
mH (n) = 2n and the inequality is satisfied trivially.
The proof of Theorem 3.13 slightly reminds the combinatorial proof of the binomial identity
     
n n−1 n−1
= + .
k k k−1
One way to count the number of ways to select k elements out of n on the right hand side is to take
one element aside. If that element is selected, then we have n−1
k−1 possibilities to select k − 1 additional
elements out of the remaining n − 1. If the element is not selected, then we have n−1 k possibilities to
select all k elements out of remaining n − 1. The sets including the first element are disjoint from the
sets excluding it, leading to the identity above.
We need one more definition for the proof of Theorem 3.13.
Definition 3.14. Let B(n, d) be the maximal number of possible ways to label n points, so that no d + 1
points are shattered.
By the definition, we have mH (n) ≤ B(n, dVC (H)).
Proof of Theorem 3.13. We prove by induction that
d  
X n
B(n, d) ≤ . (3.18)
i=0
i

For the induction base we have B(n, 0) = 1 = n0 : if no points are shattered there is just one way to


label the points. If there would be more than one way, they would differ in at least one point and that
point
 would be shattered. By the definition of binomial coefficients, which says that for k > n we have
n
k = 0, we also know that for n < d we have B(n, d) = B(n, n). In particular, B(0, d) = B(0, 0) = 1.
Now we proceed with induction on d and for each d we do an induction on n. We show that
B(n, d) ≤ B(n − 1, d) + B(n − 1, d − 1).
Let S be a maximal set of dichotomies (labeling patterns) on n points x1 , . . . , xn . We take one point
aside, xn , and split S into three disjoint subsets: S = S ∗ ∪ S + ∪ S − . The set S ∗ contains dichotomies on
n points that appear with just one sign on xn , either positive or negative. The sets S + and S − contain all
dichotomies that appear with both positive and negative sign on xn , where the positive ones are collected
in S + and the negative ones are collected in S − . Thus, the sets S + and S − are identical except in their
labeling of xn , where in S + it is always labeled as + and in S − always as −. By contradiction, the
number of points x1 , . . . , xn−1 that are shattered by S − cannot be larger than d − 1, because otherwise
the number of points that are shattered by S, which includes S + and S − , would be larger than d, since
we can use S + and S − to add xn to the set of shattered points. Therefore, |S − | ≤ B(n − 1, d − 1). At
the same time, the number of points x1 , . . . , xn−1 that are shattered by S ∗ ∪ S + cannot be larger than d,
because the total number of points shattered by S is at most d. Thus, we have |S ∗ ∪ S + | ≤ B(n − 1, d).
And overall
B(n, d) = |S| = |S ∗ ∪ S + | + |S − | ≤ B(n − 1, d) + B(n − 1, d − 1),
as desired. By the induction assumption equation (3.18) is satisfied for B(n − 1, d) and B(n − 1, d − 1),
and we have
d   Xd−1  
X n−1 n−1
B(n, d) ≤ +
i=0
i i=0
i
d−1    
X n−1 n−1
=1+ +
i=0
i+1 i
d  
X n
= ,
i=0
i

34
as desired. Finally, as we have already observed, mH (n) ≤ B(n, dVC (H)), completing the proof.
The following lemma provides a more explicit bound on the growth function.
Lemma 3.15.
d  
X n
≤ nd + 1.
i=0
i
The proof is based on induction and left as an exercise.
By plugging the results of Theorem 3.13 and Lemma 3.15 into Theorem 3.8 we obtain the VC
generalization bound.
Theorem 3.16 (VC generalization bound). Let H be a hypotheses class with VC-dimension dVC (H) =
dVC . Then:  v
u    
u 8 ln 2 (2n)dVC + 1 /δ
 t 
∃h ∈ H : L(h) ≥ L̂(h, S) +
P  ≤ δ.
n 

For example, the VC-dimension of linear separators in Rd is d + 1 and theorem 3.16 provides gen-
eralization guarantees for learning with linear separators in finite-dimensional spaces, as long as the
dimension of the space d is small in relation to the number of points n.

3.6 VC Analysis of SVMs


Kernel Support Vector Machines (SVMs) can map the data into high and potentially infinite-dimensional
spaces. For example, Radial Basis Function (RBF) kernels map the data into an infinite-dimensional
space. In the following we provide a more refined analysis of generalization in learning with linear
separators in high-dimensional spaces. The analysis is based on the notion of separation with a margin.
We use the following definitions.
Definition 3.17 (Fat Shattering). Let Hγ = {(w, b) : kwk ≤ 1/γ} be the space of hyperplanes described
by w and b, where w is a vector in Rd (with potentially infinite dimension d) with kwk ≤ 1/γ and b ∈ R.
We say that a set of points {x1 , . . . , xn } is fat-shattered by Hγ if for any set of labels {y1 , . . . , yn } ∈
n
{±1} we have a hyperplane (w, b) ∈ Hγ that satisfies yi (hw, xi i + b) ≥ 1 for all i ∈ {1, . . . , n}.
Note that when y = sign (hw, xi + b) the distance of a point x to a hyperplane h defined by (w, b) is
given by dist(h, x) = y(hw,xi+b)
kwk (Abu-Mostafa et al., 2015, Page 5, Chapter 8) and for h = (w, b) ∈ Hγ
and {x1 , . . . , xn } fat-shattered by Hγ we obtain dist(h, xi ) ≥ γ for all i ∈ {1, . . . , n}. It means that any
possible labeling of {x1 , . . . , xn } can be achieved with margin at least γ.
Definition 3.18 (Fat Shattering Dimension). We say that fat shattering dimension dFAT (Hγ ) = d if d
is the maximal number of points that can be fat shattered by Hγ . (I.e., there exist n points that can be
fat shattered by Hγ and no d + 1 points can be fat shattered by Hγ .)
Note that dFAT (Hγ ) ≤ dVC (Hγ ) ≤ d + 1, where d is the dimension of w. (If we can shatter n points
with margin γ we can also shatter them without the margin.)
The following theorem bounds the fat shattering dimension of Hγ , see Abu-Mostafa et al. (2015) for
a proof.
Theorem 3.19 ((Abu-Mostafa et al., 2015, Theorem 8.5)). Assume that the input space X is a ball of
radius R in Rd (i.e., kxk ≤ R for all x ∈ X ), where d may potentially be infinite. Then:
dFAT (Hγ ) ≤ R2 /γ 2 + 1,
 

 
where R2 /γ 2 is the smallest integer that is greater or equal to R2 /γ 2 .
The important point is that the bound on fat shattering dimension is independent of the dimension
of the space Rd that w comes from.
We define fat losses that count as error everything that falls too close to the separating hyperplane
or on the wrong side of it.

35
Definition 3.20 (Fat Losses). For h = (w, b) we define the fat losses
(
0, if yi (hw, xi i + b) ≥ 1
`FAT (h(x), y) =
1, otherwise,
LFAT (h) = E [`FAT (h(X), Y )] ,
n
1X
L̂FAT (h, S) = `FAT (h(Xi ), Yi ).
n i=1

In relation to the fat losses the fat shattering dimension acts in the same way as the VC-dimension
in relation to the zero-one loss. In particular, we have the following result that relates LFAT (h) to
L̂FAT (h, S) via dFAT (Hγ ) (the proof is left as an exercise).
Theorem 3.21.
 v    
u
u 8 ln 2 (2n)dFAT (Hγ ) + 1 /δ
 t 
∃h ∈ Hγ : LFAT (h) ≥ L̂FAT (h, S) +
P  ≤ δ.
n 

Now we are ready to analyze generalization in learning with fat linear separation. For the analysis
we make a simplifying assumption that the data are contained within a ball of radius R = 1. The
analysis for general R is left as an exercise. Note that R refers to the radius of the ball after potential
transformation of the data through a feature mapping / kernel function. For example, the RBF kernel
maps the data into an infinite dimensional space and we consider the radius of the ball containing the
transformed data in the infinite dimensional space.

Theorem 3.22. Assume that the input space X is a ball of radius R = 1 in Rd , where d is potentially
infinite. Let H be the space of linear separators h = (w, b). Then
 v    
u 8 ln 2 (2n)1+dkwk2 e + 1 (1 + dkwk2 e) dkwk2 e /δ
u
 t 
∃h ∈ H : LFAT (h) ≥ L̂FAT (h, S) +
P  ≤ δ.
n 

Observe that L(h) ≤ LFAT (h) and, therefore, the theorem provides a generalization bound for L(h).
(If we count correct classifications within the margin as errors we only increase the loss.)
Proof. The proof is based on combination of VC and Occam’s razor bounding techniques,
 see the il-
lustration in Figure 3.8. We start by noting that Theorem 3.19 is interesting when R2 /γ 2 < d + 1,
because as we have already noted dFAT (Hγ ) ≤ dVC (Hγ ) ≤ d + 1. We slice the hypotheses space H into
a nested sequence of subspaces H1 ⊂ H2 ⊂ · · · ⊂ Hd−1 ⊂ Hd = H, where for all i < d we define Hi to
be the hypothesis space Hγ with 1/γ 2 = i. In other words, Hi = Hnγ= √1 o (do not let the notation to
i
confuse you, by Hi we denote the i-th hypothesis space in the nested sequence of hypothesis spaces and
by Hγ we denote the hypothesis space with kwk upper bounded by 1/γ). By Theorem 3.19 we have
dFAT (Hi ) = i + 1 and then by Theorem 3.21:
 v    
u
u 8 ln 2 (2n)1+i + 1 /δ
 t i 
∃h ∈ Hi : LFAT (h) ≥ L̂FAT (h, S) +
P  ≤ δi .
n 

1
P∞ 1
P∞  1 1

1
 1 1
 1 1

We take δi = i(i+1) δ and note that i=1 i(i+1) = i=1 i − i+1 = 1− 2 + 2 − 3 + 3 − 4 +
d
[
· · · = 1. We also note that H = (Hi \ Hi−1 ), where H0 is defined as the empty set and Hi \ Hi−1 is
i=1
the difference between sets Hi and Hi−1 (everything that is in Hi , but not in Hi−1 ). Note that the sets

36
Figure 3.8: Illustration for the proof of Theorem 3.22

Hi \Hi−1 and Hj \Hj−1 are disjoint for i 6= j. Also note that δi is a distribution
 of our confidence budget
δ among Hi \ Hi−1 -s. Finally, note that if h = (w, b) ∈ Hi \ Hi−1 then kwk2 = i. The remainder of
the proof follows the same lines as the proof of Occam’s razor bound:
 v    
u 8 ln 2 (2n)1+dkwk2 e + 1 (1 + dkwk2 e) dkwk2 e /δ
u
 t 
∃h ∈ H : LFAT (h) ≥ L̂FAT (h, S) +
P 
n 

 v    
u 8 ln 2 (2n)1+dkwk2 e + 1 (1 + dkwk2 e) dkwk2 e /δ
u
d
[
 t 
∃h ∈
= P Hi \ Hi−1 : LFAT (h) ≥ L̂FAT (h, S) + 
i=1
n 

 v    
u 8 ln 2 (2n)1+dkwk2 e + 1 (1 + dkwk2 e) dkwk2 e /δ
u
d
X  t 
= ∃h ∈ Hi \ Hi−1
P : LFAT (h) ≥ L̂FAT (h, S) + 
i=1
n 

 v    
u
d
X u 8 ln 2 (2n)1+i + 1 (1 + i) i/δ
 t 
= ∃h ∈ Hi \ Hi−1
P : LFAT (h) ≥ L̂FAT (h, S) + 
i=1
n 

 v    
u
d u 8 ln 2 (2n)1+i + 1 /δ
X  t i 
= ∃h ∈ Hi \ Hi−1
P : LFAT (h) ≥ L̂FAT (h, S) + 
i=1
n 

 v    
u
d u 8 ln 2 (2n)1+i + 1 /δ
X  t i 
≤ ∃h ∈ Hi : LFAT (h) ≥ L̂FAT (h, S) +
P 
i=1
n 

d d d ∞
X X 1 X 1 X 1
≤ δi = δ=δ ≤δ = δ.
i=1 i=1
i(i + 1) i=1
i(i + 1) i=1
i(i + 1)

37
3.7 VC Lower Bound
In this section we show that when the VC-dimension is unbounded, it is impossible to bound the distance
between L(h) and L̂(h, S).
Theorem 3.23. Let H be a hypothesis class with dV C (H) = ∞. Then for any n there exists a distribution
over X and a class of target functions F, such that
  
E sup L(h) − L̂(h, S) ≥ 0.25,
h

where the expectation is over selection of a sample of size n and a target function.

Proof. Pick n. Since dV C (H) = ∞ we know that there exist 2n points that are shattered by H. Let the
sample space X2n = {x1 , . . . , x2n } be these points and let p(x) be uniform on X2n . Let F be the set of
all possible functions from X2n to {0, 1} and let p(f ) be uniform
S over F. Let S be a sample of n points.
Let {Fk (S)}k be maximal subsets of F, such that F = k Fk (S) and any fi , fj ∈ Fk (S) agree on S.
Note that since X2n is shattered by H, for any S, any Fk , and any fi ∈ Fk that was used to label S
there exists h∗ (Fk (S), S) ∈ H, such that for any fi ∈ Fk (S) the empirical error L̂(h∗ (fi , S), S) = 0. Let
p(k) and p(i) be uniform. Then:
       
E sup L(h) − L̂(h, S) = Ef ∼p(f ) ES∼p(X)n sup L(h) − L̂(h, S) f
h h
    
= ES∼p(X)n Ef ∼p(f ) sup L(h) − L̂(h, S) S
h
      
= ES∼p(X)n Ek∼p(k) Ei∼p(i) sup L(h) − L̂(h, S) Fk S
h
h h h i i i
≥ ES∼p(X)n Ek∼p(k) Ei∼p(i) L(h∗ (Fk , S)) − L̂(h∗ (Fk , S), S) Fk S
= ES∼p(X)n Ek∼p(k) Ei∼p(i) [L(h∗ (Fk , S))] Fk S
   
 
= ES∼p(X)n Ek∼p(k) [0.25] S
= 0.25.

Corollary 3.24. Under the assumptions of Theorem 3.23, with probability at least 0.125, suph (L(h) −
L̂(h, S)) ≥ 0.125. Thus, it is impossible to have high-probability bounds on suph (L(h) − L̂(h, S)) that
converge to zero as n goes to infinity.
Proof.
 Note
  − L̂(h,
that suph (L(h)  S)) ≤ 1, since ` is bounded in [0, 1]. Assume by contradiction that
P suph L(h) − L̂(h, S) ≥ 0.125 < 0.125. Then
  
E sup L(h) − L̂(h, S) ≤ 0.125 × 1 + (1 − 0.125) × 0.125 < 2 × 0.125 = 0.25,
h

which is in contradiction with Theorem 3.23.

3.8 PAC-Bayesian Analysis


Occam’s razor and VC analysis consider hard selection of a single hypothesis from a hypothesis class.
In PAC-Bayesian analysis hard selection is replaced by a soft selection: instead of selecting a single
hypothesis, it is allowed to select a distribution over the hypothesis space. When the distribution is a
delta-distribution putting all the mass on a single hypothesis, hard selection is recovered and the outcome
is identical to Occam’s razor bound. However, the possibility of soft selection provides much more freedom
and control over the approximation-estimation trade-off. PAC-Bayesian generalization bounds are based
on change of measure inequality, which acts as replacement for the union bound. Change of measure

38
inequality has two important advantages over the union bound: (1) it is tighter (you will verify this in
a home assignment) and (2) it can be applied to uncountably infinite hypothesis classes. Furthermore,
soft selection allows application of gradient-descent type methods to optimize the distribution over H,
which in some cases leads to efficient algorithms for direct minimization of the PAC-Bayesian bounds.
Soft selection is implemented by randomized classifiers, which are formally defined below.
Definition 3.25 (Randomized Classifier). Let ρ be a distribution over H. A randomized classifier
associated with ρ (and named ρ) acts according to the following scheme. At each prediction round it:
1. Picks h ∈ H according to ρ(h)
2. Observes x
3. Returns h(x)
h i
The expected loss of ρ is Eh∼ρ [L(h)] and the empirical loss is Eh∼ρ L̂(h, S) . Whenever it does not lead
h i
to confusion, we will shorten the notation to Eρ [L(h)] and Eρ L̂(h, S) .
There is a large number of different PAC-Bayesian inequalities. We start with the classical one due
to Seeger (2002).
Theorem 3.26 (PAC-Bayes-kl inequality). For any “prior” distribution π over H that is independent
of S, for all randomized classifiers (distributions) ρ simultaneously:
  h i  KL(ρkπ) + ln n+1 
δ
P kl Eρ L̂(h, S) Eρ [L(h)] ≥ ≤ δ. (3.19)
n
The meaning of “prior” should be interpreted in exactly the same way as the “prior” in Occam’s
razor bound: it is any distribution over H that sums up to one and does not depend on the sample S.
The prior is an auxiliary construction for deriving the bound and unlike in Bayesian learning there is no
assumption that it reflects any real-world distribution over H.
Before proceeding to the proof of the theorem we provide a discussion of its meaning. To get some
intuition we apply Pinsker’s relaxation of kl (inequality 2.17) that leads to a more digestible (although
weaker) form of the bound: with probability greater than 1 − δ for all ρ simultaneously
s
h i KL(ρkπ) + ln n+1
δ
Eρ [L(h)] ≤ Eρ L̂(h, S) + .
2n
Note that when ρ = π the KL term is zero and we recover generalization bound for a single hypothesis.
Taking ρ = π amounts to making no selection. If we start with a prior distribution π and continue with
it without taking any information from the sample we get the usual Hoeffding’s or kl inequality. In order
to get more intuition about the bound we decompose the KL-divergence:
h ρi  
1
KL(ρkπ) = Eρ ln = Eρ ln − H(ρ) .
π π | {z }
| {z } Entropy
Average
complexity

If H is finite and π is uniform, then H(ρ) ≥ 0 and KL(ρkπ) = ln |H| − H(ρ) ≤ ln |H| and we recover
generalization bound for finite hypothesis sets with an improvement by − H(ρ). Recall that the entropy
H(ρ) is zero when ρ is a delta-distribution and when ρ is uniform the entropy has its maximal value,
which is ln |H|. Thus, − H(ρ) is an “award” for avoiding commitment to a single hypothesis.
Overall, the PAC-Bayesian inequality advocates for picking ρ that minimizes the trade-off between:
1. The empirical error L̂(h, S).
1
2. The complexity (description length, prior belief) ln π(h) .

3. And has maximum entropy (it has “indifference” to h and h0 when L̂(h, S) = L̂(h0 , S) and π(h) =
π(h0 )). Maximization of H(ρ) corresponds to avoidance of selection whenever it is not necessary.
Reduced selection leads to improved estimation without impairing the approximation and provides
a tighter generalization bound.

39
3.8.1 Relation and Differences with other Learning Approaches
PAC-Bayesian analysis has the following relation and differences with Bayesian learning and with VC
analysis / Radamacher complexities.

Relation with Bayesian learning


1. Explicit way to incorporate prior information (via π(h)).

Difference with Bayesian learning


1. Explicit high-probability guarantee on the expected performance.
2. No belief in prior correctness (frequentist bound).
3. Explicit dependence on the loss function.
4. Different weighting of prior belief π(h) vs. evidence L̂(h).
5. Holds for any distribution ρ (including the Bayes posterior).

Relation with VC analysis / Radamacher complexities


1. Explicit high-probability guarantee on the expected performance.
2. Explicit dependence on the loss function.

Difference with VC analysis / Radamacher complexities


1. Complexity is defined individually for each h via π(h) (rather than “complexity of a hypothesis
class”).
2. Explicit way to incorporate prior knowledge.
3. The bound is defined for randomized classifiers ρ (not individual h); but workarounds exist in some
cases.
In a sense, PAC-Bayesian analysis takes the best out of Bayesian learning and VC analysis and puts it
together. And it also leads to efficient learning algorithms, since KL(ρkπ) is convex in ρ and L̂(ρ, S) is
linear in ρ.

3.8.2 A Proof of PAC-Bayes-kl Inequality


At the basis of most of PAC-Bayesian bounds lies the change of measure inequality, which acts as
replacement of the union bound for uncountably infinite sets.
Theorem 3.27 (Change of measure inequality). For any measurable function f (h) on H and any
distributions ρ and π: h i
Eh∼ρ(h) [f (h)] ≤ KL(ρkπ) + ln Eh∼π(h) ef (h) .
Proof.
  
ρ(h) f (h) π(h)
Eρ(h) [f (h)] = Eρ(h) ln ×e ×
π(h) ρ(h)
  
f (h) π(h)
= KL(ρkπ) + Eρ(h) ln e ×
ρ(h)
 
f (h) π(h)
≤ KL(ρkπ) + ln Eρ(h) e ×
ρ(h)
h i
f (h)
= KL(ρkπ) + ln Eπ(h) e ,

where the inequality in the third step is justified by Jensen’s inequality (Theorem B.30). Note that there
is nothing probabilistic in the statement of the theorem - it is a deterministic result.

40
In the next lemma we extend f to be a function of h and a sample S and apply a probabilistic argument
to the last term of change-of-measure inequality. The lemma is the foundation for most PAC-Bayesian
bounds.
Lemma 3.28 (PAC-Bayes lemma). For any measurable function f : H × (X × Y)n → R and any
distribution π on H that is independent of the sample S
   !
Eh∼π ES ef (h,S)
P ∃ρ : Eh∼ρ [f (h, S)] ≥ KL(ρkπ) + ln ≤ δ,
δ

where the probability is with respect to the draw of the sample S and ES is the expectation with respect
to the draw of S.
An equivalent way of writing the above statement is
   !
Eh∼π ES ef (h,S)
P ∀ρ : Eh∼ρ [f (h, S)] ≤ KL(ρkπ) + ln ≥1−δ
δ

or, in words, with probability at least 1 − δ over the draw of S, for all ρ simultaneously
  
Eπ ES ef (h,S)
Eρ [f (h, S)] ≤ KL(ρkπ) + ln .
δ
We first present a slightly less formal, but more intuitive proof and then provide a formal one. By
change of measure inequality we have
h i
Eρ [f (h, S)] ≤ KL(ρkπ) + ln Eπ ef (h,S)
  
ES Eπ ef (h,S)
≤ KL(ρkπ) + ln
w.p.≥1−δ δ
  f (h,S) 
Eπ ES e
= KL(ρkπ) + ln ,
δ
 
where in the second line we apply Markov’s inequality to the random variable Z = Eπ ef (h,S) (and
the inequality holds with probability at least 1 − δ) and in the last line we can exchange the order
of expectations, because π is independent of S. The key observation is that the change-of-measure
inequality relates all posterior distributions ρ to a single prior distribution π in a deterministic
 way
 and
the probabilistic argument (Markov’s inequality) is applied to a single random quantity Eπ ef (h,S) . This
way change-of-measure inequality replaces the union bound and it holds even when H is uncountably
infinite.
Now we provide a formal proof.
Proof.
   ! i E E ef (h,S) 
!
Eπ ES ef (h,S) h
f (h,S) π S
P ∃ρ : Eρ [f (h, S)] ≥ KL(ρkπ) + ln ≤ P Eπ e ≥ (3.20)
δ δ
!
h i E E ef (h,S) 
f (h,S) S π
= P Eπ e ≥ (3.21)
δ
≤ δ,

where (3.20) follows by change-of-measure inequality (elaborated below), in (3.21) we can exchange the
order of expectations, because π is independent
 of S, and in the last step we apply Markov’s inequality
to the random variable Z = Eπ ef (h,S) .
An elaboration regarding Step (3.20). By change of measure inequality, we have that ∀ρ : Eρ [f (h, S)] ≤
    ES [Eπ [ef (h,S) ]]
KL(ρkπ)+ln Eπ ef (h,S) . From change of measure inequality, we deduce that if Eπ ef (h,S) ≤ δ
ES [Eπ [ef (h,S) ]]
then ∀ρ : Eρ [f (h, S)] ≤ KL(ρkπ) + ln δ . Let A denote the event in the if-statement and B

41
 
denote the event in the then-statement. Then we have P(A) ≤ P(B) and, therefore, P Ā ≥ P B̄ ,
  ES [Eπ [ef (h,S) ]]
where Ā denotes the complement of event A. The complement of A is Eπ ef (h,S) > δ and
ES [Eπ [ef (h,S) ]]
the complement of B is ∃ρ : Eρ [f (h, S)] > KL(ρkπ) + ln δ , which gives us the inequality
in Step (3.20) (as usually, we are being a tiny bit sloppy and do not trace which inequalities are strict
and which are weak, with a slight extra effort this could be done, but it does not matter in practice, so
we save the effort). The important point is that the change-of-measure inequality relates all posterior
distributions ρ to a single prior distribution
 π in
 a deterministic way, and the probabilistic argument is
applied to a single random variable Eπ ef (h,S) , avoiding the need in taking a union bound. This way
the change of measure inequality acts as a replacement of the union bound.
Different PAC-Bayesian inequalities are obtained by different choices of the function f (h, S). A
key consideration
 in the choice of f (h, S) is the possibility to bound the moment generating function
ES ef (h,S) . For example, we have done it for f (h, S) = n kl(L̂(h, S)kL(h)) in Lemma 2.14 and this is the
choice of f in the proof of PAC-Bayes-kl inequality. Other choices of f are possible. Forexample, Hoeffd-

ing’s Lemma 2.6 provides a bound on the moment generating function of f (h, S) = λ L(h) − L̂(h, S) ,
which can be used to derive PAC-Bayes-Hoeffding inequality. We refer to Seldin et al. (2012) for more
details.
The proof of PAC-Bayes-kl inequality relies on convexity of the kl-divergence. We cite the theorem
and refer to Cover and Thomas (2006) for details.
Theorem 3.29 (Cover and Thomas, 2006, Theorem 2.7.2). KL(pkq) is convex in the pair (p, q); that
is, if (p1 , q1 ) and (p2 , q2 ) are two pairs of probability mass functions, then

KL(λp1 + (1 − λ)p2 kλq1 + (1 − λ)q2 ) ≤ λ KL(p1 kq1 ) + (1 − λ) KL(p2 kq2 )

for all 0 ≤ λ ≤ 1.
Corollary 3.30.  h i  h  i
kl Eρ L̂(h, S) Eρ [L(h)] ≤ Eρ kl L̂(h, S) L(h) .

Finally, we are ready to prove Theorem 3.26.

Proof of Theorem 3.26. We provide an intuitive derivation and leave the formal one (as in the proof of
Lemma 3.28) as an exercise.
We take f (h, S) = n kl(L̂(h, S)kL(h)). Then we have
 h i  h i
n kl Eρ L̂(h, S) Eρ [L(h)] ≤ Eρ n kl(L̂(h, S)kL(h))
h h ii
Eπ ES en kl(L̂(h,S)kL(h))
≤ KL(ρkπ) + ln
w.p.≥1−δ δ
Eπ [n + 1]
≤ KL(ρkπ) + ln
δ
n+1
= KL(ρkπ) + ln ,
δ
where the first inequality is by Corollary 3.30, the second inequality is by the PAC-Bayes Lemma (and it
holds with probability at least 1 − δ over the draw of S), and the third inequality is by Lemma 2.14.

3.8.3 Application to SVMs


In order to apply PAC-Bayesian bound to a given problem we have to design a prior distribution π and
then bound the KL-divergence KL(ρkπ) for the posterior distributions of interest. Sometimes we resort
to a restricted class of ρ-s, for which we are able to bound KL(ρkπ). You can see how this is done for
SVMs in Langford (2005, Section 5.3).

42
3.8.4 Relaxation of PAC-Bayes-kl: PAC-Bayes-λ Inequality
Due to its implicit form, PAC-Bayes-kl inequality is nothvery convenient
i for optimization. One way
around is to replace the bound with a linear trade-off βnEρ L̂(h, S) +KL(ρkπ). Since KL(ρkπ) is convex
h i
in ρ and Eρ L̂(h, S) is linear in ρ, for a fixed β the trade-off is convex in ρ and can be minimized. (We
note that parametrization of ρ, for example the popular restriction of ρ to a Gaussian posterior (Langford,
2005), may easily break the convexity (Germain et al., 2009). We get back to this point in Section 3.8.6.)
The value of β can then be tuned by cross-validation or substitution of ρ(β) into the bound (the former
usually works better).
Below we present a more rigorous approach. We prove the following relaxation of PAC-Bayes-kl
inequality, which leads to a bound that can be optimized by alternating minimization.
Theorem 3.31 (PAC-Bayes-λ Inequality). For any probability distribution π over H that is independent
of S and any δ ∈ (0, 1), with probability greater than 1 − δ over a random draw of a sample S, for all
distributions ρ over H and all λ ∈ (0, 2) and γ > 0 simultaneously:
h i
Eρ L̂(h, S) KL(ρkπ) + ln n+1
Eρ [L(h)] ≤ +  δ , (3.22)
1 − λ2 λ 1 − λ2 n
 γ KL(ρkπ) + ln n+1
δ
Eρ [L(h)] ≥ 1 − Eρ [L̂(h, S)] − . (3.23)
2 γn

At the moment we focus on the upper bound in equation (3.22). Note that the theorem holds for all
values of λ ∈ (0, 2) simultaneously. Therefore, we can optimize the bound with respect to λ and pick the
best one.
Proof. We prove the upper bound in equation (3.22). Proof of the lower bound (3.23) is analogous and
left as an exercise. Proof of the statement that the upper and lower bounds hold simultaneously (require
no union bound) is also left as an exercise.
By refined Pinsker’s inequality in Corollary 2.19, for p < q

kl(pkq) ≥ (q − p)2 /(2q). (3.24)

By PAC-Bayes-kl inequality, Theorem 3.26, with probability greater than 1 − δ for all ρ simultaneously
 h i  KL(ρkπ) + ln n+1
δ
kl Eρ L̂(h, S) Eρ [L(h)] ≤ .
n
By application of inequality (3.24), the above inequality can be relaxed to
s
h i KL(ρkπ) + ln n+1
δ
Eρ [L(h)] − Eρ L̂(h, S) ≤ 2Eρ [L(h)] . (3.25)
n

We have that  y √
min λx + = 2 xy
λ:λ>0 λ

(we leave this statement as a simple exercise). Thus, xy ≤ 21 λx + λy for all λ > 0 and by applying


this inequality to (3.25) we have that with probability at least 1 − δ for all ρ and λ > 0
h i λ KL(ρkπ) + ln n+1
δ
Eρ [L(h)] − Eρ L̂(h, S) ≤ Eρ [L(h)] + .
2 λn
By changing sides 
λ
 h i KL(ρkπ) + ln n+1
δ
1− Eρ [L(h)] ≤ Eρ L̂(h, S) + .
2 λn
For λ < 2 we can divide both sides by 1 − λ2 and obtain the theorem statement.


43
3.8.5 Alternating Minimization of PAC-Bayes-λ Bound
We use the term PAC-Bayes-λ bound to refer to the right hand side of PAC-Bayes-λ inequality. A
great advantage of the PAC-Bayes-λ bound is that
h it can i be conveniently minimized by alternating
minimization with respect to ρ and λ. Since Eρ L̂(h, S) is linear in ρ and KL(ρkπ) is convex in ρ
(Cover and Thomas, 2006), for a fixed λ the bound is convex in ρ and the minimum is achieved by

π(h)e−λnL̂(h,S)
ρ(h) = h i, (3.26)
Eπ e−λnL̂(h0 ,S)
h 0
i
where Eπ e−λnL̂(h ,S) is a convenient way of writing the normalization factor, which covers continuous
and discrete
h hypothesis i spaces in a unified notation. In the discrete case, which will be of main interest
−λnL̂(h0 ,S) 0
= h0 ∈H π(h0 )e−λnL̂(h ,S) . We leave a proof of the statement that (3.26) defines
P
for us, Eπ e
ρ which achieves the minimum of the bound as an exercise to the reader. Furthermore, for t ∈ (0, 1) and
a b
a, b ≥ 0 the function 1−t + t(1−t) is convex in t (Tolstikhin and Seldin, 2013) and, therefore, for a fixed
ρ the right hand side of inequality (3.22) is convex in λ for λ ∈ (0, 2) and the minimum is achieved by
2
λ= r . (3.27)
2nEρ [L̂(h,S)]
+1+1
(KL(ρkπ)+ln n+1
δ )

Note that the optimal value of λ is smaller than 1. Alternating application of update rules (3.26) and
(3.27) monotonously decreases the bound, and thus converges.
We note that while the right hand side of inequality (3.22) is convex in ρ for a fixed λ and convex in
λ for a fixed ρ, it is not simultaneously convex in ρ and λ. Joint convexity would have been a sufficient,
but it is not a necessary condition for convergence of alternating minimization to the global minimum
of the bound. Thiemann et al. (2017) provide sufficient conditions under which the procedure converges
to the global minimum, as well as examples of situations where this does not happen.

3.8.6 Construction of a Hypothesis Space for PAC-Bayes-λ


If H is infinite, computation of the partition function (the denominator in (3.26)) is intractable. This
could be resolved by parametrization of ρ (for example, restriction of ρ to a Gaussian posterior), but,
as we have already mentioned, this may break the convexity of the bound in ρ. Fortunately, things get
easy when H is finite. The crucial step is to construct a sufficiently powerful finite hypothesis space H.
One possibility that we consider here is to construct H by training m hypotheses, where each hypothesis
is trained on r random points from S and validated on the remaining n − r points. This construction
resembles a cross-validation split of the data. However, in cross-validation r is typically large (close to n)
and validation sets are non-overlapping. The approach considered here works for any r and has additional
computational advantages when r is small. We do not require validation sets to be non-overlapping and
overlaps between training sets are allowed. Below we describe the construction more formally.
Let h ∈ {1, . . . , m} index the hypotheses in H. Let Sh denote the training set of h and S \ Sh the
validation set. Sh is a subset of r points from S, which are selected independently of their values (for
example, subsampled randomly or picked according to a predefined partition of the data). We define the
1
validation error of h by L̂val (h, S) = n−r
P
(X,Y )∈S\Sh `(h(X), Y ). Note that the validation errors are
(n − r) i.i.d. random
 f (h,S)  variables with bias L(h) and, therefore, for f (h, S) = (n − r) kl(L̂val (h, S)kL(h)) we
have ES e ≤ (n − r) + 1. The following result is a straightforward adaptation of Theorem 3.31 to
this setting (we leave the proof as an exercise to the reader).
Theorem 3.32 (PAC-Bayesian Aggregation). Let S be a sample of size n. Let H be a set of m hypothe-
ses, where each h ∈ H is trained on r points from S selected independently of the composition of S. For
any probability distribution π over H that is independent of S and any δ ∈ (0, 1), with probability greater
than 1−δ over a random draw of a sample S, for all distributions ρ over H and λ ∈ (0, 2) simultaneously:
h i
Eρ L̂val (h, S) KL(ρkπ) + ln (n−r)+1
δ
Eρ [L(h)] ≤ + . (3.28)
1 − λ2 λ 1 − λ2 (n − r)


44
It is natural, but not mandatory to select a uniform prior π(h) = 1/m. The bound in equation (3.28)
can be minimized by alternating application of the update rules in equations (3.26) and (3.27) with n
being replaced by n − r and L̂ by L̂val . For evaluation of the empirical performance of this learning
approach see Thiemann et al. (2017).

3.9 PAC-Bayesian Analysis of Ensemble Classifiers


So far in this chapter we have discussed various methods of selection of classifiers from a hypothesis
set H. We now turn to aggregation of predictions of multiple classifiers through a weighted majority
vote. The power of the majority vote is in the “cancellation of errors” effect: if predictions of different
classifiers are uncorrelated and they all predict better than a random guess (meaning that L(h) < 1/2),
the errors tend to cancel out. This can be compared to a consultation of medical experts, which tends to
predict better than the best expert in the set. Most machine learning competitions are won by strategies
that aggregate predictions of multiple classifiers. The assumptions that the errors are uncorrelated and
the predictions are better than random are important. For example, if we have three hypotheses with
L(h) = p and independent errors, the probability that a uniform majority vote MVu makes an error
equals the probability that at least two out of the three hypotheses make an error. You are welcome to
verify that in this case for p ≤ 1/2 we have L(MVu ) ≤ Eu [L(h)], where u is the uniform distribution.
If the errors are correlated, it can be shown that L(MVρ ) can be larger than Eρ [L(h)], but as we
show below it is never larger than 2Eρ [L(h)]. The reader is welcome to construct an example, where
L(MVu ) > Eu [L(h)].

3.9.1 Ensemble Classifiers and Weighted Majority Vote


We now turn to some formal definitions. Ensemble classifiers predict by taking a weighted aggregation
of predictions by hypotheses from H. In multi-class prediction (the label space Y is finite) ρ-weighted
majority vote MVρ predicts
X
MVρ (X) = arg max ρ(h),
Y ∈Y
(h∈H)∧(h(X)=Y )

where ∧ represents the logical “and” operation and ties can be resolved arbitrarily.
In binary prediction with prediction space h(X) ∈ {±1} weighted majority vote can be written as
MVρ (X) = sign (Eρ [h(X)]) ,
where sign(x) = 1 if x > 0 and −1 otherwise (the value of sign(0) can be defined arbitrarily). For a
countable hypothesis space this becomes
!
X
MVρ (X) = sign ρ(h)h(X) .
h∈H

3.9.2 First Order Oracle Bound for the Weighted Majority Vote
If majority vote makes an error, we know that at least a ρ-weighted half of the classifiers have made
an error and, therefore, `(MVρ (X), Y ) ≤ 1(Eρ [1(h(X) 6= Y )] ≥ 0.5). This observation leads to the
well-known first order oracle bound for the loss of weighted majority vote.
Theorem 3.33 (First Order Oracle Bound).
L(MVρ ) ≤ 2Eρ [L(h)].
Proof. We have L(MVρ ) = ED [`(MVρ (X), Y )] ≤ P(Eρ [1(h(X) 6= Y )] ≥ 0.5). By applying Markov’s
inequality to random variable Z = Eρ [1(h(X) 6= Y )] we have:
L(MVρ ) ≤ P(Eρ [1(h(X) 6= Y )] ≥ 0.5) ≤ 2ED [Eρ [1(h(X) 6= Y )]] = 2Eρ [L(h)].

45
PAC-Bayesian analysis can be used to bound Eρ [L(h)] in Theorem 3.33 in terms of Eρ [L̂(h, S)], thus
turning the oracle bound into an empirical one. The disadvantage of the first order approach is that
Eρ [L(h)] ignores correlations of predictions, which is the main power of the majority vote.

3.9.3 Second Order Oracle Bound for the Weighted Majority Vote
Now we present a second order bound for the weighted majority vote, which is based on a second order

Markov’s inequality: for a non-negative random variable Z and ε > 0, we have P(Z ≥ ε) = P Z 2 ≥ ε2 ≤
ε−2 E Z 2 . We define tandem loss of two hypotheses h and h0 by
 

`(h(X), h0 (X), Y ) = 1(h(X) 6= Y ∧ h0 (X) 6= Y ).

The tandem loss counts an error on a sample (X, Y ) only if both h and h0 err on (X, Y ). We define the
expected tandem loss by
L(h, h0 ) = ED [1(h(X) 6= Y ∧ h0 (X) 6= Y )].
The following lemma relates the expectation of the second moment of the standard loss to the expected
tandem loss. We use the shorthand Eρ2 [L(h, h0 )] = Eh∼ρ,h0 ∼ρ [L(h, h0 )].
Lemma 3.34. In multiclass classification

ED [Eρ [1(h(X) 6= Y )]2 ] = Eρ2 [L(h, h0 )].

Proof.

ED [Eρ [1(h(X) 6= Y )]2 ] = ED [Eρ [1(h(X) 6= Y )]Eρ [1(h(X) 6= Y )]] (3.29)


= ED [Eρ2 [1(h(X) 6= Y )1(h0 (X) 6= Y )]]
= ED [Eρ2 [1(h(X) 6= Y ∧ h0 (X) 6= Y )]]
= Eρ2 [ED [1(h(X) 6= Y ∧ h0 (X) 6= Y )]]
= Eρ2 [L(h, h0 )].

A combination of second order Markov’s inequality with Lemma 3.34 leads to the following result.
Theorem 3.35 (Second Order Oracle Bound). In multiclass classification

L(MVρ ) ≤ 4Eρ2 [L(h, h0 )]. (3.30)

Proof. By second order Markov’s inequality applied to Z = Eρ [1(h(X) 6= Y )] and Lemma 3.34:

L(MVρ ) ≤ P(Eρ [1(h(X) 6= Y )] ≥ 0.5) ≤ 4ED [Eρ [1(h(X) 6= Y )]2 ] = 4Eρ2 [L(h, h0 )].

A Specialized Bound for Binary Classification


We provide an alternative form of Theorem 3.35, which can be used to exploit unlabeled data in bi-
nary classification. We denote the expected disagreement between hypotheses h and h0 by D(h, h0 ) =
ED [1(h(X) 6= h0 (X))] and express the tandem loss in terms of standard loss and disagreement.

Lemma 3.36. In binary classification


1
Eρ2 [L(h, h0 )] = Eρ [L(h)] − Eρ2 [D(h, h0 )].
2

46
Proof of Lemma 3.36. Picking from (3.29), we have

Eρ [1(h(X) 6= Y )]Eρ [1(h(X) 6= Y )] = Eρ [1(h(X) 6= Y )](1 − Eρ [(1 − 1(h(X) 6= Y ))]


= Eρ [1(h(X) 6= Y )] − Eρ [1(h(X) 6= Y )]Eρ [1(h(X) = Y )]
= Eρ [1(h(X) 6= Y )] − Eρ2 [1(h(X) 6= Y ∧ h0 (X) = Y )]
1
= Eρ [1(h(X) 6= Y )] − Eρ2 [1(h(X) 6= h0 (X))].
2
By taking expectation with respect to D on both sides and applying Lemma 3.34 to the left hand side,
we obtain:
1 1
Eρ2 [L(h, h0 )] = ED [Eρ [1(h(X) 6= Y )] − Eρ2 [1(h(X) 6= h0 (X))]] = Eρ [L(h)] − Eρ2 [D(h, h0 )].
2 2

The lemma leads to the following result.


Theorem 3.37 (Second Order Oracle Bound for Binary Classification). In binary classification

L(MVρ ) ≤ 4Eρ [L(h)] − 2Eρ2 [D(h, h0 )]. (3.31)

Proof. The theorem follows by plugging the result of Lemma 3.36 into Theorem 3.35.
The advantage of the alternative way of writing the bound is the possibility of using unlabeled data
for estimation of D(h, h0 ) in binary prediction (see also Germain et al., 2015). We note, however, that
estimation of Eρ2 [D(h, h0 )] has a slow convergence rate, as opposed to Eρ2 [L(h, h0 )], which has a fast
convergence rate. We discuss this point in Section 3.9.7.

3.9.4 Comparison of the First and Second Order Oracle Bounds


From Theorems 3.33 and 3.37 we see that in binary classification the second order bound is tighter when
Eρ2 [D(h, h0 )] > Eρ [L(h)]. Below we provide a more detailed comparison of Theorems 3.33 and 3.35 in the
worst, the best, and the independent cases. The comparison only concerns the oracle bounds, whereas
estimation of the oracle quantities, Eρ [L(h)] and Eρ2 [L(h, h0 )], is discussed in Section 3.9.7.

The worst case Since Eρ2 [L(h, h0 )] ≤ Eρ [L(h)] the second order bound is at most twice worse than the
first order bound. The worst case happens, for example, if all hypotheses in H give identical predictions.
Then Eρ2 [L(h, h0 )] = Eρ [L(h)] = L(MVρ ) for all ρ.

The best case Imagine that H consists of M ≥ 3 hypotheses, such that each hypothesis errs on 1/M
of the sample space (according to the distribution D) and that the error regions are disjoint. Then
L(h) = 1/M for all h and L(h, h0 ) = 0 for all h 6= h0 and L(h, h) = 1/M . For a uniform distribution ρ on
H the first order bound is 2Eρ [L(h)] = 2/M and the second order bound is 4Eρ2 [L(h, h0 )] = 4/M 2 and
L(MVρ ) = 0. In this case the second order bound is an order of magnitude tighter than the first order.

The independent case Assume that all hypotheses in H make independent errors and have the same
error rate, L(h) = L(h0 ) for all h and h0 . Then for h 6= h0 we have L(h, h0 ) = ED [1(h(X) 6= Y ∧ h0 (X) 6= Y )] =
ED [1(h(X) 6= Y )1(h0 (X) 6= Y )] = ED [1(h(X) 6= Y )]ED [1(h0 (X) 6= Y )] = L(h)2 and L(h, h) = L(h).
For a uniform distribution ρ the second order bound is 4Eρ2 [L(h, h0 )] = 4(L(h)2 + M 1
L(h)(1 − L(h))) and
the first order bound is 2Eρ [L(h)] = 2L(h). Assuming that M is large, so that we can ignore the second
term in the second order bound, we obtain that it is tighter for L(h) < 1/2 and looser otherwise. The
former is the interesting regime, especially in binary classification.

47
3.9.5 Second Order PAC-Bayesian Bounds for the Weighted Majority Vote
Now we provide an empirical bound for the weighted majority vote. We define the empirical tandem loss
n
1X
L̂(h, h0 , S) = 1(h(Xi ) 6= Yi ∧ h0 (Xi ) 6= Yi )
n i=1

and provide a bound on the expected loss of ρ-weighted majority vote in terms of the empirical tandem
losses.
Theorem 3.38. For any probability distribution π on H that is independent of S and any δ ∈ (0, 1),
with probability at least 1 − δ over a random draw of S, for all distributions ρ on H and all λ ∈ (0, 2)
simultaneously:
√ !
Eρ2 [L̂(h, h0 , S)] 2 KL(ρkπ) + ln(2 n/δ)
L(MVρ ) ≤ 4 + .
1 − λ/2 λ(1 − λ/2)n

Proof. The theorem follows by using the bound in equation (3.22) to bound Eρ2 [L(h, h0 )] in Theorem 3.35.
We note that KL(ρ2 kπ 2 ) = 2 KL(ρkπ) (Germain et al., 2015, Page 814).
It is also possible to use PAC-Bayes-kl to bound Eρ2 [L(h, h0 )] in Theorem 3.35, which actually gives a
tighter bound, but the bound in Theorem 3.38 is more convenient for minimization. We refer the reader
to Masegosa et al. (2020) for a procedure for bound minimization.

A specialized bound for binary classification


We define the empirical disagreement
m
1 X
D̂(h, h0 , S 0 ) = 1(h(Xi ) 6= h0 (Xi )),
m i=1

where S 0 = {X1 , . . . , Xm }. The set S 0 may overlap with the labeled set S, however, S 0 may include
additional unlabeled data. The following theorem bounds the loss of weighted majority vote in terms of
empirical disagreements. Due to possibility of using unlabeled data for estimation of disagreements in
the binary case, the theorem has the potential of yielding a tighter bound when a considerable amount
of unlabeled data is available.
Theorem 3.39. In binary classification, for any probability distribution π on H that is independent of
S and S 0 and any δ ∈ (0, 1), with probability at least 1 − δ over a random draw of S and S 0 , for all
distributions ρ on H and all λ ∈ (0, 2) and γ > 0 simultaneously:
√ !
Eρ [L̂(h, S)] KL(ρkπ) + ln(4 n/δ)
L(MVρ ) ≤ 4 +
1 − λ/2 λ(1 − λ/2)n
 √ 
2 KL(ρkπ) + ln(4 m/δ)
− 2 (1 − γ/2)Eρ2 [D̂(h, h0 , S 0 )] − .
γm

Proof. The theorem follows by using the upper bound in equation (3.22) to bound Eρ [L(h)] and the
lower bound in equation (3.23) to bound Eρ2 [D(h, h0 )] in Theorem 3.37. We replace δ by δ/2 in the
upper and lower bound and take a union bound over them.
Using PAC-Bayes-kl to bound Eρ [L(h)] and Eρ2 [D(h, h0 )] in Theorem 3.37 gives a tighter bound, but
the bound in Theorem 3.39 is more convenient for minimisation. We refer to Masegosa et al. (2020) for
a procedure for bound minimization.

3.9.6 Ensemble Construction


It is possible to use the same procedure as in Section 3.8.6 to construct an ensemble. Tandem losses can
then be estimated on overlaps of validation sets, (S \ Sh ) ∩ (S \ Sh0 ). The sample size in Theorem 3.38
should then be replaced by minh,h0 |(S \ Sh ) ∩ (S \ Sh0 )|.

48
3.9.7 Comparison of the Empirical Bounds
We provide a high-level comparison of the empirical first order bound (FO), the empirical second order
bound based on the tandem loss (TND, Theorem 3.38), and the new empirical second order bound based
on disagreements (DIS, Theorem 3.39). The two key quantities in the comparison are the sample size n
in the denominator of the bounds and fast and slow convergence rates for the standard (first order) loss,
the tandem loss, and the disagreements. Tolstikhin and Seldin (2013) have shown that if we optimize λ
for a given ρ, the PAC-Bayes-λ bound in equation (3.22) can be written as

√ √
s
2Eρ [L̂(h, S)] (KL(ρkπ) + ln(2 n/δ)) 2 (KL(ρkπ) + ln(2 n/δ))
Eρ [L(h)] ≤ Eρ [L̂(h, S)] + + .
n n

This form of the bound, also used by McAllester (2003), is convenient for explanation of fast and slow
rates. If Eρ [L̂(h, S)] is large, then the middle term on the right hand side dominates the complexity and

the bound decreases at the rate of 1/ n, which is known as a slow rate. If Eρ [L̂(h, S)] is small, then the
last term dominates and the bound decreases at the rate of 1/n, which is known as a fast rate.

FO vs. TND The advantage of the FO bound is that the validation sets S \Sh available for estimation of
the first order losses L̂(h, Sh ) are larger than the validation sets (S \Sh )∩(S \Sh0 ) available for estimation
of the tandem losses. Therefore, the denominator nmin = minh |S \ Sh | in the FO bound is larger than
the denominator nmin = minh,h0 |(S \ Sh ) ∩ (S \ Sh0 )| in the TND bound. The TND disadvantage can
be reduced by using data splits with large validation sets S \ Sh and small training sets Sh , as long as
small training sets do not overly impact the quality of base classifiers h. Another advantage of the FO
bound is that its complexity term has KL(ρkπ), whereas the TND bound has 2 KL(ρkπ). The advantage
of the TND bound is that Eρ2 [L(h, h0 )] ≤ Eρ [L(h)] and, therefore, the convergence rate of the tandem
loss is typically faster than the convergence rate of the first order loss. The interplay of the estimation
advantages and disadvantages, combined with the advantages and disadvantages of the underlying oracle
bounds discussed in Section 3.9.4, depends on the data and the hypothesis space.

TND vs. DIS The advantage of the DIS bound relative to the TND bound is that in presence of a
large amount of unlabeled data the disagreements D(h, h0 ) can be tightly estimated (the denominator
m is large) and the estimation complexity is governed by the first order term, Eρ [L(h)], which is ”easy”
to estimate, as discussed above. However, the DIS bound has two disadvantages. A minor one is its
reliance on estimation of two quantities, Eρ [L(h)] and Eρ2 [D(h, h0 )], which requires a union bound, e.g.,
replacement of δ by δ/2. A more substantial one is that the disagreement term is desired to be large,
and√ thus has a slow convergence rate. Since slow convergence rate relates to fast convergence rate as
1/ n to 1/n, as a rule of thumb the DIS bound is expected to outperform TND only when the amount
of unlabeled data is at least quadratic in the amount of labeled data, m > n2 .

For experimental comparison of the bounds and further details we refer the reader to Masegosa et al.
(2020).

49
Chapter 4

Supervised Learning - Regression

In this chapter we consider the regression problem, which is another special case of supervised learning
with X = Rd and Y = R.

4.1 Linear Least Squares


Linear regression with square loss `(Y 0 , Y ) = (Y 0 − Y )2 is also known as linear least squares. Let
S = {(x1 , y1 ), . . . , (xn , yn )} be our sample. We are looking for a prediction rule of a form h(x) = wT x,
where wT x is the dot-product (also known as the inner product) between a vector w ∈ Rd and a data
point x ∈ Rd . We will use w to denote the above prediction rule. Let X ∈ Rn×d be a matrix holding
xT1 , . . . , xTn as its rows
— xT1 —
 

X=
 .. 
. 
— xTn —
and let y = (y1 , . . . , yn )T be the vector of labels. We are looking for w that minimizes the empirical loss
Pn Pn 2
L̂(w, S) = i=1 `(wT xi , yi ) = i=1 wT xi − yi = kXw − yk2 .
When the number of constraints n (the number of points in S) is larger than the number of unknowns
d (the number of entries in w), most often the linear system Xw = y has no solutions (unless y by chance
falls in the linear span of the columns of X). Therefore, we are looking for the best approximation of
y by a linear combination of the columns of X, which means that we are looking for a projection of y
onto the column space of X. There are two ways to define projections, analytical and algebraic, which
lead to two ways of solving the problem. In the analytical formulation the projection is a point of a form
Xw that has minimal distance to y. In the algebraic formulation the projection is a vector Xw that is
perpendicular to the remainder y − Xw. We present both ways in detail below.

4.1.1 Analytical Approach


We are looking for
min kXw − yk2 = min(Xw − y)T (Xw − y) = min wT XT Xw − 2yT Xw + yT y.
w w w

By taking a derivative of the above and equating it to zero we have1


d(wT XT Xw − 2yT Xw + yT y)
= 2XT Xw − 2XT y = 0.
dw
Which gives
XT Xw = XT y.
If we assume that the columns of X are linearly independent (dim(X) = d) then XT X ∈ Rd×d is
invertible (see Appendix C) and we obtain
w = (XT X)−1 XT y.
1 See Appendix D for details on calculation of derivatives [gradients] of multidimensional functions.

50
Figure 4.1: Illustration of algebraic solution of linear least squares.

4.1.2 Algebraic Approach - Fast Track


The projection Xw is a vector that is orthogonal to the remainder y − Xw (so that y is a sum of the
projection and the remainder, y = Xw + (y − Xw), and there is a right angle between the two). Two
vectors are orthogonal if and only if their inner product is zero. Thus, we are looking for w that satisfies
T
(Xw) (y − Xw) = 0,

which is equivalent to wT XT (y − Xw) = 0. It is sufficient to find w that satisfies XT (y − Xw) = 0


−1
to solve this equation, which is equivalent to XT Xw = XT y. By multiplying both sides by XT X
 −1
(which is defined, since the columns are linearly independent) we obtain a solution w = XT X XT y.
This solution is, actually, unique due to independence of the columns of X. Assume there is another
solution w0 , such that Xw0 = y. Then Xw − Xw0 = X(w − w0 ) = 0, but since the columns of X are
linearly independent the only their linear combination that yields zero is the zero vector, meaning that
w − w0 = 0 and w = w0 .

4.1.3 Algebraic Approach - Complete Picture


Linear Least Squares is a great opportunity to revisit a number of basic concepts from linear algebra.
Once the complete picture is understood, the algebraic solution of the problem is just one line. We refer
the reader to Appendix C for a quick review of basic concepts from linear algebra. We are looking for a
solution of Xw = y, where y (most likely) lies outside of the column space of X and the equation has
no solution. Therefore, the best we can do is to solve Xw = y? , where y? is a projection of y onto the
column space of X (see Figure 4.1). We assume that dim(X) = d and thus the matrix XT X is invertible.
−1 T
The projection y? is then given by y? = X XT X X y, which means that the best we can do is to
T
−1 T −1 T
solve Xw = y? = X X X X y and the solution is w = XT X X y.

51
4.1.4 Using Linear Least Squares for Learning Coefficients of Non-linear
Models
Linear Least Squares can be used for learning coefficients of non-linear models. For example, assume
that we want to fit our data S = {(x1 , y1 ), . . . , (xn , yn )} (where both xi -s and yi -s are real numbers) with
a polynomial of degree d. I.e., we want to have a model of a form y = ad xd + ad−1 xd−1 + · · · + a1 x + a0 .
All we have to do is to map our features xi -s into feature vectors xi → (xdi , xd−1 i , . . . , xi , 1) and apply
linear least squares to the following system:
 
 d d−1  ad  
x1 x1 . . . x1 1 y1
 xd xd−1 . . . x2 1   d−1   y2 
 a 
 2 2   ..  
  .  =  .. 

 ..
 .  
 a1 
  . 
d d−1 y
xn xn . . . xn 1 n
a0
T
to get the parameters vector (ad , ad−1 , . . . , a1 , a0 ) .

52
Chapter 5

Online Learning

So far in these notes we have considered batch learning. In batch learning we start with some data, we
analyze it, and then we “ship the result of the analysis into the world”. It can be a fixed classifier h, a
distribution over classifiers ρ, or anything else, the important point is that it does not change from the
moment we are done with training. It takes no new information into account. This is also the reason
why we had to assume that new samples come from the same distribution as the samples in the training
set, because the classifier was not designed to adapt.
Online learning is a learning framework, where data collection, analysis, and application of inferred
knowledge are in a perpetual loop, see Figure 5.1. Examples of problems, which fit into this framework
include:
• Investment in the stock market.

• Online advertizing and personalization.


• Online routing.
• Games.
• Robotics.

• And so on ...
The recurrent nature of online learning problems makes them closely related to repeated games. They
also borrow some of the terminology from the game theory, including calling the problems games and
every “Act - Observe - Analyze” cycle a game round. In general, we may need online learning in the
following cases:

Figure 5.1: Online learning vs. batch.

53
Figure 5.2: The Space of Online Learning Problems.

• Interactive learning: we are in a situation, where we continuously get new information and taking
it into account may improve the quality of our actions. Many online applications on the Internet
fall under this category.
• Adversarial or game-theoretic settings: we cannot assume that “the future behaves similarly to the
past”. For example, in spam filtering we cannot assume that new spam messages are generated
from the same distribution as the old ones. Or, in playing chess we cannot assume that the moves
of the opponent are sampled i.i.d..
As with many other problems in computer science, having loops makes things much more challenging, but
also much richer and more fun.1 For example, online learning allows to treat adversarial environments,
which is impossible to do in the batch setting.

5.1 The Space of Online Learning Problems


Online learning problems are characterized by three major parameters:
1. The amount of feedback that the algorithm received on every round of interaction with the envi-
ronment.

2. The environmental resistance to the algorithm.


3. The structural complexity of a problem.
Jointly they define the space of online learning problems, see Figure 5.2. It is not really a space, but a
convenient way to organize the material and get initial orientation in the zoo of online learning settings.
We discuss the three axes of the space with some examples below.
1 The following quote from Robbins (1952) is interesting to read: “Until recently, statistical theory has been restricted to

the design and analysis of sampling experiments in which the size and composition of the samples are completely determined
before the experimentation begins. The reasons for this are partly historical, dating back to the time when the statistician
was consulted, if at all, only after the experiment was over, and partly intrinsic in the mathematical difficulty of working
with anything but a fixed number of independent random variables. A major advance now appears to be in the making
with the creation of a theory of the sequential design of experiments, in which the size and composition of the samples are
not fixed in advance but are functions of the observations themselves.”

54
Feedback
Feedback refers to the amount of information that the algorithm receives on every round of interaction
with the environment. The most basic forms of feedback are full information and limited (better known
as bandit 2 ) feedback.
A classical example of a full information game is investment in the stock market. On every round
of this game we distribute wealth over a set of stocks and the next day we observe the rates of all
stocks, which is the full information. With full information we can evaluate the quality of our investment
strategy, as well as any alternative investment strategy.
A classical example of a bandit feedback game are medical treatments. We have a set of actions (in
this case treatments), but we can only apply one treatment to a given patient. We only observe the
outcome of the applied treatment, but not of any alternative treatment, thus we have limited feedback.
With limited feedback we only know the quality of the selected strategy, but we cannot directly evaluate
the quality of alternative strategies we could have selected. This leads to the exploration-exploitation
trade-off, which is the trade-mark signature of online learning. The essence of the exploration-exploitation
trade-off is that in order to estimate the quality of actions we have to try them out (to explore). If we
explore too little, we risk missing some good actions and end up performing suboptimally. However,
exploration has a cost, because trying out suboptimal actions for too long is also undesirable. The
goal is to balance exploration (trying new actions) with exploitation, which is taking actions, which
are currently believed to be the optimal ones. The “Act-Observe-Analyze” cycle comes into play here,
because unlike in batch learning the training set is not given, but is built by the algorithm for itself: if
we do not try an action we get no data from it.
There are many other problems that fall within bandit feedback framework, most notably online
advertizing. A simplistic way of modeling online advertizing is assuming that there is a pool of adver-
tizements, but on every round of the game we are only allowed to show one advertisement to a user.
Since we only observe feedback for the advertisement that was presented, the problem can be formulated
as an online learning problem with bandit feedback.
There are other feedback models, which we will only touch briefly. In the bandit feedback model the
algorithm observes a noisy estimate of the quality of selected action, for example, whether an advertise-
ment was clicked or not. In partial feedback model studied under partial monitoring the feedback has
some relation to the action, but not necessarily its quality. For example, in dynamic pricing we only
observe whether a proposed price was above or below the value of a product for a buyer, but we do not
observe the maximal price we could get for the product. Bandit feedback is a special case of partial
feedback, where the observation is the value. Another example is dueling bandit feedback, where the
feedback is a relative preference over a pair of items rather than the absolute value of the items. For
example, an answer to the question “Do you prefer fish or chicken?” is an example of dueling bandit
feedback. Dueling bandit feedback model is used in information retrieval systems, since humans are
much better in providing relative preferences rather than absolute utility values.

Environmental Resistance
Environmental resistance is concerned with how much the environment resists to the algorithm. Two
classical examples are i.i.d. (a.k.a. stochastic) and adversarial environments. An example of an i.i.d.
environment is the weather. It has a high degree of uncertainty, but it does not play against the
algorithm. Another example of an i.i.d. environment are outcomes of medical treatments. Here also
there is uncertainty in the outcomes, but the patients are not playing against the algorithm. An example
of an adversarial environment is spam filtering. Here the spammers are deliberately changing distribution
of the spam messages in order to outplay the spam filtering algorithm. Another classical example of an
adversarial environment is the stock market. Even though the stock market does not play directly against
an individual investor (assuming the investments are small), it is not stationary, because if there would
be regularity in the market it would be exploited by other investors and would be gone.
The environment may also be collaborative, for example, when several agents are jointly solving
a common task. Yet another example are slowly changing environments, where the parameters of a
2 “The name derives from an imagined slot machine . . . . (Ordinary slot machines with one arm are one-armed bandits,

since in the long run they are as effective as human bandits in separating the victim from his money.)” (Lai and Robbins,
1985)

55
distribution are slowly changing with time.

Structural Complexity
In structural complexity we distinguish between stateless problems, contextualized problems (or problems
with state), and Markov decision processes. In stateless problems actions are taken without taking any
additional information except the history of the outcomes into account. In contextualized problems on
every round of the game the algorithm observes a context (or state) and takes an action within the
observed context. An example of context is a medical record of a patient or, in the advertizing example,
it could be parameters of the advertisement and the user.
Markov decision processes are concerned with processes with evolving state. The difference between
contextualized problems and Markov decision processes is that in the former the actions of the algorithm
do not influence the next state, whereas in the latter they do. For example, subsequent treatments of
the same patient are changing his or her state and, therefore, depend on each other. In contrast, in
subsequent treatments of different patients treatment of one patient does not influence the state of the
next patient and, thus, can be modeled as a contextualized problem.
Markov decision processes are studied within the field of reinforcement learning. There is no clear
cut distinction between online learning and reinforcement learning and one could be seen as a subfield
of another or the other way around. But as a rule of thumb, problems involving evolution of states,
such as Markov decision processes, are part of reinforcement learning and problems that do not involve
evolution of states are part of online learning.
One of the challenges in Markov decision processes is delayed feedback. It refers to the fact that, unlike
in stateless and contextualized problems, the quality of an action cannot be evaluated instantaneously.
The reason is that actions are changing the state, which may lead to long-term consequences. Consider
a situation of sitting in a bar, where every now and then a waiter comes and asks whether you want
another beer. If you take a beer you probably feel better than if you do not, but then eventually if you
take too much you will feel very bad the next morning, whereas if you do not you may feel excellent. As
before, things get more challenging, but also more exciting, when there are loops in the state space.
In Markov decision processes we distinguish between estimation and planning. Estimation is the
same problem as in other online learning problems - the outcomes of actions are unknown and we have
to estimate them. However, in Markov decision processes even if the immediate outcomes of various
actions are known, the identity of the best action in each state may still be not evident due to the
long-term consequences. This problem is addressed by planning.
There are many other online learning problems, which do not fit directly into Figure 5.2, but can still
be discussed in terms of feedback, environmental resistance, and structural complexity. For example, in
combinatorial bandits the goal is to select a set of actions, potentially with some constraints, and the
quality of the set is evaluated jointly. An instance of a combinatorial bandit problem is selection of a path
in a graph, such as communication or transport network. In this case an action can be decomposed into
sub-actions corresponding to selection of edges in the graph. The goal is to minimize the length of a path,
which may correspond to the delay between the source and the target nodes. Various forms of feedback
can be considered, including bandit feedback, where the total length of the path is observed; semi-bandit
feedback, where the length of each of the selected edges is observed; cascading bandit feedback, where
the lengths of the edges are observed in a sequence until a terminating note (e.g., a server that is down)
or the target is reached; or a full information feedback, where the length of all edges is observed.
In the following sections we consider in detail a number of the most basic online learning problems.

5.2 A General Basic Setup


We start with four most basic problems in online learning, prediction with expert advice, stochastic
multiarmed bandits, and adversarial multiarmed bandits. Prediction with expert advice refers to the
adversarial version of the problem, but in the home assignment you will analyze its stochastic counterpart,
which gives the fourth problem. All four are stateless problems and correspond to the four red crosses
in Figure 5.3. We provide a general setup that encompasses all four problems and then specialize it. We

56
Figure 5.3: The four basic online learning problems.

are given a K × ∞ matrix of losses `t,a , where t ∈ {1, 2, . . . } and a ∈ {1, . . . , K} and `t,a ∈ [0, 1].
`1,1 , `2,1 , ··· `t,1 , ···
.. .. ..
. . ··· . ···
Losses

`1,a , `2,a , ··· `t,a , ···


.. .. ..
. . ··· . ···
`1,K , `2,K , · · · `t,K , · · ·
−−−−−−−−→
time
The matrix is fixed before the game starts, but not revealed to the algorithm. There are two ways to
generate the matrix, which are specified after the definition of the game protocol.

Game Protocol

For t = 1, 2, . . . :
1. Pick a row At
2. Suffer `t,At
3. Observe . . . [the observations are defined below]

Definition of the four games There are two common ways to generate the matrix of losses. The
first is to sample `t,a -s independently, so that the mean of the losses in each row is fixed, E [`t,a ] = µ(a).
The second is to generate `t,a -s arbitrarily. The second model of generation of losses is known as an
oblivious adversary, since the generation happens before the game starts and thus does not take actions
of the algorithm into account.3
There are also two common ways to define the observations. After picking a row in round t the
algorithm may observe either the full column `t,1 , . . . , `t,K or just the selected entry `t,At . Jointly the
two ways of generating the matrix of losses and the two ways of defining the observations generate four
variantshh of the game.
hhhhObservations
hhhh
Observe `t,1 , . . . , `t,K Observe `t,At
Matrix generation hhh
hhh
I.I.D. Prediction with Stochastic multiarmed
`t,a -s are sampled i.i.d. with E [`t,a ] = µ(a)
expert advice bandits
Prediction with expert Adversarial multiarmed
`t,a are selected arbitrarily (by an adversary)
advice (adversarial) bandits
3 Itis also possible to consider an adaptive adversary, which generates losses as the game proceeds and takes past actions
of the algorithm into account. We do not discuss this model in the lecture notes.

57
Performance Measure The goal of the algorithm is to play so that the loss it suffers will not be
significantly larger than the loss of the best row in hindsight. There are several ways to formalize this
goal. The basic performance measure is the regret defined by
T
X T
X
RT = `t,At − min `t,a .
a
t=1 t=1

In adversarial problems we analyze the expected regret 4 defined by


" T # " T
#
X X
E [RT ] = E `t,At − E min `t,a .
a
t=1 t=1

If the sequence of losses is deterministic we can remove the second expectation and obtain a slightly
simpler expression " T #
X T
X
E [RT ] = E `t,At − min `t,a .
a
t=1 t=1

In stochastic problems we analyze the pseudo regret defined by


" T # " T # " T #
X X X
R̄T = E `t,At − min E `t,a = E `t,At − T min µ(a).
a a
t=1 t=1 t=1

Note that since for random variables X and Y we have E [min {X, Y }] ≤ min {E [X] , E [Y ]} [it is rec-
ommended to verify this identity], we have R̄T ≤ E [RT ]. A reason to consider pseudo regret in the
stochastic setting is that we can get bounds of order ln T on the pseudo
√ regret (so called “logarithmic”
PT
regret bounds), whereas the fluctuations of t=1 `t,a are of order T (when we sample T random vari-
PT √
ables, the deviation of t=1 `t,a from the expectation T µ(a) is of order T ). Thus, it is impossible to
get logarithmic bounds for the expected regret.

Explanation of the Names In the complete definition of prediction with expert advice game on every
round of the game the player gets an advice from K experts and then takes an action, which may be a
function of the advice and the player, as well as the experts, suffer a loss depending on the action taken.
Hence the name, prediction with expert advice. If we restrict the actions of the player to following the
advice of a single expert, then from the perspective of the playing strategy the actual advice does not
matter and it is only the loss that defines the strategy. We consider the restricted setting, because it
allows to highlight the relation with multiarmed bandits.
The name multiarmed bandits comes from the analogy with slot machines, which are one-armed
bandits. In this game actions are the “arms” of a slot machine.

Losses vs. Rewards In some games it is more natural to consider rewards (also called gains) rather
than losses. In fact, in the literature on stochastic problems it is more popular to work with rewards,
whereas in the literature on adversarial problems it is more popular to work with losses. There is a
simple transformation r = 1 − `, which brings a losses game into a gains game and the other way around.
Interestingly, in the adversarial setting working with losses leads to tighter and simpler results. In the
stochastic setting the choice does not matter.

5.3 I.I.D. (stochastic) Multiarmed Bandits


In this section we consider multiarmed bandit game, where the outcomes are generated i.i.d. with fixed,
but unknown means. In this game there is no difference between working with losses or rewards, and
since most of the literature is based on games with rewards we are going to use rewards in order to be
consistent. The treatment of losses is identical - see Seldin (2015).
4 It is also possible to analyze the regret, but we do not do it here.

58
Notations We are given a K × ∞ matrix of rewards (or gains) rt,a , where t ∈ {1, 2, . . . } and a ∈
{1, . . . , K}.

r1,1 , r2,1 , ··· rt,1 , ···

Action rewards
.. .. ..
. . ··· . ···
r1,a , r2,a , ··· rt,a , ···
.. .. ..
. . ··· . ···
r1,K , r2,K , · · · rt,K , · · ·
−−−−−−−−→
time
We assume that rt,a -s are in [0, 1] and that they are generated independently, so that E [rt,a ] = µ(a).
We use µ∗ = max µ(a) to denote the expected reward of an optimal action and ∆(a) = µ∗ − µ(a) to
a
denote the suboptimality gap (or simply the gap) of action a. We use a∗ = arg max µ(a) to denote a best
a
action (note that there may be more than one best action, in such case let a∗ be any of them).

Game Definition

For t = 1, 2, . . . :
1. Pick a row At
2. Observe & accumulate rt,At

Performance Measure Let Nt (a) denote the number of times action a was played up to round t. We
measure the performance using the pseudo regret and we rewrite it in the following way
" T # " T #
X X
R̄T = max E rt,a − E rt,At
a
t=1 t=1
" T #
X
= T µ∗ − E rt,At
t=1
T
X
= E [µ∗ − rt,At ]
t=1
T
X
= E [E [µ∗ − rt,At |At ]] (5.1)
t=1
T
X
= E [∆(At )]
t=1
X
= ∆(a)E [NT (a)] .
a

In step (5.1) we note that E [rt,At ] is an expectation over two random variables, the selection of At ,
which is based on the history of the game, and the draw of rt,At , for which E [rt,At |At ] = µ(At ). We
have E [rt,At ] = E [E [rt,At |At ]], where the inner expectation is with respect to the draw of rt,At and the
outer expectation is with respect to the draw of At . Note that in "the i.i.d.#setting the performance of an
XT
algorithm is compared to the best action in expectation (max E rt,a ), whereas in the adversarial
a
t=1
T
X
setting the performance of an algorithm is compared to the best action in hindsight (min `t,a ).
a
t=1

59
Exploration-exploitation trade-off: A simple approach I.i.d. multiarmed bandits is the simplest
problem where we face the exploration-exploitation trade-off. In general, the goal is to play a best arm
on all the rounds, but since the identity if the best arm is unknown it has to be identified first. In order to
identify a best arm we need to explore all the arms. However, rounds used for exploration of suboptimal
arms increase the regret (through the Nt (a)∆(a) term). At the same time, too greedy exploration may
lead to confusion between a best and a suboptimal arm, which may eventually lead to even higher regret
when we start exploiting a wrong arm. So let us make a first attempt to quantify this trade-off. Assume
that we know time horizon T and we start with εT exploration rounds followed by (1 − ε)T exploitation
rounds (where we play what we believe to be a best arm). Also assume that we have just two actions
and we know that for a 6= a∗ we have ∆(a) = ∆. The only thing we do not know is which of the two
actions is the best. So how should we set ε?
Let δ(ε) denote the probability that we misidentify the best arm at the end of the exploration period.
The pseudo regret can be bounded by:
 
1 1 1
R̄T ≤ ∆εT + δ(ε)∆(1 − ε)T ≤ ∆εT + δ(ε)∆T = ε + δ(ε) ∆T,
2 2 2

where the first term is a bound on the pseudo regret during the exploration phase and the second term
is a bound on the pseudo regret during the exploitation phase in case we select a wrong arm at the end
of the exploration phase. Now what is δ(ε)? Let µ̂t (a) denote the empirical mean of observed rewards
of arm a up to round t. For the exploitation phase it is natural to select the arm that maximizes µ̂εT (a)
at the end of the exploration phase. Therefore:

δ(ε) = P(µ̂εT (a) ≥ µ̂εT (a∗ ))


   
1 1
≤ P µ̂εT (a) ≥ µ(a) + ∆ + P µ̂εT (a∗ ) ≤ µ∗ − ∆
2 2
2 2
≤ 2e−2 2 ( 2 ∆) = 2e−εT ∆ /4 ,
εT 1

where the last line is by Hoeffding’s inequality. By substituting this back into the regret bound we
obtain:  
1 2
R̄T ≤ ε + 2e−εT ∆ /4 ∆T.
2
2 2
In order to minimize 21 ε+2e−εT ∆ /4 we take a derivative and equate it to zero, which leads to ε = ln(T ∆ )
T ∆2 /4 .
It is easy to check that the second derivative is positive, confirmingn that this o is the minimum. Note that
ln(T ∆2 )
ε must be non-negative, so strictly speaking we have ε = max 0, T ∆2 /4 . If we substitute this back
into the regret bound we obtain:

2 ln(T ∆2 ) 2 ln(T ∆2 )
     
− ln(T ∆2 ) 2
R̄T ≤ max ∆T, + 2e ∆T = max ∆T, + .
T ∆2 ∆ ∆
n o
∆2 )
Note that the number of exploration rounds is εT = max 0, ln(T ∆2 /4 .
Put attention that the regret bound is larger when ∆ is small. Although intuitively when ∆ is small
we do not care that much about playing a suboptimal action as opposed to the case when ∆ is large,
problems with small ∆ are actually harder and lead to larger regret. The reason is that the number of
rounds that it takes to identify the best action grows with 1/∆2 . Even though in each exploration round
we only suffer the regret of ∆ the fact that the number of exploration rounds grows with 1/∆2 makes
problems with small ∆ harder.
The above approach has three problems: (1) it assumes knowledge of the time horizon T , (2) it
assumes knowledge of the gap ∆, and (3) if we would try to generalize it to more than one arm the
length of the exploration period would depend on the smallest gap, even if there are many arms with
larger gap that are much easier to eliminate. The following approach resolves all three problems.

Upper Confidence Bound (UCB) algorithm We now concider the UCB1 algorithm of Auer et al.
(2002a).

60
Algorithm 3 UCB1 (Auer et al., 2002a)
Initialization: Play each action once.
for t = K + 1, K + 2, ... do s
3 ln t
Play At = arg max µ̂t−1 (a) + .
a 2Nt−1 (a)
end for

Figure 5.4: Illustration for UCB analysis.

q
3 ln t
The expression Ut (a) = µ̂t−1 (a) + 2Nt−1 (a) is called an upper confidence bound. Why? Because
Ut (a) upper bounds µ(a) with high probability. UCB approach follows the optimism in the face of
uncertainty principle. That is, we take an optimistic estimate of the reward of every arm by taking the
upper limit of the confidence bound. UCB1 algorithm has the following regret guarantee.

Theorem 5.1. For any time T the regret of UCB1 satisfies:

π2 X
X ln T  
R̄T ≤ 6 + 1+ ∆(a).
∆(a) 3 a
a:∆(a)>0

Proof. For the analysis it is convenient to have the following picture in mind - see Figure 5.4. A
suboptimal arm is played when Ut (a) ≥ Ut (a∗ ). Our goal is to show that this does not happen very
often. The analysis is based on the following three points, which bound the corresponding distances in
Figure 5.4.
1. We show that Ut (a∗ ) > µ(a∗ ) for almost all rounds. A bit more precisely, let F (a∗ ) be the number
2
of rounds when Ut (a∗ ) ≤ µ(a∗ ), then E [F (a∗ )] ≤ π6 .
q
3 ln t
2. In a similar way, we show that µ̂t (a) < µ(a) + 2N t (a)
for almost all rounds. A bit more precisely,
q 2
3 ln t
let F (a) be the number of rounds when µ̂t (a) ≥ µ(a) + 2N t (a)
, then E [F (a)] ≤ π6 . (Note that
this is a lower confidence bound for µ(a), or, in other words, the other side of inequality compared
to Point 1.)
q q

3. When Point 2 holds we have that Ut (a) = µ̂t−1 (a) + 2N3t−1 ln t
(a) ≤ µ(a) + 2 3 ln t
2Nt−1 (a) = µ(a ) −
q
∆(a) + 2 2N3t−1 ln t
(a) .

Let us fix time horizon T and analyze what happens by time T (note that the algorithm does not
depend on T ). We have that for most rounds t ≤ T :
s s
∗ 6 ln t ∗ 6 ln T
Ut (a) < µ(a ) − ∆(a) + ≤ µ(a ) − ∆(a) +
Nt−1 (a) Nt−1 (a)
Ut (a∗ ) > µ(a∗ ).

Thus, we can play a suboptimal action a only in the following cases:


q
• Either N6t−1 ln T 6 ln T
(a) ≥ ∆(a), which means that Nt−1 (a) ≤ ∆(a)2 .

61
• Or one of the confidence intervals in Points 1 or 2 has failed.
l m
6 ln T
In other words, after a suboptimal action a has been played for ∆(a) 2 rounds it can only be played
again if one of the confidence intervals fails. Therefore,

π2
 
6 ln T 6 ln T
E [NT (a)] ≤ 2
+ E [F (a∗ )] + E [F (a)] ≤ 2
+1+
∆(a) ∆(a) 3
P
and since R̄T (a) = a ∆(a)E [NT (a)] the result follows.
To complete the proof it is left to prove Points 1 and 2. We  prove Point q 1, the proof of Point 
2 is identical. We start by looking at P(Ut (a∗ ) ≤ µ(a∗ )) = P µ̂t−1 (a∗ ) + 2Nt−1 3 ln t ∗
(a∗ ) ≤ µ(a ) =
 q 
P µ(a∗ ) − µ̂t−1 (a∗ ) ≥ 2Nt−1 3 ln t ∗
(a∗ ) . The delicate point is that Nt−1 (a ) is a random variable that is
not independent of µ̂t−1 (a∗ ) and thus we cannot apply Hoeffding’s inequality directly. Instead, we look
at a series
Ps of random variables X1 , X2 , . . . , such that Xi -s have the same distribution as rt,a∗ -s. Let
µ̄s = 1s i=1 Xi be the average of the first s elements of the sequence. Then we have:
s ! !
r
∗ ∗ 3 ln t ∗ 3 ln t
P µ(a ) − µ̂t−1 (a ) ≥ ≤ P ∃s : µ(a ) − µ̄s ≥
2Nt−1 (a∗ ) 2s
t r !
X 3 ln t
≤ P µ(a∗ ) − µ̄s ≥
s=1
2s
t
X 1 1
≤ 3
= 2,
s=1
t t

where in the first line we decouple µ̂t (a∗ )-s from Nt (a∗ )-s via the use of µ̄s -s and in the last line we
apply Hoeffding’s inequality (note that 3 ln t = ln t3 corresponds to ln 1δ in Hoeffding’s inequality and
thus δ = t13 ). Finally, we have:
s
∞ ∞
!

X
∗ ∗ 3 ln t X 1 π2
E [F (a )] = P µ(a ) − µ̂t−1 (a ) ≥ ≤ = .
t=1
2Nt−1 (a∗ ) t=1
t 2 6

5.4 Prediction with Expert Advice


Notations We are given a K × ∞ matrix of expert losses `t,a , where t ∈ {1, 2, . . . } and a ∈ {1, . . . , K}.

`1,1 , `2,1 , ··· `t,1 , ···


Expert Losses

.. .. ..
. . ··· . ···
`1,a , `2,a , ··· `t,a , ···
.. .. ..
. . ··· . ···
`1,K , `2,K , · · · `t,K , · · ·
−−−−−−−−→
time

Game Definition

For t = 1, 2, . . . :
1. Pick a row At
2. Observe the column `t,1 , . . . , `t,K & suffer `t,At

62
Performance Measure The performance is measured by regret
T T
!
X X
RT = `t,At − min `t,a .
a
t=1 t=1

In the notes we analyze the expected regret E [RT ].

Algorithm We consider the Hedge algorithm (a.k.a. exponential weights and weighted majority) for
playing this game.

Algorithm 4 Hedge (a.k.a. Exponential Weights), (Vovk, 1990, Littlestone and Warmuth, 1994)
Input: Learning rates η1 ≥ η2 ≥ · · · > 0
∀a : L0 (a) = 0
for t = 1, 2, ... do
−ηt Lt−1 (a)
∀a : pt (a) = P e −ηt Lt−1 (a0 )
a0 e
Sample At according to pt and play it
Observe `t,1 , . . . , `t,K and suffer `t,At
∀a : Lt (a) = Lt−1 (a) + `t,a
end for

Analysis We analyze the Hedge algorithm in a slightly simplified setting, where the time horizon T
is known. Unknown time horizon can be handled by using the doubling trick (see home assignment) or,
more elegantly, by a more careful analysis (see, e.g., Bubeck and Cesa-Bianchi (2012)).
The analysis is based on the following lemma.
Lemma 5.2. Let {X1,a , X2,a , . . . }a∈{1,...,K} be K sequences of non-negative numbers (Xt,a ≥ 0 for all
Pt
a and t). Let Lt (a) = s=1 Xs,a , let L0 (a) be zero for all a and let η > 0. Finally, let pt (a) =
−ηLt−1 (a)
e
P −ηL (a0 ) . Then:
a0 e
t−1

K
T X T K
X ln K η XX 2
pt (a)Xt,a − min LT (a) ≤ + pt (a) (Xt,a ) .
t=1 a=1
a η 2 t=1 a=1

e−ηLt (a) and study how this quantity evolves. We start with an upper bound.
P
Proof. We define Wt = a
P −ηLt (a)
Wt e
= P a −ηL (a)
Wt−1 ae
t−1

P −ηXt,a −ηLt−1 (a)


e e
= a P −ηL (a) (5.2)
a e t−1

X e−ηLt−1 (a)
= e−ηXt,a P −ηL (a0 )
a0 e
t−1
a
X
= e−ηXt,a pt (a) (5.3)
a
X 1 2

≤ 1 − ηXt,a + η 2 (Xt,a ) pt (a) (5.4)
a
2
X η2 X 2
=1−η Xt,a pt (a) + (Xt,a ) pt (a)
a
2 a
2
Xt,a pt (a)+ η2 2
P P
−η a (Xt,a ) pt (a)
≤e a , (5.5)

where in (5.2) we used the fact that Lt (a) = Xt,a + Lt−1 (a), in (5.3) we used the definition of pt (a), in
(5.4) we used the inequality ex ≤ 1 + x + 12 x2 , which holds for x ≤ 0 (this is a delicate point, because the

63
inequality does not hold for x > 0 and, therefore, we must check that the condition x ≤ 0 is satisfied; it
is satisfied under the assumptions of the lemma), and inequality (5.5) is based on inequality 1 + x ≤ ex ,
which holds for all x.
Now we consider the ratio WW0 . On the one hand:
T

WT W1 W2 WT PT P η2 PT P 2
= × × ··· × ≤ e−η t=1 a Xt,a pt (a)+ 2 t=1 a (Xt,a ) pt (a)
.
W0 W0 W1 WT −1

On the other hand: X


e−ηLT (a) −η min LT (a)
WT max e−ηLT (a) e a
a a
= ≥ = ,
W0 K K K
where we lower-bounded the sum by its maximal element. By taking the two inequalities together and
applying logarithm we obtain:
T X T
X η2 X X 2
−η min LT (a) − ln K ≤ −η Xt,a pt (a) + (Xt,a ) pt (a).
a
t=1 a
2 t=1 a

Finally, by changing sides and dividing by η we get:


T X T
X ln K η XX 2
Xt,a pt (a) − min LT (a) ≤ + (Xt,a ) pt (a)
t=1 a
a η 2 t=1 a

Now we are ready to present an analysis of the Hedge algorithm.


Theorem 5.3. The expected regret of the Hedge algorithm with a fixed learning rate η satisfies:
ln K η
E [RT ] ≤ + T.
η 2
q
2 ln K
The expected regret is minimized by η = T , which leads to

E [RT ] ≤ 2T ln K.

Proof. We note that `t,a -s are positive and apply Lemma 5.2 to obtain:
K
T X T K
X ln K η2 X X 2
pt (a)`t,a − min LT (a) ≤ + pt (a) (`t,a ) .
t=1 a=1
a η 2 t=1 a=1
P PT PK
Note that a pt (a)`t,a is the expected loss of Hedge on round t and t=1 a=1 pt (a)`t,a is the ex-
pected cumulative loss of Hedge after T rounds. Thus, the left hand side X the inequality
of is the
2 2
expected regret of Hedge. Also note that `t,a ≤ 1 and thus (`t,a ) ≤ 1 and pt (a) (`t,a ) ≤ 1. Thus,
a
PT PK 2
t=1 a=1 pt (a) (`t,a ) ≤ T . Altogether, we get that:

ln K η
E [RT ] ≤ + T.
η 2

By taking the derivative of the right hand side and equating it to zero we obtain that − lnη2K + T2 = 0
q
and thus η = 2 lnT K is an extremal point. The second derivative is 2 ln
η3
K
and since η > 0 it is positive.
Thus, the extremal point is the minimum.

64
5.4.1 Lower Bound
A lower bound for the expected regret in prediction with expert advice is based on the following con-
struction. We draw a K × ∞ matrix of losses with each loss drawn according to Bernoulli distribution
with bias 1/2. In this game the expected loss of any algorithm after T rounds is T /2, irrespective of what
the algorithm is doing. However, the loss of the best action in hindsight is lower, because we are selecting
the “best” out of K rows. For each individual row the expected loss is T /2, but the expectation of the
minimum of the losses is lower. The reduction is quantified in the following theorem, see Cesa-Bianchi
and Lugosi (2006) for a proof.

Theorem 5.4. Let `t,a be i.i.d. Bernoulli random variables with bias 1/2, then
h PT i
T /2 − E mina t=1 `t,a
lim lim q = 1.
T →∞ K→∞ 1
2 T ln K
h PT i
Note that the numerator in the above expression, T /2 − E mina t=1 `t,a , is the expectation with
respect to generation of the matrix of losses of the expected regret. Thus, if the adversary generates
the matrix of losses according to the construction described above, then in expectation with respect to
generation ofqthe matrix and in the limit of K and T going to infinity the expected regret cannot be
1
smaller than 2T ln K.

5.5 Adversarial Multiarmed Bandits


Game definition We are working with the same matrix of losses as in prediction with expert advice,
but now at each round of the game we are allowed to observe only the loss of the row that we have
played:
For t = 1, 2, . . . :
1. Pick a row At

2. Observe & suffer `t,At . (`t,a -s for a 6= At remain unobserved)

Algorithm The algorithm is based on using importance-weighted estimates of the losses in the Hedge
algorithm.5

Algorithm 5 EXP3 (Auer et al., 2002b)


Input: Learning rates η1 ≥ η2 ≥ · · · > 0
∀a : L̃0 (a) = 0
for t = 1, 2, ... do
−ηt L̃t−1 (a)
∀a : pt (a) = P e −ηt L̃t−1 (a0 )
a0 e
Sample At according to pt and play it
Observe and suffer `t,At (
`t,a
` 1(A =a) , If At = a
Set `˜t,a = t,a pt (a)t = pt (a)
0, otherwise
∀a : L̃t (a) = L̃t−1 (a) + `˜t,a
end for

5 We note that the original algorithm in Auer et al. (2002b) was formulated for the gains game. Here we present an

improved algorithm for the losses game (Stoltz, 2005, Bubeck, 2010). We refer to home assignment for the difference
between the two.

65
Properties of importance-weighted samples Before we analyze the EXP3 algorithm we discuss a
number of important properties of importance-weighted sampling.
n o
1. The samples `˜t,a are not independent in two ways. First, for a fixed t, the set `˜t,1 , . . . , `˜t,K
is dependent (if we know that one of `˜t,a -s is non-zero, we automatically know that all the rest
are
n zero).o And second, `˜t,a depends on all `˜s,a0 for s < t and all a0 since pt (a) depends on
`˜s,a , which is the history of the game up to round t. In other words, pt (a) itself
1≤s<t,a∈{1,...,K}
is a random variable.
2. Even though `˜t,a -s are not independent, they are unbiased estimates of the true losses. Specifically,
 
h i
˜ `t,a 1(At = a)
E `t,a = E
pt (a)
  
`t,a 1(At = a)
=E E A1 , . . . , At−1
pt (a)
 
`t,a
=E E [1(At = a)|A1 , . . . , At−1 ]
pt (a)
 
`t,a
=E pt (a)
pt (a)
= `t,a .

The first expectation above is with respect to A1 , . . . , At . In the nested expectations, the external
expectation is with respect to A1 , . . . , At−1 and the internal is with respect to At . Note that pt (a)
is a random variable depending on A1 , . . . , At−1 , thus after the conditioning on A1 , . . . , At−1 it is
deterministic.
h i
3. Since `t,a ∈ [0, 1], we have `˜t,a ∈ 0, pt1(a) .

4. What is important is that the second moment of `˜t,a -s is by an order of magnitude smaller than
the second moment of a general random variable in the corresponding range. This is because the
expectation of `˜t,a -s is in the [0, 1] interval. Specifically:
   " 2 #
2 ` 1(A = a)
t,a t
E `˜t,a =E
pt (a)
" #
2 2
(`t,a ) (1(At = a))
=E
pt (a)2
" #
2
(`t,a ) 1(At = a)
=E
pt (a)2
 
1(At = a)
≤E
pt (a)2
  
1(At = a)
=E E A1 , . . . , At−1
pt (a)2
 
1
=E E [1(At = a)|A1 , . . . , At−1 ]
pt (a)2
 
1
=E ,
pt (a)
2 2
where we have used (1(At = a)) = 1(At = a) and (`t,a ) ≤ 1 (since `t,a ∈ [0, 1]).

Analysis Now we are ready to present the analysis of the algorithm.

66
Theorem 5.5. The expected regret of the EXP3 algorithm with a fixed learning rate η satisfies:
ln K η
E [RT ] ≤ + KT.
η 2
q
2 ln K
The expected regret is minimized by η = KT ,which leads to

E [RT ] ≤ 2KT ln K.
Note that the extra
√ payment for being able to observe just one entry rather than the full column is
the multiplicative K factor in the regret bound.
Proof. The proof of the theorem is based on Lemma 5.2. We note that `˜t,a -s are all non-negative and,
thus, by Lemma 5.2 we have:
T X T  2
X ln K η XX
pt (a)`˜t,a − min L̃T (a) ≤ + pt (a) `˜t,a .
t=1 a
a η 2 t=1 a

By taking expectation of the two sides of the inequality we obtain:


" T # " T #
XX h i ln K η XX  2
E ˜
pt (a)`t,a − E min L̃T (a) ≤ + E ˜
pt (a) `t,a .
t=1 a
a η 2 t=1 a

We note that E [min [·]] ≤ min [E [·]] and thus:


" T # " T #
XX h i ln K η XX  2
E ˜
pt (a)`t,a − min E L̃T (a) ≤ + E ˜
pt (a) `t,a .
t=1 a
a η 2 t=1 a

And now we consider the three expectation terms in this inequality.


" T # " T # " T #
XX XX h i XX
E ˜
pt (a)`t,a = E ˜
E pt (a)`t,a A1 , . . . , At−1 = E pt (a)`t,a ,
t=1 a t=1 a t=1 a

which is the expected loss of EXP3.


" T # T
h i X X
E L̃T (a) = E ˜
`t,a = `t,a ,
t=1 t=1

which is the cumulative loss of row a up to time T . And, finally,


" T # " T # " T #
XX  2 XX   2 XX 1
E ˜
pt (a) `t,a =E ˜
E pt (a) `t,a A1 , . . . , At−1 ≤ E pt (a) = KT.
t=1 a t=1 a t=1 a
pt (a)

Putting all three together back into the inequality we obtain the first statement of the theorem. And,
as before, we find η that minimizes the bound.

5.5.1 Lower Bound


The lower bound is based on construction of K + 1 games. In the 0-th game all losses are Bernoulli with
bias 1/2. In the i-th game for i ∈ {1, . . . , K} all losses are
p Bernoulli with bias 1/2 except the losses of the
i-th arm, which are Bernoulli with bias 1/2 − ε for ε = cK/T , where c is a properly selected constant.
With T /K pulls it is impossible to distinguish p between Bernoulli random variable with bias 1/2 and
Bernoulli random variable with bias 1/2 − K/T , because they induce indistinguishable distributions
over sequences of length T /K. As a result, within T pulls the player cannot distinguish between the 0-th
game and the i-th games. Therefore, if the adversary picks an i-th game at random the player’s √regret
will on average (with respect to the adversary’s and the players choices) be at least Ω (εT ) = Ω KT .
For the details of the proof see √
Cesa-Bianchi and Lugosi (2006), Bubeck and Cesa-Bianchi (2012).
It is possible to close the ln K gap between the upper and the lower bound by modifying the
algorithm and improving the upper bound. See Bubeck and Cesa-Bianchi (2012) for details.

67
5.6 Adversarial Multiarmed Bandits with Expert Advice
Game setting We are, again, working with the same matrix of losses as in prediction with expert
advice. But now on every round of the game we get advice of N experts indexed by h in a form of a
distribution over the K arms. More formally:
For t = 1, 2, . . . :

1. Observe qt,1 , . . . , qt,N , where qt,h is a probability distribution over {1, . . . , K}.
2. Pick a row At .
3. Observe & suffer `t,At . (`t,a -s for a 6= At remain unobserved)

Performance measure We compare the expected loss of the algorithm to the expected loss of the
best expert, where the expectation of the loss of expert h is taken with respect to its advice vector qh .
Specifically:
XT X T X
X
E [RT ] = pt (a)`t,a − min qt,h (a)`t,a .
h
t=1 a t=1 a

Algorithm The algorithm is quite similar to the EXP3 algorithm.6 Note that now L̃t (h) tracks
cumulative importance-weighted estimate of expert losses instead of individual arm losses.

Algorithm 6 EXP4 (Auer et al., 2002b)


Input: Learning rates η1 ≥ η2 ≥ · · · > 0
∀h : L̃0 (h) = 0
for t = 1, 2, ... do
−ηt L̃t−1 (h)
∀h: wt (h) = P e −ηt L̃t−1 (h0 )
h0 e
Observe qt,1 , .P. . , qt,N
∀a : pt (a) = h wt (h)qt,h (a)
Sample At according to pt and play it
Observe and suffer `t,At (
`t,a
` 1(A =a) , If At = a
Set `˜t,a = t,a pt (a)t = pt (a)
0, otherwise
˜ P
Set `t,h = a qt (h)`t,a ˜
∀h : L̃t (h) = L̃t−1 (h) + `˜t,h
end for

Analysis The EXP4 algorithm satisfies the following regret guarantee.


Theorem 5.6. The expected regret of the EXP4 algorithm with a fixed learning rate η satisfies:
ln N η
E [RT ] ≤ + KT.
η 2
q
2 ln N
The expected regret is minimized by η = KT , which leads to

E [RT ] ≤ 2KT ln N .

Note that ln N term plays the role of complexity of the class of experts in a very similar way to the
complexity terms we saw earlier in supervised learning (specifically, in the uniform union bound).
6 As with the EXP3 algorithm we present a slightly improved version of the algorithm for the game with losses. The

original algorithm was designed for the game with rewards.

68
Proof. The analysis is quite similar to the analysis of the EXP3 algorithm. We note that `˜t,h -s are all
non-negative and that wt is a distribution over {1, . . . , N } defined in the same way as pt in Lemma 5.2.
Thus, by Lemma 5.2 we have:
T X T  2
X ln N η XX
wt (h)`˜t,h − min L̃T (h) ≤ + wt (h) `˜t,h .
t=1
h η 2 t=1
h h

By taking expectations of the two sides of this expression we obtain:


" T #   " T #
XX ln N η XX  2
E ˜
wt (h)`t,h − E min L̃T (h) ≤ + E ˜
wt (h) `t,h .
t=1
h η 2 t=1
h h

As before, E [min [·]] ≤ min [E [·]] and thus:


" T # " T #
XX h i ln N η XX  2
E wt (h)`˜t,h − min E L̃T (h) ≤ + E wt (h) `˜t,h .
t=1
h η 2 t=1
h h

And now we consider the three expectation terms in this inequality.


" T # " T #
XX XX X
E wt (h)`˜t,h = E wt (h) qt,h (a)`˜t,a
t=1 h t=1 h a
" T X
! #
X X
=E wt (h)qt,h (a) `˜t,a
t=1 a h
" T X
#
X
=E pt (a)`˜t,a
t=1 a
" T X
#
X
=E pt (a)`t,a ,
t=1 a

where the first equality is by the definition of `˜t,h and the last equality is due to unbiasedness of `˜t,a .
Thus, the first expectation is the expected loss of EXP4.
" T # " T # " T #
h i X XX XX
E L̃T (h) = E `˜t,h = E qt (a)`˜t,a = E qt (a)`t,a ,
t=1 t=1 a t=1 a

where we can remove tilde due to unbiasedness of `˜t,a and we obtain the expected cumulative loss of
expert h over T rounds. And, finally,
" T #  !2 
XX  2 XT X X
E wt (h) `˜t,h = E wt (h) qt,h (a)`˜t,a 
t=1 h t=1 h a
" T X
#
X X  2
≤E wt (h) qt,h (a) `˜t,a
t=1 h a
" T X X
! #
X  2
=E wt (h)qt,h (a) `˜t,a
t=1 a h
" T X
#
X  2
=E pt (a) `˜t,a
t=1 a
≤ KT,
where the first inequality is by Jensen’s inequality and convexity of x2 and the last inequality is along the
same lines as the analogous inequality in the analysis of EXP3. By substituting the three expectations
back into the inequality we obtain the first statement of the theorem. And, as before, we find η that
minimizes the bound.

69
5.6.1 Lower Bound
It is possible
q to show  that the regret of adversarial multiarmed bandits with expert advice must be at
ln N ln N
least Ω KT ln K . The lower bound is based on construction of ln K independent bandit problems,
each according to the construction of the lower bound for multiarmed bandits in Section 5.5.1, and
construction of expert advice, so that for every possible selection of best arms for the subproblems there
is an expert that recommends that
√ selection. For details of the proof see Agarwal et al. (2012), Seldin
and Lugosi (2016). Closing the ln K gap between the upper and the lower bound is an open problem.

70
Appendix A

Set Theory Basics

In this chapter we provide a number of basic definitions and notations from the set theory that are used
in the notes.

Countable and Uncountable sets A set is called countable if its elements can be counted or, in
other words, if every element in a set can be associated with a natural number. For example, the set of
integer numbers is countable and the set of rational numbers (ratios of two integers) is also countable.
Finite sets are countable as well. A set is called uncountable if its elements cannot be enumerated. For
example, the set of real numbers R is uncountable and the set of numbers in a [0, 1] interval is also
uncountable.

Relations between sets For two sets A and B we use A ⊆ B to denote that A is a subset of B.

Operations on sets For two sets A and B we use A ∪ B to denote the union of A and B; A ∩ B the
intersection of A and B; and A \ B the difference of A and B (the set of elements that are in A, but not
in B).

The empty set We use ∅ or {} to denote the empty set.

Disjoint sets Two sets A and B are called disjoint if A ∩ B = ∅.

71
Appendix B

Probability Theory Basics

This chapter provides a number of basic definitions and results from the probability theory. It is partially
based on Mitzenmacher and Upfal (2005).

B.1 Axioms of Probability


We start with a definition of a probability space.
Definition B.1 (Probability space). A probability space is a tuple (Ω, F, P), where
• Ω is a sample space, which is the set of all possible outcomes of the random process modeled by the
probability space.

• F is a family of sets representing the allowable events, where each set in F is a subset of the sample
space Ω.
• P is a probability function P : F → [0, 1] satisfying Definition B.4.
Elements of Ω are called simple or elementary events.

Example B.2. For coin flips the sample space is Ω = {H, T }, where H stands for “heads” and T for
“tails”.
In dice rolling the sample space is Ω = {1, 2, 3, 4, 5, 6}, where 1,. . . ,6 label the sides of a dice (you
should consider them as labels rather than numerical values, we get back to this later in Example B.15).
If we simultaneously flip a coin and roll a dice the sample space is
Ω = {(H, 1), (T, 1), (H, 2), (T, 2), . . . , (H, 6), (T, 6)}.
If Ω is countable (including finite), the probability space is discrete. In discrete probability spaces the
family F consists of all subsets of Ω. In particular, F always includes the empty set ∅ and the complete
sample space Ω. If Ω is uncountably infinite (for example, the real line or the [0, 1] interval) a proper
definition of F requires concepts from the measure theory, which go beyond the scope of these notes.

Example B.3. In the coin flipping experiment F = {∅, {H} , {T } , {H, T }}.
Definition B.4 (Probability Axioms). A probability function is any function P : F → R that satisfies
the following conditions
1. For any event E ∈ F, 0 ≤ P(E) ≤ 1.

2. P(Ω) = 1.
3. For any finite or countably infinite sequence of mutually disjoint events E1 , E2 , . . .
 
[ X
P Ei  = P(Ei ).
i≥1 i≥1

72
We now consider a number of basic properties of probabilities.
Lemma B.5 (Monotonicity). Let A and B be two events, such that A ⊆ B. Then

P(A) ≤ P(B).

Proof. We have that B = A ∪ (B \ A) and the events A and B \ A are disjoint. Thus,

P(B) = P(A) + P(B \ A) ≥ P(A),

where the equality is by the third axiom of probabilities and the inequality is by the first axiom of
probabilities, since P(B \ A) ≥ 0.
The next simple, but very important result is known as the union bound.
Lemma B.6 (The union bound). For any finite or countably infinite sequence of events E1 , E2 , . . . ,
 
[ X
P Ei  ≤ P(Ei ).
i≥1 i≥1

Proof. We have [ [
Ei = E1 ∪ (E2 \ E1 ) ∪ (E3 \ (E1 ∪ E2 )) ∪ · · · = Fi ,
i≥1 i≥1
Si−1 S S
where the events Fi = Ei \ j=1 Ej are disjoint, Fi ⊆ Ei , and i≥1 Fi = i≥1 Ei . Therefore,
     
[ [ [ X X
P Ei  = P Fi  = P Fi  = P(Fi ) ≤ P(Ei ),
i≥1 i≥1 i≥1 i≥1 i≥1

where the second equality is by the third axiom of probabilities and the inequality is by monotonicity of
the probability (Lemma B.5).
Example B.7. Let E1 = {1, 3, 5} be the event that the outcome of a dice roll is odd and E2 = {1, 2, 3}
be the event that the outcome is at most 3. Then P(E1 ∪ E2 ) = P(1, 2, 3, 5) ≤ P(E1 ) + P(E2 ). Note that
this is true irrespective of the choice of the probability measure P. In particular, this is true irrespective
of whether the dice is fair or not.
Definition B.8 (Independence). Two events A and B are called independent if and only if

P(A ∩ B) = P(A) · P(B).

Definition B.9 (Pairwise independence). Events E1 , . . . , En are called pairwise independent if and only
if for any pair i, j
P(Ei ∩ Ej ) = P(Ei )P(Ej ).
Definition B.10 (Mutual independence). Events E1 , . . . , En are called mutually independent if and
only if for any subset of indices I ⊆ {1, . . . , n}
!
\ Y
P Ei = P(Ei ).
i∈I i∈I

Note that pairwise independence does not imply mutual independence. Take the following example:
assume we roll a fair tetrahedron (a three-dimensional object with four faces) with faces colored in red,
blue, green, and the fourth face colored in all three colors, red, blue, and green. Let E1 be the event that
we observe red color, E2 be the even that we observe blue color, and E3 be the event that we observe green
color. Then for all i we have P(Ei ) = 21 and for any pair i 6= j we have P(Ei ∩ Ej ) = 41 = P(Ei )P(Ej ).
However, P(E1 ∩ E2 ∩ E3 ) = 41 6= P(E1 )P(E2 )P(E3 ) and, thus, the events are pairwise independent, but
not mutually independent. If we say that events E1 , . . . , En are independent without further specifications
we imply mutual independence.

73
Definition B.11 (Conditional probability). The conditional probability that event A occurs given that
event B occurs is
P(A ∩ B)
P(A|B) = .
P(B)
The conditional probability is well-defined only if P(B) > 0.

By the definition we have that P(A ∩ B) = P(B)P(A|B) = P(A)P(B|A).


Example B.12. For a fair dice let A = {1, 6} and B = {1, 2, 3, 4}. Then
1
P(A) = ,
3
2
P(B) = ,
3
A ∩ B = {1} ,
1
P(A ∩ B) = ,
6
1
6 1
P(A|B) = 2 = .
3
4

Lemma
Sn B.13 (The law of total probability). Let E1 , E2 , . . . , En be mutually disjoint events, such that
E
i=1 n = Ω. Then
n
X Xn
P(A) = P(A ∩ Ei ) = P(A|Ei )P(Ei ).
i=1 i=1
Sn
Proof. Since the Ei -s are disjoint and cover the entire space it follows that A = i=1 (A ∩ Ei ) and the
events A ∩ Ei are mutually disjoint. Therefore,
n
! n n
[ X X
P(A) = P (A ∩ Ei ) = P(A ∩ Ei ) = P(A|Ei )P(Ei ).
i=1 i=1 i=1

B.2 Discrete Random Variables


We now define another basic concept in probability theory, a random variable.
Definition B.14. A random variable X on a sample space Ω is a real-valued function on Ω, that is
X : Ω → R. A discrete random variable is a random variable that takes on only a finite or countably
infinite number of values.
Example B.15. For a coin we can define a random variable X, such that X(H) = 1 and X(T ) = 0.
We can also define another random variable Y , such that Y (H) = 1 and Y (T ) = −1.
For a dice we can define a random variable X, such that X(1) = 1, X(2) = 2, X(3) = 3, X(4) =
4, X(5) = 5, X(6) = 6. We can also define a random variable Y , such that Y (1) = 3, Y (2) = 2.4, Y (3) =
−6, Y (4) = 8, Y (5) = 8, Y (6) = 0. This example emphasizes the difference between labeling of events
and assignment of numerical values to events. Note that the random variable Y does not distinguish
between faces 4 and 5 of the dice, even though they are separate events in the probability space.
Functions of random variables are also random variables. In the last example, a random variable
Z = X 2 takes values Z(1) = 1, Z(2) = 4, Z(3) = 9, . . . , Z(6) = 36.

Definition B.16 (Independence of random variables). Two random variables X and Y are independent
if and only if
P((X = x) ∩ (Y = y)) = P(X = x)P(Y = y).
for all values x and y.

74
Definition B.17 (Pairwise independence). Random variables X1 , . . . , Xn are pairwise independent if
and only if for any pair i, j and any values xi , xj

P((Xi = xi ) ∩ (Xj = xj )) = P(Xi = xi )P(Xj = xj ).

Definition B.18 (Mutual independence). Random variables X1 , . . . , Xn are mutually independent if


for any subset of indices I ⊆ {1, . . . , n} and any values xi , i ∈ I
!
\ Y
P (Xi = xi ) = P(Xi = xi ).
i∈I i∈I

Similar to the example given earlier, pairwise independence of random variables does not imply their
mutual independence. If we say that random variables are independent without further specifications we
imply mutual independence.

B.3 Expectation
Expectation is the most basic characteristic of a random variable.
Definition B.19 (Expectation). Let X be a discrete random variable and let X be the set of all possible
values that it can take. The expectation of X, denoted by E [X], is given by
X
E [X] = xP(X = x).
x∈X
P
The expectation is finite if x∈X |x|P(X = x) converges; otherwise the expectation is unbounded.

Example B.20. For a fair dice with faces numbered 1 to 6 let X(i) = i (the i-th face gets value i).
Then
6
X 1 7
E [X] = i = .
i=1
6 2

Take another random variable Z = X 2 then


6
  X 1 91
E [Z] = E X 2 = i2 = .
i=1
6 6

Expectation satisfies a number of important properties (these properties also hold for continuous
random variables). We leave a proof of these properties as an exercise.
Lemma B.21 (Multiplication by a constant). For any constant c

E [cX] = cE [X] .

Theorem B.22 (Linearity). For any pair of random variables X and Y , not necessarily independent,

E [X + Y ] = E [X] + E [Y ] .

Theorem B.23. If X and Y are independent random variables, then

E [XY ] = E [X] E [Y ] .

We emphasize that in contrast with Theorem B.22, this property does not hold in the general case (if X
and Y are not independent).

75
B.4 Variance
Variance is the second most basic characteristic of a random variable.
Definition B.24 (Variance). The variance of a random variable X (discrete or continuous), denoted
by Var [X], is defined by
h i
2 2
Var [X] = E (X − E [X]) = E X 2 − (E [X]) .
 

h i
2   2
We invite the reader to prove that E (X − E [X]) = E X 2 − (E [X]) .

Example B.25. For a fair dice with faces numbered 1 to 6 let X(i) = i (the i-th face gets value i).
Then
2 91 49 35
Var [X] = E X 2 − (E [X]) =
 
− = .
6 4 12
Theorem B.26. If X1 , . . . , Xn are independent random variables then
" n # n
X X
Var Xi = Var [Xi ] .
i=1 i=1

The proof is based on Theorem B.23 and the result does not necessarily hold when Xi -s are not
independent. We leave the proof as an exercise.

B.5 The Bernoulli and Binomial Random Variables


Two most basic discrete random variables are Bernoulli and binomial.

Definition B.27 (Bernoulli random variable). A random variable X taking values {0, 1} is called a
Bernoulli random variable. The parameter p = P(X = 1) is called the bias of X.
Bernoulli random variable has the following property (which does not hold in general):

E [X] = 0 · (1 − p) + 1 · p = p = P(X = 1).

Definition B.28 (Binomial random variable). A binomial random variable Y with parameters n and
p, denoted by B(n, p), is defined by the following probability distribution on k ∈ {0, 1, . . . , n}:
 
n k
P(Y = k) = p (1 − p)n−k .
k

Binomial random variable can be represented as a sum of independent identically distributed Bernoulli
random variables.
Lemma
Pn B.29. Let X1 , . . . , Xn be independent Bernoulli random variables with bias p. Then Y =
X
i=1 i is a binomial random variable with parameters n and p.
A proof of this lemma is left as an exercise to the reader.

B.6 Jensen’s Inequality


Jensen’s inequality is one of the most basic in probability theory.
Theorem B.30 (Jensen’s inequality). If f is a convex function and X is a random variable, then

E [f (X)] ≥ f (E [X]) .

For a proof see, for example, Mitzenmacher and Upfal (2005) or Cover and Thomas (2006).

76
Appendix C

Linear Algebra

We revisit a number of basic concepts from linear algebra. This is only a brief revision of the main
concepts that we are using in the lecture notes. For more details, please, refer to Strang (2009) or some
other textbook on linear algebra.
We start with reminding that two vectors u and v are perpendicular, u ⊥ v, if and only if their inner
product uT v = 0.

Matrix A matrix X ∈ Rn×d takes vectors in Rd and maps them into Rn . There are two fundamental
subspaces associated with a matrix X. The image of X, denoted Im(X) ⊆ Rn , is the space of all vectors
v ∈ Rn that can be obtained through multiplication of X with a vector w. The image Im(X) is a linear
subspace of Rn and it is also called a column space of X. The second subspace is the nullspace of X,
denoted N ull(X) ⊆ Rd , which is the space of all vectors w for which Xw = 0. The nullspace is a linear
subspace of Rd . The subspaces are illustrated in Figure C.1.

Matrix transpose Matrix transpose XT takes vectors in Rn and maps them into Rd . The corre-
sponding subspaces are Im(XT ), the row space of X, and N ull(XT ).

Orthogonality of the fundamental subspaces Im(X) ⊥ N ull(XT ) and Im(XT ) ⊥ N ull(X)


There is an important and extremely beautiful relation between the four fundamental subspaces as-
sociated with a matrix X and its transpose. Namely, the image of X is orthogonal to the nullspace of
XT and the image of XT is orthogonal to the nullspace of X. It means that if we take any two vectors
u ∈ Im(X) and v ∈ N ull(XT ) then uT v = 0 (and the same for the second pair of subspaces). The proof
of this fact is short and elegant. Any vector in u ∈ Im(X) can be represented as a linear combination of
the rows of X, meaning that u = Xz. At the same time, by definition of a nullspace, if v ∈ N ull(XT )
then XT v = 0. By putting these two facts together we obtain:
uT v = (Xz)T v = zT XT v = zT (XT v) = 0.

Complete relation between Im(X), Im(XT ), N ull(X), and N ull(XT ) Not only the pairs Im(X)
with N ull(XT ) and Im(XT ) with N ull(X) are orthogonal, they also complement each other. Let dim(A)
denote dimension of a matrix A. The dimension is equal to the number of independent columns, which
is equal to the number of independent rows (this fact can be shown by bringing A to a diagonal form).
Then we have the following relations:
1. dim(Im(X)) = dim(Im(XT )) = dim(X).
2. dim(N ull(X)) = d − dim(Im(XT )) and dim(N ull(XT )) = n − dim(Im(X)).
3. Im(X) ⊥ N ull(XT ) and Im(XT ) ⊥ N ull(X).
Together these properties mean that a combination of bases for Im(XT ) and N ull(X) makes a basis for
Rd and a combination of bases for Im(X) and N ull(XT ) make a basis for Rn . It means that any vector
v ∈ Rd can be represented as v = v? + v0 , where v? ∈ Im(XT ) belongs to the row space of X and
v0 ∈ N ull(X) belongs to the nullspace of X.

77
Figure C.1: The four fundamental subspaces of a matrix X. There is a right angle between Im(X)
and N ull(XT ), as well as between Im(XT ) and N ull(X).

The mapping between Im(XT ) and Im(X) is one-to-one and, thus, invertible Every vector
u in the column space comes from one and only one vector in the row space v. The proof of this fact
is also simple. Assume that u = Xv = Xv0 for two vectors v, v0 ∈ Im(XT ). Then X(v − v0 ) = 0 and
the vector v − v0 ∈ N ull(X). But N ull(X) is perpendicular to Im(XT ), which means that v − v0 is
orthogonal to itself and, therefore, must be the zero vector.

XT X is invertible if and only if X has linearly independent columns (XT X)−1 is a very
important matrix. We show that XT X is invertible if and only if X has linearly independent columns,
meaning that dim(X) = d. We show this by proving that X and XT X have the same nullspace. Let
v ∈ N ull(X), then Xv = 0 and, therefore, XT Xv = 0 and v ∈ N ull(XT X). In the other direction, let
v ∈ N ull(XT X). Then XT Xv = 0 and we have:

kXvk2 = (Xv)T (Xv) = vT XT Xv = vT (XT Xv) = 0.

Since kXvk2 = 0 if and only if Xv = 0, we have v ∈ N ull(X).


XT X is a d × d square matrix, therefore dim(XT X) = d − dim(N ull(XT X)) = d − dim(N ull(X))
and matrix XT X is invertible if and only if the dimension of the nullspace of X is zero, meaning that
X has linearly independent columns. (Note that unless n = d, X itself is a rectangular matrix and that
inverses are not defined for rectangular matrices.)

Projection onto a line A line in direction u is described by αu for α ∈ R. Projection of vector v


onto vector u means that we are looking for a projection vector p = αu, such that the remainder v − p
is orthogonal to the projection. So we have:

(v − αu)T αu = 0,
αvT u = α2 uT u,
vT u uT v
α= T
= T .
u u u u

78
uT v uT v
Thus, the projection p = αu = uT u
u. Note that uT u
is a scalar, thus

uT v uT v uuT
p= T
u = u T = T v.
u u u u u u
uuT
The matrix P = uT u
is a projection matrix. For any vector v the matrix P projects v onto u.

Projection onto a subspace A subspace can be described by a set of linear combinations Az, where
the columns of matrix A span the subspace. Projection of a vector v onto a subspace described by A
means that we are looking for a projection p = Az, such that the remainder v − p is perpendicular to
the projection. The projection p = Az belongs to the image of A, Im(A). Thus, the remainder must be
in the nullspace of AT , meaning that AT (v − p) = 0. Assuming that the columns of A are independent,
we have:

AT (v − Az) = 0,
AT v = AT Az,
z = (AT A)−1 AT v,

where we used independence of the columns of A in the last step to invert AT A. The projection is
p = Az = A(AT A)−1 AT v and the projection matrix is P = A(AT A)−1 AT . The projection matrix P
maps any vector v onto the space spanned by the columns of A, Im(A). Note how (AT A)−1 plays the
role of uT1 u in projection onto a line.

Projection matrices Projection matrices satisfy a number of interesting properties:


1. If P is a projection matrix then P2 = P (the second projection does not change the vector).
2. If P is a projection matrix projecting onto a subspace described by A then I−P is also a projection
matrix. It projects onto a subspace that is perpendicular to the subspace described by A.

79
Appendix D

Calculus

We revisit some basic concepts from calculus.

D.1 Gradients
 
x1
Gradients are vectors of partial derivatives. For a vector x =  ...  and a function f (x) the gradient
 

xd
of f is defined as
 ∂f 
∂x1
∇f (x̄) =  ..
.
 
.
∂f
∂xd

Gradient of a multivariate quadratic function f (x) = xT Ax


Let A be a matrix with entries aij . Then
    Pd 
a11 · · · a1d x1 j=1 a1j xj d X
d
T  .. . . 
..   ..  = (x1 , · · · , xd )  ..  X
f (x) = x Ax = (x1 , · · · , xd )  . = aij xi xj .
  
 . 
Pd i=1 j=1
ad1 · · · add xd j=1 adj xj
∂f
The partial derivative ∂xkthen becomes:
P 
d Pd
d d
∂f ∂ i=1 j=1 aij x i x j X X
= = akj xj + aik xi ,
∂xk ∂xk j=1 i=1

where the first sum corresponds to the first element in the product xi xj being xk and the second sum
corresponds to the second element in the product xi xj being xk . Putting all the derivatives together we
obtain:
 Pd Pd 
j=1 a1j xj + i=1 ai1 xi

∇f (x) =  .. 
.

 
Pd Pd
j=1 ad1 x j + i=1 aid x i
     T
a11 · · · a1d x1 a11 · · · a1d
=  ... ..   ..  + (x , · · · , x )  .. .. 

.  .   1 d  . . 
ad1 ··· add xd ad1 ··· add
T
= Ax + A x
= (A + AT )x.

80
A matrix A is called symmetric if AT = A. For a symmetric matrix we have ∇f (x) = 2Ax and for a
general matrix we have ∇f (x) = (A + AT )x. Note the similarity and dissimilarity with the derivative of
a univariate quadratic function f (x) = ax2 , which is f 0 (x) = 2ax.

Gradient of a linear function f (x) = bT x


 
b1
Let b =  ...  be a vector and let f (x) = bT x = i=1 bi xi . We leave it as an exercise to prove that
  Pd

bd
 ∂f 
∂x1
the gradient ∇f (x) =  ..
 = b.
 
.
∂f
∂xd

81
Bibliography

Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from data. AMLbook, 2012.
Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from data. Dynamic E-
Chapters. AMLbook, 2015.
Alekh Agarwal, Miroslav Dudı́k, Satyen Kale, John Langford, and Robert E. Schapire. Contextual
bandit learning with predictable rewards. In Proceedings on the International Conference on Artificial
Intelligence and Statistics (AISTATS), 2012.
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit
problem. Machine Learning, 47, 2002a.
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed
bandit problem. SIAM Journal of Computing, 32(1), 2002b.
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities A Nonasymptotic
Theory of Independence. Oxford University Press, 2013.
Sébastien Bubeck. Bandits Games and Clustering Foundations. PhD thesis, Université Lille, 2010.
Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed
bandit problems. Foundations and Trends in Machine Learning, 5, 2012.
Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press,
2006.
Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley Series in Telecommuni-
cations and Signal Processing, 2nd edition, 2006.
Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand. PAC-Bayesian learning
of linear classifiers. In Proceedings of the International Conference on Machine Learning (ICML),
2009.
Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand, and Jean-Francis Roy. Risk
bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of
Machine Learning Research, 16, 2015.
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American
Statistical Association, 58(301):13–30, 1963.
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in
Applied Mathematics, 6, 1985.
John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning
Research, 6, 2005.
Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Com-
putation, 108, 1994.
Katalin Marton. A measure concentration inequality for contracting Markov chains. Geometric and
Functional Analysis, 6(3), 1996.

82
Katalin Marton. A measure concentration inequality for contracting Markov chains Erratum. Geometric
and Functional Analysis, 7(3), 1997.
Andrés R. Masegosa, Stephan S. Lorenzen, Christian Igel, and Yevgeny Seldin. Second order PAC-
Bayesian bounds for the weighted majority vote. Technical report, https://arxiv.org/abs/2007.
13532, 2020.
Andreas Maurer. A note on the PAC-Bayesian theorem. www.arxiv.org, 2004.
David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51, 2003.

Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized Algorithms and Proba-
bilistic Analysis. Cambridge University Press, 2005.
Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American
Mathematical Society, 1952.
Paul-Marie Samson. Concentration of measure inequalities for markov chains and φ-mixing processes.
The Annals of Probability, 28(1), 2000.
Matthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classification. Journal
of Machine Learning Research, 3, 2002.
Yevgeny Seldin. The space of online learning problems. ECML-PKDD Tutorial. https://sites.google.
com/site/spaceofonlinelearningproblems/, 2015.
Yevgeny Seldin and Gábor Lugosi. A lower bound for multi-armed bandits with expert advice. In
Proceedings of the European Workshop on Reinforcement Learning (EWRL), 2016.
Yevgeny Seldin, François Laviolette, Nicolò Cesa-Bianchi, John Shawe-Taylor, and Peter Auer. PAC-
Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58, 2012.

Gilles Stoltz. Incomplete Information and Internal Regret in Prediction of Individual Sequences. PhD
thesis, Université Paris-Sud, 2005.
Gilbert Strang. Introduction to linear algebra. Wellesley-Cambridge Press, 4th edition, 2009.
Niklas Thiemann, Christian Igel, Olivier Wintenberger, and Yevgeny Seldin. A strongly quasiconvex
PAC-Bayesian bound. In Proceedings of the International Conference on Algorithmic Learning Theory
(ALT), 2017.
Ilya Tolstikhin and Yevgeny Seldin. PAC-Bayes-Empirical-Bernstein inequality. In Advances in Neural
Information Processing Systems (NIPS), 2013.

Vladimir Vovk. Aggregating strategies. In Proceedings of the Conference on Learning Theory (COLT),
1990.

83

You might also like