Machine Learning - The Science of Selection under Uncertainty
Machine Learning - The Science of Selection under Uncertainty
Yevgeny Seldin
The material was developed in the process of teaching the following courses:
I would like to thank all students who have pointed out typos and flaws in the lecture notes. There
are certainly more and if you spot any, please, report them to me at [email protected]. Your feedback will
serve future generations of students.
Contents
1 Supervised Learning 3
1.1 The Supervised Learning Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Classification, Regression, and Other Supervised Learning Problems . . . . . . . . 4
1.1.2 The Loss Function `(Y 0 , Y ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 K Nearest Neighbors for Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 How to Pick K in K-NN? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Test Set: It’s not about how you call it, it’s about how you use it! . . . . . . . . . 7
1.3.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Perceptron - Basic Algorithm for Linear Classification . . . . . . . . . . . . . . . . . . . . 8
1
3.9.5 Second Order PAC-Bayesian Bounds for the Weighted Majority Vote . . . . . . . . 48
3.9.6 Ensemble Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.9.7 Comparison of the Empirical Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 Online Learning 53
5.1 The Space of Online Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 A General Basic Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 I.I.D. (stochastic) Multiarmed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 Prediction with Expert Advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Adversarial Multiarmed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6 Adversarial Multiarmed Bandits with Expert Advice . . . . . . . . . . . . . . . . . . . . . 68
5.6.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
C Linear Algebra 77
D Calculus 80
D.1 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2
Chapter 1
Supervised Learning
The most basic and widespread form of machine learning is supervised learning. In the classical batch
supervised learning setting the learner is given an annotated sample, which is used to derive a prediction
rule for annotating new samples. We start with a simple informal example and then formalize the
problem.
Let’s say that we want to build a prediction rule that will use the average grade of a student in
home assignments, say on a 100-points scale, to predict whether the student will pass the final exam.
Such a prediction rule could be used for preliminary filtering of students to be allowed to take the final
exam. The annotated sample could be a set of average grades of students from the previous year with
indications of whether they have passed the final exam. The prediction rule could take a form of a
threshold grade (a.k.a. decision stump), above which the student is expected to pass and below fail.
Now assume that we want to take a more refined approach and look into individual grades in each
assignment, say, 5 assignments in total. For example, different assignments may have different relevance
for the final exam or, maybe, some students may demonstrate progression throughout the course, which
would mean that their early assignments should not be weighted equally with the later ones. In the refined
approach each student can be represented by a point in a 5-dimensional space. The one-dimensional
threshold could be replaced by a separating hyperplane, which separates the 5-dimensional space of
grades into a linear subspace, where most students are likely to pass, and the complement, where they
are likely to fail. An alternative approach is to look at “nearest neighbors” of a student in the space
of grades. Given a grade profile of a student (the point representing the student in the 5-dimensional
space) we look at students with the closest grade profile and see whether most of them passed or failed.
This is known as the K Nearest Neighbors algorithm, where K is the number of neighbors we look at.
But how many neighbors K should we look at? Considering the extremes gives some intuition about
the problem. Taking just one nearest neighbor may be unreliable. For example, we could have a good
student that accidentally failed the final exam and then all the neighbors will be marked as “expected to
fail”. Going in the other extreme and taking all the students in as neighbors is also undesirable, because
effectively it will ignore the individual profile altogether. So a good value of K should be somewhere
between 1 and n, where n is the size of the annotated set. But how to find it? Well, read on and you
will learn how to approach this question formally.
3
• h : X → Y - a hypothesis, which is a function from X to Y.
• H - a hypothesis set.
• `(Y 0 , Y ) - the loss function for predicting Y 0 instead of Y .
Pn
• L̂(h, S) = n1 i=1 `(h(Xi ), Yi ) - the empirical loss (a.k.a. error or risk) of h on S. (In many
textbooks S is omitted from the notation and L̂(h) or L̂n (h) is used to denote L̂(h, S).)
• L(h) = E [`(h(X), Y )] - the expected loss (a.k.a. error or risk) of h, where the expectation is taken
with respect to p(X, Y ).
Classification A supervised learning problem is a classification problem when the output (label) space
Y is binary. The goal of the learning algorithm is to separate between two classes: yes or no; good or
bad; healthy or sick; male or female; etc. Most often the translation of the binary label into numerical
representation is done by either taking Y = {±1} or Y = {0, 1}. Sometimes the setting is called binary
classification to emphasize that Y takes just two values.
Regression A supervised learning problem is a regression problem when the output space Y = R. For
example, prediction of person’s height would be a regression problem.
Multiclass Classification When Y consists of a finite and typically unordered and relatively small set
of values, the corresponding supervised learning problem is called multiclass classification. For example,
prediction of a study program a student will apply for based on his or her grades would be a multiclass
classification problem. Finite ordered output spaces, for example, prediction of age or age group can
also be modeled as multiclass classification, but it may be possible to exploit the structure of Y to
obtain better solutions. For example, it may be possible to exploit the fact that ages 22 and 23 are
close together, whereas 22 and 70 are far apart; therefore, it may be possible to share some information
between close ages, as well as exploit the fact that predicting 22 instead of 23 is not such a big mistake
as predicting 22 instead of 70. Depending on the setting, it may be preferable to model prediction of
ordered sets as regression rather than multiclass classification.
4
Structured Prediction Consider the problem of machine translation. An algorithm gets a sentence
in English as an input and should produce a sentence in Danish as an output. In this case the output
(the sentence in Danish) is not merely a number, but a structured object and such prediction problems
are known as structured prediction.
`(Y 0 , Y ) = (Y 0 − Y )2
HH Y no fire
H
fire
Y’ HH
0
`(Y , Y ) = no fire 0 3,000,000
fire 2,000 0
The loss for making the correct prediction is zero, but the loss of false positive (predicting fire when in
reality there is no fire) and false negative (predicting no fire when the reality is fire) are not symmetric
anymore.
Put attention that the loss depends on how the predictions are used and the loss table depends on
the user. For example, if the same alarm is installed in a house that is worth 10,000,000 DKK, the ratio
between the cost of false positives and false negatives will be very different and, as a result, the optimal
prediction strategy will not necessarily be the same.
5
Algorithm 1 K Nearest Neighbors (K-NN) for Binary Classification with Y = {±1}
1: Input: A set of labeled points {(x1 , y1 ), . . . , (xn , yn )} and a target point x that has to be classified.
2: Calculate the distances di = d(xi , x).
3: Sort di -s in ascending order and let σ : {1, . . . , n} → {1, . . . , n} be the corresponding permutation of
indices. In other words, for any pair of indices
PK i < j we should have dσ(i) ≤ dσ(j) .
4: The output of K-NN is y = sign i=1 yσ(i) . It is the majority vote of K points that are the closest
to x. Note that we can calculate the output of K-NN for all K in one shot.
1.3 Validation
Whenever we select a hypothesis ĥ∗S out of a hypothesis set H based on empirical performances L̂(h, S),
the empirical performance L̂(ĥ∗S , S) becomes a biased estimate of L(ĥ∗S ). This is clearly observed in
1-NN, where L̂(h1-NN , S) = 0, but L(h1-NN ) is most often not zero (we remind that the hypothesis space
in 1-NN is the space of all possible partitions of the sample space X and h1-NN is the hypothesis that
achieves the minimal empirical error in this space). The reason is that when we do the selection we pick
ĥ∗S that is best suited for S (it achieves the minimal L̂(h, S) out of all h in H). Therefore, from the
perspective of ĥ∗S the new samples (X, Y ) are not “similar” to the samples (Xi , Yi ) in S. A bit more
precisely, (X, Y ) is not exchangeable with (Xi , Yi ), because if we would exchange (Xi , Yi ) with (X, Y ) it
is likely that ĥ∗S , the hypothesis that minimizes L̂(h, S), would be different. Again, this is very clear in
1-NN: if we change one sample (Xi , Yi ) in S we get a different prediction rule h1-NN . We get back to this
topic in much more details in Chapter 3 after we develop some mathematical tools for analyzing the bias
in Chapter 2. For now we present a simple solution for estimating L(ĥ∗S ) and motivate why we need the
tools from Chapter 2.
The solution is to split the sample set S into training set Strain and validation set Sval . We can then
find the best hypothesis for the training set, h∗Strain , and validate it on the validation set by computing
L̂(h∗Strain , Sval ). Note that from the perspective of h∗Strain the samples in Sval are exchangeable with any
new samples (X, Y ). If we exchange (Xi , Yi ) ∈ Sval with another sample (X, Y) coming from the same
distribution, h∗Strain will stay the same and in expectation E `(h∗Strain (Xi ), Yi ) = E `(h∗Strain (X), Y ) ,
meaning that on average L̂(h∗Strain , Sval ) will also stay the same (only on average, the exact value may
change). Therefore, L̂(h∗Strain , Sval ) is an unbiased estimate of L(h∗Strain ). (We get back to this point in
much more details in Chapter 3.)
1 This is because the closest point in S to a sample point x is x itself and we assume that S includes no identical points
i i
with dissimilar labels, which is a reasonable assumption if X = Rd .
6
Now we get to the question of how to split S into Strain and Sval , and again it is very instructive to
consider the extreme cases. Imagine that we keep a single sample for validation and use the remaining
n − 1 samples for training. Let’s say that we keep the last sample, (Xn , Yn ), for validation, then
L̂(h∗Strain , Sval ) = `(h∗Strain (Xn ), Yn ) and in the case of zero-one loss it is either zero or one. Even though
L̂(h∗Strain , Sval ) is an unbiased estimate of L(h∗Strain ), it clearly does not represent it well. At the other
extreme, if we keep n − 1 points for validation and use the single remaining point for training we run into
a different kind of problem: a classifier trained on a single point is going to be extremely weak. Let’s
say that we have used the first point, (X1 , Y1 ), for training. In the case of K-NN classifier, as well as
most other classifiers, h∗Strain will always predict Y1 , no matter what input it gets. The validation error
L̂(h∗Strain , Sval ) will be a very good estimate of L(h∗Strain ), but this is definitely not a classifier we want.
So how many samples from S should go into Strain and how many into Sval ? Currently there is no
“gold answer” to this question, but in Chapters 2 and 3 we develop mathematical tools for intelligent
reasoning about it. An important observation to make is that for h independent of (X, Y ) the zero-one
loss `(h(X), Y ) is a Bernoulli random variable with bias P(`(h(X), Y ) = 1) = L(h). Furthermore, when h
is independent of a set of samples {(X1 , Y1 ), . . . , (Xm , Ym )} (i.e., these samples are not used for selecting
h), the losses `(h(Xi ), Yi ) are independent identically distributed (i.i.d.) Bernoulli random variables with
bias L(h). Therefore, when Sval is of size m, the validation loss L̂(h∗Strain , Sval ) is an average of m i.i.d.
Bernoulli random variables with bias L(h∗Strain ). The validation loss L̂(h∗Strain , Sval ) is observed, but the
expected loss that we are actually interested in is unobserved. One of the key questions that we are
interested in is how far L̂(h∗Strain , Sval ) can be from L(h∗Strain ). We have already seen that m = 1 is
too little. But how large should it be, 10, 100, 1000? Essentially this question is equivalent to asking
how many times do we need to flip a biased coin in order to get a satisfactory estimate of its bias. In
Chapter 2 we develop concentration of measure inequalities that answer this question.
Another technical question is which samples should go into Strain and which into Sval ? From the
theoretical perspective we assume that S is sampled i.i.d. and, therefore, it does not matter. We can
take the first n − m samples into Strain and the last m into Sval or split in any other way. From a
practical perspective the samples may actually not be i.i.d. and there could be some parameter that has
influenced their order in S. For example, they could have been ordered alphabetically. Therefore, from a
practical perspective it is desirable to take a random permutation of S before splitting, unless the order
carries some information we would like to preserve. For example, if S is a time-ordered series of product
reviews and we would like to build a classifier that classifies them into positive and negative, we may
want to get an estimate of temporal variation and keep the order when we do the split, i.e., train on the
earlier samples and validate on the later.
1.3.1 Test Set: It’s not about how you call it, it’s about how you use it!
Assume that we have split S into Strain and Sval ; we have trained h1-NN , . . . , hn-NN on Strain ; we calculated
L̂(h1-NN , Sval ), . . . , L̂(hn-NN , Sval ) and picked the value K ∗ that minimizes L̂(hK-NN , Sval ). Is L̂(hK∗ -NN , Sval )
an unbiased estimate of L(hK∗ -NN )?
This is probably one of the most conceptually difficult points about validation, at least when you
encounter it for the first time. While for each hK-NN individually L̂(hK-NN , Sval ) is an unbiased estimate
of L(hK-NN ), the validation loss L̂(hK∗ -NN , Sval ) is a biased estimate of L(hK∗ -NN ). This is because Sval was
used for selection of K ∗ and, therefore, hK∗ -NN depends on Sval . So if we want to get an unbiased estimate
of L(hK∗ -NN ) we have to reserve some “fresh” data for that. So we need to split S into Strain , Sval , and
Stest ; train the K-NN classifiers on Strain ; pick the best K ∗ based on L̂(hK-NN , Sval ); and then compute
L̂(hK∗ -NN , Stest ) to get an unbiased estimate of L(hK∗ -NN ).
It’s not about how you call it, it’s about how you use it! Some people think that if you call
some data a test set it automatically makes loss estimates on this set unbiased. This is not true. Imagine
that you have split S into Strain , Sval , and Stest ; you trained K-NN on Strain , picked the best value K ∗
using Sval , and estimated the loss of hK∗ -NN on Stest . And now you are unhappy with the result and you
want to try a different learning method, say a neural network. You go through the same steps: you train
networks with various parameter settings on Strain , you validate them on Sval , and you pick the best
parameter set θ∗ based on the validation loss. Finally, you compute the test loss of the neural network
parametrized by θ∗ on Stest . It happens to be lower than the test loss of K ∗ -NN and you decide to go
7
with the neural network. Does the empirical loss of the neural network on Stest represent an unbiased
estimate of its expected loss? No! Why? Because our choice to pick the neural network was based on
its superior performance relative to hK∗ -NN on Stest , so Stest was used in selection of the neural network.
Therefore, there is dependence between Stest and the hypothesis we have selected, and the loss on Stest
is biased. If we want to get an unbiased estimate of the loss we have to find new “fresh” data or reserve
such data from the start and keep it in a locker until the final evaluation moment. Alternatively, we
can correct for the bias and in Chapter 3 we will learn some tools for making the correction. The main
take-home message is: It is not about how you call a data set, Strain , Sval , or Stest , it is the
way you use it which determines whether you get unbiased estimates or not! In some cases it
is possible to get unbiased estimates or to correct for the bias already with Strain , and sometimes there
is bias even on Stest and we need to correct for that.
1.3.2 Cross-Validation
Sometimes it feels wasteful to use only part of the data for training and part for validation. A heuristic
way around it is cross-validation. In the standard N -fold cross-validation setup the data S are split into
N non-overlapping folds S1 , . . . , SN . Then for i ∈ {1, . . . , N } we train on all folds except the i-th and
validate on Si . We then take the average of the N validation errors and pick the parameter that achieves
the minimum (for example, the best K in K-NN). Finally, we train a model with the best parameter we
have selected in the cross-validation procedure (for example the best K ∗ in K-NN) using all the data S.
The standard cross-validation procedure described above is a heuristic and has no theoretical guar-
antees. It is fairly robust and widely used in practice, but it is possible to construct examples, where
it fails. In Chapter 3 we describe a modification of the cross-validation procedure, which comes with
theoretical generalization guarantees and is empirically competitive with the standard cross-validation
procedure.
Hypothesis space The hypothesis space in linear classification is the space of all possible separating
hyperplanes. If we are talking about homogeneous linear classifiers then it is restricted to hyperplanes
passing through the origin. Thus, for homogeneous linear classifiers H = Rd and for general linear
classifiers H = Rd+1 .
8
Perceptron algorithm Perceptron is the simplest algorithm for learning homogeneous separating
hyperplanes. It operates under the assumption that the data are separable by a homogeneous
hyperplane, meaning that there exists a hyperplane passing through the origin that perfectly separates
positive points from negative.
Algorithm 2 Perceptron
1: Input: A training set {(x1 , y1 ), . . . , (xn , yn )}
2: Initialization: w1 = 0 (where 0 is the zero vector)
3: t=1
4: while exists (xit , yit ), such that yit (wtT xit ) ≤ 0 do
5: wt+1 = wt + yit xit
6: t=t+1
7: end while
8: Return: wt
Note that a point (x, y) is classified correctly if ywT x > 0 and misclassified if ywT x ≤ 0. Thus, the
selection step (line 4 in the pseudocode) picks a misclassified point, as long as there exists such. The
update step (line 5 in the pseudocode) rotates the hyperplane w, so that the classification is “improved”.
T
Specifically, the following property is satisfied: if (xit , yit ) is the point selected at step t then yit wt+1 x it >
T
yit wt xit (verification of this property is left as an exercise to the reader). Note this property does not
guarantee that after the update wt+1 will classify (xit , yit ) correctly. But it will rotate in the right
direction and after sufficiently many updates (xit , yit ) will end up on the right side of the hyperplane.
Also note that while the classification of (xit , yit ) is improved, it may go the opposite way for other
points. As long as the data are linearly separable, the algorithm will eventually find the separation.
The algorithm does not specify the order in which misclassified points are selected. Two natural
choices are sequential and random. We leave it as an exercise to the reader to check which of the two
choices leads to faster convergence of the algorithm.
9
Chapter 2
Concentration of Measure
Inequalities
Concentration of measure inequalities are one of the main tools for analyzing learning algorithms. This
chapter is devoted to a number of concentration of measure inequalities that form the basis for the results
discussed in later chapters.
E [X]
P(X ≥ ε) ≤ .
ε
Proof. Define a random variable Y = 1(X ≥ ε) to be the indicator function of whether X exceeds ε. Then
Y ≤ Xε (see Figure 2.1). Since Y is a Bernoulli random variable, E [Y ] = P(Y = 1) (see Appendix B).
We have:
X E [X]
P(X ≥ ε) = P(Y = 1) = E [Y ] ≤ E = .
ε ε
Check yourself: where in the proof do we use non-negativity of X and strict positiveness of ε?
Figure 2.1: Relation between the identity function and the indicator function.
10
By denoting the right hand side of Markov’s inequality by δ we obtain the following equivalent
statement. For any non-negative random variable X:
1
P X ≥ E [X] ≤ δ.
δ
Example. We would like to bound the probability that we flip a fair coin 10 times and obtain 8 or more
1
heads. Let X1 , . . . , X10 be i.i.d. Bernoulli random variables with
hPbias 2 .i The question is equivalent to
P10 10
asking what is the probability that i=1 Xi ≥ 8. We have E i=1 Xi = 5 (the reader is invited to
prove this statement formally) and by Markov’s inequality
hP i
10
! 10
X E i=1 Xi 5
P Xi ≥ 8 ≤ = .
i=1
8 8
We note that even though Markov’s inequality is weak, there are situations in which it is tight. We
invite the reader to construct an example of a random variable for which Markov’s inequality is tight.
Var [X]
P(|X − E [X]| ≥ ε) ≤ .
ε2
Proof.
a transformation of a random variable. We have that P(|X − E [X]| ≥ ε) =
The proof uses
2
P (X − E [X]) ≥ ε2 , because the first statement holds if and only if the second holds. In addition,
2
using Markov’s inequality and the fact that (X − E [X]) is a non-negative random variable we have
h i
E (X − E [X])2 Var [X]
2
P(|X − E [X]| ≥ ε) = P (X − E [X]) ≥ ε2 ≤ = .
ε2 ε2
Check yourself: where in the proof did we use the positiveness of ε?
In order to illustrate the relative advantage of Chebyshev’s inequality compared to Markov’s consider
the following example. PnLet X1 , . . . , Xn be n independent identically distributed Bernoulli random vari-
ables and let µ̂n = n1 i=1 Xi be their average. We would like to bound the probability that µ̂n deviates
from E [µ̂n ] by more than ε (this is the central question in machine learning). We have E [µ̂P
n ] = E [X1 ] = µ
n
and by independence of Xi -s and Theorem B.26 we have Var [µ̂n ] = n12 Var [nµ̂n ] = n12 i=1 Var [Xi ] =
1
n Var [X1 ]. By Markov’s inequality
E [µ̂n ] E [X1 ]
P(µ̂n − E [µ̂n ] ≥ ε) = P(µ̂n ≥ E [µ̂n ] + ε) ≤ = .
E [µ̂n ] + ε E [X1 ] + ε
Note that as n grows the inequality stays the same. By Chebyshev’s inequality we have
11
2.3 Hoeffding’s Inequality
Hoeffding’s inequality is a much more powerful concentration result.
Theorem 2.3 (Hoeffding’s Inequality). Let X1 , . . . , Xn be independent real-valued random variables,
such that for each i ∈ {1, . . . , n} there exist ai ≤ bi , such that Xi ∈ [ai , bi ]. Then for every ε > 0:
n
" n # !
X X 2 Pn 2
P Xi − E Xi ≥ ε ≤ e−2ε / i=1 (bi −ai ) (2.1)
i=1 i=1
and " n # !
n
X X 2 Pn 2
P Xi − E Xi ≤ −ε ≤ e−2ε / i=1 (bi −ai ) . (2.2)
i=1 i=1
By taking a union bound of the events in (2.1) and (2.2) we obtain the following corollary.
Corollary 2.4. Under the assumptions of Theorem 2.3:
n
" n # !
X X 2 Pn 2
P Xi − E Xi ≥ ε ≤ 2e−2ε / i=1 (bi −ai ) . (2.3)
i=1 i=1
Equations (2.1) and (2.2) are known as “one-sided Hoeffding’s inequalities” and (2.3) is known as
“two-sided Hoeffding’s inequality”.
If we assume that Xi -s are identically distributed and belong to the [0, 1] interval we obtain the
following corollary.
Corollary 2.5. Let X1 , . . . , Xn be independent random variables, such that Xi ∈ [0, 1] and E [Xi ] = µ
for all i, then for every ε > 0: !
n
1X 2
P Xi − µ ≥ ε ≤ e−2nε (2.4)
n i=1
and !
n
1X 2
P µ− Xi ≥ ε ≤ e−2nε . (2.5)
n i=1
Pn
Recall that by Chebyshev’s inequality µ̂n = n1 i=1 Xi converges to µ at the rate of n−1 . Hoeffding’s
inequality demonstrates that the convergence is actually much faster, at least at the rate of e−n .
The proof of Hoeffding’s inequality is based on Hoeffding’s lemma.
Lemma 2.6 (Hoeffding’s Lemma). Let X be a random variable, such that X ∈ [a, b]. Then for any
λ ∈ R:
λ2 (b−a)2
E eλX ≤ eλE[X]+ 8 .
The function f (λ) = E eλX is known as the moment generating function of X, since f 0 (0) = E [X],
f 00 (0) = E X 2 , and, more generally, f (k) (0) = E X k . We provide the proof of the lemma immediately
Proof of Theorem 2.3. We prove the first inequality in Theorem 2.3. The second inequality follows by
applying the first inequality to −X1 , . . . , −Xn . The proof is based on Chernoff’s bounding technique.
For any λ > 0 the following holds:
h Pn Pn i
E eλ( i=1 Xi −E[ i=1 Xi ])
n
" n # !
Pn Pn
Xi ≥ ε = P eλ( i=1 Xi −E[ i=1 Xi ]) ≥ eλε ≤
X X
P Xi − E ,
i=1 i=1
eλε
12
where the first step holds since eλx is a monotonously increasing function for λ > 0 and the second step
holds by Markov’s inequality. We now take a closer look at the nominator:
h Pn Pn i h Pn i
E eλ( i=1 Xi −E[ i=1 Xi ]) = E e( i=1 λ(Xi −E[Xi ]))
" n #
Y
λ(Xi −E[Xi ])
=E e
i=1
n
Y h i
= E eλ(Xi −E[Xi ]) (2.6)
i=1
n
2
(bi −ai )2 /8
Y
≤ eλ (2.7)
i=1
2 Pn 2
= e(λ /8) i=1 (bi −ai ) ,
where (2.6) holds since X1 , . . . , Xn are independent and (2.7) holds by Hoeffding’s lemma applied to a
random variable Zi = Xi −E [Xi ] (note that E [Zi ] = 0 and that Zi ∈ [ai −µi , bi −µi ] for µi = E [Xi ]). Put
attention to the crucial role that independence of X1 , . . . , Xn plays in the proof ! Without independence
we would not have been able to exchange the expectation with the product and the proof would break
down! To complete the proof we substitute the bound on the expectation into the previous calculation
and obtain: " n # !
n
2 Pn 2
X ≥ ε ≤ e(λ /8)( i=1 (bi −ai ) )−λε .
X X
P X −E
i i
i=1 i=1
It is important to note that the best choice of λ does not depend on the sample. In particular, it allows
to fix λ before observing the sample. By substituting λ∗ into the calculation we obtain the result of the
theorem.
Proof of Lemma 2.6. Note that
h i h i
E eλX = E eλ(X−E[X])+λE[X] = eλE[X] × E eλ(X−E[X]) .
Hence, it is sufficient to show that for any random variable Z with E [Z] = 0 and Z ∈ [a, b] we have:
2 2
E eλZ ≤ eλ (b−a) /8 .
13
where u = λ(b − a) and φ(u) = −pu + ln (1 − p + peu ) and we used the fact that E [Z] = 0. It is easy to
verify that the derivative of φ is
p
φ0 (u) = −p +
p + (1 − p)e−u
and, therefore, φ(0) = φ0 (0) = 0. Furthermore,
p(1 − p)e−u 1
φ00 (u) = 2 ≤ .
(p + (1 − p)e−u ) 4
u2 00
By Taylor’s theorem, φ(u) = φ(0) + uφ0 (0) + 2 φ (θ) for some θ ∈ [0, u]. Thus, we have:
2
u2 00 u2 00 u2 λ2 (b − a)
φ(u) = φ(0) + uφ0 (0) + φ (θ) = φ (θ) ≤ = .
2 2 8 8
(Put attention that the ln 2 factor in the last inequality comes from the union bound over the first two
inequalities: if we want to keep the same confidence we have to compromise on precision.)
14
In many situations we are interested in the complimentary events. Thus, for example, we have
s
n
1 X ln 1δ
Pµ − Xi ≤ ≥ 1 − δ.
n i=1 2n
Careful reader may point out that the inequalities above should be strict (“<” and “>”). This is true,
but if it holds for strict inequalities it also holds for non-strict inequalities (“≤” and “≥”). Since strict
inequalities provide no practical advantage we will use the non-strict inequalities to avoid the headache
of remembering which inequalities should be strict and which should not.
The last inequality essentially says that with probability at least 1 − δ we have
s
n
1X ln 1δ
µ≤ Xi +
n i=1 2n
Pn
and this is how we will occasionally use it. Note that the random variable is n1 i=1 Xi and the right
way of interpreting the above inequality is actually that with probability at least 1 − δ
s
n
1X ln 1δ
Xi ≥ µ − ,
n i=1 2n
Pn
i.e., the probability is over n1 i=1 Xi and not over µ. However, many generalization bounds that we
study in Chapter 3 are written in the first form in the literature and we follow the tradition.
With a slight abuse of notation we specialize the definition of entropy to Bernoulli random variables.
Definition 2.9 (Binary entropy). Let p be a bias of Bernoulli random variable X. We define the entropy
of p as
H(p) = −p ln p − (1 − p) ln(1 − p).
Note that when we talk about Bernoulli random variables p denotes the bias of the random variable
and when we talk about more general random variables p denotes the complete distribution.
Entropy is one of the central quantities in information theory and it has numerous applications. We
start by using binary entropy to bound binomial coefficients.
15
Lemma 2.10.
1 n H( nk ) n
≤ en H( n ) .
k
e ≤
n+1 k
k k
(Note that n ∈ [0, 1] and H n in the lemma is the binary entropy.)
Proof. By the binomial formula we know that for any p ∈ [0, 1]:
n
X n i
p (1 − p)n−i = 1. (2.8)
i=0
i
We start with the upper bound. Take p = nk . Since the sum is larger than any individual term, for the
k-th term of the sum we get:
n k n−k
1≥ p (1 − p)
k
k n−k
n k k
= 1−
k n n
k n−k
n k n−k
=
k n n
n k ln k +(n−k) ln n−k
= e n n
k
n n( nk ln nk + n−k n−k
n ln n )
= e
k
n −n H( nk )
= e .
k
By changing sides of the inequality we obtain the upper bound.
For the lower bound it is possible to show that if we fix p = nk then nk pk (1 − p)n−k ≥ ni pi (1 − p)n−i
for any i ∈ {0, . . . , n}, see Cover and Thomas (2006, Example 11.1.3) for details. We also note that there
are n + 1 elements in the sum in equation (2.8). Again, take p = nk , then
i n−i k n−k
n k n−k n k n−k n −n H( nk )
1 ≤ (n + 1) max = (n + 1) = (n + 1) e ,
i i n n k n n k
where the last step follows the same steps as in the derivation of the upper bound.
Lemma 2.10 shows that the number of configurations of chosing k out of n objects is directly related
to the entropy of the imbalance nk between the number of objects that are selected (k) and the number
of objects that are left out (n − k).
We now introduce one additional quantity, the Kullback-Leibler (KL) divergence, also known as
Kullback-Leibler distance and as relative entropy.
Definition 2.11 (Relative entropy or Kullback-Leibler divergence). Let p(x) and q(x) be two probability
distributions of a random variable X (or two probability density functions, if X is a continuous random
variable), the Kullback-Leibler divergence or relative entropy is defined as:
(P
p(x) ln p(x)
q(x) , if X is discrete
p(X)
KL(pkq) = Ep ln = R x∈X p(x) .
q(X) x∈X
p(x) ln q(x) dx, if X is continuous
16
KL divergence is the central quantity in information theory. Although it is not a distance measure,
because it does not satisfy the triangle inequality, it is the right way of measuring distances between
probability distributions. This is illustrated by the following example.
Example
Pn 2.13. Let X1 , . . . , Xn be an i.i.d. sample of n BernoulliPn randomvariables with bias p and
let n1 i=1 Xi be the empirical bias of the sample. (Note that n1 i=1 Xi ∈ 0, n1 , n2 , . . . , nn .) Then by
Lemma 2.10:
n
!
1X k n k
p (1 − p)n−k ≤ en H( n ) en( n ln p+ n ln(1−p)) = e−n kl( n kp)
k k n−k k
P Xi = = (2.9)
n i=1 n k
and !
n
1X k 1 −n kl( nk kp)
P Xi = ≥ e .
n i=1 n n+1
Thus, kl nk kp governs the probability of observing empirical bias nk when the true bias is p. It is easy
to verify that kl(pkp) = 0 and it is also possible to show that kl(p̂kp) is convex in p̂ and that kl(p̂kp) ≥ 0.
Thus, the probability of empirical bias is maximized when it coincides with the true bias.
2.5 kl Inequality
Example 2.13 shows that kl can be used to bound the empirical bias when the true bias is known. But
in machine learning we are usually interested in the inverse problem - how to infer the true bias p when
the empirical bias p̂ is known. Next we demonstrate that this is also possible and that it leads to an
inequality, which in most cases is tighter than Hoeffding’s inequality. We start with the following lemma.
Pn
Lemma 2.14. Let X1 , . . . , Xn be i.i.d. Bernoulli with bias p and let p̂ = n1 i=1 Xi be the empirical
bias. Then h i
E en kl(p̂kp) ≤ n + 1.
Proof.
n n
h
n kl(p̂kp)
i X k n kl( nk kp) X −n kl( nk kp) n kl( nk kp)
E e = P p̂ = e ≤ e e = n + 1,
n
k=0 k=0
17
Lemma 2.16 (Pinsker’s inequality).
1
KL(pkq) ≥ kp − qk21 ,
2
P
where kp − qk1 = x∈X |p(x) − q(x)| is the L1 -norm.
Corollary 2.17 (Pinsker’s inequality for the binary kl divergence).
1 2
kl(pkq) ≥ (|p − q| + |(1 − p) − (1 − q)|) = 2(p − q)2 . (2.12)
2
By applying Corollary 2.17 to inequality (2.11) we obtain that with probability greater than 1 − δ
r s
kl(p̂kp) ln n+1
δ
|p − p̂| ≤ ≤ .
2 2n
Recall that Hoeffding’s inequality assures that with probability greater than 1 − δ
s
ln 1δ
p ≤ p̂ + .
2n
Thus, in the worst case the kl inequality is only weaker by the ln(n + 1) factor and in fact the ln(n + 1)
factor can be reduced by a more careful analysis, see Maurer (2004), Langford (2005). Next we show
that the kl inequality can actually be significantly tighter than Hoeffding’s inequality. For this we use
refined Pinsker’s inequality, see Marton (1996, 1997), Samson (2000), Boucheron et al. (2013, Lemma
8.4).
Lemma 2.18 (Refined Pinsker’s inequality).
(p − q)2 (p − q)2
kl(pkq) ≥ + .
2 max {p, q} 2 max {(1 − p), (1 − q)}
Corollary 2.19 (Refined Pinsker’s inequality). If q > p then
(p − q)2
kl(pkq) ≥ .
2q
Corollary 2.20 (Refined Pinsker’s inequality). If kl(pkq) ≤ ε then
p
q ≤ p + 2pε + 2ε.
By applying Corollary 2.20 to inequality (2.11) we obtain that with probability greater than 1 − δ
s
2p̂ ln n+1
δ 2 ln n+1
δ
p ≤ p̂ + + .
n n
Note that when p̂ is close to zero, the latter inequality is much tighter than Hoeffding’s inequality. Finally,
we note that although there is no analytic inversion of kl(p̂kp) it is possible to invert it numerically to
obtain even tighter bounds than the relaxations above. Additionally, the bound in Theorem 2.15 can be
improved slightly, see Maurer (2004), Langford (2005).
18
Lemma 2.21. Let X1 , . . . , Xn denote a random sample without replacement from a finite set X =
{x1 , . . . , xN } of N real values. Let Y1 , . . . , Yn denote a random sample with replacement from X . Then
for any continuous and convex function f : R → R
" n
!# " n
!#
X X
E f Xi ≤E f Yi .
i=1 i=1
In particular, the lemma can be used to prove Hoeffding’s inequality for sampling without replace-
ment.
Theorem 2.22 (Hoeffding’s inequality for sampling without replacement). Let X1 , . . . , Xn denote a
random sample without replacement from a finite set X = {x1 , . . . , xN } of N values, where each element
PN
xi is in the [0, 1] interval. Let µ = N1 i=1 xi be the average of the values in X . Then for all ε > 0
n
!
1X 2
P Xi − µ ≥ ε ≤ e−2nε ,
n i=1
n
!
1X 2
P µ− Xi ≥ ε ≤ e−2nε .
n i=1
The proof is a minor adaptation of the proof of Hoeffding’s inequality for sampling with replacement
using Lemma 2.21 and is left as an exercise. (Note that it requires a small modification inside the proof,
because Lemma 2.21 cannot be applied directly to the statement of Hoeffding’s inequality.)
While formal proof requires a bit of work, intuitively the result is quite expected. Imagine the process
of sampling without replacement. If the average of points sampled so far starts deviating from the mean
of the values in X , the average of points that are left in X deviates in the opposite direction and “applies
extra force” to new samples to bring the average back to µ. In the limit when n = N we are guaranteed
to have the average of Xi -s being equal to µ.
19
Chapter 3
One of the most central questions in machine learning is: “How much can we trust the predictions
of a learning algorithm?”. A way of answering this question is by providing generalization bounds on
the expected performance of the algorithm on new data points. In this chapter we derive a number of
generalization bounds for supervised classification.
2. We observe a sample S sampled i.i.d. according to a fixed, but unknown distribution p(X, Y ).
3. Based on the empirical performances L̂(h, S) of the hypotheses in H, we select a prediction rule
ĥ∗S , which we consider to be the “best” in H in some sense. Typically, ĥ∗S is either the empirical
risk minimizer (ERM), ĥ∗S = arg min L̂(h, S), or a regularized empirical risk minimizer.
h
In this chapter we are concerned with the question of what can be said about the expected loss L(ĥ∗S ),
which is the error we are expected to make on new samples. More precisely, we provide tools for bounding
the probability that L̂(ĥ∗S , S) is significantly smaller than L(ĥ∗S ). Recall that L̂(ĥ∗S , S) is observed and
L(ĥ∗S ) is unobserved. Having small L̂(ĥ∗S , S) and large L(ĥ∗S ) is undesired, because it means that based
on L̂(ĥ∗S , S) we believe that ĥ∗S performs well, but in reality it does not.
20
Assumtions There are two key assumptions we make throughout the chapter:
1. The samples in S are i.i.d..
2. The new samples (X, Y ) come from the same distribution as the samples in S.
These are the assumptions behind concentration of measure inequalities developed in Chapter 2 and
it is important to remember that if they are not satisfied the results derived in this chapter are not valid.
In a sense, it is intuitive why we have to make these assumptions. For example, if we train a language
model using data from The Wall Street Journal and then apply it to Twitter the change in prediction
accuracy can be very dramatic. Even though both are written in English and comprehensible by humans,
the language used by professional journalists writing for The Wall Street Journal is very different from
the language used in the short tweets.
The two assumptions are behind most supervised learning algorithms that you can meet in practice
and, therefore, it is important to keep them in mind. In Chapter 5 we discuss how to depart from them,
but for now we stick with them.
Given the assumptions above, for any fixed h prediction
i rule that is independent of S, the empirical
loss is an unbiased estimate of the true loss, E L̂(h, S) = L(h). An intuitive way to see it is that under
the assumptions that the samples in S are i.i.d. and coming from the same distribution as new samples
(X, Y ), from the perspective of h the new samples (X, Y ) are in no way different from the samples in S:
any new sample (X, Y ) could have happened to be in S instead of some other sample (Xi , Yi ) (they are
“exchangeable”). Formally,
" n #
h i 1X
E(X1 ,Y1 ),...,(Xn ,Yn ) L̂(h, S) = E(X1 ,Y1 ),...,(Xn ,Yn ) `(h(Xi ), Yi )
n i=1
n
1X
= E(X1 ,Y1 ),...,(Xn ,Yn ) [`(h(Xi ), Yi )]
n i=1
n
1X
= E(Xi ,Yi ) [`(h(Xi ), Yi )]
n i=1
n
1X
= L(h)
n i=1
= L(h).
However,
h when we i make
h the selection
i of ĥ∗S based on S the “exchangeability” argument no longer applies
and E L̂(ĥ∗S , S) 6= E L(ĥ∗S ) (note that ĥ∗S is a random variable depending on S and we take expectation
with respect to this randomness). This is because ĥ∗S is tailored to S (for example, it minimizes L̂(h, S))
and from the perspective of selection process the samples in S are not exchangeable with new samples
(X, Y ). If we exchange the samples we may end up with a different ĥ∗S . In the extreme case when the
hypothesis space H is so rich that it can fit any possible labeling of the data (for example, the hypothesis
space corresponding toh 1-nearest-neighbor
i prediction rule) we may end up in a situation, where L̂(ĥ∗S , S)
is always zero, but E L(ĥ∗S ) ≥ 41 , as in the following informal example.
Informal Lower Bound Imagine that we want to learn a classifier that predicts whether a student’s
birthday is on an even or odd day based on student’s id. Assume that the total number of students
is 2n, that the hypothesis class H includes all possible mappings from student id to even/odd, so that
|H| = 22n , and that we observe a sample of n uniformly sampled students (potentially with repetitions).
Since all possible mappings are within H, we have ĥ∗S ∈ H for which L̂(ĥ∗S , S) = 0. However, ĥ∗S is
guaranteed to make zero error only on the samples that were observed, which constitute at most half
of the total number of students. For the remaining students ĥ∗S can, at the best, make a random guess
which will succeed with probability 12 . Therefore, the expected loss of ĥ∗S is L(ĥ∗S ) ≥ 12 · 0 + 21 · 12 = 41 ,
where the first term is an upper bound on the probability of observing an already seen student times
the expected error ĥ∗S makes in this case and the second term is a lower bound on the probability of
21
Figure 3.2: Learning by Selection.
observing a new student times the expected error ĥ∗S makes in this case. For a more formal treatment
see the lower bounds in Chapter 3.7.
Considering it from the perspective of expectations, we have:
" n #
h
∗
i 1X ∗
E(X1 ,Y1 ),...,(Xn ,Yn ) L̂(ĥS , S) = E(X1 ,Y1 ),...,(Xn ,Yn ) `(ĥS (Xi ), Yi )
n i=1
n
1X h i
= E(X1 ,Y1 ),...,(Xn ,Yn ) `(ĥ∗S (Xi ), Yi )
n i=1
n
1X h i
= E(X1 ,Y1 ),...,(Xn ,Yn ) `(ĥ∗S (X1 ), Y1 )
n i=1
h i
= E(X1 ,Y1 ),...,(Xn ,Yn ) `(ĥ∗S (X1 ), Y1 )
h h ii
6= E(X,Y ) E(X1 ,Y1 ),...,(Xn ,Yn ) `(ĥ∗S (X), Y )
h h ii
= E(X1 ,Y1 ),...,(Xn ,Yn ) E(X,Y ) `(ĥ∗S (X), Y )
h i
= E(X1 ,Y1 ),...,(Xn ,Yn ) L(ĥ∗S ) .
The selection leads to the approximation-estimation trade-off (a.k.a. bias-variance trade-off), see
Figure 3.2. If the hypothesis class H is small it is easy to identify a good hypothesis h in H, but since
H is small it is likely that all the hypotheses in H are weak. On the other hand, if H is large it is more
likely to contain stronger hypotheses, but at the same time the probability of confusion with a poor
hypothesis grows. This is because there is always a small chance that the empirical loss L̂(h, S) does
not represent the true loss L(h) faithfully. The more hypotheses we take, the higher is the chance that
L̂(h, S) is misleading for some of them, which increases the chance of confusion.
Finding a good balance between approximation and estimation errors is one of the central questions
in machine learning. The main tool for analyzing the trade-off from the theoretical perspective are
concentration of measure inequalities. Since concentration of measure inequalities do not apply when
the prediction rule ĥ∗S depends on S, the main approach to analyzing the prediction power of ĥ∗S is to
consider cases with no dependency and then take a union bound over selection from these cases. In this
chapter we study three different ways of implementing this idea, see Figure 3.3 for an overview. We
distinguish between hard selection, where the learning procedure returns a single hypothesis h and soft
selection, where the learning procedure returns a distribution over H.
22
Figure 3.3: Overview of the major approaches to derivation of generalization bounds considered in this
chapter.
1. Occam’s razor applies to hard selection from a countable hypothesis space H and it is based on a
weighted union bound over H. We know that for every fixed h the expected loss is close to the
empirical loss, meaning that L(h) − L̂(h, S) is small. When H is countable we can take a weighted
union bound and obtain that L(h) − L̂(h, S) is “small” for all h ∈ H (where the magnitude of
“small” is inversely proportional to the weight of h in the union bound) and thus it is “small” for
ĥ∗S .
2. Vapnik-Chervonenkis (VC) analysis applies to hard selection from an uncountable hypothesis space
H and it is based on projection of H onto S and a union bound over what we obtain after the
projection. The idea is that even when H is uncountably infinite, there is only a finite number
of “behaviors” (ways to label S) we can observe on a finite sample S. In other words, when we
look at H through the prism of S we can only distinguish between a finite number of subsets of
H and everything that falls within the subsets is equivalent in terms of L̂(h, S). Therefore, S
only serves for a (finite) selection of a subset of H out of a finite number of subsets, whereas the
(infinite) selection from within the subset is independent of S. Selection that is independent of S
introduces no bias. As before, the VC analysis exploits the fact that for any fixed h the distance
L(h) − L̂(h, S) is small and then takes a union bound over the potential dependencies, which are
the dependencies between the subsets (the projections) and S.
3. PAC-Bayesian analysis applies to soft selection from an uncountable hypothesis space H and it
is based on change of measure inequality, which can be seen as a refinement of the union bound.
Unlike the preceding two approaches, which return a single classifier ĥ∗S , PAC-Bayesian analysis
returns a randomized classifier defined by a distribution ρ over H. The actual classification then
happens by drawing a new classifier h from H according to ρ at each prediction round and applying
it to make a prediction. When H is countable, ρ can (but does not have to) be a delta-distribution
putting all the mass on a single hypothesis ĥ∗S and in this case the generalization guarantees are
identical to those in Occam’s razor approach. The amount of selection is measured by deviation
of ρ from a prior distribution π, where π is selected independently of S. It is natural to put more
of ρ-mass on hypotheses that perform well on S, but the more we skew ρ toward well-performing
hypotheses the more it deviates from π. This provides a more refined way of measuring the amount
of selection compared to the other two approaches. Furthermore, randomization allows to avoid
selection when it is not necessary. The avoidance of selection reduces the variance without impairing
the bias. For example, when two hypotheses have similar empirical performance we do not have
23
to commit to one of them, but can instead distribute ρ equally among them. The analysis then
provides a certain “bonus” for avoiding commitment.
and s
2
lnδ
P L(h) − L̂(h, S) ≥ ≤ δ. (3.2)
2n
q 1
ln
Proof. For (3.1) take ε = 2n in (2.5) and rearrange the terms. Equation (3.2) follows in a similar
δ
way from the two-sided Hoeffding’s inequality. Note that in (3.1) we have 1δ and in (3.2) we have 2δ .
There is an alternative way to read equation (3.1): with probability at least 1 − δ we have
s
ln 1δ
L(h) ≤ L̂(h, S) + .
2n
We remind the reader that the above inequality should actually be interpreted as
s
ln 1δ
L̂(h, S) ≥ L(h) −
2n
and it means that with probabilitypat least 1 − δ the empirical loss L̂(h, S) does not underestimate the
expected loss L(h) by more than ln(1/δ)/2n. However, it is customary to write the inequality in the
first form (as an upper bound on L(h) and we follow the tradition (see the discussion at the end of
Section 2.3.1).
Theorem 3.1 is analogous to the problem of estimating a bias of a coin based on coin flip outcomes.
There is always a small probability that the flip outcomes will not be representative of the coin bias.
For example, it may happen that we flip a fair coin 1000 times (without knowing that it is a fair coin!)
and observe “all heads” or some other misleading outcome. And if this happens we are doomed - there
is nothing we can do when the sample does not represent the reality faithfully. Fortunately for us, this
happens with a small probability that decreases exponentially with the sample size n.
Whether we use the one-sided bound (3.1) or the two-sided bound (3.2) depends on the situation.
In most cases we are interested in the upper bound on the expected performance of the prediction rule
given by (3.1).
24
Figure 3.4: Validation (the red part in the figure) is identical to learning with a reduced hypothesis set
H0 (most often H0 is finite).
becomes interesting when training sample S helps to improve future predictions or, equivalently, decrease
the expected loss L(h). In this section we consider the simplest non-trivial case, where H consists of
a finite number of hypotheses M . There are at least two cases, where we meet a finite H in real life.
The first is when the input space X is finite. This case is relatively rare. The second and much more
frequent case is when H itself is an outcome of a learning process. For example, this is what happens in
a validation procedure, see Figure 3.4. In validation we are using a validation set in order to select the
best hypothesis out of a finite number of candidates corresponding to different parameter values and/or
different algorithms.
And now comes the delicate point. Let ĥ∗S be a hypothesis with minimal empirical risk, ĥ∗S =
arg min L̂(h, S) (it is natural to pick the empirical risk minimizer ĥ∗S to make predictions on new samples,
h
but the following discussion equally applies to any other selection rule that takes sample S into account;
note that there may be multiple hypotheses that achieve the minimal h empirical
i error and in this case
we can pick one arbitrarily). While for each h individually E L̂(h, S) = L(h), this is not true for
h i h i h i
E L̂(ĥ∗S , S) . In other words, E L̂(ĥ∗S , S) 6= E L(ĥ∗S ) (we have to put expectation on the right hand
side, because ĥ∗S depends on the sample). The reason is that when we pick ĥ∗S that minimizes the
empirical error on S, from the perspective of ĥ∗S the samples in S no longer look identical to future
samples (X, Y ). This is because ĥ∗S is selected in a very special way - it is selected to minimize the
empirical error on S and, thus, it is tailored to S and most likely does better on S than on new random
samples (X, Y ). One way to handle this issue is to apply a union bound.
Theorem 3.2. Assume that ` is bounded in the [0, 1] interval and that |H| = M . Then for any δ ∈ (0, 1)
we have: s
M
ln δ
P∃h ∈ H : L(h) ≥ L̂(h, S) + ≤ δ. (3.3)
2n
Proof.
s s
ln M
δ
X ln M
δ
X δ
P∃h ∈ H : L(h) ≥ L̂(h, S) + ≤ PL(h) ≥ L̂(h, S) + ≤ = δ,
2n 2n M
h∈H h∈H
where the first inequality is by the union bound and the second is by Hoeffding’s inequality.
Another way of reading Theorem 3.2 is: with probability at least 1 − δ for all h ∈ H
s
ln M
δ
L(h) ≤ L̂(h, S) + . (3.4)
2n
It means that no matter which h from H is returned by the algorithm, with high probability we have the
guarantee (3.4). In particular, it holds for ĥ∗S . Again, remember that
p the random quantity is actually
L̂(h, S) and the right way to read the bound is L̂(h, S) ≥ L(h) − ln(M/δ)/2n, see the discussion in
the previous section.
25
The price for considering M hypotheses instead of a single one is ln M . Note that it grows only
logarithmically with M ! Also note that there is no contradiction between the upper bound and the lower
2n
bound we have discussed in Section 3.1. In the construction p we took M = |H| = 2 .
p of the lower bound
If we substitute this value of M into (3.4) we obtain ln(M/δ)/2n ≥ ln(2) ≥ 0.8, which has no
contradiction with L(h) ≥ 0.25.
Similar to theorem 3.1 it is possible to derive a two-sided bound on the error. It is also possible to
derive a lower bound by using the other side ofpHoeffding’s inequality (2.4): with probability at least
1 − δ, for all h ∈ H we have L(h) ≥ L̂(h, S) − ln(M/δ)/2n. Typically we want the upper bound on
L(h), but if we want to compare two prediction rules, h and h0 , we need an upper bound for one and
a lower bound for the other. The “lazy” approach is to take the two-sided bound for everything, but
sometimes it is possible to save the factor of ln(2) by carefully considering which hypotheses require the
lower bound and which require the upper bound and applying the union bound correspondingly (we are
not getting into the details).
where δ is the probability that things go wrong and L̂(h, S) happens to be far away from L(h) because
S is not representative for the performance of h. There is a dependence between the probability that
things go wrong and the requirement on the closeness between L(h) and L̂(h, S). If we want them to be
very close (meaning that ln 1δ is small) then δ has to be large, but if we can allow larger distance then
δ can be smaller.
So, δ can be seen as our “confidence budget” (or, more precisely, “uncertainty budget”) - the probabil-
ity that we allow things to go wrong. The idea behind Occam’s RazorPbound is to distribute this budget
unevenly among the hypotheses in H. We use π(h) ≥ 0, such that h∈H π(h) ≤ 1 as our distribution
of the confidence budget δ, where each hypothesis h is assigned π(h) fraction of the budget. This means
that for every hypothesis h ∈ H the sample S is allowed to be “non representative” with probability at
most π(h)δ, so that the probability that there exists any h ∈ H for which S is not representative is at
most δ (by the union bound). The price that we pay is that the precision (the closeness of L̂(h, S) to
L(h)) now differs from one hypothesis to another and depends on the confidence budget p π(h)δ that was
assigned to it. More precisely, L̂(h, S) is allowed to underestimate
P L(h) by up to ln(1/(π(h)δ))/2n.
The precision increases when π(h) increases, but since h∈H π(h) ≤ 1 we cannot afford high precision
for every h and have to compromise. More on this in the next theorem and its applications that follow.
Theorem 3.3. Let ` be bounded in [0, 1], let H be a X
countable hypothesis set and let π(h) be independent
of the sample and satisfying π(h) ≥ 0 for all h and π(h) ≤ 1. Then:
h∈H
v
u
u ln 1
t π(h)δ
∃h ∈ H : L(h) ≥ L̂(h, S) +
P ≤ δ.
2n
Proof.
v v
u u
u ln 1 1
t ln π(h)δ
u
t π(h)δ X
∃h ∈ H : L(h) ≥ L̂(h, S) +
P ≤ PL(h) ≥ L̂(h, S) +
2n 2n
h∈H
X
≤ π(h)δ
h∈H
≤ δ,
26
where the first inequality is by the union bound, the second inequality is by Hoeffding’s inequality, and
the last inequality is by the assumption on π(h). Note that π(h) has to be selected before we observe the
sample (or, in other words, independently of the sample), otherwise the second inequality
does not hold.
Pn p
More explicitly, in Hoeffding’s inequality P E [Z1 ] − n i=1 Zi ≥ ln(1/δ )/2n ≤ δ the parameter δ 0
1 0 0
has to be independent of Z1 , . . . , Zn . For π(h) independent of S we take δ 0 = π(h)δ and apply the
inequality. But if π(h) would be dependent on S we would not be able to apply it.
Another way of reading Theorem 3.3 is that with probability at least 1 − δ, for all h ∈ H:
v
u
u ln 1
t π(h)δ
L(h) ≤ L̂(h, S) + .
2n
Again, refer back to the discussion in Section 3.2 regarding the correct interpretation of the inequality.
Note that the bound on L(h) depends both on L̂(h, S) and on π(h). Therefore, according to the bound,
the best generalization is achieved by h that optimizes the trade-off between empirical performance
L̂(h, S) and π(h), where π(h) can be interpreted as a complexity measure or a prior belief. Also, note
that π(h) can be designed arbitrarily, but it should be independent of the sample S. If π(h) happens to
put more mass on h-s with low L̂(h, S) the bound will be tighter, otherwise the bound will be looser,
but it will still be a valid bound. But we cannot readjust π(h) after observing S! Some considerations
behind the choice of π(h) are provided in Section 3.4.1.P
Also note that while we can select π(h) such that h∈H P π(h) = 1 and interpret π as a probability
distribution over H, it is not a requirement (we may have h∈H π(h) < 1) and π is used as an auxiliary
construction for derivation of the bound rather than the prior distribution in the Bayesian sense (for
readers who are familiar with Bayesian learning). However, we can use π to incorporate prior knowledge
into the learning procedure.
d
1 1
Proof. We first note that |Hd | = 22 . We define π(h) = 2d(h)+1 22d(h)
. The first part of π(h) distributes
1
the confidence budget δ among Hd -s (we can see it as p(Hd ) = 2d(h)+1 , the share of confidence budget
P∞to H
that goes
1
d ) and the second part P
of π(h) distributes the confidence budget uniformly within Hd .
Since d=0 2d+1 = 1, the assumption h∈H π(h) ≤ 1 is satisfied. The result follows by application of
Theorem 3.3.
27
(a) Subsets of linear homogeneous separators defined (b) Subsets of linear homogeneous separators defined
by two sample points. by three sample points.
Figure 3.5: Subsets of homogeneous linear separators in R2 formed by 3.5a two and 3.5b
three sample points. A homogeneous linear separator in R2 is defined by a vector w ∈ R2 . The
sample points define a number of regions in R2 that are shown by the numbers in circles. We say that a
linear separator falls within a certain region when the vector w defining it falls within that region. All
homogeneous linear separators falling within the same region have the same empirical loss L̂(h, S) and,
therefore, any selection among them is not based on the sample S and introduces no bias. The sample
only discriminates between the subsets.
d(h)
1 1
Note that the bound depends on ln π(h)δ and the dominating term in π(h) is 22 . We could
1
have selected a different distribution of confidence over Hd -s, for example, p(Hd ) = (for which
(d+1)(d+2)
P∞ 1
we also have d=0 (d+1)(d+2) = 1), which is perfectly fine, but makes no significant difference for the
d(h)
bound. The dominating complexity term ln 22 comes from uniform distribution of confidence within
Hd , which makes sense unless we have some prior information about the problem. In absence of such
information there is no reason to give preference to any of the trees within Hd , because Hd is symmetric.
The prior selected in the proof of Theorem 3.5 exploits structural symmetries within the hypothesis
class H and assigns equal weight to hypotheses that are symmetric under permutation of names of
the input variables. While we want π(h) to be as large as possible for every h, the number of such
permutationP symmetric hypotheses is the major barrier dictating how large π(h) can be (because π has
to satisfy h∈H π(h) ≤ 1). Deeper trees have more symmetric permutations and, therefore, get smaller
π(h) compared to shallow trees. If there is prior information that breaks the permutation symmetry it
can be used to assign higher prior to the corresponding trees and if it correctly reflects the true data
distribution it will also lead to tighter bounds. If the prior information does not match the true data
distribution such adjustments may have the opposite effect.
28
L̂(h, S) and selection among them is independent of S. See Figure 3.5 for an illustration.
The effective selection based on the sample S depends on the number of subsets of H with distinct
labeling patterns on S. When the number of such subsets is exponential in the size of the sample
n, the selection is too large and leads to overfitting, as we have already seen for selection from large
finite hypothesis spaces in the earlier sections. I.e., we cannot guarantee closeness of L̂(ĥ∗S , S) to L(ĥ∗S ).
However, if the number of subsets is subexponential in n, we can provide generalization guarantees for
L(ĥ∗S ). In Figure 3.5 we illustrate (informally) that at a certain point the number of subsets of the class
of homogeneous linear separators in R2 stops growing exponentially with n.1 For n = 2 the sample
defines 4 = 2n subsets, but for n = 3 the sample defines 6 < 2n subsets. It can be formally shown that
no 3 sample points can define more than 6 subsets of the space of homogeneous linear separators in R2
(some may define less, but that is even better for us) and that for n > 2 the number of subsets grows
polynomially rather than exponentially with n.
In what follows we first bound the distance between L̂(h, S) and L(h) for all h ∈ H in terms of
the number of subsets using symmetrization (Section 3.5.1) and then bound the number of subsets
(Section 3.5.2).
Definition 3.7 (The Growth Function). The growth function of H is the maximal number of dichotomies
it can generate on n points:
mH (n) = max |H (x1 , . . . , xn )| .
x1 ,...,xn
Put attention that mH (n) is defined by the “worst-case” configuration of points x1 , . . . , xn , for which
|H (x1 , . . . , xn )| is maximized. Thus, for lower bounding mH (n) (i.e., for showing that mH (n) ≥ v
for some value v) we have to find a configuration of points x1 , . . . , xn for which |H (x1 , . . . , xn )| ≥
v or, at least, prove that such configuration exists. For upper bounding mH (n) (i.e., for showing
that mH (n) ≤ v) we have to show that for any possible configuration of points x1 , . . . , xn we have
|H (x1 , . . . , xn )| ≤ v. In other words, coming up with an example of a particular configuration x1 , . . . , xn
for which |H (x1 , . . . , xn )| ≤ v is insufficient for proving that mH (n) ≤ v, because there may potentially
be an alternative configuration of points achieving a larger number of labeling configurations. To be
concrete, the illustration in Figure 3.5b shows that for the hypothesis space H of homogeneous linear
separators in R2 we have mH (3) ≥ 6, but it does not show that mH (3) ≤ 6. If we want to prove that
mH (3) ≤ 6 we have to show that no configuration of 3 sample points can differentiate between more
than 6 distinct subsets of the hypothesis space. More generally, if we want to show that mH (n) = v we
have to show that mH (n) ≥ v and mH (n) ≤ v. I.e., the only way to show equality is by proving a lower
and an upper bound.
The following theorem uses the growth function to bound the distance between empirical and expected
loss for all h ∈ H.
Theorem 3.8. Assume that ` is bounded in the [0, 1] interval. Then for any δ ∈ (0, 1)
s
8 ln 2mHδ(2n)
P∃h ∈ H : L(h) ≥ L̂(h, S) + ≤ δ.
n
The result is useful when mH (2n) en . In Section 3.5.2 we discuss when we can and cannot expect
to have it, but for now we concentrate on the proof of the theorem.
The proof of the theorem is based on three ingredients. First we introduce a “ghost sample” S 0 ,
which is an imaginary sample of the same size as S (i.e., of size n). We do not need to have this sample
1 Homogeneous linear separators are linear separators passing through the origin.
29
Figure 3.6: Illustration for Step 2 of the proof of Theorem 3.8.
at hand, but we ask what would have happened if we had such sample. Then we apply symmetrization:
we show that the probability that for any h the empirical loss L̂(h, S) is far from L(h) by more than ε is
bounded by twice the probability that L̂(h, S) is far from L̂(h, S 0 ) by more than ε/2. This allows us to
consider the behavior of H on the two samples, S and S 0 , instead of studying it over all X (because the
definition of L(h) involves all X , whereas the definition of L̂(h, S 0 ) involves only S 0 ). In the third step
we project H onto the two samples, S and S 0 . Even though H is uncountably infinite, when we look at
it through the prism of S ∪ S 0 we can only observe a finite number of distinct behaviors. More precisely,
the number of different ways H can label S ∪ S 0 is at most mH (2n). We show that the probability that
for any of the possible ways to label S ∪ S 0 the empirical losses L̂(h, S) and L̂(h, S 0 ) diverge by more
than ε/2 decreases exponentially with n.
Now we do this formally.
30
(a) Illustration of the split. (b) Illustration of the distances.
Figure 3.7: Illustration of the split of S ∪ S 0 into S and S 0 . On the left: First we sample the
joint sample S ∪ S 0 . Then each hypothesis hj produces a “big bag” of losses {Z1 , . . . , Z2n }, where
Zi = `(hj (Xi ), Yi ). Even though H is uncountably infinite, the number of different ways to label S ∪ S 0
is at most mH (2n) by the definition of the growth function and thus the number of different “big bags”
of losses is at most mH (2n) (in the illustration we have m ≤ mH (2n)). Finally, we split S ∪ S 0 into S
and S 0 , which corresponds to splitting the “big bags” of 2n losses into pairs of “small bags” of n losses,
corresponding to L̂(hj , S) and L̂(hj , S 0 ). On the right: we illustrate the distances between the average
losses in a pair of “small bags” and the corresponding “big bag”, which is the average of the two “small
bags”.
The inequality follows by the fact that for any two events A and B we have P(A) ≥ P(A AND B)
and the equality by P(A AND B) = P(B)P(A|B). The first term in (3.6) is the term we want and
we need to lower bound the second term. We let h∗ be any h for which, by conditioning, we have
L(h∗ ) − L̂(h∗ , S) ≥ ε. With high probability we have that L̂(h∗ , S 0 ) is close to L(h∗ ) up to ε/2. And
since we are given that L̂(h, S) is far from L(h∗ ) by more than ε it must also be far from L̂(h∗ , S 0 ) by
more than ε/2 with high probability, see the illustration in Figure 3.6. Formally, we have:
ε
P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ ∃h ∈ H : L(h) − L̂(h, S) ≥ ε
2 ε
≥ P L̂(h∗ , S 0 ) − L̂(h∗ , S) ≥ L(h∗ ) − L̂(h∗ , S) ≥ ε (3.7)
2
ε
≥ P L(h∗ ) − L̂(h∗ , S 0 ) ≤ L(h∗ ) − L̂(h∗ , S) ≥ ε (3.8)
2
ε
= P L(h∗ ) − L̂(h∗ , S 0 ) ≤ (3.9)
2
ε
≥ 1 − P L(h∗ ) − L̂(h∗ , S 0 ) ≥
2
2
≥ 1 − e−2n(ε/2) (3.10)
1
≥ . (3.11)
2
Explanation of the steps: in (3.7) the event on the left hand
side
includes the event on the right hand
0
side; in (3.8) we have L̂(h, S ) − L̂(h, S) = L(h) − L̂(h, S) − L(h) − L̂(h, S 0 ) and since we are given
that L(h) − L̂(h, S) ≥ ε the event L̂(h, S 0 ) − L̂(h, S) ≥ ε/2 follows from L(h) − L̂(h, S 0 ) ≤ ε/2, see
Figure 3.6; in (3.9) we can remove the conditioning on S, because the event of interest concerns S 0 ,
which is independent of S; (3.10) follows by Hoeffding’s inequality; and (3.11) follows by the lemma’s
2
assumption on e−nε /2 .
By plugging the result back into (3.6) and multiplying by 2 we obtain the statement of the lemma.
Step 3 [Projection] Now we focus on P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ 2ε , which concerns the be-
havior of H on two finite samples, S and S 0 . There are two possible ways to sample S and S 0 . The first
is to sample S and then S 0 . An alternative way is to sample a joint sample S2n = S ∪ S 0 and then split
31
it into S and S 0 by randomly assigning half of the samples into S and half into S 0 . The two procedures
are equivalent and lead to the same distribution over S and S 0 . We focus on the second procedure. Its
advantage is that once we have sampled S ∪ S 0 the number of ways to label it with hypotheses from
H is finite, even though H is uncountably infinite. This way we turn an uncountably infinite problem
into a finite problem. The number of different sequences of losses on S ∪ S 0 is at most the number of
different ways to label it, which is at most the growth function mH (2n) by definition. The probability
of having L̂(h, S 0 ) − L̂(h, S) ≥ ε/2 for a fixed h reduces to the probability of splitting a sequence of 2n
losses into n and n losses and having more than ε/2 difference between the average of the two. The latter
reduces to the problem of sampling n losses without replacement from a bag of 2n losses and obtaining
an average which deviates from the bag’s average by more than ε/4, see Figure 3.7. This probability
2
can be bounded by Hoeffding’s inequality for sampling without replacement and decreases as e−nε /8 .
Putting this together we obtain the following result.
Lemma 3.10. ε 2
P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ ≤ mH (2n)e−nε /8 .
2
As you may guess, mH (2n) comes from a union bound over the number of possible sequences of losses
we may obtain with hypotheses from H on S ∪ S 0 . We now prove the lemma formally.
Proof of Lemma 3.10.
ε X ε
P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ = P(S ∪ S 0 )P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ S ∪ S 0
2 2
S∪S 0
ε
≤ sup P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ S ∪ S 0 .
S∪S 0 2
Put attention that the conditional probabilities are with respect to the splitting of S ∪ S 0 into S and S 0 .
Let Z(S ∪ S 0 ) = {Z1 , . . . , Z2n : Zi = `(h(Xi ), Yi ), h ∈ H} be the set of all possible sequences of losses
that can be obtained by applying h ∈ H to S ∪ S 0 . Since there are at most mH (2n) distinct ways to
label S ∪ S 0 we have |Z(S ∪ S 0 )| ≤ mH (2n). Let σ : {1, . . . , 2n} → {1, . . . , 2n} denote a permutation of
indexes. We have
ε
sup P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥ S ∪ S 0
S∪S 0 2
n 2n
!
0 1X 1 X ε
= sup P ∃ {Z1 , . . . , Z2n } ∈ Z(S ∪ S ) : Zσ(i) − Zσ(i) ≥ (3.12)
S∪S 0 n i=1 n i=n+1 2
n 2n
!
X 1X 1 X ε
≤ sup P Zσ(i) − Zσ(i) ≥ (3.13)
S∪S 0 n i=1 n i=n+1 2
{Z1 ,...,Z2n }∈Z(S∪S 0 )
n 2n
!
X 1X 1 X ε
= sup P Zσ(i) − Zi ≥ (3.14)
S∪S 0 n i=1 2n i=1 4
{Z1 ,...,Z2n }∈Z(S∪S 0 )
X 2
≤ sup e−nε /8 (3.15)
S∪S 0
{Z1 ,...,Z2n }∈Z(S∪S 0 )
2
≤ sup mH (2n)e−nε /8
(3.16)
S∪S 0
2
= mH (2n)e−nε /8
,
where (3.12) follows by the fact that Z(S ∪ S 0 ) is the set of all possible losses on S ∪ S 0 and in the step
of splitting S ∪ S 0 into S and S 0 and computing L̂(h, S 0 ) and L̂(h, S) we are splitting a “big bag” of 2n
losses into two “small bags” of n and n; all that is left from H in the splitting process is Z(S ∪ S 0 );
the probability in (3.12) is over the split of S ∪ S 0 into S and S 0 , which is expressed by taking the
first n elements of a random permutation σ of indexes into S 0 and the last n elements into S and the
probability is over σ; in (3.13) we apply the union bound; for (3.14) see the illustration in Figure 3.7b; in
(3.15) we apply Hoeffding’s inequality for sampling without replacement (Theorem 2.22) to the process
of randomly sampling n losses out of 2n and observing ε/4 deviation from the average; in (3.16) we apply
the bound on |Z(S ∪ S 0 )|.
32
Step 4 [Putting Everything Together] All that is left for the proof of Theorem 3.8 is to put
Lemmas 3.9 and 3.10 together.
2
Proof of Theorem 3.8. Assuming that e−nε /2 ≤ 1/2 we have by Lemmas 3.9 and 3.10:
ε
P ∃h ∈ H : L(h) − L̂(h, S) ≥ ε ≤ 2P ∃h ∈ H : L̂(h, S 0 ) − L̂(h, S) ≥
2
2
≤ 2mH (2n)e−nε /8 .
2 2
Note that if e−nε /2 > 1/2 then 2mH (2n)e−nε /8 > 1 and the inequality is satisfied trivially (because
probabilities are always upper bounded by 1).
By denoting the right hand side of the inequality by δ and solving for ε we obtain the result.
kH(x1 , . . . , xn )k = 2n .
For example, the set of homogeneous linear separators in R2 shatters the two points in Figure 3.5a,
but it does not shatter the three points in Figure 3.5b. Note that if two points lie on one line passing
through the origin, they are not shattered by the set of homogeneous linear separators, because they
always get the same label. Thus, we may have two sets of points of the same size, where one is shattered
and the other is not.
Definition 3.12. The Vapnik-Chervonenkis ( VC) dimension of H, denoted by dVC (H) is the maximal
number of points that can be shattered by H. In other words,
33
n
We remind that the binomial coefficient k counts the number of ways to pick k elements out of n
as nk = 0. Thus, equation (3.17) is well-defined even when n < dVC (H).
and that for n < k it is defined
Pn n
We also remind that i=0 i = 2n , where 2n is the number of all possible subsets of n elements, which
is equal to the sum over i going from 0 to n to select i elements out of n. For n ≤ dVC (H) we have
mH (n) = 2n and the inequality is satisfied trivially.
The proof of Theorem 3.13 slightly reminds the combinatorial proof of the binomial identity
n n−1 n−1
= + .
k k k−1
One way to count the number of ways to select k elements out of n on the right hand side is to take
one element aside. If that element is selected, then we have n−1
k−1 possibilities to select k − 1 additional
elements out of the remaining n − 1. If the element is not selected, then we have n−1 k possibilities to
select all k elements out of remaining n − 1. The sets including the first element are disjoint from the
sets excluding it, leading to the identity above.
We need one more definition for the proof of Theorem 3.13.
Definition 3.14. Let B(n, d) be the maximal number of possible ways to label n points, so that no d + 1
points are shattered.
By the definition, we have mH (n) ≤ B(n, dVC (H)).
Proof of Theorem 3.13. We prove by induction that
d
X n
B(n, d) ≤ . (3.18)
i=0
i
For the induction base we have B(n, 0) = 1 = n0 : if no points are shattered there is just one way to
label the points. If there would be more than one way, they would differ in at least one point and that
point
would be shattered. By the definition of binomial coefficients, which says that for k > n we have
n
k = 0, we also know that for n < d we have B(n, d) = B(n, n). In particular, B(0, d) = B(0, 0) = 1.
Now we proceed with induction on d and for each d we do an induction on n. We show that
B(n, d) ≤ B(n − 1, d) + B(n − 1, d − 1).
Let S be a maximal set of dichotomies (labeling patterns) on n points x1 , . . . , xn . We take one point
aside, xn , and split S into three disjoint subsets: S = S ∗ ∪ S + ∪ S − . The set S ∗ contains dichotomies on
n points that appear with just one sign on xn , either positive or negative. The sets S + and S − contain all
dichotomies that appear with both positive and negative sign on xn , where the positive ones are collected
in S + and the negative ones are collected in S − . Thus, the sets S + and S − are identical except in their
labeling of xn , where in S + it is always labeled as + and in S − always as −. By contradiction, the
number of points x1 , . . . , xn−1 that are shattered by S − cannot be larger than d − 1, because otherwise
the number of points that are shattered by S, which includes S + and S − , would be larger than d, since
we can use S + and S − to add xn to the set of shattered points. Therefore, |S − | ≤ B(n − 1, d − 1). At
the same time, the number of points x1 , . . . , xn−1 that are shattered by S ∗ ∪ S + cannot be larger than d,
because the total number of points shattered by S is at most d. Thus, we have |S ∗ ∪ S + | ≤ B(n − 1, d).
And overall
B(n, d) = |S| = |S ∗ ∪ S + | + |S − | ≤ B(n − 1, d) + B(n − 1, d − 1),
as desired. By the induction assumption equation (3.18) is satisfied for B(n − 1, d) and B(n − 1, d − 1),
and we have
d Xd−1
X n−1 n−1
B(n, d) ≤ +
i=0
i i=0
i
d−1
X n−1 n−1
=1+ +
i=0
i+1 i
d
X n
= ,
i=0
i
34
as desired. Finally, as we have already observed, mH (n) ≤ B(n, dVC (H)), completing the proof.
The following lemma provides a more explicit bound on the growth function.
Lemma 3.15.
d
X n
≤ nd + 1.
i=0
i
The proof is based on induction and left as an exercise.
By plugging the results of Theorem 3.13 and Lemma 3.15 into Theorem 3.8 we obtain the VC
generalization bound.
Theorem 3.16 (VC generalization bound). Let H be a hypotheses class with VC-dimension dVC (H) =
dVC . Then: v
u
u 8 ln 2 (2n)dVC + 1 /δ
t
∃h ∈ H : L(h) ≥ L̂(h, S) +
P ≤ δ.
n
For example, the VC-dimension of linear separators in Rd is d + 1 and theorem 3.16 provides gen-
eralization guarantees for learning with linear separators in finite-dimensional spaces, as long as the
dimension of the space d is small in relation to the number of points n.
where R2 /γ 2 is the smallest integer that is greater or equal to R2 /γ 2 .
The important point is that the bound on fat shattering dimension is independent of the dimension
of the space Rd that w comes from.
We define fat losses that count as error everything that falls too close to the separating hyperplane
or on the wrong side of it.
35
Definition 3.20 (Fat Losses). For h = (w, b) we define the fat losses
(
0, if yi (hw, xi i + b) ≥ 1
`FAT (h(x), y) =
1, otherwise,
LFAT (h) = E [`FAT (h(X), Y )] ,
n
1X
L̂FAT (h, S) = `FAT (h(Xi ), Yi ).
n i=1
In relation to the fat losses the fat shattering dimension acts in the same way as the VC-dimension
in relation to the zero-one loss. In particular, we have the following result that relates LFAT (h) to
L̂FAT (h, S) via dFAT (Hγ ) (the proof is left as an exercise).
Theorem 3.21.
v
u
u 8 ln 2 (2n)dFAT (Hγ ) + 1 /δ
t
∃h ∈ Hγ : LFAT (h) ≥ L̂FAT (h, S) +
P ≤ δ.
n
Now we are ready to analyze generalization in learning with fat linear separation. For the analysis
we make a simplifying assumption that the data are contained within a ball of radius R = 1. The
analysis for general R is left as an exercise. Note that R refers to the radius of the ball after potential
transformation of the data through a feature mapping / kernel function. For example, the RBF kernel
maps the data into an infinite dimensional space and we consider the radius of the ball containing the
transformed data in the infinite dimensional space.
Theorem 3.22. Assume that the input space X is a ball of radius R = 1 in Rd , where d is potentially
infinite. Let H be the space of linear separators h = (w, b). Then
v
u 8 ln 2 (2n)1+dkwk2 e + 1 (1 + dkwk2 e) dkwk2 e /δ
u
t
∃h ∈ H : LFAT (h) ≥ L̂FAT (h, S) +
P ≤ δ.
n
Observe that L(h) ≤ LFAT (h) and, therefore, the theorem provides a generalization bound for L(h).
(If we count correct classifications within the margin as errors we only increase the loss.)
Proof. The proof is based on combination of VC and Occam’s razor bounding techniques,
see the il-
lustration in Figure 3.8. We start by noting that Theorem 3.19 is interesting when R2 /γ 2 < d + 1,
because as we have already noted dFAT (Hγ ) ≤ dVC (Hγ ) ≤ d + 1. We slice the hypotheses space H into
a nested sequence of subspaces H1 ⊂ H2 ⊂ · · · ⊂ Hd−1 ⊂ Hd = H, where for all i < d we define Hi to
be the hypothesis space Hγ with 1/γ 2 = i. In other words, Hi = Hnγ= √1 o (do not let the notation to
i
confuse you, by Hi we denote the i-th hypothesis space in the nested sequence of hypothesis spaces and
by Hγ we denote the hypothesis space with kwk upper bounded by 1/γ). By Theorem 3.19 we have
dFAT (Hi ) = i + 1 and then by Theorem 3.21:
v
u
u 8 ln 2 (2n)1+i + 1 /δ
t i
∃h ∈ Hi : LFAT (h) ≥ L̂FAT (h, S) +
P ≤ δi .
n
1
P∞ 1
P∞ 1 1
1
1 1
1 1
We take δi = i(i+1) δ and note that i=1 i(i+1) = i=1 i − i+1 = 1− 2 + 2 − 3 + 3 − 4 +
d
[
· · · = 1. We also note that H = (Hi \ Hi−1 ), where H0 is defined as the empty set and Hi \ Hi−1 is
i=1
the difference between sets Hi and Hi−1 (everything that is in Hi , but not in Hi−1 ). Note that the sets
36
Figure 3.8: Illustration for the proof of Theorem 3.22
Hi \Hi−1 and Hj \Hj−1 are disjoint for i 6= j. Also note that δi is a distribution
of our confidence budget
δ among Hi \ Hi−1 -s. Finally, note that if h = (w, b) ∈ Hi \ Hi−1 then kwk2 = i. The remainder of
the proof follows the same lines as the proof of Occam’s razor bound:
v
u 8 ln 2 (2n)1+dkwk2 e + 1 (1 + dkwk2 e) dkwk2 e /δ
u
t
∃h ∈ H : LFAT (h) ≥ L̂FAT (h, S) +
P
n
v
u 8 ln 2 (2n)1+dkwk2 e + 1 (1 + dkwk2 e) dkwk2 e /δ
u
d
[
t
∃h ∈
= P Hi \ Hi−1 : LFAT (h) ≥ L̂FAT (h, S) +
i=1
n
v
u 8 ln 2 (2n)1+dkwk2 e + 1 (1 + dkwk2 e) dkwk2 e /δ
u
d
X t
= ∃h ∈ Hi \ Hi−1
P : LFAT (h) ≥ L̂FAT (h, S) +
i=1
n
v
u
d
X u 8 ln 2 (2n)1+i + 1 (1 + i) i/δ
t
= ∃h ∈ Hi \ Hi−1
P : LFAT (h) ≥ L̂FAT (h, S) +
i=1
n
v
u
d u 8 ln 2 (2n)1+i + 1 /δ
X t i
= ∃h ∈ Hi \ Hi−1
P : LFAT (h) ≥ L̂FAT (h, S) +
i=1
n
v
u
d u 8 ln 2 (2n)1+i + 1 /δ
X t i
≤ ∃h ∈ Hi : LFAT (h) ≥ L̂FAT (h, S) +
P
i=1
n
d d d ∞
X X 1 X 1 X 1
≤ δi = δ=δ ≤δ = δ.
i=1 i=1
i(i + 1) i=1
i(i + 1) i=1
i(i + 1)
37
3.7 VC Lower Bound
In this section we show that when the VC-dimension is unbounded, it is impossible to bound the distance
between L(h) and L̂(h, S).
Theorem 3.23. Let H be a hypothesis class with dV C (H) = ∞. Then for any n there exists a distribution
over X and a class of target functions F, such that
E sup L(h) − L̂(h, S) ≥ 0.25,
h
where the expectation is over selection of a sample of size n and a target function.
Proof. Pick n. Since dV C (H) = ∞ we know that there exist 2n points that are shattered by H. Let the
sample space X2n = {x1 , . . . , x2n } be these points and let p(x) be uniform on X2n . Let F be the set of
all possible functions from X2n to {0, 1} and let p(f ) be uniform
S over F. Let S be a sample of n points.
Let {Fk (S)}k be maximal subsets of F, such that F = k Fk (S) and any fi , fj ∈ Fk (S) agree on S.
Note that since X2n is shattered by H, for any S, any Fk , and any fi ∈ Fk that was used to label S
there exists h∗ (Fk (S), S) ∈ H, such that for any fi ∈ Fk (S) the empirical error L̂(h∗ (fi , S), S) = 0. Let
p(k) and p(i) be uniform. Then:
E sup L(h) − L̂(h, S) = Ef ∼p(f ) ES∼p(X)n sup L(h) − L̂(h, S) f
h h
= ES∼p(X)n Ef ∼p(f ) sup L(h) − L̂(h, S) S
h
= ES∼p(X)n Ek∼p(k) Ei∼p(i) sup L(h) − L̂(h, S) Fk S
h
h h h i i i
≥ ES∼p(X)n Ek∼p(k) Ei∼p(i) L(h∗ (Fk , S)) − L̂(h∗ (Fk , S), S) Fk S
= ES∼p(X)n Ek∼p(k) Ei∼p(i) [L(h∗ (Fk , S))] Fk S
= ES∼p(X)n Ek∼p(k) [0.25] S
= 0.25.
Corollary 3.24. Under the assumptions of Theorem 3.23, with probability at least 0.125, suph (L(h) −
L̂(h, S)) ≥ 0.125. Thus, it is impossible to have high-probability bounds on suph (L(h) − L̂(h, S)) that
converge to zero as n goes to infinity.
Proof.
Note
− L̂(h,
that suph (L(h) S)) ≤ 1, since ` is bounded in [0, 1]. Assume by contradiction that
P suph L(h) − L̂(h, S) ≥ 0.125 < 0.125. Then
E sup L(h) − L̂(h, S) ≤ 0.125 × 1 + (1 − 0.125) × 0.125 < 2 × 0.125 = 0.25,
h
38
inequality has two important advantages over the union bound: (1) it is tighter (you will verify this in
a home assignment) and (2) it can be applied to uncountably infinite hypothesis classes. Furthermore,
soft selection allows application of gradient-descent type methods to optimize the distribution over H,
which in some cases leads to efficient algorithms for direct minimization of the PAC-Bayesian bounds.
Soft selection is implemented by randomized classifiers, which are formally defined below.
Definition 3.25 (Randomized Classifier). Let ρ be a distribution over H. A randomized classifier
associated with ρ (and named ρ) acts according to the following scheme. At each prediction round it:
1. Picks h ∈ H according to ρ(h)
2. Observes x
3. Returns h(x)
h i
The expected loss of ρ is Eh∼ρ [L(h)] and the empirical loss is Eh∼ρ L̂(h, S) . Whenever it does not lead
h i
to confusion, we will shorten the notation to Eρ [L(h)] and Eρ L̂(h, S) .
There is a large number of different PAC-Bayesian inequalities. We start with the classical one due
to Seeger (2002).
Theorem 3.26 (PAC-Bayes-kl inequality). For any “prior” distribution π over H that is independent
of S, for all randomized classifiers (distributions) ρ simultaneously:
h i KL(ρkπ) + ln n+1
δ
P kl Eρ L̂(h, S) Eρ [L(h)] ≥ ≤ δ. (3.19)
n
The meaning of “prior” should be interpreted in exactly the same way as the “prior” in Occam’s
razor bound: it is any distribution over H that sums up to one and does not depend on the sample S.
The prior is an auxiliary construction for deriving the bound and unlike in Bayesian learning there is no
assumption that it reflects any real-world distribution over H.
Before proceeding to the proof of the theorem we provide a discussion of its meaning. To get some
intuition we apply Pinsker’s relaxation of kl (inequality 2.17) that leads to a more digestible (although
weaker) form of the bound: with probability greater than 1 − δ for all ρ simultaneously
s
h i KL(ρkπ) + ln n+1
δ
Eρ [L(h)] ≤ Eρ L̂(h, S) + .
2n
Note that when ρ = π the KL term is zero and we recover generalization bound for a single hypothesis.
Taking ρ = π amounts to making no selection. If we start with a prior distribution π and continue with
it without taking any information from the sample we get the usual Hoeffding’s or kl inequality. In order
to get more intuition about the bound we decompose the KL-divergence:
h ρi
1
KL(ρkπ) = Eρ ln = Eρ ln − H(ρ) .
π π | {z }
| {z } Entropy
Average
complexity
If H is finite and π is uniform, then H(ρ) ≥ 0 and KL(ρkπ) = ln |H| − H(ρ) ≤ ln |H| and we recover
generalization bound for finite hypothesis sets with an improvement by − H(ρ). Recall that the entropy
H(ρ) is zero when ρ is a delta-distribution and when ρ is uniform the entropy has its maximal value,
which is ln |H|. Thus, − H(ρ) is an “award” for avoiding commitment to a single hypothesis.
Overall, the PAC-Bayesian inequality advocates for picking ρ that minimizes the trade-off between:
1. The empirical error L̂(h, S).
1
2. The complexity (description length, prior belief) ln π(h) .
3. And has maximum entropy (it has “indifference” to h and h0 when L̂(h, S) = L̂(h0 , S) and π(h) =
π(h0 )). Maximization of H(ρ) corresponds to avoidance of selection whenever it is not necessary.
Reduced selection leads to improved estimation without impairing the approximation and provides
a tighter generalization bound.
39
3.8.1 Relation and Differences with other Learning Approaches
PAC-Bayesian analysis has the following relation and differences with Bayesian learning and with VC
analysis / Radamacher complexities.
where the inequality in the third step is justified by Jensen’s inequality (Theorem B.30). Note that there
is nothing probabilistic in the statement of the theorem - it is a deterministic result.
40
In the next lemma we extend f to be a function of h and a sample S and apply a probabilistic argument
to the last term of change-of-measure inequality. The lemma is the foundation for most PAC-Bayesian
bounds.
Lemma 3.28 (PAC-Bayes lemma). For any measurable function f : H × (X × Y)n → R and any
distribution π on H that is independent of the sample S
!
Eh∼π ES ef (h,S)
P ∃ρ : Eh∼ρ [f (h, S)] ≥ KL(ρkπ) + ln ≤ δ,
δ
where the probability is with respect to the draw of the sample S and ES is the expectation with respect
to the draw of S.
An equivalent way of writing the above statement is
!
Eh∼π ES ef (h,S)
P ∀ρ : Eh∼ρ [f (h, S)] ≤ KL(ρkπ) + ln ≥1−δ
δ
or, in words, with probability at least 1 − δ over the draw of S, for all ρ simultaneously
Eπ ES ef (h,S)
Eρ [f (h, S)] ≤ KL(ρkπ) + ln .
δ
We first present a slightly less formal, but more intuitive proof and then provide a formal one. By
change of measure inequality we have
h i
Eρ [f (h, S)] ≤ KL(ρkπ) + ln Eπ ef (h,S)
ES Eπ ef (h,S)
≤ KL(ρkπ) + ln
w.p.≥1−δ δ
f (h,S)
Eπ ES e
= KL(ρkπ) + ln ,
δ
where in the second line we apply Markov’s inequality to the random variable Z = Eπ ef (h,S) (and
the inequality holds with probability at least 1 − δ) and in the last line we can exchange the order
of expectations, because π is independent of S. The key observation is that the change-of-measure
inequality relates all posterior distributions ρ to a single prior distribution π in a deterministic
way
and
the probabilistic argument (Markov’s inequality) is applied to a single random quantity Eπ ef (h,S) . This
way change-of-measure inequality replaces the union bound and it holds even when H is uncountably
infinite.
Now we provide a formal proof.
Proof.
! i E E ef (h,S)
!
Eπ ES ef (h,S) h
f (h,S) π S
P ∃ρ : Eρ [f (h, S)] ≥ KL(ρkπ) + ln ≤ P Eπ e ≥ (3.20)
δ δ
!
h i E E ef (h,S)
f (h,S) S π
= P Eπ e ≥ (3.21)
δ
≤ δ,
where (3.20) follows by change-of-measure inequality (elaborated below), in (3.21) we can exchange the
order of expectations, because π is independent
of S, and in the last step we apply Markov’s inequality
to the random variable Z = Eπ ef (h,S) .
An elaboration regarding Step (3.20). By change of measure inequality, we have that ∀ρ : Eρ [f (h, S)] ≤
ES [Eπ [ef (h,S) ]]
KL(ρkπ)+ln Eπ ef (h,S) . From change of measure inequality, we deduce that if Eπ ef (h,S) ≤ δ
ES [Eπ [ef (h,S) ]]
then ∀ρ : Eρ [f (h, S)] ≤ KL(ρkπ) + ln δ . Let A denote the event in the if-statement and B
41
denote the event in the then-statement. Then we have P(A) ≤ P(B) and, therefore, P Ā ≥ P B̄ ,
ES [Eπ [ef (h,S) ]]
where Ā denotes the complement of event A. The complement of A is Eπ ef (h,S) > δ and
ES [Eπ [ef (h,S) ]]
the complement of B is ∃ρ : Eρ [f (h, S)] > KL(ρkπ) + ln δ , which gives us the inequality
in Step (3.20) (as usually, we are being a tiny bit sloppy and do not trace which inequalities are strict
and which are weak, with a slight extra effort this could be done, but it does not matter in practice, so
we save the effort). The important point is that the change-of-measure inequality relates all posterior
distributions ρ to a single prior distribution
π in
a deterministic way, and the probabilistic argument is
applied to a single random variable Eπ ef (h,S) , avoiding the need in taking a union bound. This way
the change of measure inequality acts as a replacement of the union bound.
Different PAC-Bayesian inequalities are obtained by different choices of the function f (h, S). A
key consideration
in the choice of f (h, S) is the possibility to bound the moment generating function
ES ef (h,S) . For example, we have done it for f (h, S) = n kl(L̂(h, S)kL(h)) in Lemma 2.14 and this is the
choice of f in the proof of PAC-Bayes-kl inequality. Other choices of f are possible. Forexample, Hoeffd-
ing’s Lemma 2.6 provides a bound on the moment generating function of f (h, S) = λ L(h) − L̂(h, S) ,
which can be used to derive PAC-Bayes-Hoeffding inequality. We refer to Seldin et al. (2012) for more
details.
The proof of PAC-Bayes-kl inequality relies on convexity of the kl-divergence. We cite the theorem
and refer to Cover and Thomas (2006) for details.
Theorem 3.29 (Cover and Thomas, 2006, Theorem 2.7.2). KL(pkq) is convex in the pair (p, q); that
is, if (p1 , q1 ) and (p2 , q2 ) are two pairs of probability mass functions, then
for all 0 ≤ λ ≤ 1.
Corollary 3.30. h i h i
kl Eρ L̂(h, S) Eρ [L(h)] ≤ Eρ kl L̂(h, S) L(h) .
Proof of Theorem 3.26. We provide an intuitive derivation and leave the formal one (as in the proof of
Lemma 3.28) as an exercise.
We take f (h, S) = n kl(L̂(h, S)kL(h)). Then we have
h i h i
n kl Eρ L̂(h, S) Eρ [L(h)] ≤ Eρ n kl(L̂(h, S)kL(h))
h h ii
Eπ ES en kl(L̂(h,S)kL(h))
≤ KL(ρkπ) + ln
w.p.≥1−δ δ
Eπ [n + 1]
≤ KL(ρkπ) + ln
δ
n+1
= KL(ρkπ) + ln ,
δ
where the first inequality is by Corollary 3.30, the second inequality is by the PAC-Bayes Lemma (and it
holds with probability at least 1 − δ over the draw of S), and the third inequality is by Lemma 2.14.
42
3.8.4 Relaxation of PAC-Bayes-kl: PAC-Bayes-λ Inequality
Due to its implicit form, PAC-Bayes-kl inequality is nothvery convenient
i for optimization. One way
around is to replace the bound with a linear trade-off βnEρ L̂(h, S) +KL(ρkπ). Since KL(ρkπ) is convex
h i
in ρ and Eρ L̂(h, S) is linear in ρ, for a fixed β the trade-off is convex in ρ and can be minimized. (We
note that parametrization of ρ, for example the popular restriction of ρ to a Gaussian posterior (Langford,
2005), may easily break the convexity (Germain et al., 2009). We get back to this point in Section 3.8.6.)
The value of β can then be tuned by cross-validation or substitution of ρ(β) into the bound (the former
usually works better).
Below we present a more rigorous approach. We prove the following relaxation of PAC-Bayes-kl
inequality, which leads to a bound that can be optimized by alternating minimization.
Theorem 3.31 (PAC-Bayes-λ Inequality). For any probability distribution π over H that is independent
of S and any δ ∈ (0, 1), with probability greater than 1 − δ over a random draw of a sample S, for all
distributions ρ over H and all λ ∈ (0, 2) and γ > 0 simultaneously:
h i
Eρ L̂(h, S) KL(ρkπ) + ln n+1
Eρ [L(h)] ≤ + δ , (3.22)
1 − λ2 λ 1 − λ2 n
γ KL(ρkπ) + ln n+1
δ
Eρ [L(h)] ≥ 1 − Eρ [L̂(h, S)] − . (3.23)
2 γn
At the moment we focus on the upper bound in equation (3.22). Note that the theorem holds for all
values of λ ∈ (0, 2) simultaneously. Therefore, we can optimize the bound with respect to λ and pick the
best one.
Proof. We prove the upper bound in equation (3.22). Proof of the lower bound (3.23) is analogous and
left as an exercise. Proof of the statement that the upper and lower bounds hold simultaneously (require
no union bound) is also left as an exercise.
By refined Pinsker’s inequality in Corollary 2.19, for p < q
By PAC-Bayes-kl inequality, Theorem 3.26, with probability greater than 1 − δ for all ρ simultaneously
h i KL(ρkπ) + ln n+1
δ
kl Eρ L̂(h, S) Eρ [L(h)] ≤ .
n
By application of inequality (3.24), the above inequality can be relaxed to
s
h i KL(ρkπ) + ln n+1
δ
Eρ [L(h)] − Eρ L̂(h, S) ≤ 2Eρ [L(h)] . (3.25)
n
We have that y √
min λx + = 2 xy
λ:λ>0 λ
√
(we leave this statement as a simple exercise). Thus, xy ≤ 21 λx + λy for all λ > 0 and by applying
this inequality to (3.25) we have that with probability at least 1 − δ for all ρ and λ > 0
h i λ KL(ρkπ) + ln n+1
δ
Eρ [L(h)] − Eρ L̂(h, S) ≤ Eρ [L(h)] + .
2 λn
By changing sides
λ
h i KL(ρkπ) + ln n+1
δ
1− Eρ [L(h)] ≤ Eρ L̂(h, S) + .
2 λn
For λ < 2 we can divide both sides by 1 − λ2 and obtain the theorem statement.
43
3.8.5 Alternating Minimization of PAC-Bayes-λ Bound
We use the term PAC-Bayes-λ bound to refer to the right hand side of PAC-Bayes-λ inequality. A
great advantage of the PAC-Bayes-λ bound is that
h it can i be conveniently minimized by alternating
minimization with respect to ρ and λ. Since Eρ L̂(h, S) is linear in ρ and KL(ρkπ) is convex in ρ
(Cover and Thomas, 2006), for a fixed λ the bound is convex in ρ and the minimum is achieved by
π(h)e−λnL̂(h,S)
ρ(h) = h i, (3.26)
Eπ e−λnL̂(h0 ,S)
h 0
i
where Eπ e−λnL̂(h ,S) is a convenient way of writing the normalization factor, which covers continuous
and discrete
h hypothesis i spaces in a unified notation. In the discrete case, which will be of main interest
−λnL̂(h0 ,S) 0
= h0 ∈H π(h0 )e−λnL̂(h ,S) . We leave a proof of the statement that (3.26) defines
P
for us, Eπ e
ρ which achieves the minimum of the bound as an exercise to the reader. Furthermore, for t ∈ (0, 1) and
a b
a, b ≥ 0 the function 1−t + t(1−t) is convex in t (Tolstikhin and Seldin, 2013) and, therefore, for a fixed
ρ the right hand side of inequality (3.22) is convex in λ for λ ∈ (0, 2) and the minimum is achieved by
2
λ= r . (3.27)
2nEρ [L̂(h,S)]
+1+1
(KL(ρkπ)+ln n+1
δ )
Note that the optimal value of λ is smaller than 1. Alternating application of update rules (3.26) and
(3.27) monotonously decreases the bound, and thus converges.
We note that while the right hand side of inequality (3.22) is convex in ρ for a fixed λ and convex in
λ for a fixed ρ, it is not simultaneously convex in ρ and λ. Joint convexity would have been a sufficient,
but it is not a necessary condition for convergence of alternating minimization to the global minimum
of the bound. Thiemann et al. (2017) provide sufficient conditions under which the procedure converges
to the global minimum, as well as examples of situations where this does not happen.
44
It is natural, but not mandatory to select a uniform prior π(h) = 1/m. The bound in equation (3.28)
can be minimized by alternating application of the update rules in equations (3.26) and (3.27) with n
being replaced by n − r and L̂ by L̂val . For evaluation of the empirical performance of this learning
approach see Thiemann et al. (2017).
where ∧ represents the logical “and” operation and ties can be resolved arbitrarily.
In binary prediction with prediction space h(X) ∈ {±1} weighted majority vote can be written as
MVρ (X) = sign (Eρ [h(X)]) ,
where sign(x) = 1 if x > 0 and −1 otherwise (the value of sign(0) can be defined arbitrarily). For a
countable hypothesis space this becomes
!
X
MVρ (X) = sign ρ(h)h(X) .
h∈H
3.9.2 First Order Oracle Bound for the Weighted Majority Vote
If majority vote makes an error, we know that at least a ρ-weighted half of the classifiers have made
an error and, therefore, `(MVρ (X), Y ) ≤ 1(Eρ [1(h(X) 6= Y )] ≥ 0.5). This observation leads to the
well-known first order oracle bound for the loss of weighted majority vote.
Theorem 3.33 (First Order Oracle Bound).
L(MVρ ) ≤ 2Eρ [L(h)].
Proof. We have L(MVρ ) = ED [`(MVρ (X), Y )] ≤ P(Eρ [1(h(X) 6= Y )] ≥ 0.5). By applying Markov’s
inequality to random variable Z = Eρ [1(h(X) 6= Y )] we have:
L(MVρ ) ≤ P(Eρ [1(h(X) 6= Y )] ≥ 0.5) ≤ 2ED [Eρ [1(h(X) 6= Y )]] = 2Eρ [L(h)].
45
PAC-Bayesian analysis can be used to bound Eρ [L(h)] in Theorem 3.33 in terms of Eρ [L̂(h, S)], thus
turning the oracle bound into an empirical one. The disadvantage of the first order approach is that
Eρ [L(h)] ignores correlations of predictions, which is the main power of the majority vote.
3.9.3 Second Order Oracle Bound for the Weighted Majority Vote
Now we present a second order bound for the weighted majority vote, which is based on a second order
Markov’s inequality: for a non-negative random variable Z and ε > 0, we have P(Z ≥ ε) = P Z 2 ≥ ε2 ≤
ε−2 E Z 2 . We define tandem loss of two hypotheses h and h0 by
The tandem loss counts an error on a sample (X, Y ) only if both h and h0 err on (X, Y ). We define the
expected tandem loss by
L(h, h0 ) = ED [1(h(X) 6= Y ∧ h0 (X) 6= Y )].
The following lemma relates the expectation of the second moment of the standard loss to the expected
tandem loss. We use the shorthand Eρ2 [L(h, h0 )] = Eh∼ρ,h0 ∼ρ [L(h, h0 )].
Lemma 3.34. In multiclass classification
Proof.
A combination of second order Markov’s inequality with Lemma 3.34 leads to the following result.
Theorem 3.35 (Second Order Oracle Bound). In multiclass classification
Proof. By second order Markov’s inequality applied to Z = Eρ [1(h(X) 6= Y )] and Lemma 3.34:
L(MVρ ) ≤ P(Eρ [1(h(X) 6= Y )] ≥ 0.5) ≤ 4ED [Eρ [1(h(X) 6= Y )]2 ] = 4Eρ2 [L(h, h0 )].
46
Proof of Lemma 3.36. Picking from (3.29), we have
Proof. The theorem follows by plugging the result of Lemma 3.36 into Theorem 3.35.
The advantage of the alternative way of writing the bound is the possibility of using unlabeled data
for estimation of D(h, h0 ) in binary prediction (see also Germain et al., 2015). We note, however, that
estimation of Eρ2 [D(h, h0 )] has a slow convergence rate, as opposed to Eρ2 [L(h, h0 )], which has a fast
convergence rate. We discuss this point in Section 3.9.7.
The worst case Since Eρ2 [L(h, h0 )] ≤ Eρ [L(h)] the second order bound is at most twice worse than the
first order bound. The worst case happens, for example, if all hypotheses in H give identical predictions.
Then Eρ2 [L(h, h0 )] = Eρ [L(h)] = L(MVρ ) for all ρ.
The best case Imagine that H consists of M ≥ 3 hypotheses, such that each hypothesis errs on 1/M
of the sample space (according to the distribution D) and that the error regions are disjoint. Then
L(h) = 1/M for all h and L(h, h0 ) = 0 for all h 6= h0 and L(h, h) = 1/M . For a uniform distribution ρ on
H the first order bound is 2Eρ [L(h)] = 2/M and the second order bound is 4Eρ2 [L(h, h0 )] = 4/M 2 and
L(MVρ ) = 0. In this case the second order bound is an order of magnitude tighter than the first order.
The independent case Assume that all hypotheses in H make independent errors and have the same
error rate, L(h) = L(h0 ) for all h and h0 . Then for h 6= h0 we have L(h, h0 ) = ED [1(h(X) 6= Y ∧ h0 (X) 6= Y )] =
ED [1(h(X) 6= Y )1(h0 (X) 6= Y )] = ED [1(h(X) 6= Y )]ED [1(h0 (X) 6= Y )] = L(h)2 and L(h, h) = L(h).
For a uniform distribution ρ the second order bound is 4Eρ2 [L(h, h0 )] = 4(L(h)2 + M 1
L(h)(1 − L(h))) and
the first order bound is 2Eρ [L(h)] = 2L(h). Assuming that M is large, so that we can ignore the second
term in the second order bound, we obtain that it is tighter for L(h) < 1/2 and looser otherwise. The
former is the interesting regime, especially in binary classification.
47
3.9.5 Second Order PAC-Bayesian Bounds for the Weighted Majority Vote
Now we provide an empirical bound for the weighted majority vote. We define the empirical tandem loss
n
1X
L̂(h, h0 , S) = 1(h(Xi ) 6= Yi ∧ h0 (Xi ) 6= Yi )
n i=1
and provide a bound on the expected loss of ρ-weighted majority vote in terms of the empirical tandem
losses.
Theorem 3.38. For any probability distribution π on H that is independent of S and any δ ∈ (0, 1),
with probability at least 1 − δ over a random draw of S, for all distributions ρ on H and all λ ∈ (0, 2)
simultaneously:
√ !
Eρ2 [L̂(h, h0 , S)] 2 KL(ρkπ) + ln(2 n/δ)
L(MVρ ) ≤ 4 + .
1 − λ/2 λ(1 − λ/2)n
Proof. The theorem follows by using the bound in equation (3.22) to bound Eρ2 [L(h, h0 )] in Theorem 3.35.
We note that KL(ρ2 kπ 2 ) = 2 KL(ρkπ) (Germain et al., 2015, Page 814).
It is also possible to use PAC-Bayes-kl to bound Eρ2 [L(h, h0 )] in Theorem 3.35, which actually gives a
tighter bound, but the bound in Theorem 3.38 is more convenient for minimization. We refer the reader
to Masegosa et al. (2020) for a procedure for bound minimization.
where S 0 = {X1 , . . . , Xm }. The set S 0 may overlap with the labeled set S, however, S 0 may include
additional unlabeled data. The following theorem bounds the loss of weighted majority vote in terms of
empirical disagreements. Due to possibility of using unlabeled data for estimation of disagreements in
the binary case, the theorem has the potential of yielding a tighter bound when a considerable amount
of unlabeled data is available.
Theorem 3.39. In binary classification, for any probability distribution π on H that is independent of
S and S 0 and any δ ∈ (0, 1), with probability at least 1 − δ over a random draw of S and S 0 , for all
distributions ρ on H and all λ ∈ (0, 2) and γ > 0 simultaneously:
√ !
Eρ [L̂(h, S)] KL(ρkπ) + ln(4 n/δ)
L(MVρ ) ≤ 4 +
1 − λ/2 λ(1 − λ/2)n
√
2 KL(ρkπ) + ln(4 m/δ)
− 2 (1 − γ/2)Eρ2 [D̂(h, h0 , S 0 )] − .
γm
Proof. The theorem follows by using the upper bound in equation (3.22) to bound Eρ [L(h)] and the
lower bound in equation (3.23) to bound Eρ2 [D(h, h0 )] in Theorem 3.37. We replace δ by δ/2 in the
upper and lower bound and take a union bound over them.
Using PAC-Bayes-kl to bound Eρ [L(h)] and Eρ2 [D(h, h0 )] in Theorem 3.37 gives a tighter bound, but
the bound in Theorem 3.39 is more convenient for minimisation. We refer to Masegosa et al. (2020) for
a procedure for bound minimization.
48
3.9.7 Comparison of the Empirical Bounds
We provide a high-level comparison of the empirical first order bound (FO), the empirical second order
bound based on the tandem loss (TND, Theorem 3.38), and the new empirical second order bound based
on disagreements (DIS, Theorem 3.39). The two key quantities in the comparison are the sample size n
in the denominator of the bounds and fast and slow convergence rates for the standard (first order) loss,
the tandem loss, and the disagreements. Tolstikhin and Seldin (2013) have shown that if we optimize λ
for a given ρ, the PAC-Bayes-λ bound in equation (3.22) can be written as
√ √
s
2Eρ [L̂(h, S)] (KL(ρkπ) + ln(2 n/δ)) 2 (KL(ρkπ) + ln(2 n/δ))
Eρ [L(h)] ≤ Eρ [L̂(h, S)] + + .
n n
This form of the bound, also used by McAllester (2003), is convenient for explanation of fast and slow
rates. If Eρ [L̂(h, S)] is large, then the middle term on the right hand side dominates the complexity and
√
the bound decreases at the rate of 1/ n, which is known as a slow rate. If Eρ [L̂(h, S)] is small, then the
last term dominates and the bound decreases at the rate of 1/n, which is known as a fast rate.
FO vs. TND The advantage of the FO bound is that the validation sets S \Sh available for estimation of
the first order losses L̂(h, Sh ) are larger than the validation sets (S \Sh )∩(S \Sh0 ) available for estimation
of the tandem losses. Therefore, the denominator nmin = minh |S \ Sh | in the FO bound is larger than
the denominator nmin = minh,h0 |(S \ Sh ) ∩ (S \ Sh0 )| in the TND bound. The TND disadvantage can
be reduced by using data splits with large validation sets S \ Sh and small training sets Sh , as long as
small training sets do not overly impact the quality of base classifiers h. Another advantage of the FO
bound is that its complexity term has KL(ρkπ), whereas the TND bound has 2 KL(ρkπ). The advantage
of the TND bound is that Eρ2 [L(h, h0 )] ≤ Eρ [L(h)] and, therefore, the convergence rate of the tandem
loss is typically faster than the convergence rate of the first order loss. The interplay of the estimation
advantages and disadvantages, combined with the advantages and disadvantages of the underlying oracle
bounds discussed in Section 3.9.4, depends on the data and the hypothesis space.
TND vs. DIS The advantage of the DIS bound relative to the TND bound is that in presence of a
large amount of unlabeled data the disagreements D(h, h0 ) can be tightly estimated (the denominator
m is large) and the estimation complexity is governed by the first order term, Eρ [L(h)], which is ”easy”
to estimate, as discussed above. However, the DIS bound has two disadvantages. A minor one is its
reliance on estimation of two quantities, Eρ [L(h)] and Eρ2 [D(h, h0 )], which requires a union bound, e.g.,
replacement of δ by δ/2. A more substantial one is that the disagreement term is desired to be large,
and√ thus has a slow convergence rate. Since slow convergence rate relates to fast convergence rate as
1/ n to 1/n, as a rule of thumb the DIS bound is expected to outperform TND only when the amount
of unlabeled data is at least quadratic in the amount of labeled data, m > n2 .
For experimental comparison of the bounds and further details we refer the reader to Masegosa et al.
(2020).
49
Chapter 4
In this chapter we consider the regression problem, which is another special case of supervised learning
with X = Rd and Y = R.
X=
..
.
— xTn —
and let y = (y1 , . . . , yn )T be the vector of labels. We are looking for w that minimizes the empirical loss
Pn Pn 2
L̂(w, S) = i=1 `(wT xi , yi ) = i=1 wT xi − yi = kXw − yk2 .
When the number of constraints n (the number of points in S) is larger than the number of unknowns
d (the number of entries in w), most often the linear system Xw = y has no solutions (unless y by chance
falls in the linear span of the columns of X). Therefore, we are looking for the best approximation of
y by a linear combination of the columns of X, which means that we are looking for a projection of y
onto the column space of X. There are two ways to define projections, analytical and algebraic, which
lead to two ways of solving the problem. In the analytical formulation the projection is a point of a form
Xw that has minimal distance to y. In the algebraic formulation the projection is a vector Xw that is
perpendicular to the remainder y − Xw. We present both ways in detail below.
50
Figure 4.1: Illustration of algebraic solution of linear least squares.
51
4.1.4 Using Linear Least Squares for Learning Coefficients of Non-linear
Models
Linear Least Squares can be used for learning coefficients of non-linear models. For example, assume
that we want to fit our data S = {(x1 , y1 ), . . . , (xn , yn )} (where both xi -s and yi -s are real numbers) with
a polynomial of degree d. I.e., we want to have a model of a form y = ad xd + ad−1 xd−1 + · · · + a1 x + a0 .
All we have to do is to map our features xi -s into feature vectors xi → (xdi , xd−1 i , . . . , xi , 1) and apply
linear least squares to the following system:
d d−1 ad
x1 x1 . . . x1 1 y1
xd xd−1 . . . x2 1 d−1 y2
a
2 2 ..
. = ..
..
.
a1
.
d d−1 y
xn xn . . . xn 1 n
a0
T
to get the parameters vector (ad , ad−1 , . . . , a1 , a0 ) .
52
Chapter 5
Online Learning
So far in these notes we have considered batch learning. In batch learning we start with some data, we
analyze it, and then we “ship the result of the analysis into the world”. It can be a fixed classifier h, a
distribution over classifiers ρ, or anything else, the important point is that it does not change from the
moment we are done with training. It takes no new information into account. This is also the reason
why we had to assume that new samples come from the same distribution as the samples in the training
set, because the classifier was not designed to adapt.
Online learning is a learning framework, where data collection, analysis, and application of inferred
knowledge are in a perpetual loop, see Figure 5.1. Examples of problems, which fit into this framework
include:
• Investment in the stock market.
• And so on ...
The recurrent nature of online learning problems makes them closely related to repeated games. They
also borrow some of the terminology from the game theory, including calling the problems games and
every “Act - Observe - Analyze” cycle a game round. In general, we may need online learning in the
following cases:
53
Figure 5.2: The Space of Online Learning Problems.
• Interactive learning: we are in a situation, where we continuously get new information and taking
it into account may improve the quality of our actions. Many online applications on the Internet
fall under this category.
• Adversarial or game-theoretic settings: we cannot assume that “the future behaves similarly to the
past”. For example, in spam filtering we cannot assume that new spam messages are generated
from the same distribution as the old ones. Or, in playing chess we cannot assume that the moves
of the opponent are sampled i.i.d..
As with many other problems in computer science, having loops makes things much more challenging, but
also much richer and more fun.1 For example, online learning allows to treat adversarial environments,
which is impossible to do in the batch setting.
the design and analysis of sampling experiments in which the size and composition of the samples are completely determined
before the experimentation begins. The reasons for this are partly historical, dating back to the time when the statistician
was consulted, if at all, only after the experiment was over, and partly intrinsic in the mathematical difficulty of working
with anything but a fixed number of independent random variables. A major advance now appears to be in the making
with the creation of a theory of the sequential design of experiments, in which the size and composition of the samples are
not fixed in advance but are functions of the observations themselves.”
54
Feedback
Feedback refers to the amount of information that the algorithm receives on every round of interaction
with the environment. The most basic forms of feedback are full information and limited (better known
as bandit 2 ) feedback.
A classical example of a full information game is investment in the stock market. On every round
of this game we distribute wealth over a set of stocks and the next day we observe the rates of all
stocks, which is the full information. With full information we can evaluate the quality of our investment
strategy, as well as any alternative investment strategy.
A classical example of a bandit feedback game are medical treatments. We have a set of actions (in
this case treatments), but we can only apply one treatment to a given patient. We only observe the
outcome of the applied treatment, but not of any alternative treatment, thus we have limited feedback.
With limited feedback we only know the quality of the selected strategy, but we cannot directly evaluate
the quality of alternative strategies we could have selected. This leads to the exploration-exploitation
trade-off, which is the trade-mark signature of online learning. The essence of the exploration-exploitation
trade-off is that in order to estimate the quality of actions we have to try them out (to explore). If we
explore too little, we risk missing some good actions and end up performing suboptimally. However,
exploration has a cost, because trying out suboptimal actions for too long is also undesirable. The
goal is to balance exploration (trying new actions) with exploitation, which is taking actions, which
are currently believed to be the optimal ones. The “Act-Observe-Analyze” cycle comes into play here,
because unlike in batch learning the training set is not given, but is built by the algorithm for itself: if
we do not try an action we get no data from it.
There are many other problems that fall within bandit feedback framework, most notably online
advertizing. A simplistic way of modeling online advertizing is assuming that there is a pool of adver-
tizements, but on every round of the game we are only allowed to show one advertisement to a user.
Since we only observe feedback for the advertisement that was presented, the problem can be formulated
as an online learning problem with bandit feedback.
There are other feedback models, which we will only touch briefly. In the bandit feedback model the
algorithm observes a noisy estimate of the quality of selected action, for example, whether an advertise-
ment was clicked or not. In partial feedback model studied under partial monitoring the feedback has
some relation to the action, but not necessarily its quality. For example, in dynamic pricing we only
observe whether a proposed price was above or below the value of a product for a buyer, but we do not
observe the maximal price we could get for the product. Bandit feedback is a special case of partial
feedback, where the observation is the value. Another example is dueling bandit feedback, where the
feedback is a relative preference over a pair of items rather than the absolute value of the items. For
example, an answer to the question “Do you prefer fish or chicken?” is an example of dueling bandit
feedback. Dueling bandit feedback model is used in information retrieval systems, since humans are
much better in providing relative preferences rather than absolute utility values.
Environmental Resistance
Environmental resistance is concerned with how much the environment resists to the algorithm. Two
classical examples are i.i.d. (a.k.a. stochastic) and adversarial environments. An example of an i.i.d.
environment is the weather. It has a high degree of uncertainty, but it does not play against the
algorithm. Another example of an i.i.d. environment are outcomes of medical treatments. Here also
there is uncertainty in the outcomes, but the patients are not playing against the algorithm. An example
of an adversarial environment is spam filtering. Here the spammers are deliberately changing distribution
of the spam messages in order to outplay the spam filtering algorithm. Another classical example of an
adversarial environment is the stock market. Even though the stock market does not play directly against
an individual investor (assuming the investments are small), it is not stationary, because if there would
be regularity in the market it would be exploited by other investors and would be gone.
The environment may also be collaborative, for example, when several agents are jointly solving
a common task. Yet another example are slowly changing environments, where the parameters of a
2 “The name derives from an imagined slot machine . . . . (Ordinary slot machines with one arm are one-armed bandits,
since in the long run they are as effective as human bandits in separating the victim from his money.)” (Lai and Robbins,
1985)
55
distribution are slowly changing with time.
Structural Complexity
In structural complexity we distinguish between stateless problems, contextualized problems (or problems
with state), and Markov decision processes. In stateless problems actions are taken without taking any
additional information except the history of the outcomes into account. In contextualized problems on
every round of the game the algorithm observes a context (or state) and takes an action within the
observed context. An example of context is a medical record of a patient or, in the advertizing example,
it could be parameters of the advertisement and the user.
Markov decision processes are concerned with processes with evolving state. The difference between
contextualized problems and Markov decision processes is that in the former the actions of the algorithm
do not influence the next state, whereas in the latter they do. For example, subsequent treatments of
the same patient are changing his or her state and, therefore, depend on each other. In contrast, in
subsequent treatments of different patients treatment of one patient does not influence the state of the
next patient and, thus, can be modeled as a contextualized problem.
Markov decision processes are studied within the field of reinforcement learning. There is no clear
cut distinction between online learning and reinforcement learning and one could be seen as a subfield
of another or the other way around. But as a rule of thumb, problems involving evolution of states,
such as Markov decision processes, are part of reinforcement learning and problems that do not involve
evolution of states are part of online learning.
One of the challenges in Markov decision processes is delayed feedback. It refers to the fact that, unlike
in stateless and contextualized problems, the quality of an action cannot be evaluated instantaneously.
The reason is that actions are changing the state, which may lead to long-term consequences. Consider
a situation of sitting in a bar, where every now and then a waiter comes and asks whether you want
another beer. If you take a beer you probably feel better than if you do not, but then eventually if you
take too much you will feel very bad the next morning, whereas if you do not you may feel excellent. As
before, things get more challenging, but also more exciting, when there are loops in the state space.
In Markov decision processes we distinguish between estimation and planning. Estimation is the
same problem as in other online learning problems - the outcomes of actions are unknown and we have
to estimate them. However, in Markov decision processes even if the immediate outcomes of various
actions are known, the identity of the best action in each state may still be not evident due to the
long-term consequences. This problem is addressed by planning.
There are many other online learning problems, which do not fit directly into Figure 5.2, but can still
be discussed in terms of feedback, environmental resistance, and structural complexity. For example, in
combinatorial bandits the goal is to select a set of actions, potentially with some constraints, and the
quality of the set is evaluated jointly. An instance of a combinatorial bandit problem is selection of a path
in a graph, such as communication or transport network. In this case an action can be decomposed into
sub-actions corresponding to selection of edges in the graph. The goal is to minimize the length of a path,
which may correspond to the delay between the source and the target nodes. Various forms of feedback
can be considered, including bandit feedback, where the total length of the path is observed; semi-bandit
feedback, where the length of each of the selected edges is observed; cascading bandit feedback, where
the lengths of the edges are observed in a sequence until a terminating note (e.g., a server that is down)
or the target is reached; or a full information feedback, where the length of all edges is observed.
In the following sections we consider in detail a number of the most basic online learning problems.
56
Figure 5.3: The four basic online learning problems.
are given a K × ∞ matrix of losses `t,a , where t ∈ {1, 2, . . . } and a ∈ {1, . . . , K} and `t,a ∈ [0, 1].
`1,1 , `2,1 , ··· `t,1 , ···
.. .. ..
. . ··· . ···
Losses
Game Protocol
For t = 1, 2, . . . :
1. Pick a row At
2. Suffer `t,At
3. Observe . . . [the observations are defined below]
Definition of the four games There are two common ways to generate the matrix of losses. The
first is to sample `t,a -s independently, so that the mean of the losses in each row is fixed, E [`t,a ] = µ(a).
The second is to generate `t,a -s arbitrarily. The second model of generation of losses is known as an
oblivious adversary, since the generation happens before the game starts and thus does not take actions
of the algorithm into account.3
There are also two common ways to define the observations. After picking a row in round t the
algorithm may observe either the full column `t,1 , . . . , `t,K or just the selected entry `t,At . Jointly the
two ways of generating the matrix of losses and the two ways of defining the observations generate four
variantshh of the game.
hhhhObservations
hhhh
Observe `t,1 , . . . , `t,K Observe `t,At
Matrix generation hhh
hhh
I.I.D. Prediction with Stochastic multiarmed
`t,a -s are sampled i.i.d. with E [`t,a ] = µ(a)
expert advice bandits
Prediction with expert Adversarial multiarmed
`t,a are selected arbitrarily (by an adversary)
advice (adversarial) bandits
3 Itis also possible to consider an adaptive adversary, which generates losses as the game proceeds and takes past actions
of the algorithm into account. We do not discuss this model in the lecture notes.
57
Performance Measure The goal of the algorithm is to play so that the loss it suffers will not be
significantly larger than the loss of the best row in hindsight. There are several ways to formalize this
goal. The basic performance measure is the regret defined by
T
X T
X
RT = `t,At − min `t,a .
a
t=1 t=1
If the sequence of losses is deterministic we can remove the second expectation and obtain a slightly
simpler expression " T #
X T
X
E [RT ] = E `t,At − min `t,a .
a
t=1 t=1
Note that since for random variables X and Y we have E [min {X, Y }] ≤ min {E [X] , E [Y ]} [it is rec-
ommended to verify this identity], we have R̄T ≤ E [RT ]. A reason to consider pseudo regret in the
stochastic setting is that we can get bounds of order ln T on the pseudo
√ regret (so called “logarithmic”
PT
regret bounds), whereas the fluctuations of t=1 `t,a are of order T (when we sample T random vari-
PT √
ables, the deviation of t=1 `t,a from the expectation T µ(a) is of order T ). Thus, it is impossible to
get logarithmic bounds for the expected regret.
Explanation of the Names In the complete definition of prediction with expert advice game on every
round of the game the player gets an advice from K experts and then takes an action, which may be a
function of the advice and the player, as well as the experts, suffer a loss depending on the action taken.
Hence the name, prediction with expert advice. If we restrict the actions of the player to following the
advice of a single expert, then from the perspective of the playing strategy the actual advice does not
matter and it is only the loss that defines the strategy. We consider the restricted setting, because it
allows to highlight the relation with multiarmed bandits.
The name multiarmed bandits comes from the analogy with slot machines, which are one-armed
bandits. In this game actions are the “arms” of a slot machine.
Losses vs. Rewards In some games it is more natural to consider rewards (also called gains) rather
than losses. In fact, in the literature on stochastic problems it is more popular to work with rewards,
whereas in the literature on adversarial problems it is more popular to work with losses. There is a
simple transformation r = 1 − `, which brings a losses game into a gains game and the other way around.
Interestingly, in the adversarial setting working with losses leads to tighter and simpler results. In the
stochastic setting the choice does not matter.
58
Notations We are given a K × ∞ matrix of rewards (or gains) rt,a , where t ∈ {1, 2, . . . } and a ∈
{1, . . . , K}.
Action rewards
.. .. ..
. . ··· . ···
r1,a , r2,a , ··· rt,a , ···
.. .. ..
. . ··· . ···
r1,K , r2,K , · · · rt,K , · · ·
−−−−−−−−→
time
We assume that rt,a -s are in [0, 1] and that they are generated independently, so that E [rt,a ] = µ(a).
We use µ∗ = max µ(a) to denote the expected reward of an optimal action and ∆(a) = µ∗ − µ(a) to
a
denote the suboptimality gap (or simply the gap) of action a. We use a∗ = arg max µ(a) to denote a best
a
action (note that there may be more than one best action, in such case let a∗ be any of them).
Game Definition
For t = 1, 2, . . . :
1. Pick a row At
2. Observe & accumulate rt,At
Performance Measure Let Nt (a) denote the number of times action a was played up to round t. We
measure the performance using the pseudo regret and we rewrite it in the following way
" T # " T #
X X
R̄T = max E rt,a − E rt,At
a
t=1 t=1
" T #
X
= T µ∗ − E rt,At
t=1
T
X
= E [µ∗ − rt,At ]
t=1
T
X
= E [E [µ∗ − rt,At |At ]] (5.1)
t=1
T
X
= E [∆(At )]
t=1
X
= ∆(a)E [NT (a)] .
a
In step (5.1) we note that E [rt,At ] is an expectation over two random variables, the selection of At ,
which is based on the history of the game, and the draw of rt,At , for which E [rt,At |At ] = µ(At ). We
have E [rt,At ] = E [E [rt,At |At ]], where the inner expectation is with respect to the draw of rt,At and the
outer expectation is with respect to the draw of At . Note that in "the i.i.d.#setting the performance of an
XT
algorithm is compared to the best action in expectation (max E rt,a ), whereas in the adversarial
a
t=1
T
X
setting the performance of an algorithm is compared to the best action in hindsight (min `t,a ).
a
t=1
59
Exploration-exploitation trade-off: A simple approach I.i.d. multiarmed bandits is the simplest
problem where we face the exploration-exploitation trade-off. In general, the goal is to play a best arm
on all the rounds, but since the identity if the best arm is unknown it has to be identified first. In order to
identify a best arm we need to explore all the arms. However, rounds used for exploration of suboptimal
arms increase the regret (through the Nt (a)∆(a) term). At the same time, too greedy exploration may
lead to confusion between a best and a suboptimal arm, which may eventually lead to even higher regret
when we start exploiting a wrong arm. So let us make a first attempt to quantify this trade-off. Assume
that we know time horizon T and we start with εT exploration rounds followed by (1 − ε)T exploitation
rounds (where we play what we believe to be a best arm). Also assume that we have just two actions
and we know that for a 6= a∗ we have ∆(a) = ∆. The only thing we do not know is which of the two
actions is the best. So how should we set ε?
Let δ(ε) denote the probability that we misidentify the best arm at the end of the exploration period.
The pseudo regret can be bounded by:
1 1 1
R̄T ≤ ∆εT + δ(ε)∆(1 − ε)T ≤ ∆εT + δ(ε)∆T = ε + δ(ε) ∆T,
2 2 2
where the first term is a bound on the pseudo regret during the exploration phase and the second term
is a bound on the pseudo regret during the exploitation phase in case we select a wrong arm at the end
of the exploration phase. Now what is δ(ε)? Let µ̂t (a) denote the empirical mean of observed rewards
of arm a up to round t. For the exploitation phase it is natural to select the arm that maximizes µ̂εT (a)
at the end of the exploration phase. Therefore:
where the last line is by Hoeffding’s inequality. By substituting this back into the regret bound we
obtain:
1 2
R̄T ≤ ε + 2e−εT ∆ /4 ∆T.
2
2 2
In order to minimize 21 ε+2e−εT ∆ /4 we take a derivative and equate it to zero, which leads to ε = ln(T ∆ )
T ∆2 /4 .
It is easy to check that the second derivative is positive, confirmingn that this o is the minimum. Note that
ln(T ∆2 )
ε must be non-negative, so strictly speaking we have ε = max 0, T ∆2 /4 . If we substitute this back
into the regret bound we obtain:
2 ln(T ∆2 ) 2 ln(T ∆2 )
− ln(T ∆2 ) 2
R̄T ≤ max ∆T, + 2e ∆T = max ∆T, + .
T ∆2 ∆ ∆
n o
∆2 )
Note that the number of exploration rounds is εT = max 0, ln(T ∆2 /4 .
Put attention that the regret bound is larger when ∆ is small. Although intuitively when ∆ is small
we do not care that much about playing a suboptimal action as opposed to the case when ∆ is large,
problems with small ∆ are actually harder and lead to larger regret. The reason is that the number of
rounds that it takes to identify the best action grows with 1/∆2 . Even though in each exploration round
we only suffer the regret of ∆ the fact that the number of exploration rounds grows with 1/∆2 makes
problems with small ∆ harder.
The above approach has three problems: (1) it assumes knowledge of the time horizon T , (2) it
assumes knowledge of the gap ∆, and (3) if we would try to generalize it to more than one arm the
length of the exploration period would depend on the smallest gap, even if there are many arms with
larger gap that are much easier to eliminate. The following approach resolves all three problems.
Upper Confidence Bound (UCB) algorithm We now concider the UCB1 algorithm of Auer et al.
(2002a).
60
Algorithm 3 UCB1 (Auer et al., 2002a)
Initialization: Play each action once.
for t = K + 1, K + 2, ... do s
3 ln t
Play At = arg max µ̂t−1 (a) + .
a 2Nt−1 (a)
end for
q
3 ln t
The expression Ut (a) = µ̂t−1 (a) + 2Nt−1 (a) is called an upper confidence bound. Why? Because
Ut (a) upper bounds µ(a) with high probability. UCB approach follows the optimism in the face of
uncertainty principle. That is, we take an optimistic estimate of the reward of every arm by taking the
upper limit of the confidence bound. UCB1 algorithm has the following regret guarantee.
π2 X
X ln T
R̄T ≤ 6 + 1+ ∆(a).
∆(a) 3 a
a:∆(a)>0
Proof. For the analysis it is convenient to have the following picture in mind - see Figure 5.4. A
suboptimal arm is played when Ut (a) ≥ Ut (a∗ ). Our goal is to show that this does not happen very
often. The analysis is based on the following three points, which bound the corresponding distances in
Figure 5.4.
1. We show that Ut (a∗ ) > µ(a∗ ) for almost all rounds. A bit more precisely, let F (a∗ ) be the number
2
of rounds when Ut (a∗ ) ≤ µ(a∗ ), then E [F (a∗ )] ≤ π6 .
q
3 ln t
2. In a similar way, we show that µ̂t (a) < µ(a) + 2N t (a)
for almost all rounds. A bit more precisely,
q 2
3 ln t
let F (a) be the number of rounds when µ̂t (a) ≥ µ(a) + 2N t (a)
, then E [F (a)] ≤ π6 . (Note that
this is a lower confidence bound for µ(a), or, in other words, the other side of inequality compared
to Point 1.)
q q
∗
3. When Point 2 holds we have that Ut (a) = µ̂t−1 (a) + 2N3t−1 ln t
(a) ≤ µ(a) + 2 3 ln t
2Nt−1 (a) = µ(a ) −
q
∆(a) + 2 2N3t−1 ln t
(a) .
Let us fix time horizon T and analyze what happens by time T (note that the algorithm does not
depend on T ). We have that for most rounds t ≤ T :
s s
∗ 6 ln t ∗ 6 ln T
Ut (a) < µ(a ) − ∆(a) + ≤ µ(a ) − ∆(a) +
Nt−1 (a) Nt−1 (a)
Ut (a∗ ) > µ(a∗ ).
61
• Or one of the confidence intervals in Points 1 or 2 has failed.
l m
6 ln T
In other words, after a suboptimal action a has been played for ∆(a) 2 rounds it can only be played
again if one of the confidence intervals fails. Therefore,
π2
6 ln T 6 ln T
E [NT (a)] ≤ 2
+ E [F (a∗ )] + E [F (a)] ≤ 2
+1+
∆(a) ∆(a) 3
P
and since R̄T (a) = a ∆(a)E [NT (a)] the result follows.
To complete the proof it is left to prove Points 1 and 2. We prove Point q 1, the proof of Point
2 is identical. We start by looking at P(Ut (a∗ ) ≤ µ(a∗ )) = P µ̂t−1 (a∗ ) + 2Nt−1 3 ln t ∗
(a∗ ) ≤ µ(a ) =
q
P µ(a∗ ) − µ̂t−1 (a∗ ) ≥ 2Nt−1 3 ln t ∗
(a∗ ) . The delicate point is that Nt−1 (a ) is a random variable that is
not independent of µ̂t−1 (a∗ ) and thus we cannot apply Hoeffding’s inequality directly. Instead, we look
at a series
Ps of random variables X1 , X2 , . . . , such that Xi -s have the same distribution as rt,a∗ -s. Let
µ̄s = 1s i=1 Xi be the average of the first s elements of the sequence. Then we have:
s ! !
r
∗ ∗ 3 ln t ∗ 3 ln t
P µ(a ) − µ̂t−1 (a ) ≥ ≤ P ∃s : µ(a ) − µ̄s ≥
2Nt−1 (a∗ ) 2s
t r !
X 3 ln t
≤ P µ(a∗ ) − µ̄s ≥
s=1
2s
t
X 1 1
≤ 3
= 2,
s=1
t t
where in the first line we decouple µ̂t (a∗ )-s from Nt (a∗ )-s via the use of µ̄s -s and in the last line we
apply Hoeffding’s inequality (note that 3 ln t = ln t3 corresponds to ln 1δ in Hoeffding’s inequality and
thus δ = t13 ). Finally, we have:
s
∞ ∞
!
∗
X
∗ ∗ 3 ln t X 1 π2
E [F (a )] = P µ(a ) − µ̂t−1 (a ) ≥ ≤ = .
t=1
2Nt−1 (a∗ ) t=1
t 2 6
.. .. ..
. . ··· . ···
`1,a , `2,a , ··· `t,a , ···
.. .. ..
. . ··· . ···
`1,K , `2,K , · · · `t,K , · · ·
−−−−−−−−→
time
Game Definition
For t = 1, 2, . . . :
1. Pick a row At
2. Observe the column `t,1 , . . . , `t,K & suffer `t,At
62
Performance Measure The performance is measured by regret
T T
!
X X
RT = `t,At − min `t,a .
a
t=1 t=1
Algorithm We consider the Hedge algorithm (a.k.a. exponential weights and weighted majority) for
playing this game.
Algorithm 4 Hedge (a.k.a. Exponential Weights), (Vovk, 1990, Littlestone and Warmuth, 1994)
Input: Learning rates η1 ≥ η2 ≥ · · · > 0
∀a : L0 (a) = 0
for t = 1, 2, ... do
−ηt Lt−1 (a)
∀a : pt (a) = P e −ηt Lt−1 (a0 )
a0 e
Sample At according to pt and play it
Observe `t,1 , . . . , `t,K and suffer `t,At
∀a : Lt (a) = Lt−1 (a) + `t,a
end for
Analysis We analyze the Hedge algorithm in a slightly simplified setting, where the time horizon T
is known. Unknown time horizon can be handled by using the doubling trick (see home assignment) or,
more elegantly, by a more careful analysis (see, e.g., Bubeck and Cesa-Bianchi (2012)).
The analysis is based on the following lemma.
Lemma 5.2. Let {X1,a , X2,a , . . . }a∈{1,...,K} be K sequences of non-negative numbers (Xt,a ≥ 0 for all
Pt
a and t). Let Lt (a) = s=1 Xs,a , let L0 (a) be zero for all a and let η > 0. Finally, let pt (a) =
−ηLt−1 (a)
e
P −ηL (a0 ) . Then:
a0 e
t−1
K
T X T K
X ln K η XX 2
pt (a)Xt,a − min LT (a) ≤ + pt (a) (Xt,a ) .
t=1 a=1
a η 2 t=1 a=1
e−ηLt (a) and study how this quantity evolves. We start with an upper bound.
P
Proof. We define Wt = a
P −ηLt (a)
Wt e
= P a −ηL (a)
Wt−1 ae
t−1
X e−ηLt−1 (a)
= e−ηXt,a P −ηL (a0 )
a0 e
t−1
a
X
= e−ηXt,a pt (a) (5.3)
a
X 1 2
≤ 1 − ηXt,a + η 2 (Xt,a ) pt (a) (5.4)
a
2
X η2 X 2
=1−η Xt,a pt (a) + (Xt,a ) pt (a)
a
2 a
2
Xt,a pt (a)+ η2 2
P P
−η a (Xt,a ) pt (a)
≤e a , (5.5)
where in (5.2) we used the fact that Lt (a) = Xt,a + Lt−1 (a), in (5.3) we used the definition of pt (a), in
(5.4) we used the inequality ex ≤ 1 + x + 12 x2 , which holds for x ≤ 0 (this is a delicate point, because the
63
inequality does not hold for x > 0 and, therefore, we must check that the condition x ≤ 0 is satisfied; it
is satisfied under the assumptions of the lemma), and inequality (5.5) is based on inequality 1 + x ≤ ex ,
which holds for all x.
Now we consider the ratio WW0 . On the one hand:
T
WT W1 W2 WT PT P η2 PT P 2
= × × ··· × ≤ e−η t=1 a Xt,a pt (a)+ 2 t=1 a (Xt,a ) pt (a)
.
W0 W0 W1 WT −1
Proof. We note that `t,a -s are positive and apply Lemma 5.2 to obtain:
K
T X T K
X ln K η2 X X 2
pt (a)`t,a − min LT (a) ≤ + pt (a) (`t,a ) .
t=1 a=1
a η 2 t=1 a=1
P PT PK
Note that a pt (a)`t,a is the expected loss of Hedge on round t and t=1 a=1 pt (a)`t,a is the ex-
pected cumulative loss of Hedge after T rounds. Thus, the left hand side X the inequality
of is the
2 2
expected regret of Hedge. Also note that `t,a ≤ 1 and thus (`t,a ) ≤ 1 and pt (a) (`t,a ) ≤ 1. Thus,
a
PT PK 2
t=1 a=1 pt (a) (`t,a ) ≤ T . Altogether, we get that:
ln K η
E [RT ] ≤ + T.
η 2
By taking the derivative of the right hand side and equating it to zero we obtain that − lnη2K + T2 = 0
q
and thus η = 2 lnT K is an extremal point. The second derivative is 2 ln
η3
K
and since η > 0 it is positive.
Thus, the extremal point is the minimum.
64
5.4.1 Lower Bound
A lower bound for the expected regret in prediction with expert advice is based on the following con-
struction. We draw a K × ∞ matrix of losses with each loss drawn according to Bernoulli distribution
with bias 1/2. In this game the expected loss of any algorithm after T rounds is T /2, irrespective of what
the algorithm is doing. However, the loss of the best action in hindsight is lower, because we are selecting
the “best” out of K rows. For each individual row the expected loss is T /2, but the expectation of the
minimum of the losses is lower. The reduction is quantified in the following theorem, see Cesa-Bianchi
and Lugosi (2006) for a proof.
Theorem 5.4. Let `t,a be i.i.d. Bernoulli random variables with bias 1/2, then
h PT i
T /2 − E mina t=1 `t,a
lim lim q = 1.
T →∞ K→∞ 1
2 T ln K
h PT i
Note that the numerator in the above expression, T /2 − E mina t=1 `t,a , is the expectation with
respect to generation of the matrix of losses of the expected regret. Thus, if the adversary generates
the matrix of losses according to the construction described above, then in expectation with respect to
generation ofqthe matrix and in the limit of K and T going to infinity the expected regret cannot be
1
smaller than 2T ln K.
Algorithm The algorithm is based on using importance-weighted estimates of the losses in the Hedge
algorithm.5
5 We note that the original algorithm in Auer et al. (2002b) was formulated for the gains game. Here we present an
improved algorithm for the losses game (Stoltz, 2005, Bubeck, 2010). We refer to home assignment for the difference
between the two.
65
Properties of importance-weighted samples Before we analyze the EXP3 algorithm we discuss a
number of important properties of importance-weighted sampling.
n o
1. The samples `˜t,a are not independent in two ways. First, for a fixed t, the set `˜t,1 , . . . , `˜t,K
is dependent (if we know that one of `˜t,a -s is non-zero, we automatically know that all the rest
are
n zero).o And second, `˜t,a depends on all `˜s,a0 for s < t and all a0 since pt (a) depends on
`˜s,a , which is the history of the game up to round t. In other words, pt (a) itself
1≤s<t,a∈{1,...,K}
is a random variable.
2. Even though `˜t,a -s are not independent, they are unbiased estimates of the true losses. Specifically,
h i
˜ `t,a 1(At = a)
E `t,a = E
pt (a)
`t,a 1(At = a)
=E E A1 , . . . , At−1
pt (a)
`t,a
=E E [1(At = a)|A1 , . . . , At−1 ]
pt (a)
`t,a
=E pt (a)
pt (a)
= `t,a .
The first expectation above is with respect to A1 , . . . , At . In the nested expectations, the external
expectation is with respect to A1 , . . . , At−1 and the internal is with respect to At . Note that pt (a)
is a random variable depending on A1 , . . . , At−1 , thus after the conditioning on A1 , . . . , At−1 it is
deterministic.
h i
3. Since `t,a ∈ [0, 1], we have `˜t,a ∈ 0, pt1(a) .
4. What is important is that the second moment of `˜t,a -s is by an order of magnitude smaller than
the second moment of a general random variable in the corresponding range. This is because the
expectation of `˜t,a -s is in the [0, 1] interval. Specifically:
" 2 #
2 ` 1(A = a)
t,a t
E `˜t,a =E
pt (a)
" #
2 2
(`t,a ) (1(At = a))
=E
pt (a)2
" #
2
(`t,a ) 1(At = a)
=E
pt (a)2
1(At = a)
≤E
pt (a)2
1(At = a)
=E E A1 , . . . , At−1
pt (a)2
1
=E E [1(At = a)|A1 , . . . , At−1 ]
pt (a)2
1
=E ,
pt (a)
2 2
where we have used (1(At = a)) = 1(At = a) and (`t,a ) ≤ 1 (since `t,a ∈ [0, 1]).
66
Theorem 5.5. The expected regret of the EXP3 algorithm with a fixed learning rate η satisfies:
ln K η
E [RT ] ≤ + KT.
η 2
q
2 ln K
The expected regret is minimized by η = KT ,which leads to
√
E [RT ] ≤ 2KT ln K.
Note that the extra
√ payment for being able to observe just one entry rather than the full column is
the multiplicative K factor in the regret bound.
Proof. The proof of the theorem is based on Lemma 5.2. We note that `˜t,a -s are all non-negative and,
thus, by Lemma 5.2 we have:
T X T 2
X ln K η XX
pt (a)`˜t,a − min L̃T (a) ≤ + pt (a) `˜t,a .
t=1 a
a η 2 t=1 a
Putting all three together back into the inequality we obtain the first statement of the theorem. And,
as before, we find η that minimizes the bound.
67
5.6 Adversarial Multiarmed Bandits with Expert Advice
Game setting We are, again, working with the same matrix of losses as in prediction with expert
advice. But now on every round of the game we get advice of N experts indexed by h in a form of a
distribution over the K arms. More formally:
For t = 1, 2, . . . :
1. Observe qt,1 , . . . , qt,N , where qt,h is a probability distribution over {1, . . . , K}.
2. Pick a row At .
3. Observe & suffer `t,At . (`t,a -s for a 6= At remain unobserved)
Performance measure We compare the expected loss of the algorithm to the expected loss of the
best expert, where the expectation of the loss of expert h is taken with respect to its advice vector qh .
Specifically:
XT X T X
X
E [RT ] = pt (a)`t,a − min qt,h (a)`t,a .
h
t=1 a t=1 a
Algorithm The algorithm is quite similar to the EXP3 algorithm.6 Note that now L̃t (h) tracks
cumulative importance-weighted estimate of expert losses instead of individual arm losses.
Note that ln N term plays the role of complexity of the class of experts in a very similar way to the
complexity terms we saw earlier in supervised learning (specifically, in the uniform union bound).
6 As with the EXP3 algorithm we present a slightly improved version of the algorithm for the game with losses. The
68
Proof. The analysis is quite similar to the analysis of the EXP3 algorithm. We note that `˜t,h -s are all
non-negative and that wt is a distribution over {1, . . . , N } defined in the same way as pt in Lemma 5.2.
Thus, by Lemma 5.2 we have:
T X T 2
X ln N η XX
wt (h)`˜t,h − min L̃T (h) ≤ + wt (h) `˜t,h .
t=1
h η 2 t=1
h h
where the first equality is by the definition of `˜t,h and the last equality is due to unbiasedness of `˜t,a .
Thus, the first expectation is the expected loss of EXP4.
" T # " T # " T #
h i X XX XX
E L̃T (h) = E `˜t,h = E qt (a)`˜t,a = E qt (a)`t,a ,
t=1 t=1 a t=1 a
where we can remove tilde due to unbiasedness of `˜t,a and we obtain the expected cumulative loss of
expert h over T rounds. And, finally,
" T # !2
XX 2 XT X X
E wt (h) `˜t,h = E wt (h) qt,h (a)`˜t,a
t=1 h t=1 h a
" T X
#
X X 2
≤E wt (h) qt,h (a) `˜t,a
t=1 h a
" T X X
! #
X 2
=E wt (h)qt,h (a) `˜t,a
t=1 a h
" T X
#
X 2
=E pt (a) `˜t,a
t=1 a
≤ KT,
where the first inequality is by Jensen’s inequality and convexity of x2 and the last inequality is along the
same lines as the analogous inequality in the analysis of EXP3. By substituting the three expectations
back into the inequality we obtain the first statement of the theorem. And, as before, we find η that
minimizes the bound.
69
5.6.1 Lower Bound
It is possible
q to show that the regret of adversarial multiarmed bandits with expert advice must be at
ln N ln N
least Ω KT ln K . The lower bound is based on construction of ln K independent bandit problems,
each according to the construction of the lower bound for multiarmed bandits in Section 5.5.1, and
construction of expert advice, so that for every possible selection of best arms for the subproblems there
is an expert that recommends that
√ selection. For details of the proof see Agarwal et al. (2012), Seldin
and Lugosi (2016). Closing the ln K gap between the upper and the lower bound is an open problem.
70
Appendix A
In this chapter we provide a number of basic definitions and notations from the set theory that are used
in the notes.
Countable and Uncountable sets A set is called countable if its elements can be counted or, in
other words, if every element in a set can be associated with a natural number. For example, the set of
integer numbers is countable and the set of rational numbers (ratios of two integers) is also countable.
Finite sets are countable as well. A set is called uncountable if its elements cannot be enumerated. For
example, the set of real numbers R is uncountable and the set of numbers in a [0, 1] interval is also
uncountable.
Relations between sets For two sets A and B we use A ⊆ B to denote that A is a subset of B.
Operations on sets For two sets A and B we use A ∪ B to denote the union of A and B; A ∩ B the
intersection of A and B; and A \ B the difference of A and B (the set of elements that are in A, but not
in B).
71
Appendix B
This chapter provides a number of basic definitions and results from the probability theory. It is partially
based on Mitzenmacher and Upfal (2005).
• F is a family of sets representing the allowable events, where each set in F is a subset of the sample
space Ω.
• P is a probability function P : F → [0, 1] satisfying Definition B.4.
Elements of Ω are called simple or elementary events.
Example B.2. For coin flips the sample space is Ω = {H, T }, where H stands for “heads” and T for
“tails”.
In dice rolling the sample space is Ω = {1, 2, 3, 4, 5, 6}, where 1,. . . ,6 label the sides of a dice (you
should consider them as labels rather than numerical values, we get back to this later in Example B.15).
If we simultaneously flip a coin and roll a dice the sample space is
Ω = {(H, 1), (T, 1), (H, 2), (T, 2), . . . , (H, 6), (T, 6)}.
If Ω is countable (including finite), the probability space is discrete. In discrete probability spaces the
family F consists of all subsets of Ω. In particular, F always includes the empty set ∅ and the complete
sample space Ω. If Ω is uncountably infinite (for example, the real line or the [0, 1] interval) a proper
definition of F requires concepts from the measure theory, which go beyond the scope of these notes.
Example B.3. In the coin flipping experiment F = {∅, {H} , {T } , {H, T }}.
Definition B.4 (Probability Axioms). A probability function is any function P : F → R that satisfies
the following conditions
1. For any event E ∈ F, 0 ≤ P(E) ≤ 1.
2. P(Ω) = 1.
3. For any finite or countably infinite sequence of mutually disjoint events E1 , E2 , . . .
[ X
P Ei = P(Ei ).
i≥1 i≥1
72
We now consider a number of basic properties of probabilities.
Lemma B.5 (Monotonicity). Let A and B be two events, such that A ⊆ B. Then
P(A) ≤ P(B).
Proof. We have that B = A ∪ (B \ A) and the events A and B \ A are disjoint. Thus,
where the equality is by the third axiom of probabilities and the inequality is by the first axiom of
probabilities, since P(B \ A) ≥ 0.
The next simple, but very important result is known as the union bound.
Lemma B.6 (The union bound). For any finite or countably infinite sequence of events E1 , E2 , . . . ,
[ X
P Ei ≤ P(Ei ).
i≥1 i≥1
Proof. We have [ [
Ei = E1 ∪ (E2 \ E1 ) ∪ (E3 \ (E1 ∪ E2 )) ∪ · · · = Fi ,
i≥1 i≥1
Si−1 S S
where the events Fi = Ei \ j=1 Ej are disjoint, Fi ⊆ Ei , and i≥1 Fi = i≥1 Ei . Therefore,
[ [ [ X X
P Ei = P Fi = P Fi = P(Fi ) ≤ P(Ei ),
i≥1 i≥1 i≥1 i≥1 i≥1
where the second equality is by the third axiom of probabilities and the inequality is by monotonicity of
the probability (Lemma B.5).
Example B.7. Let E1 = {1, 3, 5} be the event that the outcome of a dice roll is odd and E2 = {1, 2, 3}
be the event that the outcome is at most 3. Then P(E1 ∪ E2 ) = P(1, 2, 3, 5) ≤ P(E1 ) + P(E2 ). Note that
this is true irrespective of the choice of the probability measure P. In particular, this is true irrespective
of whether the dice is fair or not.
Definition B.8 (Independence). Two events A and B are called independent if and only if
Definition B.9 (Pairwise independence). Events E1 , . . . , En are called pairwise independent if and only
if for any pair i, j
P(Ei ∩ Ej ) = P(Ei )P(Ej ).
Definition B.10 (Mutual independence). Events E1 , . . . , En are called mutually independent if and
only if for any subset of indices I ⊆ {1, . . . , n}
!
\ Y
P Ei = P(Ei ).
i∈I i∈I
Note that pairwise independence does not imply mutual independence. Take the following example:
assume we roll a fair tetrahedron (a three-dimensional object with four faces) with faces colored in red,
blue, green, and the fourth face colored in all three colors, red, blue, and green. Let E1 be the event that
we observe red color, E2 be the even that we observe blue color, and E3 be the event that we observe green
color. Then for all i we have P(Ei ) = 21 and for any pair i 6= j we have P(Ei ∩ Ej ) = 41 = P(Ei )P(Ej ).
However, P(E1 ∩ E2 ∩ E3 ) = 41 6= P(E1 )P(E2 )P(E3 ) and, thus, the events are pairwise independent, but
not mutually independent. If we say that events E1 , . . . , En are independent without further specifications
we imply mutual independence.
73
Definition B.11 (Conditional probability). The conditional probability that event A occurs given that
event B occurs is
P(A ∩ B)
P(A|B) = .
P(B)
The conditional probability is well-defined only if P(B) > 0.
Lemma
Sn B.13 (The law of total probability). Let E1 , E2 , . . . , En be mutually disjoint events, such that
E
i=1 n = Ω. Then
n
X Xn
P(A) = P(A ∩ Ei ) = P(A|Ei )P(Ei ).
i=1 i=1
Sn
Proof. Since the Ei -s are disjoint and cover the entire space it follows that A = i=1 (A ∩ Ei ) and the
events A ∩ Ei are mutually disjoint. Therefore,
n
! n n
[ X X
P(A) = P (A ∩ Ei ) = P(A ∩ Ei ) = P(A|Ei )P(Ei ).
i=1 i=1 i=1
Definition B.16 (Independence of random variables). Two random variables X and Y are independent
if and only if
P((X = x) ∩ (Y = y)) = P(X = x)P(Y = y).
for all values x and y.
74
Definition B.17 (Pairwise independence). Random variables X1 , . . . , Xn are pairwise independent if
and only if for any pair i, j and any values xi , xj
Similar to the example given earlier, pairwise independence of random variables does not imply their
mutual independence. If we say that random variables are independent without further specifications we
imply mutual independence.
B.3 Expectation
Expectation is the most basic characteristic of a random variable.
Definition B.19 (Expectation). Let X be a discrete random variable and let X be the set of all possible
values that it can take. The expectation of X, denoted by E [X], is given by
X
E [X] = xP(X = x).
x∈X
P
The expectation is finite if x∈X |x|P(X = x) converges; otherwise the expectation is unbounded.
Example B.20. For a fair dice with faces numbered 1 to 6 let X(i) = i (the i-th face gets value i).
Then
6
X 1 7
E [X] = i = .
i=1
6 2
Expectation satisfies a number of important properties (these properties also hold for continuous
random variables). We leave a proof of these properties as an exercise.
Lemma B.21 (Multiplication by a constant). For any constant c
E [cX] = cE [X] .
Theorem B.22 (Linearity). For any pair of random variables X and Y , not necessarily independent,
E [X + Y ] = E [X] + E [Y ] .
E [XY ] = E [X] E [Y ] .
We emphasize that in contrast with Theorem B.22, this property does not hold in the general case (if X
and Y are not independent).
75
B.4 Variance
Variance is the second most basic characteristic of a random variable.
Definition B.24 (Variance). The variance of a random variable X (discrete or continuous), denoted
by Var [X], is defined by
h i
2 2
Var [X] = E (X − E [X]) = E X 2 − (E [X]) .
h i
2 2
We invite the reader to prove that E (X − E [X]) = E X 2 − (E [X]) .
Example B.25. For a fair dice with faces numbered 1 to 6 let X(i) = i (the i-th face gets value i).
Then
2 91 49 35
Var [X] = E X 2 − (E [X]) =
− = .
6 4 12
Theorem B.26. If X1 , . . . , Xn are independent random variables then
" n # n
X X
Var Xi = Var [Xi ] .
i=1 i=1
The proof is based on Theorem B.23 and the result does not necessarily hold when Xi -s are not
independent. We leave the proof as an exercise.
Definition B.27 (Bernoulli random variable). A random variable X taking values {0, 1} is called a
Bernoulli random variable. The parameter p = P(X = 1) is called the bias of X.
Bernoulli random variable has the following property (which does not hold in general):
Definition B.28 (Binomial random variable). A binomial random variable Y with parameters n and
p, denoted by B(n, p), is defined by the following probability distribution on k ∈ {0, 1, . . . , n}:
n k
P(Y = k) = p (1 − p)n−k .
k
Binomial random variable can be represented as a sum of independent identically distributed Bernoulli
random variables.
Lemma
Pn B.29. Let X1 , . . . , Xn be independent Bernoulli random variables with bias p. Then Y =
X
i=1 i is a binomial random variable with parameters n and p.
A proof of this lemma is left as an exercise to the reader.
E [f (X)] ≥ f (E [X]) .
For a proof see, for example, Mitzenmacher and Upfal (2005) or Cover and Thomas (2006).
76
Appendix C
Linear Algebra
We revisit a number of basic concepts from linear algebra. This is only a brief revision of the main
concepts that we are using in the lecture notes. For more details, please, refer to Strang (2009) or some
other textbook on linear algebra.
We start with reminding that two vectors u and v are perpendicular, u ⊥ v, if and only if their inner
product uT v = 0.
Matrix A matrix X ∈ Rn×d takes vectors in Rd and maps them into Rn . There are two fundamental
subspaces associated with a matrix X. The image of X, denoted Im(X) ⊆ Rn , is the space of all vectors
v ∈ Rn that can be obtained through multiplication of X with a vector w. The image Im(X) is a linear
subspace of Rn and it is also called a column space of X. The second subspace is the nullspace of X,
denoted N ull(X) ⊆ Rd , which is the space of all vectors w for which Xw = 0. The nullspace is a linear
subspace of Rd . The subspaces are illustrated in Figure C.1.
Matrix transpose Matrix transpose XT takes vectors in Rn and maps them into Rd . The corre-
sponding subspaces are Im(XT ), the row space of X, and N ull(XT ).
Complete relation between Im(X), Im(XT ), N ull(X), and N ull(XT ) Not only the pairs Im(X)
with N ull(XT ) and Im(XT ) with N ull(X) are orthogonal, they also complement each other. Let dim(A)
denote dimension of a matrix A. The dimension is equal to the number of independent columns, which
is equal to the number of independent rows (this fact can be shown by bringing A to a diagonal form).
Then we have the following relations:
1. dim(Im(X)) = dim(Im(XT )) = dim(X).
2. dim(N ull(X)) = d − dim(Im(XT )) and dim(N ull(XT )) = n − dim(Im(X)).
3. Im(X) ⊥ N ull(XT ) and Im(XT ) ⊥ N ull(X).
Together these properties mean that a combination of bases for Im(XT ) and N ull(X) makes a basis for
Rd and a combination of bases for Im(X) and N ull(XT ) make a basis for Rn . It means that any vector
v ∈ Rd can be represented as v = v? + v0 , where v? ∈ Im(XT ) belongs to the row space of X and
v0 ∈ N ull(X) belongs to the nullspace of X.
77
Figure C.1: The four fundamental subspaces of a matrix X. There is a right angle between Im(X)
and N ull(XT ), as well as between Im(XT ) and N ull(X).
The mapping between Im(XT ) and Im(X) is one-to-one and, thus, invertible Every vector
u in the column space comes from one and only one vector in the row space v. The proof of this fact
is also simple. Assume that u = Xv = Xv0 for two vectors v, v0 ∈ Im(XT ). Then X(v − v0 ) = 0 and
the vector v − v0 ∈ N ull(X). But N ull(X) is perpendicular to Im(XT ), which means that v − v0 is
orthogonal to itself and, therefore, must be the zero vector.
XT X is invertible if and only if X has linearly independent columns (XT X)−1 is a very
important matrix. We show that XT X is invertible if and only if X has linearly independent columns,
meaning that dim(X) = d. We show this by proving that X and XT X have the same nullspace. Let
v ∈ N ull(X), then Xv = 0 and, therefore, XT Xv = 0 and v ∈ N ull(XT X). In the other direction, let
v ∈ N ull(XT X). Then XT Xv = 0 and we have:
(v − αu)T αu = 0,
αvT u = α2 uT u,
vT u uT v
α= T
= T .
u u u u
78
uT v uT v
Thus, the projection p = αu = uT u
u. Note that uT u
is a scalar, thus
uT v uT v uuT
p= T
u = u T = T v.
u u u u u u
uuT
The matrix P = uT u
is a projection matrix. For any vector v the matrix P projects v onto u.
Projection onto a subspace A subspace can be described by a set of linear combinations Az, where
the columns of matrix A span the subspace. Projection of a vector v onto a subspace described by A
means that we are looking for a projection p = Az, such that the remainder v − p is perpendicular to
the projection. The projection p = Az belongs to the image of A, Im(A). Thus, the remainder must be
in the nullspace of AT , meaning that AT (v − p) = 0. Assuming that the columns of A are independent,
we have:
AT (v − Az) = 0,
AT v = AT Az,
z = (AT A)−1 AT v,
where we used independence of the columns of A in the last step to invert AT A. The projection is
p = Az = A(AT A)−1 AT v and the projection matrix is P = A(AT A)−1 AT . The projection matrix P
maps any vector v onto the space spanned by the columns of A, Im(A). Note how (AT A)−1 plays the
role of uT1 u in projection onto a line.
79
Appendix D
Calculus
D.1 Gradients
x1
Gradients are vectors of partial derivatives. For a vector x = ... and a function f (x) the gradient
xd
of f is defined as
∂f
∂x1
∇f (x̄) = ..
.
.
∂f
∂xd
where the first sum corresponds to the first element in the product xi xj being xk and the second sum
corresponds to the second element in the product xi xj being xk . Putting all the derivatives together we
obtain:
Pd Pd
j=1 a1j xj + i=1 ai1 xi
∇f (x) = ..
.
Pd Pd
j=1 ad1 x j + i=1 aid x i
T
a11 · · · a1d x1 a11 · · · a1d
= ... .. .. + (x , · · · , x ) .. ..
. . 1 d . .
ad1 ··· add xd ad1 ··· add
T
= Ax + A x
= (A + AT )x.
80
A matrix A is called symmetric if AT = A. For a symmetric matrix we have ∇f (x) = 2Ax and for a
general matrix we have ∇f (x) = (A + AT )x. Note the similarity and dissimilarity with the derivative of
a univariate quadratic function f (x) = ax2 , which is f 0 (x) = 2ax.
bd
∂f
∂x1
the gradient ∇f (x) = ..
= b.
.
∂f
∂xd
81
Bibliography
Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from data. AMLbook, 2012.
Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from data. Dynamic E-
Chapters. AMLbook, 2015.
Alekh Agarwal, Miroslav Dudı́k, Satyen Kale, John Langford, and Robert E. Schapire. Contextual
bandit learning with predictable rewards. In Proceedings on the International Conference on Artificial
Intelligence and Statistics (AISTATS), 2012.
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit
problem. Machine Learning, 47, 2002a.
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed
bandit problem. SIAM Journal of Computing, 32(1), 2002b.
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities A Nonasymptotic
Theory of Independence. Oxford University Press, 2013.
Sébastien Bubeck. Bandits Games and Clustering Foundations. PhD thesis, Université Lille, 2010.
Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed
bandit problems. Foundations and Trends in Machine Learning, 5, 2012.
Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press,
2006.
Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley Series in Telecommuni-
cations and Signal Processing, 2nd edition, 2006.
Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand. PAC-Bayesian learning
of linear classifiers. In Proceedings of the International Conference on Machine Learning (ICML),
2009.
Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand, and Jean-Francis Roy. Risk
bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of
Machine Learning Research, 16, 2015.
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American
Statistical Association, 58(301):13–30, 1963.
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in
Applied Mathematics, 6, 1985.
John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning
Research, 6, 2005.
Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Com-
putation, 108, 1994.
Katalin Marton. A measure concentration inequality for contracting Markov chains. Geometric and
Functional Analysis, 6(3), 1996.
82
Katalin Marton. A measure concentration inequality for contracting Markov chains Erratum. Geometric
and Functional Analysis, 7(3), 1997.
Andrés R. Masegosa, Stephan S. Lorenzen, Christian Igel, and Yevgeny Seldin. Second order PAC-
Bayesian bounds for the weighted majority vote. Technical report, https://arxiv.org/abs/2007.
13532, 2020.
Andreas Maurer. A note on the PAC-Bayesian theorem. www.arxiv.org, 2004.
David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51, 2003.
Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized Algorithms and Proba-
bilistic Analysis. Cambridge University Press, 2005.
Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American
Mathematical Society, 1952.
Paul-Marie Samson. Concentration of measure inequalities for markov chains and φ-mixing processes.
The Annals of Probability, 28(1), 2000.
Matthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classification. Journal
of Machine Learning Research, 3, 2002.
Yevgeny Seldin. The space of online learning problems. ECML-PKDD Tutorial. https://sites.google.
com/site/spaceofonlinelearningproblems/, 2015.
Yevgeny Seldin and Gábor Lugosi. A lower bound for multi-armed bandits with expert advice. In
Proceedings of the European Workshop on Reinforcement Learning (EWRL), 2016.
Yevgeny Seldin, François Laviolette, Nicolò Cesa-Bianchi, John Shawe-Taylor, and Peter Auer. PAC-
Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58, 2012.
Gilles Stoltz. Incomplete Information and Internal Regret in Prediction of Individual Sequences. PhD
thesis, Université Paris-Sud, 2005.
Gilbert Strang. Introduction to linear algebra. Wellesley-Cambridge Press, 4th edition, 2009.
Niklas Thiemann, Christian Igel, Olivier Wintenberger, and Yevgeny Seldin. A strongly quasiconvex
PAC-Bayesian bound. In Proceedings of the International Conference on Algorithmic Learning Theory
(ALT), 2017.
Ilya Tolstikhin and Yevgeny Seldin. PAC-Bayes-Empirical-Bernstein inequality. In Advances in Neural
Information Processing Systems (NIPS), 2013.
Vladimir Vovk. Aggregating strategies. In Proceedings of the Conference on Learning Theory (COLT),
1990.
83