0% found this document useful (0 votes)
281 views319 pages

4

machine learning notes

Uploaded by

Stephen Mattuck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
281 views319 pages

4

machine learning notes

Uploaded by

Stephen Mattuck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 319

CHAPTER 2

Estimating Probabilities

Machine Learning
Copyright
c 2017. Tom M. Mitchell. All rights reserved.
*DRAFT OF January 26, 2018*

*PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR’S


PERMISSION*

This is a rough draft chapter intended for inclusion in the upcoming second
edition of the textbook Machine Learning, T.M. Mitchell, McGraw Hill.
You are welcome to use this for educational purposes, but do not duplicate
or repost it on the internet. For online copies of this and other materials
related to this book, visit the web site www.cs.cmu.edu/∼tom/mlbook.html.
Please send suggestions for improvements, or suggested exercises, to
[email protected].

Many machine learning methods depend on probabilistic approaches. The


reason is simple: when we are interested in learning some target function
f : X → Y , we can more generally learn the probabilistic function P(Y |X).
By using a probabilistic approach, we can design algorithms that learn func-
tions with uncertain outcomes (e.g., predicting tomorrow’s stock price) and
that incorporate prior knowledge to guide learning (e.g., a bias that tomor-
row’s stock price is likely to be similar to today’s price). This chapter de-
scribes joint probability distributions over many variables, and shows how
they can be used to calculate a target P(Y |X). It also considers the problem
of learning, or estimating, probability distributions from training data, pre-
senting the two most common approaches: maximum likelihood estimation
and maximum a posteriori estimation.

1 Joint Probability Distributions


The key to building probabilistic models is to define a set of random variables,
and to consider the joint probability distribution over them. For example, Table
1 defines a joint probability distribution over three random variables: a person’s

1
Copyright
c 2016, Tom M. Mitchell. 2

Gender HoursWorked Wealth probability


female < 40.5 poor 0.2531
female < 40.5 rich 0.0246
female ≥ 40.5 poor 0.0422
female ≥ 40.5 rich 0.0116
male < 40.5 poor 0.3313
male < 40.5 rich 0.0972
male ≥ 40.5 poor 0.1341
male ≥ 40.5 rich 0.1059

Table 1: A Joint Probability Distribution. This table defines a joint probability distri-
bution over three random variables: Gender, HoursWorked, and Wealth.

Gender, the number of HoursWorked each week, and their Wealth. In general,
defining a joint probability distribution over a set of discrete-valued variables in-
volves three simple steps:

1. Define the random variables, and the set of values each variable can take
on. For example, in Table 1 the variable Gender can take on the value
male or female, the variable HoursWorked can take on the value “< 40.5’
or “≥ 40.5,” and Wealth can take on values rich or poor.

2. Create a table containing one row for each possible joint assignment of val-
ues to the variables. For example, Table 1 has 8 rows, corresponding to the 8
possible ways of jointly assigning values to three boolean-valued variables.
More generally, if we have n boolean valued variables, there will be 2n rows
in the table.

3. Define a probability for each possible joint assignment of values to the vari-
ables. Because the rows cover every possible joint assignment of values,
their probabilities must sum to 1.

The joint probability distribution is central to probabilistic inference, because


once we know the joint distribution we can answer every possible probabilistic
question that can be asked about these variables. We can calculate conditional or
joint probabilities over any subset of the variables, given their joint distribution.
This is accomplished by operating on the probabilities for the relevant rows in the
table. For example, we can calculate:

• The probability that any single variable will take on any specific value. For
example, we can calculate that the probability P(Gender = male) = 0.6685
for the joint distribution in Table 1, by summing the four rows for which
Gender = male. Similarly, we can calculate the probability P(Wealth =
rich) = 0.2393 by adding together the probabilities for the four rows cover-
ing the cases for which Wealth=rich.
Copyright
c 2016, Tom M. Mitchell. 3

• The probability that any subset of the variables will take on a particular joint
assignment. For example, we can calculate that the probability P(Wealth=rich∧
Gender=female) = 0.0362, by summing the two table rows that satisfy this
joint assignment.

• Any conditional probability defined over subsets of the variables. Recall


the definition of conditional probability P(Y |X) = P(X ∧Y )/P(X). We can
calculate both the numerator and denominator in this definition by sum-
ming appropriate rows, to obtain the conditional probability. For example,
according to Table 1, P(Wealth=rich|Gender=female) = 0.0362/0.3315 =
0.1092.

To summarize, if we know the joint probability distribution over an arbi-


trary set of random variables {X1 . . . Xn }, then we can calculate the conditional
and joint probability distributions for arbitrary subsets of these variables (e.g.,
P(Xn |X1 . . . Xn−1 )). In theory, we can in this way solve any classification, re-
gression, or other function approximation problem defined over these variables,
and furthermore produce probabilistic rather than deterministic predictions for
any given input to the target function.1 For example, if we wish to learn to
predict which people are rich or poor based on their gender and hours worked,
we can use the above approach to simply calculate the probability distribution
P(Wealth | Gender, HoursWorked).

1.1 Learning the Joint Distribution


How can we learn joint distributions from observed training data? In the example
of Table 1 it will be easy if we begin with a large database containing, say, descrip-
tions of a million people in terms of their values for our three variables. Given a
large data set such as this, one can easily estimate a probability for each row in the
table by calculating the fraction of database entries (people) that satisfy the joint
assignment specified for that row. If thousands of database entries fall into each
row, we will obtain highly reliable probability estimates using this strategy.
In other cases, however, it can be difficult to learn the joint distribution due to
the very large amount of training data required. To see the point, consider how our
learning problem would change if we were to add additional variables to describe
a total of 100 boolean features for each person in Table 1 (e.g., we could add ”do
they have a college degree?”, ”are they healthy?”). Given 100 boolean features,
the number of rows in the table would now expand to 2100 , which is greater than
1030 . Unfortunately, even if our database describes every single person on earth
we would not have enough data to obtain reliable probability estimates for most
rows. There are only approximately 1010 people on earth, which means that for
most of the 1030 rows in our table, we would have zero training examples! This
1 Of course if our random variables have continuous values instead of discrete, we would need
an infinitely large table. In such cases we represent the joint distribution by a function instead of a
table, but the principles for using the joint distribution remain unchanged.
Copyright
c 2016, Tom M. Mitchell. 4

is a significant problem given that real-world machine learning applications often


use many more than 100 features to describe each example – for example, many
learning algorithms for text analysis use millions of features to describe text in a
given document.
To successfully address the issue of learning probabilities from available train-
ing data, we must (1) be smart about how we estimate probability parameters from
available data, and (2) be smart about how we represent joint probability distribu-
tions.

2 Estimating Probabilities
Let us begin our discussion of how to estimate probabilities with a simple exam-
ple, and explore two intuitive algorithms. It will turn out that these two intuitive
algorithms illustrate the two primary approaches used in nearly all probabilistic
machine learning algorithms.
In this simple example you have a coin, represented by the random variable
X. If you flip this coin, it may turn up heads (indicated by X = 1) or tails (X = 0).
The learning task is to estimate the probability that it will turn up heads; that is, to
estimate P(X = 1). We will use θ to refer to the true (but unknown) probability of
heads (e.g., P(X = 1) = θ), and use θ̂ to refer to our learned estimate of this true
θ. You gather training data by flipping the coin n times, and observe that it turns
up heads α1 times, and tails α0 times. Of course n = α1 + α0 .
What is the most intuitive approach to estimating θ = P(X =1) from this train-
ing data? Most people immediately answer that we should estimate the probability
by the fraction of flips that result in heads:

Probability estimation Algorithm 1 (maximum likelihood). Given


observed training data producing α1 total ”heads,” and α0 total ”tails,”
output the estimate
α1
θ̂ =
α1 + α0
For example, if we flip the coin 50 times, observing 24 heads and 26 tails, then
we will estimate the probability P(X = 1) to be θ̂ = 0.48.
This approach is quite reasonable, and very intuitive. It is a good approach
when we have plenty of training data. However, notice that if the training data is
very scarce it can produce unreliable estimates. For example, if we observe only
3 flips of the coin, we might observe α1 = 1 and α0 = 2, producing the estimate
θ̂ = 0.33. How would we respond to this? If we have prior knowledge about the
coin – for example, if we recognize it as a government minted coin which is likely
to have θ close to 0.5 – then we might respond by still believing the probability is
closer to 0.5 than to the algorithm 1 estimate θ̂ = 0.33. This leads to our second
intuitive algorithm: an algorithm that enables us to incorporate prior assumptions
along with observed training data to produce our final estimate. In particular,
Algorithm 2 allows us to express our prior assumptions or knowledge about the
Copyright
c 2016, Tom M. Mitchell. 5

coin by adding in any number of imaginary coin flips resulting in heads or tails.
We can use this option of introducing γ1 imaginary heads, and γ0 imaginary tails,
to express our prior assumptions:

Probability estimation Algorithm 2. (maximum a posteriori prob-


ability). Given observed training data producing α1 observed ”heads,”
and α0 observed ”tails,” plus prior information expressed by introduc-
ing γ1 imaginary ”heads” and γ0 imaginary ”tails,” output the estimate

(α1 + γ1 )
θ̂ =
(α1 + γ1 ) + (α0 + γ0 )

Note that Algorithm 2, like Algorithm 1, produces an estimate based on the


proportion of coin flips that result in ”heads.” The only difference is that Algo-
rithm 2 allows including optional imaginary flips that represent our prior assump-
tions about θ, in addition to actual observed data. Algorithm 2 has several attrac-
tive properties:

• It is easy to incorporate our prior assumptions about the value of θ by ad-


justing the ratio of γ1 to γ0 . For example, if we have reason to assume
that θ = 0.7 we can add in γ1 = 7 imaginary flips with X = 1, and γ0 = 3
imaginary flips with X = 0.

• It is easy to express our degree of certainty about our prior knowledge, by


adjusting the total volume of imaginary coin flips. For example, if we are
highly certain of our prior belief that θ = 0.7, then we might use priors of
γ1 = 700 and γ0 = 300 instead of γ1 = 7 and γ0 = 3. By increasing the
volume of imaginary examples, we effectively require a greater volume of
contradictory observed data in order to produce a final estimate far from our
prior assumed value.

• If we set γ1 = γ0 = 0, then Algorithm 2 produces exactly the same estimate


as Algorithm 1. Algorithm 1 is just a special case of Algorithm 2.

• Asymptotically, as the volume of actual observed data grows toward infin-


ity, the influence of our imaginary data goes to zero (the fixed number of
imaginary coin flips becomes insignificant compared to a sufficiently large
number of actual observations). In other words, Algorithm 2 behaves so
that priors have the strongest influence when observations are scarce, and
their influence gradually reduces as observations become more plentiful.

Both Algorithm 1 and Algorithm 2 are intuitively quite compelling. In fact,


these two algorithms exemplify the two most widely used approaches to machine
learning of probabilistic models from training data. They can be shown to follow
from two different underlying principles. Algorithm 1 follows a principle called
Maximum Likelihood Estimation (MLE), in which we seek an estimate of θ that
Copyright
c 2016, Tom M. Mitchell. 6

Figure 1: MLE and MAP estimates of θ as the number of coin flips grows. Data was
generated by a random number generator that output a value of 1 with probability θ = 0.3,
and a value of 0 with probability of (1 − θ) = 0.7. Each plot shows the two estimates of θ
as the number of observed coin flips grows. Plots on the left correspond to values of γ1 and
γ0 that reflect the correct prior assumption about the value of θ, plots on the right reflect
the incorrect prior assumption that θ is most probably 0.4. Plots in the top row reflect
lower confidence in the prior assumption, by including only 60 = γ1 + γ0 imaginary data
points, whereas bottom plots assume 120. Note as the size of the data grows, the MLE
and MAP estimates converge toward each other, and toward the correct estimate for θ.

maximizes the probability of the observed data. In fact we can prove (and will,
below) that Algorithm 1 outputs an estimate of θ that makes the observed data at
least as probable as any other possible estimate of θ. Algorithm 2 follows a dif-
ferent principle called Maximum a Posteriori (MAP) estimation, in which we seek
the estimate of θ that is itself most probable, given the observed data, plus back-
ground assumptions about its value. Thus, the difference between these two prin-
ciples is that Algorithm 2 assumes background knowledge is available, whereas
Algorithm 1 does not. Both principles have been widely used to derive and to
justify a vast range of machine learning algorithms, from Bayesian networks, to
linear regression, to neural network learning. Our coin flip example represents
just one of many such learning problems.
The experimental behavior of these two algorithms is shown in Figure 1. Here
Copyright
c 2016, Tom M. Mitchell. 7

the learning task is to estimate the unknown value of θ = P(X = 1) for a boolean-
valued random variable X, based on a sample of n values of X drawn indepen-
dently (e.g., n independent flips of a coin with probability θ of heads). In this
figure, the true value of θ is 0.3, and the same sequence of training examples is
used in each plot. Consider first the plot in the upper left. The blue line shows
the estimates of θ produced by Algorithm 1 (MLE) as the number n of training
examples grows. The red line shows the estimates produced by Algorithm 2, us-
ing the same training examples and using priors γ0 = 42 and γ1 = 18. This prior
assumption aligns with the correct value of θ (i.e., [γ1 /(γ1 + γ0 )] = 0.3). Note
that as the number of training example coin flips grows, both algorithms converge
toward the correct estimate of θ, though Algorithm 2 provides much better esti-
mates than Algorithm 1 when little data is available. The bottom left plot shows
the estimates if Algorithm 2 uses even more confident priors, captured by twice as
many imaginary examples (γ0 = 84 and γ1 = 36). The two plots on the right side
of the figure show the estimates produced when Algorithm 2 (MAP) uses incor-
rect priors (where [γ1 /(γ1 + γ0 )] = 0.4). The difference between the top right and
bottom right plots is again only a difference in the number of imaginary examples,
reflecting the difference in confidence that θ should be close to 0.4.

2.1 Maximum Likelihood Estimation (MLE)


Maximum Likelihood Estimation, often abbreviated MLE, estimates one or more
probability parameters θ based on the principle that if we observe training data D,
we should choose the value of θ that makes D most probable. When applied to
the coin flipping problem discussed above, it yields Algorithm 1. The definition
of the MLE in general is

θ̂MLE = arg max P(D|θ) (1)


θ

The intuition underlying this principle is simple: we are more likely to observe
data D if we are in a world where the appearance of this data is highly probable.
Therefore, we should estimate θ by assigning it whatever value maximizes the
probability of having observed D.
Beginning with this principle for choosing among possible estimates of θ, it
is possible to mathematically derive a formula for the value of θ that provably
maximizes P(D|θ). Many machine learning algorithms are defined so that they
provably learn a collection of parameter values that follow this maximum likeli-
hood principle. Below we derive Algorithm 1 for our above coin flip example,
beginning with the maximum likelihood principle.
To precisely define our coin flipping example, let X be a random variable
which can take on either value 1 or 0, and let θ = P(X = 1) refer to the true, but
possibly unknown, probability that a random draw of X will take on the value 1.2
Assume we flip the coin X a number of times to produce training data D, in which
2A random variable defined in this way is called a Bernoulli random variable, and the proba-
bility distribution it follows, defined by θ, is called a Bernoulli distribution.
Copyright
c 2016, Tom M. Mitchell. 8

we observe X = 1 a total of α1 times, and X = 0 a total of α0 times. We further


assume that the outcomes of the flips are independent (i.e., the result of one coin
flip has no influence on other coin flips), and identically distributed (i.e., the same
value of θ governs each coin flip). Taken together, these assumptions are that the
coin flips are independent, identically distributed (which is often abbreviated to
”i.i.d.”).
The maximum likelihood principle involves choosing θ to maximize P(D|θ).
Therefore, we must begin by writing an expression for P(D|θ), or equivalently
P(α1 , α0 |θ) in terms of θ, then find an algorithm that chooses a value for θ that
maximizes this quantity. To begin, note that if data D consists of just one coin flip,
then P(D|θ) = θ if that one coin flip results in X = 1, and P(D|θ) = (1−θ) if the
result is instead X = 0. Furthermore, if we observe a set of i.i.d. coin flips such
as D = {1, 1, 0, 1, 0}, then we can easily calculate P(D|θ) by multiplying together
the probabilities of each individual coin flip:

P(D = {1, 1, 0, 1, 0} | θ) = θ · θ · (1−θ) · θ · (1−θ) = θ3 · (1−θ)2

In other words, if we summarize D by the total number of observed times α1 when


X = 1 and the number of times α0 that X = 0, we have in general

P(D = hα1 , α0 i|θ) = θα1 (1−θ)α0 (2)

The above expression gives us a formula for P(D = hα1 , α0 i|θ). The quantity
P(D|θ) is often called the data likelihood, or the data likelihood function because
it expresses the probability of the observed data D as a function of θ. This likeli-
hood function is often written L(θ) = P(D|θ).
Our final step in this derivation is to determine the value of θ that maximizes
the data likelihood function P(D = hα1 , α0 i|θ). Notice that maximizing P(D|θ)
with respect to θ is equivalent to maximizing its logarithm, ln P(D|θ) with respect
to θ, because ln(x) increases monotonically with x:

arg max P(D|θ) = arg max ln P(D|θ)


θ θ

It often simplifies the mathematics to maximize ln P(D|θ) rather than P(D|θ), as


is the case in our current example. In fact, this log likelihood is so common that it
has its own notation, `(θ) = ln P(D|θ).
To find the value of θ that maximizes ln P(D|θ), and therefore also maximizes
P(D|θ), we can calculate the derivative of ln P(D = hα1 , α0 i|θ) with respect to
θ, then solve for the value of θ that makes this derivative equal to zero. Because
ln P(D|θ) is a concave function of θ, the value of θ where this derivative is zero
will be the value that maximizes ln P(D|θ). First, we calculate the derivative of
the log of the likelihood function of Eq. (2):
∂`(θ) ∂ ln P(D|θ)
=
∂θ ∂θ
∂ ln[θα1 (1−θ)α0 ]
=
∂θ
Copyright
c 2016, Tom M. Mitchell. 9

∂ [α1 ln θ + α0 ln(1−θ)]
=
∂θ
∂ ln θ ∂ ln(1−θ)
= α1 + α0
∂θ ∂θ
∂ ln θ ∂ ln(1−θ) ∂(1 − θ)
= α1 + α0 ·
∂θ ∂(1 − θ) ∂θ
∂`(θ) 1 1
= α1 + α0 · (−1) (3)
∂θ θ (1 − θ)
∂ ln x 1
where the last step follows from the equality ∂x = x , and where the next to last
∂ f (x) ∂ f (x) ∂g(x)
step follows from the chain rule ∂x = ∂g(x) · ∂x .
Finally, to calculate the value of θ that maximizes `(θ), we set the derivative
in equation (3) to zero, and solve for θ.
1 1
0 = α1 − α0
θ 1−θ
1 1
α0 = α1
1−θ θ
α0 θ = α1 (1 − θ)
(α1 + α0 )θ = α1
α1
θ = (4)
α1 + α0
Thus, we have derived in equation (4) the intuitive Algorithm 1 for estimating
θ, starting from the principle that we want to choose the value of θ that maximizes
P(D|θ).
α1
θ̂MLE = arg max P(D|θ) = arg max ln P(D|θ) = (5)
θ θ α1 + α0
This same maximum likelihood principle is used as the basis for deriving many
machine learning algorithms for more complex problems where the solution is not
so intuitively obvious.

2.2 Maximum a Posteriori Probability Estimation (MAP)


Maximum a Posteriori Estimation, often abbreviated to MAP estimation, esti-
mates one or more probability parameters θ based on the principle that we should
choose the value of θ that is most probable, given the observed data D and our
prior assumptions summarized by P(θ).
θ̂MAP = arg max P(θ|D)
θ

When applied to the coin flipping problem discussed above, it yields Algorithm
2. Using Bayes rule, we can rewrite the MAP principle as:
P(D|θ)P(θ)
θ̂MAP = arg max P(θ|D) = arg max
θ θ P(D)
Copyright
c 2016, Tom M. Mitchell. 10

and given that P(D) does not depend on θ, we can simplify this by ignoring the
denominator:

θ̂MAP = arg max P(θ|D) = arg max P(D|θ)P(θ) (6)


θ θ

Comparing this to the MLE principle described in equation (1), we see that whereas
the MLE principle is to choose θ to maximize P(D|θ), the MAP principle instead
maximizes P(D|θ)P(θ). The only difference is the extra P(θ).
To produce a MAP estimate for θ we must specify a prior distribution P(θ)
that summarizes our a priori assumptions about the value of θ. In the case where
data is generated by multiple i.i.d. draws of a Bernoulli random variable, as in our
coin flip example, the most common form of prior is a Beta distribution:

θβ1 −1 (1 − θ)β0 −1
P(θ) = Beta(β0 , β1 ) = (7)
B(β0 , β1 )

Here β0 and β1 are parameters whose values we must specify in advance to define
a specific P(θ). As we shall see, choosing values for β0 and β1 corresponds to
choosing the number of imaginary examples γ0 and γ1 in the above Algorithm
2. The denominator B(β0 , β1 ) is a normalization term defined by the function B,
which assures the probability integrates to one, but which is independent of θ.
As defined in Eq. (6), the MAP estimate involves choosing the value of θ that
maximizes P(D|θ)P(θ). Recall we already have an expression for P(D|θ) in Eq.
(2). Combining this with the above expression for P(θ) we have:

θ̂MAP = arg max P(D|θ)P(θ)


θ

α1 α0 θβ1 −1 (1 − θ)β0 −1
= arg max θ (1−θ)
θ B(β0 , β1 )
θα1 +β1 −1 (1 − θ)α0 +β0 −1
= arg max
θ B(β0 , β1 )
= arg max θα1 +β1 −1 (1 − θ)α0 +β0 −1 (8)
θ

where the final line follows from the previous line because B(β0 , β1 ) is indepen-
dent of θ.
How can we solve for the value of θ that maximizes the expression in Eq. (8)?
Fortunately, we have already answered this question! Notice that the quantity we
seek to maximize in Eq. (8) can be made identical to the likelihood function in Eq.
(2) if we substitute (α1 + β1 − 1) for α1 in Eq. (2), and substitute (α0 + β0 − 1)
for α0 . We can therefore reuse the derivation of θ̂MLE beginning from Eq. (2) and
ending with Eq. (4), simply by carrying through this substitution. Applying this
same substitution to Eq. (4) implies the solution to Eq. (8) is therefore

(α1 + β1 − 1)
θ̂MAP = arg max P(D|θ)P(θ) = (9)
θ (α1 + β1 − 1) + (α0 + β0 − 1)
Copyright
c 2016, Tom M. Mitchell. 11

Figure 2: Prior (left) and posterior (right) probability distributions on θ in a MAP


estimate for θ. Consider estimating θ = P(X = 1) for a boolean valued (Bernoulli)
random variable X. Suppose we set a prior P(θ) defined by a Beta distribution with
β0 = 3, β1 = 4, as shown on the left. Suppose further that we now observe data D con-
sisting of 100 examples: 50 where we observe X = 1 and 50 where X = 0. Then the
posterior probability P(θ|D) over θ given this observed data D, which is proportional to
P(D|θ)P(θ), is another Beta distribution with β0 = 53, β1 = 54, as shown on the right.
Notice that both distributions assign non-zero probability to every possible value of θ
between 0 and 1, though the posterior distribution has most of its probability mass near
θ = 0.5.

Thus, we have derived in Eq. (9) the intuitive Algorithm 2 for estimating θ,
starting from the principle that we want to choose the value of θ that maximizes
P(θ|D). The number γ1 of imaginary ”heads” in Algorithm 2 is equal to β1 − 1,
and the number γ0 of imaginary ”tails” is equal to β0 − 1. This same maximum
a posteriori probability principle is used as the basis for deriving many machine
learning algorithms for more complex problems where the solution is not so intu-
itively obvious as it is in our coin flipping example.

2.2.1 MAP Priors and Posteriors


Why did we chose above to use the Beta(β0 , β1 ) family of probability distributions
to define our prior probability P(θ) when calculating the MAP estimate θ̂MAP ?
We made this choice because the Beta distribution has a functional form that is
the same as the data likelihood P(D|θ) in our problem, so that when we multiply
these two forms together to get the quantity P(D|θ)P(θ) , this product is easily
expressed as yet another expression of this same functional form. Furthermore, in
this product the β0 and β1 parameters that define our Beta distribution play exactly
the same role as the observed data counts; that is, they capture the effect of the
Beta prior P(θ) in a form that is equivalent to specifying the number of imaginary
examples used in our earlier Algorithm 23 .
Figure 2 shows an example of the prior P(θ) and posterior P(θ|D) ∝ P(D|θ)P(θ)
distributions corresponding to a MAP estimate in our example.
3 More precisely, the number of imaginary examples λi in Algorithm 2 is given by βi − 1.
Copyright
c 2016, Tom M. Mitchell. 12

The Beta(β0 , β1 ) distribution defined in Eq. (7) is called the conjugate prior
for the binomial likelihood function θα1 (1−θ)α0 , because the posterior distribu-
tion P(D|θ)P(θ) is also a Beta distribution. More generally, any P(θ) is called the
conjugate prior for a likelihood function L(θ) = P(D|θ) if the posterior P(θ|D) is
of the same form as P(θ).

3 Working with Other Probability Distributions


Formally, the probability distribution we considered above is called a Bernoulli
distribution: it governs a random variable X which can take on two possible val-
ues, 0 or 1, and this Bernoulli distribution is defined by the single parameter
θ (i.e., P(X = 1) = θ, and P(X = 0) = (1 − θ). We will sometimes refer to a
boolean-valued random variable which is governed by a Bernoulli distribution as
a Bernoulli random variable. As noted above, the conjugate prior for estimating
the parameter θ of a Bernoulli distribution is the Beta distribution.

3.1 Discrete Valued Random Variables with Many Values


If we have a random variable that can take on more than two values then we need
more than one parameter to describe the probability distribution for that variable.
For example, consider a six-sided die which, when rolled, can come up with any
of 6 possible results.
A common approach to characterizing such n-valued random variables is to
use a generalization of the Bernoulli distribution called a Categorical distribu-
tion, where we assign a different probability to each possible value of the ran-
dom variable. For example, we might model a six-sided die as a random vari-
able X that can take on the values 1 through 6, and represent its probability
distribution by a vector θ of six different parameters θ = hθ1 . . . θ6 i, where the
parameter θi describes the probability that X will take on its ith value; that is,
θi = P(X = i). The likelihood function L(θθ) = P(D|θθ) for estimating the vector
θ of parameters from observed rolls of the die is a simple generalization of the
likelihood for estimating a Bernoulli distribution. It takes the form of a product
L(θθ) = P(D|θθ) = θα1 1 θα2 2 . . . θαn n , where αi indicates the observed count of times
when the value X = i was observed in the data. Given a sample of data drawn
from a particular Categorical distribution for a random variable that can take on n
possible values, the maximum likelihood estimate for θi is given by
αi
θ̂MLE
i = (10)
α1 + . . . + αn
where α j indicates the number of times the value X = j was observed in the data.
Note that just like the case of a binary-valued random variable (see eq. 5), the
MLE is simply the fraction of times the particular value was observed in the data.
If we prefer a MAP estimate for an n-valued random variable governed by a
Categorical distribution, we use the conjugate prior for the Categorical distribu-
tion, which is called the Dirichlet distribution. Of course given that the Categorical
Copyright
c 2016, Tom M. Mitchell. 13

distribution has n different θi parameters, its prior will have to specify the proba-
bility for each joint assignment of these n parameters. The Dirichlet distribution
is a generalization of the Beta distribution, and has the form
(β1 −1) (β2 −1) (β −1)
θ1 θ2 . . . θn n
P(θ1 , . . . θn ) =
B(β1 , . . . , βn )
where the denominator is again a normalizing function to assure that the total
probability mass is 1, and where this normalizing function B(β1 , . . . , βn ) is inde-
pendent of the vector of parameters θ = hθ1 . . . θn i and therefore can be ignored
when deriving their MAP estimates.
The MAP estimate for each θi for a Categorial distribution is given by
(αi + βi − 1)
θ̂MAP
i = (11)
(α1 + β1 − 1) + . . . + (αn + βn − 1)
where α j indicates the number of times the value X = j was observed in the
data, and where the β j s are the parameters of the Dirichlet prior which reflects
our prior knowledge or assumptions. Here again, we can view the MAP estimate
as combining the observed data given by the α j values with β j − 1 additional
imaginary observations for X = j. Comparing this formula to the earlier formula
giving the MAP estimate for a Bernoulli random variable (eq. 9), it is easy to see
that this is a direct generalization of that simpler case, and that it again follows the
intuition of our earlier Algorithm 2.

4 What You Should Know


The main points of this chapter include:

• Joint probability distributions lie at the core of probabilistic machine learn-


ing approaches. Given the joint probability distribution P(X1 . . . Xn ) over a
set of random variables, it is possible in principle to compute any joint or
conditional probability defined over any subset of these variables.

• Learning, or estimating, the joint probability distribution from training data


can be easy if the data set is large compared to the number of distinct prob-
ability terms we must estimate. But in many practical problems the data
is more sparse, requiring methods that rely on prior knowledge or assump-
tions, in addition to observed data.

• Maximum likelihood estimation (MLE) is one of two widely used principles


for estimating the parameters that define a probability distribution. This
principle is to choose the set of parameter values θ̂MLE that makes the ob-
served training data most probable (over all the possible choices of θ):

θ̂MLE = arg max P(data|θ)


θ
Copyright
c 2016, Tom M. Mitchell. 14

In many cases, maximum likelihood estimates correspond to the intuitive


notion that we should base probability estimates on observed ratios. For
example, given the problem of estimating the probability that a coin will
turn up heads, given α1 observed flips resulting in heads, and α0 observed
flips resulting in tails, the maximum likelihood estimate corresponds exactly
to taking the fraction of flips that turn up heads:
α1
θ̂MLE = arg max P(data|θ) =
θ α1 + α0

• Maximium a posteriori probability (MAP) estimation is the other of the two


widely used principles. This principle is to choose the most probable value
of θ, given the observed training data plus a prior probability distribution
P(θ) which captures prior knowledge or assumptions about the value of θ:

θ̂MAP = arg max P(θ|data) = arg max P(data|θ)P(θ)


θ θ

In many cases, MAP estimates correspond to the intuitive notion that we


can represent prior assumptions by making up ”imaginary” data which re-
flects these assumptions. For example, the MAP estimate for the above coin
flip example, assuming a prior P(θ) = Beta(γ0 + 1, γ1 + 1), yields a MAP
estimate which is equivalent to the MLE estimate if we simply add in an
imaginary γ1 heads and γ0 tails to the actual observed α1 heads and α0 tails:

(α1 + γ1 )
θ̂MAP = arg max P(data|θ)P(θ) =
θ (α1 + γ1 ) + (α0 + γ0 )

EXERCISES
1. In the MAP estimation of θ for our Bernoulli random variable X in this
chapter, we used a Beta(β0 , β1 ) prior probability distribution to capture our
prior beliefs about the prior probability of different values of θ, before see-
ing the observed data.

• Plot this prior probability distribution over θ, corresponding to the


number of imaginary examples used in the top left plot of Figure 1
(i.e., γ0 = 42, γ1 = 18). Specifically create a plot showing the prior
probability (vertical axis) for each possible value of θ between 0 and 1
(horizontal axis), as represented by the prior distribution Beta(β0 , β1 ).
Recall the correspondence βi = γi + 1. Note you will want to write a
simple computer program to create this plot.
• Above, you plotted the prior probability over possible values of θ.
Now plot the posterior probability distribution over θ given that prior,
plus observed data in which 6 heads (X = 1) were observed, along
with 9 tails (X = 0).
Copyright
c 2016, Tom M. Mitchell. 15

• View the plot you created above to visually determine the approximate
Maximum a Posterior probability estimate θMAP . What is it? What is
the exact value of the MAP estimate? What is the exact value of the
Maximum Likelihood Estimate θMLE ?
• How do you think your plot of the posterior probability would change
if you altered the Beta prior distribution to use γ0 = 420, γ1 = 180?
(hint: it’s ok to actually plot this). What if you changed the Beta prior
to γ0 = 32, γ1 = 28?

5 Acknowledgements
I very much appreciate receiving helpful comments on earlier drafts of this chapter
from Ondřej Filip, Ayush Garg, Akshay Mishra and Tao Chen. Andrew Moore
provided the data summary shown in Table 1.

REFERENCES
Mitchell, T (1997). Machine Learning, McGraw Hill.
Wasserman, L. (2004). All of Statistics, Springer-Verlag.
CHAPTER 3

GENERATIVE AND DISCRIMINATIVE


CLASSIFIERS:
NAIVE BAYES AND LOGISTIC REGRESSION

Machine Learning
Copyright 2015.
c Tom M. Mitchell. All rights reserved.
*DRAFT OF September 23, 2017*

*PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR’S PERMISSION*

This is a rough draft chapter intended for inclusion in the upcoming second edi-
tion of the textbook Machine Learning, T.M. Mitchell, McGraw Hill. You are
welcome to use this for educational purposes, but do not duplicate or repost it
on the internet. For online copies of this and other materials related to this book,
visit the web site www.cs.cmu.edu/∼tom/mlbook.html.
Please send suggestions for improvements, or suggested exercises, to
[email protected].

1 Learning Classifiers based on Bayes Rule


Here we consider the relationship between supervised learning, or function ap-
proximation problems, and Bayesian reasoning. We begin by considering how to
design learning algorithms based on Bayes rule.
Consider a supervised learning problem in which we wish to approximate an
unknown target function f : X → Y , or equivalently P(Y |X). To begin, we will
assume Y is a boolean-valued random variable, and X is a vector containing n
boolean attributes. In other words, X = hX1 , X2 . . . , Xn i, where Xi is the boolean
random variable denoting the ith attribute of X.
Applying Bayes rule, we see that P(Y = yi |X) can be represented as

P(X = xk |Y = yi )P(Y = yi )
P(Y = yi |X = xk ) =
∑ j P(X = xk |Y = y j )P(Y = y j )

1
Copyright
c 2015, Tom M. Mitchell. 2

where ym denotes the mth possible value for Y , xk denotes the kth possible vector
value for X, and where the summation in the denominator is over all legal values
of the random variable Y .
One way to learn P(Y |X) is to use the training data to estimate P(X|Y ) and
P(Y ). We can then use these estimates, together with Bayes rule above, to deter-
mine P(Y |X = xk ) for any new instance xk .

A NOTE ON NOTATION: We will consistently use upper case symbols (e.g.,


X) to refer to random variables, including both vector and non-vector variables. If
X is a vector, then we use subscripts (e.g., Xi to refer to each random variable, or
feature, in X). We use lower case symbols to refer to values of random variables
(e.g., Xi = xi j may refer to random variable Xi taking on its jth possible value). We
will sometimes abbreviate by omitting variable names, for example abbreviating
P(Xi = xi j |Y = yk ) to P(xi j |yk ). We will write E[X] to refer to the expected value
j
of X. We use superscripts to index training examples (e.g., Xi refers to the value
of the random variable Xi in the jth training example.). We use δ(x) to denote
an “indicator” function whose value is 1 if its logical argument x is true, and
whose value is 0 otherwise. We use the #D{x} operator to denote the number of
elements in the set D that satisfy property x. We use a “hat” to indicate estimates;
for example, θ̂ indicates an estimated value of θ.

1.1 Unbiased Learning of Bayes Classifiers is Impractical


If we are going to train a Bayes classifier by estimating P(X|Y ) and P(Y ), then
it is reasonable to ask how much training data will be required to obtain reliable
estimates of these distributions. Let us assume training examples are generated
by drawing instances at random from an unknown underlying distribution P(X),
then allowing a teacher to label this example with its Y value.
A hundred independently drawn training examples will usually suffice to ob-
tain a maximum likelihood estimate of P(Y ) that is within a few percent of its cor-
rect value1 when Y is a boolean variable. However, accurately estimating P(X|Y )
typically requires many more examples. To see why, consider the number of pa-
rameters we must estimate when Y is boolean and X is a vector of n boolean
attributes. In this case, we need to estimate a set of parameters

θi j ≡ P(X = xi |Y = y j )

where the index i takes on 2n possible values (one for each of the possible vector
values of X), and j takes on 2 possible values. Therefore, we will need to estimate
approximately 2n+1 parameters. To calculate the exact number of required param-
eters, note for any fixed j, the sum over i of θi j must be one. Therefore, for any
particular value y j , and the 2n possible values of xi , we need compute only 2n − 1
independent parameters. Given the two possible values for Y , we must estimate
a total of 2(2n − 1) such θi j parameters. Unfortunately, this corresponds to two
1 Why? See Chapter 5 of edition 1 of Machine Learning.
Copyright
c 2015, Tom M. Mitchell. 3

distinct parameters for each of the distinct instances in the instance space for X.
Worse yet, to obtain reliable estimates of each of these parameters, we will need to
observe each of these distinct instances multiple times! This is clearly unrealistic
in most practical learning domains. For example, if X is a vector containing 30
boolean features, then we will need to estimate more than 3 billion parameters.

2 Naive Bayes Algorithm


Given the intractable sample complexity for learning Bayesian classifiers, we must
look for ways to reduce this complexity. The Naive Bayes classifier does this
by making a conditional independence assumption that dramatically reduces the
number of parameters to be estimated when modeling P(X|Y ), from our original
2(2n − 1) to just 2n.

2.1 Conditional Independence


Definition: Given three sets of random variables X,Y and Z, we say X
is conditionally independent of Y given Z, if and only if the proba-
bility distribution governing X is independent of the value of Y given
Z; that is

(∀i, j, k)P(X = xi |Y = y j , Z = zk ) = P(X = xi |Z = zk )

As an example, consider three boolean random variables to describe the current


weather: Rain, T hunder and Lightning. We might reasonably assert that T hunder
is independent of Rain given Lightning. Because we know Lightning causes
T hunder, once we know whether or not there is Lightning, no additional infor-
mation about T hunder is provided by the value of Rain. Of course there is a
clear dependence of T hunder on Rain in general, but there is no conditional de-
pendence once we know the value of Lightning. Although X, Y and Z are each
single random variables in this example, more generally the definition applies to
sets of random variables. For example, we might assert that variables {A, B} are
conditionally independent of {C, D} given variables {E, F}.

2.2 Derivation of Naive Bayes Algorithm


The Naive Bayes algorithm is a classification algorithm based on Bayes rule and a
set of conditional independence assumptions. Given the goal of learning P(Y |X)
where X = hX1 . . . , Xn i, the Naive Bayes algorithm makes the assumption that
each Xi is conditionally independent of each of the other Xk s given Y , and also
independent of each subset of the other Xk ’s given Y .
The value of this assumption is that it dramatically simplifies the representa-
tion of P(X|Y ), and the problem of estimating it from the training data. Consider,
for example, the case where X = hX1 , X2 i. In this case
Copyright
c 2015, Tom M. Mitchell. 4

P(X|Y ) = P(X1 , X2 |Y )
= P(X1 |X2 ,Y )P(X2 |Y )
= P(X1 |Y )P(X2 |Y )

Where the second line follows from a general property of probabilities, and the
third line follows directly from our above definition of conditional independence.
More generally, when X contains n attributes which satisfy the conditional inde-
pendence assumption, we have
n
P(X1 . . . Xn |Y ) = ∏ P(Xi |Y ) (1)
i=1

Notice that when Y and the Xi are boolean variables, we need only 2n parameters
to define P(Xi = xik |Y = y j ) for the necessary i, j, k. This is a dramatic reduction
compared to the 2(2n − 1) parameters needed to characterize P(X|Y ) if we make
no conditional independence assumption.
Let us now derive the Naive Bayes algorithm, assuming in general that Y is
any discrete-valued variable, and the attributes X1 . . . Xn are any discrete or real-
valued attributes. Our goal is to train a classifier that will output the probability
distribution over possible values of Y , for each new instance X that we ask it to
classify. The expression for the probability that Y will take on its kth possible
value, according to Bayes rule, is
P(Y = yk )P(X1 . . . Xn |Y = yk )
P(Y = yk |X1 . . . Xn ) =
∑ j P(Y = y j )P(X1 . . . Xn |Y = y j )
where the sum is taken over all possible values y j of Y . Now, assuming the Xi are
conditionally independent given Y , we can use equation (1) to rewrite this as
P(Y = yk ) ∏i P(Xi |Y = yk )
P(Y = yk |X1 . . . Xn ) = (2)
∑ j P(Y = y j ) ∏i P(Xi |Y = y j )
Equation (2) is the fundamental equation for the Naive Bayes classifier. Given a
new instance X new = hX1 . . . Xn i, this equation shows how to calculate the prob-
ability that Y will take on any given value, given the observed attribute values
of X new and given the distributions P(Y ) and P(Xi |Y ) estimated from the training
data. If we are interested only in the most probable value of Y , then we have the
Naive Bayes classification rule:
P(Y = yk ) ∏i P(Xi |Y = yk )
Y ← arg max
yk ∑ j P(Y = y j ) ∏i P(Xi |Y = y j )
which simplifies to the following (because the denominator does not depend on
yk ).
Y ← arg max P(Y = yk ) ∏ P(Xi |Y = yk ) (3)
yk
i
Copyright
c 2015, Tom M. Mitchell. 5

2.3 Naive Bayes for Discrete-Valued Inputs


To summarize, let us precisely define the Naive Bayes learning algorithm by de-
scribing the parameters that must be estimated, and how we may estimate them.
When the n input attributes Xi each take on J possible discrete values, and
Y is a discrete variable taking on K possible values, then our learning task is to
estimate two sets of parameters. The first is
θi jk ≡ P(Xi = xi j |Y = yk ) (4)
for each input attribute Xi , each of its possible values xi j , and each of the possible
values yk of Y . Note there will be nJK such parameters, and note also that only
n(J − 1)K of these are independent, given that they must satisfy 1 = ∑ j θi jk for
each pair of i, k values.
In addition, we must estimate parameters that define the prior probability over
Y:
πk ≡ P(Y = yk ) (5)
Note there are K of these parameters, (K − 1) of which are independent.
We can estimate these parameters using either maximum likelihood estimates
(based on calculating the relative frequencies of the different events in the data),
or using Bayesian MAP estimates (augmenting this observed data with prior dis-
tributions over the values of these parameters).
Maximum likelihood estimates for θi jk given a set of training examples D are
given by
#D{Xi = xi j ∧Y = yk }
θ̂i jk = P̂(Xi = xi j |Y = yk ) = (6)
#D{Y = yk }
where the #D{x} operator returns the number of elements in the set D that satisfy
property x.
One danger of this maximum likelihood estimate is that it can sometimes re-
sult in θ estimates of zero, if the data does not happen to contain any training
examples satisfying the condition in the numerator. To avoid this, it is common to
use a “smoothed” estimate which effectively adds in a number of additional “hal-
lucinated” examples, and which assumes these hallucinated examples are spread
evenly over the possible values of Xi . This smoothed estimate is given by
#D{Xi = xi j ∧Y = yk } + l
θ̂i jk = P̂(Xi = xi j |Y = yk ) = (7)
#D{Y = yk } + lJ
where J is the number of distinct values Xi can take on, and l determines the
strength of this smoothing (i.e., the number of hallucinated examples is lJ). This
expression corresponds to a MAP estimate for θi jk if we assume a Dirichlet prior
distribution over the θi jk parameters, with equal-valued parameters. If l is set to
1, this approach is called Laplace smoothing.
Maximum likelihood estimates for πk are
#D{Y = yk }
π̂k = P̂(Y = yk ) = (8)
|D|
Copyright
c 2015, Tom M. Mitchell. 6

where |D| denotes the number of elements in the training set D.


Alternatively, we can obtain a smoothed estimate, or equivalently a MAP es-
timate based on a Dirichlet prior over the πk parameters assuming equal priors on
each πk , by using the following expression
#D{Y = yk } + l
π̂k = P̂(Y = yk ) = (9)
|D| + lK
where K is the number of distinct values Y can take on, and l again determines the
strength of the prior assumptions relative to the observed data D.

2.4 Naive Bayes for Continuous Inputs


In the case of continuous inputs Xi , we can of course continue to use equations
(2) and (3) as the basis for designing a Naive Bayes classifier. However, when the
Xi are continuous we must choose some other way to represent the distributions
P(Xi |Y ). One common approach is to assume that for each possible discrete value
yk of Y , the distribution of each continuous Xi is Gaussian, and is defined by a
mean and standard deviation specific to Xi and yk . In order to train such a Naive
Bayes classifier we must therefore estimate the mean and standard deviation of
each of these Gaussians:

µik = E[Xi |Y = yk ] (10)


σ2ik = E[(Xi − µik )2 |Y = yk ] (11)

for each attribute Xi and each possible value yk of Y . Note there are 2nK of these
parameters, all of which must be estimated independently.
Of course we must also estimate the priors on Y as well

πk = P(Y = yk ) (12)

The above model summarizes a Gaussian Naive Bayes classifier, which as-
sumes that the data X is generated by a mixture of class-conditional (i.e., depen-
dent on the value of the class variable Y ) Gaussians. Furthermore, the Naive Bayes
assumption introduces the additional constraint that the attribute values Xi are in-
dependent of one another within each of these mixture components. In particular
problem settings where we have additional information, we might introduce addi-
tional assumptions to further restrict the number of parameters or the complexity
of estimating them. For example, if we have reason to believe that noise in the
observed Xi comes from a common source, then we might further assume that all
of the σik are identical, regardless of the attribute i or class k (see the homework
exercise on this issue).
Again, we can use either maximum likelihood estimates (MLE) or maximum
a posteriori (MAP) estimates for these parameters. The maximum likelihood esti-
mator for µik is
1 j
µ̂ik = j ∑ Xi δ(Y j = yk ) (13)
∑ j δ(Y = yk ) j
Copyright
c 2015, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where δ(Y = yk ) is
1 if Y = yk and 0 otherwise. Note the role of δ here is to select only those training
examples for which Y = yk .
The maximum likelihood estimator for σ2ik is

1 j
σ̂2ik = ∑(X − µ̂ik )2δ(Y j = yk ) (14)
∑ j δ(Y j = yk ) j i

This maximum likelihood estimator is biased, so the minimum variance unbi-


ased estimator (MVUE) is sometimes used instead. It is
1 j
σ̂2ik = j ∑ (Xi − µ̂ik )2 δ(Y j = yk ) (15)
(∑ j δ(Y = yk )) − 1 j

3 Logistic Regression
Logistic Regression is an approach to learning functions of the form f : X → Y , or
P(Y |X) in the case where Y is discrete-valued, and X = hX1 . . . Xn i is any vector
containing discrete or continuous variables. In this section we will primarily con-
sider the case where Y is a boolean variable, in order to simplify notation. In the
final subsection we extend our treatment to the case where Y takes on any finite
number of discrete values.
Logistic Regression assumes a parametric form for the distribution P(Y |X),
then directly estimates its parameters from the training data. The parametric
model assumed by Logistic Regression in the case where Y is boolean is:
1
P(Y = 1|X) = (16)
1 + exp(w0 + ∑ni=1 wi Xi )

and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) = (17)
1 + exp(w0 + ∑ni=1 wi Xi )
Notice that equation (17) follows directly from equation (16), because the sum of
these two probabilities must equal 1.
One highly convenient property of this form for P(Y |X) is that it leads to a
simple linear expression for classification. To classify any given X we generally
want to assign the value yk that maximizes P(Y = yk |X). Put another way, we
assign the label Y = 0 if the following condition holds:

P(Y = 0|X)
1<
P(Y = 1|X)

substituting from equations (16) and (17), this becomes


n
1 < exp(w0 + ∑ wi Xi )
i=1
Copyright
c 2015, Tom M. Mitchell. 8

Figure 1: Form of the logistic function. In Logistic Regression, P(Y |X) is as-
sumed to follow this form.

and taking the natural log of both sides we have a linear classification rule that
assigns label Y = 0 if X satisfies
n
0 < w0 + ∑ wi Xi (18)
i=1

and assigns Y = 1 otherwise.


Interestingly, the parametric form of P(Y |X) used by Logistic Regression is
precisely the form implied by the assumptions of a Gaussian Naive Bayes classi-
fier. Therefore, we can view Logistic Regression as a closely related alternative to
GNB, though the two can produce different results in many cases.

3.1 Form of P(Y |X) for Gaussian Naive Bayes Classifier


Here we derive the form of P(Y |X) entailed by the assumptions of a Gaussian
Naive Bayes (GNB) classifier, showing that it is precisely the form used by Logis-
tic Regression and summarized in equations (16) and (17). In particular, consider
a GNB based on the following modeling assumptions:

• Y is boolean, governed by a Bernoulli distribution, with parameter π =


P(Y = 1)

• X = hX1 . . . Xn i, where each Xi is a continuous random variable

• For each Xi , P(Xi |Y = yk ) is a Gaussian distribution of the form N(µik , σi )

• For all i and j 6= i, Xi and X j are conditionally independent given Y


Copyright
c 2015, Tom M. Mitchell. 9

Note here we are assuming the standard deviations σi vary from attribute to at-
tribute, but do not depend on Y .
We now derive the parametric form of P(Y |X) that follows from this set of
GNB assumptions. In general, Bayes rule allows us to write
P(Y = 1)P(X|Y = 1)
P(Y = 1|X) =
P(Y = 1)P(X|Y = 1) + P(Y = 0)P(X|Y = 0)
Dividing both the numerator and denominator by the numerator yields:
1
P(Y = 1|X) = =0)P(X|Y =0)
1 + P(Y
P(Y =1)P(X|Y =1)

or equivalently
1
P(Y = 1|X) = =0)P(X|Y =0)
1 + exp(ln P(Y
P(Y =1)P(X|Y =1) )

Because of our conditional independence assumption we can write this


1
P(Y = 1|X) = =0) i |Y =0)
1 + exp(ln P(Y
P(Y =1) + ∑i ln P(X
P(Xi |Y =1) )
1
= P(Xi |Y =0)
(19)
1 + exp(ln 1−π
π + ∑i ln P(Xi |Y =1) )

Note the final step expresses P(Y = 0) and P(Y = 1) in terms of the binomial
parameter π.
Now consider just the summation in the denominator of equation (19). Given
our assumption that P(Xi |Y = yk ) is Gaussian, we can expand this term as follows:
2
 
√ 1 2 exp −(Xi −µ2 i0 )
P(Xi |Y = 0) 2πσ 2σi
∑ ln P(Xi|Y = 1) = ∑ ln √ 1 i  −(Xi−µi1)2 
i i exp 2σ2i
2πσ2i
(Xi − µi1 )2 − (Xi − µi0 )2
 
= ∑ ln exp
i 2σ2i
(Xi − µi1 )2 − (Xi − µi0 )2
 
= ∑
i 2σ2i
 2
(Xi − 2Xi µi1 + µ2i1 ) − (Xi2 − 2Xi µi0 + µ2i0 )

= ∑
i 2σ2i
2Xi (µi0 − µi1 ) + µ2i1 − µ2i0
 
= ∑
i 2σ2i
µ2i1 − µ2i0
 
µi0 − µi1
= ∑ Xi + (20)
i σ2i 2σ2i
Copyright
c 2015, Tom M. Mitchell. 10

Note this expression is a linear weighted sum of the Xi ’s. Substituting expression
(20) back into equation (19), we have
1
P(Y = 1|X) = (21)
µ2i1 −µ2i0 )
 
µi0 −µi1
1 + exp(ln 1−π
π + ∑i σi2 Xi + 2σ2i
)

Or equivalently,
1
P(Y = 1|X) = (22)
1 + exp(w0 + ∑ni=1 wi Xi )
where the weights w1 . . . wn are given by
µi0 − µi1
wi =
σ2i
and where
1−π µ2 − µ2
w0 = ln + ∑ i1 2 i0
π i 2σi
Also we have
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) = 1 − P(Y = 1|X) = (23)
1 + exp(w0 + ∑ni=1 wi Xi )

3.2 Estimating Parameters for Logistic Regression


The above subsection proves that P(Y |X) can be expressed in the parametric form
given by equations (16) and (17), under the Gaussian Naive Bayes assumptions
detailed there. It also provides the value of the weights wi in terms of the param-
eters estimated by the GNB classifier. Here we describe an alternative method
for estimating these weights. We are interested in this alternative for two reasons.
First, the form of P(Y |X) assumed by Logistic Regression holds in many problem
settings beyond the GNB problem detailed in the above section, and we wish to
have a general method for estimating it in a more broad range of cases. Second, in
many cases we may suspect the GNB assumptions are not perfectly satisfied. In
this case we may wish to estimate the wi parameters directly from the data, rather
than going through the intermediate step of estimating the GNB parameters which
forces us to adopt its more stringent modeling assumptions.
One reasonable approach to training Logistic Regression is to choose param-
eter values that maximize the conditional data likelihood. The conditional data
likelihood is the probability of the observed Y values in the training data, condi-
tioned on their corresponding X values. We choose parameters W that satisfy

W ← arg max ∏ P(Y l |X l ,W )


W
l

where W = hw0 , w1 . . . wn i is the vector of parameters to be estimated, Y l denotes


the observed value of Y in the lth training example, and X l denotes the observed
Copyright
c 2015, Tom M. Mitchell. 11

value of X in the lth training example. The expression to the right of the arg max
is the conditional data likelihood. Here we include W in the conditional, to em-
phasize that the expression is a function of the W we are attempting to maximize.
Equivalently, we can work with the log of the conditional likelihood:

W ← arg max ∑ ln P(Y l |X l ,W )


W
l

This conditional data log likelihood, which we will denote l(W ) can be written
as
l(W ) = ∑ Y l ln P(Y l = 1|X l ,W ) + (1 −Y l ) ln P(Y l = 0|X l ,W )
l
Note here we are utilizing the fact that Y can take only values 0 or 1, so only one
of the two terms in the expression will be non-zero for any given Y l .
To keep our derivation consistent with common usage, we will in this section
flip the assignment of the boolean variable Y so that we assign
1
P(Y = 0|X) = (24)
1 + exp(w0 + ∑ni=1 wi Xi )

and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 1|X) = (25)
1 + exp(w0 + ∑ni=1 wi Xi )
In this case, we can reexpress the log of the conditional likelihood as:

l(W ) = ∑ Y l ln P(Y l = 1|X l ,W ) + (1 −Y l ) ln P(Y l = 0|X l ,W )


l
P(Y l = 1|X l ,W )
= ∑ Y l ln l l
P(Y = 0|X ,W )
+ ln P(Y l = 0|X l ,W )
l
n n
= ∑ Y l (w0 + ∑ wiXil ) − ln(1 + exp(w0 + ∑ wiXil ))
l i i

where Xil denotes the value of Xi for the lth training example. Note the superscript
l is not related to the log likelihood function l(W ).
Unfortunately, there is no closed form solution to maximizing l(W ) with re-
spect to W . Therefore, one common approach is to use gradient ascent, in which
we work with the gradient, which is the vector of partial derivatives. The ith
component of the vector gradient has the form

∂l(W )
= ∑ Xil (Y l − P̂(Y l = 1|X l ,W ))
∂wi l

where P̂(Y l |X l ,W ) is the Logistic Regression prediction using equations (24) and
(25) and the weights W . To accommodate weight w0 , we assume an imaginary
X0 = 1 for all l. This expression for the derivative has an intuitive interpretation:
the term inside the parentheses is simply the prediction error; that is, the difference
Copyright
c 2015, Tom M. Mitchell. 12

between the observed Y l and its predicted probability! Note if Y l = 1 then we wish
for P̂(Y l = 1|X l ,W ) to be 1, whereas if Y l = 0 then we prefer that P̂(Y l = 1|X l ,W )
be 0 (which makes P̂(Y l = 0|X l ,W ) equal to 1). This error term is multiplied by
the value of Xil , which accounts for the magnitude of the wi Xil term in making this
prediction.
Given this formula for the derivative of each wi , we can use standard gradient
ascent to optimize the weights W . Beginning with initial weights of zero, we
repeatedly update the weights in the direction of the gradient, on each iteration
changing every weight wi according to

wi ← wi + η ∑ Xil (Y l − P̂(Y l = 1|X l ,W ))


l

where η is a small constant (e.g., 0.01) which determines the step size. Because
the conditional log likelihood l(W ) is a concave function in W , this gradient ascent
procedure will converge to a global maximum. Gradient ascent is described in
greater detail, for example, in Chapter 4 of Mitchell (1997). In many cases where
computational efficiency is important it is common to use a variant of gradient
ascent called conjugate gradient ascent, which often converges more quickly.

3.3 Regularization in Logistic Regression


Overfitting the training data is a problem that can arise in Logistic Regression,
especially when data is very high dimensional and training data is sparse. One
approach to reducing overfitting is regularization, in which we create a modified
“penalized log likelihood function,” which penalizes large values of W . One ap-
proach is to use the penalized log likelihood function

λ
W ← arg max ∑ ln P(Y l |X l ,W ) − ||W ||2
W
l 2

which adds a penalty proportional to the squared magnitude of W . Here λ is a


constant that determines the strength of this penalty term.
Modifying our objective by adding in this penalty term gives us a new objec-
tive to maximize. It is easy to show that maximizing it corresponds to calculating
the MAP estimate for W under the assumption that the prior distribution P(W ) is
a Normal distribution with mean zero, and a variance related to 1/λ. Notice that
in general, the MAP estimate for W involves optimizing the objective

∑ ln P(Y l |X l ,W ) + ln P(W )
l

and if P(W ) is a zero mean Gaussian distribution, then ln P(W ) yields a term
proportional to ||W ||2 .
Given this penalized log likelihood function, it is easy to rederive the gradient
descent rule. The derivative of this penalized log likelihood function is similar to
Copyright
c 2015, Tom M. Mitchell. 13

our earlier derivative, with one additional penalty term


∂l(W )
= ∑ Xil (Y l − P̂(Y l = 1|X l ,W )) − λwi
∂wi l
which gives us the modified gradient descent rule
wi ← wi + η ∑ Xil (Y l − P̂(Y l = 1|X l ,W )) − ηλwi (26)
l
In cases where we have prior knowledge about likely values for specific wi , it
is possible to derive a similar penalty term by using a Normal prior on W with a
non-zero mean.

3.4 Logistic Regression for Functions with Many Discrete Val-


ues
Above we considered using Logistic Regression to learn P(Y |X) only for the case
where Y is a boolean variable. More generally, if Y can take on any of the discrete
values {y1 , . . . yK }, then the form of P(Y = yk |X) for Y = y1 ,Y = y2 , . . .Y = yK−1
is:
exp(wk0 + ∑ni=1 wki Xi )
P(Y = yk |X) = (27)
1 + ∑K−1 n
j=1 exp(w j0 + ∑i=1 w ji Xi )
When Y = yK , it is
1
P(Y = yK |X) = (28)
1 + ∑K−1 n
j=1 exp(w j0 + ∑i=1 w ji Xi )

Here w ji denotes the weight associated with the jth class Y = y j and with input
Xi . It is easy to see that our earlier expressions for the case where Y is boolean
(equations (16) and (17)) are a special case of the above expressions. Note also
that the form of the expression for P(Y = yK |X) assures that [∑K k=1 P(Y = yk |X)] =
1.
The primary difference between these expressions and those for boolean Y is
that when Y takes on K possible values, we construct K −1 different linear expres-
sions to capture the distributions for the different values of Y . The distribution for
the final, Kth, value of Y is simply one minus the probabilities of the first K − 1
values.
In this case, the gradient descent rule with regularization becomes:
w ji ← w ji + η ∑ Xil (δ(Y l = y j ) − P̂(Y l = y j |X l ,W )) − ηλw ji (29)
l

where δ(Y l = y j ) = 1 if the lth training value, Y l , is equal to y j , and δ(Y l = y j ) = 0


otherwise. Note our earlier learning rule, equation (26), is a special case of this
new learning rule, when K = 2. As in the case for K = 2, the quantity inside the
parentheses can be viewed as an error term which goes to zero if the estimated
conditional probability P̂(Y l = y j |X l ,W )) perfectly matches the observed value
of Y l .
Copyright
c 2015, Tom M. Mitchell. 14

4 Relationship Between Naive Bayes Classifiers and


Logistic Regression
To summarize, Logistic Regression directly estimates the parameters of P(Y |X),
whereas Naive Bayes directly estimates parameters for P(Y ) and P(X|Y ). We of-
ten call the former a discriminative classifier, and the latter a generative classifier.
We showed above that the assumptions of one variant of a Gaussian Naive
Bayes classifier imply the parametric form of P(Y |X) used in Logistic Regres-
sion. Furthermore, we showed that the parameters wi in Logistic Regression can
be expressed in terms of the Gaussian Naive Bayes parameters. In fact, if the GNB
assumptions hold, then asymptotically (as the number of training examples grows
toward infinity) the GNB and Logistic Regression converge toward identical clas-
sifiers.
The two algorithms also differ in interesting ways:

• When the GNB modeling assumptions do not hold, Logistic Regression and
GNB typically learn different classifier functions. In this case, the asymp-
totic (as the number of training examples approach infinity) classification
accuracy for Logistic Regression is often better than the asymptotic accu-
racy of GNB. Although Logistic Regression is consistent with the Naive
Bayes assumption that the input features Xi are conditionally independent
given Y , it is not rigidly tied to this assumption as is Naive Bayes. Given
data that disobeys this assumption, the conditional likelihood maximization
algorithm for Logistic Regression will adjust its parameters to maximize the
fit to (the conditional likelihood of) the data, even if the resulting parameters
are inconsistent with the Naive Bayes parameter estimates.
• GNB and Logistic Regression converge toward their asymptotic accuracies
at different rates. As Ng & Jordan (2002) show, GNB parameter estimates
converge toward their asymptotic values in order log n examples, where n
is the dimension of X. In contrast, Logistic Regression parameter estimates
converge more slowly, requiring order n examples. The authors also show
that in several data sets Logistic Regression outperforms GNB when many
training examples are available, but GNB outperforms Logistic Regression
when training data is scarce.

5 What You Should Know


The main points of this chapter include:

• We can use Bayes rule as the basis for designing learning algorithms (func-
tion approximators), as follows: Given that we wish to learn some target
function f : X → Y , or equivalently, P(Y |X), we use the training data to
learn estimates of P(X|Y ) and P(Y ). New X examples can then be classi-
fied using these estimated probability distributions, plus Bayes rule. This
Copyright
c 2015, Tom M. Mitchell. 15

type of classifier is called a generative classifier, because we can view the


distribution P(X|Y ) as describing how to generate random instances X con-
ditioned on the target attribute Y .

• Learning Bayes classifiers typically requires an unrealistic number of train-


ing examples (i.e., more than |X| training examples where X is the instance
space) unless some form of prior assumption is made about the form of
P(X|Y ). The Naive Bayes classifier assumes all attributes describing X
are conditionally independent given Y . This assumption dramatically re-
duces the number of parameters that must be estimated to learn the classi-
fier. Naive Bayes is a widely used learning algorithm, for both discrete and
continuous X.

• When X is a vector of discrete-valued attributes, Naive Bayes learning al-


gorithms can be viewed as linear classifiers; that is, every such Naive Bayes
classifier corresponds to a hyperplane decision surface in X. The same state-
ment holds for Gaussian Naive Bayes classifiers if the variance of each fea-
ture is assumed to be independent of the class (i.e., if σik = σi ).

• Logistic Regression is a function approximation algorithm that uses training


data to directly estimate P(Y |X), in contrast to Naive Bayes. In this sense,
Logistic Regression is often referred to as a discriminative classifier because
we can view the distribution P(Y |X) as directly discriminating the value of
the target value Y for any given instance X.

• Logistic Regression is a linear classifier over X. The linear classifiers pro-


duced by Logistic Regression and Gaussian Naive Bayes are identical in
the limit as the number of training examples approaches infinity, provided
the Naive Bayes assumptions hold. However, if these assumptions do not
hold, the Naive Bayes bias will cause it to perform less accurately than Lo-
gistic Regression, in the limit. Put another way, Naive Bayes is a learning
algorithm with greater bias, but lower variance, than Logistic Regression. If
this bias is appropriate given the actual data, Naive Bayes will be preferred.
Otherwise, Logistic Regression will be preferred.

• We can view function approximation learning algorithms as statistical esti-


mators of functions, or of conditional distributions P(Y |X). They estimate
P(Y |X) from a sample of training data. As with other statistical estima-
tors, it can be useful to characterize learning algorithms by their bias and
expected variance, taken over different samples of training data.

6 Further Reading
Wasserman (2004) describes a Reweighted Least Squares method for Logistic
Regression. Ng and Jordan (2002) provide a theoretical and experimental com-
parison of the Naive Bayes classifier and Logistic Regression.
Copyright
c 2015, Tom M. Mitchell. 16

EXERCISES
1. At the beginning of the chapter we remarked that “A hundred training ex-
amples will usually suffice to obtain an estimate of P(Y ) that is within a
few percent of the correct value.” Describe conditions under which the 95%
confidence interval for our estimate of P(Y ) will be ±0.02.
2. Consider learning a function X → Y where Y is boolean, where X = hX1 , X2 i,
and where X1 is a boolean variable and X2 a continuous variable. State the
parameters that must be estimated to define a Naive Bayes classifier in this
case. Give the formula for computing P(Y |X), in terms of these parameters
and the feature values X1 and X2 .
3. In section 3 we showed that when Y is Boolean and X = hX1 . . . Xn i is a
vector of continuous variables, then the assumptions of the Gaussian Naive
Bayes classifier imply that P(Y |X) is given by the logistic function with
appropriate parameters W . In particular:
1
P(Y = 1|X) =
1 + exp(w0 + ∑ni=1 wi Xi )
and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) =
1 + exp(w0 + ∑ni=1 wi Xi )
Consider instead the case where Y is Boolean and X = hX1 . . . Xn i is a vec-
tor of Boolean variables. Prove for this case also that P(Y |X) follows this
same form (and hence that Logistic Regression is also the discriminative
counterpart to a Naive Bayes generative classifier over Boolean features).
Hints:
• Simple notation will help. Since the Xi are Boolean variables, you
need only one parameter to define P(Xi |Y = yk ). Define θi1 ≡ P(Xi =
1|Y = 1), in which case P(Xi = 0|Y = 1) = (1 − θi1 ). Similarly, use
θi0 to denote P(Xi = 1|Y = 0).
• Notice with the above notation you can represent P(Xi |Y = 1) as fol-
lows
P(Xi |Y = 1) = θXi1i (1 − θi1 )(1−Xi )
Note when Xi = 1 the second term is equal to 1 because its exponent
is zero. Similarly, when Xi = 0 the first term is equal to 1 because its
exponent is zero.
4. (based on a suggestion from Sandra Zilles). This question asks you to con-
sider the relationship between the MAP hypothesis and the Bayes optimal
hypothesis. Consider a hypothesis space H defined over the set of instances
X, and containing just two hypotheses, h1 and h2 with equal prior probabil-
ities P(h1) = P(h2) = 0.5. Suppose we are given an arbitrary set of training
Copyright
c 2015, Tom M. Mitchell. 17

data D which we use to calculate the posterior probabilities P(h1|D) and


P(h2|D). Based on this we choose the MAP hypothesis, and calculate the
Bayes optimal hypothesis. Suppose we find that the Bayes optimal classi-
fier is not equal to either h1 or to h2, which is generally the case because
the Bayes optimal hypothesis corresponds to “averaging over” all hypothe-
ses in H. Now we create a new hypothesis h3 which is equal to the Bayes
optimal classifier with respect to H, X and D; that is, h3 classifies each in-
stance in X exactly the same as the Bayes optimal classifier for H and D.
We now create a new hypothesis space H 0 = {h1, h2, h3}. If we train using
the same training data, D, will the MAP hypothesis from H 0 be h3? Will the
Bayes optimal classifier with respect to H 0 be equivalent to h3? (Hint: the
answer depends on the priors we assign to the hypotheses in H 0 . Can you
give constraints on these priors that assure the answers will be yes or no?)

7 Acknowledgements
I very much appreciate receiving helpful comments on earlier drafts of this chapter
from the following: Nathaniel Fairfield, Rainer Gemulla, Vineet Kumar, Andrew
McCallum, Anand Prahlad, Wei Wang, Geoff Webb, and Sandra Zilles.

REFERENCES
Mitchell, T (1997). Machine Learning, McGraw Hill.
Ng, A.Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A compar-
ison of Logistic Regression and Naive Bayes, Neural Information Processing Systems, Ng,
A.Y., and Jordan, M. (2002).
Wasserman, L. (2004). All of Statistics, Springer-Verlag.
CS229 Lecture notes
Andrew Ng

Part V
Support Vector Machines
This set of notes presents the Support Vector Machine (SVM) learning al-
gorithm. SVMs are among the best (and many believe are indeed the best)
“off-the-shelf” supervised learning algorithms. To tell the SVM story, we’ll
need to first talk about margins and the idea of separating data with a large
“gap.” Next, we’ll talk about the optimal margin classifier, which will lead
us into a digression on Lagrange duality. We’ll also see kernels, which give
a way to apply SVMs efficiently in very high dimensional (such as infinite-
dimensional) feature spaces, and finally, we’ll close off the story with the
SMO algorithm, which gives an efficient implementation of SVMs.

1 Margins: Intuition
We’ll start our story on SVMs by talking about margins. This section will
give the intuitions about margins and about the “confidence” of our predic-
tions; these ideas will be made formal in Section 3.
Consider logistic regression, where the probability p(y = 1|x; θ) is mod-
eled by hθ (x) = g(θT x). We would then predict “1” on an input x if and
only if hθ (x) ≥ 0.5, or equivalently, if and only if θT x ≥ 0. Consider a
positive training example (y = 1). The larger θT x is, the larger also is
hθ (x) = p(y = 1|x; w, b), and thus also the higher our degree of “confidence”
that the label is 1. Thus, informally we can think of our prediction as being
a very confident one that y = 1 if θT x ≫ 0. Similarly, we think of logistic
regression as making a very confident prediction of y = 0, if θT x ≪ 0. Given
a training set, again informally it seems that we’d have found a good fit to
the training data if we can find θ so that θT x(i) ≫ 0 whenever y (i) = 1, and

1
2

θT x(i) ≪ 0 whenever y (i) = 0, since this would reflect a very confident (and
correct) set of classifications for all the training examples. This seems to be
a nice goal to aim for, and we’ll soon formalize this idea using the notion of
functional margins.
For a different type of intuition, consider the following figure, in which x’s
represent positive training examples, o’s denote negative training examples,
a decision boundary (this is the line given by the equation θT x = 0, and
is also called the separating hyperplane) is also shown, and three points
have also been labeled A, B and C.

A0
1

B0
1
C0
1

Notice that the point A is very far from the decision boundary. If we are
asked to make a prediction for the value of y at A, it seems we should be
quite confident that y = 1 there. Conversely, the point C is very close to
the decision boundary, and while it’s on the side of the decision boundary
on which we would predict y = 1, it seems likely that just a small change to
the decision boundary could easily have caused out prediction to be y = 0.
Hence, we’re much more confident about our prediction at A than at C. The
point B lies in-between these two cases, and more broadly, we see that if
a point is far from the separating hyperplane, then we may be significantly
more confident in our predictions. Again, informally we think it’d be nice if,
given a training set, we manage to find a decision boundary that allows us
to make all correct and confident (meaning far from the decision boundary)
predictions on the training examples. We’ll formalize this later using the
notion of geometric margins.
3

2 Notation
To make our discussion of SVMs easier, we’ll first need to introduce a new
notation for talking about classification. We will be considering a linear
classifier for a binary classification problem with labels y and features x.
From now, we’ll use y ∈ {−1, 1} (instead of {0, 1}) to denote the class labels.
Also, rather than parameterizing our linear classifier with the vector θ, we
will use parameters w, b, and write our classifier as

hw,b (x) = g(wT x + b).

Here, g(z) = 1 if z ≥ 0, and g(z) = −1 otherwise. This “w, b” notation


allows us to explicitly treat the intercept term b separately from the other
parameters. (We also drop the convention we had previously of letting x0 = 1
be an extra coordinate in the input feature vector.) Thus, b takes the role of
what was previously θ0 , and w takes the role of [θ1 . . . θn ]T .
Note also that, from our definition of g above, our classifier will directly
predict either 1 or −1 (cf. the perceptron algorithm), without first going
through the intermediate step of estimating the probability of y being 1
(which was what logistic regression did).

3 Functional and geometric margins


Let’s formalize the notions of the functional and geometric margins. Given a
training example (x(i) , y (i) ), we define the functional margin of (w, b) with
respect to the training example

γ̂ (i) = y (i) (wT x + b).

Note that if y (i) = 1, then for the functional margin to be large (i.e., for
our prediction to be confident and correct), we need wT x + b to be a large
positive number. Conversely, if y (i) = −1, then for the functional margin
to be large, we need wT x + b to be a large negative number. Moreover, if
y (i) (wT x + b) > 0, then our prediction on this example is correct. (Check
this yourself.) Hence, a large functional margin represents a confident and a
correct prediction.
For a linear classifier with the choice of g given above (taking values in
{−1, 1}), there’s one property of the functional margin that makes it not a
very good measure of confidence, however. Given our choice of g, we note that
if we replace w with 2w and b with 2b, then since g(wT x + b) = g(2wT x + 2b),
4

this would not change hw,b (x) at all. I.e., g, and hence also hw,b (x), depends
only on the sign, but not on the magnitude, of wT x + b. However, replacing
(w, b) with (2w, 2b) also results in multiplying our functional margin by a
factor of 2. Thus, it seems that by exploiting our freedom to scale w and b,
we can make the functional margin arbitrarily large without really changing
anything meaningful. Intuitively, it might therefore make sense to impose
some sort of normalization condition such as that ||w||2 = 1; i.e., we might
replace (w, b) with (w/||w||2 , b/||w||2 ), and instead consider the functional
margin of (w/||w||2 , b/||w||2 ). We’ll come back to this later.
Given a training set S = {(x(i) , y (i) ); i = 1, . . . , m}, we also define the
function margin of (w, b) with respect to S as the smallest of the functional
margins of the individual training examples. Denoted by γ̂, this can therefore
be written:
γ̂ = min γ̂ (i) .
i=1,...,m

Next, let’s talk about geometric margins. Consider the picture below:

A w

γ (i)

The decision boundary corresponding to (w, b) is shown, along with the


vector w. Note that w is orthogonal (at 90◦ ) to the separating hyperplane.
(You should convince yourself that this must be the case.) Consider the
point at A, which represents the input x(i) of some training example with
label y (i) = 1. Its distance to the decision boundary, γ (i) , is given by the line
segment AB.
How can we find the value of γ (i) ? Well, w/||w|| is a unit-length vector
pointing in the same direction as w. Since A represents x(i) , we therefore
5

find that the point B is given by x(i) − γ (i) · w/||w||. But this point lies on
the decision boundary, and all points x on the decision boundary satisfy the
equation wT x + b = 0. Hence,
 
T (i) (i) w
w x −γ + b = 0.
||w||

Solving for γ (i) yields


T
wT x(i) + b

(i) w b
γ = = x(i) + .
||w|| ||w|| ||w||
This was worked out for the case of a positive training example at A in the
figure, where being on the “positive” side of the decision boundary is good.
More generally, we define the geometric margin of (w, b) with respect to a
training example (x(i) , y (i) ) to be
 T !
(i) (i) w (i) b
γ =y x + .
||w|| ||w||

Note that if ||w|| = 1, then the functional margin equals the geometric
margin—this thus gives us a way of relating these two different notions of
margin. Also, the geometric margin is invariant to rescaling of the parame-
ters; i.e., if we replace w with 2w and b with 2b, then the geometric margin
does not change. This will in fact come in handy later. Specifically, because
of this invariance to the scaling of the parameters, when trying to fit w and b
to training data, we can impose an arbitrary scaling constraint on w without
changing anything important; for instance, we can demand that ||w|| = 1, or
|w1 | = 5, or |w1 + b| + |w2 | = 2, and any of these can be satisfied simply by
rescaling w and b.
Finally, given a training set S = {(x(i) , y (i) ); i = 1, . . . , m}, we also define
the geometric margin of (w, b) with respect to S to be the smallest of the
geometric margins on the individual training examples:

γ = min γ (i) .
i=1,...,m

4 The optimal margin classifier


Given a training set, it seems from our previous discussion that a natural
desideratum is to try to find a decision boundary that maximizes the (ge-
ometric) margin, since this would reflect a very confident set of predictions
6

on the training set and a good “fit” to the training data. Specifically, this
will result in a classifier that separates the positive and the negative training
examples with a “gap” (geometric margin).
For now, we will assume that we are given a training set that is linearly
separable; i.e., that it is possible to separate the positive and negative ex-
amples using some separating hyperplane. How will we find the one that
achieves the maximum geometric margin? We can pose the following opti-
mization problem:

maxγ,w,b γ
s.t. y (i) (wT x(i) + b) ≥ γ, i = 1, . . . , m
||w|| = 1.

I.e., we want to maximize γ, subject to each training example having func-


tional margin at least γ. The ||w|| = 1 constraint moreover ensures that the
functional margin equals to the geometric margin, so we are also guaranteed
that all the geometric margins are at least γ. Thus, solving this problem will
result in (w, b) with the largest possible geometric margin with respect to the
training set.
If we could solve the optimization problem above, we’d be done. But the
“||w|| = 1” constraint is a nasty (non-convex) one, and this problem certainly
isn’t in any format that we can plug into standard optimization software to
solve. So, let’s try transforming the problem into a nicer one. Consider:
γ̂
maxγ̂,w,b
||w||
s.t. y (i) (wT x(i) + b) ≥ γ̂, i = 1, . . . , m

Here, we’re going to maximize γ̂/||w||, subject to the functional margins all
being at least γ̂. Since the geometric and functional margins are related by
γ = γ̂/||w|, this will give us the answer we want. Moreover, we’ve gotten rid
of the constraint ||w|| = 1 that we didn’t like. The downside is that we now
γ̂
have a nasty (again, non-convex) objective ||w|| function; and, we still don’t
have any off-the-shelf software that can solve this form of an optimization
problem.
Let’s keep going. Recall our earlier discussion that we can add an arbi-
trary scaling constraint on w and b without changing anything. This is the
key idea we’ll use now. We will introduce the scaling constraint that the
functional margin of w, b with respect to the training set must be 1:

γ̂ = 1.
7

Since multiplying w and b by some constant results in the functional margin


being multiplied by that same constant, this is indeed a scaling constraint,
and can be satisfied by rescaling w, b. Plugging this into our problem above,
and noting that maximizing γ̂/||w|| = 1/||w|| is the same thing as minimizing
||w||2 , we now have the following optimization problem:
1
minw,b ||w||2
2
s.t. y (i) (wT x(i) + b) ≥ 1, i = 1, . . . , m
We’ve now transformed the problem into a form that can be efficiently
solved. The above is an optimization problem with a convex quadratic ob-
jective and only linear constraints. Its solution gives us the optimal mar-
gin classifier. This optimization problem can be solved using commercial
quadratic programming (QP) code.1
While we could call the problem solved here, what we will instead do is
make a digression to talk about Lagrange duality. This will lead us to our
optimization problem’s dual form, which will play a key role in allowing us to
use kernels to get optimal margin classifiers to work efficiently in very high
dimensional spaces. The dual form will also allow us to derive an efficient
algorithm for solving the above optimization problem that will typically do
much better than generic QP software.

5 Lagrange duality
Let’s temporarily put aside SVMs and maximum margin classifiers, and talk
about solving constrained optimization problems.
Consider a problem of the following form:
minw f (w)
s.t. hi (w) = 0, i = 1, . . . , l.
Some of you may recall how the method of Lagrange multipliers can be used
to solve it. (Don’t worry if you haven’t seen it before.) In this method, we
define the Lagrangian to be
l
X
L(w, β) = f (w) + βi hi (w)
i=1
1
You may be familiar with linear programming, which solves optimization problems
that have linear objectives and linear constraints. QP software is also widely available,
which allows convex quadratic objectives and linear constraints.
8

Here, the βi ’s are called the Lagrange multipliers. We would then find
and set L’s partial derivatives to zero:
∂L ∂L
= 0; = 0,
∂wi ∂βi
and solve for w and β.
In this section, we will generalize this to constrained optimization prob-
lems in which we may have inequality as well as equality constraints. Due to
time constraints, we won’t really be able to do the theory of Lagrange duality
justice in this class,2 but we will give the main ideas and results, which we
will then apply to our optimal margin classifier’s optimization problem.
Consider the following, which we’ll call the primal optimization problem:
minw f (w)
s.t. gi (w) ≤ 0, i = 1, . . . , k
hi (w) = 0, i = 1, . . . , l.
To solve it, we start by defining the generalized Lagrangian
k
X l
X
L(w, α, β) = f (w) + αi gi (w) + βi hi (w).
i=1 i=1

Here, the αi ’s and βi ’s are the Lagrange multipliers. Consider the quantity
θP (w) = max L(w, α, β).
α,β : αi ≥0

Here, the “P” subscript stands for “primal.” Let some w be given. If w
violates any of the primal constraints (i.e., if either gi (w) > 0 or hi (w) 6= 0
for some i), then you should be able to verify that
k
X l
X
θP (w) = max f (w) + αi gi (w) + βi hi (w) (1)
α,β : αi ≥0
i=1 i=1
= ∞. (2)
Conversely, if the constraints are indeed satisfied for a particular value of w,
then θP (w) = f (w). Hence,

f (w) if w satisfies primal constraints
θP (w) =
∞ otherwise.
2
Readers interested in learning more about this topic are encouraged to read, e.g., R.
T. Rockarfeller (1970), Convex Analysis, Princeton University Press.
9

Thus, θP takes the same value as the objective in our problem for all val-
ues of w that satisfies the primal constraints, and is positive infinity if the
constraints are violated. Hence, if we consider the minimization problem
min θP (w) = min max L(w, α, β),
w w α,β : αi ≥0

we see that it is the same problem (i.e., and has the same solutions as) our
original, primal problem. For later use, we also define the optimal value of
the objective to be p∗ = minw θP (w); we call this the value of the primal
problem.
Now, let’s look at a slightly different problem. We define
θD (α, β) = min L(w, α, β).
w
Here, the “D” subscript stands for “dual.” Note also that whereas in the
definition of θP we were optimizing (maximizing) with respect to α, β, here
we are minimizing with respect to w.
We can now pose the dual optimization problem:
max θD (α, β) = max min L(w, α, β).
α,β : αi ≥0 α,β : αi ≥0 w

This is exactly the same as our primal problem shown above, except that the
order of the “max” and the “min” are now exchanged. We also define the
optimal value of the dual problem’s objective to be d∗ = maxα,β : αi ≥0 θD (w).
How are the primal and the dual problems related? It can easily be shown
that
d∗ = max min L(w, α, β) ≤ min max L(w, α, β) = p∗ .
α,β : αi ≥0 w w α,β : αi ≥0

(You should convince yourself of this; this follows from the “max min” of a
function always being less than or equal to the “min max.”) However, under
certain conditions, we will have
d ∗ = p∗ ,
so that we can solve the dual problem in lieu of the primal problem. Let’s
see what these conditions are.
Suppose f and the gi ’s are convex,3 and the hi ’s are affine.4 Suppose
further that the constraints gi are (strictly) feasible; this means that there
exists some w so that gi (w) < 0 for all i.
3
When f has a Hessian, then it is convex if and only if the Hessian is positive semi-
definite. For instance, f (w) = wT w is convex; similarly, all linear (and affine) functions
are also convex. (A function f can also be convex without being differentiable, but we
won’t need those more general definitions of convexity here.)
4
I.e., there exists ai , bi , so that hi (w) = aTi w + bi . “Affine” means the same thing as
linear, except that we also allow the extra intercept term bi .
10

Under our above assumptions, there must exist w∗ , α∗ , β ∗ so that w∗ is the


solution to the primal problem, α∗ , β ∗ are the solution to the dual problem,
and moreover p∗ = d∗ = L(w∗ , α∗ , β ∗ ). Moreover, w∗ , α∗ and β ∗ satisfy the
Karush-Kuhn-Tucker (KKT) conditions, which are as follows:

L(w∗ , α∗ , β ∗ ) = 0, i = 1, . . . , n (3)
∂wi

L(w∗ , α∗ , β ∗ ) = 0, i = 1, . . . , l (4)
∂βi
αi∗ gi (w∗ ) = 0, i = 1, . . . , k (5)
gi (w∗ ) ≤ 0, i = 1, . . . , k (6)
α∗ ≥ 0, i = 1, . . . , k (7)

Moreover, if some w∗ , α∗ , β ∗ satisfy the KKT conditions, then it is also a


solution to the primal and dual problems.
We draw attention to Equation (5), which is called the KKT dual com-
plementarity condition. Specifically, it implies that if αi∗ > 0, then gi (w∗ ) =
0. (I.e., the “gi (w) ≤ 0” constraint is active, meaning it holds with equality
rather than with inequality.) Later on, this will be key for showing that the
SVM has only a small number of “support vectors”; the KKT dual comple-
mentarity condition will also give us our convergence test when we talk about
the SMO algorithm.

6 Optimal margin classifiers


Previously, we posed the following (primal) optimization problem for finding
the optimal margin classifier:
1
minw,b ||w||2
2
s.t. y (i) (wT x(i) + b) ≥ 1, i = 1, . . . , m

We can write the constraints as

gi (w) = −y (i) (wT x(i) + b) + 1 ≤ 0.

We have one such constraint for each training example. Note that from the
KKT dual complementarity condition, we will have αi > 0 only for the train-
ing examples that have functional margin exactly equal to one (i.e., the ones
11

corresponding to constraints that hold with equality, gi (w) = 0). Consid-


er the figure below, in which a maximum margin separating hyperplane is
shown by the solid line.

The points with the smallest margins are exactly the ones closest to the
decision boundary; here, these are the three points (one negative and two pos-
itive examples) that lie on the dashed lines parallel to the decision boundary.
Thus, only three of the αi ’s—namely, the ones corresponding to these three
training examples—will be non-zero at the optimal solution to our optimiza-
tion problem. These three points are called the support vectors in this
problem. The fact that the number of support vectors can be much smaller
than the size the training set will be useful later.
Let’s move on. Looking ahead, as we develop the dual form of the prob-
lem, one key idea to watch out for is that we’ll try to write our algorithm
in terms of only the inner product hx(i) , x(j) i (think of this as (x(i) )T x(j) )
between points in the input feature space. The fact that we can express our
algorithm in terms of these inner products will be key when we apply the
kernel trick.
When we construct the Lagrangian for our optimization problem we have:
m
1 X 
L(w, b, α) = ||w||2 − αi y (i) (wT x(i) + b) − 1 .

(8)
2 i=1

Note that there’re only “αi ” but no “βi ” Lagrange multipliers, since the
problem has only inequality constraints.
Let’s find the dual form of the problem. To do so, we need to first
minimize L(w, b, α) with respect to w and b (for fixed α), to get θD , which
12

we’ll do by setting the derivatives of L with respect to w and b to zero. We


have: m
X
∇w L(w, b, α) = w − αi y (i) x(i) = 0
i=1

This implies that


m
X
w= αi y (i) x(i) . (9)
i=1

As for the derivative with respect to b, we obtain


m
∂ X
L(w, b, α) = αi y (i) = 0. (10)
∂b i=1

If we take the definition of w in Equation (9) and plug that back into the
Lagrangian (Equation 8), and simplify, we get
m m m
X 1 X (i) (j) (i) T (j)
X
L(w, b, α) = αi − y y αi αj (x ) x − b αi y (i) .
i=1
2 i,j=1 i=1

But from Equation (10), the last term must be zero, so we obtain
m m
X 1 X (i) (j)
L(w, b, α) = αi − y y αi αj (x(i) )T x(j) .
i=1
2 i,j=1

Recall that we got to the equation above by minimizing L with respect to w


and b. Putting this together with the constraints αi ≥ 0 (that we always had)
and the constraint (10), we obtain the following dual optimization problem:
m m
X 1 X (i) (j)
maxα W (α) = αi − y y αi αj hx(i) , x(j) i.
i=1
2 i,j=1
s.t. αi ≥ 0, i = 1, . . . , m
Xm
αi y (i) = 0,
i=1

You should also be able to verify that the conditions required for p∗ =
d∗ and the KKT conditions (Equations 3–7) to hold are indeed satisfied in
our optimization problem. Hence, we can solve the dual in lieu of solving
the primal problem. Specifically, in the dual problem above, we have a
maximization problem in which the parameters are the αi ’s. We’ll talk later
13

about the specific algorithm that we’re going to use to solve the dual problem,
but if we are indeed able to solve it (i.e., find the α’s that maximize W (α)
subject to the constraints), then we can use Equation (9) to go back and find
the optimal w’s as a function of the α’s. Having found w∗ , by considering
the primal problem, it is also straightforward to find the optimal value for
the intercept term b as
maxi:y(i) =−1 w∗ T x(i) + mini:y(i) =1 w∗ T x(i)
b∗ = − . (11)
2
(Check for yourself that this is correct.)
Before moving on, let’s also take a more careful look at Equation (9),
which gives the optimal value of w in terms of (the optimal value of) α.
Suppose we’ve fit our model’s parameters to a training set, and now wish to
make a prediction at a new point input x. We would then calculate wT x + b,
and predict y = 1 if and only if this quantity is bigger than zero. But
using (9), this quantity can also be written:
m
!T
X
T (i) (i)
w x+b = αi y x x+b (12)
i=1
m
X
= αi y (i) hx(i) , xi + b. (13)
i=1
Hence, if we’ve found the αi ’s, in order to make a prediction, we have to
calculate a quantity that depends only on the inner product between x and
the points in the training set. Moreover, we saw earlier that the αi ’s will all
be zero except for the support vectors. Thus, many of the terms in the sum
above will be zero, and we really need to find only the inner products between
x and the support vectors (of which there is often only a small number) in
order calculate (13) and make our prediction.
By examining the dual form of the optimization problem, we gained sig-
nificant insight into the structure of the problem, and were also able to write
the entire algorithm in terms of only inner products between input feature
vectors. In the next section, we will exploit this property to apply the ker-
nels to our classification problem. The resulting algorithm, support vector
machines, will be able to efficiently learn in very high dimensional spaces.

7 Kernels
Back in our discussion of linear regression, we had a problem in which the
input x was the living area of a house, and we considered performing regres-
14

sion using the features x, x2 and x3 (say) to obtain a cubic function. To


distinguish between these two sets of variables, we’ll call the “original” input
value the input attributes of a problem (in this case, x, the living area).
When that is mapped to some new set of quantities that are then passed to
the learning algorithm, we’ll call those new quantities the input features.
(Unfortunately, different authors use different terms to describe these two
things, but we’ll try to use this terminology consistently in these notes.) We
will also let φ denote the feature mapping, which maps from the attributes
to the features. For instance, in our example, we had
 
x
φ(x) =  x2  .
x3

Rather than applying SVMs using the original input attributes x, we may
instead want to learn using some features φ(x). To do so, we simply need to
go over our previous algorithm, and replace x everywhere in it with φ(x).
Since the algorithm can be written entirely in terms of the inner prod-
ucts hx, zi, this means that we would replace all those inner products with
hφ(x), φ(z)i. Specifically, given a feature mapping φ, we define the corre-
sponding Kernel to be

K(x, z) = φ(x)T φ(z).

Then, everywhere we previously had hx, zi in our algorithm, we could simply


replace it with K(x, z), and our algorithm would now be learning using the
features φ.
Now, given φ, we could easily compute K(x, z) by finding φ(x) and φ(z)
and taking their inner product. But what’s more interesting is that often,
K(x, z) may be very inexpensive to calculate, even though φ(x) itself may
be very expensive to calculate (perhaps because it is an extremely high di-
mensional vector). In such settings, by using in our algorithm an efficient
way to calculate K(x, z), we can get SVMs to learn in the high dimensional
feature space space given by φ, but without ever having to explicitly find or
represent vectors φ(x).
Let’s see an example. Suppose x, z ∈ Rn , and consider

K(x, z) = (xT z)2 .


15

We can also write this as


n
! n
!
X X
K(x, z) = xi z i xj z j
i=1 j=1
n
XX n
= xi xj z i z j
i=1 j=1
Xn
= (xi xj )(zi zj )
i,j=1

Thus, we see that K(x, z) = φ(x)T φ(z), where the feature mapping φ is given
(shown here for the case of n = 3) by
 
x1 x1
 x1 x2 
 
 x1 x3 
 
 x2 x1 
 
φ(x) = 
 x 2 x 2
.

 x2 x3 
 
 x3 x1 
 
 x3 x2 
x3 x3

Note that whereas calculating the high-dimensional φ(x) requires O(n2 ) time,
finding K(x, z) takes only O(n) time—linear in the dimension of the input
attributes.
For a related kernel, also consider

K(x, z) = (xT z + c)2


n n
X X √ √
= (xi xj )(zi zj ) + ( 2cxi )( 2czi ) + c2 .
i,j=1 i=1

(Check this yourself.) This corresponds to the feature mapping (again shown
16

for n = 3)
x1 x1
 

 x1 x2 


 x1 x3 


 x2 x1 


 x2 x2 


 x2 x3 

φ(x) = 
 x3 x1 ,


 x3 x2 

√x3 x3
 
 
√2cx1
 
 
√2cx2
 
 
 2cx3 
c
and the parameter c controls the relative weighting between the xi (first
order) and the xi xj (second order) terms.
T d
More broadly, the  kernel K(x, z) = (x z + c) corresponds to a feature
n+d
mapping to an d feature space, corresponding of all monomials of the
form xi1 xi2 . . . xik that are up to order d. However, despite working in this
O(nd )-dimensional space, computing K(x, z) still takes only O(n) time, and
hence we never need to explicitly represent feature vectors in this very high
dimensional feature space.
Now, let’s talk about a slightly different view of kernels. Intuitively, (and
there are things wrong with this intuition, but nevermind), if φ(x) and φ(z)
are close together, then we might expect K(x, z) = φ(x)T φ(z) to be large.
Conversely, if φ(x) and φ(z) are far apart—say nearly orthogonal to each
other—then K(x, z) = φ(x)T φ(z) will be small. So, we can think of K(x, z)
as some measurement of how similar are φ(x) and φ(z), or of how similar are
x and z.
Given this intuition, suppose that for some learning problem that you’re
working on, you’ve come up with some function K(x, z) that you think might
be a reasonable measure of how similar x and z are. For instance, perhaps
you chose
||x − z||2
 
K(x, z) = exp − .
2σ 2
This is a reasonable measure of x and z’s similarity, and is close to 1 when
x and z are close, and near 0 when x and z are far apart. Can we use this
definition of K as the kernel in an SVM? In this particular example, the
answer is yes. (This kernel is called the Gaussian kernel, and corresponds
17

to an infinite dimensional feature mapping φ.) But more broadly, given some
function K, how can we tell if it’s a valid kernel; i.e., can we tell if there is
some feature mapping φ so that K(x, z) = φ(x)T φ(z) for all x, z?
Suppose for now that K is indeed a valid kernel corresponding to some
feature mapping φ. Now, consider some finite set of m points (not necessarily
the training set) {x(1) , . . . , x(m) }, and let a square, m-by-m matrix K be
defined so that its (i, j)-entry is given by Kij = K(x(i) , x(j) ). This matrix
is called the Kernel matrix. Note that we’ve overloaded the notation and
used K to denote both the kernel function K(x, z) and the kernel matrix K,
due to their obvious close relationship.
Now, if K is a valid Kernel, then Kij = K(x(i) , x(j) ) = φ(x(i) )T φ(x(j) ) =
φ(x(j) )T φ(x(i) ) = K(x(j) , x(i) ) = Kji , and hence K must be symmetric. More-
over, letting φk (x) denote the k-th coordinate of the vector φ(x), we find that
for any vector z, we have
XX
z T Kz = zi Kij zj
i j
XX
= zi φ(x(i) )T φ(x(j) )zj
i j
XX X
= zi φk (x(i) )φk (x(j) )zj
i j k
XXX
= zi φk (x(i) )φk (x(j) )zj
k i j
!2
X X
= zi φk (x(i) )
k i
≥ 0.

The second-to-last step above used the same trick as you saw in Problem
set 1 Q1. Since z was arbitrary, this shows that K is positive semi-definite
(K ≥ 0).
Hence, we’ve shown that if K is a valid kernel (i.e., if it corresponds to
some feature mapping φ), then the corresponding Kernel matrix K ∈ Rm×m
is symmetric positive semidefinite. More generally, this turns out to be not
only a necessary, but also a sufficient, condition for K to be a valid kernel
(also called a Mercer kernel). The following result is due to Mercer.5
5
Many texts present Mercer’s theorem in a slightly more complicated form involving
L functions, but when the input attributes take values in Rn , the version given here is
2

equivalent.
18

Theorem (Mercer). Let K : Rn × Rn 7→ R be given. Then for K


to be a valid (Mercer) kernel, it is necessary and sufficient that for any
{x(1) , . . . , x(m) }, (m < ∞), the corresponding kernel matrix is symmetric
positive semi-definite.

Given a function K, apart from trying to find a feature mapping φ that


corresponds to it, this theorem therefore gives another way of testing if it is
a valid kernel. You’ll also have a chance to play with these ideas more in
problem set 2.
In class, we also briefly talked about a couple of other examples of ker-
nels. For instance, consider the digit recognition problem, in which given an
image (16x16 pixels) of a handwritten digit (0-9), we have to figure out which
digit it was. Using either a simple polynomial kernel K(x, z) = (xT z)d or
the Gaussian kernel, SVMs were able to obtain extremely good performance
on this problem. This was particularly surprising since the input attributes
x were just 256-dimensional vectors of the image pixel intensity values, and
the system had no prior knowledge about vision, or even about which pixels
are adjacent to which other ones. Another example that we briefly talked
about in lecture was that if the objects x that we are trying to classify are
strings (say, x is a list of amino acids, which strung together form a protein),
then it seems hard to construct a reasonable, “small” set of features for
most learning algorithms, especially if different strings have different length-
s. However, consider letting φ(x) be a feature vector that counts the number
of occurrences of each length-k substring in x. If we’re considering strings
of English letters, then there are 26k such strings. Hence, φ(x) is a 26k di-
mensional vector; even for moderate values of k, this is probably too big for
us to efficiently work with. (e.g., 264 ≈ 460000.) However, using (dynam-
ic programming-ish) string matching algorithms, it is possible to efficiently
compute K(x, z) = φ(x)T φ(z), so that we can now implicitly work in this
26k -dimensional feature space, but without ever explicitly computing feature
vectors in this space.
The application of kernels to support vector machines should already
be clear and so we won’t dwell too much longer on it here. Keep in mind
however that the idea of kernels has significantly broader applicability than
SVMs. Specifically, if you have any learning algorithm that you can write
in terms of only inner products hx, zi between input attribute vectors, then
by replacing this with K(x, z) where K is a kernel, you can “magically”
allow your algorithm to work efficiently in the high dimensional feature space
corresponding to K. For instance, this kernel trick can be applied with the
perceptron to derive a kernel perceptron algorithm. Many of the algorithms
19

that we’ll see later in this class will also be amenable to this method, which
has come to be known as the “kernel trick.”

8 Regularization and the non-separable case


The derivation of the SVM as presented so far assumed that the data is
linearly separable. While mapping data to a high dimensional feature space
via φ does generally increase the likelihood that the data is separable, we
can’t guarantee that it always will be so. Also, in some cases it is not clear
that finding a separating hyperplane is exactly what we’d want to do, since
that might be susceptible to outliers. For instance, the left figure below
shows an optimal margin classifier, and when a single outlier is added in the
upper-left region (right figure), it causes the decision boundary to make a
dramatic swing, and the resulting classifier has a much smaller margin.

To make the algorithm work for non-linearly separable datasets as well


as be less sensitive to outliers, we reformulate our optimization (using ℓ1
regularization) as follows:
m
1 X
minγ,w,b ||w||2 + C ξi
2 i=1

s.t. y (i) (wT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m


ξi ≥ 0, i = 1, . . . , m.

Thus, examples are now permitted to have (functional) margin less than 1,
and if an example has functional margin 1 − ξi (with ξ > 0), we would pay
a cost of the objective function being increased by Cξi . The parameter C
controls the relative weighting between the twin goals of making the ||w||2
small (which we saw earlier makes the margin large) and of ensuring that
most examples have functional margin at least 1.
20

As before, we can form the Lagrangian:


m m m
1 X X   X
L(w, b, ξ, α, r) = wT w + C ξi − αi y (i) (xT w + b) − 1 + ξi − ri ξ i .
2 i=1 i=1 i=1

Here, the αi ’s and ri ’s are our Lagrange multipliers (constrained to be ≥ 0).


We won’t go through the derivation of the dual again in detail, but after
setting the derivatives with respect to w and b to zero as before, substituting
them back in, and simplifying, we obtain the following dual form of the
problem:
m m
X 1 X (i) (j)
maxα W (α) = αi − y y αi αj hx(i) , x(j) i
i=1
2 i,j=1
s.t. 0 ≤ αi ≤ C, i = 1, . . . , m
Xm
αi y (i) = 0,
i=1

As before, we also have that w can be expressed in terms of the αi ’s


as given in Equation (9), so that after solving the dual problem, we can
continue to use Equation (13) to make our predictions. Note that, somewhat
surprisingly, in adding ℓ1 regularization, the only change to the dual problem
is that what was originally a constraint that 0 ≤ αi has now become 0 ≤
αi ≤ C. The calculation for b∗ also has to be modified (Equation 11 is no
longer valid); see the comments in the next section/Platt’s paper.
Also, the KKT dual-complementarity conditions (which in the next sec-
tion will be useful for testing for the convergence of the SMO algorithm)
are:

αi = 0 ⇒ y (i) (wT x(i) + b) ≥ 1 (14)


αi = C ⇒ y (i) (wT x(i) + b) ≤ 1 (15)
0 < αi < C ⇒ y (i) (wT x(i) + b) = 1. (16)

Now, all that remains is to give an algorithm for actually solving the dual
problem, which we will do in the next section.

9 The SMO algorithm


The SMO (sequential minimal optimization) algorithm, due to John Platt,
gives an efficient way of solving the dual problem arising from the derivation
21

of the SVM. Partly to motivate the SMO algorithm, and partly because it’s
interesting in its own right, let’s first take another digression to talk about
the coordinate ascent algorithm.

9.1 Coordinate ascent


Consider trying to solve the unconstrained optimization problem

max W (α1 , α2 , . . . , αm ).
α

Here, we think of W as just some function of the parameters αi ’s, and for now
ignore any relationship between this problem and SVMs. We’ve already seen
two optimization algorithms, gradient ascent and Newton’s method. The
new algorithm we’re going to consider here is called coordinate ascent:

Loop until convergence: {

For i = 1, . . . , m, {
αi := arg maxα̂i W (α1 , . . . , αi−1 , α̂i , αi+1 , . . . , αm ).
}

Thus, in the innermost loop of this algorithm, we will hold all the vari-
ables except for some αi fixed, and reoptimize W with respect to just the
parameter αi . In the version of this method presented here, the inner-loop
reoptimizes the variables in order α1 , α2 , . . . , αm , α1 , α2 , . . .. (A more sophis-
ticated version might choose other orderings; for instance, we may choose
the next variable to update according to which one we expect to allow us to
make the largest increase in W (α).)
When the function W happens to be of such a form that the “arg max”
in the inner loop can be performed efficiently, then coordinate ascent can be
a fairly efficient algorithm. Here’s a picture of coordinate ascent in action:
22

2.5

1.5

0.5

−0.5

−1

−1.5

−2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

The ellipses in the figure are the contours of a quadratic function that
we want to optimize. Coordinate ascent was initialized at (2, −2), and also
plotted in the figure is the path that it took on its way to the global maximum.
Notice that on each step, coordinate ascent takes a step that’s parallel to one
of the axes, since only one variable is being optimized at a time.

9.2 SMO
We close off the discussion of SVMs by sketching the derivation of the SMO
algorithm. Some details will be left to the homework, and for others you
may refer to the paper excerpt handed out in class.
Here’s the (dual) optimization problem that we want to solve:
m m
X 1 X (i) (j)
maxα W (α) = αi − y y αi αj hx(i) , x(j) i. (17)
i=1
2 i,j=1
s.t. 0 ≤ αi ≤ C, i = 1, . . . , m (18)
Xm
αi y (i) = 0. (19)
i=1

Let’s say we have set of αi ’s that satisfy the constraints (18-19). Now,
suppose we want to hold α2 , . . . , αm fixed, and take a coordinate ascent step
and reoptimize the objective with respect to α1 . Can we make any progress?
The answer is no, because the constraint (19) ensures that
m
X
(1)
α1 y =− αi y (i) .
i=2
23

Or, by multiplying both sides by y (1) , we equivalently have


m
X
α1 = −y (1) αi y (i) .
i=2

(This step used the fact that y (1) ∈ {−1, 1}, and hence (y (1) )2 = 1.) Hence,
α1 is exactly determined by the other αi ’s, and if we were to hold α2 , . . . , αm
fixed, then we can’t make any change to α1 without violating the constrain-
t (19) in the optimization problem.
Thus, if we want to update some subject of the αi ’s, we must update at
least two of them simultaneously in order to keep satisfying the constraints.
This motivates the SMO algorithm, which simply does the following:
Repeat till convergence {
1. Select some pair αi and αj to update next (using a heuristic that
tries to pick the two that will allow us to make the biggest progress
towards the global maximum).
2. Reoptimize W (α) with respect to αi and αj , while holding all the
other αk ’s (k 6= i, j) fixed.
}
To test for convergence of this algorithm, we can check whether the KKT
conditions (Equations 14-16) are satisfied to within some tol. Here, tol is
the convergence tolerance parameter, and is typically set to around 0.01 to
0.001. (See the paper and pseudocode for details.)
The key reason that SMO is an efficient algorithm is that the update to
αi , αj can be computed very efficiently. Let’s now briefly sketch the main
ideas for deriving the efficient update.
Let’s say we currently have some setting of the αi ’s that satisfy the con-
straints (18-19), and suppose we’ve decided to hold α3 , . . . , αm fixed, and
want to reoptimize W (α1 , α2 , . . . , αm ) with respect to α1 and α2 (subject to
the constraints). From (19), we require that
m
X
(1) (2)
α1 y + α2 y =− αi y (i) .
i=3

Since the right hand side is fixed (as we’ve fixed α3 , . . . αm ), we can just let
it be denoted by some constant ζ:
α1 y (1) + α2 y (2) = ζ. (20)
We can thus picture the constraints on α1 and α2 as follows:
24

H α1y(1)+ α2y(2)=ζ
α2

L
α1 C

From the constraints (18), we know that α1 and α2 must lie within the box
[0, C] × [0, C] shown. Also plotted is the line α1 y (1) + α2 y (2) = ζ, on which we
know α1 and α2 must lie. Note also that, from these constraints, we know
L ≤ α2 ≤ H; otherwise, (α1 , α2 ) can’t simultaneously satisfy both the box
and the straight line constraint. In this example, L = 0. But depending on
what the line α1 y (1) + α2 y (2) = ζ looks like, this won’t always necessarily be
the case; but more generally, there will be some lower-bound L and some
upper-bound H on the permissible values for α2 that will ensure that α1 , α2
lie within the box [0, C] × [0, C].
Using Equation (20), we can also write α1 as a function of α2 :

α1 = (ζ − α2 y (2) )y (1) .

(Check this derivation yourself; we again used the fact that y (1) ∈ {−1, 1} so
that (y (1) )2 = 1.) Hence, the objective W (α) can be written

W (α1 , α2 , . . . , αm ) = W ((ζ − α2 y (2) )y (1) , α2 , . . . , αm ).

Treating α3 , . . . , αm as constants, you should be able to verify that this is


just some quadratic function in α2 . I.e., this can also be expressed in the
form aα22 + bα2 + c for some appropriate a, b, and c. If we ignore the “box”
constraints (18) (or, equivalently, that L ≤ α2 ≤ H), then we can easily
maximize this quadratic function by setting its derivative to zero and solving.
We’ll let α2new,unclipped denote the resulting value of α2 . You should also be
able to convince yourself that if we had instead wanted to maximize W with
respect to α2 but subject to the box constraint, then we can find the resulting
value optimal simply by taking α2new,unclipped and “clipping” it to lie in the
25

[L, H] interval, to get



 H if α2new,unclipped > H
α2new = αnew,unclipped if L ≤ α2new,unclipped ≤ H
 2
L if α2new,unclipped < L

Finally, having found the α2new , we can use Equation (20) to go back and find
the optimal value of α1new .
There’re a couple more details that are quite easy but that we’ll leave you
to read about yourself in Platt’s paper: One is the choice of the heuristics
used to select the next αi , αj to update; the other is how to update b as the
SMO algorithm is run.
10-601 Machine Learning

Maria-Florina Balcan Spring 2015

Generalization Abilities: Sample Complexity Results.

The ability to generalize beyond what we have seen in the training phase is the essence of machine
learning, essentially what makes machine learning, machine learning. In these notes we describe
some basic concepts and the classic formalization that allows us to talk about these important
concepts in a precise way.

Distributional Learning

The basic idea of the distributional learning setting is to assume that examples are being provided
from a fixed (but perhaps unknown) distribution over the instance space. The assumption of a
fixed distribution gives us hope that what we learn based on some training data will carry over
to new test data we haven’t seen yet. A nice feature of this assumption is that it provides us a
well-defined notion of the error of a hypothesis with respect to target concept.
Specifically, in the distributional learning setting (captured by the PAC model of Valiant and Sta-
tistical Learning Theory framework of Vapnik) we assume that the input to the learning algorithm
is a set of labeled examples

S: (x1 , y1 ), . . . , (xm , ym )

where xi are drawn i.i.d. from some fixed but unknown distribution D over the the instance space
X and that they are labeled by some target concept c∗ . So yi = c∗ (xi ). Here the goal is to do
optimization over the given sample S in order to find a hypothesis h : X → {0, 1}, that has small
error over whole distribution D. The true error of h with respect to a target concept c∗ and the
underlying distribution D is defined as

err(h) = Pr (h(x) 6= c∗ (x)).


x∼D

(Prx∼D (A) means the probability of event A given that x is selected according to distribution D.)
We denote by
m
∗ 1 X
errS (h) = Pr (h(x) 6= c (x)) = I[h(xi ) 6= c∗ (xi )]
x∼S m i=1
the empirical error of h over the sample S (that is the fraction of examples in S misclassified by h).
What kind of guarantee could we hope to make?

• We converge quickly to the target concept (or equivalent). But, what if our distribution
places low weight on some part of X?

1
• We converge quickly to an approximation of the target concept. But, what if the examples
we see don’t correctly reflect the distribution?

• With high probability we converge to an approximation of the target concept. This is the
idea of Probably Approximately Correct learning.

Distributional Learning. Realizable case

Here is a basic result that is meaningful in the realizable case (when the target function belongs to
an a-priori known finite hypothesis space H.)

Theorem 1 Let H be a finite hypothesis space. Let D be an arbitrary, fixed unknown probability
distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if we
draw a sample from D of size
1 1
  
m= ln(|H|) + ln ,
 δ
then with probability at least 1 − δ, all hypotheses/concepts in H with error ≥  are inconsistent
with the data (or alternatively, with probability at least 1 − δ any hypothesis consistent with the data
will have error at most .)

Proof: The proof involves the following steps:

1. Consider some specific “bad” hypothesis h whose error is at least . The probability that this
bad hypothesis h is consistent with m examples drawn from D is at most (1 − )m .

2. Notice that there are (only) at most |H| possible bad hypotheses.

3. (1) and (2) imply that given m examples drawn from D, the probability there exists a bad
hypothesis consistent with all of them is at most |H|(1 − )m . Suppose that m is sufficiently
large so that this quantity is at most δ. That means that with probability (1 − δ) there are
no consistent hypothesis whose error is more than .

4. The final step is to calculate the value m needed to satisfy

|H|(1 − )m ≤ δ. (1)

Using the inequality 1 − x ≤ e−x , it is simple to verify that (1) is true as long as:

1 1
  
m≥ ln(|H|) + ln .
 δ

Note: Another way to write the bound in Theorem 1 is as follows:

2
For any δ > 0, if we draw a sample from D of size m then with probability at least 1 − δ, any
hypothesis in H consistent with the data will have error at most
1 1
  
ln(|H|) + ln .
m δ
This is the more “statistical learning theory style” way of writing the same bound.

Distributional Learning. The Non-realizable case

In the general case, the target function might not be in the class of functions we consider. Formally,
in the non-realizable or agnostic passive supervised learning setting, we assume assume that the
input to a learning algorithm is a set S of labeled examples S = {(x1 , y1 ), . . . , (xm , ym )}. We
assume that these examples are drawn i.i.d. from some fixed but unknown distribution D over the
the instance space X and that they are labeled by some target concept c∗ . So yi = c∗ (xi ). The
goal is just as in the realizable case to do optimization over the given sample S in order to find a
hypothesis h : X → {0, 1} of small error over whole distribution D. Our goal is to compete with
the best function (the function of smallest true error rate) in some concept class H.
A natural hope is that picking a concept c with a small observed error rate gives us small true error
rate. It is therefore useful to find a relationship between observed error rate for a sample and the
true error rate.

Concentration Inequalities. Hoeffding Bound

Consider a hypothesis with true error rate p (or a coin of bias p) observed on m examples (the coin
is flipped m times). Let S be the number of observed errors (the number of heads seen) so S/m is
the observed error rate.
Hoeffding bounds state that for any  ∈ [0, 1],
2
S
1. Pr[ m > p + ] ≤ e−2m , and
2
S
2. Pr[ m < p − ] ≤ e−2m .

Simple sample complexity results for finite hypotheses spaces

We can use the Hoeffding bounds to show the following:

Theorem 2 Let H be a finite hypothesis space. Let D be an arbitrary, fixed unknown probability
distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if we
draw a sample S from D of size
1 1
  
m ≥ 2 ln(2|H|) + ln ,
2 δ
then probability at least (1 − δ), all hypotheses h in H have

|err(h) − errS (h)| ≤ . (2)

3
Proof: Let us fix a hypothesis h. By Hoeffding, we get that the probability that its observed error
2
within  of its true error is at most 2e−2m ≤ δ/|H|. By union bound over all all h in H, we then
get the desired result.

Note: A statement of type one is called a uniform convergence result. It implies that the hypoth-
esis that minimizes the empirical error rate will be very close in generalization error to the best

hypothesis in the class. In particular if h h∈H err S (h) we have err(h) ≤ err(h ) + 2,
b = argmin b

where h is a hypothesis of smallest true error rate.
Note: The sample size grows quadratically with 1/. Recall that the learning sample size in the
realizable (PAC) case grew only linearly with 1/.
Note: Another way to write the bound in Theorem 2 is as follows:
For any δ > 0, if we draw a sample from D of size m then with probability at least 1 − δ, all
hypotheses h in H have v
u  
u ln(2|H|) + ln 1
t δ
err(h) ≤ errS (h) +
2m
This is the more “statistical learning theory style” way of writing the same bound.

Sample complexity results for infinite hypothesis spaces

In the case where H is not finite, we will replace |H| with other measures of complexity of H
(shattering coefficient, VC-dimension, Rademacher complexity).

Shattering, VC dimension

Let H be a concept class over an instance space X, i.e. a set of functions functions from X to
{0, 1} (where both H and X may be infinite). For any S ⊆ X, let’s denote by H (S) the set of
all behaviors or dichotomies on S that are induced or realized by H, i.e. if S = {x1 , · · · , xm }, then
H (S) ⊆ {0, 1}m and
H (S) = {(c (x1 ) , · · · , c (xm )) ; c ∈ H} .
Also, for any natural number m, we consider H [m] to be the maximum number of ways to split m
points using concepts in H, that is

H [m] = max {|H (S)| ; |S| = m, S ⊆ X} .

To instantiate this, to get a feel of what this result means imagine that H is the class of thresholds
on the line, then H[m] = m + 1, or that H is the class of intervals, then H[m] = O(m2 ), or for
linear separators in Rd , H[m] = md+1 .

Definition 1 If |H (S) | = 2|S| then S is shattered by H.

Definition 2 The Vapnik-Chervonenkis dimension of H, denoted as V Cdim(H), is the car-


dinality of the largest set S shattered by H. If arbitrarily large finite sets can be shattered by H,
then V Cdim(H) = ∞.

4
Note 1 In order to show that the VC dimension of a class is at least d we must simply find some
shattered set of size d. In order to show that the VC dimension is at most d we must show that no
set of size d + 1 is shattered.

Examples

1. Let H be the concept class of thresholds on the real number line. Clearly samples of size
1 can be shattered by this class. However, no sample of size 2 can be shattered since it is
impossible to choose threshold such that x1 is labeled positive and x2 is labeled negative for
x1 ≤ x2 . Hence the V Cdim(H) = 1.

2. Let H be the concept class intervals on the real line. Here a sample of size 2 is shattered, but
no sample of size 3 is shattered, since no concept can satisfy a sample whose middle point is
negative and outer points are positive. Hence, V Cdim(H) = 2.

3. Let H be the concept class of k non-intersecting intervals on the real line. A sample of
size 2k shatters (just treat each pair of points as a separate case of example 2) but no
sample of size 2k + 1 shatters, since if the sample points are alternated positive/negative,
starting with a positive point, the positive points can’t be covered by only k intervals. Hence
V Cdim(H) = 2k.

4. Let H the class of linear separators in R2 . Three points can be shattered, but four cannot;
hence V Cdim(H) = 3. To see why four points can never be shattered, consider two cases.
The trivial case is when one point can be placed within a triangle formed by the other three;
then if the middle point is positive and the others are negative, no half space can contain
only the positive points. If however the points cannot be arranged in that pattern, then label
two points diagonally across from each other as positive, and the other two as negative In
general, one can show that the VCdimension of the class of linear separators in Rn is n + 1.

5. The class of axis-aligned rectangles in the plane has V CDIM = 4. The trick here is to note that
for any collection of five points, at least one of them must be interior to or on the boundary
of any rectangle bounded by the other four; hence if the bounding points are positive, the
interior point cannot be made negative.

Sauer’s Lemma
Pd m
Lemma 1 If d = V Cdim(H), then for all m, H[m] ≤ Φd (m), where Φd (m) = i=0 i .
For m > d we have: d
em

Φd (m) ≤ .
d

Note that for H the class of intervals we achieve H[m] = Φd (m), where d = V Cdim(H), so the
bound in the Sauer’s lemma is tight.

5
Sample Complexity Results based on Shattering and VCdim

Interestingly, we can roughly replace ln(|H|) from the case where H is finite with the shattering
coefficient H[2m] when H is infinite. Specifically:

Theorem 3 Let H be an arbitrary hypothesis space. Let D be an arbitrary, fixed unknown proba-
bility distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if
we draw a sample S from D of size
2 1
  
m > · log2 (2 · H[2m]) + log2 (3)
 δ
then with probability (1 − δ), all bad hypothesis in H (with error >  with respect to c and D) are
inconsistent with the data.

Theorem 4 Let H be an arbitrary hypothesis space. Let D be an arbitrary, fixed unknown proba-
bility distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if
we draw a sample S from D of size

m > (8/2 )[ln(2H[2m]) + ln(1/δ)]

then with probability 1 − δ, all h in H have

|errD (h) − errS (h)| < .

We can now use Sauer’s lemma to get a nice closed form expression on sample complexity (an
upper bound on the number of samples needed to learn concepts from the class) based on the VC-
dimension of a concept class. The following is the VC dimension based sample complexity bound
for the realizable case:

Theorem 5 Let H be an arbitrary hypothesis space of VC-dimension d. Let D be an arbitrary


unknown probability distribution over the instance space and let c∗ be an arbitrary unknown target
function. For any , δ > 0, if we draw a sample S from D of size m satisfying
8 16 2
    
m≥ d ln + ln .
  δ
then with probability at least 1 − δ, all the hypotheses in H with errD (h) >  are inconsistent with
the data, i.e., errS (h) 6= 0.

So it is possible to learn a class C of VC-dimension d with parameters δ and  given that the
number of samples m is at least m ≥ c d log 1 + 1 log 1δ where c is a fixed constant. So, as long
as V Cdim(H) is finite, it is possible to learn concepts from H even though H might be infinite!
One can also show that this sample complexity result is tight within a factor of O(log(1/)). Here
is a simplified version of the lower bound:

Theorem 6 Any algorithm for learning a concept class of VC dimension d with parameters  and
δ ≤ 1/15 must use more than (d − 1)/(64) examples in the worst case.

6
The following is the VC dimension based sample complexity bound for the non-realizable case:

Theorem 7 Let H be an arbitrary hypothesis space of VC-dimension d. Let D be an arbitrary,


fixed unknown probability distribution over X and let c∗ be an arbitrary unknown target function.
For any , δ > 0, if we draw a sample S from D of size
1 1
   
m = O 2 d + ln ,
 δ

then probability at least (1 − δ), all hypotheses h in H have

|err(h) − errS (h)| ≤ . (4)

Note: As in the finite case, we can rewrite the bounds in Theorems 5 and 7 in the “statistical
learning theory style” as follows:
Let H be an arbitrary hypothesis space of VC-dimension d. For any δ > 0, if we draw a sample
from D of size m then with probability at least 1 − δ, any hypothesis in H consistent with the data
will have error at most
1 1
   
O d ln(m/d) + ln .
m δ
For any δ > 0 if we draw a sample from D of size m then with probability at least 1 − δ, all
hypotheses h in H have s 
d + ln(1/δ) 
err(h) ≤ errS (h) + O  .
m

We can see from these bounds that the gap between true error and empirical error in the realizable

case is O(ln(m)/m), whereas in the general (non-realizable) case this is (larger) O(1/ m).

7
Maximum Likelihood, Logistic Regression,
and Stochastic Gradient Training
Charles Elkan
[email protected]
January 10, 2014

1 Principle of maximum likelihood


Consider a family of probability distributions defined by a set of parameters θ.
The distributions may be either probability mass functions (pmfs) or probability
density functions (pdfs). Suppose that we have a random sample drawn from
a fixed but unknown member of this family. The random sample is a training
set of n examples x1 to xn . An example may also be called an observation, an
outcome, an instance, or a data point. In general each xj is a vector of values, and
θ is a vector of real-valued parameters. For example, for a Gaussian distribution
θ = hµ, σ 2 i.
We assume that the examples are independent, so the probability of the set is
the product of the probabilities of the individual examples:
Y
f (x1 , . . . , xn ; θ) = fθ (xj ; θ).
j

The notation above makes us think of the distribution θ as fixed and the examples
xj as unknown, or varying. However, we can think of the training data as fixed
and consider alternative parameter values. This is the point of view behind the
definition of the likelihood function:
L(θ; x1 , . . . , xn ) = f (x1 , . . . , xn ; θ).
Note that if f (x; θ) is a probability mass function, then the likelihood is always
less than one, but if f (x; θ) is a probability density function, then the likelihood
can be greater than one, since densities can be greater than one.

1
The principle of maximum likelihood says that given the training data, we
should use as our model the distribution f (·; θ̂) that gives the greatest possible
probability to the training data. Formally,

θ̂ = argmaxθ L(θ; x1 , . . . , xn ).

The value θ̂ is called the maximum likelihood estimator (MLE) of θ. In general


the hat notation indicates an estimated quantity; if necessary we will use notation
like θ̂MLE to indicate the nature of an estimate.

2 Examples of maximizing likelihood


As a first example of finding a maximum likelihood estimator, consider estimating
the parameter of a Bernoulli distribution. A random variable with this distribution
is a formalization of a coin toss. The value of the random variable is 1 with
probability θ and 0 with probability 1 − θ. Let X be a Bernoulli random variable,
and let x be an outcome of X. We have
 
θ if x = 1
P (X = x) =
1 − θ if x = 0

Usually, we use the notation P (·) for a probability mass, and the notation p(·) for
a probability density. For mathematical convenience write P (X) as

P (X = x) = θx (1 − θ)1−x .

Suppose that the training data are x1 through xn where each xi ∈ {0, 1}. The
likelihood function is
n
Y
L(θ; x1 , . . . , xn ) = f (x1 , . . . , xn ; θ) = P (X = xi ) = θh (1 − θ)n−h
i=1
Pn
where h = i=1 xi . The maximization is performed over the possible scalar
values 0 ≤ θ ≤ 1.
We can do the maximization by setting the derivative with respect to θ equal
to zero. The derivative is
d h
θ (1 − θ)n−h = hθh−1 (1 − θ)n−h + θh (n − h)(1 − θ)n−h−1 (−1)

= θh−1 (1 − θ)n−h−1 [h(1 − θ) − (n − h)θ]

2
which has solutions θ = 0, θ = 1, and θ = h/n. The solution which is a maximum
is clearly θ = h/n while θ = 0 and θ = 1 are minima. So we have the maximum
likelihood estimate θ̂ = h/n.
The log likelihood function, written l(·), is simply the logarithm of the likeli-
hood function L(·). Because logarithm is a monotonic strictly increasing function,
maximizing the log likelihood is precisely equivalent to maximizing the likeli-
hood, and also to minimizing the negative log likelihood.
For an example of maximizing the log likelihood, consider the problem of
estimating the parameters of a univariate Gaussian distribution. This distribution
is
1 (x − µ)2
f (x; µ, σ 2 ) = √ exp[− ].
σ 2π 2σ 2
The log likelihood for one example x is
√ (x − µ)2
l(µ, σ 2 ; x) = log L(µ, σ 2 ; x) = − log σ − log 2π − .
2σ 2
Suppose that we have training data {x1 , . . . , xn }. The maximum log likelihood
estimates are
n
√ 1 X
hµ̂, σ̂ 2 i = argmaxhµ,σ2 i [−n log σ − n log 2π − (xi − µ)2 ].
2σ 2 i=1

The expression in square brackets is to be optimized simultaneously over two


variables. Fortunately, it can be simplified it into two sequential univariate opti-
mizations. Eliminating the minus signs makes both these optimizations be mini-
mizations. The first is n
X
µ̂ = argminµ (xi − µ)2
i=1

while the second is


1
σ̂ 2 = argminσ2 [n log σ + T]
2σ 2
where T = ni=1 (xi − µ̂)2 . In order to do the first minimization, write (xi − µ)2
P
as (xi − x̄ + x̄ − µ)2 . Then
n
X n
X n
X
2 2
(xi − µ) = (xi − x̄) + 2(x̄ − µ) (xi − x̄) + n(x̄ − µ)2 .
i=1 i=1 i=1

3
The first term ni=1 (xi − x̄)2 does not depend on µ
P
Pson
it is irrelevant to the min-
imization. The second term equals zero, because i=1 (xi − x̄) = 0. The third
term is always positive, so it is clear that it is minimized when µ = x̄.
To perform the second minimization, work out the derivative symbolically and
then work out when it equals zero:
∂ 1 1
[n log σ + σ −2 T ] = nσ −1 + (−2σ −3 )T
∂σ 2 2
= σ (n − T σ −2 )
−1

= 0 if σ 2 = T /n.
Maximum likelihood estimators are typically reasonable, P but they may have is-
sues. Consider the Gaussian variance estimator σ̂MLE = ni=1 (xi − x̄)2 /n and
2
2
the case where n = 1. In this case σ̂MLE = 0. This estimate is guaranteed to be
too small. Intuitively, the estimate is optimistically assuming that all future data
points x2 and so on will equal x1 exactly.
It can be proved that in general the maximum likelihood estimate of the vari-
ance of a Gaussian is too small, on average:
n
1X n−1 2
E[ (xi − x̄)2 ; µ, σ 2 ] = σ < σ2.
n i=1 n
This phenomenon can be considered an instance of overfitting: the observed
spread around the observed mean x̄ is less than the unknown true spread σ 2 around
the unknown true mean µ.

3 Conditional likelihood
An important extension of the idea of likelihood is conditional likelihood. Re-
member that the notation p(y|x) is an abbreviation for the conditional probability
p(Y = y|X = x) where Y and X are random variables. The conditional like-
lihood of θ given data x and y is L(θ; y|x) = p(y|x) = f (y|x; θ). Intuitively,
Y follows a probability distribution that is different for different x. Technically,
for each x there is a different function f (y|x; θ), but all these functions share the
same parameters θ. We assume that x itself is never unknown, so there is no need
to have a probabilistic model of it.
Given training data consisting of hxi , yi i pairs, the principle of maximum con-
Q likelihood says to choose a parameter estimate θ̂ that maximizes the prod-
ditional
uct i f (yi |xi ; θ). Note that we do not need to assume that the xi are independent

4
in order to justify the conditional likelihood being a product; we just need to as-
sume that the yi are independent when each is conditioned on its own xi . For
any specific value of x, θ̂ can then be used to compute probabilities for alternative
values y of Y . By assumption, we never want to predict values of x.
Suppose that Y is a binary (Bernoulli) outcome and that x is a real-valued
vector. We can assume that the probability that Y = 1 is a nonlinear function of a
linear function of x. Specifically, we assume the conditional model
d
X 1
p(Y = 1|x; α, β) = σ(α + β j xj ) = Pd
j=1 1 + exp −[α + j=1 βj xj ]

where σ(z) = 1/(1 + e−z ) is the nonlinear function. This model is called logistic
regression. We use j to index over the feature values x1 to xd of a single example
of dimensionality d, since we use i below to index over training examples 1 to
n. If necessary, the notation xij means the jth feature value of the ith example.
Be sure to understand the distinction between a feature and a value of a feature.
Essentially a feature is a random variable, while a value of a feature is a possible
outcome of the random variable. Features may also be called attributes, predictors,
or independent variables. The dependent random variable Y is sometimes called
a dependent variable.
The logistic regression model is easier to understand in the form
d
p X
log =α+ βj xj
1−p j=1

where p is an abbreviation for p(Y = 1|x; α, β). The ratio p/(1 − p) is called
the odds of the event Y = 1 given X = x, and log[p/(1 − p)] is called the log
odds. Since probabilities range between 0 and 1, odds range between 0 and +∞
P unboundedly between −∞ and +∞. A linear expression of
and log odds range
the form α + j βj xj can also take unbounded values, so it is reasonable to use
a linear expression as a model for log odds, but not as a model for odds or for
probabilities. Essentially, logistic regression is the simplest reasonable model for
a random yes/no outcome whose probability depends linearly on predictors x1 to
xd .
For each feature j, exp(βj xj ) is a multiplicative scaling factor on the odds
p/(1 − p). If the predictor xj is binary, then exp(βj ) is the extra odds of having
the outcome Y = 1 rather than Y = 0 when xj = 1, compared to when xj = 0.
If the predictor xj is real-valued, then exp(βj ) is the extra odds of having the

5
outcome Y = 1 when the value of xj increases by one unit. A major limitation
of the basic logistic regression model is that the probability p must either increase
monotonically, or decrease monotonically, as a function of each predictor xj . The
basic model does not allow the probability to depend in a U-shaped way on any
xj .
Given the training set {hx1 , y1 i, . . . , hxn , yn i}, we learn a logistic regression
classifier by maximizing the log joint conditional likelihood. This is the sum of
the log conditional likelihood for each training example:
n
X n
X
LCL = log L(θ; yi |xi ) = log f (yi |xi ; θ).
i=1 i=1

Given a single training example hxi , yi i, the log conditional likelihood is log pi if
the true label yi = 1 and log(1 − pi ) if yi = 0, where pi = p(y = 1|xi ; θ).
To simplify the following discussion, assume from now on that α = β0 and
x0 = 1 for every example x, so the parameter vector θ is β ∈ Rd+1 . By group-
ing together the positive and negative training examples, we can write the total
conditional log likelihood as
X X
LCL = log pi + log(1 − pi ).
i:yi =1 i:yi =0

The partial derivative of LCL with respect to parameter βj is


X ∂ X ∂
log pi + log(1 − pi ).
i:y =1
∂β j i:y =0
∂β j
i i

For an individual training example hx, yi, if its label y = 1 the partial derivative is
∂ 1 ∂
log p = p
∂βj p ∂βj
while if y = 0 it is
∂ 1  ∂ 
log(1 − p) = − p .
∂βj 1−p ∂βj

Let e = exp[− dj=0 βj xj ] so p = 1/(1 + e) and 1 − p = (1 + e − 1)/(1 + e) =


P
e/(1 + e). With this notation we have
∂ ∂
p = (−1)(1 + e)−2 e
∂βj ∂βj

6
∂ X
= (−1)(1 + e)−2 (e) [− β j xj ]
∂βj j

= (−1)(1 + e)−2 (e)(−xj )


1 e
= xj
1+e1+e
= p(1 − p)xj .

So
∂ ∂
log p = (1 − p)xj and log(1 − p) = −pxj .
∂βj ∂βj
For the entire training set the partial derivative of the log conditional likelihood
with respect to βj is

∂ X X X
LCL = (1 − pi )xij + −pi xij = (yi − pi )xij
∂βj i:y =1 i:y =0 i
i i

where xij is the value of the jth feature of the ith training example. Setting the
partial derivative to zero yields
X X
yi xij = pi xij .
i i

We have one equation of this type for each parameter βj . The equations can be
used to check the correctness of a trained model.P
Informally, but not precisely, the expression i yi xij is the average value over
the training set of the ith feature, where each training example
P is weighted 1 if its
true label is positive, and 0 otherwise. The expression i pi xij is the same aver-
age, except that each example i is weighted according to its predicted probability
pi of being positive. When the logistic regression classifier is trained correctly,
then these two averages must be the same for every feature. The special case for
j = 0 gives
1X 1X
yi = pi .
n i n i
In words, the empirical base rate probability of being positive must equal the
average predicted probability of being positive.

7
4 Stochastic gradient training
There are several sophisticated ways of actually doing the maximization of the to-
tal conditional log likelihood, i.e. the conditional log likelihood summed over all
training examples hxi , yi i; for details see [Minka, 2007, Komarek and Moore, 2005].
However, here we consider a method called stochastic gradient ascent. This
method changes the parameter values to increase the log likelihood based on one
example at a time. It is called stochastic because the derivative based on a ran-
domly chosen single example is a random approximation to the true derivative
based on all the training data.
As explained in the previous section, the partial derivative of the log condi-
tional likelihood with respect to βj is
∂ X
LCL = (yi − pi )xij
∂βj i

where xij is the value of the jth feature of the ith training example. The gradient-
based update of the parameter βj is

βj := βj + λ LCL
∂βj
where λ is a step size. A major problem with thisP approach is the time complexity
of computing the partial derivatives. Evaluating i (yi − pi )xij for all j requires
O(nd) time where n is the number of training examples and d is their dimen-
sionality. Typically, after this evaluation, each βj can be changed by only a small
amount. The partial derivatives must then be evaluated again, at high computa-
tional cost again, before updating βj further.
The stochastic gradient idea is that we can get a random approximation of the
partial derivatives in much less than O(nd) time, so the parameters can be updated
much more rapidly. In general, for each parameter βj we want to define a random
variable Zj such that

E[Zj ] = LCL.
∂βj
For logistic regression, one such Zj is n(yi − pi )xij where i is chosen randomly,
with uniform probability, from the set {1, 2, . . . , n}. Based on this, the stochastic
gradient update of βj is

βj := βj + λZ = βj + λ(yi − pi )xij

8
where i is selected randomly and n has been dropped since it is a constant. As be-
fore, the learning rate λ is a multiplier that controls the magnitude of the changes
to the parameters.
Stochastic gradient ascent (or descent, for a minimization problem) is a method
that is often useful in machine learning. Experience suggests some heuristics for
making it work well in practice.
• The training examples are sorted in random order, and the parameters are
updated for each example sequentially. One complete update for every ex-
ample is called an epoch. Typically, a small constant number of epochs is
used, perhaps 3 to 100 epochs.

• The learning rate is chosen by trial and error. It can be kept constant across
all epochs, e.g. λ = 0.1 or λ = 1, or it can be decreased gradually as a
function of the epoch number.

• Because the learning rate is the same for every parameter, it is useful to
scale the features xj so that their magnitudes are similar for all j. Given
that the feature x0 has constant value 1, it is reasonable to normalize every
other feature to have mean zero and variance 1, for example.
Stochastic gradient ascent (or descent) has some properties that are very useful in
practice. First, suppose that xj = 0 for most features j of a training example x.
Then updating βj based on x can be skipped. This means that the time to do one
epoch is O(nf d) where n is the number of training examples, d is the number of
features, and f is the average number of nonzero feature values per example. If
an example x is the bag-of-words representation of document, then d is the size of
the vocabulary (often over 30,000) but f d is the average number of words actually
used in a document (often under 300).
Second, suppose that the number n of training examples is very large, as is the
case in many modern applications. Then, a stochastic gradient method may con-
verge to good parameter estimates in less than one epoch of training. In contrast,
a training method that computes the log likelihood of all data and uses this in the
same way regardless of n will be inefficient in how it uses the data.
For each example, a stochastic gradient method updates all parameters once.
The dual idea is to update one parameter at a time, based on all examples. This
method is called coordinate ascent (or descent). For feature j the update rule is
X
βj := βj + λ (yi − pi )xij .
i

9
The update for the whole parameter vector β̄ is

β̄ := β̄ + λ(ȳ − p̄)T X

where the matrix X is the entire training set and the column vector ȳ consists of
the 0/1 labels for every training example. Often, coordinate ascent converges too
slowly to be useful. However, it can be useful to do one update of β̄ after all
epochs of stochastic gradient ascent.
Regardless of the method used to train a model, it is important to remember
that optimizing the model perfectly on the training data usually does not lead to
the best possible performance on test examples. There are several reasons for this:

• The model with best possible performance may not belong to the family of
models under consideration. This is an instance of the principle “you cannot
learn it if you cannot represent it.”

• The training data may not be representative of the test data, i.e. the training
and test data may be samples from different populations.

• Fitting the training data as closely as possible may simply be overfitting.

• The objective function for training, namely log likelihood or conditional log
likelihood, may not be the desired objective from an application perspective;
for example, the desired objective may be classification accuracy.

5 Regularization
Consider learning a logistic regression classifier for categorizing documents. Sup-
pose that word number j appears only in documents whose labels are positive. The
partial derivative of the log conditional likelihood with respect to the parameter
for this word is
∂ X
LCL = (yi − pi )xij .
∂βj i

This derivative will always be positive, as long as the predicted probability pi


is not perfectly one for all these documents. Therefore, following the gradient
will send βj to infinity. Then, every test document containing this word will be
predicted to be positive with certainty, regardless of all other words in the test
document. This over-confident generalization is an example of overfitting.

10
There is a standard method for solving this overfitting problem that is quite
simple, but quite successful. The solution is called regularization. The idea is to
impose a penalty on the magnitude of the parameter values. This penalty should
be minimized, in a trade-off with maximizing likelihood. Mathematically, the
optimization problem to be solved is

β̂ = argmaxβ LCL − µ||β||22


Pd
where ||β||22 = 2
j=1 βj is the squared L2 norm of the parameter vector β of
length d. The constant µ quantifies the trade-off between maximizing likelihood
and making parameter values be close to zero.1
This type of regularization is called quadratic or Tikhonov regularization. A
major reason why it is popular is that it can be derived from several different points
of view. In particular, it arises as a consequence of assuming a Gaussian prior
on parameters. It also arises from theorems on minimizing generalization error,
i.e. error on independent test sets drawn from the same distribution as the training
set. And, it arises from robust classification: assuming that each training point lies
in an unknown location inside a sphere centered on its measured location.
Stochastic gradient following is easily extended to include regularization. We
simply include the penalty term when calculating the gradient for each example.
Consider
d
∂ X ∂
[log p(y|x; β) − µ βj2 ] = [ log p(y|x; β)] − µ2βj .
∂βj j=0
∂β j

Remember that for logistic regression the partial derivative of the log conditional
likelihood for one example is

log p(y|x; β) = (y − p)xj
∂βj

so the update rule with regularization is

βj := βj + λ[(y − p)xj − 2µβj ]


1
If β0 is the intercept of the model, then typically this parameter is not regularized. The intu-
itive reason is that every training example provides information about the intercept, that is about
the baseline probability of an example being positive, so there is enough information available to
avoid overfitting in the trained value of this parameter.

11
where λ is the learning rate. Update rules like the one above are often called
“weight decay” rules, since the weight βj is made smaller at each update unless
y − p has the same sign as xj .
Straightforward stochastic gradient ascent for training a regularized logistic
regression model loses the desirable sparsity property described above, because
the value of every parameter βj must be decayed for every training example. How
to overcome this computational inefficiency is described in [Carpenter, 2008].
Writing the regularized optimization problem as a minimization gives
n
X d
X
β̂ = argminβ − log p(yi |xi ; β) + µ βj2 .
i=1 j=0

The expression − log p(yi |xi ; β) is called the “loss” for training example i. If the
predicted probability, using β, of the true label yi is close to 1, then the loss is
small. But if the predicted probability of yi is close to 0, then the loss is large.
Losses are always non-negative; we want to minimize them. We also want to
minimize the numerical magnitude of the trained parameters.

References
[Carpenter, 2008] Carpenter, B. (2008). Lazy sparse stochastic gradi-
ent descent for regularized multinomial logistic regression. Techni-
cal report, Alias-i. Available at http://lingpipe-blog.com/
lingpipe-white-papers/.

[Komarek and Moore, 2005] Komarek, P. and Moore, A. W. (2005). Making lo-
gistic regression a core data mining tool with TR-IRLS. In Proceedings of
the Fifth IEEE International Conference on Data Mining (ICDM’05), pages
685–688.

[Minka, 2007] Minka, T. P. (2007). A comparison of numerical optimizers


for logistic regression. First version 2001. Unpublished paper available at
http://research.microsoft.com/∼minka.
CSE 250B Quiz 3, January 21, 2010
Write your name:
Let fθ (x; θ) where x ∈ R be the probability density function (pdf) of the uniform
distribution over the range 0 to θ. Precisely, fθ (x; θ) = 1/θ if 0 ≤ x ≤ θ while
fθ (x; θ) = 0 otherwise.
Let x1 to xn be an independent identically distributed (iid) sample from fθ for
some unknown true parameter value θ > 0. The maximum likelihood estimator
(MLE) of θ is θ̂ = maxi xi .

[3 points] In one or two sentences, explain intuitively the reason why this is the
MLE. You do not need to use any equations.

Note: The MLE above is an example of overfitting, since the true value of θ is
almost certainly larger than the MLE.
CSE 250B Quiz 4, January 27, 2011
Write your name:
Assume that winning or losing a basketball game is similar to flipping a biased
coin. Suppose that San Diego State University (SDSU) has won all six games that
it has played.

(a) The maximum likelihood estimate (MLE) of the probability that SDSU will
win its next game is 1.0. Explain why this is the MLE. (Using equations is not
required.)

(b) This MLE can be viewed as overfitting. Explain why.


CSE 250B Quiz 4, January 28, 2010
The objective function to be minimized when training an L2 -regularized linear
regression model is
n
X d
X
2
E= (f (xi ; w) − yi ) + µ wj2
i=1 j=0

where the model is


d
X
f (xi ; w) = wj xij .
j=0
All notation above is the same as in the class lecture notes.

[3 points] Work out the partial derivative of the objective function with respect to
weight wj .

Answer. Let fi be an abbreviation for f (xi ; w) and let di be an abbreviation for



f . Note that di = xij . The expanded objective function is
∂wj i

Xn d
X
2 2
E=[ fi − 2fi yi + yi ] + µ wj2 .
i=1 j=0

The partial derivative is


n
∂ X
E = [ [2fi di − 2yi di + 0] + 2µwj
∂wj i=1
which is n
∂ X
E = 2[µwj + (fi − yi )xij ].
∂wj i=1
Additional note. Because the goal is to minimize E, we do gradient descent, not
ascent, with the update rule

wj := wj − λ E.
∂wj
The update rule says that if the average over all training examples i of (fi − yi )xij
is positive, then wj must be decreased. Assume that xij is non-negative; this
update rule is reasonable because then fi is too big on average, and decreasing wj
will make fi decrease. The update rule also says that, because of regularization,
wj must always be decreased even more, by 2λµ times its current value.
CSE 250B Quiz, February 3, 2011
Your name:
Suppose you are doing stochastic gradient descent to minimize the following error
function on a single training example:

E = e(f (x; w) − y))

Work out the stochastic gradient update rule as specifically as possible, when
the error function is absolute error: e(z) = |z|.
Hint: Use the notation e0 (z) for the derivative of e(z).
Answer: For each component wj of the parameter vector w, the update rule is


wj := wj − α E
∂wj

where α is the learning rate. We can work out


∂ ∂
E = e0 (f (x; w) − y) f
∂wj ∂wj

If e(z) = |z| then e0 (z) = sign(z) so


wj := wj − α sign(f (x; w) − y) f.
∂wj

Intuitively, if f (x; w) is too large, and increasing wj makes f increase, then wj


should be decreased.
Theory and Applications of Boosting

Rob Schapire
Princeton University
Example: “How May I Help You?”
[Gorin et al.]
• goal: automatically categorize type of call requested by phone
customer (Collect, CallingCard, PersonToPerson, etc.)
• yes I’d like to place a collect call long distance
please (Collect)
• operator I need to make a call but I need to bill
it to my office (ThirdNumber)
• yes I’d like to place a call on my master card
please (CallingCard)
• I just called a number in sioux city and I musta
rang the wrong number because I got the wrong
party and I would like to have that taken off of
my bill (BillingCredit)
Example: “How May I Help You?”
[Gorin et al.]
• goal: automatically categorize type of call requested by phone
customer (Collect, CallingCard, PersonToPerson, etc.)
• yes I’d like to place a collect call long distance
please (Collect)
• operator I need to make a call but I need to bill
it to my office (ThirdNumber)
• yes I’d like to place a call on my master card
please (CallingCard)
• I just called a number in sioux city and I musta
rang the wrong number because I got the wrong
party and I would like to have that taken off of
my bill (BillingCredit)
• observation:
• easy to find “rules of thumb” that are “often” correct
• e.g.: “IF ‘card’ occurs in utterance
THEN predict ‘CallingCard’ ”
• hard to find single highly accurate prediction rule
The Boosting Approach

• devise computer program for deriving rough rules of thumb


• apply procedure to subset of examples
• obtain rule of thumb
• apply to 2nd subset of examples
• obtain 2nd rule of thumb
• repeat T times
Details

• how to choose examples on each round?


• concentrate on “hardest” examples
(those most often misclassified by previous rules of
thumb)
• how to combine rules of thumb into single prediction rule?
• take (weighted) majority vote of rules of thumb
Boosting

• boosting = general method of converting rough rules of


thumb into highly accurate prediction rule
• technically:
• assume given “weak” learning algorithm that can
consistently find classifiers (“rules of thumb”) at least
slightly better than random, say, accuracy ≥ 55%
(in two-class setting)
• given sufficient data, a boosting algorithm can provably
construct single classifier with very high accuracy, say,
99%
Outline of Tutorial

• brief background
• basic algorithm and core theory
• other ways of understanding boosting
• experiments, applications and extensions
Brief Background
Strong and Weak Learnability

• boosting’s roots are in “PAC” (Valiant) learning model


• get random examples from unknown, arbitrary distribution
• strong PAC learning algorithm:
• for any distribution
with high probability
given polynomially many examples (and polynomial time)
can find classifier with arbitrarily small generalization
error
• weak PAC learning algorithm
• same, but generalization error only needs to be slightly
better than random guessing ( 12 − γ)
• [Kearns & Valiant ’88]:
• does weak learnability imply strong learnability?
Early Boosting Algorithms

• [Schapire ’89]:
• first provable boosting algorithm
• [Freund ’90]:
• “optimal” algorithm that “boosts by majority”
• [Drucker, Schapire & Simard ’92]:
• first experiments using boosting
• limited by practical drawbacks
AdaBoost
• [Freund & Schapire ’95]:
introduced “AdaBoost” algorithm

strong practical advantages over previous boosting

algorithms
• experiments and applications using AdaBoost:
[Drucker & Cortes ’96] [Abney, Schapire & Singer ’99] [Tieu & Viola ’00]
[Jackson & Craven ’96] [Haruno, Shirai & Ooyama ’99] [Walker, Rambow & Rogati ’01]
[Freund & Schapire ’96] [Cohen & Singer’ 99] [Rochery, Schapire, Rahim & Gupta ’01]
[Quinlan ’96] [Dietterich ’00] [Merler, Furlanello, Larcher & Sboner ’01]
[Breiman ’96] [Schapire & Singer ’00] [Di Fabbrizio, Dutton, Gupta et al. ’02]
[Maclin & Opitz ’97] [Collins ’00] [Qu, Adam, Yasui et al. ’02]
[Bauer & Kohavi ’97] [Escudero, Màrquez & Rigau ’00] [Tur, Schapire & Hakkani-Tür ’03]
[Schwenk & Bengio ’98] [Iyer, Lewis, Schapire et al. ’00] [Viola & Jones ’04]
[Schapire, Singer & Singhal ’98] [Onoda, Rätsch & Müller ’00] [Middendorf, Kundaje, Wiggins et al. ’04]
.
.
.
• continuing development of theory and algorithms:
[Breiman ’98, ’99] [Duffy & Helmbold ’99, ’02] [Koltchinskii, Panchenko & Lozano ’01]
[Schapire, Freund, Bartlett & Lee ’98] [Freund & Mason ’99] [Collins, Schapire & Singer ’02]
[Grove & Schuurmans ’98] [Ridgeway, Madigan & Richardson ’99] [Demiriz, Bennett & Shawe-Taylor ’02]
[Mason, Bartlett & Baxter ’98] [Kivinen & Warmuth ’99] [Lebanon & Lafferty ’02]
[Schapire & Singer ’99] [Friedman, Hastie & Tibshirani ’00] [Wyner ’02]
[Cohen & Singer ’99] [Rätsch, Onoda & Müller ’00] [Rudin, Daubechies & Schapire ’03]
[Freund & Mason ’99] [Rätsch, Warmuth, Mika et al. ’00] [Jiang ’04]
[Domingo & Watanabe ’99] [Allwein, Schapire & Singer ’00] [Lugosi & Vayatis ’04]
[Mason, Baxter, Bartlett & Frean ’99] [Friedman ’01] [Zhang ’04]
.
.
.
Basic Algorithm and Core Theory

• introduction to AdaBoost
• analysis of training error
• analysis of test error based on
margins theory
A Formal Description of Boosting

• given training set (x1 , y1 ), . . . , (xm , ym )


• yi ∈ {−1, +1} correct label of instance xi ∈ X
A Formal Description of Boosting

• given training set (x1 , y1 ), . . . , (xm , ym )


• yi ∈ {−1, +1} correct label of instance xi ∈ X
• for t = 1, . . . , T :
• construct distribution Dt on {1, . . . , m}
• find weak classifier (“rule of thumb”)
ht : X → {−1, +1}
with small error t on Dt :
t = Pri ∼Dt [ht (xi ) 6= yi ]
A Formal Description of Boosting

• given training set (x1 , y1 ), . . . , (xm , ym )


• yi ∈ {−1, +1} correct label of instance xi ∈ X
• for t = 1, . . . , T :
• construct distribution Dt on {1, . . . , m}
• find weak classifier (“rule of thumb”)
ht : X → {−1, +1}
with small error t on Dt :
t = Pri ∼Dt [ht (xi ) 6= yi ]
• output final classifier Hfinal
AdaBoost
[with Freund]

• constructing Dt :
• D1 (i ) = 1/m
AdaBoost
[with Freund]

• constructing Dt :
• D1 (i ) = 1/m
• given Dt and ht :
 −α
Dt (i ) e t if yi = ht (xi )
Dt+1 (i ) = ×
Zt e αt if yi 6= ht (xi )
Dt (i )
= exp(−αt yi ht (xi ))
Zt

where Zt = normalization
 constant
1 1 − t
αt = 2 ln >0
t
AdaBoost
[with Freund]

• constructing Dt :
• D1 (i ) = 1/m
• given Dt and ht :
 −α
Dt (i ) e t if yi = ht (xi )
Dt+1 (i ) = ×
Zt e αt if yi 6= ht (xi )
Dt (i )
= exp(−αt yi ht (xi ))
Zt

where Zt = normalization
 constant
1 1 − t
αt = 2 ln >0
t
• final classifier: !
X
• Hfinal (x) = sign αt ht (x)
t
Toy Example

D1

weak classifiers = vertical or horizontal half-planes


Round 1

h1 D2
   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

   
               

ε1 =0.30
α1=0.42
Round 2

h2 D3




                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               





                            
               

ε2 =0.21
α2=0.65
Round 3

                                            
                                            

                                            
                                            

                                            
                                            

                                            
                                            

                                            
                                            

                                            
                                            

h3
                                            
                                            

                                            
                                            

                                            
                                            

               
               

                                            
                                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

                            
                            

               
               

ε3 =0.14
α3=0.92
Final Classifier

          
          

        $ $ $ $ $ $ $ $ $ & &
! !         % % % % % % % % % ' '

          
          

        $ $ $ $ $ $ $ $ $ & &
! !         % % % % % % % % % ' '

          
          

        $ $ $ $ $ $ $ $ $ & &
! !         % % % % % % % % % ' '

          
          

        $ $ $ $ $ $ $ $ $ & &
! !         % % % % % % % % % ' '

          
          

        $ $ $ $ $ $ $ $ $ & &
! !         % % % % % % % % % ' '

          
          

        $ $ $ $ $ $ $ $ $ & &
! !         % % % % % % % % % ' '

          
          

          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

H = sign 0.42 + 0.65 + 0.92


          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

          
        $ $ $ $ $ $ $ $ $ & &           

final
! !         % % % % % % % % % ' '

          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

          
        $ $ $ $ $ $ $ $ $ & &           
! !         % % % % % % % % % ' '

( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #

( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #

( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #

( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #

( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #

( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #

( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #

( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #

( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #

( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

= 























"

"

"








"
#

"
#

"
#
"
#

"
#

"
#
"
#

"
#

"
#
"
#

"
#

"
#
"
#

"
#

"
#
"
#

"
#

"
#
"
#

"
#

"
#
"
#

"
#

"
#
"
#

"
#

"
#
"
#

"
#

"
#
"
#

"
#

"
#
"
#

"
#

"
#

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #

   "  " " " " " " " " " " " "
    # # # # # # # # # # # #
Analyzing the training error
• Theorem:
• write t as 1/2 − γt
Analyzing the training error
• Theorem:
• write t as 1/2 − γt
• then
Yh p i
training error(Hfinal ) ≤ 2 t (1 − t )
t
Yq
= 1 − 4γt2
t
!
X
≤ exp −2 γt2
t
Analyzing the training error
• Theorem:
• write t as 1/2 − γt
• then
Yh p i
training error(Hfinal ) ≤ 2 t (1 − t )
t
Yq
= 1 − 4γt2
t
!
X
≤ exp −2 γt2
t

• so: if ∀t : γt ≥ γ > 0
2
then training error(Hfinal ) ≤ e −2γ T
• AdaBoost is adaptive:
• does not need to know γ or T a priori
• can exploit γt  γ
Proof

X
• let f (x) = αt ht (x) ⇒ Hfinal (x) = sign(f (x))
t
• Step 1: unwrapping recurrence:
!
X
exp −yi αt ht (xi )
1 t
Dfinal (i ) = Y
m Zt
t

1 exp (−yi f (xi ))


= Y
m Zt
t
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
• Proof:

1 X 1 if yi 6= Hfinal (xi )
training error(Hfinal ) =
m 0 else
i
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
• Proof:

1 X 1 if yi 6= Hfinal (xi )
training error(Hfinal ) =
m 0 else
i


1 X 1 if yi f (xi ) ≤ 0
=
m 0 else
i
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
• Proof:

1 X 1 if yi 6= Hfinal (xi )
training error(Hfinal ) =
m 0 else
i


1 X 1 if yi f (xi ) ≤ 0
=
m 0 else
i
1 X
≤ exp(−yi f (xi ))
m
i
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
• Proof:

1 X 1 if yi 6= Hfinal (xi )
training error(Hfinal ) =
m 0 else
i


1 X 1 if yi f (xi ) ≤ 0
=
m 0 else
i
1 X
≤ exp(−yi f (xi ))
m
i
X Y
= Dfinal (i ) Zt
i t
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
• Proof:

1 X 1 if yi 6= Hfinal (xi )
training error(Hfinal ) =
m 0 else
i


1 X 1 if yi f (xi ) ≤ 0
=
m 0 else
i
1 X
≤ exp(−yi f (xi ))
m
i
X Y
= Dfinal (i ) Zt
i t
Y
= Zt
t
Proof (cont.)

p
• Step 3: Zt = 2 t (1 − t )
Proof (cont.)

p
• Step 3: Zt = 2 t (1 − t )
• Proof:
X
Zt = Dt (i ) exp(−αt yi ht (xi ))
i
X X
= Dt (i )e αt + Dt (i )e −αt
i :yi 6=ht (xi ) i :yi =ht (xi )
−αt
= t e αt + (1
− t ) e
p
= 2 t (1 − t )
How Will Test Error Behave? (A First Guess)

0.8

0.6

error
0.4 test
0.2
train
20 40 60 80 100
# of rounds (T)

expect:
• training error to continue to drop (or reach zero)
• test error to increase when Hfinal becomes “too complex”
• “Occam’s razor”
• overfitting
• hard to know when to stop training
Actual Typical Run
20

15 C4.5 test error

error
10 (boosting C4.5 on
test “letter” dataset)
5

0
train
10 100 1000
# of rounds (T)

• test error does not increase, even after 1000 rounds


• (total size > 2,000,000 nodes)
• test error continues to drop even after training error is zero!
# rounds
5 100 1000
train error 0.0 0.0 0.0
test error 8.4 3.3 3.1
• Occam’s razor wrongly predicts “simpler” rule is better
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]

• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]

• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
• recall: Hfinal is weighted majority vote of weak classifiers
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]

• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
• recall: Hfinal is weighted majority vote of weak classifiers
• measure confidence by margin = strength of the vote
= (fraction voting correctly) − (fraction voting incorrectly)

high conf. high conf.


incorrect low conf. correct
Hfinal Hfinal
−1 incorrect 0 correct +1
Empirical Evidence: The Margin Distribution
• margin distribution
= cumulative distribution of margins of training examples

cumulative distribution
1.0
20
1000
15 100
error

10 0.5

5
test
train 5
0
10 100 1000 -1 -0.5 0.5 1
# of rounds (T) margin

# rounds
5 100 1000
train error 0.0 0.0 0.0
test error 8.4 3.3 3.1
% margins ≤ 0.5 7.7 0.0 0.0
minimum margin 0.14 0.52 0.55
Theoretical Evidence: Analyzing Boosting Using Margins

• Theorem: large margins ⇒ better bound on generalization


error (independent of number of rounds)
Theoretical Evidence: Analyzing Boosting Using Margins

• Theorem: large margins ⇒ better bound on generalization


error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate
final classifier by a much smaller classifier (just as polls
can predict not-too-close election)
Theoretical Evidence: Analyzing Boosting Using Margins

• Theorem: large margins ⇒ better bound on generalization


error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate
final classifier by a much smaller classifier (just as polls
can predict not-too-close election)
• Theorem: boosting tends to increase margins of training
examples (given weak learning assumption)
Theoretical Evidence: Analyzing Boosting Using Margins

• Theorem: large margins ⇒ better bound on generalization


error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate
final classifier by a much smaller classifier (just as polls
can predict not-too-close election)
• Theorem: boosting tends to increase margins of training
examples (given weak learning assumption)
• proof idea: similar to training error proof
Theoretical Evidence: Analyzing Boosting Using Margins

• Theorem: large margins ⇒ better bound on generalization


error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate
final classifier by a much smaller classifier (just as polls
can predict not-too-close election)
• Theorem: boosting tends to increase margins of training
examples (given weak learning assumption)
• proof idea: similar to training error proof
• so:
although final classifier is getting larger,
margins are likely to be increasing,
so final classifier actually getting close to a simpler classifier,
driving down the test error
More Technically...
• with high probability, ∀θ > 0 :

p !
d/m
generalization error ≤ P̂r[margin ≤ θ] + Õ
θ

(P̂r[ ] = empirical probability)


• bound depends on
• m = # training examples
• d = “complexity” of weak classifiers
• entire distribution of margins of training examples
• P̂r[margin ≤ θ] → 0 exponentially fast (in T ) if
(error of ht on Dt ) < 1/2 − θ (∀t)
• so: if weak learning assumption holds, then all examples
will quickly have “large” margins
Other Ways of Understanding AdaBoost

• game theory
• loss minimization
• estimating conditional probabilities
Game Theory
• game defined by matrix M:
Rock Paper Scissors
Rock 1/2 1 0
Paper 0 1/2 1
Scissors 1 0 1/2
• row player chooses row i
• column player chooses column j
(simultaneously)
• row player’s goal: minimize loss M(i , j)
Game Theory
• game defined by matrix M:
Rock Paper Scissors
Rock 1/2 1 0
Paper 0 1/2 1
Scissors 1 0 1/2
• row player chooses row i
• column player chooses column j
(simultaneously)
• row player’s goal: minimize loss M(i , j)
• usually allow randomized play:
• players choose distributions P and Q over rows and
columns
• learner’s (expected) loss
X
= P(i )M(i , j)Q(j)
i ,j

= PT MQ ≡ M(P, Q)
The Minmax Theorem
• von Neumann’s minmax theorem:

min max M(P, Q) = max min M(P, Q)


P Q Q P
= v
= “value” of game M

• in words:
• v = min max means:
• row player has strategy P∗
such that ∀ column strategy Q
loss M(P∗ , Q) ≤ v
• v = max min means:
• this is optimal in sense that
column player has strategy Q∗
such that ∀ row strategy P
loss M(P, Q∗ ) ≥ v
The Boosting Game
• let {g1 , . . . , gN } = space of all weak classifiers
• row player ↔ booster
• column player ↔ weak learner
• matrix M:
• row ↔ example (xi , yi )
• column ↔ weak classifier gj

1 if yi = gj (xi )
• M(i , j) =
0 else

weak learner
g1 gj gN
x1 y1
booster

xi yi M(i,j)

xmym
Boosting and the Minmax Theorem

• if:
• ∀ distributions over examples
∃h with accuracy ≥ 21 + γ
• then:
1
• min max M(P, j) ≥ 2 +γ
P j
• by minmax theorem:
1 1
• max min M(i , Q) ≥ 2 +γ > 2
Q i
• which means:
• ∃ weighted majority of classifiers which correctly classifies
all examples with positive margin (2γ)
• optimal margin ↔ “value” of game
AdaBoost and Game Theory
[with Freund]

• AdaBoost is special case of general algorithm for


solving games through repeated play
• can show
• distribution over examples converges to (approximate)
minmax strategy for boosting game
• weights on weak classifiers converge to (approximate)
maxmin strategy
• different instantiation of game-playing algorithm gives on-line
learning algorithms (such as weighted majority algorithm)
AdaBoost and Exponential Loss
• many (most?) learning algorithms minimize a “loss” function
• e.g. least squares regression
• training error proof shows AdaBoost actually minimizes
Y 1 X
Zt = exp(−yi f (xi ))
t
m
i
X
where f (x) = αt ht (x)
t
• on each round, AdaBoost greedily chooses αt and ht to
minimize loss
AdaBoost and Exponential Loss
• many (most?) learning algorithms minimize a “loss” function
• e.g. least squares regression
• training error proof shows AdaBoost actually minimizes
Y 1 X
Zt = exp(−yi f (xi ))
t
m
i
X
where f (x) = αt ht (x)
t
• on each round, AdaBoost greedily chooses αt and ht to
minimize loss
• exponential loss is an upper
bound on 0-1 (classification)
loss
• AdaBoost provably
minimizes exponential loss

yf(x)
Coordinate Descent
[Breiman]

• {g1 , . . . , gN } = space of all weak classifiers


• want to find λ1 , . . . , λN to minimize
 
X X
L(λ1 , . . . , λN ) = exp −yi λj gj (xi )
i j
Coordinate Descent
[Breiman]

• {g1 , . . . , gN } = space of all weak classifiers


• want to find λ1 , . . . , λN to minimize
 
X X
L(λ1 , . . . , λN ) = exp −yi λj gj (xi )
i j

• AdaBoost is actually doing coordinate descent on this


optimization problem:
• initially, all λj = 0
• each round: choose one coordinate λj (corresponding to
ht ) and update (increment by αt )
• choose update causing biggest decrease in loss
• powerful technique for minimizing over huge space of
functions
Functional Gradient Descent
[Friedman][Mason et al.]
• want to minimize
X
L(f ) = L(f (x1 ), . . . , f (xm )) = exp(−yi f (xi ))
i
Functional Gradient Descent
[Friedman][Mason et al.]
• want to minimize
X
L(f ) = L(f (x1 ), . . . , f (xm )) = exp(−yi f (xi ))
i

• say have current estimate f and want to improve


• to do gradient descent, would like update

f ← f − α∇f L(f )
Functional Gradient Descent
[Friedman][Mason et al.]
• want to minimize
X
L(f ) = L(f (x1 ), . . . , f (xm )) = exp(−yi f (xi ))
i

• say have current estimate f and want to improve


• to do gradient descent, would like update

f ← f − α∇f L(f )

• but update restricted in class of weak classifiers

f ← f + αht
Functional Gradient Descent
[Friedman][Mason et al.]
• want to minimize
X
L(f ) = L(f (x1 ), . . . , f (xm )) = exp(−yi f (xi ))
i

• say have current estimate f and want to improve


• to do gradient descent, would like update

f ← f − α∇f L(f )

• but update restricted in class of weak classifiers

f ← f + αht

• so choose ht “closest” to −∇f L(f )


• equivalent to AdaBoost
Benefits of Model Fitting View

• immediate generalization to other loss functions


• e.g. squared error for regression
• e.g. logistic regression (by only changing one line of
AdaBoost)
• sensible approach for converting output of boosting into
conditional probability estimates
Benefits of Model Fitting View

• immediate generalization to other loss functions


• e.g. squared error for regression
• e.g. logistic regression (by only changing one line of
AdaBoost)
• sensible approach for converting output of boosting into
conditional probability estimates
• caveat: wrong to view AdaBoost as just an algorithm for
minimizing exponential loss
• other algorithms for minimizing same loss will (provably)
give very poor performance
• thus, this loss function cannot explain why AdaBoost
“works”
Estimating Conditional Probabilities
[Friedman, Hastie & Tibshirani]

• often want to estimate probability that y = +1 given x


• AdaBoost minimizes (empirical version of):
h i h i
Ex,y e −yf (x) = Ex P [y = +1|x] e −f (x) + P [y = −1|x] e f (x)

where x, y random from true distribution


Estimating Conditional Probabilities
[Friedman, Hastie & Tibshirani]

• often want to estimate probability that y = +1 given x


• AdaBoost minimizes (empirical version of):
h i h i
Ex,y e −yf (x) = Ex P [y = +1|x] e −f (x) + P [y = −1|x] e f (x)

where x, y random from true distribution


• over all f , minimized when
 
1 P [y = +1|x]
f (x) = · ln
2 P [y = −1|x]
or
1
P [y = +1|x] =
1 + e −2f (x)

• so, to convert f output by AdaBoost to probability estimate,


use same formula
Calibration Curve
1

0.8 x
’test’
’train’
0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

• order examples by f value output by AdaBoost


• break into bins of size r
• for each bin, plot a point:
• x-value: average estimated probability of examples in bin
• y -value: actual fraction of positive examples in bin
Other Ways to Think about AdaBoost

• dynamical systems
• statistical consistency
• maximum entropy
Experiments, Applications and Extensions

• basic experiments
• multiclass classification
• confidence-rated predictions
• text categorization /
spoken-dialogue systems
• incorporating prior knowledge
• active learning
• face detection
Practical Advantages of AdaBoost

• fast
• simple and easy to program
• no parameters to tune (except T )
• flexible — can combine with any learning algorithm
• no prior knowledge needed about weak learner
• provably effective, provided can consistently find rough rules
of thumb
→ shift in mind set — goal now is merely to find classifiers
barely better than random guessing
• versatile
• can use with data that is textual, numeric, discrete, etc.
• has been extended to learning problems well beyond
binary classification
Caveats

• performance of AdaBoost depends on data and weak learner


• consistent with theory, AdaBoost can fail if
• weak classifiers too complex
→ overfitting
• weak classifiers too weak (γt → 0 too quickly)
→ underfitting
→ low margins → overfitting
• empirically, AdaBoost seems especially susceptible to uniform
noise
UCI Experiments
[with Freund]

• tested AdaBoost on UCI benchmarks


• used:
• C4.5 (Quinlan’s decision tree algorithm)
• “decision stumps”: very simple rules of thumb that test
on single attributes

eye color = brown ? height > 5 feet ?


yes no yes no
predict predict predict predict
+1 -1 -1 +1
UCI Results

30 30

25 25

20 20
C4.5

C4.5
15 15

10 10

5 5

0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30

boosting Stumps boosting C4.5


Multiclass Problems
[with Freund]
• say y ∈ Y = {1, . . . , k}
• direct approach (AdaBoost.M1):

ht : X → Y

e −αt

Dt (i ) if yi = ht (xi )
Dt+1 (i ) = ·
Zt e αt if yi 6= ht (xi )
X
Hfinal (x) = arg max αt
y ∈Y
t:ht (x)=y
Multiclass Problems
[with Freund]
• say y ∈ Y = {1, . . . , k}
• direct approach (AdaBoost.M1):

ht : X → Y

e −αt

Dt (i ) if yi = ht (xi )
Dt+1 (i ) = ·
Zt e αt if yi 6= ht (xi )
X
Hfinal (x) = arg max αt
y ∈Y
t:ht (x)=y

• can prove same bound on error if ∀t : t ≤ 1/2


• in practice, not usually a problem for “strong” weak
learners (e.g., C4.5)
• significant problem for “weak” weak learners (e.g.,
decision stumps)
• instead, reduce to binary
Reducing Multiclass to Binary
[with Singer]

• say possible labels are {a, b, c, d, e}


• each training example replaced by five {−1, +1}-labeled
examples: 

 (x, a) , −1
 (x, b) , −1


x , c → (x, c) , +1
(x, d) , −1




(x, e) , −1

• predict with label receiving most (weighted) votes


AdaBoost.MH

• can prove:

k Y
training error(Hfinal ) ≤ · Zt
2

• reflects fact that small number of errors in binary


predictors can cause overall prediction to be incorrect
• extends immediately to multi-label case
(more than one correct label per example)
Using Output Codes
[with Allwein & Singer][Dietterich & Bakiri]
• alternative: choose “code word” for each label
π1 π2 π3 π4
a − + − +
b − + + −
c + − − +
d + − + +
e − + − −
• each training example mapped to one example per column


 (x, π1 ) , +1
(x, π2 ) , −1

x , c →

 (x, π3 ) , −1
(x, π4 ) , +1

• to classify new example x:


• evaluate classifier on (x, π1 ), . . . , (x, π4 )
• choose label “most consistent” with results
Output Codes (cont.)

• training error bounds independent of # of classes


• overall prediction robust to large number of errors in binary
predictors
• but: binary problems may be harder
Ranking Problems
[with Freund, Iyer & Singer]

• other problems can also be handled by reducing to binary


• e.g.: want to learn to rank objects (say, movies) from
examples
• can reduce to multiple binary questions of form:
“is or is not object A preferred to object B?”
• now apply (binary) AdaBoost
“Hard” Predictions Can Slow Learning

• ideally, want weak classifier that says:



+1 if x above L
h(x) =
“don’t know” else
“Hard” Predictions Can Slow Learning

• ideally, want weak classifier that says:



+1 if x above L
h(x) =
“don’t know” else

• problem: cannot express using “hard” predictions


• if must predict ±1 below L, will introduce many “bad”
predictions
• need to “clean up” on later rounds
• dramatically increases time to convergence
Confidence-rated Predictions
[with Singer]

• useful to allow weak classifiers to assign confidences to


predictions
• formally, allow ht : X → R

sign(ht (x)) = prediction


|ht (x)| = “confidence”
Confidence-rated Predictions
[with Singer]

• useful to allow weak classifiers to assign confidences to


predictions
• formally, allow ht : X → R

sign(ht (x)) = prediction


|ht (x)| = “confidence”

• use identical update:

Dt (i )
Dt+1 (i ) = · exp(−αt yi ht (xi ))
Zt
and identical rule for combining weak classifiers
• question: how to choose αt and ht on each round
Confidence-rated Predictions (cont.)

• saw earlier:
!
Y 1 X X
training error(Hfinal ) ≤ Zt = exp −yi αt ht (xi )
t
m t
i

• therefore, on each round t, should choose αt ht to minimize:


X
Zt = Dt (i ) exp(−αt yi ht (xi ))
i

• in many cases (e.g., decision stumps), best confidence-rated


weak classifier has simple form that can be found efficiently
Confidence-rated Predictions Help a Lot
test no conf
70 train no conf
test conf
train conf
60

50
% Error

40

30

20

10

1 10 100 1000 10000


Number of rounds

round first reached


% error conf. no conf. speedup
40 268 16,938 63.2
35 598 65,292 109.2
30 1,888 >80,000 –
Application: Boosting for Text Categorization
[with Singer]

• weak classifiers: very simple weak classifiers that test on


simple patterns, namely, (sparse) n-grams
• find parameter αt and rule ht of given form which
minimize Zt
• use efficiently implemented exhaustive search
• “How may I help you” data:
• 7844 training examples
• 1000 test examples
• categories: AreaCode, AttService, BillingCredit, CallingCard,
Collect, Competitor, DialForMe, Directory, HowToDial,
PersonToPerson, Rate, ThirdNumber, Time, TimeCharge,
Other.
Weak Classifiers
rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
1 collect

2 card

3 my home

4 person ? person

5 code

6 I
More Weak Classifiers
rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
7 time

8 wrong number

9 how

10 call

11 seven

12 trying to

13 and
More Weak Classifiers
rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
14 third

15 to

16 for

17 charges

18 dial

19 just
Finding Outliers
examples with most weight are often outliers (mislabeled and/or
ambiguous)
• I’m trying to make a credit card call (Collect)
• hello (Rate)
• yes I’d like to make a long distance collect call
please (CallingCard)
• calling card please (Collect)
• yeah I’d like to use my calling card number (Collect)
• can I get a collect call (CallingCard)
• yes I would like to make a long distant telephone call
and have the charges billed to another number
(CallingCard DialForMe)
• yeah I can not stand it this morning I did oversea
call is so bad (BillingCredit)
• yeah special offers going on for long distance
(AttService Rate)
• mister allen please william allen (PersonToPerson)
• yes ma’am I I’m trying to make a long distance call to
a non dialable point in san miguel philippines
(AttService Other)
Application: Human-computer Spoken Dialogue
[with Rahim, Di Fabbrizio, Dutton, Gupta, Hollister & Riccardi]

• application: automatic “store front” or “help desk” for AT&T


Labs’ Natural Voices business
• caller can request demo, pricing information, technical
support, sales agent, etc.
• interactive dialogue
How It Works
Human raw
computer
speech utterance

text−to−speech automatic
speech
recognizer
text response
text

dialogue
manager natural language
predicted understanding
category

• NLU’s job: classify caller utterances into 24 categories


(demo, sales rep, pricing info, yes, no, etc.)
• weak classifiers: test for presence of word or phrase
Need for Prior, Human Knowledge
[with Rochery, Rahim & Gupta]

• building NLU: standard text categorization problem


• need lots of data, but for cheap, rapid deployment, can’t wait
for it
• bootstrapping problem:
• need labeled data to deploy
• need to deploy to get labeled data
• idea: use human knowledge to compensate for insufficient
data
• modify loss function to balance fit to data against fit to
prior model
Results: AP-Titles

80
data+knowledge
knowledge only
70 data only

60
% error rate

50

40

30

20

10
100 1000 10000
# training examples
Results: Helpdesk

90

85
data + knowledge
Classification Accuracy

80

75
data
70

65

60

55
knowledge
50

45
0 500 1000 1500 2000 2500
# Training Examples
Problem: Labels are Expensive

• for spoken-dialogue task


• getting examples is cheap
• getting labels is expensive
• must be annotated by humans
• how to reduce number of labels needed?
Active Learning

• idea:
• use selective sampling to choose which examples to label
• focus on least confident examples [Lewis & Gale]
• for boosting, use (absolute) margin |f (x)| as natural
confidence measure
[Abe & Mamitsuka]
Labeling Scheme

• start with pool of unlabeled examples


• choose (say) 500 examples at random for labeling
• run boosting on all labeled examples
• get combined classifier f
• pick (say) 250 additional examples from pool for labeling
• choose examples with minimum |f (x)|
• repeat
Results: How-May-I-Help-You?
34
random
active

32

% error rate
30

28

26

24
0 5000 10000 15000 20000 25000 30000 35000 40000
# labeled examples

first reached % label


% error random active savings
28 11,000 5,500 50
26 22,000 9,500 57
25 40,000 13,000 68
Results: Letter
25
random
active

20

% error rate
15

10

0
0 2000 4000 6000 8000 10000 12000 14000 16000
# labeled examples

first reached % label


% error random active savings
10 3,500 1,500 57
5 9,000 2,750 69
4 13,000 3,500 73
Application: Detecting Faces
[Viola & Jones]

• problem: find faces in photograph or movie


• weak classifiers: detect light/dark rectangles in image

• many clever tricks to make extremely fast and accurate


Conclusions

• boosting is a practical tool for classification and other learning


problems
• grounded in rich theory
• performs well experimentally
• often (but not always!) resistant to overfitting
• many applications and extensions
• many ways to think about boosting
• none is entirely satisfactory by itself,
but each useful in its own way
• considerable room for further theoretical and
experimental work
References

• Ron Meir and Gunnar Rätsch.


An Introduction to Boosting and Leveraging.
In Advanced Lectures on Machine Learning (LNAI2600), 2003.
http://www.boosting.org/papers/MeiRae03.pdf
• Robert E. Schapire.
The boosting approach to machine learning: An overview.
In MSRI Workshop on Nonlinear Estimation and Classification, 2002.
http://www.cs.princeton.edu/∼schapire/boost.html
MSRI Workshop on Nonlinear Estimation and Classification, 2002.

The Boosting Approach to Machine Learning


An Overview
Robert E. Schapire
AT&T Labs Research
Shannon Laboratory
180 Park Avenue, Room A203
Florham Park, NJ 07932 USA
www.research.att.com/ schapire


December 19, 2001

Abstract
Boosting is a general method for improving the accuracy of any given
learning algorithm. Focusing primarily on the AdaBoost algorithm, this
chapter overviews some of the recent work on boosting including analyses
of AdaBoost’s training error and generalization error; boosting’s connection
to game theory and linear programming; the relationship between boosting
and logistic regression; extensions of AdaBoost for multiclass classification
problems; methods of incorporating human knowledge into boosting; and
experimental and applied work using boosting.

1 Introduction
Machine learning studies automatic techniques for learning to make accurate pre-
dictions based on past observations. For example, suppose that we would like to
build an email filter that can distinguish spam (junk) email from non-spam. The
machine-learning approach to this problem would be the following: Start by gath-
ering as many examples as posible of both spam and non-spam emails. Next, feed
these examples, together with labels indicating if they are spam or not, to your
favorite machine-learning algorithm which will automatically produce a classifi-
cation or prediction rule. Given a new, unlabeled email, such a rule attempts to
predict if it is spam or not. The goal, of course, is to generate a rule that makes the
most accurate predictions possible on new test examples.

1
Building a highly accurate prediction rule is certainly a difficult task. On the
other hand, it is not hard at all to come up with very rough rules of thumb that
are only moderately accurate. An example of such a rule is something like the
following: “If the phrase ‘buy now’ occurs in the email, then predict it is spam.”
Such a rule will not even come close to covering all spam messages; for instance,
it really says nothing about what to predict if ‘buy now’ does not occur in the
message. On the other hand, this rule will make predictions that are significantly
better than random guessing.
Boosting, the machine-learning method that is the subject of this chapter, is
based on the observation that finding many rough rules of thumb can be a lot easier
than finding a single, highly accurate prediction rule. To apply the boosting ap-
proach, we start with a method or algorithm for finding the rough rules of thumb.
The boosting algorithm calls this “weak” or “base” learning algorithm repeatedly,
each time feeding it a different subset of the training examples (or, to be more pre-
cise, a different distribution or weighting over the training examples 1 ). Each time
it is called, the base learning algorithm generates a new weak prediction rule, and
after many rounds, the boosting algorithm must combine these weak rules into a
single prediction rule that, hopefully, will be much more accurate than any one of
the weak rules.
To make this approach work, there are two fundamental questions that must be
answered: first, how should each distribution be chosen on each round, and second,
how should the weak rules be combined into a single rule? Regarding the choice
of distribution, the technique that we advocate is to place the most weight on the
examples most often misclassified by the preceding weak rules; this has the effect
of forcing the base learner to focus its attention on the “hardest” examples. As
for combining the weak rules, simply taking a (weighted) majority vote of their
predictions is natural and effective.
There is also the question of what to use for the base learning algorithm, but
this question we purposely leave unanswered so that we will end up with a general
boosting procedure that can be combined with any base learning algorithm.
Boosting refers to a general and provably effective method of producing a very
accurate prediction rule by combining rough and moderately inaccurate rules of
thumb in a manner similar to that suggested above. This chapter presents an
overview of some of the recent work on boosting, focusing especially on the Ada-
Boost algorithm which has undergone intense theoretical study and empirical test-
ing.
1
A distribution over training examples can be used to generate a subset of the training examples
simply by sampling repeatedly from the distribution.

2
 
 
 !#"%$& '($*)
Given:  ,-
./$01
where ,
Initialize +
$& 43
.
For 2 :
5
Train base learner using<distribution
;>=
+76 .
5
Get base classifier
=
8 6:9 .
5
Choose ? 6 .
5
Update: ,-
DCFEHG-"    


,-
B +(6 ?I6 8J6
+ 6A@ K
6
K 
where 6 is a normalization factor (chosen so that + 6A@ will be a distribu-
tion).

Output the final classifier:

L M
.ONQPSR#TVUX W Z
-[\
 ? 6 8 6
6AY

Figure 1: The boosting algorithm AdaBoost.

2 AdaBoost
Working in Valiant’s PAC (probably approximately correct) learning model [75],
Kearns and Valiant [41, 42] were the first to pose the question of whether a “weak”
learning algorithm that performs just slightly better than random guessing can be
“boosted” into an arbitrarily accurate “strong” learning algorithm. Schapire [66]
came up with the first provable polynomial-time boosting algorithm in 1989. A
year later, Freund [26] developed a much more efficient boosting algorithm which,
although optimal in a certain sense, nevertheless suffered like Schapire’s algorithm
from certain practical drawbacks. The first experiments with these early boosting
algorithms were carried out by Drucker, Schapire and Simard [22] on an OCR task.
The AdaBoost algorithm, introduced in 1995 by Freund and Schapire [32],
solved many of the practical difficulties of the earlier boosting algorithms, and is
the focus of this paper. Pseudocode for AdaBoost is given in Fig. 1 in the slightly
generalized form given by Schapire and Singer [70].
I 
 ]^

The algorithm takes as input
a training set  D
where each belongs

to some domain or
instance space , and each label is in some label set . For most of this paper,
<_#"%$& '($*)
we assume ; in Section 7, we discuss extensions to the multiclass
case. AdaBoost calls a given weak or base learning algorithm repeatedly in a series

3
 $& 43
of rounds 2 . One of the main ideas of the algorithm is to maintain a
distribution or set, of weights over the training ,-

set. The weight of this distribution on


training example on round 2 is denoted + 6 . Initially, all weights are set equally,
but on each round, the weights of incorrectly classified examples are increased so
that the base learner is forced to focus on the hard examples in 
the
;
training
=
set.
The base learner’s job is to find a base classifier 8 69 appropriate
for the distribution + 6 . (Base classifiers were also called rules of thumb or weak
prediction rules in Section 1.) In the simplest case, the range of each 8M6 is binary,
#"%$& '($*)
i.e., restricted to ; the base learner’s job then is to minimize the error
   

6
 
8J6
 
 
O 

Once the base classifier 8 6 has been received, AdaBoost chooses a parameter
V=
? 6 that intuitively measures the importance that it assigns to 8 6 . In the figure,
we have deliberately left the choice of ? 6 unspecified. For binary 8 6 , we typically

    
set  $ "
 T 6
? 6 (1)
6

as in the original description of AdaBoost given by Freund and Schapire [32]. More
on choosing ? 6 follows in Section 3. The distribution + 6 is then updated using the
L
rule shown 3 in the figure. The final or combined classifier is a weighted majority
vote of the base classifiers where ?:6 is the weight assigned to 86 .

3 Analyzing the training error


The most basic theoretical property of AdaBoost concerns its ability to reduce
the training error, i.e., the fraction of mistakes on the training set. Specifically,
Schapire and Singer [70], in generalizing a theorem of Freund and Schapire [32],
show that the training error of the final classifier is bounded as follows:


1
$
,
9
L 

O )
 1
$
X

CFEHGI-" 
 :

\  K
6 (2)
6

where henceforth we define



:Z
: X
? 6 8 6
Z

(3)

!
6

L M
 NQPSR#T :M


" 
"
" "
so that  . (For simplicity of notation, we write and 6 as
  

#%$'&)(*,+.-/(1032
shorthand for $ Y and   
W6AY ,
 

respectively.) The inequality follows from the fact


L
that if . The equality can be proved straightforwardly by
unraveling the recursive definition of + 6 .

4
Eq. (2) suggests that the training error can be reduced most rapidly (in a greedy
way) by choosing ? 6 and 8 6 on each round to minimize
K  X ,-
DCFE G-"  


6  + 6 ? 6 8 6 (4)

In the case of binary classifiers, this leads to the choice of ? 6 given in Eq. (1) and
gives a bound on the training error of

 K
6
  
 6
4$ "
 6

!  
$ " 

6
  CFE GVU "

X

6
 [
(5)
6 6 6 6

where we define 6

 $0
6 . This bound was first proved by Freund and

"

2
Schapire [32]. Thus, if each base classifier is slightly better than random3 so that

#$




6 for some , then the training error drops exponentially fast in since
the bound in Eq. (5) is at most W . This bound, combined with the bounds
on generalization error given below prove that AdaBoost is indeed a boosting al-
gorithm in the sense that it can efficiently convert a true weak learning algorithm
(that can always generate a classifier with a weak edge for any distribution) into
a strong learning algorithm (that can generate a classifier with an arbitrarily low
error rate, given sufficient data).

Eq. (2) points to the fact that, at heart, AdaBoost is a procedure for finding a
linear combination of base classifiers which attempts to minimize

X

CFEHGI-" 
 :

. X

CFE GVU "  X
? 6 8 6

[\
(6)
6

Essentially, on each round, AdaBoost chooses 8 6 (by calling the base learner) and
then sets ? 6 to add one more term to the accumulating weighted sum of base classi-
fiers in such a way that the sum of exponentials above will be maximally reduced.
In other words, AdaBoost is doing a kind of steepest descent search to minimize
Eq. (6) where the search is constrained at each step to follow coordinate direc-
tions (where we identify coordinates with the weights assigned to base classifiers).
This view of boosting and its generalization are examined in considerable detail
by Duffy and Helmbold [23], Mason et al. [51, 52] and Friedman [35]. See also
Section 6.
Schapire and Singer [70] discuss the choice of ? 6 Mand

8 6 in the case that 8 6


is real-valued (rather than binary). In this case, 8Z6 Z
can be interpreted as a

 
“confidence-rated prediction”
Z

in which the sign of 8 6 is the predicted label,


while the magnitude 86 gives a measure of confidence. Here, Schapire and
K
Singer advocate choosing ? 6 and 8 6 so as to minimize 6 (Eq. (4)) on each round.

5
4 Generalization error
In studying and designing learning algorithms, we are of course interested in per-
formance on examples not seen during training, i.e., in the generalization error, the
topic of this section. Unlike Section 3 where the training examples were arbitrary,
here we assume that all examples
some unknown distribution on
  
(both train and test) are generated i.i.d. from
. The generalization error is the probability
of misclassifying a new example, while the test error is the fraction of mistakes on
a newly sampled test set (thus, generalization error is expected test error). Also,
for simplicity, we restrict our attention to binary base classifiers.
Freund and Schapire [32] showed how to bound the1 generalization error of the

2 
final classifier in terms of its training error, the size of the sample,
3
dimension of the base classifier space and the number of rounds of boosting.
the VC-

Specifically, they used techniques from Baum and Haussler [5] to show that the
generalization error, with high probability, is at most 3

3    
  

3
L Z
  '
1

where
   
denotes empirical probability on the training sample.3 This bound sug-
gests that boosting will overfit if run for too many rounds, i.e., as becomes large.
In fact, this sometimes does happen. However, in early experiments, several au-
thors [8, 21, 59] observed empirically that boosting often does not overfit, even
when run for thousands of rounds. Moreover, it was observed that AdaBoost would
sometimes continue to drive down the generalization error long after the training
error had reached zero, clearly contradicting the spirit of the bound above. For
instance, the left side of Fig. 2 shows the training and test curves of running boost-
ing on top of Quinlan’s C4.5 decision-tree learning algorithm [60] on the “letter”
dataset.
In response to these empirical findings, Schapire et al. [69], following the work
of Bartlett [3], gave an alternative analysis
IJ

in terms of the margins of the training


examples. The margin of example is defined to be

  *
QR#PAT 
:
 :Z
 
 X

6
? 6 8 6
Z

6
  ? 6
X

6
 
? 6

  
2
The Vapnik-Chervonenkis (VC) dimension is a standard measure of the “complexity” of a space
of binary functions. See, for instance, refs. [6, 76] for its definition and relation to learning theory.
3
The “soft-Oh” notation , here used rather informally, is meant to hide all logarithmic and
constant factors (in the same way that standard “big-Oh” notation hides only constant factors).

6
cumulative distribution
1.0
20

15
error

10 0.5

0
10 100 1000 -1 -0.5 0.5 1
# rounds margin

Figure 2: Error curves and the margin distribution graph for boosting C4.5 on
the letter dataset as reported by Schapire et al. [69]. Left: the training and test
error curves (lower and upper curves, respectively) of the combined classifier as
a function of the number of rounds of boosting. The horizontal lines indicate the
test error rate of the base classifier as well as the test error of the final combined
classifier. Right: The cumulative distribution of margins of the training examples
after 5, 100 and 1000 iterations, indicated by short-dashed, long-dashed (mostly
hidden) and solid curves, respectively.

It is a number in
"]$& '($ 
and is positive if and only if correctly classifies the
L

example. Moreover, as before, the magnitude of the margin can be interpreted as a


measure of confidence in the prediction. Schapire et al. proved that larger margins
on the training set translate into a superior upper bound on the generalization error.

     

Specifically, the generalization error is at most

     
*
QR#P T IJ
'
1

for3 any 

with high probability. Note that this bound is entirely independent
of , the number of rounds of boosting. In addition, Schapire et al. proved that
boosting is particularly aggressive at reducing the margin (in a quantifiable sense)
since it concentrates on the examples with the smallest margins (whether positive
or negative). Boosting’s effect on the margins can be seen empirically, for instance,
on the right side of Fig. 2 which shows the cumulative distribution of margins of the
training examples on the “letter” dataset. In this case, even after the training error
reaches zero, boosting continues to increase the margins of the training examples
effecting a corresponding drop in the test error.
Although the margins theory gives a qualitative explanation of the effectiveness
of boosting, quantitatively, the bounds are rather weak. Breiman [9], for instance,

7
shows empirically that one classifier can have a margin distribution that is uni-
formly better than that of another classifier, and yet be inferior in test accuracy. On
the other hand, Koltchinskii, Panchenko and Lozano [44, 45, 46, 58] have recently
proved new margin-theoretic bounds that are tight enough to give useful quantita-
tive predictions.
Attempts (not always successful) to use the insights gleaned from the theory
of margins have been made by several authors [9, 37, 50]. In addition, the margin
theory points to a strong connection between boosting and the support-vector ma-
chines of Vapnik and others [7, 14, 77] which explicitly attempt to maximize the
minimum margin.

5 A connection to game theory and linear programming


The behavior of AdaBoost can also be understood in a game-theoretic setting as
explored by Freund and Schapire [31, 33] (see also Grove and Schuurmans [37]
and Breiman [9]). In classical game theory, it is possible to put any two-person,
zero-sum
,
game in the form of a matrix . To play the game, one player chooses a
row and the other player chooses a column  . The loss 
to the row player (which
is the same as the payoff to the column player) is . More generally, the two
sides may play randomly, choosing distributions  and  over rows or columns,
respectively. The expected loss then is   .
Boosting can be viewed as repeated play of  a particular  S S S
game
)
matrix. Assume
that the base classifiers are binary, and let
8 8 be the entire base
classifier space (which
  
 ]

we assume for now to be


1
finite). For a fixed training set
, the game matrix has rows and columns where
$  
: 
  if 8

otherwise.

The row player now is the boosting algorithm, and the column player is the base
learner. The boosting algorithm’s choice of a distribution + 6 over training exam-
ples becomes a distribution  over rows of , while the base learner’s choice of a
base classifier 8 6 becomes the choice of a column  of .
As an example of the connection between boosting and game theory, consider
von Neumann’s famous minmax theorem which states that
 

E

PAT
  
  

P T

E
  

for any matrix . When applied to the matrix just defined and reinterpreted in
the boosting setting, this can be shown to have the following meaning: If, for any

8

$0 "

distribution over examples, there exists a base classifier with error at most ,
then


there exists a convex combination of base classifiers with a margin of at least


on all training examples. AdaBoost seeks to find such a final classifier with
high margin on all examples by combining many base classifiers; so in a sense, the
minmax theorem tells us that AdaBoost at least has the potential for success since,
given a “good” base learner, there must exist a good combination of base classi-
fiers. Going much further, AdaBoost can be shown to be a special case of a more
general algorithm for playing repeated games, or for approximately solving matrix
games. This shows that, asymptotically, the distribution over training examples as
well as the weights over base classifiers in the final classifier have game-theoretic
intepretations as approximate minmax or maxmin strategies.
The problem of solving (finding optimal strategies for) a zero-sum game is
well known to be solvable using linear programming. Thus, this formulation of the
boosting problem as a game also connects boosting to linear, and more generally
convex, programming. This connection has led to new algorithms and insights as
explored by Rätsch et al. [62], Grove and Schuurmans [37] and Demiriz, Bennett
and Shawe-Taylor [17].
In another direction, Schapire [68] describes and analyzes the generalization
of both AdaBoost and Freund’s earlier “boost-by-majority” algorithm [26] to a
broader family of repeated games called “drifting games.”

6 Boosting and logistic regression


 
Classification generally is the problem of predicting the label of an example
with the intention of minimizing the probability of an incorrect prediction. How-
ever, it is often useful to estimate the probability of a particular label. Friedman,
Hastie and Tibshirani [34] suggested a method for using the output of AdaBoost to
make reasonable estimates of such probabilities. Specifically, they suggested using
a logistic function, and estimating

   # *,+.- 0 # *,+.- # 0 $ *,+ - 0


*
  '($  
' (7)

where, as usual,

:M

is the weighted average of base classifiers produced by Ada-


Boost (Eq. (3)). The rationale for this choice is the close connection between the
log loss (negative log likelihood) of such a model, namely,
X


T $ '
# $  &)(*,+ -/(0 (8)

9
and the function that, we have already noted, AdaBoost attempts to minimize:
X
 # $'&)(*,+ -/(10
(9)


Specifically, it can be verified
$ " T
 that Eq. (8) is upper bounded by Eq. (9). In addition,

if we add the constant to Eq. (8) (which does not affect its minimization),
then it can be verified that the resulting function and the one in Eq. (9) have iden-
tical Taylor expansions around zero up to second order; thus, their behavior near
zero is very similar. Finally, it can be shown that, for any distribution over pairs

# $  & *,+.- 0  



, the expectations
T *$B'

# $'& *,+.- 0 
and


  
are minimized by the same (unconstrained) function , namely,
 
   
   '($ 
:M
: T
 /"%$ 

Thus, for all these reasons, minimizing Eq. (9), as is done by AdaBoost, can be
viewed as a method of approximately minimizing the negative log likelihood given
in Eq. (8). Therefore, we may expect Eq. (7) to give a reasonable probability
estimate.
Of course, as Friedman, Hastie and Tibshirani point out, rather than minimiz-
ing the exponential loss in Eq. (6), we could attempt instead to directly minimize
the logistic loss in Eq. (8). To this end, they propose their LogitBoost algorithm.
A different, more direct modification of AdaBoost for logistic loss was proposed
by Collins, Schapire and Singer [13]. Following up on work by Kivinen and War-
muth [43] and Lafferty [47], they derive this algorithm using a unification of logis-
tic regression and boosting based on Bregman distances. This work further con-
nects boosting to the maximum-entropy literature, particularly the iterative-scaling
family of algorithms [15, 16]. They also give unified proofs of convergence to
optimality for a family of new and old algorithms, including AdaBoost, for both
the exponential loss used by AdaBoost and the logistic loss used for logistic re-
gression. See also the later work of Lebanon and Lafferty [48] who showed that
logistic regression and boosting are in fact solving the same constrained optimiza-
tion problem, except that in boosting, certain normalization constraints have been
dropped.
For logistic regression, we attempt to minimize the loss function
X

 T $ '
# $'&)(*,+.-/(10  (10)

10
which is the same as in Eq. (8) except for an inconsequential change of constants
in the exponent. The modification of AdaBoost proposed by Collins, Schapire and
Singer to handle this loss function is particularly simple.
,-

In AdaBoost, unraveling
the definition of + 6 given in Fig. 1 shows that + 6 is proportional (i.e., equal up
to normalization) to !
$
CFE G]-"  



where we define
6
Z
B X M

6  ? 6 8 6
6 Y

To minimize,-
the loss function in Eq. (10), the only necessary modification is to
redefine + 6 to be proportional to


$

$
$ ' CFE G]    


A very similar algorithm is described by Duffy and Helmbold [23]. Note that in
each case, the weight on the examples, viewed as a vector, is proportional to the
negative gradient of the respective loss function. This is because both algorithms
are doing a kind of functional gradient descent, an observation that is spelled out
and exploited by Breiman [9], Duffy and Helmbold [23], Mason et al. [51, 52] and
Friedman [35].
Besides logistic regression, there have been a number of approaches  taken to
apply boosting to more general regression problems in which the labels are real
numbers and the goal is to produce real-valued predictions that are close to these la-
bels. Some of these, such as those of Ridgeway [63] and Freund and Schapire [32],
attempt to reduce the regression problem to a classification problem. Others, such
as those of Friedman [35] and Duffy and Helmbold [24] use the functional gradient
descent view of boosting to derive algorithms that directly minimize a loss func-
tion appropriate for regression. Another boosting-based approach to regression
was proposed by Drucker [20].

7 Multiclass classification
There are several methods of extending AdaBoost to the multiclass case. The most
straightforward generalization [32], called AdaBoost.M1, is adequate when the
base learner is strong enough to achieve reasonably high accuracy, even on the
hard distributions created by AdaBoost. However, this method fails if the base
learner cannot achieve at least 50% accuracy when run on these hard distributions.

11
For the latter case, several more sophisticated methods have been developed.
These generally work by reducing the multiclass problem to a larger binary prob-
lem. Schapire and Singer’s [70] algorithm 
AdaBoost.MH works by creating a set
of binary problems,

for each example and each possible label , of the form:
“For example , is the correct label or is it one of the other labels?” Freund
and Schapire’s [32] algorithm AdaBoost.M2 (which is a special case of Schapire
and Singer’s [70]

AdaBoost.MR algorithm)

instead creates binary

problems, for
each example

with correct label
 
and each incorrect label of the form: “For
example , is the correct label or ?”
These methods require additional effort in the design of the base learning algo-
rithm. A different technique [67], which incorporates Dietterich and Bakiri’s [19]
method of error-correcting output codes, achieves similar provable bounds to those
of AdaBoost.MH and AdaBoost.M2, but can be used with any base learner that
can handle simple, binary labeled data. Schapire and Singer [70] and Allwein,
Schapire and Singer [2] give yet another method of combining boosting with error-
correcting output codes.

8 Incorporating human knowledge


Boosting, like many machine-learning methods, is entirely data-driven in the sense
that the classifier it generates is derived exclusively from the evidence present in
the training data itself. When data is abundant, this approach makes sense. How-
ever, in some applications, data may be severely limited, but there may be human
knowledge that, in principle, might compensate for the lack of data.
In its standard form, boosting does not allow for the direct incorporation of such
prior knowledge. Nevertheless, Rochery et al. [64, 65] describe a modification of
boosting that combines and balances human expertise with available training data.
The aim of the approach is to allow the human’s rough judgments to be refined,
reinforced and adjusted by the statistics of the training data, but in a manner that
does not permit the data to entirely overwhelm human judgments.
The first step in this approach

is for a human expert to construct
rule  mapping each instance to an estimated probability 
Z
 
by hand a
$
that'($is

interpreted as the guessed probability that instance will appear with label .
There are various methods for constructing such a function  , and the hope is that
this difficult-to-build function need not be highly accurate for the approach to be
effective.
Rochery et al.’s basic idea is to replace the logistic loss function in Eq. (10)

12
with one that incorporates prior knowledge, namely,
X
 T $B'
# $'&)(1*,+.-/(10  
' X     

  $ '
# $ *,+ -/(0 
where
 



7
  
T

0 #
B'

4$%"


is binary relative
T 4$ "
0 4$%"  #


entropy. The first term is the same as that in Eq. (10). The second term gives a
measure of the distance from the model built by boosting to the human’s model.
Thus, we balance the conditional likelihood of the data against the distance from
our model to the human’s model. The relative importance of the two terms is
controlled by the parameter .

9 Experiments and applications


Practically, AdaBoost has many advantages. It is fast, simple and3 easy to pro-
gram. It has no parameters to tune (except for the number of round ). It requires
no prior knowledge about the base learner and so can be flexibly combined with
any method for finding base classifiers. Finally, it comes with a set of theoretical
guarantees given sufficient data and a base learner that can reliably provide only
moderately accurate base classifiers. This is a shift in mind set for the learning-
system designer: instead of trying to design a learning algorithm that is accurate
over the entire space, we can instead focus on finding base learning algorithms that
only need to be better than random.
On the other hand, some caveats are certainly in order. The actual performance
of boosting on a particular problem is clearly dependent on the data and the base
learner. Consistent with theory, boosting can fail to perform well given insufficient
data, overly complex base classifiers or base classifiers that are too weak. Boosting
seems to be especially susceptible to noise [18] (more on this in Sectionsec:exps).
AdaBoost has been tested empirically by many researchers, including [4, 18,
21, 40, 49, 59, 73]. For instance, Freund and Schapire [30] tested AdaBoost on a
set of UCI benchmark datasets [54] using C4.5 [60] as a base learning algorithm,
as well as an algorithm that finds the best “decision stump” or single-test decision
tree. Some of the results of these experiments are shown in Fig. 3. As can be seen
from this figure, even boosting the weak decision stumps can usually give as good
results as C4.5, while boosting C4.5 generally gives the decision-tree algorithm a
significant improvement in performance.
In another set of experiments, Schapire and Singer [71] used boosting for text
categorization tasks. For this work, base classifiers were used that test on the pres-
ence or absence of a word or phrase. Some results of these experiments comparing

13
30 30
30 30
25 25
25 25
20 20

C4.5
C4.5

20 20

C4.5
C4.5

15 15
15 15
10 10
10 10
5 5
5 5
0 0
00 5 10 15 20 25 30 00 5 10 15 20 25 30
0 5 10 15 20 25 30 0 5 10 15 20 25 30
boosting stumps boosting C4.5
boosting stumps boosting C4.5
Figure 3: Comparison of C4.5 versus boosting stumps and boosting C4.5 on a set
of 27 benchmark problems as reported by Freund and Schapire [30]. Each point
in each scatterplot shows the test error rate of the two competing algorithms on
a single benchmark. The -coordinate of each point gives the test error rate (in
percent) of C4.5 on the given benchmark, and the -coordinate gives the error rate


of boosting stumps (left plot) or boosting C4.5 (right plot). All error rates have
been averaged over multiple runs.

AdaBoost to four other methods are shown in Fig. 4. In nearly all of these ex-
periments and for all of the performance measures tested, boosting performed as
well or significantly better than the other methods tested. As shown in Fig. 5, these
experiments also demonstrated the effectiveness of using confidence-rated predic-
tions [70], mentioned in Section 3 as a means of speeding up boosting.
Boosting has also been applied to text filtering [72] and routing [39], “ranking”
problems [28], learning problems arising in natural language processing [1, 12, 25,
38, 55, 78], image retrieval [74], medical diagnosis [53], and customer monitoring
and segmentation [56, 57].
Rochery et al.’s [64, 65] method of incorporating human knowledge into boost-
ing, described in Section 8, was applied to two speech categorization tasks. In this
case, the prior knowledge took the form of a set of hand-built rules mapping key-
words to predicted categories. The results are shown in Fig. 6.
The final classifier produced by AdaBoost when used, for instance, with a
decision-tree base learning algorithm, can be extremely complex and difficult to
comprehend. With greater care, a more human-understandable final classifier can
be obtained using boosting. Cohen and Singer [11] showed how to design a base

14
16 35

14
30
12
25
10
% Error

% Error
8 20

6
15
4
AdaBoost AdaBoost
Sleeping-experts 10 Sleeping-experts
2 Rocchio Rocchio
Naive-Bayes Naive-Bayes
PrTFIDF PrTFIDF
0 5
3 4 5 6 4 6 8 10 12 14 16 18 20
Number of Classes Number of Classes

Figure 4: Comparison of error rates for AdaBoost and four other text categoriza-
tion methods (naive Bayes, probabilistic TF-IDF, Rocchio and sleeping experts)
as reported by Schapire and Singer [71]. The algorithms were tested on two text
corpora — Reuters newswire articles (left) and AP newswire headlines (right) —
and with varying numbers of class labels as indicated on the -axis of each figure.

learning algorithm that, when combined with AdaBoost, results in a final classifier
consisting of a relatively small set of rules similar to those generated by systems
like RIPPER [10], IREP [36] and C4.5rules [60]. Cohen and Singer’s system,
called SLIPPER, is fast, accurate and produces quite compact rule sets. In other
work, Freund and Mason [29] showed how to apply boosting to learn a generaliza-
tion of decision trees called “alternating trees.” Their algorithm produces a single
alternating tree rather than an ensemble of trees as would be obtained by running
AdaBoost on top of a decision-tree learning algorithm. On the other hand, their
learning algorithm achieves error rates comparable to those of a whole ensemble
of trees.
A nice property of AdaBoost is its ability to identify outliers, i.e., examples
that are either mislabeled in the training data, or that are inherently ambiguous and
hard to categorize. Because AdaBoost focuses its weight on the hardest examples,
the examples with the highest weight often turn out to be outliers. An example of
this phenomenon can be seen in Fig. 7 taken from an OCR experiment conducted
by Freund and Schapire [30].
When the number of outliers is very large, the emphasis placed on the hard ex-
amples can become detrimental to the performance of AdaBoost. This was demon-
strated very convincingly by Dietterich [18]. Friedman, Hastie and Tibshirani [34]
suggested a variant of AdaBoost, called “Gentle AdaBoost” that puts less emphasis
on outliers. Rätsch, Onoda and Müller [61] show how to regularize AdaBoost to
handle noisy data. Freund [27] suggested another algorithm, called “BrownBoost,”
that takes a more radical approach that de-emphasizes outliers when it seems clear
that they are “too hard” to classify correctly. This algorithm, which is an adaptive

15
discrete AdaBoost.MR discrete AdaBoost.MR
70 discrete AdaBoost.MH 70 discrete AdaBoost.MH
real AdaBoost.MH real AdaBoost.MH
60 60

50 50
% Error

% Error
40 40

30 30

20 20

10 10

1 10 100 1000 10000 1 10 100 1000 10000


Number of rounds Number of rounds

Figure 5: Comparison of the training (left) and test (right) error using three boost-
ing methods on a six-class text classification problem from the TREC-AP collec-
tion, as reported by Schapire and Singer [70, 71]. Discrete AdaBoost.MH and
discrete AdaBoost.MR are multiclass versions of AdaBoost that require binary
#"]$& '($*)
( -valued) base classifiers, while real AdaBoost.MH is a multiclass ver-
sion that uses “confidence-rated” (i.e., real-valued) base classifiers.

version of Freund’s [26] “boost-by-majority” algorithm, demonstrates an intrigu-


ing connection between boosting and Brownian motion.

10 Conclusion
In this overview, we have seen that there have emerged a great many views or
interpretations of AdaBoost. First and foremost, AdaBoost is a genuine boosting
algorithm: given access to a true weak learning algorithm that always performs a
little bit better than random guessing on every distribution over the training set, we
can prove arbitrarily good bounds on the training error and generalization error of
AdaBoost.
Besides this original view, AdaBoost has been interpreted as a procedure based
on functional gradient descent, as an approximation of logistic regression and as
a repeated-game playing algorithm. AdaBoost has also been shown to be re-
lated to many other topics, such as game theory and linear programming, Breg-
man distances, support-vector machines, Brownian motion, logistic regression and
maximum-entropy methods such as iterative scaling.
All of these connections and interpretations have greatly enhanced our under-
standing of boosting and contributed to its extension in ever more practical di-
rections, such as to logistic regression and other loss-minimization problems, to
multiclass problems, to incorporate regularization and to allow the integration of
prior background knowledge.

16
92 90

knowledge + data
90 85
knowledge + data
data
88 80
Classification Accuracy (%)

Classification Accuracy (%)


86 75
knowledge data

84 70

82 65

80 60

78 55

knowledge
76 50

74 45
0 200 400 600 800 1000 1200 1400 1600 0 500 1000 1500 2000 2500 3000
# Training Sentences # Training Examples

Figure 6: Comparison of percent classification accuracy on two spoken language


tasks (“How may I help you” on the left and “Help desk” on the right) as a func-
tion of the number of training examples using data and knowledge separately or
together, as reported by Rochery et al. [64, 65].

We also have discussed a few of the growing number of applications of Ada-


Boost to practical machine learning problems, such as text and speech categoriza-
tion.

References
[1] Steven Abney, Robert E. Schapire, and Yoram Singer. Boosting applied to tagging
and PP attachment. In Proceedings of the Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing and Very Large Corpora, 1999.
[2] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to
binary: A unifying approach for margin classifiers. Journal of Machine Learning
Research, 1:113–141, 2000.
[3] Peter L. Bartlett. The sample complexity of pattern classification with neural net-
works: the size of the weights is more important than the size of the network. IEEE
Transactions on Information Theory, 44(2):525–536, March 1998.
[4] Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algo-
rithms: Bagging, boosting, and variants. Machine Learning, 36(1/2):105–139, 1999.
[5] Eric B. Baum and David Haussler. What size net gives valid generalization? Neural
Computation, 1(1):151–160, 1989.
[6] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth.
Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for
Computing Machinery, 36(4):929–965, October 1989.

17
4:1/0.23,4/0.22 8:6/0.18,8/0.18 4:9/0.16,4/0.16 4:1/0.23,4/0.22 3:5/0.18,3/0.17 0:6/0.22,0/0.22 7:9/0.20,7/0.19 3:5/0.29,3/0.29 9:9/0.15,4/0.15

9:9/0.16,4/0.14 3:3/0.19,5/0.17 9:9/0.20,7/0.17 9:9/0.25,7/0.22 4:4/0.22,1/0.19 7:7/0.20,9/0.18 5:5/0.20,3/0.17 4:4/0.18,9/0.15 4:4/0.20,9/0.17

4:1/0.27,4/0.17 5:0/0.26,5/0.17 7:4/0.25,9/0.18 1:9/0.15,7/0.15 2:0/0.29,2/0.19 9:7/0.25,9/0.17 2:3/0.27,2/0.19 8:2/0.30,8/0.21 4:1/0.27,4/0.18

3:5/0.28,3/0.28 9:7/0.19,9/0.19 4:1/0.23,4/0.23 4:1/0.21,4/0.20 4:9/0.16,4/0.16 9:9/0.17,4/0.17 7:7/0.20,9/0.20 8:8/0.18,6/0.18 4:4/0.19,1/0.19

4:4/0.18,9/0.16 4:4/0.21,1/0.18 7:7/0.24,9/0.21 9:9/0.25,7/0.22 4:4/0.19,9/0.16 9:9/0.20,7/0.17 4:4/0.19,9/0.16 9:9/0.16,4/0.14 5:5/0.19,3/0.17

4:1/0.28,4/0.20 2:8/0.22,2/0.17 0:2/0.26,0/0.19 5:3/0.25,5/0.20 4:1/0.26,4/0.19 7:2/0.22,3/0.18 2:0/0.23,6/0.18 0:6/0.20,5/0.15 8:2/0.20,3/0.20


Figure 7: A sample of the examples that have the largest weight on an OCR task as
reported by Freund and Schapire [30]. These examples were chosen after 4 rounds
of boosting (top line), 12 rounds (middle)

 0 
  
and 25 rounds (bottom). Underneath
0
each image

is
 a line of the form : , , where
4:9/0.16,4/0.16 4:1/0.23,4/0.22 4:1/0.21,4/0.20 9:9/0.17,4/0.17 is the label
9:9/0.19,7/0.18 of the exam-
9:9/0.19,4/0.19 9:9/0.19,4/0.18 9:9/0.21,7/0.21 7:7/0.17,9/0.17
ple, and are the labels that get the highest and second highest vote from the
combined classifier at that point in the run of the algorithm, and
( 
, are the 
corresponding normalized scores.

[7] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm
for optimal margin classifiers. In Proceedings of the Fifth Annual ACM Workshop on
Computational Learning Theory, pages 144–152, 1992.
[8] Leo Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998.
[9] Leo Breiman. Prediction games and arcing classifiers. Neural Computation,
11(7):1493–1517, 1999.
[10] William Cohen. Fast effective rule induction. In Proceedings of the Twelfth Interna-
tional Conference on Machine Learning, pages 115–123, 1995.
[11] William W. Cohen and Yoram Singer. A simple, fast, and effective rule learner. In
Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999.
[12] Michael Collins. Discriminative reranking for natural language parsing. In Proceed-
ings of the Seventeenth International Conference on Machine Learning, 2000.
[13] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, Ada-
Boost and Bregman distances. Machine Learning, to appear.
[14] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,
20(3):273–297, September 1995.

18
[15] J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models.
The Annals of Mathematical Statistics, 43(5):1470–1480, 1972.
[16] Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features
of random fields. IEEE Transactions Pattern Analysis and Machine Intelligence,
19(4):1–13, April 1997.
[17] Ayhan Demiriz, Kristin P. Bennett, and John Shawe-Taylor. Linear programming
boosting via column generation. Machine Learning, 46(1/2/3):225–254, 2002.
[18] Thomas G. Dietterich. An experimental comparison of three methods for construct-
ing ensembles of decision trees: Bagging, boosting, and randomization. Machine
Learning, 40(2):139–158, 2000.
[19] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via
error-correcting output codes. Journal of Artificial Intelligence Research, 2:263–286,
January 1995.
[20] Harris Drucker. Improving regressors using boosting techniques. In Machine Learn-
ing: Proceedings of the Fourteenth International Conference, pages 107–115, 1997.
[21] Harris Drucker and Corinna Cortes. Boosting decision trees. In Advances in Neural
Information Processing Systems 8, pages 479–485, 1996.
[22] Harris Drucker, Robert Schapire, and Patrice Simard. Boosting performance in neural
networks. International Journal of Pattern Recognition and Artificial Intelligence,
7(4):705–719, 1993.
[23] Nigel Duffy and David Helmbold. Potential boosters? In Advances in Neural Infor-
mation Processing Systems 11, 1999.
[24] Nigel Duffy and David Helmbold. Boosting methods for regression. Machine Learn-
ing, 49(2/3), 2002.
[25] Gerard Escudero, Lluís Màrquez, and German Rigau. Boosting applied to word
sense disambiguation. In Proceedings of the 12th European Conference on Machine
Learning, pages 129–141, 2000.
[26] Yoav Freund. Boosting a weak learning algorithm by majority. Information and
Computation, 121(2):256–285, 1995.
[27] Yoav Freund. An adaptive version of the boost by majority algorithm. Machine
Learning, 43(3):293–318, June 2001.
[28] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efficient boost-
ing algorithm for combining preferences. In Machine Learning: Proceedings of the
Fifteenth International Conference, 1998.
[29] Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In
Machine Learning: Proceedings of the Sixteenth International Conference, pages
124–133, 1999.

19
[30] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In
Machine Learning: Proceedings of the Thirteenth International Conference, pages
148–156, 1996.
[31] Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boosting.
In Proceedings of the Ninth Annual Conference on Computational Learning Theory,
pages 325–332, 1996.
[32] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of Computer and System Sciences,
55(1):119–139, August 1997.
[33] Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative
weights. Games and Economic Behavior, 29:79–103, 1999.
[34] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression:
A statistical view of boosting. The Annals of Statistics, 38(2):337–374, April 2000.
[35] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.
The Annals of Statistics, 29(5), October 2001.
[36] Johannes Fürnkranz and Gerhard Widmer. Incremental reduced error pruning. In
Machine Learning: Proceedings of the Eleventh International Conference, pages 70–
77, 1994.
[37] Adam J. Grove and Dale Schuurmans. Boosting in the limit: Maximizing the mar-
gin of learned ensembles. In Proceedings of the Fifteenth National Conference on
Artificial Intelligence, 1998.
[38] Masahiko Haruno, Satoshi Shirai, and Yoshifumi Ooyama. Using decision trees to
construct a practical parser. Machine Learning, 34:131–149, 1999.
[39] Raj D. Iyer, David D. Lewis, Robert E. Schapire, Yoram Singer, and Amit Singhal.
Boosting for document routing. In Proceedings of the Ninth International Conference
on Information and Knowledge Management, 2000.
[40] Jeffrey C. Jackson and Mark W. Craven. Learning sparse perceptrons. In Advances
in Neural Information Processing Systems 8, pages 654–660, 1996.
[41] Michael Kearns and Leslie G. Valiant. Learning Boolean formulae or finite automata
is as hard as factoring. Technical Report TR-14-88, Harvard University Aiken Com-
putation Laboratory, August 1988.
[42] Michael Kearns and Leslie G. Valiant. Cryptographic limitations on learning Boolean
formulae and finite automata. Journal of the Association for Computing Machinery,
41(1):67–95, January 1994.
[43] Jyrki Kivinen and Manfred K. Warmuth. Boosting as entropy projection. In Proceed-
ings of the Twelfth Annual Conference on Computational Learning Theory, pages
134–144, 1999.

20
[44] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the
generalization error of combined classifiers. The Annals of Statistics, 30(1), February
2002.
[45] Vladimir Koltchinskii, Dmitriy Panchenko, and Fernando Lozano. Further explana-
tion of the effectiveness of voting methods: The game between margins and weights.
In Proceedings 14th Annual Conference on Computational Learning Theory and 5th
European Conference on Computational Learning Theory, pages 241–255, 2001.
[46] Vladimir Koltchinskii, Dmitriy Panchenko, and Fernando Lozano. Some new bounds
on the generalization error of combined classifiers. In Advances in Neural Informa-
tion Processing Systems 13, 2001.
[47] John Lafferty. Additive models, boosting and inference for generalized divergences.
In Proceedings of the Twelfth Annual Conference on Computational Learning The-
ory, pages 125–133, 1999.
[48] Guy Lebanon and John Lafferty. Boosting and maximum likelihood for exponential
models. In Advances in Neural Information Processing Systems 14, 2002.
[49] Richard Maclin and David Opitz. An empirical evaluation of bagging and boost-
ing. In Proceedings of the Fourteenth National Conference on Artificial Intelligence,
pages 546–551, 1997.
[50] Llew Mason, Peter Bartlett, and Jonathan Baxter. Direct optimization of margins
improves generalization in combined classifiers. In Advances in Neural Information
Processing Systems 12, 2000.
[51] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradi-
ent techniques for combining hypotheses. In Alexander J. Smola, Peter J. Bartlett,
Bernhard Schölkopf, and Dale Schuurmans, editors, Advances in Large Margin Clas-
sifiers. MIT Press, 1999.
[52] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting algorithms
as gradient descent. In Advances in Neural Information Processing Systems 12, 2000.
[53] Stefano Merler, Cesare Furlanello, Barbara Larcher, and Andrea Sboner. Tuning cost-
sensitive boosting and its application to melanoma diagnosis. In Multiple Classifier
Systems: Proceedings of the 2nd International Workshop, pages 32–42, 2001.
[54] C. J. Merz and P. M. Murphy. UCI repository of machine learning databases, 1999.
www.ics.uci.edu/ mlearn/MLRepository.html.
[55] Pedro J. Moreno, Beth Logan, and Bhiksha Raj. A boosting approach for confidence
scoring. In Proceedings of the 7th European Conference on Speech Communication
and Technology, 2001.
[56] Michael C. Mozer, Richard Wolniewicz, David B. Grimes, Eric Johnson, and Howard
Kaushansky. Predicting subscriber dissatisfaction and improving retention in the
wireless telecommunications industry. IEEE Transactions on Neural Networks,
11:690–696, 2000.

21
[57] Takashi Onoda, Gunnar Rätsch, and Klaus-Robert Müller. Applying support vector
machines and boosting to a non-intrusive monitoring system for household electric
appliances with inverters. In Proceedings of the Second ICSC Symposium on Neural
Computation, 2000.
[58] Dmitriy Panchenko. New zero-error bounds for voting algorithms. Unpublished
manuscript, 2001.
[59] J. R. Quinlan. Bagging, boosting, and C4.5. In Proceedings of the Thirteenth Na-
tional Conference on Artificial Intelligence, pages 725–730, 1996.
[60] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[61] G. Rätsch, T. Onoda, and K.-R. Müller. Soft margins for AdaBoost. Machine Learn-
ing, 42(3):287–320, 2001.
[62] Gunnar Rätsch, Manfred Warmuth, Sebastian Mika, Takashi Onoda, Steven Lemm,
and Klaus-Robert Müller. Barrier boosting. In Proceedings of the Thirteenth Annual
Conference on Computational Learning Theory, pages 170–179, 2000.
[63] Greg Ridgeway, David Madigan, and Thomas Richardson. Boosting methodology
for regression problems. In Proceedings of the International Workshop on AI and
Statistics, pages 152–161, 1999.
[64] M. Rochery, R. Schapire, M. Rahim, N. Gupta, G. Riccardi, S. Bangalore, H. Al-
shawi, and S. Douglas. Combining prior knowledge and boosting for call classifica-
tion in spoken language dialogue. Unpublished manuscript, 2001.
[65] Marie Rochery, Robert Schapire, Mazin Rahim, and Narendra Gupta. BoosTexter for
text categorization in spoken language dialogue. Unpublished manuscript, 2001.
[66] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–
227, 1990.
[67] Robert E. Schapire. Using output codes to boost multiclass learning problems. In
Machine Learning: Proceedings of the Fourteenth International Conference, pages
313–321, 1997.
[68] Robert E. Schapire. Drifting games. Machine Learning, 43(3):265–291, June 2001.
[69] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the
margin: A new explanation for the effectiveness of voting methods. The Annals of
Statistics, 26(5):1651–1686, October 1998.
[70] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using
confidence-rated predictions. Machine Learning, 37(3):297–336, December 1999.
[71] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text
categorization. Machine Learning, 39(2/3):135–168, May/June 2000.
[72] Robert E. Schapire, Yoram Singer, and Amit Singhal. Boosting and Rocchio ap-
plied to text filtering. In Proceedings of the 21st Annual International Conference on
Research and Development in Information Retrieval, 1998.

22
[73] Holger Schwenk and Yoshua Bengio. Training methods for adaptive boosting of
neural networks. In Advances in Neural Information Processing Systems 10, pages
647–653, 1998.
[74] Kinh Tieu and Paul Viola. Boosting image retrieval. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2000.
[75] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–
1142, November 1984.
[76] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probability and its applications,
XVI(2):264–280, 1971.
[77] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
[78] Marilyn A. Walker, Owen Rambow, and Monica Rogati. SPoT: A trainable sentence
planner. In Proceedings of the 2nd Annual Meeting of the North American Chapter
of the Associataion for Computational Linguistics, 2001.

23
Center Based Clustering: A Foundational Perspective

Pranjal Awasthi and Maria-Florina Balcan


Princeton University and Carnegie Mellon University

November 10, 2014

Abstract

In the first part of this chapter we detail center based clustering methods, namely methods based on
finding a “best” set of center points and then assigning data points to their nearest center. In particular,
we focus on k-means and k-median clustering which are two of the most widely used clustering objectives.
We describe popular heuristics for these methods and theoretical guarantees associated with them. We
also describe how to design worst case approximately optimal algorithms for these problems. In the
second part of the chapter we describe recent work on how to improve on these worst case algorithms
even further by using insights from the nature of real world clustering problems and data sets. Finally,
we also summarize theoretical work on clustering data generated from mixture models such as a mixture
of Gaussians.

1 Approximation algorithms for k-means and k-median

One of the most popular approaches to clustering is to define an objective function over the
data points and find a partitioning which achieves the optimal solution, or an approximately
optimal solution to the given objective function. Common objective functions include center
based objective functions such as k-median and k-means where one selects k center points
and the clustering is obtained by assigning each data point to its closest center point. Here
closeness is measured in terms of a pairwise distance function d(), which the clustering
algorithm has access to, encoding how dissimilar two data points are. For instance, the
data could be points in Euclidean space with d() measuring Euclidean distance, or it could
be strings with d() representing an edit distance, or some other dissimilarity score. For
mathematical convenience it is also assumed that the distance function d() is a metric. In
P center points c1 , c2 , · · · ck , and a partitioning of
k-median clustering the objective is to find
the data so as to minimize Φk−median = x mini d(x, ci ). This objective is historically very
useful and well studied for facility location problems [16, 43]. Similarly the objective in
k-means is to minimize Φk−means = x mini d(x, ci )2 . Optimizing this objective is closely
P
related to fitting the maximum likelihood mixture model for a given dataset. For a given set
of centers, the optimal clustering for that set is obtained by assigning each data point to its
closest center point. This is known as the Voronoi partitioning of the data. Unfortunately,
exactly optimizing the k-median and the k-means objectives is a notoriously hard problem.
Intuitively this is expected since the objective function is a non-convex function of the
variables involved. This apparent hardness can also be formally justified by appealing to the

1
notion of NP completeness [43, 33, 8]. At a high level the notion of NP completeness identifies
a wide class of problems which are in principle equivalent to each other. In other words, an
efficient algorithm for exactly optimizing one of the problems in the class on all instances
would also lead to algorithms for all the problems in the class. This class contains many
optimization problems that are believed to be hard1 to exactly optimize in the worst case
and not surprisingly, k-median and k-means also fall into the class. Hence it is unlikely that
one would be able to optimize these objectives exactly using efficient algorithms. Naturally,
this leads to the question of recovering approximate solutions and a lot of the work in the
theoretical community has focused on this direction [16, 11, 29, 34, 43, 47, 48, 57, 20].
Such works typically fall into two categories, a) providing formal worst case guarantees
on all instances of the problem, and b) providing better guarantees suited for for nicer,
stable instances. In this chapter we discuss several stepping stone results in these directions,
focusing our attention on the k-means objective. A lot of the the ideas and techniques
mentioned apply in a straightforward manner to the k-median objective as well. We will
point out crucial differences between the two objectives as and when they appear. We will
additionally discuss several practical implications of these results.
We will begin by describing a very popular heuristic for the k-means problem known as
Lloyd’s method. Lloyd’s method [51] is an iterative procedure which starts out with a set
of k seed centers and at each step computes a new set of centers with a lower k-means cost.
This is achieved by computing the Voronoi partitioning of the current set of centers and
replacing each center with the center of the corresponding partition. We will describe the
theoretical properties and limitations of Lloyd’s method which will also motivate the need
for good worst case approximation algorithms for k-means and k-median. We will see that
the method is very sensitive to the choice of the seed centers. Next we will describe a general
method based on local search which achieves constant factor approximations for both the
k-means and the k-median objectives. Similar to Lloyd’s method, the local search heuristic
starts out with a set of k seed centers and at each step swaps one of the centers for a new one
resulting in a decrease in the k-means cost. Using a clever analysis it can be shown that this
procedure outputs a good approximation to the optimal solution [47]. This is interesting,
since as mentioned above, optimizing the k-means is NP-complete, in fact it is NP-complete
even for k = 2, for points in the Euclidean space [33]2 .
In the second part of the chapter we will describe some of the recent developments in the
study of clustering objectives. These works take a non-worst case analysis approach to the
problem. The basic theme is to design algorithms which give good solutions to clustering
problems only when the underlying optimal solution has a meaningful structure. We will
call such clustering instances as stable instances. We would describe in detail two recently
studied notions of stability. The first one called separability was proposed by Ostrovsky et.
al [57]. According to this notion a k-clustering instance is stable if it is much more expensive
to cluster the data using (k − 1) or fewer clusters. For such instances Ostrovsky et. al
show that one can design a simple Lloyd’s type algorithm which achieves a constant factor
approximation. A different notion called approximation stability was proposed by Balcan et.
al [20]. The motivation comes from the fact that often in practice optimizing an objective
function acts as a proxy for the real problem of getting close to the correct unknown ground
truth clustering. Hence it is only natural to assume that any good approximation to the proxy
function such as k-means or k-median will also be close to the ground truth clustering in
1 This is the famous P vs NP problem, and there is a whole area called Computational Complexity Theory that studies this

and related problems [12].


2 If one restricts centers to be data points, then k-means can be solved optimally in time O(nk+1 ) by trying all possible

k-tuples of centers and choosing the best. The difficulty of k-means for k = 2 in Euclidean space comes from the fact that the
optimal centers need not be data points.

2
terms of structure. Balcan et. al show that under this assumption one can design algorithms
that solve the end goal of getting close to the the ground truth clustering. More surprisingly
this is true even in cases where it is NP -hard to achieve a good approximation to the proxy
objective.

In the last part of the chapter we briefly review existing theoretical work on clustering
data generated from mixture models. We mainly focus on Gaussian Mixture Models (GMM)
which are the most widely studied distributional model for clustering. We will study algo-
rithms for clustering data from a GMM under the assumption that the mean vectors of
the component Gaussians are well separated. We will also see the effectiveness of spectral
techniques for GMMs. Finally, we will look at recent work on estimating the parameters of
a Gaussian mixture model under minimal assumptions.

2 Lloyd’s method for k-means

Consider a set A of n points in the d-dimensional Euclidean space. We start by formally


defining Voronoi partitions.
Definition 1 (Voronoi Partition). Given a clustering instance C ⊂ Rd and k points c1 , c2 , · · · ck ,
a Voronoi partitioning using these centers consists of k disjoint clusters. Cluster i consists
of all the points x ∈ C satisfying d(x, ci ) ≤ d(x, cj ) for all j 6= i.3

Lloyd’s method, also known as the k-means algorithm is the most popular heuristic for
k-means clustering in the Euclidean space which has been shown to be one of the top ten
algorithms in data mining [69]. The method is an iterative procedure which is described
below.

Algorithm Lloyd’s method


1. Seeding: Choose k seed points c1 , c2 , · · · ck . Set Φold = ∞. Compute the current
k-means cost Φ using seed points as centers, i.e.
n
X
Φcurr = min d2 (xi , cj )
j
i=1

2. While Φcurr < Φold ,


(a) Voronoi partitioning: Compute the Voronoi partitioning of the data based on
the centers c1 , c2 , · · · ck . In other words, create k clusters C1 , C2 , · · · , Ck such that
Ci = {x : d(x, ci ) ≤ minj6=i d(x, cj )}. Break ties arbitrarily.
(b) Reseeding: Compute new centers cˆ1 , cˆ2 , · · · , cˆk , where cˆj = mean(Cj ) =
1
P
|Cj | x∈Cj x . Set Φold = Φcurr . Update the current k-means cost Φcurr using
the new centers.
(c) Output: The current set of centers c1 , c2 , . . . ck .

3 Ties can be broken arbitrarily.

3
We would like to stress that although Lloyd’s method is popularly known as the k-means
algorithm, there is a difference between the underlying k-means objective (which is usually
hard to optimize) and the k-means algorithm which is a heuristic to solve the problem. An
attractive feature of Lloyd’s method is that the k-means cost of the clustering obtained never
increases. This follows from the fact that for any set of points, the 1-means cost is minimized
by choosing the mean of the set as the center. Hence for any cluster Ci in the partitioning,
choosing mean(Ci ) will never lead to a solution of higher cost. Hence if we repeat this
method until there is no change in the k-means cost, we will reach a local optimum of the
k-means cost function in finite time. In particular the number of iterations will be at most
nO(kd) which is the maximum number of Voronoi partitions of a set of n points in ℜd [42].
The basic method mentioned above leads to a class of algorithms depending upon the choice
of the seeding method. A simple way is to start with k randomly chosen data points. This
choice however can lead to arbitrarily bad solution quality as shown in Figure 1. In addition
it is also known that the Lloyd’s method can take upto 2n iterations to converge even in 2
dimensions [14, 66].
A B C D
x y

A B C D
x y

Figure 1: Consider 4 points {A, B, C, D} on a line separated by distances x, y and z such


that z < x < y. Let k = 3. The optimal solution has centers at A, B and the centroid of
2
C, D with a total cost of z2 . When choosing random seeds, there is a constant probability
that we choose {A, C, D}. In this case the final centers will be C, D and the centroid of A, B
2
with a total cost of x2 . This ratio can be made arbitrarily bad.

In sum, from a theoretical standpoint, k-means with random/arbitrary seeds is not a good
clustering algorithm in terms of efficiency or quality. Nevertheless, the speed and simplicity
of k-means are quite appealing in practical applications. Therefore, recent work has focused
on improving the initialization procedure: deciding on a better way to initialize the clustering
dramatically changes the performance of the Lloyd’s iteration, both in terms of quality and
convergence properties. For example, [15] showed that choosing a good set of seed points
is crucial and if done carefully can itself be a good candidate solution without the need for
further iterations. Their algorithm called k-means++ uses the following seeding procedure:
it selects only the first center uniformly at random from the data and each subsequent center
is selected with a probability proportional to its contribution to the overall error given the
previous selections. See Algorithm kmeans++ for a formal description:

4
Algorithm kmeans++
1. Initialize: a set S by choosing a data point at random.
2. While |S| < k,
(a) Choose a data point x with probability proportional to minz∈S d(x, z)2 , and add it
to S.
3. Output: the clustering obtained by the Voronoi partitioning of the data using the
centers in S.

[15] showed that Algorithm kmeans++ is an log k approximation algorithm for the
k-means objective. We say that an algorithm is an α-approximation for a given objective
function Φ if for every clustering instance the algorithm outputs a solution of expected cost
at most α times the cost of the best solution. The design of approximation algorithms for
NP -hard problems has been a fruitful research direction and has led to a wide array of tools
and techniques. Formally, [15] show that:
Theorem 1 ([15]). Let S be the set of centers output by the above algorithm and Φ(S)
be the k-means cost of the clustering obtained using S as the centers. Then E[Φ(S)] ≤
O(log k)OPT, where OPT is the cost of the optimal k-means solution.

We would like to point out that in general the output of k-means++ is not a local
optimum. Hence it might be desirable in practice to run a few steps of the Lloyd’s method
starting from this solution. This could only lead to a better solution.
Subsequent work of [6] introduced a streaming algorithm inspired by the k-means++
algorithm that makes a single pass over the data. They show that if one is allowed to cluster
using a little more than k centers, specifically O(k log k) centers, then one can achieve a
constant-factor approximation in expectation to the k-means objective. The approximation
guarantee was improved in [5]. Such approximation algorithms which use more than k centers
are also known as bi-criteria approximations.
As mentioned earlier, Lloyd’s method can take up to exponential iterations in order to
converge to a local optimum. However [13] showed that the method converges quickly on an
“average” instance. In order to formalize this, they study the problem under the smoothed
analysis framework of [65]. In the smoothed analysis framework the input is generated by
applying a small Gaussian perturbation to an adversarial input. [65] showed that the simplex
method takes polynomial number of iterations on such smoothed instances. In a similar
spirit, [13] showed that for smoothed instances Lloyd’s method runs in time polynomial
in n, the number of points and σ1 , the standard deviation of the Gaussian perturbation.
However, these works do not provide any guarantee on the quality of the final solution
produced.
We would like to point out that in principle the Lloyd’s method can be extended to the
k-median objective. A natural extension would be to replace the mean computation in the
Reseeding step with computing the median of a set of points X in the Euclidean space, i.e.,
d
P
a point c ∈ ℜ such that x∈X d(x, c) is minimized. However this problem turns out to be
NP-complete [53]. For this reason, the Lloyd’s method is typically used only for the k-means
objective.

5
3 Properties of the k-means objective

In this section we provide some useful facts about the k-means clustering objective. We will
use C to denote the set of n points which represent a clustering instance. The first fact can
be used to show that given a Voronoi partitioning of the data, replacing a given center with
the mean of the corresponding partition can never increase the k-means cost.
Fact 2. ConsiderPa finite set X ⊂ Rd and c =mean(X). For any y ∈ Rd , we have that,
2 2 2
P
x∈X d(x, y) = x∈X d(x, c) + |X|d(c, y) .

Proof. Representing each point in the coordinate notation as x = (x1 , x2 , · · · , xd ), we have


that
X d
XX
d(x, y)2 = |xi − yi |2
x∈X x∈X i=1
d
XX
= (|xi − ci |2 + |ci − yi |2 + 2(xi − ci )(ci − yi ))
x∈X i=1

X d
X X
2 2
= d(x, c) + |X|d(c, y) + 2(ci − yi ) (xi − ci )
x∈X i=1 x∈X
X
2 2
= d(x, c) + |X|d(c, y)
x∈X

P
Here the last equality follows from the fact that for any i, ci = x∈X xi /n.

An easy corollary of the above fact is the following,


Corollary 3. Consider a finite set X ⊂ Rd and let c = mean(X). We have d(x, y)2 =
P
x,y∈X
2|X| x∈X d(x, c)2 .
P

Below we prove another fact which will be useful later.


Fact 4. Let X ⊂ Rd be finite set of points. Let ∆1 2 (X) denote the 1-means cost of X. Given
a partition of X into X1 and X2 such that c =mean(X), c1 =mean(X1) and c2 = mean(X2 ),
2
we have that a) ∆1 2 (X) = ∆1 2 (X1 )+∆1 2 (X2 )+ |X1|X|
||X2 |
d(c1 , c2 )2 . and b) d(c, c1 )2 ≤ ∆1|X||X
(X)|X2 |
1|
.

Proof. We can write ∆1 2 (X) = 2


d(x, c)2 . Using Fact 2 we can write
P P
x∈X1 d(x, c) + x∈X2
X
d(x, c)2 = ∆1 2 (X1 ) + |X1 |d(c, c1 )2 .
x∈X1

d(x, c)2 = ∆1 2 (X2 ) + |X2 |d(c, c2 )2 . Hence we have


P
Similarly, x∈X2

∆1 2 (X) = ∆1 2 (X1 ) + ∆1 2 (X2 ) + |X1 |d(c, c1 )2 + |X2 |d(c, c2 )2 .

6
|X1 |c1 +|X2 |c2
Part (a) follows by substituting c = |X1 |+|X2 |
in the above equation.

From Part (a) we have that


|X1 ||X2 |
∆1 2 (X) ≥ d(c1 , c2 )2 .
|X|
(|X1 |+|X2 |) |X1 |
Part (b) follows by substituting c2 = X2
c − c
|X2 | 1
above.

4 Local search based algorithms

In the previous section we saw that a carefully chosen seeding can lead to a good approxima-
tion for the k-means objective. In this section we will see how to design much better (constant
factor) approximation algorithms for k-means (as well as k-median). We will describe a very
generic approach based on local search. These algorithms work by making local changes to
a candidate solution and improving it at each step. They have been successfully used for
a variety of optimization problems [7, 28, 36, 40, 58, 61]. Kanungo et. al [47] analyzed a
simple local search based algorithm for k-means as described below.

Algorithm Local search


1. Initialization: Choose k data points {c1 , c2 , . . . ck } arbitrarily from the data set D.
Let this set be
PT . Let Φ(T ) denote the cost of the k-means solution using T as centers,
i.e., Φ(T ) = ni=1 minj d2 (xi , cj ). Set Told = φ, Tcurr = T .
2. While Φ(Tcurr ) < Φ(Told ),
• For x ∈ Tcurr and y ∈ D \ Tcurr :
if Φ((Tcurr \ {x}) ∪ {y}) < Φ(Tcurr ), update Told = Tcurr and Tcurr ← (Tcurr \ {x}) ∪
{y}.
3. Output: S = Tcurr as the set of final centers.

We would like to point out that in order to make the above algorithm run in polynomial
time, one needs to change the criteria in the while loop to be Φ(Tcurr ) < (1 − ǫ)Φ(Told ). The
running time will then depend polynomially in n and 1/ǫ. For simplicity of analysis, we will
prove the following theorem for the idealized version of the algorithm with no ǫ.
Theorem 5 ([47]). Let S be the final set of centers returned by the above procedure. Then,
Φ(S) ≤ 50OPT.

In order to prove the above theorem we start by building up some notation. Let T be
the set of k data points returned by the local search algorithm as candidate centers. Let O
be the set of k data points which achieve the minimum value of the k-means cost function
among all sets of k data points. Note that the centers in O do not necessarily represent the
optimal solution as the optimal centers might not be data points. However using the next
lemma one can show that using data points as centers is only twice as bad as the optimal
solution.

7
Lemma 6. Given C ⊆ Rd , and the optimal k-means clustering of C, {C1 , C2 , · · · Ck }, there
exists a set S of k data points such that Φ(S) ≤ 2OPT.

Proof. For a given set C ⊆ ℜd , let ∆1 2 represent the 1-means cost of C. From Fact 2 it is easy
to see that this cost is achieved by choosing the mean of C as the center. In order to prove the
above lemma it is enough to show Pthat for each optimal cluster Ci with mean ci , there exists
a data point xi ∈ Ci such that x∈Ci d(x, xi )2 ≤ 2∆1 2 (Ci ). Let xi be the data point in Ci
which is closest to ci . Again using Fact 2 we have x∈Ci d(x, xi )2 = ∆1 2 (Ci ) + |Ci |d(x, ci )2 ≤
P

2∆1 2 (Ci ).

Hence it is enough to compare the cost of the centers returned by the algorithm to the cost
of the optimal centers using data points. In particular, we will show that Φ(T ) ≤ 25Φ(O).
We start with the simple observation that by the property of the local search algorithm, for
any t ∈ T , and o ∈ O, swapping t for o results in an increase in cost. In other words
Φ(T − t + o) − Φ(T ) ≥ 0 (4.1)

The main idea is to add up Equation 4.1 over a carefully chosen set of swaps {o, t} to get
the desired result. In order to describe the set of swaps chosen we start by defining a cover
graph
Definition 2. A cover graph is a bipartite graph with the centers in T on one side and the
centers in O on the other side. For each o ∈ O, let to be the point in T which is closest to
o. The cover graph contains edges of the form o, to for all o ∈ O.

Figure 2: An example cover graph.


Next we use the cover graph to generate the set of useful swaps. For each t ∈ T which has
degree 1 in the cover graph, we output the swap pair {t, o} where o is the point connected
to t. Let T ′ be the degree 0 vertices in the cover graph. We pair the remaining vertices
o ∈ O with the vertices in T ′ such that each vertex in O has degree 1 and each vertex in
T ′ has degree at most 2. To see that such a pairing will exist notice that for any t ∈ T of
degree k > 1 there will exist k − 1 distinct zero vertices in T ; these vertices can be paired to
vertices in O connected to t maintaining the above property. We then output all the edges
in this pairing as the set of useful swaps.

4.1 Bounding the cost of a swap

Consider a swap {o, t} output by using the cover graph. We will apply Equation 4.1 to this
pair. We will explicitly define a clustering using centers in T −t+o and upper bound its cost.

8
We will then use the lower bound of Φ(T ) from Equation 4.1 to get the kind of equations
we want to sum up over. Let the clustering given by centers in T be C1 , C2 , · · · Ck . Let Co ∗
be the cluster corresponding to center o in the optimal clustering given by O. Let ox be the
closest point in O to x. Similarly let tx be the closest point in T to x. The key property
satisfied by any output pair {o, t} is the following
Fact 7. Let {o, t} be a swap pair output using the cover graph. Then we have that for any
x ∈ Ct either ox = o or tox 6= t.

Proof. Assume that for some x ∈ Ct , ox = o′ 6= o. By the procedure used to output swap
pairs we have that t has degree 1 or 0 in the cover graph. In addition, if t has degree 1 then
to = t. In both the cases we have that to′ 6= t.

Next we create a new clustering by swapping o for t and assigning all the points in Co ∗ to
o. Next we reassign points in Ct \ Co ∗ . Consider a point x ∈ Ct \ Co ∗ . Clearly ox 6= o. Let
tox be the point in T which is connected to ox in the cover graph. We assign x to tox . One
needs to ensure here that tox 6= t which follows from Fact 7. From Equation 4.1 the increase
in cost due to this reassignment must be non-negative. In other words we have
X X
(d(x, o)2 − d(x, tx )2 ) + (d(x, tox )2 − d(x, t)2 ) ≥ 0 (4.2)
x∈Co ∗ x∈Ct \Co ∗

We will add up Equation 4.2 over the set of all good swaps.

4.2 Adding it all up

In order to sum up over all swaps notice that in the first term in Equation 4.2 every point
x ∈ C appears exactly once by beingPin Co ∗ for some o ∈ O. Hence the sum over all swaps
of the first term can be written as x∈C (d(x, ox )2 − d(x, tx )2 ). Consider the second term
in Equation 4.2. We have that (d(x, tox )2 − d(x, t)2 )) ≥ 0 since x is in Ct . Hence we can
replace the second summation over all x ∈ Ct without affecting the inequality. Also every
point x ∈ C appears at most twice in the second term by Pbeing in Ct for some t ∈ T . Hence
the sum over all swaps of the second term is at most x∈C (d(x, tox )2 − d(x, tx )2 ). Adding
these up and rearranging we get that
Φ(O) − 3Φ(T ) + 2R ≥ 0 (4.3)
d(x, tox )2 .
P
Here R = x∈C

In the last part we will upper bound the quantity R. R represents the cost of assigning
every point x to a center in T but not necessarily the closest one. Hence, R ≥ Φ(T ) ≥ Φ(O).
However we next show that this reassignment cost is not too large.
2 2
P P P
Notice that R can also be written as o∈O ∗ d(x, to ) .
x∈Co P Also x∈Co ∗ d(x, to ) =
P 2 ∗ 2
P 2 2
x∈Co ∗ d(x, o) + |Co |d(o, to ) . Hence we have that R = o∈O x∈Co ∗ (d(x, o) + d(o, to ) ).

9
Also note that d(o, to ) ≤ d(o, tx ) for any x. Hence
X X
R ≤ (d(x, o)2 + d(o, tx )2 )
o∈O x∈Co ∗
X
= (d(x, ox )2 + d(ox , tx )2 )
x∈C

Using triangle inequality we know that d(ox , tx ) ≤ d(ox , x) + d(x, tx ). Substituting above
and expanding we get that

X
R ≤ 2Φ(O) + Φ(T ) + 2 d(x, ox )d(x, tx ) (4.4)
x∈C

The P
last term in the above p equation
p can be bounded using Cauchy-Schwarz inequal-
ity as
p px∈C d(x, ox )d(x, tx ) ≤ Φ(O) Φ(S). So we have that R ≤ 2Φ(O) + Φ(T ) +
2 Φ(O) Φ(S). Substituting this in Equation 4.3 and solving we get the desired result
that Φ(T ) ≤ 25Φ(O). Combining this with Lemma 6 proves Theorem 5.
A natural generalization of Algorithm Local search is to swap more than one centers
at each step. This could potentially lead to a much better local optimum. This multi-swap
scheme was analyzed by [47] and using a similar analysis as above one can show the following
Theorem 8. Let S be the final set of centers by the local search algorithm which swaps upto
p centers at a time. Then we have that Φ(S) ≤ 2(3 + 2p )2 OPT, where OPT is the cost of the
optimal k-means solution.

For the case of k-median the same algorithm and analysis gives [16]
Theorem 9. Let S be the final set of centers by the local search algorithm which swaps upto
p centers at a time. Then we have that Φ(S) ≤ (3 + p2 )OPT, where OPT is the cost of the
optimal k-median solution.


This approximation factor for k-median has recently been improved to (1 + 3 + ǫ) [50].
For the case of k-means in Euclidean space [48] give an algorithm which achieves a (1 + ǫ)
approximation to the k-means objective for any constant ǫ > 0. However the runtime of the
algorithm depends exponentially in k and hence it is only suitable for small instances.

5 Clustering of stable instances

In this part of the chapter we delve into some of the more modern research in the theory
of clustering. In recent past there has been an increasing interest in designing clustering
algorithms that enjoy strong theoretical guarantees on non-worst case instance. This is
of significant interest for two reasons: a) From a theoretical point of view, this helps us
understand and characterize the class of problems for which one can get optimal or close

10
to optimal guarantees, b) From a practical point of view, real world instances often have
additional structure that could be exploited to get better performance. Compared to worst
case analysis, the main challenge here is to formalize well motivated and interesting additional
structures of clustering instances under which good algorithms exist. In this section we
present two popular interesting notions.

5.1 ǫ-separability

This notion of stability was proposed by Ostrovsky et al.[57]. Given an instance of k-


means clustering, let OPT(k) denote the cost of the optimal k-means solution. We can also
decompose OPT(k) as OPT = ki=1 OPTi , where OPTi denotes the 1-means cost of cluster
P

Ci , i.e., x∈Ci d(x, ci )2 . Such an instance is called ǫ-separable if it satisfies OPT(k − 1) >
P
1
ǫ2
OPT(k).

The definition is motivated by the following issue: when approaching a clustering problem,
one typically has to decide how many clusters one wants to partition the data in, i.e., the
value of k. If the k-means objective is the underlying criteria being used to judge the quality
of a clustering, and the optimal (k − 1)-means clustering is comparable to the optimal k-
means clustering, then one can in principle also use (k − 1) clusters to describe the data set.
In fact this particular method is a very popular heuristic to find out the number of hidden
clusters in the data set. In other words choose the value of k at which there is a significant
increase in the k-means cost when going from k to k − 1. As an illustrative example consider
the case of a mixture of k spherical unit variance Gaussians in d dimensions whose pair wise
means are separated by a distance D >> 1. Given n points from each Gaussian, the optimal
k-means cost with high probability is nkd. On the other hand, if we try to cluster this data
using (k − 1) clusters, the optimal cost will now become n(k − 1)d + n(D 2 + d). Hence, taking
2 +d 2
the ratio of the two costs, this instance will be ǫ-separable for ǫ12 = (k−1)d+D = 1 + Dkd ),
2 √ kd
so ǫ = (1 + Dkd )−1/2 . Hence, if D ≫ kd, then the instance will be highly separable (the
separability parameter ǫ will be o(1)).
It was shown by Ostrovsky et al. [57] that one can design much better approximation
algorithms for ǫ-separable instances.
Theorem 10 ( [57]). There is a polynomial time algorithm which given any ǫ-separable 2-
means instance returns a clustering of cost at most OP T
1−ρ
with probability at least 1 − O(ρ)
2 2
where c2 ǫ ≤ ρ ≤ c1 ǫ for some constants c1 , c2 > 0.
Theorem 11 ( [57]). There is a polynomial time algorithm which given any ǫ-separable k-
means instance a clustering of cost at most OPT
1−ρ
with probability 1 − O((ρ)1/4 ) where c2 ǫ2 ≤
ρ ≤ c1 ǫ2 for some constants c1 , c2 > 0.

5.2 Proof Sketch and Intuition for Theorem 10

Notice that the above algorithm does not need to know the value of ǫ from the separability
of the instance. Define ri to be the radius of cluster Ci in the optimal k-means clustering,

11
i.e., ri 2 = OPT
|Ci |
i
. The main observation is that under the ǫ-separability condition, the optimal
k-means clustering is “spread out”. In other words, the radius of any cluster is much smaller
than the inter cluster distances. This can be formulated in the following lemma
1−ǫ2
Lemma 12. ∀i, j, d(ci , cj )2 ≥ ǫ2
max(ri 2 , rj 2 ).

Proof. Given an ǫ-separable instance of k-means, consider any two clusters Ci and Cj in
the optimal clustering with centers ci and cj respectively. Consider the (k − 1) clustering
obtained by deleting cj and assigning all the points in Cj to Ci . By ǫ-separability, the cost
of this new clustering must be at least OPT ǫ2
. However the increase in the cost will be exactly
2
|Cj |d(ci , cj ) . This follows from the simple observation stated in Fact 2. Hence we have that
ǫ2 2
|Cj |d(ci , cj )2 > ( ǫ12 − 1)OPT. This gives us that rj 2 = OPT
|Cj |
≤ 1−ǫ 2 d(ci , cj ) . Similarly, if we

ǫ2
delete ci and assign all the points in Ci to Cj we get that ri 2 ≤ 1−ǫ2
d(ci , cj )2 .

When dealing with the two means problem, if one could find two initial candidate center
points which are close to the corresponding optimal centers, then we could hope to run a
Lloyd’s type step and improve the solution quality. In particular if we could find c¯1 and c¯2
such that d(c1 , c¯1 )2 ≤ αr1 2 and d(c2 , c¯2 )2 ≤ αr2 2 , then we know from Fact 2 that using these
center points will give us a (1 + α) approximation to OPT. Lemma 12 suggests the following
approach: pick data points x, y with probability proportional to d(x, y)2. We will show that
this will lead to seed points cˆ1 and cˆ2 not too far from the optimal centers. Applying a Lloyd
type reseeding step will then lead us to the final centers which will be much closer to the
optimal centers. We start by defining the core of a cluster.
Definition 3 (Core of a cluster). Let ρ < 1 be a constant. We define Xi = {x ∈ Ci :
2
d(x, ci )2 ≤ rρi }. We call Xi as the core of the cluster Ci .

We next show that if we pick initial seeds {cˆ1 , cˆ2 } = {x, y} with probability proportional
to d(x, y)2 then with high probability the points lie within the core of different clusters.
100ǫ2
Lemma 13. For sufficiently small ǫ and ρ = 1−ǫ2
, we have P r[{cˆ1 , cˆ2 } ∩ X1 6= ∅ and
{x, y} ∩ X2 6= ∅] = 1 − O(ρ).

ϭ Ϯ

ƌϮͬρ ƌϮͬρ
Đϭ ĚϮ
ĐϮ

Figure 3: An ǫ-separable 2-means instance

Proof Sketch. For simplicity assume that the sizes of the two clusters is the same, i.e.,
|Ci | = |Cj | = n/2. In this case, we have r1 2 = r2 2 = 2OP
n
T
= r 2 . Also, let d2 (c1 , c2 ) = d2 .

12
2
From ε-separability, we know that d2 > 1−ǫ ǫ2
r 2 . Also, from the definition of the core, we
know that at least (1 − ρ) fraction of the mass of each cluster lies within P the core. Hence,
the clustering instance looks like the one showed in Figure 3. Let A = x∈X1 ,y∈X2 d(x, y)2
2 A
P
and B = x,y⊂C d(x, y) . Then the probability of the event is exactly B . Let’s analyze
quantity B first. The proof goes by arguing that the pairwise distances between X1 and
X2 will dominate B. This is because of Lemma 12 which says that d2 is much greater than
r 2 , the average radius of a cluster. More formally, From Corollary 3 and from Fact 4 we
can get that B = n∆1 2 (C) = n∆2 2 (C) + n2 /4d2 . In addition ǫ-separability tells us that
n2
∆1 2 (C) > 1/ǫ2 ∆2 2 (C). Hence we get that B ≤ 4(1−ǫ 2
2) d .

Let’s analyze A = x∈X1 ,y∈X2 d(x, y)2 . From triangle inequality, we have that for any
P
√ √
x ∈ X1 , y ∈ X2 , d2 (x, y) ≥ (d − 2r/ ρ)2 . Hence A ≥ 14 (1 − ρ)2 n2 (d − 2r/ ρ)2 . Substituting
these bounds and using the fact that ρO(ǫ2 ), gives us that A/B ≥ (1 − O(ρ)).

Using these initial seeds we now show that a single step of a Lloyd’s type method can
yield good a solution. Define r = d(cˆ1 , cˆ2 )/3. Define c¯1 as the mean of the points in B(cˆ1 , r)
and c¯2 as the mean of the points in B(cˆ2 , r). Notice that instead of taking the mean of the
Voronoi partition corresponding to cˆ1 and cˆ2 , we take the mean of the points within a small
radius of the given seeds.
Lemma 14. Given cˆ1 ∈ X1 and cˆ2 ∈ X2 , the clustering obtained using c¯1 and c¯2 as centers
has 2-means cost at most OP
1−ρ
T
.

Proof. We will first show that X1 ⊆ B(cˆ1 , r) ⊆ C1 . Using Lemma 12 we know that d(cˆ1 , c1 ) ≤
ǫ
ρ(1−ǫ2 )
d(c1 , c2 ) ≤ d(c1 , c2 )/10 for sufficiently small ǫ. Similarly d(cˆ2 , c2 ) ≤ d(c1 , c2 )/10. Hence
we get that 4/5 ≤ r ≤ 6/5. So for any z ∈ B(cˆ1 , r), d(z, c1 ) ≤ d(c1 , c2 )/2. Hence z ∈ C1 .
2
Also for any z ∈ X1 , d(z, cˆ1 ) ≤ 2 rρ1 ≤ r. Similarly one can show that X2 ⊆ B(cˆ2 , r) ⊆ C2 .
ρ ρ
Now applying Fact 4 we can claim that d(c¯1 , c1 ) ≤ 1−ρ r1 2 and d(c¯2 , c2 ) ≤ 1−ρ r2 2 . So using
ρ
c¯1 and c¯2 as centers we get a clustering of cost at most OPT + 1−ρ OPT = OPT 1−ρ
.

Summarizing the discussion above, we have the following simple algorithm for the 2-means
problem.

Algorithm 2-means
1. Seeding: Choose initial seeds x, y with probability proportional to d(x, y)2 .
2. Given seeds cˆ1 , cˆ2 , let r = d(cˆ1 , cˆ2 )/3. Define c¯1 = mean(B(cˆ1 , r)) and c¯2 =
mean(B(cˆ2 , r)).
3. Output: c¯1 and c¯2 as the cluster centers.

13
5.3 Proof Sketch and Intuition for Theorem 11

In order to generalize the above argument to the case of k clusters, one could follow a
similar approach and start with k initial seed centers. Again we start by choosing x, y
with probability proportional to d(x, y)2 . After choosing a set of U of points, we choose
the next point z with probability proportional to mincˆi ∈U d(z, cˆi )2 . Using a similar analysis
as in Lemma 13 one can show that if we pick k seeds then with probability (1 − O(ρ))k
they will lie with the cores of different clusters. However this probability of success is
exponentially small in k and is not good for our purpose. The approach taken in [57] is to
sample a larger set of points and argue that with high probability it is going to contain k
seed points from the “outer” cores of different clusters. Here we define outer core of a cluster
2
as Xi out = {x ∈ Ci : d(x, ci )2 ≤ rρi3 } – so this notion is similar to the core notion for k = 2
except that the radius of the core is bigger by a factor of 1/(ρ) than before. We would like
to again point out a similar seeding procedure as the one described above is used in the
k-means++ algorithm [15](See Section 2). One can show that using k seed centers in this
way gives an O(log(k))-approximation to the k-means objective in the worst case.

Lemma 15 ( [57]). Let N = 1−5ρ 2k
+ 2(1−5ρ)
ln(2/δ)
2 , where ρ = ǫ. If we sample N points using the
sampling procedure then P r[∀j = 1 · · · k, there exists some x̂i ∈ Xj out ] ≥ 1 − δ

Since we sample more than k points in the first step, one needs to extract k good seed
points out of this set before running the Lloyd step. This is achieved by the following greedy
procedure:

Algorithm Greedy deletion procedure


1. Let S denote the current set of candidate centers. Let Φ(S) denote the k-means cost of
the Voronoi partition using S. Similarly, for x ∈ S denote Φ(Sx ) be the k-means cost
of the Voronoi partition using S \ {x}.
2. While |S| > k,
• remove a point x from S, such that Φ(Sx ) − Φ(S) is minimum.
• For every remaining point x ∈ S, let R(x) denote the Voronoi set corresponding to
x. Replace x by mean(R(x)).
• Output: S.

At the end of the greedy procedure we have the following guarantee


Di
Lemma 16. For every optimal center ci , there is a point cˆi ∈ S, such that d(ci , cˆi ) ≤ 10
.
Here Di = minj6=i d(ci , cj ).

Using the above lemma and applying the same Lloyd step as in the 2-means problem
we get a set of k good final centers. These centers have the property that for each i,
ρ
d(ci , c¯i ) ≤ 1−ρ ri 2 . Putting the above argument formally we get the desired result.

14
5.4 Approximation stability

In [20] Balcan et al. introduce and analyze a class of approximation stable instances for which
they provide polynomial time algorithms for finding accurate clustering. The starting point
of this work, is that for many problems of interest to machine learning, such as as clustering
proteins by function, images by subject, or documents by topic, there is some unknown
correct target clustering. In such cases the implicit hope when pursuing an objective based
clustering approach (k-means or k-median) is that approximately optimizing the objective
function will in fact produce a clustering of low clustering error, i.e. a clustering that is
point wise close to the target clustering. Balcan et al. have shown that by making this
implicit assumption explicit, one can efficiently compute a low-error clustering even in cases
when the approximation problem of the objective function is NP-complete! This is quite
interesting since it shows that by exploiting the properties of the problem at hand one
can solve the desired problem and bypass worst case hardness results. A similar stability
assumption, regarding additive approximations, was presented in [54]. The work of [54]
studied sufficient conditions under which the stability assumption holds true.

Formally, the approximation stability notion is defined as follows:


Definition 4 (((1 + α, ǫ)-approximation-stability)). Let X be a set of n points residing in a
metric space M. Given an objective function Φ (such as k-median, k-means, or min-sum),
we say that instance (M, X) satisfies (1+α, ǫ)-approximation-stability for Φ if all clusterings
C with Φ(C) ≤ (1 + α) · OPTΦ are ǫ-close to the target clustering CT for (M, S).

Here the term “target” clustering refers to the ground truth clustering of X which one is
trying to approximate. It is also important to clarify what we mean by an ǫ-close clustering.
Given two k clusterings PC and C ∗ of n points, the distance between them is measured as
dist(C, C ) = minσ∈Sk n ki=1 |Ci \ C ∗σ(i) |. We say that C is ǫ-close to C ∗ if the distance
∗ 1

between them is at most ǫ. Interestingly, this approximation stability condition implies a


lot of structure about the problem instance which could be exploited algorithmically. For
example, we can show the following.
Theorem 17. [ [20]] If the given instance (M, S) satisfies (1 + α, ǫ)-approximation-stability
for the k-median or the k-means objective, then we can efficiently produce a clustering that
is O(ǫ + ǫ/α)-close to the target clustering CT .

Notice that the above theorem is valid even for values of α for which getting a (1 + α)-
approximation to k-median and k-means is NP -hard! In a recent paper, [4] show that
running the kmeans++ algorithm for approximation stable instances of k-means gives a
constant factor approximation with probability Ω( k1 ). In the following we will provide a
sketch of the proof of Theorem 17 for k-means clustering.

5.5 Proof Sketch and Intuition for Theorem 17

Let C1 , C2 , · · · , Ck be an optimal k-means clustering of C. Let c1 , c2 , · · · , ck be the corre-


sponding cluster centers. For any point x ∈ C, let w(x) be the distance of x to its cluster
center. Similarly let w2 (x) be the distance of x to theP
second closest center. The value of
the optimal solution can then we written as OPT = x w(x)2 . The main implication of

15
approximation stability is that most of the points are much closer to their own center than
to the centers of other clusters. Specifically:
Lemma 18. If the instance (M, X) satisfies (1 + α, ǫ)-approximation-stability then less than
6ǫn points satisfy w2 (x)2 − w(x)2 ≤ αOPT
2ǫn
.

Proof. Let C ∗ be the optimal k-means clustering. First notice that by approximation-stability
dist(C ∗ , CT ) = ǫ∗ ≤ ǫ. Let B be the set of points that satisfy w2 (x)2 − w(x)2 ≤ αOPT
2ǫn
. Let us

assume that |B| > 6ǫn. We will create a new clustering C by transferring some of the points
in B to their second closest center. In particular it can be shown that there exists a subset
of size |B|/3 such that for each point reassigned in this set, the distance of the clustering
to C ∗ increases by 1/n. Hence we will have a clustering C ′ which is 2ǫ away from C ∗ and at
least ǫ away from CT . However the increase in cost in going from C ∗ to C ′ is at most αOPT.
This contradicts the approximation stability assumption.

q
αOPT
Let us define dcrit = 50ǫn
as the critical distance. We call a point x good if it satisfies
w(x)2 < dcrit 2 and w2 (x)2 − w(x)2 > 25dcrit 2 . Otherwise we call x as a bad point. Let B be
the set of all bad points and let Gi be the good points in target cluster i. By Lemma 18
at most 6ǫn points satisfy w2 (x)2 − w(x)2 > 25d2crit. Also from Markov’s inequality at most
50ǫn
α
points can have w(x)2 > dcrit 2 . Hence |B| = O(ǫ/α).

Given Lemma 18, if we then define the τ -threshold graph Gτ = (S, Eτ ) to be the graph
C

produced by connecting all pairs {x, y} ∈ 2 with d(x, y) < τ , and consider τ = 2dcrit we
get the following two properties:

(1) For x, y ∈ Ci ∗ such that x and y are good points, we have {x, y} ∈ E(Gτ ).
(2) For x ∈ Ci ∗ and y ∈ Cj ∗ such that x and y are good points, {x, y} ∈
/ E(Gτ ).
(3) For x ∈ Ci ∗ and y ∈ Cj ∗ , x and y do not have any good point as a common neighbor.

Hence the threshold graph has the structure as shown in Figure 4, where each Gi is a
clique representing the set of good points in cluster i. This suggests the following algorithm
for k-means clustering. Notice that unlike the algorithm for ǫ-separability, the algorithm for
approximation stability mentioned below needs to know the values of the stability parameters
α and ǫ4 .
4 This is specifically for the goal of finding a clustering that nearly matches an unknown target clustering, because one may
not in general have a way to identify which of two proposed solutions is preferable. On the other hand, if the goal is to find a
solution of low cost, then one does not need to know α or ǫ: one can just try all possible values for dcrit in the algorithm and
take the solution of least total cost.

16
Algorithm k-means algorithm
Input: ǫ ≤ 1, α > 0, k.
q
αOPT a
1. Initialization: Define dcrit = 50ǫn

2. Construct the τ -threshold graph Gτ with τ = 2dcrit.


3. For j = 1 to k do:
Pick the vertex vj of highest degree in Gτ .
Remove vj and its neighborhood from Gτ and call this cluster C(vj ).
k−1
4. Output: the k clusters C(v1 ), . . . , C(vk−1 ), S − ∪i=1 C(vi ).

a For simplicity we assume here that one knows the value of OPT. If not, one can run a constant-factor approximation

algorithm to produce a sufficiently good estimate.

G1 G2
Gk−1 Gk

Figure 4: The structure of the threshold graph.

The authors in [20] use the properties of the threshold graph to show that the greedy
method of Step 3 of the algorithm produces an accurate clustering. In particular, if the
vertex vj we pick is a good point in some cluster Ci , then we are guaranteed to extract the
whole set Gi of good points in that cluster and potentially some bad points as well (see
Figure 5(a)). If on the other hand the vertex vj we pick is a bad point, then we might
extract only a part of a good set Gi and miss some good points in Gi , which might lead to
some errors. (Note that by property (3) we never extract parts of two different good sets
Gi and Gj ). However, since vj was picked to be the vertex of the highest degree in Gτ , we
are guaranteed to extract at least as many bad points as the number of missed good points
in Gi see Figure 5(b). These than implies that overall we can charge the errors to the bad
points, so the distance between the target clustering and the resulting clustering is O(ǫ/α)n,
as desired.

5.6 Other notions of stability and relations between them

This notion of ǫ-separability, is in fact related to (c, ǫ)-approximation-stability. Indeed, in


Theorem 5.1 of their paper, [57] show that their ǫ-separatedness assumption implies that any
near-optimal solution to k-means is O(ǫ2 )-close to the k-means optimal clustering. However,

17
G1 G2
vj

B vj

(a) (b)

Figure 5: If the greedy algorithm chooses a good vertex vj as in (a), we get the entire good
set of points from that cluster. If vj is a bad point as in (b), the missed good points can be
charged to bad points.

the converse is not necessarily the case: an instance could satisfy approximation-stability
without being ǫ-separated.5 [21] presents a specific example of points in Euclidean space
with c = 2. In fact, for the case that k is much larger than 1/ǫ, the difference between the
two properties can be more substantial. See Figure 6 for an example. In addition, algorithms
for approximation stability have been successfully applied in clustering problems arising in
computational biology [68] (See Section 5.8 for details).

[17] study center based clustering objectives and define a notion of stability called α-weak
deletion stability. A clustering instance is stable under this notion if in the optimal clustering
merging any two clusters into one increases the cost by a multiplicative factor of (1 + α).
This a broad notion of stability that generalizes both the ǫ-separability notion studied in
section 5.1 and the approximation stability in the case of large cluster sizes. Remarkably, [17]
show that for such instances of k-median and k-means one can design a (1+ǫ) approximation
algorithm for any ǫ > 0. This leads to immediate improvements over the works of [20] (for the
case of large clusters) and of [57]. However, the runtime of the resulting algorithm depends
polynomially in n and k and exponentially in the parameters 1/α and 1/ǫ, so the simpler
algorithms of [17] and [20] are more suitable for scenarios where one expects the stronger
properties to hold. See Section 5.8 for further discussion. [3] also study various notions of
clusterability of a dataset and present algorithms for such stable instances.

Kumar and Kannan [49] consider the problem of recovering a target clustering under
deterministic separation conditions that are motivated by the k-means objective and by
Gaussian and related mixture models. They consider the setting of points in Euclidean
space, and show that if the projection of any data point onto the line joining the mean of
its cluster in the target clustering to the mean of any other cluster of the target is Ω(k)
standard deviations closer to its own mean than the other mean, then they can recover the
target clusters in polynomial time. This condition was further analyzed and reduced by
work of [18]. This separation condition is formally incomparable to approximation-stability
(even restricting to the case of k-means with points in Euclidean space). In particular,
5 [57] shows an implication in this direction (Theorem 5.2); however, this implication requires a substantially stronger

condition, namely that data satisfy (c, ǫ)-approximation-stability for c = 1/ǫ2 (and that target clusters be large). In contrast,
the primary interest of [21] in the case where c is below the threshold for existence of worst-case approximation algorithms.

18
ϭ

ϭ
ϭ

Figure 6: Suppose ǫ is√a small constant,√and consider a clustering instance in which the
target consists of k = n clusters with n points each, such that all points in the same
cluster have distance 1 and all points in different clusters have distance D + 1 where
√ D is
a large constant. Then, merging two clusters increases the cost additively by Θ( n), since
D is a constant.
√ Consequently, the optimal (k − 1)-means/median solution is just a factor
1 + O(1/ n) more expensive than the optimal k-means/median clustering. However, for
D sufficiently large compared to 1/ǫ, this example satisfies (2, ǫ)-approximation-stability or
even (1/ǫ, ǫ)-approximation-stability – see [21] for formal details.

if the dimension is low and k is large compared to 1/ǫ, then this condition can require
more separation than approximation-stability (e.g., with k well-spaced clusters of unit radius
approximation-stability would require separation only O(1/ǫ) and independent of k – see [21]
for an example). On the other hand if the clusters are high-dimensional, then this condition
can require less separation than approximation-stability since the ratio of projected distances
will be more pronounced than the ratios of distances in the original space.
Bilu and Linial [25] consider inputs satisfying the condition that the optimal solution to
the objective remains optimal even after bounded perturbations to the input weight matrix.
This condition is known as perturbation resilience. Bilu and Linial [25] give an algorithm
for a different clustering objective known as maxcut. The maxcut objective asks for a 2
partitioning of a graph such the total number of edges going between the two pieces is max-
imized. The authors show that the maxcut objective is easy under the assumption that the
optimal solution is stable to O(n2/3 )-factor multiplicative perturbations to the edge weights.
The√ work of Makarychev et al. [52] subsequently reduced the required resilience factor to
O( log n). In [18] the authors study perturbation resilience for center-based clustering ob-
jectives such as k-median and k-means, and give an algorithm that finds the optimal solution

when the input is stable to only factor-3 perturbations. This factor is improved to 1 + 2
by [22], who also design algorithms under a relaxed (c, ǫ)-stability to perturbations condi-
tion in which the optimal solution need not be identical on the c-perturbed instance, but
may change on an ǫ fraction of the points (in this case, the algorithms require c = 4).
Note that for the k-median objective, (c, ǫ)-approximation-stability with respect to C ∗ im-
plies (c, ǫ)-stability to perturbations because an optimal solution in a c-perturbed instance
is guaranteed to be a c-approximation on the original instance;6 so, (c, ǫ)-stability to per-
6 In particular, a c-perturbed instance d˜ satisfies d(x, y) ≤ d(x,
˜ y) ≤ cd(x, y) for all points x, y. So, using Φ to denote cost
in the original instance, Φ̃ to denote cost in the perturbed instance and using C̃ to denote the optimal clustering under Φ̃, we

19
turbations is a weaker condition. Similarly, for k-means, (c, ǫ)-stability to perturbations is
implied by (c2 , ǫ)-approximation-stability. However, as noted above, the values of c known
to lead to efficient clustering in the case of stability to perturbations are larger than for
approximation-stability, where any constant c > 1 suffices.

5.7 Runtime Analysis

Below we provide the run time guarantees of the various algorithms discussed so far. While
these may be improved with appropriate data structures, we assume here a straightforward
implementation in which computing the distance between two data points takes time O(d),
as does adding or averaging two data points. For example, computing a step of Lloyd’s algo-
rithm requires assigning each of the n data points to its nearest center, which in turn requires
taking the minimum of k distances per data point (so O(nkd) time total), and then resetting
each center to the average of all data points assigned to it (so O(nd) time total). This gives
Lloyd’s algorithm a running time of O(nkd) per iteration. The k-means++ algorithm has
only a seed-selection step, which can be run in time O(nd) per seed by remembering the
minimum distances of each point to the previous seeds, so it has a total time of O(nkd).
For the ǫ-separability algorithm, to obtain the sampling probabilities for the first two
seeds one can compute all pairwise distances at cost of O(n2 d). Obtaining the rest of the
seeds is faster since one only needs to compute distances to previous seeds, so this takes
time O(ndk). Finally there is a greedy deletion procedure at time O(ndk) per step for O(k)
steps. So the overall time is O(n2d + ndk 2 ).

For the approximation-stability algorithm, creating a graph of distances takes time O(n2 d),
after which creating the threshold graph takes time O(n2) if one knows the value of dcrit .
For the rest of the algorithm, each step takes time O(n) to find the highest-degree vertex,
and then time proportional to the number of edges examined to remove the vertex and its
neighbors. Over the entire remainder of the algorithm this takes time O(n2 ) total. If the
value of dcrit is not known, one can try O(n) values, taking the best solution. This gives an
overall time of O(n3 + n2 d).

Finally, for local search, one can first create a graph of distances in time O(n2d). Each
local swap step has O(nk) pairs (x, y) to try, and for each pair one can compute its cost
in time O(nk) by computing the minimum distance of each data point to the proposed k
centers. So, the algorithm can be run in time O(n2 k 2 ) per iteration. The total number of
iterations is at most poly(n) 7 so the overall running time is at most O(n2d + n2 k 2 poly(n)).
As can be seen from the table below the algorithms become more and more computationally
expensive if one needs formal guarantees on a larger instance space. For example, the local
search algorithm provides worst case approximation guarantees on all instances but is very
slow. On the other hand Lloyd’s method and k-means++ are very fast but provide bad
worst case guarantees, especially when the number of clusters k is large. Algorithms based
on stability notions aim to provide the best of both worlds by being fast and provably good
on well behaved instances. In the conclusion section 7 we outline a guideline for practitioners
when working with the various clustering assumptions.
˜ ≤ Φ̃(C ∗ ) ≤ cΦ(C ∗ ).
have Φ(C̃) ≤ Φ̃(C)
7 The actual number of iterations depend upon the cost of the initial solution and the stopping condition.

20
Method Runtime
Lloyd’s O(nkd)) × (#iterations)
k-means++ O(nkd)
ǫ-separability O(n2 d + ndk 2 )
Approximation stability O(n3 + n2 d)
Local search O(n d + n2 k 2 poly(n))
2

Table 1: A run time analysis of various algorithms discussed in the chapter. The running
time degrades as one requires formal guarantees on larger instance spaces.

5.8 Extensions

Variants of the k-means objective


k-means clustering is the most popular methods for vector quantization which is used in en-
coding speech signals and data compression [39]. There have been variants of the k-means al-
gorithm called fuzzy kmeans which allow each point to have a degree of membership into var-
ious clusters [24]. This modified k-means objective is popular for image segmentation [1, 64].
There have also been experiments on speeding up the Lloyd’s method by updating centers
at each step by only choosing a random sample of the entire dataset [37]. [26] present an
empirical study on the convergence properties of Lloyd’s method. Rasmussen [60] contains a
discussion of k-means clustering for information retrieval. [35] present an empirical compar-
ison of k-means and spectral clustering methods. [59] study a modified k-means objective
with an additional penalty for the number of clusters chosen. They motivate the new ob-
jective as a way to solve the cluster selection problem. This approach is inspired by the
Bayesian model selection procedures [62]. For further details on the applications of k-means,
refer to Chapters 1.2 and 2.3.

k-means++: Streaming and Parallel versions of k-means


As we saw in section 2, careful seeding is crucial in order for the Lloyd’s method to succeed.
One such method is proposed in the k-means++ algorithm. Using the seed centers output
by k-means++ one can immediately guarantee an O(log k) approximation to the k-means
objective. However k-means++ is an iterative method which needs to be repeated k times
in order to get a good set of seed points. This makes it undesirable for use in applications
involving massive datasets with thousands of clusters. This problem is overcome in [19] where
the authors propose a scalable and parallel version of kmeans++. The new algorithm runs
in much fewer iterations and chooses more than one seed point at each step. The authors
experimentally demonstrate that this leads to much better computational performance in
practice without losing out on the solution quality. In [6] the authors design an algorithm
for k-means which makes a single pass over the data. This makes it much more suitable for
applications where one needs to process data in the streaming model. The authors show that
if one is allowed to store a little more than k centers (O(k log k)) then one can also achieve
good approximation guarantees and at the same time have an extremely efficient algorithm.
They experimentally demonstrate that the proposed method is much faster than known
implementations of the Lloyd’s method. There has been subsequent work on improving the
approximation factors and making the algorithms more practical [63].

Approximation Stability in practice


Motivated by clustering applications in computational biology, [68] analyze (c, ǫ)-approx-
imation-stability in a model with unknown distance information where one can only make a
limited number of one versus all queries. [68] design an algorithm that given (c, ǫ)-approx-

21
imation-stability for the k-median objective finds a clustering that is very close to the target
by using only O(k) one-versus-all queries in the large cluster case, and in addition is faster
than the algorithm we present here. In particular, the algorithm for the large clusters case
described in [20] (similar to the one we described in Section 5.4 for the k-means objective)
can be implemented in O(|S|3) time, while the one proposed in [68] runs in time O(|S|k(k +
log |S|)). [68] use their algorithm to cluster biological datasets in the Pfam [38] and SCOP
[56] databases, where the points are proteins and distances are inversely proportional to
their sequence similarity. This setting nicely fits the one-versus all queries model because
one can use a fast sequence database search program to query a sequence against an entire
dataset. The Pfam [38] and SCOP [56] databases are used in biology to observe evolutionary
relationships between proteins and to find close relatives of particular proteins. [68] find
that for one of these sources they can obtain clusterings that almost exactly match the given
classification, and for the other the performance of their algorithm is comparable to that of
the best known algorithms using the full distance matrix.

6 Mixture Models

In the previous sections we saw worst case approximation algorithms for various clustering
objectives. We also saw examples of how assumptions on the nature of the optimal solution
can lead to much better approximation algorithms. In this section we will study a different
assumption on how the data is generated in the first place. In the machine learning literature,
such assumptions take the form of a probabilistic model for generating a clustering instance.
The goal is to cluster correctly (with high probability) an instance generated from the par-
ticular model. The most famous and well studied example of this is the Gaussian Mixture
Model (GMM)[46]. This will be the main focus of this section. We will illustrate conditions
under which datasets arising from such a mixture model can be provably clustered.

Gaussian Mixture Model A univariate Gaussian random variable X, with mean µ and
−(x−µ)2
variance σ 2 has the density function f (x) = σ√12π e σ2 . Similarly, a multivariate Gaussian
random variable, X ∈ ℜn has the density function
1
e( 2 (x−µ) Σ (x−µ)) .
−1 T −1
f (x) = n/2
|Σ|1/2 (2π)

Here µ ∈ ℜn is called the mean vector and Σ is the n × n covariance matrix. A spe-
cial case is the spherical Gaussian for which Σ = σ 2 In . Here σ 2 refers to the variance
of the Gaussian in any given direction. Consider k n-dimensional Gaussian distributions,
N (µ1 , Σ1 ), N (µ2, Σ2 ), · · · , N (µk , Σk ). A Gaussian mixture model M refers to the distribu-
tion obtained from a convex combination of such Gaussian. More specifically
M = w1 N (µ1 , Σ1 ) + w2 N (µ2, Σ2 ) + · · · wk N (µk , Σk ).

P
Here wi ≥ 0, are called the mixing weights and satisfy i wi = 1. One can think of a
point being generated from M by first choosing a component Gaussian i, with probability
wi , and then generating a point from the corresponding Gaussian distribution N (µi, Σi ).
Given a data set of m points coming from such a mixture model, a fairly natural question

22
is to recover the individual components of the mixture model. This is a clustering problem
where one wants to cluster the points into k clusters such that the points drawn from the
same Gaussian are in a single partition. Notice that unlike in the previous sections, the
algorithms designed for mixture models will have probabilistic guarantees. In other words,
we would like the clustering algorithm to recover, with high probability, the individual com-
ponents. Here the probability is over the draw of the m sample points. Another problem one
could ask is to approximate the parameters (mean, variance) of each individual component
Gaussian. This is known as the parameter estimation problem. It is easy to see that if one
could solve the clustering problem approximately optimally, then estimating the parameters
of each individual component is also easy. Conversely, after doing parameter estimation
one can easily compute the Bayes optimal clustering. To study the clustering problem, one
typically assumes separation conditions among the component Gaussians which limit the
amount of overlap between them. The most common among them is to assume that the
mean vectors of the component Gaussians are far apart. However, there are also scenarios
when such separation conditions do not hold (consider two Gaussian which are aligned in
an ’X’ shape), yet the data can be clustered well. In order to do this, one first does param-
eter estimation which needs much weaker assumptions. After estimating the parameters,
the optimal clustering can be recovered. This is an important reason to study parameter
estimation. In the next section we will see examples of some separation conditions and the
corresponding clustering algorithms that one can use. Later, we will also look at recent work
on parameter estimation under minimal separation conditions.

6.1 Clustering methods

In this section we will look at distance based clustering algorithms for learning a mixture
of Gaussians. For simplicity, we will start with the case of k spherical Gaussians in ℜn
with means {µ1 , µ2 , · · · , µk } and variance Σ = σ 2 In . The algorithms we describe will work
under the assumption that the means are far apart. We will call this as the center separation
property:
Definition 5 (Center Separation). A mixture of k identical spherical Gaussians satisfies
center separation if ∀i 6= j,
∆i,j = ||µi − µj || > βi,j σ

The quantity βi,j typically depends on k the number of clusters, n the dimensionality of
the dataset and wmin , the minimum mixing weight. If the spherical Gaussians have different
variances σi ’s, the R.H.S. is replaced by βi,j (σi + σj ). For the case of general Gaussians,
σi will denote the maximum variance of Gaussian i in any particular direction. One of the
earliest results using center separation
√ for clustering is by Dasgupta [32]. We will start with
a simple condition that βi,j = C n, for some constant C > 4 and will also assume that
wmin = Ω(1/k). Let’s consider Pn a typical point x from a particular Gaussian N (µi, σ 2 In ).
We have E[||X − µi || ] = E[ d=1 |xd − µi d | ] = nσ 2 . Now consider two typical points x and
2 2

y from two different Gaussians N (µi, σ 2 In ) and N (µj , σ 2 In ). We have

E[||X − Y ||2 ] = E[||X − µi + µi − µj − (Y − µj )||2 ]


= E[||X − µi ||2 ] + E[||Y − µj ||2 ] + ||µi − µj ||2
≥ 2nσ 2 + C 2 σ 2 n

23
For C large enough (say C > 4), we will have that for any two typical points x, y in the
same cluster, ||x−y||2 ≤ 2σ 2 n. And for any two points in different clusters ||x−y||2 > 18σ 2 n.
Using standard concentration bounds we can say that for a sample of size poly(n), with high
probability, all points from a single Gaussian will be closer to each other, than to points
from other Gaussians. In this case one could simply create a graph by connecting any two
points x, y such that ||x − y||2 ≤ 2σ 2 n. It is easy to see that the connected components in
this graph will correspond precisely to the individual components of the mixture model. If
C is smaller, say 2, one needs a stronger concentration result [10] mentioned below
Lemma 19. If x, y are picked independently from N(µi , σ 2 In ), then with probability 1−1/n3 ,
||x − y||2 ∈ [2σ 2 n(1 − 4 log(n)

n
), 2σ 2 n(1 + 5 log(n)

n
)].

Also, as before, one can show that with high probability, for x and y from two differ-
ent Gaussians, we have ||x − y||2 > 2σ 2 n(1 + 4 log(n)

n
). From this it follows that if r is the
minimum distance between any two points in the sample, then for any x in Gaussian i
and any y in the same Gaussian, we have ||x − y||2 ≤ (1 + 4.5 √ log(n)
n
)r. And for a point z in
any other Gaussian we have ||x−z||2 > (1 + 4.5 √
log(n)
n
)r. This suggests the following algorithm
Algorithm Cluster Spherical Gaussians
1. Let D be the set of all sample points.
2. For: i = 1 to k,
(a) Let x0 and y0 be such that kx0 − y0 k2 = r = minx,y∈D kx − yk2.
4.5√log n
(b) Let T = {y ∈ D : kx0 − yk2 ≤ r(1 + n
).
(c) Output: T as one of the clusters.

Handling smaller c

For smaller values of C, for example C < 1, one cannot in general say that the above
strong concentration will hold true. In fact, in order to correctly classify the points, we
might need to √ see points which are much closer to the center of a Gaussain (say at distance
less than 12 σ n). However, most of the mass of a Gaussian lies in a thin shell around radius

of σ n. Hence, one might have to see exponentially many samples in order to get a good
classification. Dasgupta [32] solves this problem by first projecting the data onto a random
d = O(log(k)/ǫ2 ) dimensional subspace. This has the effect that the center separation
property is still preserved up to a factor of (1 − ǫ). One can now do distance based clustering
in this subspace as the number of samples needed will be proportional to 2d instead of 2n .

General Gaussians The results of Dasgupta were extend by Arora and Kannan [10] to the
case of general Gaussians. They also managed to reduce the required separation between
means. They assumed that βi,j = Ω(log(n))(Ri + Rj )(σi + σj ). As mentioned before, σi
denotes the maximum variance of Gaussian i in any direction. Ri denotes the median radius
of Gaussian i8 . For the case of spherical gaussians, this separation becomes Ω(n1/4 log(n)(σi +
8 The radius such that the probability mass within Ri equals 1/2

24
σj )). Arora and Kannan use isoperimetric inequalities to get strong concentration results
for such Gaussians. In particular they show that
Theorem 20. Given βi,j = Ω(log(n)(Ri + Rj )), there exists a polynomial time algorithm
2 2
which given at least m = δ2nw6kmin samples from a mixture of k general Gaussians, solves the
clustering problem exactly with probability (1 − δ).

Proof Intuition: The first step is to generalize Lemma 19 for the case of general Gaussians.
In particular one can show that for x, y are picked at random from a general Gaussian i,
with median radius Ri and maximum variance σi , we have with high probability
2Ri 2 − 18 log(n)σi Ri ≤ ||x − y||2 ≤ 2(Ri + 20 log(n)σi )2
Similarly, for x, y from different Gaussians i and j, we have with high probability
||x − y||2 > 2min(Ri 2 , Rj 2 ) + 120 log(n)(σi + σj )(Ri + Rj ) + Ω((log(n))2 (σi 2 + σj 2 ).

The above concentration results imply (w.h.p.) that pairwise distances within points
from a Gaussian i lie in an interval Ii and distances between Gaussians Ii,j lie in the interval
Ii,j . Furthermore, Ii,j will be disjoint from the interval corresponding to the Gaussian with
smaller value of Ri (Need a figure here). In particular, if one looks at balls of increasing
radius around a point from the Gaussian with minimum radius, σi , there will be a stage
when there exists a gap: i.e., increasing the radius slightly does not include any more points.
From the above lemmas, this gap will be roughly Ω(σi ). Hence, at this stage, we can remove
this Gaussian from the data and recurse. This property suggests the following algorithm
outline.

Algorithm Cluster General Gaussians


1. Let r be the smallest radius such that |B(x, r)| > 34 wmin |S|, for some x ∈ D.
Here |S| is the size of dataset and B(x, r) denotes the ball of radius r around x.
2. Let σ denote
√ the maximum variance of the Gaussian with the least radius.
Let γ = O( wmin σ).
3. While D is non-empty,
(a) Let s be such that |B(x, r + sγ)| ∩ D = |B(x, r + (s − 1)γ)|.
(b) Remove a set T containing all the points from S which are in B(x, r + sγ log(n)).
(c) Output: T as one of the cluster.

One point to mention is that one does not really know beforehand the value of sigma at
each iteration. Arora and Kannan [10] get around this by estimating the variance from the
data in the ball B(x, r). They then show that this estimate is good enough for the algorithm
to work.

25
6.2 Spectral Algorithms

The algorithms mentioned in the above section need the center separation to grow polyno-
mially with n. This is prohibitively large especially in cases when k ≪ n. In this section,
we look at how spectral techniques can be used to only require the separation to grow with
k instead of n.
Algorithmic Intuition In order to remove the dependence on n we would like to project
the data such that points from the same Gaussian become much closer while still maintain-
ing the large separation between means. One idea is to do a random projection. However,
random projections from n to d dimensions scale each squared distance equally (by factor
d/n) and will not give us any advantage. However, consider the case of two spherical gaus-
sians with means µ1 and µ2 and variance σ 2 In . Consider projecting all the points to the
line joining µ1 and µ2 . Now consider any random point x from the first Gaussian. For
any unit vector along the line joining µ1 and µ2 we have that (x − µ1 ).v behaves like a 1-
dimensional Gaussian with mean 0 and variance σ 2 . Hence the expected distance of a point
x from its mean becomes σ 2 . This means that for any two points in the same Gaussian,
the expected squared distance becomes 4σ 2 (as opposed to 2nσ 2 ). However, the distance
between the means remains the same. In fact the above claim is true if we project onto
any subspace containing the means. This subspace is exactly characterized by the Singu-
lar Value Decomposition (SVD) of the data matrix. This suggests the following algorithm
Algorithm Spectral Clustering
1. Compute the SVD decomposition of the data.
2. Project the data onto the space of top-k right singular vectors.
3. Run a distance based clustering method in this projected space.

Such spectral algorithms were proposed by Vempala and Wang [67] who reduced the
separation for spherical Gaussians to βi,j = Ω(k 1/4 (log(n/wmin ))1/4 ). The case of general
Gaussians was studied in [2] who give efficient clustering algorithms for βi,j = ( √ 1 +
min(wi ,wj )
p 3/2
k log(kmin(2k , n))). [45] give algorithms for general Gaussians for βi,j = wkmin 2 .

6.3 Parameter Estimation

In the previous sections we looked at the problem of clustering points from a Gaussian
Mixture Model. Another important problem is that of estimating the parameters of the
component Gaussians. These parameters refer to the mixture weights wi ’s, mean vectors
µi ’s and the covariance matrices Σi ’s. As mentioned before, if one could do efficiently get
a good clustering, then the parameter estimation problem is solved by simply producing
empirical estimates from the corresponding clusters. However, there could be scenarios
when it is not possible to produce a good clustering. For, ex. consider two one dimensional
gaussians with mean 0 and variance σ 2 and 2σ 2 . These gaussians have a large overlap and
any clustering method will inherently have a large error.R On the other hand, let’s look at
the statistical distance between the two gaussians, i.e., x |f1 (x) − f2 (x)|dx. This measures

26
how much one distribution dominates the other one. It is easy to see that in this case the
Gaussian with the higher variance will dominate the other Gaussian almost everywhere.
Hence the statistical distance is close to 1. This suggests that information theoretically, one
should be able to estimate the parameters of these two mixtures. In this section, we will
look at some recent work of Kalai, Moitra, Valiant [44] and Moitra, Valiant [55] in efficient
algorithms for estimating the parameters of a Gaussian mixture model. These works make
minimal assumption on the nature of the data, namely, that the component gaussians have
noticeable statistical distance. Similar results were proven in [23] who also gave algorithms
for more general distributions.
The case of two Gaussians:
We will first look at the case of 2 Gaussians in ℜn . We will R assume that the statistical
distance between the gaussians, D(N1 , N2 ) is noticeable, i.e., x |f1 (x) − f2 (x)|dx > α. Kalai
et. al [44] show the following theorem
Theorem 21. Let M = w1 N1 (µ1 , Σ1 )+w2 N2 (µ2 , Σ2 ) be an isotropic GMM where D(N1 , N2 ) >
α. Then, there is an algorithm which outputs M′ = w1′ N ′ 1 (µ′1 , Σ′1 ) + w2′ N ′ 2 (µ′2 , Σ′2 ) such
that for some permutation π : {0, 1} 7→ {0, 1} we have,

|wi − wπ(i) | ≤ ǫ
||µi − µ′π(i) || ≤ ǫ
||Σi − Σ′π(i) || ≤ ǫ

The algorithm runs in time poly(n, 1/ǫ, 1/α, 1/w1, 1/w2).

The condition on the mixture being isotropic is necessary to recover a good additive
approximation for the means and the variances since otherwise, one could just scale the data
and the estimates will scale proportionately.
Reduction to a one dimensional problem
In order to estimate the mixture parameters, Kalai et. al, reduce the problem to a series of
one dimensional learning problems. Consider an arbitrary unit vector v. Suppose we project
the data onto the direction of v and let the means of the Gaussians in this projected space be
µ′1 and µ′2 . Then we have that µ1 = E[x.v] = E[(x − µ1 ).v] = µ1 .v. Hence, the parameters
of the original mean vector are linearly related to the mean in the projected space. Similarly,
let’s perturb v to get v ′ = v +ǫ(ei +ej ). Here ei and ej denote the basis vectors corresponding
to coordinates i and j. Let σ1′ 2 be the variance of the gaussian in the projected space v ′ .
Then writing σ1′ 2 = E[(x.v ′ )2 ] and expanding, we get that E[xi xj ] will be linearly related to
σ1′ 2 , σ1 2 and the µi ’s. Hence, by estimating the parameters correctly over a series of n2 , one
dimensional vectors, one can efficiently recover the original parameters (by solving a system
of linear equations).
Solving the one dimensional problem
The one dimensional problem is solved by the method of moments. In particular, define
Li [M] to be the ith moment for the mixture model M, i.e., Li [M] = Ex∼M [xi M(x)]. Also
define L̂i to be the empirical ith moment of the data. The algorithm in [44] does a brute force
search over the parameter space for the two Gaussians and for a given candidate model M′
computes the first 6 moments. If all the moments are within ǫ of the empirical moments, then

27
the analysis in [44] shows that the parameters will be ǫ1/67 close to the parameters of the two
gaussians. The same claim is also true for learning a mixture of k 1 dimensional gaussians if
one goes upto (4k − 2) moments [55]. The search space however will be exponential in k. It
is shown in [55] that for learning k one dimensional gaussians, this exponential dependence
is unavoidable.
Solving the labeling problem
As noted above, the learning algorithm will solve n2 , 1-dimensional problems and get param-
eter estimates for the two gaussians for each 1-dimensional problem. In order to solve for the
parameters of the original gaussians, we need to identify for each gaussian, the correspond-
ing n2 parameters for each of the subproblems. Kalai et. al do this by arguing that if one
projects the two gaussians onto a random direction v, with high enough probability, the cor-
responding parameters for the two projected gaussians will differ by poly(α). Hence, if one
takes small random perturbations of this vector v, the corresponding parameter estimates
will be easily distinguishable.

The overall algorithm has the following structure

Algorithm Learning a mixture of two Gaussians


1. Choose a random vector v and choose n2 random perturbations vi,j .
2. For each i, j, project the data onto vi,j and solve the one dimensional problem using
the method of moments.
3. Solve the labeling problem to identify the n2 parameter sets corresponding to a single
Gaussian.
4. Solve a system of linear equations on this parameter set to obtain the original parame-
ters.

Figure 7: The case of 3 Gaussians.


For the case of more than 2 Gaussians, Moitra and Valiant [55] extend the ideas mentioned
above to provide an algorithm for estimating the parameters of a mixture of k Gaussians.
For the case of k Gaussians, additional complications arise as it is not true anymore that
projecting the k Gaussians to a random 1-dimensional subspace, maintains the statistical
distance. For example, consider Figure 7. Here, projecting the data onto a random direction,

28
will almost surely collapse components 2 and 3. [55] solve this problem by first running a
clustering algorithm to separate components 2 and 3 from component 1 and recursively
solving the two sub-instances. Once, 2 and 3 have been separated, one can scale the space
to ensure that they remain separated over a random projection. The algorithm from [55]
has the sample complexity which depends exponentially on k. They also show that this
dependence is necessary. One could use the algorithm from [55] to also cluster the points
into component Gaussians under minimal assumptions. The sample complexity however, will
depend exponentially in k. In contrast, one could algorithms from previous sections to cluster
in polynomial time under stronger separation assumptions. The work of [41, 9] removes the
exponential dependence on k and designs polynomial time algorithms for clustering data
from a GMM under minimal separation assuming only that the mean vectors span a k
dimensional subspace. However, their algorithm which is based on Tensor decompositions
only works in the case when all the component Gaussians are spherical. It is an open question
to get similar result for general Gaussians. There has also been work on clustering points
from a mixture of other distributions. [31, 30] gave algorithms for clustering a mixture
of heavy tailed distributions. [27] gave algorithms for clustering a mixture of 2 Gaussians
assuming only that the two distributions are separated by a hyperplane. The recent work
of [49] studies a deterministic separation condition on a set of points and show that any set
of points satisfying this condition can be clustered accurately. Using this they easily derive
many previously known results for clustering mixture of Gaussians as a corollary.

7 Conclusion

In this chapter we presented a selection of recent work on clustering problems in the computer
science community. As is evident, the focus of all these works is on providing efficient
algorithms with rigorous guarantees for various clustering problems. In many cases, these
guarantees depend on the specific structure and properties of the instance at hand which are
captured by stability assumptions and/or distributional assumptions. The study of different
stability assumptions also provide insights into the structural properties of real world data
and in some cases also lead to practically useful algorithms [68]. As discussed in Section 5.6
different assumptions are suited for different kinds of data and they relate to each other in
interesting ways. For instance, perturbation resilience is a much weaker assumption than
both ǫ-separability and approximation stability. However, we have algorithms with much
stronger guarantees for the latter two. As a practitioner one is often torn between using
algorithms with formal guarantees (which are typically slower) vs. fast heuristics like the
Lloyd’s method. When dealing with data which may satisfy any of the stability notions
proposed in this chapter, a general rule of thumb we suggest is to run the algorithms proposed
in this chapter on a smaller random subset of the data and use the solution obtained to
initialize fast heuristics like the Lloyd’s method. Current research on clustering algorithms
continues to explore more realistic notions of data stability and their implications for practical
clustering scenarios.

References
[1] Gibbs random fields, fuzzy clustering, and the unsupervised segmentation of textured
images. CVGIP: Graphical Models and Image Processing, 55(1):1 – 19, 1993.

29
[2] D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In
Proceedings of the Eighteenth Annual Conference on Learning Theory, 2005.
[3] Margareta Ackerman and Shai Ben-David. Clusterability: A theoretical study. Journal
of Machine Learning Research - Proceedings Track, 5, 2009.
[4] Manu Agarwal, Ragesh Jaiswal, and Arindam Pal. k-means++ under approximation
stability. The 10th annual conference on Theory and Applications of Models of Compu-
tation, 2013.
[5] Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means
clustering. In Proceedings of the 12th International Workshop and 13th International
Workshop on Approximation, Randomization, and Combinatorial Optimization. Algo-
rithms and Techniques, APPROX ’09 / RANDOM ’09, 2009.
[6] N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In Ad-
vances in Neural Information Processing Systems, 2009.
[7] Paola Alimonti. Non-oblivious local search for graph and hypergraph coloring problems.
In Graph-Theoretic Concepts in Computer Science, Lecture Notes in Computer Science.
1995.
[8] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of
euclidean sum-of-squares clustering. Mach. Learn.
[9] Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgar-
sky. Tensor decompositions for learning latent variable models. Technical report,
http://arxiv.org/abs/1210.7559, 2012.
[10] S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of
the 37th ACM Symposium on Theory of Computing, 2005.
[11] S. Arora, P. Raghavan, and S. Rao. Approximation schemes for Euclidean k-medians
and related problems. In Proceedings of the Thirty-First Annual ACM Symposium on
Theory of Computing. 1999.
[12] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cam-
bridge University Press, 2009.
[13] David Arthur, Bodo Manthey, and Heiko Röglin. Smoothed analysis of the k-means
method. Journal of the ACM, 58(5), October 2011.
[14] David Arthur and Sergei Vassilvitskii. How slow is the k-means method? In Proceedings
of the twenty-second annual symposium on Computational geometry, 2006.
[15] David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding.
In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms,
2007.
[16] V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local
search heuristics for k-median and facility location problems. SIAM Journal on Com-
puting, 33(3):544–562, 2004.
[17] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Stability yields a PTAS for k-median
and k-means clustering. In Proceedings of the 2010 IEEE 51st Annual Symposium on
Foundations of Computer Science, 2010.
[18] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under pertur-
bation stability. Information Processing Letters, 112(1-2), January 2012.
[19] B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-
means++. In Proceedings of the 38th International Conference on Very Large Databases,
2012.
[20] M.-F. Balcan, A. Blum, and A. Gupta. Approximate clustering without the approxi-
mation. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2009.
[21] M.-F. Balcan, A. Blum, and A. Gupta. Clustering under approximation stability. In
Journal of the ACM, 2013.
[22] Maria-Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. Pro-
ceedings of the 39th International Colloquium on Automata, Languages and Program-

30
ming, 2012.
[23] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In
Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science,
2010.
[24] James C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms.
Kluwer Academic Publishers, Norwell, MA, USA, 1981.
[25] Yonatan Bilu and Nathan Linial. Are stable instances easy? In Proceedings of the First
Symposium on Innovations in Computer Science, 2010.
[26] Lon Bottou and Yoshua Bengio. Convergence properties of the k-means algorithms. In
Advances in Neural Information Processing Systems 7, pages 585–592. MIT Press, 1995.
[27] Spencer Charles Brubaker and Santosh Vempala. Isotropic PCA and affine-invariant
clustering. In Proceedings of the 2008 49th Annual IEEE Symposium on Foundations
of Computer Science, 2008.
[28] Barun Chandra, Howard Karloff, and Craig Tovey. New results on the old k-opt algo-
rithm for the tsp. In Proceedings of the fifth annual ACM-SIAM symposium on Discrete
algorithms, 1994.
[29] M. Charikar, S. Guha, E. Tardos, and D. B. Shmoy. A constant-factor approximation
algorithm for the k-median problem. In Proceedings of the Thirty-First Annual ACM
Symposium on Theory of Computing, 1999.
[30] Kamalika Chaudhuri and Satish Rao. Beyond gaussians: Spectral methods for learning
mixtures of heavy-tailed distributions. In Proceedings of the 21st Annual Conference on
Learning Theory, 2008.
[31] Kamalika Chaudhuri and Satish Rao. Learning mixtures of product distributions us-
ing correlations and independence. In Proceedings of the 21st Annual Conference on
Learning Theory, 2008.
[32] S. Dasgupta. Learning mixtures of gaussians. In Proceedings of The 40th Annual
Symposium on Foundations of Computer Science, 1999.
[33] S. Dasgupta. The hardness of k-means clustering. Technical report, University of
California, San Diego, 2008.
[34] W. Fernandez de la Vega, Marek Karpinski, Claire Kenyon, and Yuval Rabani. Ap-
proximation schemes for clustering problems. In Proceedins of the Thirty-Fifth Annual
ACM Symposium on Theory of Computing, 2003.
[35] Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral graph
partitioning. In Proceedings of the seventh ACM SIGKDD international conference on
Knowledge discovery and data mining, 2001.
[36] Doratha E. Drake and Stefan Hougardy. Linear time local improvements for weighted
matchings in graphs. In Proceedings of the 2nd international conference on Experimental
and efficient algorithms, 2003.
[37] Vance Faber. Clustering and the Continuous k-Means Algorithm. 1994.
[38] R.D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger andJ.E. Pollington, O.L. Gavin,
P. Gunesekaran, G. Ceric, K. Forslund, L. Holm, E.L. Sonnhammer, S.R. Eddy, and
A. Bateman. The pfam protein families database. Nucleic Acids Research, 38:D211–222,
2010.
[39] Allen Gersho and Robert M. Gray. Vector quantization and signal compression. Kluwer
Academic Publishers, Norwell, MA, USA, 1991.
[40] Pierre Hansen and Brigitte Jaumard. Algorithms for the maximum satisfiability prob-
lem. Computing, 1990.
[41] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: moment
methods and spectral decompositions. In Proceedings of the 4th Innovations in Theo-
retical Computer Science Conference, 2013.
[42] Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams
and randomization to variance-based k-clustering: (extended abstract). In Proceedings

31
of the tenth annual symposium on Computational geometry, 1994.
[43] K. Jain, M. Mahdian, and A. Saberi. A new greedy approach for facility location
problems. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing,
2002.
[44] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Efficiently learning mixtures
of two gaussians. In Proceedings of the 42th ACM Symposium on Theory of Computing,
2010.
[45] R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture
models. In Proceedings of The Eighteenth Annual Conference on Learning Theory, 2005.
[46] Ravi Kannan and Santosh Vempala. Spectral algorithms. Foundations and Trends in
Theoretical Computer Science, 4(3-4), 2009.
[47] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth
Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means
clustering. In Proceedings of the eighteenth annual symposium on Computational geom-
etry, New York, NY, USA, 2002. ACM.
[48] A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + ǫ)-approximation
algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual
IEEE Symposium on Foundations of Computer Science, Washington, DC, USA, 2004.
[49] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means
algorithm. In Proceedings of the 51st Annual IEEE Symposium on Foundations of
Computer Science, 2010.
[50] Shi Li and Ola Svensson. Approximating k-median via pseudo-approximation. In Pro-
ceedings of the 45th ACM Symposium on Theory of Computing, 2013.
[51] S.P. Lloyd. Least squares quantization in PCM. IEEE Trans. Inform. Theory,
28(2):129–137, 1982.
[52] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Bilu-linial
stable instances of max cut and minimum multiway cut. In SODA, pages 890–906.
SIAM, 2014.
[53] N. Megiddo and K. Supowit. On the complexity of some common geometric location
problems. SIAM Journal on Computing, 13(1):182–196, 1984.
[54] Marina Meilă. The uniqueness of a good optimum for K-means. In Proceedings of the
International Machine Learning Conference, pages 625–632, 2006.
[55] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures
of gaussians. In Proceedings of the 51st Annual IEEE Symposium on Foundations of
Computer Science, 2010.
[56] A.G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. Scop: a structural classifi-
cation of proteins database for the investigation of sequences and structures. Journal
of Molecular Biology, 247:536–540, 1995.
[57] R. Ostrovsky, Y. Rabani, L. Schulman, and C. Swamy. The effectiveness of lloyd-type
methods for the k-means problem. In Proceedings of the 47th Annual IEEE Symposium
on Foundations of Computer Science, 2006.
[58] Christos H. Papadimitriou. On selecting a satisfying truth assignment (extended ab-
stract). In Proceedings of the 32nd annual symposium on Foundations of computer
science, 1991.
[59] Dan Pelleg and Andrew W. Moore. X-means: Extending k-means with efficient es-
timation of the number of clusters. In Proceedings of the Seventeenth International
Conference on Machine Learning, 2000.
[60] Edie M. Rasmussen. Clustering algorithms. In Information Retrieval: Data Structures
& Algorithms, pages 419–442. 1992.
[61] Petra Schuurman and Tjark Vredeveld. Performance guarantees of local search for
multiprocessor scheduling. INFORMS J. on Computing, 2007.
[62] Gideon Schwarz. Estimating the Dimension of a Model. The Annals of Statistics,

32
6(2):461–464, 1978.
[63] Michael Shindler, Alex Wong, and Adam Meyerson. Fast and accurate k-means for
large datasets. In Proceedings of the 25th Annual Conference on Neural Information
Processing Systems, 2011.
[64] A. H S Solberg, T. Taxt, and A.K. Jain. A markov random field model for classification
of multisource satellite imagery. IEEE Transactions on Geoscience and Remote Sensing,
1996.
[65] Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the
simplex algorithm usually takes polynomial time. Journal of the ACM, 51(3), May 2004.
[66] Andrea Vattani. k-means requires exponentially many iterations even in the plane. In
Proceedings of the 25th annual symposium on Computational geometry, 2009.
[67] S. Vempala and G. Wang. A spectral algorithm for learning mixture models. Journal
of Computer and System Sciences, 68(2):841–860, 2004.
[68] K. Voevodski, M. F. Balcan, H. Roeglin, S. Teng, and Y. Xia. Efficient clustering with
limited distance information. In Proceedings of the 26th Conference on Uncertainty in
Artificial Intelligence, 2010.
[69] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan,
A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. The
top ten algorithms in data mining. Knowledge and Information Systems, 2008.

33
Two faces of active learning
Sanjoy Dasgupta
[email protected]

Abstract
An active learner has a collection of data points, each with a label that is initially hidden but can be
obtained at some cost. Without spending too much, it wishes to find a classifier that will accurately map
points to labels. There are two common intuitions about how this learning process should be organized:
(i) by choosing query points that shrink the space of candidate classifiers as rapidly as possible; and (ii) by
exploiting natural clusters in the (unlabeled) data set. Recent research has yielded learning algorithms for
both paradigms that are efficient, work with generic hypothesis classes, and have rigorously characterized
labeling requirements. Here we survey these advances by focusing on two representative algorithms and
discussing their mathematical properties and empirical performance.

1 Introduction
As digital storage gets cheaper, and sensing devices proliferate, and the web grows ever larger, it gets easier
to amass vast quantities of unlabeled data – raw speech, images, text documents, and so on. But to build
classifiers from these data, labels are needed, and obtaining them can be costly and time consuming. When
building a speech recognizer for instance, the speech signal comes cheap but thereafter a human must examine
the waveform and label the beginning and end of each phoneme within it. This is tedious and painstaking,
and requires expertise.
We will consider situations in which we are given a large set of unlabeled points from some domain X ,
each of which has a hidden label, from a finite set Y, that can be queried. The idea is to find a good classifier,
a mapping h : X → Y from a pre-specified set H, without making too many queries. For instance, each
x ∈ X might be the description of a molecule, with its label y ∈ {+1, −1} denoting whether or not it binds
to a particular target of interest. If the x’s are vectors, a possible choice of H is the class of linear separators.
In this setting, a supervised learner would query a random subset of the unlabeled data and ignore the
rest. A semisupervised learner would do the same, but would keep around the unlabeled points and use them
to constrain the choice of classifier. Most ambitious of all, an active learner would try to get the most out
of a limited budget by choosing its query points in an intelligent and adaptive manner (Figure 1).

1.1 A model for analyzing sampling strategies


The practical necessity of active learning has resulted in a glut of different querying strategies over the past
decade. These differ in detail, but very often conform to the following basic paradigm:

Start with a pool of unlabeled data S ⊂ X


Pick a few points from S at random and get their labels
Repeat:
Fit a classifier h ∈ H to the labels seen so far
Query the unlabeled point in S closest to the boundary of h
(or most uncertain, or most likely to decrease overall uncertainty,...)

This high level scheme (Figure 2) has a ready and intuitive appeal. But how can it be analyzed?

1
− − − −

+
− +

+ − +
+ +
+
+

Figure 1: Each circle represents an unlabeled point, while + and − denote points of known label. Left: raw
and cheap – a large reservoir of unlabeled data. Middle: supervised learning picks a few points to label and
ignores the rest. Right: Semisupervised and active learning get more use out of the unlabeled pool, by using
them to constrain the choice of classifier, or by choosing informative points to label.

− −
+

+
+

Figure 2: A typical active learning strategy chooses the next query point near the decision boundary obtained
from the current set of labeled points. Here the boundary is a linear separator, and there are several unlabeled
points close to it that would be candidates for querying.

2
w∗ w

111
000
000
11145% 5%
111
000
000
111 5% 45%

Figure 3: An illustration of sampling bias in active learning. The data lie in four groups on the line, and are
(say) distributed uniformly within each group. The two extremal groups contain 90% of the distribution.
Solids have a + label, while stripes have a − label.

To do so, we shall consider learning problems within the framework of statistical learning theory. In this
model, there is an unknown, underlying distribution P from which data points (and their hidden labels) are
drawn independently at random. If X denotes the space of data and Y the labels, this P is a distribution
over X × Y. Any classifier we build is evaluated in terms of its performance on P.
In a typical learning problem, we choose classifiers from a set of candidate hypotheses H. The best such
candidate, h∗ ∈ H, is by definition the one with smallest error on P, that is, with smallest

err(h) = P[h(X) 6= Y ].

Since P is unknown, we cannot perform this minimization ourselves. However, if we have access to a sample
of n points from P, we can choose a classifier hn that does well on this sample. We hope, then, that hn → h∗
as n grows. If this is true, we can also talk about the rate of convergence of err(hn ) to err(h∗ ).
A special case of interest is when h∗ makes no mistakes: that is, h∗ (x) = y for all (x, y) in the support
of P. We will call this the separable case and will frequently use it in preliminary discussions because it is
especially amenable to analysis. All the algorithms we describe here, however, are designed for the more
realistic nonseparable scenario.

1.2 Sampling bias


When we consider the earlier querying scheme within the framework of statistical learning theory, we imme-
diately run into the special difficulty of active learning: sampling bias.
The initial set of random samples from S, with its labeling, is a good reflection of the underlying data
distribution P. But as training proceeds, and points are queried based on increasingly confident assessments
of their informativeness, the training set looks less and less like P. It consists of an unusual subset of points,
hardly a representative subsample; why should a classifier trained on these strange points do well on the
overall distribution?
To make this intuition concrete, let’s consider the simple one-dimensional data set depicted in Figure 3.
Most of the data lies in the two extremal groups, so an initial random sample has a good chance of coming
entirely from these. Suppose the hypothesis class consists of thresholds on the line: H = {hw : w ∈ R} where


+1 if x ≥ w
− +
hw (x) =
−1 if x < w w

Then the initial boundary will lie somewhere in the center group, and the first query point will lie in
this group. So will every subsequent query point, forever. As active learning proceeds, the algorithm will
gradually converge to the classifier shown as w. But this has 5% error, whereas classifier w∗ has only 2.5%
error. Thus the learner is not consistent: even with infinitely many labels, it returns a suboptimal classifier.
The problem is that the second group from the left gets overlooked. It is not part of the initial random
sample, and later on, the learner is mistakenly confident that the entire group has a − label. And this is just
in one dimension; in high dimension, the problem can be expected to be worse, since there are more places

3
H

+ −

Figure 4: Two faces of active learning. Left: In the case of binary labels, each data point x cuts the hypothesis
space H into two pieces: the hypotheses that label it +, and those that label it −. If data is separable, one
of these two pieces can be discarded once the label of x is known. A series of well-chosen query points could
rapidly shrink H. Right: If the unlabeled points look like this, perhaps we just need five labels.

for this troublesome group to be hiding out. For a discussion of this problem in text classification, see the
paper of Schutze et al. [17].
Sampling bias is the most fundamental challenge posed by active learning. In this paper, we will deal
exclusively with learning strategies that are provably consistent and we will analyze their label complexity:
the number of labels queried in order to achieve a given rate of accuracy.

1.3 Two faces of active learning


Assuming sampling bias is correctly handled, how exactly is active learning helpful? The recent literature
offers two distinct narratives for explaining this. The first has to do with efficient search through the
hypothesis space. Each time a new label is seen, the current version space – the set of classifiers that are still
“in the running” given the labels seen so far – shrinks somewhat. It is reasonable to try to explicitly select
points whose labels will shrink this version space as fast as possible (Figure 4, left). Most theoretical work
in active learning attempts to formalize this intuition. There are now several learning algorithms which, on
a variety of canonical examples, provably yield significantly lower label complexity than supervised learning
[9, 13, 10, 2, 3, 7, 15, 16, 12]. We will use one particular such scheme [12] as an illustration of the general
theory.
The second argument for active learning has to do with exploiting cluster structure in data. Suppose, for
instance, that the unlabeled points form five nice clusters (Figure 4, right); with luck, these clusters will be
pure in their class labels and only five labels will be necessary! Of course, this is hopelessly optimistic. In
general, there may be no nice clusters, or there may be viable clusterings at many different resolutions. The
clusters themselves may only be mostly-pure, or they may not be aligned with labels at all. An ideal scheme
would do something sensible for any data distribution, but would especially be able to detect and exploit
any cluster structure (loosely) aligned with class labels, at whatever resolution. This type of active learning
is a bit of an unexplored wilderness as far as theory goes. We will describe one preliminary piece of work [11]
that solves some of the problems, clarifies some of the difficulties, and opens some doors to further research.

2 Efficient search through hypothesis space


The canonical example here is when the data lie on the real line and the hypotheses H are thresholds. How
many labels are needed to find a hypothesis h ∈ H whose error on the underlying distribution P is at most
some ǫ?

4
Separable data General (nonseparable) data
Aggressive Query by committee [13]
Splitting index [10]
A2 algorithm [2]
Mellow Generic mellow learner [9] Disagreement coefficient [15]
Reduction to supervised [12]
Importance-weighted approach [5]

Figure 5: Some of the key results on active learning within the framework of statistical learning theory.
The splitting index and disagreement coefficient are parameters of a learning problem that control the label
complexity of active learning. The other entries of the table are all learning algorithms.

In supervised learning, such issues are well understood. The standard machinery of sample complexity
[6] tells us that if the data are separable—that is, if they can be perfectly classified by some hypothesis
in H—then we need approximately 1/ǫ random labeled examples from P, and it is enough to return any
classifier consistent with them.
Now suppose we instead draw 1/ǫ unlabeled samples from P:

If we lay these points down on the line, their hidden labels are a sequence of −’s followed by a sequence
of +’s, and the goal is to discover the point w at which the transition occurs. This can be accomplished
with a binary search which asks for just log 1/ǫ labels: first ask for the label of the median point; if it’s +,
move to the 25th percentile point, otherwise move to the 75th percentile point; and so on. Thus, for this
hypothesis class, active learning gives an exponential improvement in the number of labels needed, from 1/ǫ
to just log 1/ǫ. For instance, if supervised learning requires a million labels, active learning requires just
log 1,000,000 ≈ 20, literally!
This toy example is only for separable data, but with a little care something similar can be achieved for
the nonseparable case. It is a tantalizing possibility that even for more complicated hypothesis classes H, a
sort of generalized binary search is possible.

2.1 Some results of active learning theory


There is a large body of work on active learning within the membership query model [1] and also some work
on active online learning [8]. Here we focus exclusively on the framework of statistical learning theory, as
described in the introduction. The results obtained so far can be categorized according to whether or not
they are able to handle nonseparable data distributions, and whether their querying strategies are aggressive
or mellow. This last dichotomy is imprecise, but roughly, an aggressive scheme is one that seeks out highly
informative query points while a mellow scheme queries any point that is at all informative, in the sense that
its label cannot be inferred from the data already seen.
Some representative results are shown in Figure 5. At present, the only schemes known to work in the
realistic case of nonseparable data are mellow. It might seem that such schemes would confer at best a
modest advantage over supervised learning — a constant-factor reduction in label complexity, perhaps. The
surprise is that in a wide range of cases, they offer an exponential reduction.

2.2 A generic mellow learner


Cohn, Atlas, and Ladner [9] introduced a wonderfully simple, mellow learning strategy for separable data.
This scheme, henceforth nicknamed CAL, has formed the basis for much subsequent work. It operates in a
streaming model where unlabeled data points arrive one at a time, and for each, the learner has to decide
on the spot whether or not to ask for its label.

5
H1 = H S = {} (points seen so far)
For t = 1, 2, . . .: For t = 1, 2, . . .:
Receive unlabeled point xt Receive unlabeled point xt
If disagreement in Ht about xt ’s label: If learn(S ∪ (xt , +1)) and learn(S ∪ (xt , −1))
query label yt of xt both return an answer:
Ht+1 = {h ∈ Ht : h(xt ) = yt } query label yt
else: else:
Ht+1 = Ht set yt to whichever label succeeded
S = S ∪ {(xt , yt )}

Figure 6: Left: CAL, a generic mellow learner for separable data. Right: A way to simulate CAL without
having to explicitly maintain the version space Ht . Here learn(·) is a black-box supervised learner that
takes as input a data set and returns any classifier from H consistent with the data, provided one exists.

− − −
− + − + − +
+ + +
− + + − + + − + +

Figure 7: Left: The first seven points in the data stream were labeled. How about this next point? Middle:
Some of the hypotheses in the current version space. Right: The region of disagreement.

CAL works by always maintaining the current version space: the subset of hypotheses consistent with
the labels seen so far. At time t, this is some Ht ⊂ H. When the data point xt arrives, CAL checks to see
whether there is any disagreement within Ht about its label. If there isn’t, then the label can be inferred;
otherwise it must be requested (Figure 6, left).
Figure 7 shows CAL at work in a setting where the data points lie in the plane, and the hypotheses
are linear separators. A key concept is that of the disagreement region, the portion of the input space X
on which there is disagreement within Ht . A data point is queried if and only if it lies in this region, and
therefore the efficacy of CAL depends upon the rate at which the P-mass of this region shrinks. As we will
see shortly, there is a broad class of situations in which this shrinkage is geometric: the P-mass halves every
constant number of labels, giving a label complexity that is exponentially better than that of supervised
learning. It is quite surprising that so mellow a scheme performs this well; and it is of interest, then, to ask
how it might be made more practical.

2.3 Upgrading CAL


As described, CAL has two major shortcomings. First, it needs to explicitly maintain the version space,
which is unmanageably large in most cases of interest. Second, it makes sense only for separable data. Two
recent papers, first [2] and then [12], have shown how to overcome these hurdles.
The first problem is easily handled. The version space can be maintained implicitly in terms of the
labeled examples seen so far (Figure 6, right). To handle the second problem, nonseparable data, we need
to change the definition of the version space Ht , because there may not be any hypotheses that agree with
all the labels. Thereafter, with the newly-defined Ht , we continue as before, requesting the label yt of any
point xt for which there is disagreement within Ht . If there is no disagreement, we “infer” the label of xt
– call this ŷt – and include it in the training set. Because of nonseparability, the inferred ŷt might differ

6
from the actual hidden label yt . Regardless, every point gets labeled, one way or the other. The resulting
algorithm, which we will call DHM after its authors [12], is presented in the appendix. Here we give some
rough intuition.
After t time steps, there are t labeled points (some queried, some inferred). Let errt (h) denote the
empirical error of h on these points, that is, the fraction of these t points that h gets wrong. Writing ht for
the minimizer of errt (·), define

Ht+1 = {h ∈ Ht : errt (h) ≤ errt (ht ) + ∆t },

where ∆t comes out of some standard generalization bound (DHM doesn’t do this exactly, but is similar in
spirit). Then the following assertions hold:
• The optimal hypothesis h∗ (with minimum error on the underlying distribution P) lies in Ht for all t.
• Any inferred label is consistent with h∗ (although it might disagree with the actual, hidden label).
Because all points get labeled, there is no bias introduced into the marginal distribution on X . It might
seem, however, that there is some bias in the conditional distribution of y given x, because the inferred
labels can differ from the actual labels. The saving grace is that this bias shifts the empirical error of every
hypothesis in Ht by the same amount – because all these hypotheses agree with the inferred label – and thus
the relative ordering of hypotheses is preserved.
In a typical trial of DHM (or CAL), the querying eventually concentrates near the decision boundary
of the optimal hypothesis. In what respect, then, do these methods differ from the heuristics we described
in the introduction? To understand this, let’s return to the example of Figure 3. The first few data points
drawn from this distribution may well lie in the far-left and far-right clusters. So if the learner were to choose
a single hypothesis, it would lie somewhere near the middle of the line. But DHM doesn’t do this. Instead,
it maintains the entire version space (implicitly), and this version space includes all thresholds between the
two extremal clusters. Therefore the second cluster from the left, which tripped up naive schemes, will not
be overlooked.
To summarize, DHM avoids the consistency problems of many other active learning heuristics by (i) mak-
ing confidence judgements based on the current version space, rather than the single best current hypothesis,
and (ii) labeling all points, either by query or inference, to avoid skewing the distribution on X .

2.4 Label complexity


The label complexity of supervised learning is quite well characterized by a single parameter of the hypothesis
class called the VC dimension [6]: if this dimension is d, then a classifier whose error is within ǫ of optimal
can be learned using roughly d/ǫ2 labeled examples. In the active setting, further information is needed in
order to assess label complexity [10]. For mellow active learning, Hanneke [15] identified a key parameter of
the learning problem (hypothesis class as well as data distribution) called the disagreement coefficient and
gave bounds for CAL in terms of this quantity. It proved similarly useful for analyzing DHM [12]. We will
shortly define this parameter and see examples of it, but in the meanwhile, we take a look at the bounds.
Suppose data is generated independently at random from an underlying distribution P. How many labels
does CAL or DHM need before it finds a hypothesis with error less than ǫ, with probability at least 1 − δ?
To be precise, for a specific distribution P and hypothesis class H, we define the label complexity of CAL,
denoted LCAL (ǫ, δ), to be the smallest integer to such that for all t ≥ to ,

P[some h ∈ Ht has err(h) > ǫ] ≤ δ.

In the case of DHM, the distribution P might not be separable, in which case we need to take into account
the best achievable error:
ν = inf err(h).
h∈H

7
LDHM (ǫ, δ) is then the smallest to such that

P[some h ∈ Ht has err(h) > ν + ǫ] ≤ δ

for all t ≥ to . In typical supervised learning bounds, and here as well, the dependence of L(ǫ, δ) upon δ is
modest, at most poly log(1/δ). To avoid clutter, we will henceforth ignore δ and speak only of L(ǫ).

Theorem 1 [16] Suppose H has finite VC dimension d, and the learning problem is separable, with dis-
agreement coefficient θ. Then  
e 1
LCAL (ǫ) ≤ O θd log ,
ǫ
e notation suppresses terms logarithmic in d, θ, and log 1/ǫ.
where the O

A supervised learner would need Ω(d/ǫ) examples to achieve this guarantee, so active learning yields an
exponential improvement when θ is finite: its label requirement scales as log 1/ǫ rather than 1/ǫ. And this
is without any effort at finding maximally informative points!
In the nonseparable case, the label complexity also depends on the minimum achievable error within the
hypothesis class.
Theorem 2 [12] With parameters as defined above,
  2

LDHM (ǫ) ≤ O e θ d log2 1 + dν
ǫ ǫ2
where ν = inf h∈H err(h).
In this same setting, a supervised learner would require Ω((d/ǫ) + (dν/ǫ2 )) samples. If ν is small relative
to ǫ, we again see an exponential improvement from active learning; otherwise, the improvement is by the
constant factor ν.
The second term in the label complexity is inevitable for nonseparable data.
Theorem 3 [5] Pick any hypothesis class with finite VC dimension d. Then there exists a distribution P
over X × Y for which any active learner must incur a label complexity
 2

L(ǫ, 1/2) ≥ Ω ,
ǫ2
where ν = inf h∈H err(h).
The corresponding lower bound for supervised learning is dν/ǫ2 .

2.5 The disagreement coefficient


We now define the leading constant in both label complexity upper bounds: the disagreement coefficient.
To start with, the data distribution P induces a natural metric on the hypothesis class H: the distance
between any two hypotheses is simply the P-mass of points on which they disagree. Formally, for any
h, h′ ∈ H, we define
d(h, h′ ) = P[h(X) 6= h′ (X)],
and correspondingly, the closed ball of radius r around h is

B(h, r) = {h′ ∈ H : d(h, h′ ) ≤ r}.

Now, suppose we are running either CAL or DHM, and that the current version space is some V ⊂ H.
Then the only points that will be queried are those that lie within the disagreement region

DIS(V ) = {x ∈ X : there exist h, h′ ∈ V with h(x) 6= h′ (x)}.

8
h
h*

Figure 8: Left: Suppose the data lie in the plane, and that hypothesis class consists of linear separators.
The distance between any two hypotheses h∗ and h is the probability mass (under P) of the region on which
they disagree. Middle: The thick line is h∗ . The thinner lines are examples of hypotheses in B(h∗ , r). Right:
DIS(B(h∗ , r)) might look something like this.

Figure 8 illustrates these notions.


If the minimum-error hypothesis is h∗ , then after a certain amount of querying we would hope that the
version space is contained within B(h∗ , r) for small-ish r. In which case, the probability that a random
point from P would get queried is at most P(DIS(B(h∗ , r))). The disagreement coefficient measures how this
probability scales with r: it is defined to be
P[DIS(B(h∗ , r))]
θ = sup .
r>0 r
Let’s work through an example. Suppose X = R and H consists of thresholds. For any two thresholds
h < h′ , the distance d(h, h′ ) is simply the P-mass of the interval [h, h′ ). If h∗ is the best threshold, then
B(h∗ , r) consists exactly of the interval I that contains: h∗ ; the segment to the immediate left of h∗ of
P-mass r; and the segment to the immediate right of h∗ of P-mass r. The disagreement region DIS(B(h∗ , r))
is this same interval I; and since it has mass 2r, it follows that the disagreement coefficient is 2.
Disagreement coefficients have been derived for various concept classes and data distributions of interest,
including:
• Thresholds in R: θ = 2, as explained above.
• Homogeneous (through-the-origin) linear separators
√ in Rd , with a data distribution P that is uniform
over the surface of the unit sphere [15]: θ ≤ d.
• Linear separators in Rd , with a smooth data density bounded away from zero [14]: θ = c(h∗ )d, where
c(h∗ ) is some constant depending on the target hypothesis h∗ .

2.6 Further work on mellow active learning


A more refined disagreement coefficient
When the disagreement coefficient θ is bounded, CAL and DHM offer better label complexity than supervised
learning. But there are simple instances in which θ is arbitrarily large. For instance, suppose again that
X = R but that the hypotheses consist of intervals:

+1 if a ≤ x ≤ b
H = {ha,b : a, b ∈ R}, ha,b (x) =
−1 otherwise

Then the distance between any two hypotheses ha,b and ha′ ,b′ is

d(ha,b , ha′ ,b′ ) = P{x : x ∈ [a, b] ∪ [a′ , b′ ], x 6∈ [a, b] ∩ [a′ , b′ ]} = P([a, b]∆[a′ , b′ ]),

9
where S∆T denotes the symmetric set difference (S ∪ T ) \ (S ∩ T ). Now suppose the target hypothesis is
some hα,β with α ≤ β. If r > P[α, β] then B(hα,β , r) includes all intervals of probability mass ≤ r − P[α, β].
Thus, if P is a density, the disagreement region of B(hα,β , r) is all of X ! Letting r approach P[α, β] from
above, we see that θ is at least 1/P[α, β], which is unbounded as β gets closer to α.
A saving grace is that for smaller values r ≤ P[α, β], the hypotheses in B(hα,β , r) are intervals intersecting
hα,β , and consequently the disagreement region has mass at most 4r. Thus there are two regimes in the
active learning process for H: an initial phase in which the radius of uncertainty r is brought down to P[α, β],
and a subsequent phase in which r is further decreased to O(ǫ). The first phase might be slow, but the second
should behave as if θ = 4. Moreover, the dependence of the label complexity upon ǫ should arise entirely
from the second phase. A series of recent papers [4, 14] analyzes such cases by loosening the definition of
disagreement coefficient from
P[DIS(B(h∗ , r))] P[DIS(B(h∗ , r))]
sup to lim sup .
r>0 r r→0 r
In the example above, the revised disagreement coefficient is 4.

Other loss functions


DHM uses a supervised learner as a black box, and assumes that this subroutine truly returns a hypothesis
minimizing the empirical error. However, in many cases of interest, such as high-dimensional linear separa-
tors, this minimization is NP-hard and is typically solved in practice by substituting a convex loss function in
place of 0 − 1 loss. Some recent work [5] develops a mellow active learning scheme for general loss functions,
using importance weighting.

2.7 Illustrative experiments


We now show DHM at work in a few toy cases. The first is our recurring example of threshold functions on
the line. Figure 9 shows the results of an experiment in which the (unlabeled) data is distributed uniformly
over the interval [0, 1] and the target threshold is 0.5. Three types of noise distribution are considered:
• No noise: every point above 0.5 gets a label of +1 and the rest get a label of −1.
• Random noise: each data point’s label is flipped with a certain probability, 10% in one experiment and
20% in another. This is a benign form of noise.
• Boundary noise: the noisy labels are concentrated near the boundary (for more details, see [12]).
The fraction of corrupted labels is 10% in one experiment and 20% in another. This kind of noise is
challenging for active learning, because the learner’s region of disagreement gets progressively more
noisy as it shrinks.
Each experiment has a stream of 10,000 unlabeled points; the figure shows how many of these are queried
during the course of learning.
Figures 10 and 11 show similar results for the hypothesis class of intervals. The distribution of queries
is initially all over the place, but eventually concentrates at the decision boundary of the target hypothesis.
Although the behavior of this active learner seems qualitatively similar to that of the heuristics we described
in the introduction, it differs in one crucial respect: here, all points get labeled to avoid bias in the training set.

The experiments so far are for one-dimensional data. There are two significant hurdles in scaling up the
DHM algorithm to real data sets:
1. The version space Ht is defined using a generalization bound. Current bounds are tight only in a few
special cases, such as small finite hypothesis classes and thresholds on the line. Otherwise they can be
extremely loose, with the result that Ht ends up being larger than necessary, and far too many points
get queried.

10
4000
boundary noise 0.2
3500

Number of queries
3000
2500 boundary noise 0.1
2000
1500
random noise 0.2
1000
random noise 0.1
500
no noise
0
0 5000 10000
Number of data points seen

Figure 9: Here the data distribution is uniform over X = [0, 1] and H consists of thresholds on the line. The
target threshold is at 0.5. We test five different noise models for the conditional distribution of labels.

3500

3000 width 0.1, random noise 0.1


Number of queries

2500
width 0.2, random noise 0.1
2000

1500

1000 width 0.1, no noise


width 0.2, no noise
500

0
0 5000 10000
Number of data points seen

Figure 10: The data distribution is uniform over X = [0, 1], and H consists of intervals. We vary the width
of the target interval and the noise model for the conditional distribution of labels.

11
0 0.4 0.6 1

Figure 11: The distribution of queries for the experiment of Figure 10, with target interval [0.4, 0.6] and
random noise of 0.1. The initial distribution of queries is shown above and the eventual distribution below.

2. Its querying policy is not at all aggressive.


The first problem is perhaps eroding with time, as better and better bounds are found. How well would
CAL/DHM perform if this problem vanished altogether? That is, what are the limits of mellow active
learning?
One way to investigate this question is to run DHM with the best possible bound. Recall the definition
of the version space:
Ht+1 = {h ∈ Ht : errt (h) ≤ errt (ht ) + ∆t },
where errt (·) is the empirical error on the first t points, and ht is the minimizer of errt (·). Instead of obtaining
∆t from large deviation theory, we can simply set it to the smallest value that retains the target hypothesis
h∗ within the version space; that is, ∆t = errt (h∗ ) − errt (ht ). This cannot be used in practice because
errt (h∗ ) is unknown; here we use it in a synthetic example to explore how effective mellow active learning
could be if the problem of loose generalization bounds were to vanish.
Figure 12 shows the result of applying this optimistic bound to a ten-dimensional learning problem with
two Gaussian classes. The hypotheses are linear separators and the best separator has an error of 5%. A
stream of 500 data points is processed, and as before, the label complexity is impressive: less than 70 of these
points get queried. The figure on the right presents a different view of the same experiment. It shows how
the test error decreases over the first 50 queries of the active learner, as compared to a supervised learner
(which asks for every point’s label). There is an initial phase where active learning offers no advantage; this
is the time during which a reasonably good classifier, with about 8% error, is learned. Thereafter, active
learning kicks in and after 50 queries reaches a level of error that the supervised learner will not attain until
after about 500 queries. However, this final error rate is still 5%, and so the benefit of active learning is
realized only during the last 3% decrease in error rate.

2.8 Where next


The single biggest open problem is to develop active learning algorithms with more aggressive querying
strategies. This would appear to significantly complicate problems of sampling bias, and has thus proved
tricky to analyze mathematically. Such schemes might be able to exploit their querying ability earlier in the
learning process, rather than having to wait until a reasonably good classifier has already been obtained.
Next, we’ll turn to an entirely different view of active learning and present a scheme that is, in fact, able
to realize early benefits.

12
80 0.18
supervised
70 active
0.16
Number of queries made

60
0.14

50
0.12

Error
40
0.1
30

0.08
20

10 0.06

0 0.04
0 100 200 300 400 500 0 10 20 30 40 50 60
Number of data points seen Number of labels seen

Figure 12: Here X = R10 and H consists of linear separators. Each class is Gaussian and the best separator
has 5% error. Left: Queries made over a stream of 500 examples. Right: Test error for active versus
supervised sampling.

3 Exploiting cluster structure in data


Figure 4, right, immediately brings to mind an active learning strategy that starts with a pool of unlabeled
data, clusters it, asks for one label per cluster, and then takes these labels as representative of their entire
respective clusters. Such a scheme is fraught with problems: (i) there might not be an obvious clustering,
or alternatively, (ii) good clusterings might exist at many granularities (for instance, there might be a good
partition into ten clusters but also into twenty clusters), or, worst of all, (iii) the labels themselves might
not be aligned with the clusters.
What is needed is a scheme that does not make assumptions about the distribution of data and labels,
but is able to exploit situations where there exist clusters that are fairly homogeneous in their labels. An
elegant example of this is due to Zhu, Lafferty, and Ghahramani [20]. Their algorithm begins by imposing a
neighborhood graph on an unlabeled data set S, and then works by locally propagating any labels it obtains,
roughly as follows:

Pick a few points from S at random and get their labels


Repeat:
Propagate labels to “nearby” unlabeled points in the graph
Query in an “unknown” part of the graph

This kind of nonparametric learner appears to have the usual problems of sampling bias, but differs from
the approaches of the previous section, and has been studied far less. One recently-proposed algorithm,
which we will call DH after its authors [11], attempts to capture the spirit of local propagation schemes
while maintaining sound statistics on just how “unknown” different regions are; we now turn to it.

3.1 Three rules for querying


In the previous section the unlabeled points arrived in a stream, one at a time. Here we’ll assume we have
all of them at the outset: a pool S ⊂ X . The entire learning process will then consist of requesting labels of
points from S.

13
In the DH algorithm, the only guide in deciding where to query is cluster structure. But the clusters in
use may change over time. Let Ct be the clustering operative at time t, while the tth query is being chosen.
This Ct is some partition of S into groups; more formally,
[
C = S, and any C, C ′ ∈ Ct are either identical or disjoint.
C∈Ct

To avoid biases induced by sampling, DH follows a simple rule for querying:


Rule 1. At any time t, the learner specifies a cluster C ∈ Ct and receives the label of a point
chosen uniformly at random from C (along with the identity of that point).
In other words, the learner is not allowed to specify exactly which point to label, but merely the cluster from
which it will be randomly drawn. In the case where there are binary labels {−1, +1}, it helps to think of
each cluster as a biased coin whose heads probability is simply the fraction of points within it with label +1.
The querying process is then a toss of this biased coin. The scheme also works if the set of labels Y is some
finite set, in which case samples from the cluster are modeled by a multinomial.
If the labeling budget runs out at time T , the current clustering CT is used to assign labels to all of S:
each point gets the majority label of its cluster.

+ − + +
− +− − ++
+ − − + ++


+ ⇒ − −
− − −
+ +
− − − −
+ − ++ +
+ ++ + +
+ + + ++

To control the error induced by this process, it is important to find clusters that are as homogeneous as
possible in their labels. In the above example, we can be fairly sure of this. We have five random labels
from the left cluster, all of which are −. Using a tail bound for the binomial distribution, we can obtain
an interval (such as [0.8, 1.0]) in which the true bias of this cluster is very likely to lie. The DH algorithm
makes heavy use of such confidence intervals.
If the current clustering C has a cluster that is very mixed in its labels, then this cluster needs to be split
further, to get a new clustering C′ :

− − − − − −
− −
− − + − − +




− ⇒ −


+ +
+ +
+ +

The lefthand cluster of C can be left alone (for the time being), since the one on the right is clearly more
troublesome. A fortunate consequence of Rule 1 is that the queries made to C can be reused for C′ : a
random label in the righthand cluster of C is also a random label for the new cluster in which it falls.
Thus the clustering of the data changes only by splitting clusters.

Rule 2. Pick any two times t′ > t. Then Ct′ must be a refinement of Ct , that is,

for all C ′ ∈ Ct′ , there exists some C ∈ Ct such that C ′ ⊂ C.

14
1

2 4
9
5 6 7 8

1111
0000
0000
1111 45% 5%
11
00
00
11 5% 45%

Figure 13: The top few nodes of a hierarchical clustering.

As in the example above, the nested structure of clusters makes it possible to re-use labels when the clustering
changes. If at time t, the querying process yields (x, y) for some x ∈ C ∈ Ct , then later, at t′ > t, this same
(x, y) is reusable as a random draw from the C ′ ∈ Ct′ to which x belongs.
The final rule imposes a constraint on the manner in which a clustering is refined.
Rule 3. When a cluster is split to obtain a new clustering, Ct → Ct+1 , the manner of split
cannot depend upon the labels seen.
This avoids complicated dependencies. The upshot of it is that we might as well start off with a hierarchical
clustering of S, set C1 to the root of the clustering, and gradually move down the hierarchy, as needed,
during the querying process.

3.2 An illustrative example


Given the three rules for querying, the specification of the DH cluster-based learner is quite intuitive. Let’s
get a sense of its behavior by revisiting the example in Figure 3 that presented difficulties for many active
learning heuristics.
DH would start with a hierarchical clustering of this data. Figure 13 shows how it might look: only
the top few nodes of the hierarchy are depicted, and their numbering is arbitrary. At any given time, the
learner works with a particular partition of the data set, given by a pruning of the tree. Initially, this is
just {1}, a single cluster containing everything. Random points are drawn from this cluster and their labels
are queried. Suppose one of these points, x, lies in the rightmost group. Then it is a random sample from
node 1 of the hierarchy, but also from nodes 3 and 9. Based on such random samples, each node of the tree
maintains counts of the positive and negative instances seen within it. A few samples reveal that the top
node 1 is very mixed while nodes 2 and 3 are substantially more pure. Once this transpires, the partition
{1} is replaced by {2, 3}. Subsequent random samples are chosen from either 2 or 3, according to a sampling
strategy favoring the less-pure node. A few more queries down the line, the pruning is refined to {2, 4, 9}.
This is when the benefits of the partitioning scheme become most obvious; based on the samples seen, it is
concluded that cluster 9 is (almost) pure, and thus (almost) no more queries are made from it until the rest
of the space has been partitioned into regions that are similarly pure.
The querying can be stopped at any stage; then, each cluster in the current partition is assigned the
majority label of the points queried from it. In this way, DH labels the entire data set, while trying to keep
the number of induced erroneous labels to a minimum. If desired, the labels can be used for a subsequent
round of supervised learning, with any learning algorithm and hypothesis class.

15
P ← {root} (current pruning of tree)
L(root) ← 1 (arbitrary starting label for root)
For t = 1, 2, . . . (until the budget runs out):
Repeat B times:
v ← select(P )
Pick a random point z from subtree Tv
Query z’s label
Update counts for all nodes u on path from z to v
In a bottom-up pass of T , compute bound(u) for all nodes u ∈ T
For each (selected) v ∈ P :
Let (P ′ , L′ ) be the pruning and labeling of Tv minimizing bound(v)
P ← (P \ {v}) ∪ P ′
L(v) ← L′ (u) for all u ∈ P ′
For each cluster v ∈ P :
Assign each point in Tv the label L(v)

Figure 14: The DH cluster-adaptive active learning algorithm.

3.3 The learning algorithm


Figure 14 shows the DH active learning algorithm. Its input consists of a hierarchical clustering T whose
leaves are the n data points, as well as a batch size B. This latter quantity specifies how many queries will
be asked per iteration: it is frequently more convenient to make small batches of queries at a time rather
than single queries. At the end of any iteration, the algorithm can be made to stop and output a labeling of
all the data.
The pseudocode needs some explaining. The subtree rooted at a node u ∈ T is denoted Tu . At any given
time, the algorithm works with a particular pruning P ⊂ T , a set of nodes such that the subtrees (clusters)
{Tu : u ∈ P } are disjoint and contain all the leaves. Each node u of the pruning is assigned a label L(u),
with the intention that if the algorithm were to be abruptly stopped, all of the data points in Tu would be
given this label. The resulting mislabeling error within Tu (that is, the fraction of Tu whose true label isn’t
L(u)) can be upper-bounded using the samples seen so far; bound(u) is a conservative such estimate. These
bounds can be computed directly from Hoeffding’s inequality, or better still, from the tails of the binomial
or multinomial; details are in [11]. Computationally, what matters is that after each batch of queries, all
the bounds can be updated in a linear-time bottom-up pass through the tree, at which point the pruning P
might be refined.
Two things remain unspecified: the manner in which the hierarchical clustering is built and the procedure
select, which picks a cluster to query. We discuss these next, but regardless of how these decisions are made,
the DH algorithm has a basic statistical soundness: it always maintains valid estimates of the error induced
by its current pruning, and it refines prunings to drive down the error. This leaves a lot of flexibility to
explore different clustering and sampling strategies.

The select procedure


The select(P ) procedure controls the selective sampling. There are many choices for how to do this:

1. Choose v ∈ P with probability ∝ wv .


This is similar to random sampling.
2. Choose v ∈ P with probability ∝ wv bound(v).
This is an active learning rule that reduces sampling in regions of the space that have already been
observed to be fairly pure in their labels.

16
3. For each subtree (Tz , z ∈ P ), find the observed majority label, and assign this label to all points in
the subtree; fit a classifier h to this data; and choose v ∈ P with probability ∝ min{|{x ∈ Tv : h(x) =
+1}|, |{x ∈ Tv : h(x) = −1}|}.
This biases sampling towards regions close to the current decision boundary.

Innumerable variations of the third strategy are possible. Such schemes have traditionally suffered from
consistency problems (recall Figure 3), for instance because entire regions of space are overconfidently over-
looked. The DH framework relieves such concerns because there is always an accurate bound on the error
induced by the current labeling.

Building a hierarchical clustering


DH works best when there is a pruning P of the tree such that |P | is small and a significant fraction of
its constituent clusters are almost-pure. There are several ways in which one might try to generate such a
tree. The first option is simply to run a standard hierarchical clustering algorithm, such as Ward’s average
linkage method. If domain knowledge is available in the form of a specialized distance function, this can be
used for the clustering; for instance, if the data is believed to lie on a low-dimensional manifold, a distance
function can be generated by running Dijkstra’s algorithm on a neighborhood graph, as in Isomap [18]. A
third option is to use a small set of labeled data to guide the construction of the hierarchical clustering, by
providing soft constraints.

Label complexity
A rudimentary label complexity result for this model is proved in [11]: if the provided hierarchical clustering
contains a pruning P whose clusters are ǫ-pure in their labels, then the learner will find a labeling that is
O(ǫ)-pure with O(|P |d(P )/ǫ) labels, where d(P ) is the maximum depth of a node in P .

3.4 An illustrative experiment


The MNIST data set1 is a widely-used benchmark for the multiclass problem of handwritten digit recognition.
It contains over 60,000 images of digits, each a vector in R784 . We began by extracting 10,000 training images
and 2,000 test images, divided equally between the digits.
The first step in the DH active learning scheme is to hierarchically cluster the training data, bearing in
mind that the efficacy of the subsequent querying process depends heavily on how well the clusters align
with true classes. Previous work has found that the use of specialized distance functions such as tangent
distance or shape context dramatically improves nearest-neighbor classification performance on this data set;
the most sensible thing to do would therefore be to cluster using one of these distance measures. Instead,
we tried our luck with Euclidean distance and a standard hierarchical agglomerative clustering procedure
called Ward’s algorithm [19]. (Briefly, this algorithm starts with each data point in its own cluster, and then
repeatedly merges pairs of clusters until there is a single cluster containing all the data. The merger chosen
at each point in time is that which occasions the smallest increase in k-means cost.)
It is useful to look closely at the resulting hierarchical clustering, a tree with 10,000 leaves. Since there
are ten classes (digits), the best possible scenario would be one in which there were a pruning consisting of
ten nodes (clusters), each pure in its label. The worst possible scenario would be if purity were achieved
only at the leaves – a pruning with 10,000 nodes. Reality thankfully falls much closer to the former extreme.
Figure 15, left, relates the size of a pruning to the induced mislabeling error, that is, the fraction of points
misclassified when each cluster is assigned its most-frequent label. For instance, the entry (50, 0.12) means
that there exists a pruning with 50 nodes such that the induced label error is just 12%. This isn’t too bad
given the number of classes, and bodes well for active learning.
A good pruning exists relatively high in the tree; but does the querying process find it or something
comparable? The analysis shows that it must, using a number of queries roughly proportional to the number
1 http://yann.lecun.com/exdb/mnist/

17
0.8 0.24 random
active
Fraction of labels incorrect

0.22

0.6 0.2

0.18

Error
0.4 0.16

0.14

0.2 0.12

0.1

0 0.08
0 200 400 600 800 1000 0 2000 4000 6000 8000 10000
Number of clusters Number of labels

Figure 15: Results on OCR data. Left: Errors of the best prunings in the OCR digits tree. Right: Test
error curves on classification task.

of clusters. As empirical corroboration, we ran the hierarchical sampler ten times, and on average 400 queries
were needed to discover a pruning of error rate 12% or less.
So far we have only talked about error rates on the training set. We can complete the picture by using
the final labeling from the sampling scheme as input to a supervised learner (logistic regression with ℓ2
regularization, the trade-off parameter chosen by 10-fold cross validation). A good baseline for comparison
is the same experiment but with random instead of active sampling. Figure 15, right, shows the resulting
learning curves: the tradeoff between the number of labels and the error rate on the held-out test set. The
initial advantage of cluster-adaptive sampling reflects its ability to discover and subsequently ignore relatively
pure clusters at the onset of sampling. Later on, it is left sampling from clusters of easily confused digits
(the prime culprits being 3’s, 5’s, and 8’s).

3.5 Where next


The DH algorithm attempts to capture the spirit of nonparametric local-propagation active learning, while
maintaining reliable statistics about what it does and doesn’t know. In the process, however, it loses some
flexibility in its sampling strategy – it would be useful to be able to relax this.
On the theory front, the label complexity of DH needs to be better understood. At present there is no
clean characterization of when it does better than straight random sampling, and by how much. Among
other things, this would help in choosing the querying rule (that is, the procedure select).
A very general way to describe this cluster-based active learner is to say it uses unlabeled data to create
a data-dependent hypothesis class that is particularly amenable to adaptive sampling. This broad approach
to active learning might be worth studying systematically.

Acknowledgements
The author is grateful for his collaborators – Alina Beygelzimer, Daniel Hsu, Adam Kalai, John Langford,
and Claire Monteleoni – and for the support of the National Science Foundation under grant IIS-0713540.
There were also two anonymous reviewers who gave very helpful feedback on the first draft of this paper.

18
S = ∅ (points with inferred labels)
T = ∅ (points with queried labels)
For t = 1, 2, . . .:
Receive xt
If (h+1 = learn(S ∪ {(xt , +1)}, T )) fails: Add (xt , −1) to S and break
If (h−1 = learn(S ∪ {(xt , −1)}, T )) fails: Add (xt , +1) to S and break
If err(h−1 , S ∪ T ) − err(h+1 , S ∪ T ) > ∆t : Add (xt , +1) to S and break
If err(h+1 , S ∪ T ) − err(h−1 , S ∪ T ) > ∆t : Add (xt , −1) to S and break
Request yt and add (xt , yt ) to T

P
Figure 16: The DHM selective sampling algorithm. Here, err(h, A) = (1/|A|) (x,y)∈A 1(h(x) 6= y). A
possible setting for ∆t is shown in Equation 1. At any time, the current hypothesis is learn(S, T ).

4 Appendix: the DHM algorithm


For technical reasons, the DHM algorithm (Figure 16) makes black-box calls to a special type of supervised
learner: for A, B ⊂ X × {±1},
learn(A, B) returns a hypothesis h ∈ H consistent with A, and with minimum error on B. If
there is no hypothesis consistent with A, a failure flag is returned.
For some simple hypothesis classes like intervals on the line, or rectangles in R2 , it is easy to construct such
a learner. For more complex classes like linear separators, the main bottleneck is the hardness of minimizing
the 0 − 1 loss on B (that is, the hardness of agnostic supervised learning).
During the learning process, DHM labels every point but divides them into two groups: those whose
labels are queried (denoted T in the figure) and those whose labels are inferred (denoted S). Points in S are
subsequently used as hard constraints; with high probability, all these labels are consistent with the target
hypothesis h∗ .
The generalization bound ∆t can be set to
r
p p  d log t + log(1/δ)
2
∆t = βt + βt err(h+1 , S ∪ T ) + err(h−1 , S ∪ T ) , βt = C (1)
t
where C is a universal constant, d is the VC dimension of class H, and δ is the overall permissible failure
probability.

References
[1] D. Angluin. Queries revisited. In Proceedings of the Twelfth International Conference on Algorithmic
Learning Theory, pages 12–31, 2001.
[2] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In International Conference
on Machine Learning, 2006.
[3] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Conference on Learning
Theory, 2007.
[4] M.-F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In Proceed-
ings of the 21st Annual Conference on Learning Theory, 2008.
[5] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In International
Conference on Machine Learning, 2009.

19
[6] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Lecture Notes
in Artificial Intelligence, 3176:169–207, 2004.
[7] R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Transactions on Information
Theory, 54(5):2339–2353, 2008.
[8] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Worst-case analysis of selective sampling for linear-
threshold algorithms. In Advances in Neural Information Processing Systems, 2004.
[9] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning,
15(2):201–221, 1994.
[10] S. Dasgupta. Coarse sample complexity bounds for active learning. In Neural Information Processing
Systems, 2005.
[11] S. Dasgupta and D.J. Hsu. Hierarchical sampling for active learning. In International Conference on
Machine Learning, 2008.
[12] S. Dasgupta, D.J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Neural
Information Processing Systems, 2007.
[13] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee
algorithm. Machine Learning, 28(2):133–168, 1997.
[14] E. Friedman. Active learning for smooth problems. In Conference on Learning Theory, 2009.
[15] S. Hanneke. A bound on the label complexity of agnostic active learning. In International Conference
on Machine Learning, 2007.
[16] S. Hanneke. Theoretical Foundations of Active Learning. PhD Thesis, CMU Machine Learning Depart-
ment, 2009.
[17] H. Schutze, E. Velipasaoglu, and J. Pedersen. Performance thresholding in practical text classification.
In ACM International Conference on Information and Knowledge Management, 2006.
[18] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimensionality
reduction. Science, 290(5500):2319–2323, 2000.
[19] J.H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical
Association, 58:236–244, 1963.
[20] X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-supervised learning using
gaussian fields and harmonic functions. In ICML Workshop on the Continuum from Labeled to Unlabeled
Data, 2003.

20
Semi-Supervised Learning
Xiaojin Zhu, University of Wisconsin-Madison

Synonyms: Learning from labeled and unlabeled data, transductive learn-


ing

Definition
Semi-supervised learning uses both labeled and unlabeled data to perform
an otherwise supervised learning or unsupervised learning task.
In the former case, there is a distinction between inductive semi-supervised
learning and transductive learning. In inductive semi-supervised learning,
iid
the learner has both labeled training data {(xi , yi )}li=1 ∼ p(x, y) and unla-
iid
beled training data {xi }l+u
i=l+1 ∼ p(x), and learns a predictor f : X 7→ Y,
f ∈ F where F is the hypothesis space. Here x ∈ X is an input instance,
y ∈ Y its target label (discrete for classification or continuous for regression),
p(x, y) the unknown joint distribution and p(x) its marginal, and typically
l  u. The goal is to learn a predictor that predicts future test data better
than the predictor learned from the labeled training data alone. In trans-
ductive learning, the setting is the same except that one is solely interested
in the predictions on the unlabeled training data {xi }l+u i=l+1 , without any
intention to generalize to future test data.
In the latter case, an unsupervised learning task is enhanced by labeled
data. For example, in semi-supervised clustering (a.k.a. constrained clus-
tering) one may have a few must-links (two instances must be in the same
cluster) and cannot-links (two instances cannot be in the same cluster) in ad-
dition to the unlabeled instances to be clustered; in semi-supervised dimen-
sionality reduction one might have the target low-dimensional coordinates
on a few instances.
This entry will focus on the former case of learning a predictor.

1
Motivation and Background
Semi-supervised learning is initially motivated by its practical value in learn-
ing faster, better, and cheaper. In many real world applications, it is rela-
tively easy to acquire a large amount of unlabeled data {x}. For example,
documents can be crawled from the Web, images can be obtained from
surveillance cameras, and speech can be collected from broadcast. However,
their corresponding labels {y} for the prediction task, such as sentiment
orientation, intrusion detection, and phonetic transcript, often requires slow
human annotation and expensive laboratory experiments. This labeling bot-
tleneck results in a scarce of labeled data and a surplus of unlabeled data.
Therefore, being able to utilize the surplus unlabeled data is desirable.
Recently, semi-supervised learning also finds applications in cognitive
psychology as a computational model for human learning. In human cate-
gorization and concept forming, the environment provides unsupervised data
(e.g., a child watching surrounding objects by herself) in addition to labeled
data from a teacher (e.g., Dad points to an object and says “bird!”). There
is evidence that human beings can combine labeled and unlabeled data to
facilitate learning.
The history of semi-supervised learning goes back to at least the 70s,
when self-training, transduction, and Gaussian mixtures with the EM al-
gorithm first emerged. It enjoyed an explosion of interest since the 90s,
with the development of new algorithms like co-training and transductive
support vector machines, new applications in natural language processing
and computer vision, and new theoretical analyses. More discussions can be
found in section 1.1.3 in [7].

Theory
It is obvious that unlabeled data {xi }l+ui=l+1 by itself does not carry any
information on the mapping X 7→ Y. How can it help us learn a better
predictor f : X 7→ Y? Balcan and Blum pointed out in [2] that the key lies
in an implicit ordering of f ∈ F induced by the unlabeled data. Informally,
if the implicit ordering happens to rank the target predictor f ∗ near the top,
then one needs less labeled data to learn f ∗ . This idea will be formalized
later on using PAC learning bounds. In other contexts, the implicit ordering
is interpreted as a prior over F or as a regularizer.
A semi-supervised learning method must address two questions: what
implicit ordering is induced by the unlabeled data, and how to algorith-

2
mically find a predictor near the top of this implicit ordering and fits the
labeled data well. Many semi-supervised learning methods have been pro-
posed, with different answers to these two questions [15, 7, 1, 10]. It is
impossible to enumerate all methods in this entry. Instead, we present a few
representative methods.

Generative Models
This semi-supervised learning method assumes the form of joint probability
p(x, y | θ) = p(y | θ)p(x | y, θ). For example, the class prior distribution
p(y | θ) can be a multinomial over Y, while the class conditional distribution
p(x | y, θ) can be a multivariate Gaussian in X [6, 9]. We use θ ∈ Θ to denote
the parameters of the joint probability. Each θ corresponds to a predictor
fθ via Bayes rule:

p(x, y | θ)
fθ (x) ≡ argmax p(y | x, θ) = argmax P 0
.
y y y 0 p(x, y | θ)

Therefore, F = {fθ : θ ∈ Θ}.


What is the implicit ordering of fθ induced by unlabeled training data
{xi }l+u
i=l+1 ? It is the large to small ordering of log likelihood of θ on unlabeled
data:  
l+u
X X
log p({xi }l+u
i=l+1 | θ) = log  p(xi , y | θ)
i=l+1 y∈Y

The top ranked fθ is the one whose θ (or rather the generative model with
parameters θ) best fits the unlabeled data. Therefore, this method assumes
that the form of the joint probability is correct for the task.
To identify the fθ that both fits the labeled data well and ranks high,
one maximizes the log likelihood of θ on both labeled and unlabeled data:

argmax log p({xi , yi }li=1 | θ) + λ log p({xi }l+u


i=l+1 | θ),
θ

where λ is a balancing weight. This is a non-concave problem. A local max-


imum can be found with the Expectation-Maximization (EM) algorithm, or
other numerical optimization methods.

Semi-Supervised Support Vector Machines


This semi-supervised learning method assumes that the decision boundary
f (x) = 0 is situated in a low-density region (in terms of unlabeled data)

3
between the two classes y ∈ {−1, 1} [12, 8]. Consider the following hat loss
function on an unlabeled instance x:

max(1 − |f (x)|, 0)

which is positive when −1 < f (x) < 1, and zero outside. The hat loss
thus measures the violation in (unlabeled) large margin separation between
f and x. Averaging over all unlabeled training instances, it induces an
implicit ordering from small to large over f ∈ F:
l+u
1 X
max(1 − |f (x)|, 0).
u
i=l+1

The top ranked f is one whose decision boundary avoids most unlabeled
instances by a large margin.
To find the f that both fits the labeled data well and ranks high, one
typically minimizes the following objective:
l l+u
1X 1 X
argmin max(1 − yi f (xi ), 0) + λ1 kf k2 + λ2 max(1 − |f (x)|, 0),
f l u
i=1 i=l+1

which is a combination of the objective for supervised support vector ma-


chines, and the average hat loss. Algorithmically, the optimization prob-
lem is difficult because the hat loss is non-convex. Existing solutions in-
clude semi-definite programming relaxation, deterministic annealing, con-
tinuation method, concave-convex procedure (CCCP), stochastic gradient
descent, and Branch and Bound.

Graph-Based Models
This semi-supervised learning method assumes that there is a graph G =
{V, E} such that the vertices V are the labeled and unlabeled training
instances, and the undirected edges E connect instances i, j with weight
wij [4, 14, 3]. The graph is sometimes assumed to be a random instanti-
ation of an underlying manifold structure that supports p(x). Typically,
wij reflects the proximity of xi , xj . For example, the Gaussian edge weight
function defines wij = exp −kxi − xj k2 /σ 2 . As another example, the kNN


edge weight function defines wij = 1 if xi is within the k nearest neighbors of


xj or vice versa, and wij = 0 otherwise. Other commonly used edge weight
functions include -radius neighbors, b-matching, and combinations of the
above.

4
Large wij implies a preference for the predictions f (xi ) and f (xj ) to be
the same. This can be formalized by the graph energy of a function f :
l+u
X
wij (f (xi ) − f (xj ))2 .
i,j=1

The graph energy induces an implicit ordering of f ∈ F from small to large.


The top ranked function is the smoothest with respect to the graph (in fact,
it is any constant function). The graph energy can be equivalently expressed
using the so-called unnormalized graph Laplacian matrix. Variants including
the normalized Laplacian and the powers of these matrices.
To find the f that both fits the labeled data well and ranks high (i.e., be-
ing smooth on the graph or manifold), one typically minimizes the following
objective:
l l+u
1X X
argmin c(f (xi ), yi ) + λ1 kf k2 + λ2 wij (f (xi ) − f (xj ))2 ,
f l
i=1 i,j=1

where c(f (x), y) is a convex loss function such as the hinge loss or the squared
loss. This is a convex optimization problem with efficient solvers.

Co-training and Multiview Models


This semi-supervised learning method assumes that there are multiple, dif-
ferent learners trained on the same labeled data, and these learners agree
on the unlabeled data. A classic algorithm is co-training [5]. Take the ex-
ample of web page classification, where each web page x is represented by
two subsets of features, or “views” x = hx(1) , x(2) i. For instance, x(1) can
represent the words on the page itself, and x(2) the words on the hyper-
links (on other web pages) pointing to this page. The co-training algorithm
trains two predictors: f (1) on x(1) (ignoring the x(2) portion of the feature)
and f (2) on x(2) , both initially from the labeled data. If f (1) confidently
predicts the label of an unlabeled instance x, then the instance-label pair
(x, f (1) (x)) is added to f (2) ’s labeled training data, and vice versa. Note this
promotes f (1) and f (2) to predict the same on x. This repeats so that each
view teaches the other. Multiview models generalize co-training by utilizing
more than two predictors, and relaxing the requirement of having separate
views [11]. In either case, the final prediction is obtained from a (confidence
weighted) average or vote among the predictors.
To define the implicit ordering on the hypothesis space, we need a slight
extension. In general, let there be m predictors f (1) , . . . f (m) . Now let a

5
hypothesis be an m-tuple of predictors hf (1) , . . . f (m) i. The disagreement of
a tuple on the unlabeled data can be defined as
l+u X
X m
c(f (u) (xi ), f (v) (xi )),
i=l+1 u,v=1

where c() is a loss function. Typical choices of c() are the 0-1 loss for
classification, and the squared loss for regression. Then the disagreement
induces an implicit ordering on tuples from small to large.
It is important for these m predictors to be of diverse types, and have
different inductive biases. In general, each predictor f (u) , u = 1 . . . m may
be evaluated by its individual loss function c(u) and regularizer Ω(u) . To find
a hypothesis (i.e,. m predictors) that fits the labeled data well and ranks
high, one can minimize the following objective:
m l
!
X 1 X (u) (u)
argmin c (f (xi ), yi ) + λ1 Ω(u) (f (u) )
(1)
hf ,...f (m) i l
u=1 i=1
l+u
X X m
+λ2 c(f (u) (xi ), f (v) (xi )).
i=l+1 u,v=1

Multiview learning typically optimizes this objective directly. When the loss
functions and regularizers are convex, numerical solution is relatively easy
to obtain. In the special cases when the loss functions are the squared loss,
and the regularizers are squared `2 norms, there is a closed form solution.
On the other hand, the co-training algorithm, as presented earlier, optimizes
the objective indirectly with the iterative procedure. One advantage of co-
training is that the algorithm is a wrapper method, in that it can use any
“blackbox” learners f (1) and f (2) without the need to modify the learners.

A PAC Bound for Semi-Supervised Learning


Previously, we presented several semi-supervised learning methods, each in-
duces an implicit ordering on the hypothesis space using the unlabeled train-
ing data, and each attempts to find a hypothesis that fit the labeled training
data well as well as rank high in that implicit ordering. We now present a
theoretical justification on why this is a good idea. In particular, we present
a uniform convergence bound by Balcan and Blum (Theorem 11 in [2]).
Alternative theoretical analyses on semi-supervised learning can be found
by following the recommended reading.

6
First, we introduce some notations. Consider the 0-1 loss for classifica-
tion. Let c∗ : X 7→ {0, 1} be the unknown target function, which may not
be in F. Let err(f ) = Ex∼p [f (x) 6= c∗ (x)] be the true error rate of a hy-
c ) = 1l li=1 f (xi ) 6= c∗ (xi ) be the empirical error rate
P
pothesis f , and err(f
of f on the labeled training sample. To characterize the implicit ordering,
we defined an “unlabeled error rate” errunl (f ) = 1 − Ex∼p [χ(f, x)], where
the compatibility function χ : F × X 7→ [0, 1] measures how “compatible” f
is to an unlabeled instance x. As an example, in semi-supervised support
vector machines, if x is far away from the decision boundary produced by
f , then χ(f, x) is large; but if x is close to the decision boundary, χ(f, x)
is small. In this example, a large errunl (f ) then means that the decision
boundary of f cuts through dense unlabeled data regions, and thus f is un-
desirable for semi-supervised learning. In contrast, a small errunl (f ) means
that the decision boundary of f lies in a low density gap, which is more
desirable. In theory, the implicit ordering on f ∈ F is to sort errunl (f )
from small to large.P In practice, we use the empirical unlabeled error rate
c unl (f ) = 1 − u1 l+u
err i=l+1 χ(f, xi ).
Our goal is to show that if an f ∈ F “fits the labeled data well and
ranks high”, then f is almost as good as the best hypothesis in F. Let
t ∈ [0, 1]. We first consider the best hypothesis ft∗ in the subset of F
that consists of hypotheses whose unlabeled error rate is no worse than
t: ft∗ = argminf 0 ∈F ,errunl (f 0 )≤t err(f 0 ). Obviously, t = 1 gives the best
hypothesis in the whole F. However, the nature of the guarantee has the
form err(f ) ≤ err(ft∗ ) + EstimationError(t) + c, where the EstimationError
term increases with t. Thus, with t = 1 the bound can be loose. On the
other hand, if t is close to 0, EstimationError(t) is small, but err(ft∗ ) can
be much worse than err(ft=1 ∗ ). The bound will account for the optimal t.

We introduce a few more definitions. Let F(f ) = {f 0 ∈ F : err c unl (f 0 ) ≤


c unl (f )} be the subset of F with empirical error no worse than that of f .
err
As a complexity measure, let [F(f )] be the number of different partitions of
the first l unlabeled instances xl+1 . . . x2l , using f ∈ F(f ). Finally, let ˆ(f ) =
q
24
log(8[F(f )]). Then we have the following agnostic bound (meaning that
l
c∗ may not be in F, and err
c unl (f ) may not be zero for any f ∈ F):

Theorem 1. Given l labeled instances and sufficient unlabeled instances,


with probability at least 1 − δ, the function

c 0 ) + ˆ(f 0 )
f = argmin err(f
f 0 ∈F

7
satisfies the guarantee that
r
log(8/δ)
err(f ) ≤ min(err(ft∗ ) + ˆ(ft∗ )) +5 .
t l
If a function f fits the labeled data well, it has a small err(f
c ). If it
ranks high, then F(f ) will be a small set, consequently ˆ(f ) is small. The
argmin operator identifies the best such function during training. The bound
account for the minimum of all possible t tradeoffs. Therefore, we see that
the “lucky” case is when the implicit ordering is good such that ft=1 ∗ , the

best hypothesis in F, is near the top of the ranking. This is when semi-
supervised learning is expected to perform well. Balcan and Blum also give
results addressing the key issue of how much unlabeled data is needed for
c unl (f ) and errunl (f ) to be close for all f ∈ F.
err

Applications
Because the type of semi-supervised learning discussed in this entry has the
same goal of creating a predictor as supervised learning, it is applicable to
essentially any problems where supervised learning can be applied. For ex-
ample, semi-supervised learning has been applied to natural language pro-
cessing (word sense disambiguation [13], document categorization, named
entity classification, sentiment analysis, machine translation), computer vi-
sion (object recognition, image segmentation), bioinformatics (protein func-
tion prediction), and cognitive psychology. Follow the recommended reading
for individual papers.

Future Directions
There are several directions to further enhance the value semi-supervised
learning. First, we need guarantees that it will outperform supervised learn-
ing. Currently, the practitioner has to manually choose a particular semi-
supervised learning method, and often manually set learning parameters.
Sometimes, a bad choice that does not match the task (e.g., modeling each
class with a Gaussian when the data does not have this distribution) can
make semi-supervised learning worse than supervised learning. Second, we
need methods that benefit from unlabeled when l, the size of labeled data,
is large. It has been widely observed that the gain over supervised learn-
ing is the largest when l is small, but diminishes as l increases. Third, we
need good ways to combine semi-supervised learning and active learning. In

8
natural learning systems such as humans, we routinely observe unlabeled in-
put, which often naturally leads to questions. And finally, we need methods
that can efficiently process massive unlabeled data, especially in an online
learning setting.

Cross References
active learning, classification, constrained clustering, dimensionality reduc-
tion, online learning, regression, supervised learning, unsupervised learning

Recommended Reading

[1] S. Abney. Semisupervised Learning for Computational Linguistics.


Chapman & Hall/CRC, 2007.

[2] M.-F. Balcan and A. Blum. A discriminative model for semi-supervised


learning. Journal of the ACM, 2009.

[3] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A


geometric framework for learning from labeled and unlabeled examples.
Journal of Machine Learning Research, 7:2399–2434, November 2006.

[4] A. Blum and S. Chawla. Learning from labeled and unlabeled data
using graph mincuts. In Proc. 18th International Conf. on Machine
Learning, 2001.

[5] A. Blum and T. Mitchell. Combining labeled and unlabeled data with
co-training. In COLT: Proceedings of the Workshop on Computational
Learning Theory, 1998.

[6] V. Castelli and T. Cover. The exponential value of labeled samples.


Pattern Recognition Letters, 16(1):105–111, 1995.

[7] O. Chapelle, A. Zien, and B. Schölkopf, editors. Semi-supervised learn-


ing. MIT Press, 2006.

[8] T. Joachims. Transductive inference for text classification using sup-


port vector machines. In Proc. 16th International Conf. on Machine
Learning, pages 200–209. Morgan Kaufmann, San Francisco, CA, 1999.

9
[9] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text clas-
sification from labeled and unlabeled documents using EM. Machine
Learning, 39(2/3):103–134, 2000.

[10] M. Seeger. Learning with labeled and unlabeled data. Technical report,
University of Edinburgh, 2001.

[11] V. Sindhwani, P. Niyogi, and M. Belkin. A co-regularized approach


to semi-supervised learning with multiple views. In Proc. of the 22nd
ICML Workshop on Learning with Multiple Views, August 2005.

[12] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.

[13] D. Yarowsky. Unsupervised word sense disambiguation rivaling super-


vised methods. In Proceedings of the 33rd Annual Meeting of the As-
sociation for Computational Linguistics, pages 189–196, 1995.

[14] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning us-


ing Gaussian fields and harmonic functions. In The 20th International
Conference on Machine Learning (ICML), 2003.

[15] X. Zhu and A. B. Goldberg. Introduction to Semi-Supervised Learn-


ing. Synthesis Lectures on Artificial Intelligence and Machine Learning.
Morgan & Claypool Publishers, 2009.

10
Journal of Arti cial Intelligence Research 4 (1996) 237-285 Submitted 9/95 published 5/96

Reinforcement Learning: A Survey


Leslie Pack Kaelbling [email protected]
Michael L. Littman [email protected]
Computer Science Department, Box 1910, Brown University
Providence, RI 02912-1910 USA
Andrew W. Moore [email protected]
Smith Hall 221, Carnegie Mellon University, 5000 Forbes Avenue
Pittsburgh, PA 15213 USA

Abstract
This paper surveys the eld of reinforcement learning from a computer-science per-
spective. It is written to be accessible to researchers familiar with machine learning. Both
the historical basis of the eld and a broad selection of current work are summarized.
Reinforcement learning is the problem faced by an agent that learns behavior through
trial-and-error interactions with a dynamic environment. The work described here has a
resemblance to work in psychology, but diers considerably in the details and in the use
of the word \reinforcement." The paper discusses central issues of reinforcement learning,
including trading o exploration and exploitation, establishing the foundations of the eld
via Markov decision theory, learning from delayed reinforcement, constructing empirical
models to accelerate learning, making use of generalization and hierarchy, and coping with
hidden state. It concludes with a survey of some implemented systems and an assessment
of the practical utility of current methods for reinforcement learning.

1. Introduction
Reinforcement learning dates back to the early days of cybernetics and work in statistics,
psychology, neuroscience, and computer science. In the last ve to ten years, it has attracted
rapidly increasing interest in the machine learning and articial intelligence communities.
Its promise is beguiling|a way of programming agents by reward and punishment without
needing to specify how the task is to be achieved. But there are formidable computational
obstacles to fullling the promise.
This paper surveys the historical basis of reinforcement learning and some of the current
work from a computer science perspective. We give a high-level overview of the eld and a
taste of some specic approaches. It is, of course, impossible to mention all of the important
work in the eld this should not be taken to be an exhaustive account.
Reinforcement learning is the problem faced by an agent that must learn behavior
through trial-and-error interactions with a dynamic environment. The work described here
has a strong family resemblance to eponymous work in psychology, but diers considerably
in the details and in the use of the word \reinforcement." It is appropriately thought of as
a class of problems, rather than as a set of techniques.
There are two main strategies for solving reinforcement-learning problems. The rst is to
search in the space of behaviors in order to nd one that performs well in the environment.
This approach has been taken by work in genetic algorithms and genetic programming,
c 1996 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.
Kaelbling, Littman, & Moore

s a
I i
B
R r

Figure 1: The standard reinforcement-learning model.

as well as some more novel search techniques (Schmidhuber, 1996). The second is to use
statistical techniques and dynamic programming methods to estimate the utility of taking
actions in states of the world. This paper is devoted almost entirely to the second set of
techniques because they take advantage of the special structure of reinforcement-learning
problems that is not available in optimization problems in general. It is not yet clear which
set of approaches is best in which circumstances.
The rest of this section is devoted to establishing notation and describing the basic
reinforcement-learning model. Section 2 explains the trade-o between exploration and
exploitation and presents some solutions to the most basic case of reinforcement-learning
problems, in which we want to maximize the immediate reward. Section 3 considers the more
general problem in which rewards can be delayed in time from the actions that were crucial
to gaining them. Section 4 considers some classic model-free algorithms for reinforcement
learning from delayed reward: adaptive heuristic critic, TD( ) and Q-learning. Section 5
demonstrates a continuum of algorithms that are sensitive to the amount of computation an
agent can perform between actual steps of action in the environment. Generalization|the
cornerstone of mainstream machine learning research|has the potential of considerably
aiding reinforcement learning, as described in Section 6. Section 7 considers the problems
that arise when the agent does not have complete perceptual access to the state of the
environment. Section 8 catalogs some of reinforcement learning's successful applications.
Finally, Section 9 concludes with some speculations about important open problems and
the future of reinforcement learning.

1.1 Reinforcement-Learning Model


In the standard reinforcement-learning model, an agent is connected to its environment
via perception and action, as depicted in Figure 1. On each step of interaction the agent
receives as input, i, some indication of the current state, s, of the environment the agent
then chooses an action, a, to generate as output. The action changes the state of the
environment, and the value of this state transition is communicated to the agent through
a scalar reinforcement signal, r. The agent's behavior, B , should choose actions that tend
to increase the long-run sum of values of the reinforcement signal. It can learn to do this
over time by systematic trial and error, guided by a wide variety of algorithms that are the
subject of later sections of this paper.
238
Reinforcement Learning: A Survey

Formally, the model consists of


a discrete set of environment states, S 
a discrete set of agent actions, A and
a set of scalar reinforcement signals typically f0 1g, or the real numbers.
The gure also includes an input function I , which determines how the agent views the
environment state we will assume that it is the identity function (that is, the agent perceives
the exact state of the environment) until we consider partial observability in Section 7.
An intuitive way to understand the relation between the agent and its environment is
with the following example dialogue.
Environment: You are in state 65. You have 4 possible actions.
Agent: I'll take action 2.
Environment: You received a reinforcement of 7 units. You are now in state
15. You have 2 possible actions.
Agent: I'll take action 1.
Environment: You received a reinforcement of -4 units. You are now in state
65. You have 4 possible actions.
Agent: I'll take action 2.
Environment: You received a reinforcement of 5 units. You are now in state
44. You have 5 possible actions.
.. ..
. .
The agent's job is to nd a policy  , mapping states to actions, that maximizes some
long-run measure of reinforcement. We expect, in general, that the environment will be
non-deterministic that is, that taking the same action in the same state on two dierent
occasions may result in dierent next states and/or dierent reinforcement values. This
happens in our example above: from state 65, applying action 2 produces diering rein-
forcements and diering states on two occasions. However, we assume the environment is
stationary that is, that the probabilities of making state transitions or receiving specic
reinforcement signals do not change over time.1
Reinforcement learning diers from the more widely studied problem of supervised learn-
ing in several ways. The most important dierence is that there is no presentation of in-
put/output pairs. Instead, after choosing an action the agent is told the immediate reward
and the subsequent state, but is not told which action would have been in its best long-term
interests. It is necessary for the agent to gather useful experience about the possible system
states, actions, transitions and rewards actively to act optimally. Another dierence from
supervised learning is that on-line performance is important: the evaluation of the system
is often concurrent with learning.
1. This assumption may be disappointing after all, operation in non-stationary environments is one of the
motivations for building learning systems. In fact, many of the algorithms described in later sections
are e ective in slowly-varying non-stationary environments, but there is very little theoretical analysis
in this area.

239
Kaelbling, Littman, & Moore

Some aspects of reinforcement learning are closely related to search and planning issues
in articial intelligence. AI search algorithms generate a satisfactory trajectory through a
graph of states. Planning operates in a similar manner, but typically within a construct
with more complexity than a graph, in which states are represented by compositions of
logical expressions instead of atomic symbols. These AI algorithms are less general than the
reinforcement-learning methods, in that they require a predened model of state transitions,
and with a few exceptions assume determinism. On the other hand, reinforcement learning,
at least in the kind of discrete cases for which theory has been developed, assumes that
the entire state space can be enumerated and stored in memory|an assumption to which
conventional search algorithms are not tied.
1.2 Models of Optimal Behavior
Before we can start thinking about algorithms for learning to behave optimally, we have
to decide what our model of optimality will be. In particular, we have to specify how the
agent should take the future into account in the decisions it makes about how to behave
now. There are three models that have been the subject of the majority of work in this
area.
The nite-horizon model is the easiest to think about at a given moment in time, the
agent should optimize its expected reward for the next h steps:
X
h
E ( rt) 
t=0
it need not worry about what will happen after that. In this and subsequent expressions,
rt represents the scalar reward received t steps into the future. This model can be used in
two ways. In the rst, the agent will have a non-stationary policy that is, one that changes
over time. On its rst step it will take what is termed a h-step optimal action. This is
dened to be the best action available given that it has h steps remaining in which to act
and gain reinforcement. On the next step it will take a (h ; 1)-step optimal action, and so
on, until it nally takes a 1-step optimal action and terminates. In the second, the agent
does receding-horizon control, in which it always takes the h-step optimal action. The agent
always acts according to the same policy, but the value of h limits how far ahead it looks
in choosing its actions. The nite-horizon model is not always appropriate. In many cases
we may not know the precise length of the agent's life in advance.
The innite-horizon discounted model takes the long-run reward of the agent into ac-
count, but rewards that are received in the future are geometrically discounted according
to discount factor  , (where 0   < 1):
1
X
E (  trt) :
t=0
We can interpret  in several ways. It can be seen as an interest rate, a probability of living
another step, or as a mathematical trick to bound the innite sum. The model is conceptu-
ally similar to receding-horizon control, but the discounted model is more mathematically
tractable than the nite-horizon model. This is a dominant reason for the wide attention
this model has received.
240
Reinforcement Learning: A Survey

Another optimality criterion is the average-reward model, in which the agent is supposed
to take actions that optimize its long-run average reward:

lim E ( 1X
h
rt) :
h!1 h t=0
Such a policy is referred to as a gain optimal policy it can be seen as the limiting case of
the innite-horizon discounted model as the discount factor approaches 1 (Bertsekas, 1995).
One problem with this criterion is that there is no way to distinguish between two policies,
one of which gains a large amount of reward in the initial phases and the other of which
does not. Reward gained on any initial prex of the agent's life is overshadowed by the
long-run average performance. It is possible to generalize this model so that it takes into
account both the long run average and the amount of initial reward than can be gained.
In the generalized, bias optimal model, a policy is preferred if it maximizes the long-run
average and ties are broken by the initial extra reward.
Figure 2 contrasts these models of optimality by providing an environment in which
changing the model of optimality changes the optimal policy. In this example, circles
represent the states of the environment and arrows are state transitions. There is only
a single action choice from every state except the start state, which is in the upper left
and marked with an incoming arrow. All rewards are zero except where marked. Under a
nite-horizon model with h = 5, the three actions yield rewards of +6:0, +0:0, and +0:0, so
the rst action should be chosen under an innite-horizon discounted model with  = 0:9,
the three choices yield +16:2, +59:0, and +58:5 so the second action should be chosen
and under the average reward model, the third action should be chosen since it leads to
an average reward of +11. If we change h to 1000 and  to 0.2, then the second action is
optimal for the nite-horizon model and the rst for the innite-horizon discounted model
however, the average reward model will always prefer the best long-term average. Since the
choice of optimality model and parameters matters so much, it is important to choose it
carefully in any application.
The nite-horizon model is appropriate when the agent's lifetime is known one im-
portant aspect of this model is that as the length of the remaining lifetime decreases, the
agent's policy may change. A system with a hard deadline would be appropriately modeled
this way. The relative usefulness of innite-horizon discounted and bias-optimal models is
still under debate. Bias-optimality has the advantage of not requiring a discount parameter
however, algorithms for nding bias-optimal policies are not yet as well-understood as those
for nding optimal innite-horizon discounted policies.
1.3 Measuring Learning Performance
The criteria given in the previous section can be used to assess the policies learned by a
given algorithm. We would also like to be able to evaluate the quality of learning itself.
There are several incompatible measures in use.
Eventual convergence to optimal. Many algorithms come with a provable guar-
antee of asymptotic convergence to optimal behavior (Watkins & Dayan, 1992). This
is reassuring, but useless in practical terms. An agent that quickly reaches a plateau
241
Kaelbling, Littman, & Moore

+2

Finite horizon, h=4


+10

Infinite horizon, γ=0.9


+11

Average reward

Figure 2: Comparing models of optimality. All unlabeled arrows produce a reward of zero.

at 99% of optimality may, in many applications, be preferable to an agent that has a


guarantee of eventual optimality but a sluggish early learning rate.

Speed of convergence to optimality. Optimality is usually an asymptotic result,


and so convergence speed is an ill-dened measure. More practical is the speed of
convergence to near-optimality. This measure begs the denition of how near to
optimality is sucient. A related measure is level of performance after a given time,
which similarly requires that someone dene the given time.
It should be noted that here we have another dierence between reinforcement learning
and conventional supervised learning. In the latter, expected future predictive accu-
racy or statistical eciency are the prime concerns. For example, in the well-known
PAC framework (Valiant, 1984), there is a learning period during which mistakes do
not count, then a performance period during which they do. The framework provides
bounds on the necessary length of the learning period in order to have a probabilistic
guarantee on the subsequent performance. That is usually an inappropriate view for
an agent with a long existence in a complex environment.
In spite of the mismatch between embedded reinforcement learning and the train/test
perspective, Fiechter (1994) provides a PAC analysis for Q-learning (described in
Section 4.2) that sheds some light on the connection between the two views.
Measures related to speed of learning have an additional weakness. An algorithm
that merely tries to achieve optimality as fast as possible may incur unnecessarily
large penalties during the learning period. A less aggressive strategy taking longer to
achieve optimality, but gaining greater total reinforcement during its learning might
be preferable.

Regret. A more appropriate measure, then, is the expected decrease in reward gained
due to executing the learning algorithm instead of behaving optimally from the very
beginning. This measure is known as regret (Berry & Fristedt, 1985). It penalizes
mistakes wherever they occur during the run. Unfortunately, results concerning the
regret of algorithms are quite hard to obtain.

242
Reinforcement Learning: A Survey

1.4 Reinforcement Learning and Adaptive Control


Adaptive control (Burghes & Graham, 1980 Stengel, 1986) is also concerned with algo-
rithms for improving a sequence of decisions from experience. Adaptive control is a much
more mature discipline that concerns itself with dynamic systems in which states and ac-
tions are vectors and system dynamics are smooth: linear or locally linearizable around a
desired trajectory. A very common formulation of cost functions in adaptive control are
quadratic penalties on deviation from desired state and action vectors. Most importantly,
although the dynamic model of the system is not known in advance, and must be esti-
mated from data, the structure of the dynamic model is xed, leaving model estimation
as a parameter estimation problem. These assumptions permit deep, elegant and powerful
mathematical analysis, which in turn lead to robust, practical, and widely deployed adaptive
control algorithms.

2. Exploitation versus Exploration: The Single-State Case


One major dierence between reinforcement learning and supervised learning is that a
reinforcement-learner must explicitly explore its environment. In order to highlight the
problems of exploration, we treat a very simple case in this section. The fundamental issues
and approaches described here will, in many cases, transfer to the more complex instances
of reinforcement learning discussed later in the paper.
The simplest possible reinforcement-learning problem is known as the k-armed bandit
problem, which has been the subject of a great deal of study in the statistics and applied
mathematics literature (Berry & Fristedt, 1985). The agent is in a room with a collection of
k gambling machines (each called a \one-armed bandit" in colloquial English). The agent is
permitted a xed number of pulls, h. Any arm may be pulled on each turn. The machines
do not require a deposit to play the only cost is in wasting a pull playing a suboptimal
machine. When arm i is pulled, machine i pays o 1 or 0, according to some underlying
probability parameter pi , where payos are independent events and the pi s are unknown.
What should the agent's strategy be?
This problem illustrates the fundamental tradeo between exploitation and exploration.
The agent might believe that a particular arm has a fairly high payo probability should
it choose that arm all the time, or should it choose another one that it has less information
about, but seems to be worse? Answers to these questions depend on how long the agent
is expected to play the game the longer the game lasts, the worse the consequences of
prematurely converging on a sub-optimal arm, and the more the agent should explore.
There is a wide variety of solutions to this problem. We will consider a representative
selection of them, but for a deeper discussion and a number of important theoretical results,
see the book by Berry and Fristedt (1985). We use the term \action" to indicate the
agent's choice of arm to pull. This eases the transition into delayed reinforcement models
in Section 3. It is very important to note that bandit problems t our denition of a
reinforcement-learning environment with a single state with only self transitions.
Section 2.1 discusses three solutions to the basic one-state bandit problem that have
formal correctness results. Although they can be extended to problems with real-valued
rewards, they do not apply directly to the general multi-state delayed-reinforcement case.
243
Kaelbling, Littman, & Moore

Section 2.2 presents three techniques that are not formally justied, but that have had wide
use in practice, and can be applied (with similar lack of guarantee) to the general case.
2.1 Formally Justied Techniques
There is a fairly well-developed formal theory of exploration for very simple problems.
Although it is instructive, the methods it provides do not scale well to more complex
problems.
2.1.1 Dynamic-Programming Approach
If the agent is going to be acting for a total of h steps, it can use basic Bayesian reasoning
to solve for an optimal strategy (Berry & Fristedt, 1985). This requires an assumed prior
joint distribution for the parameters fpig, the most natural of which is that each pi is
independently uniformly distributed between 0 and 1. We compute a mapping from belief
states (summaries of the agent's experiences during this run) to actions. Here, a belief state
can be represented as a tabulation of action choices and payos: fn1 w1 n2 w2 : : : nk  wk g
denotes a state of play in which each arm i has been pulled ni times with wi payos. We
write V  (n1 w1 : : : nk  wk ) as the expected payo remaining, given that a total of h pulls
are available,
P and we use the remaining pulls optimally.
If i ni = h, then there are no remaining pulls, and V (n1  w1 : : : nk  wk ) = 0. This is
the basis of a recursive denition. If we know the V  value for all belief states with t pulls
remaining, we can compute the V  value of any belief state with t + 1 pulls remaining:
" #
 Future payo if agent takes
V (n1 w1 : : : nk  wk) = maxi E then acts optimally for remaining pulls action i ,
  !
 i V ( n 1  wi  : : : n i + 1  w i + 1  : :
= maxi (1 ;  )V (n  w  : : : n + 1 w  : : : n  w ): n k  wk )+
i 1 i i i k k
where i is the posterior subjective probability of action i paying o given ni , wi and
our prior probability. For the uniform priors, which result in a beta distribution, i =
(wi + 1)=(ni + 2).
The expense of lling in the table of V  values in this way for all attainable belief states
is linear in the number of belief states times actions, and thus exponential in the horizon.
2.1.2 Gittins Allocation Indices
Gittins gives an \allocation index" method for nding the optimal choice of action at each
step in k-armed bandit problems (Gittins, 1989). The technique only applies under the
discounted expected reward criterion. For each action, consider the number of times it has
been chosen, n, versus the number of times it has paid o, w. For certain discount factors,
there are published tables of \index values," I (n w) for each pair of n and w. Look up
the index value for each action i, I (ni  wi). It represents a comparative measure of the
combined value of the expected payo of action i (given its history of payos) and the value
of the information that we would get by choosing it. Gittins has shown that choosing the
action with the largest index value guarantees the optimal balance between exploration and
exploitation.
244
Reinforcement Learning: A Survey

a=0 a=1

1 2 3 N-1 N 2N 2N-1 N+3 N+2 N+1


r=1
a=0 a=1

1 2 3 N-1 N 2N 2N-1 N+3 N+2 N+1


r=0

Figure 3: A Tsetlin automaton with 2N states. The top row shows the state transitions
that are made when the previous action resulted in a reward of 1 the bottom
row shows transitions after a reward of 0. In states in the left half of the gure,
action 0 is taken in those on the right, action 1 is taken.

Because of the guarantee of optimal exploration and the simplicity of the technique
(given the table of index values), this approach holds a great deal of promise for use in more
complex applications. This method proved useful in an application to robotic manipulation
with immediate reward (Salganico & Ungar, 1995). Unfortunately, no one has yet been
able to nd an analog of index values for delayed reinforcement problems.
2.1.3 Learning Automata
A branch of the theory of adaptive control is devoted to learning automata, surveyed by
Narendra and Thathachar (1989), which were originally described explicitly as nite state
automata. The Tsetlin automaton shown in Figure 3 provides an example that solves a
2-armed bandit arbitrarily near optimally as N approaches innity.
It is inconvenient to describe algorithms as nite-state automata, so a move was made
to describe the internal state of the agent as a probability distribution according to which
actions would be chosen. The probabilities of taking dierent actions would be adjusted
according to their previous successes and failures.
An example, which stands among a set of algorithms independently developed in the
mathematical psychology literature (Hilgard & Bower, 1975), is the linear reward-inaction
algorithm. Let pi be the agent's probability of taking action i.
When action ai succeeds,
pi := pi + (1 ; pi)
pj := pj ; pj for j 6= i
When action ai fails, pj remains unchanged (for all j ).
This algorithm converges with probability 1 to a vector containing a single 1 and the
rest 0's (choosing a particular action with probability 1). Unfortunately, it does not always
converge to the correct action but the probability that it converges to the wrong one can
be made arbitrarily small by making  small (Narendra & Thathachar, 1974). There is no
literature on the regret of this algorithm.
245
Kaelbling, Littman, & Moore

2.2 Ad-Hoc Techniques


In reinforcement-learning practice, some simple, ad hoc strategies have been popular. They
are rarely, if ever, the best choice for the models of optimality we have used, but they may
be viewed as reasonable, computationally tractable, heuristics. Thrun (1992) has surveyed
a variety of these techniques.
2.2.1 Greedy Strategies
The rst strategy that comes to mind is to always choose the action with the highest esti-
mated payo. The aw is that early unlucky sampling might indicate that the best action's
reward is less than the reward obtained from a suboptimal action. The suboptimal action
will always be picked, leaving the true optimal action starved of data and its superiority
never discovered. An agent must explore to ameliorate this outcome.
A useful heuristic is optimism in the face of uncertainty in which actions are selected
greedily, but strongly optimistic prior beliefs are put on their payos so that strong negative
evidence is needed to eliminate an action from consideration. This still has a measurable
danger of starving an optimal but unlucky action, but the risk of this can be made arbitrar-
ily small. Techniques like this have been used in several reinforcement learning algorithms
including the interval exploration method (Kaelbling, 1993b) (described shortly), the ex-
ploration bonus in Dyna (Sutton, 1990), curiosity-driven exploration (Schmidhuber, 1991a),
and the exploration mechanism in prioritized sweeping (Moore & Atkeson, 1993).
2.2.2 Randomized Strategies
Another simple exploration strategy is to take the action with the best estimated expected
reward by default, but with probability p, choose an action at random. Some versions of
this strategy start with a large value of p to encourage initial exploration, which is slowly
decreased.
An objection to the simple strategy is that when it experiments with a non-greedy action
it is no more likely to try a promising alternative than a clearly hopeless alternative. A
slightly more sophisticated strategy is Boltzmann exploration. In this case, the expected
reward for taking action a, ER(a) is used to choose an action probabilistically according to
the distribution
ER(a)=T
P (a) = P e ER(a )=T :
a 2A e
0
0

The temperature parameter T can be decreased over time to decrease exploration. This
method works well if the best action is well separated from the others, but suers somewhat
when the values of the actions are close. It may also converge unnecessarily slowly unless
the temperature schedule is manually tuned with great care.
2.2.3 Interval-based Techniques
Exploration is often more ecient when it is based on second-order information about the
certainty or variance of the estimated values of actions. Kaelbling's interval estimation
algorithm (1993b) stores statistics for each action ai : wi is the number of successes and ni
the number of trials. An action is chosen by computing the upper bound of a 100  (1 ; )%
246
Reinforcement Learning: A Survey

condence interval on the success probability of each action and choosing the action with
the highest upper bound. Smaller values of the  parameter encourage greater exploration.
When payos are boolean, the normal approximation to the binomial distribution can be
used to construct the condence interval (though the binomial should be used for small
n). Other payo distributions can be handled using their associated statistics or with
nonparametric methods. The method works very well in empirical trials. It is also related
to a certain class of statistical techniques known as experiment design methods (Box &
Draper, 1987), which are used for comparing multiple treatments (for example, fertilizers
or drugs) to determine which treatment (if any) is best in as small a set of experiments as
possible.
2.3 More General Problems
When there are multiple states, but reinforcement is still immediate, then any of the above
solutions can be replicated, once for each state. However, when generalization is required,
these solutions must be integrated with generalization methods (see section 6) this is
straightforward for the simple ad-hoc methods, but it is not understood how to maintain
theoretical guarantees.
Many of these techniques focus on converging to some regime in which exploratory
actions are taken rarely or never this is appropriate when the environment is stationary.
However, when the environment is non-stationary, exploration must continue to take place,
in order to notice changes in the world. Again, the more ad-hoc techniques can be modied
to deal with this in a plausible manner (keep temperature parameters from going to 0 decay
the statistics in interval estimation), but none of the theoretically guaranteed methods can
be applied.

3. Delayed Reward
In the general case of the reinforcement learning problem, the agent's actions determine
not only its immediate reward, but also (at least probabilistically) the next state of the
environment. Such environments can be thought of as networks of bandit problems, but
the agent must take into account the next state as well as the immediate reward when it
decides which action to take. The model of long-run optimality the agent is using determines
exactly how it should take the value of the future into account. The agent will have to be
able to learn from delayed reinforcement: it may take a long sequence of actions, receiving
insignicant reinforcement, then nally arrive at a state with high reinforcement. The agent
must be able to learn which of its actions are desirable based on reward that can take place
arbitrarily far in the future.
3.1 Markov Decision Processes
Problems with delayed reinforcement are well modeled as Markov decision processes (MDPs).
An MDP consists of
a set of states S ,
a set of actions A,
247
Kaelbling, Littman, & Moore

a reward function R : S  A ! <, and


a state transition function T : SA ! (S ), where a member of (S ) is a probability
distribution over the set S (i.e. it maps states to probabilities). We write T (s a s0)
for the probability of making a transition from state s to state s0 using action a.
The state transition function probabilistically species the next state of the environment as
a function of its current state and the agent's action. The reward function species expected
instantaneous reward as a function of the current state and action. The model is Markov if
the state transitions are independent of any previous environment states or agent actions.
There are many good references to MDP models (Bellman, 1957 Bertsekas, 1987 Howard,
1960 Puterman, 1994).
Although general MDPs may have innite (even uncountable) state and action spaces,
we will only discuss methods for solving nite-state and nite-action problems. In section 6,
we discuss methods for solving problems with continuous input and output spaces.
3.2 Finding a Policy Given a Model
Before we consider algorithms for learning to behave in MDP environments, we will ex-
plore techniques for determining the optimal policy given a correct model. These dynamic
programming techniques will serve as the foundation and inspiration for the learning al-
gorithms to follow. We restrict our attention mainly to nding optimal policies for the
innite-horizon discounted model, but most of these algorithms have analogs for the nite-
horizon and average-case models as well. We rely on the result that, for the innite-horizon
discounted model, there exists an optimal deterministic stationary policy (Bellman, 1957).
We will speak of the optimal value of a state|it is the expected innite discounted sum
of reward that the agent will gain if it starts in that state and executes the optimal policy.
Using  as a complete decision policy, it is written
X
1 t !

V (s) = max
 E  rt :
t=0
This optimal value function is unique and can be dened as the solution to the simultaneous
equations 0 1
X
V (s) = max @
a R(s a) +  T (s a s0)V (s0)A  8s 2 S  (1)
s 2S
0

which assert that the value of a state s is the expected instantaneous reward plus the
expected discounted value of the next state, using the best available action. Given the
optimal value function, we can specify the optimal policy as
0 1
X
(s) = arg max @
a R(s a) +  T (s a s0)V (s0)A :
s 2S 0

3.2.1 Value Iteration


One way, then, to nd an optimal policy is to nd the optimal value function. It can
be determined by a simple iterative algorithm called value iteration that can be shown to
converge to the correct V  values (Bellman, 1957 Bertsekas, 1987).
248
Reinforcement Learning: A Survey

V s
initialize ( ) arbitrarily
loop until policy good enough
loop for s2S
a2A
Q(s a) := R(s a) +  Ps 2S T (s a s0)V (s0)
loop for
0

V (s) := maxa Q(s a)


end loop
end loop

It is not obvious when to stop the value iteration algorithm. One important result
bounds the performance of the current greedy policy as a function of the Bellman residual of
the current value function (Williams & Baird, 1993b). It says that if the maximum dierence
between two successive value functions is less than , then the value of the greedy policy,
(the policy obtained by choosing, in every state, the action that maximizes the estimated
discounted reward, using the current estimate of the value function) diers from the value
function of the optimal policy by no more than 2 =(1 ;  ) at any state. This provides an
eective stopping criterion for the algorithm. Puterman (1994) discusses another stopping
criterion, based on the span semi-norm, which may result in earlier termination. Another
important result is that the greedy policy is guaranteed to be optimal in some nite number
of steps even though the value function may not have converged (Bertsekas, 1987). And in
practice, the greedy policy is often optimal long before the value function has converged.
Value iteration is very exible. The assignments to V need not be done in strict order
as shown above, but instead can occur asynchronously in parallel provided that the value
of every state gets updated innitely often on an innite run. These issues are treated
extensively by Bertsekas (1989), who also proves convergence results.
Updates based on Equation 1 are known as full backups since they make use of infor-
mation from all possible successor states. It can be shown that updates of the form

Q(s a) := Q(s a) + (r +  max


a
Q(s0 a0) ; Q(s a))
0

can also be used as long as each pairing of a and s is updated innitely often, s0 is sampled
from the distribution T (s a s0), r is sampled with mean R(s a) and bounded variance, and
the learning rate  is decreased slowly. This type of sample backup (Singh, 1993) is critical
to the operation of the model-free methods discussed in the next section.
The computational complexity of the value-iteration algorithm with full backups, per
iteration, is quadratic in the number of states and linear in the number of actions. Com-
monly, the transition probabilities T (s a s0) are sparse. If there are on average a constant
number of next states with non-zero probability then the cost per iteration is linear in the
number of states and linear in the number of actions. The number of iterations required to
reach the optimal value function is polynomial in the number of states and the magnitude
of the largest reward if the discount factor is held constant. However, in the worst case
the number of iterations grows polynomially in 1=(1 ;  ), so the convergence rate slows
considerably as the discount factor approaches 1 (Littman, Dean, & Kaelbling, 1995b).
249
Kaelbling, Littman, & Moore

3.2.2 Policy Iteration


The policy iteration algorithm manipulates the policy directly, rather than nding it indi-
rectly via the optimal value function. It operates as follows:
choose an arbitrary policy 0
loop
 := 0
compute the value function of policy :
V (s) = R(s (s)) +  Ps 2S T (s (s) s0)V (s0)
solve the linear equations
0

0(s) := arg maxa (R(s a) +  Ps 2S T (s a s0)V(s0))


improve the policy at each state:

until  =  0
0

The value function of a policy is just the expected innite discounted reward that will
be gained, at each state, by executing that policy. It can be determined by solving a set
of linear equations. Once we know the value of each state under the current policy, we
consider whether the value could be improved by changing the rst action taken. If it can,
we change the policy to take the new action whenever it is in that situation. This step is
guaranteed to strictly improve the performance of the policy. When no improvements are
possible, then the policy is guaranteed to be optimal.
Since there are at most jAjjSj distinct policies, and the sequence of policies improves at
each step, this algorithm terminates in at most an exponential number of iterations (Puter-
man, 1994). However, it is an important open question how many iterations policy iteration
takes in the worst case. It is known that the running time is pseudopolynomial and that for
any xed discount factor, there is a polynomial bound in the total size of the MDP (Littman
et al., 1995b).
3.2.3 Enhancement to Value Iteration and Policy Iteration
In practice, value iteration is much faster per iteration, but policy iteration takes fewer
iterations. Arguments have been put forth to the eect that each approach is better for
large problems. Puterman's modi ed policy iteration algorithm (Puterman & Shin, 1978)
provides a method for trading iteration time for iteration improvement in a smoother way.
The basic idea is that the expensive part of policy iteration is solving for the exact value
of V . Instead of nding an exact value for V , we can perform a few steps of a modied
value-iteration step where the policy is held xed over successive iterations. This can be
shown to produce an approximation to V that converges linearly in  . In practice, this can
result in substantial speedups.
Several standard numerical-analysis techniques that speed the convergence of dynamic
programming can be used to accelerate value and policy iteration. Multigrid methods can
be used to quickly seed a good initial approximation to a high resolution value function
by initially performing value iteration at a coarser resolution (Rude, 1993). State aggre-
gation works by collapsing groups of states to a single meta-state solving the abstracted
problem (Bertsekas & Casta~non, 1989).
250
Reinforcement Learning: A Survey

3.2.4 Computational Complexity


Value iteration works by producing successive approximations of the optimal value function.
Each iteration can be performed in O(jAjjS j2) steps, or faster if there is sparsity in the
transition function. However, the number of iterations required can grow exponentially in
the discount factor (Condon, 1992) as the discount factor approaches 1, the decisions must
be based on results that happen farther and farther into the future. In practice, policy
iteration converges in fewer iterations than value iteration, although the per-iteration costs
of O(jAjjS j2 + jS j3) can be prohibitive. There is no known tight worst-case bound available
for policy iteration (Littman et al., 1995b). Modied policy iteration (Puterman & Shin,
1978) seeks a trade-o between cheap and eective iterations and is preferred by some
practictioners (Rust, 1996).
Linear programming (Schrijver, 1986) is an extremely general problem, and MDPs can
be solved by general-purpose linear-programming packages (Derman, 1970 D'Epenoux,
1963 Homan & Karp, 1966). An advantage of this approach is that commercial-quality
linear-programming packages are available, although the time and space requirements can
still be quite high. From a theoretic perspective, linear programming is the only known
algorithm that can solve MDPs in polynomial time, although the theoretically ecient
algorithms have not been shown to be ecient in practice.

4. Learning an Optimal Policy: Model-free Methods


In the previous section we reviewed methods for obtaining an optimal policy for an MDP
assuming that we already had a model. The model consists of knowledge of the state tran-
sition probability function T (s a s0) and the reinforcement function R(s a). Reinforcement
learning is primarily concerned with how to obtain the optimal policy when such a model
is not known in advance. The agent must interact with its environment directly to obtain
information which, by means of an appropriate algorithm, can be processed to produce an
optimal policy.
At this point, there are two ways to proceed.
Model-free: Learn a controller without learning a model.
Model-based: Learn a model, and use it to derive a controller.
Which approach is better? This is a matter of some debate in the reinforcement-learning
community. A number of algorithms have been proposed on both sides. This question also
appears in other elds, such as adaptive control, where the dichotomy is between direct and
indirect adaptive control.
This section examines model-free learning, and Section 5 examines model-based meth-
ods.
The biggest problem facing a reinforcement-learning agent is temporal credit assignment.
How do we know whether the action just taken is a good one, when it might have far-
reaching eects? One strategy is to wait until the \end" and reward the actions taken if
the result was good and punish them if the result was bad. In ongoing tasks, it is dicult
to know what the \end" is, and this might require a great deal of memory. Instead, we
will use insights from value iteration to adjust the estimated value of a state based on
251
Kaelbling, Littman, & Moore

r
s v
AHC
a
RL

Figure 4: Architecture for the adaptive heuristic critic.

the immediate reward and the estimated value of the next state. This class of algorithms
is known as temporal dierence methods (Sutton, 1988). We will consider two dierent
temporal-dierence learning strategies for the discounted innite-horizon model.
4.1 Adaptive Heuristic Critic and TD( )
The adaptive heuristic critic algorithm is an adaptive version of policy iteration (Barto,
Sutton, & Anderson, 1983) in which the value-function computation is no longer imple-
mented by solving a set of linear equations, but is instead computed by an algorithm called
TD(0). A block diagram for this approach is given in Figure 4. It consists of two compo-
nents: a critic (labeled AHC), and a reinforcement-learning component (labeled RL). The
reinforcement-learning component can be an instance of any of the k-armed bandit algo-
rithms, modied to deal with multiple states and non-stationary rewards. But instead of
acting to maximize instantaneous reward, it will be acting to maximize the heuristic value,
v, that is computed by the critic. The critic uses the real external reinforcement signal to
learn to map states to their expected discounted values given that the policy being executed
is the one currently instantiated in the RL component.
We can see the analogy with modied policy iteration if we imagine these components
working in alternation. The policy  implemented by RL is xed and the critic learns the
value function V for that policy. Now we x the critic and let the RL component learn a
new policy  0 that maximizes the new value function, and so on. In most implementations,
however, both components operate simultaneously. Only the alternating implementation
can be guaranteed to converge to the optimal policy, under appropriate conditions. Williams
and Baird explored the convergence properties of a class of AHC-related algorithms they
call \incremental variants of policy iteration" (Williams & Baird, 1993a).
It remains to explain how the critic can learn the value of a policy. We dene hs a r s0i
to be an experience tuple summarizing a single transition in the environment. Here s is the
agent's state before the transition, a is its choice of action, r the instantaneous reward it
receives, and s0 its resulting state. The value of a policy is learned using Sutton's TD(0)
algorithm (Sutton, 1988) which uses the update rule
V (s) := V (s) + (r + V (s0) ; V (s)) :
Whenever a state s is visited, its estimated value is updated to be closer to r + V (s0 ),
since r is the instantaneous reward received and V (s0) is the estimated value of the actually
occurring next state. This is analogous to the sample-backup rule from value iteration|the
only dierence is that the sample is drawn from the real world rather than by simulating
a known model. The key idea is that r + V (s0 ) is a sample of the value of V (s), and it is
252
Reinforcement Learning: A Survey

more likely to be correct because it incorporates the real r. If the learning rate  is adjusted
properly (it must be slowly decreased) and the policy is held xed, TD(0) is guaranteed to
converge to the optimal value function.
The TD(0) rule as presented above is really an instance of a more general class of
algorithms called TD( ), with = 0. TD(0) looks only one step ahead when adjusting
value estimates although it will eventually arrive at the correct answer, it can take quite a
while to do so. The general TD( ) rule is similar to the TD(0) rule given above,
V (u) := V (u) + (r + V (s0) ; V (s))e(u) 
but it is applied to every state according to its eligibility e(u), rather than just to the
immediately previous state, s. One version of the eligibility trace is dened to be
X
t (
e(s) = (  )t;k
ss , where
ssk = 1 if s = sk .
k=1
k 0 otherwise
The eligibility of a state s is the degree to which it has been visited in the recent past
when a reinforcement is received, it is used to update all the states that have been recently
visited, according to their eligibility. When = 0 this is equivalent to TD(0). When = 1,
it is roughly equivalent to updating all the states according to the number of times they
were visited by the end of a run. Note that we can update the eligibility online as follows:
(
e(s) :=  ee((ss)) + 1 ifotherwise
s = current state .

It is computationally more expensive to execute the general TD( ), though it often


converges considerably faster for large (Dayan, 1992 Dayan & Sejnowski, 1994). There
has been some recent work on making the updates more ecient (Cichosz & Mulawka, 1995)
and on changing the denition to make TD( ) more consistent with the certainty-equivalent
method (Singh & Sutton, 1996), which is discussed in Section 5.1.
4.2 Q-learning
The work of the two components of AHC can be accomplished in a unied manner by
Watkins' Q-learning algorithm (Watkins, 1989 Watkins & Dayan, 1992). Q-learning is
typically easier to implement. In order to understand Q-learning, we have to develop some
additional notation. Let Q(s a) be the expected discounted reinforcement of taking action
a in state s, then continuing by choosing actions optimally. Note that V (s) is the value
of s assuming the best action is taken initially, and so V  (s) = maxa Q (s a). Q (s a) can
hence be written recursively as
X
Q(s a) = R(s a) +  T (s a s0) max
a
Q(s0 a0) :
s 2S
0
0

Note also that, since V  (s) = maxa Q (s a), we have   (s) = arg maxa Q (s a) as an
optimal policy.
Because the Q function makes the action explicit, we can estimate the Q values on-
line using a method essentially the same as TD(0), but also use them to dene the policy,
253
Kaelbling, Littman, & Moore

because an action can be chosen just by taking the one with the maximum Q value for the
current state.
The Q-learning rule is
Q(s a) := Q(s a) + (r +  max
a 0
Q(s0 a0) ; Q(s a)) 
where hs a r s0i is an experience tuple as described earlier. If each action is executed in
each state an innite number of times on an innite run and  is decayed appropriately, the
Q values will converge with probability 1 to Q (Watkins, 1989 Tsitsiklis, 1994 Jaakkola,
Jordan, & Singh, 1994). Q-learning can also be extended to update states that occurred
more than one step previously, as in TD( ) (Peng & Williams, 1994).
When the Q values are nearly converged to their optimal values, it is appropriate for
the agent to act greedily, taking, in each situation, the action with the highest Q value.
During learning, however, there is a dicult exploitation versus exploration trade-o to be
made. There are no good, formally justied approaches to this problem in the general case
standard practice is to adopt one of the ad hoc methods discussed in section 2.2.
AHC architectures seem to be more dicult to work with than Q-learning on a practical
level. It can be hard to get the relative learning rates right in AHC so that the two
components converge together. In addition, Q-learning is exploration insensitive: that
is, that the Q values will converge to the optimal values, independent of how the agent
behaves while the data is being collected (as long as all state-action pairs are tried often
enough). This means that, although the exploration-exploitation issue must be addressed
in Q-learning, the details of the exploration strategy will not aect the convergence of the
learning algorithm. For these reasons, Q-learning is the most popular and seems to be the
most eective model-free algorithm for learning from delayed reinforcement. It does not,
however, address any of the issues involved in generalizing over large state and/or action
spaces. In addition, it may converge quite slowly to a good policy.
4.3 Model-free Learning With Average Reward
As described, Q-learning can be applied to discounted innite-horizon MDPs. It can also
be applied to undiscounted problems as long as the optimal policy is guaranteed to reach a
reward-free absorbing state and the state is periodically reset.
Schwartz (1993) examined the problem of adapting Q-learning to an average-reward
framework. Although his R-learning algorithm seems to exhibit convergence problems for
some MDPs, several researchers have found the average-reward criterion closer to the true
problem they wish to solve than a discounted criterion and therefore prefer R-learning to
Q-learning (Mahadevan, 1994).
With that in mind, researchers have studied the problem of learning optimal average-
reward policies. Mahadevan (1996) surveyed model-based average-reward algorithms from
a reinforcement-learning perspective and found several diculties with existing algorithms.
In particular, he showed that existing reinforcement-learning algorithms for average reward
(and some dynamic programming algorithms) do not always produce bias-optimal poli-
cies. Jaakkola, Jordan and Singh (1995) described an average-reward learning algorithm
with guaranteed convergence properties. It uses a Monte-Carlo component to estimate the
expected future reward for each state as the agent moves through the environment. In
254
Reinforcement Learning: A Survey

addition, Bertsekas presents a Q-learning-like algorithm for average-case reward in his new
textbook (1995). Although this recent work provides a much needed theoretical foundation
to this area of reinforcement learning, many important problems remain unsolved.

5. Computing Optimal Policies by Learning Models


The previous section showed how it is possible to learn an optimal policy without knowing
the models T (s a s0) or R(s a) and without even learning those models en route. Although
many of these methods are guaranteed to nd optimal policies eventually and use very
little computation time per experience, they make extremely inecient use of the data they
gather and therefore often require a great deal of experience to achieve good performance.
In this section we still begin by assuming that we don't know the models in advance, but
we examine algorithms that do operate by learning these models. These algorithms are
especially important in applications in which computation is considered to be cheap and
real-world experience costly.
5.1 Certainty Equivalent Methods
We begin with the most conceptually straightforward method: rst, learn the T and R
functions by exploring the environment and keeping statistics about the results of each
action next, compute an optimal policy using one of the methods of Section 3. This
method is known as certainty equivlance (Kumar & Varaiya, 1986).
There are some serious objections to this method:
It makes an arbitrary division between the learning phase and the acting phase.
How should it gather data about the environment initially? Random exploration
might be dangerous, and in some environments is an immensely inecient method of
gathering data, requiring exponentially more data (Whitehead, 1991) than a system
that interleaves experience gathering with policy-building more tightly (Koenig &
Simmons, 1993). See Figure 5 for an example.
The possibility of changes in the environment is also problematic. Breaking up an
agent's life into a pure learning and a pure acting phase has a considerable risk that
the optimal controller based on early life becomes, without detection, a suboptimal
controller if the environment changes.
A variation on this idea is certainty equivalence, in which the model is learned continually
through the agent's lifetime and, at each step, the current model is used to compute an
optimal policy and value function. This method makes very eective use of available data,
but still ignores the question of exploration and is extremely computationally demanding,
even for fairly small state spaces. Fortunately, there are a number of other model-based
algorithms that are more practical.
5.2 Dyna
Sutton's Dyna architecture (1990, 1991) exploits a middle ground, yielding strategies that
are both more eective than model-free learning and more computationally ecient than
255
Kaelbling, Littman, & Moore

1 2 3 ....... n Goal

Figure 5: In this environment, due to Whitehead (1991), random exploration would take
take O(2n ) steps to reach the goal even once, whereas a more intelligent explo-
ration strategy (e.g. \assume any untried action leads directly to goal") would
require only O(n2 ) steps.

the certainty-equivalence approach. It simultaneously uses experience to build a model (T^


and R^ ), uses experience to adjust the policy, and uses the model to adjust the policy.
Dyna operates in a loop of interaction with the environment. Given an experience tuple
hs a s0 ri, it behaves as follows:
Update the model, incrementing statistics for the transition from s to s0 on action a
and for receiving reward r for taking action a in state s. The updated models are T^
and R^ .
Update the policy at state s based on the newly updated model using the rule
X^
Q(s a) := R^(s a) +  T (s a s0) max
a
Q(s0 a0) 
s
0
0

which is a version of the value-iteration update for Q values.


Perform k additional updates: choose k state-action pairs at random and update them
according to the same rule as before:
X^
Q(sk  ak ):=R^(sk  ak ) +  T (sk  ak s0) max
a
Q(s0 a0) :
s
0
0

Choose an action a0 to perform in state s0 , based on the Q values but perhaps modied
by an exploration strategy.
The Dyna algorithm requires about k times the computation of Q-learning per instance,
but this is typically vastly less than for the naive model-based method. A reasonable value
of k can be determined based on the relative speeds of computation and of taking action.
Figure 6 shows a grid world in which in each cell the agent has four actions (N, S, E,
W) and transitions are made deterministically to an adjacent cell, unless there is a block,
in which case no movement occurs. As we will see in Table 1, Dyna requires an order of
magnitude fewer steps of experience than does Q-learning to arrive at an optimal policy.
Dyna requires about six times more computational eort, however.
256
Reinforcement Learning: A Survey

Figure 6: A 3277-state grid world. This was formulated as a shortest-path reinforcement-


learning problem, which yields the same result as if a reward of 1 is given at the
goal, a reward of zero elsewhere and a discount factor is used.

Steps before Backups before


convergence convergence
Q-learning 531,000 531,000
Dyna 62,000 3,055,000
prioritized sweeping 28,000 1,010,000

Table 1: The performance of three algorithms described in the text. All methods used
the exploration heuristic of \optimism in the face of uncertainty": any state not
previously visited was assumed by default to be a goal state. Q-learning used
its optimal learning rate parameter for a deterministic maze:  = 1. Dyna and
prioritized sweeping were permitted to take k = 200 backups per transition. For
prioritized sweeping, the priority queue often emptied before all backups were
used.

257
Kaelbling, Littman, & Moore

5.3 Prioritized Sweeping / Queue-Dyna


Although Dyna is a great improvement on previous methods, it suers from being relatively
undirected. It is particularly unhelpful when the goal has just been reached or when the
agent is stuck in a dead end it continues to update random state-action pairs, rather than
concentrating on the \interesting" parts of the state space. These problems are addressed
by prioritized sweeping (Moore & Atkeson, 1993) and Queue-Dyna (Peng & Williams,
1993), which are two independently-developed but very similar techniques. We will describe
prioritized sweeping in some detail.
The algorithm is similar to Dyna, except that updates are no longer chosen at random
and values are now associated with states (as in value iteration) instead of state-action pairs
(as in Q-learning). To make appropriate choices, we must store additional information in
the model. Each state remembers its predecessors: the states that have a non-zero transition
probability to it under some action. In addition, each state has a priority, initially set to
zero.
Instead of updating k random state-action pairs, prioritized sweeping updates k states
with the highest priority. For each high-priority state s, it works as follows:
Remember the current value of the state: Vold = V (s).
Update the state's value
 X^ !
V (s) := max ^ 0 0
a R(s a) +  T (s a s )V (s ) :
s 0

Set the state's priority back to 0.


Compute the value change # = jVold ; V (s)j.
Use # to modify the priorities of the predecessors of s.
If we have updated the V value for state s0 and it has changed by amount #, then the
immediate predecessors of s0 are informed of this event. Any state s for which there exists
an action a such that T^(s a s0) 6= 0 has its priority promoted to #  T^(s a s0), unless its
priority already exceeded that value.
The global behavior of this algorithm is that when a real-world transition is \surprising"
(the agent happens upon a goal state, for instance), then lots of computation is directed
to propagate this new information back to relevant predecessor states. When the real-
world transition is \boring" (the actual result is very similar to the predicted result), then
computation continues in the most deserving part of the space.
Running prioritized sweeping on the problem in Figure 6, we see a large improvement
over Dyna. The optimal policy is reached in about half the number of steps of experience
and one-third the computation as Dyna required (and therefore about 20 times fewer steps
and twice the computational eort of Q-learning).
258
Reinforcement Learning: A Survey

5.4 Other Model-Based Methods


Methods proposed for solving MDPs given a model can be used in the context of model-
based methods as well.
RTDP (real-time dynamic programming) (Barto, Bradtke, & Singh, 1995) is another
model-based method that uses Q-learning to concentrate computational eort on the areas
of the state-space that the agent is most likely to occupy. It is specic to problems in which
the agent is trying to achieve a particular goal state and the reward everywhere else is 0.
By taking into account the start state, it can nd a short path from the start to the goal,
without necessarily visiting the rest of the state space.
The Plexus planning system (Dean, Kaelbling, Kirman, & Nicholson, 1993 Kirman,
1994) exploits a similar intuition. It starts by making an approximate version of the MDP
which is much smaller than the original one. The approximate MDP contains a set of states,
called the envelope, that includes the agent's current state and the goal state, if there is one.
States that are not in the envelope are summarized by a single \out" state. The planning
process is an alternation between nding an optimal policy on the approximate MDP and
adding useful states to the envelope. Action may take place in parallel with planning, in
which case irrelevant states are also pruned out of the envelope.

6. Generalization
All of the previous discussion has tacitly assumed that it is possible to enumerate the state
and action spaces and store tables of values over them. Except in very small environments,
this means impractical memory requirements. It also makes inecient use of experience. In
a large, smooth state space we generally expect similar states to have similar values and sim-
ilar optimal actions. Surely, therefore, there should be some more compact representation
than a table. Most problems will have continuous or large discrete state spaces some will
have large or continuous action spaces. The problem of learning in large spaces is addressed
through generalization techniques, which allow compact storage of learned information and
transfer of knowledge between \similar" states and actions.
The large literature of generalization techniques from inductive concept learning can be
applied to reinforcement learning. However, techniques often need to be tailored to specic
details of the problem. In the following sections, we explore the application of standard
function-approximation techniques, adaptive resolution models, and hierarchical methods
to the problem of reinforcement learning.
The reinforcement-learning architectures and algorithms discussed above have included
the storage of a variety of mappings, including S ! A (policies), S ! < (value functions),
S  A ! < (Q functions and rewards), S  A ! S (deterministic transitions), and S 
A  S ! $0 1] (transition probabilities). Some of these mappings, such as transitions and
immediate rewards, can be learned using straightforward supervised learning, and can be
handled using any of the wide variety of function-approximation techniques for supervised
learning that support noisy training examples. Popular techniques include various neural-
network methods (Rumelhart & McClelland, 1986), fuzzy logic (Berenji, 1991 Lee, 1991).
CMAC (Albus, 1981), and local memory-based methods (Moore, Atkeson, & Schaal, 1995),
such as generalizations of nearest neighbor methods. Other mappings, especially the policy
259
Kaelbling, Littman, & Moore

mapping, typically need specialized algorithms because training sets of input-output pairs
are not available.
6.1 Generalization over Input
A reinforcement-learning agent's current state plays a central role in its selection of reward-
maximizing actions. Viewing the agent as a state-free black box, a description of the
current state is its input. Depending on the agent architecture, its output is either an
action selection, or an evaluation of the current state that can be used to select an action.
The problem of deciding how the dierent aspects of an input aect the value of the output
is sometimes called the \structural credit-assignment" problem. This section examines
approaches to generating actions or evaluations as a function of a description of the agent's
current state.
The rst group of techniques covered here is specialized to the case when reward is not
delayed the second group is more generally applicable.
6.1.1 Immediate Reward
When the agent's actions do not inuence state transitions, the resulting problem becomes
one of choosing actions to maximize immediate reward as a function of the agent's current
state. These problems bear a resemblance to the bandit problems discussed in Section 2
except that the agent should condition its action selection on the current state. For this
reason, this class of problems has been described as associative reinforcement learning.
The algorithms in this section address the problem of learning from immediate boolean
reinforcement where the state is vector valued and the action is a boolean vector. Such
algorithms can and have been used in the context of a delayed reinforcement, for instance,
as the RL component in the AHC architecture described in Section 4.1. They can also be
generalized to real-valued reward through reward comparison methods (Sutton, 1984).
CRBP The complementary reinforcement backpropagation algorithm (Ackley & Littman,
1990) (crbp) consists of a feed-forward network mapping an encoding of the state to an
encoding of the action. The action is determined probabilistically from the activation of
the output units: if output unit i has activation yi , then bit i of the action vector has value
1 with probability yi , and 0 otherwise. Any neural-network supervised training procedure
can be used to adapt the network as follows. If the result of generating action a is r = 1,
then the network is trained with input-output pair hs ai. If the result is r = 0, then the
network is trained with input-output pair hs a&i, where a& = (1 ; a1 : : : 1 ; an ).
The idea behind this training rule is that whenever an action fails to generate reward,
crbp will try to generate an action that is dierent from the current choice. Although it
seems like the algorithm might oscillate between an action and its complement, that does
not happen. One step of training a network will only change the action slightly and since
the output probabilities will tend to move toward 0.5, this makes action selection more
random and increases search. The hope is that the random distribution will generate an
action that works better, and then that action will be reinforced.
ARC The associative reinforcement comparison (arc) algorithm (Sutton, 1984) is an
instance of the ahc architecture for the case of boolean actions, consisting of two feed-
260
Reinforcement Learning: A Survey

forward networks. One learns the value of situations, the other learns a policy. These can
be simple linear networks or can have hidden units.
In the simplest case, the entire system learns only to optimize immediate reward. First,
let us consider the behavior of the network that learns the policy, a mapping from a vector
describing s to a 0 or 1. If the output unit has activation yi , then a, the action generated,
will be 1 if y + > 0, where is normal noise, and 0 otherwise.
The adjustment for the output unit is, in the simplest case,
e = r(a ; 1=2) 
where the rst factor is the reward received for taking the most recent action and the second
encodes which action was taken. The actions are encoded as 0 and 1, so a ; 1=2 always has
the same magnitude if the reward and the action have the same sign, then action 1 will be
made more likely, otherwise action 0 will be.
As described, the network will tend to seek actions that given positive reward. To extend
this approach to maximize reward, we can compare the reward to some baseline, b. This
changes the adjustment to
e = (r ; b)(a ; 1=2) 
where b is the output of the second network. The second network is trained in a standard
supervised mode to estimate r as a function of the input state s.
Variations of this approach have been used in a variety of applications (Anderson, 1986
Barto et al., 1983 Lin, 1993b Sutton, 1984).
REINFORCE Algorithms Williams (1987, 1992) studied the problem of choosing ac-
tions to maximize immedate reward. He identied a broad class of update rules that per-
form gradient descent on the expected reward and showed how to integrate these rules with
backpropagation. This class, called reinforce algorithms, includes linear reward-inaction
(Section 2.1.3) as a special case.
The generic reinforce update for a parameter wij can be written
#wij = ij (r ; bij ) @w@ ln(gj )
ij
where ij is a non-negative factor, r the current reinforcement, bij a reinforcement baseline,
and gi is the probability density function used to randomly generate actions based on unit
activations. Both ij and bij can take on dierent values for each wij , however, when ij
is constant throughout the system, the expected update is exactly in the direction of the
expected reward gradient. Otherwise, the update is in the same half space as the gradient
but not necessarily in the direction of steepest increase.
Williams points out that the choice of baseline, bij , can have a profound eect on the
convergence speed of the algorithm.
Logic-Based Methods Another strategy for generalization in reinforcement learning is
to reduce the learning problem to an associative problem of learning boolean functions.
A boolean function has a vector of boolean inputs and a single boolean output. Taking
inspiration from mainstream machine learning work, Kaelbling developed two algorithms
for learning boolean functions from reinforcement: one uses the bias of k-DNF to drive
261
Kaelbling, Littman, & Moore

the generalization process (Kaelbling, 1994b) the other searches the space of syntactic
descriptions of functions using a simple generate-and-test method (Kaelbling, 1994a).
The restriction to a single boolean output makes these techniques dicult to apply. In
very benign learning situations, it is possible to extend this approach to use a collection
of learners to independently learn the individual bits that make up a complex output. In
general, however, that approach suers from the problem of very unreliable reinforcement:
if a single learner generates an inappropriate output bit, all of the learners receive a low
reinforcement value. The cascade method (Kaelbling, 1993b) allows a collection of learners
to be trained collectively to generate appropriate joint outputs it is considerably more
reliable, but can require additional computational eort.
6.1.2 Delayed Reward
Another method to allow reinforcement-learning techniques to be applied in large state
spaces is modeled on value iteration and Q-learning. Here, a function approximator is used
to represent the value function by mapping a state description to a value.
Many reseachers have experimented with this approach: Boyan and Moore (1995) used
local memory-based methods in conjunction with value iteration Lin (1991) used backprop-
agation networks for Q-learning Watkins (1989) used CMAC for Q-learning Tesauro (1992,
1995) used backpropagation for learning the value function in backgammon (described in
Section 8.1) Zhang and Dietterich (1995) used backpropagation and TD( ) to learn good
strategies for job-shop scheduling.
Although there have been some positive examples, in general there are unfortunate in-
teractions between function approximation and the learning rules. In discrete environments
there is a guarantee that any operation that updates the value function (according to the
Bellman equations) can only reduce the error between the current value function and the
optimal value function. This guarantee no longer holds when generalization is used. These
issues are discussed by Boyan and Moore (1995), who give some simple examples of value
function errors growing arbitrarily large when generalization is used with value iteration.
Their solution to this, applicable only to certain classes of problems, discourages such diver-
gence by only permitting updates whose estimated values can be shown to be near-optimal
via a battery of Monte-Carlo experiments.
Thrun and Schwartz (1993) theorize that function approximation of value functions
is also dangerous because the errors in value functions due to generalization can become
compounded by the \max" operator in the denition of the value function.
Several recent results (Gordon, 1995 Tsitsiklis & Van Roy, 1996) show how the appro-
priate choice of function approximator can guarantee convergence, though not necessarily to
the optimal values. Baird's residual gradient technique (Baird, 1995) provides guaranteed
convergence to locally optimal solutions.
Perhaps the gloominess of these counter-examples is misplaced. Boyan and Moore (1995)
report that their counter-examples can be made to work with problem-specic hand-tuning
despite the unreliability of untuned algorithms that provably converge in discrete domains.
Sutton (1996) shows how modied versions of Boyan and Moore's examples can converge
successfully. An open question is whether general principles, ideally supported by theory,
can help us understand when value function approximation will succeed. In Sutton's com-
262
Reinforcement Learning: A Survey

parative experiments with Boyan and Moore's counter-examples, he changes four aspects
of the experiments:
1. Small changes to the task specications.
2. A very dierent kind of function approximator (CMAC (Albus, 1975)) that has weak
generalization.
3. A dierent learning algorithm: SARSA (Rummery & Niranjan, 1994) instead of value
iteration.
4. A dierent training regime. Boyan and Moore sampled states uniformly in state space,
whereas Sutton's method sampled along empirical trajectories.
There are intuitive reasons to believe that the fourth factor is particularly important, but
more careful research is needed.
Adaptive Resolution Models In many cases, what we would like to do is partition
the environment into regions of states that can be considered the same for the purposes of
learning and generating actions. Without detailed prior knowledge of the environment, it
is very dicult to know what granularity or placement of partitions is appropriate. This
problem is overcome in methods that use adaptive resolution during the course of learning,
a partition is constructed that is appropriate to the environment.
Decision Trees In environments that are characterized by a set of boolean or discrete-
valued variables, it is possible to learn compact decision trees for representing Q values. The
G-learning algorithm (Chapman & Kaelbling, 1991), works as follows. It starts by assuming
that no partitioning is necessary and tries to learn Q values for the entire environment as
if it were one state. In parallel with this process, it gathers statistics based on individual
input bits it asks the question whether there is some bit b in the state description such
that the Q values for states in which b = 1 are signicantly dierent from Q values for
states in which b = 0. If such a bit is found, it is used to split the decision tree. Then,
the process is repeated in each of the leaves. This method was able to learn very small
representations of the Q function in the presence of an overwhelming number of irrelevant,
noisy state attributes. It outperformed Q-learning with backpropagation in a simple video-
game environment and was used by McCallum (1995) (in conjunction with other techniques
for dealing with partial observability) to learn behaviors in a complex driving-simulator. It
cannot, however, acquire partitions in which attributes are only signicant in combination
(such as those needed to solve parity problems).
Variable Resolution Dynamic Programming The VRDP algorithm (Moore, 1991)
enables conventional dynamic programming to be performed in real-valued multivariate
state-spaces where straightforward discretization would fall prey to the curse of dimension-
ality. A kd-tree (similar to a decision tree) is used to partition state space into coarse
regions. The coarse regions are rened into detailed regions, but only in parts of the state
space which are predicted to be important. This notion of importance is obtained by run-
ning \mental trajectories" through state space. This algorithm proved eective on a number
of problems for which full high-resolution arrays would have been impractical. It has the
disadvantage of requiring a guess at an initially valid trajectory through state-space.
263
Kaelbling, Littman, & Moore

(a) (b) (c)


G G G
Goal

Start

Figure 7: (a) A two-dimensional maze problem. The point robot must nd a path from
start to goal without crossing any of the barrier lines. (b) The path taken by
PartiGame during the entire rst trial. It begins with intense exploration to nd a
route out of the almost entirely enclosed start region. Having eventually reached
a suciently high resolution, it discovers the gap and proceeds greedily towards
the goal, only to be temporarily blocked by the goal's barrier region. (c) The
second trial.

PartiGame Algorithm Moore's PartiGame algorithm (Moore, 1994) is another solution


to the problem of learning to achieve goal congurations in deterministic high-dimensional
continuous spaces by learning an adaptive-resolution model. It also divides the environment
into cells but in each cell, the actions available consist of aiming at the neighboring cells
(this aiming is accomplished by a local controller, which must be provided as part of the
problem statement). The graph of cell transitions is solved for shortest paths in an online
incremental manner, but a minimax criterion is used to detect when a group of cells is
too coarse to prevent movement between obstacles or to avoid limit cycles. The oending
cells are split to higher resolution. Eventually, the environment is divided up just enough to
choose appropriate actions for achieving the goal, but no unnecessary distinctions are made.
An important feature is that, as well as reducing memory and computational requirements,
it also structures exploration of state space in a multi-resolution manner. Given a failure,
the agent will initially try something very dierent to rectify the failure, and only resort to
small local changes when all the qualitatively dierent strategies have been exhausted.
Figure 7a shows a two-dimensional continuous maze. Figure 7b shows the performance
of a robot using the PartiGame algorithm during the very rst trial. Figure 7c shows the
second trial, started from a slightly dierent position.
This is a very fast algorithm, learning policies in spaces of up to nine dimensions in less
than a minute. The restriction of the current implementation to deterministic environments
limits its applicability, however. McCallum (1995) suggests some related tree-structured
methods.
264
Reinforcement Learning: A Survey

6.2 Generalization over Actions


The networks described in Section 6.1.1 generalize over state descriptions presented as
inputs. They also produce outputs in a discrete, factored representation and thus could be
seen as generalizing over actions as well.
In cases such as this when actions are described combinatorially, it is important to
generalize over actions to avoid keeping separate statistics for the huge number of actions
that can be chosen. In continuous action spaces, the need for generalization is even more
pronounced.
When estimating Q values using a neural network, it is possible to use either a distinct
network for each action, or a network with a distinct output for each action. When the
action space is continuous, neither approach is possible. An alternative strategy is to use a
single network with both the state and action as input and Q value as the output. Training
such a network is not conceptually dicult, but using the network to nd the optimal action
can be a challenge. One method is to do a local gradient-ascent search on the action in
order to nd one with high value (Baird & Klopf, 1993).
Gullapalli (1990, 1992) has developed a \neural" reinforcement-learning unit for use in
continuous action spaces. The unit generates actions with a normal distribution it adjusts
the mean and variance based on previous experience. When the chosen actions are not
performing well, the variance is high, resulting in exploration of the range of choices. When
an action performs well, the mean is moved in that direction and the variance decreased,
resulting in a tendency to generate more action values near the successful one. This method
was successfully employed to learn to control a robot arm with many continuous degrees of
freedom.
6.3 Hierarchical Methods
Another strategy for dealing with large state spaces is to treat them as a hierarchy of
learning problems. In many cases, hierarchical solutions introduce slight sub-optimality in
performance, but potentially gain a good deal of eciency in execution time, learning time,
and space.
Hierarchical learners are commonly structured as gated behaviors, as shown in Figure 8.
There is a collection of behaviors that map environment states into low-level actions and
a gating function that decides, based on the state of the environment, which behavior's
actions should be switched through and actually executed. Maes and Brooks (1990) used
a version of this architecture in which the individual behaviors were xed a priori and the
gating function was learned from reinforcement. Mahadevan and Connell (1991b) used the
dual approach: they xed the gating function, and supplied reinforcement functions for the
individual behaviors, which were learned. Lin (1993a) and Dorigo and Colombetti (1995,
1994) both used this approach, rst training the behaviors and then training the gating
function. Many of the other hierarchical learning methods can be cast in this framework.
6.3.1 Feudal Q-learning
Feudal Q-learning (Dayan & Hinton, 1993 Watkins, 1989) involves a hierarchy of learning
modules. In the simplest case, there is a high-level master and a low-level slave. The master
receives reinforcement from the external environment. Its actions consist of commands that
265
Kaelbling, Littman, & Moore

s b1

b2 g a

b3

Figure 8: A structure of gated behaviors.

it can give to the low-level learner. When the master generates a particular command to
the slave, it must reward the slave for taking actions that satisfy the command, even if they
do not result in external reinforcement. The master, then, learns a mapping from states to
commands. The slave learns a mapping from commands and states to external actions. The
set of \commands" and their associated reinforcement functions are established in advance
of the learning.
This is really an instance of the general \gated behaviors" approach, in which the slave
can execute any of the behaviors depending on its command. The reinforcement functions
for the individual behaviors (commands) are given, but learning takes place simultaneously
at both the high and low levels.
6.3.2 Compositional Q-learning
Singh's compositional Q-learning (1992b, 1992a) (C-QL) consists of a hierarchy based on
the temporal sequencing of subgoals. The elemental tasks are behaviors that achieve some
recognizable condition. The high-level goal of the system is to achieve some set of condi-
tions in sequential order. The achievement of the conditions provides reinforcement for the
elemental tasks, which are trained rst to achieve individual subgoals. Then, the gating
function learns to switch the elemental tasks in order to achieve the appropriate high-level
sequential goal. This method was used by Tham and Prager (1994) to learn to control a
simulated multi-link robot arm.
6.3.3 Hierarchical Distance to Goal
Especially if we consider reinforcement learning modules to be part of larger agent archi-
tectures, it is important to consider problems in which goals are dynamically input to the
learner. Kaelbling's HDG algorithm (1993a) uses a hierarchical approach to solving prob-
lems when goals of achievement (the agent should get to a particular state as quickly as
possible) are given to an agent dynamically.
The HDG algorithm works by analogy with navigation in a harbor. The environment
is partitioned (a priori, but more recent work (Ashar, 1994) addresses the case of learning
the partition) into a set of regions whose centers are known as \landmarks." If the agent is
266
Reinforcement Learning: A Survey

office

2/5 1/5
2/5

hall hall
+100
printer
Figure 9: An example of a partially observable environment.

currently in the same region as the goal, then it uses low-level actions to move to the goal.
If not, then high-level information is used to determine the next landmark on the shortest
path from the agent's closest landmark to the goal's closest landmark. Then, the agent uses
low-level information to aim toward that next landmark. If errors in action cause deviations
in the path, there is no problem the best aiming point is recomputed on every step.

7. Partially Observable Environments


In many real-world environments, it will not be possible for the agent to have perfect and
complete perception of the state of the environment. Unfortunately, complete observability
is necessary for learning methods based on MDPs. In this section, we consider the case in
which the agent makes observations of the state of the environment, but these observations
may be noisy and provide incomplete information. In the case of a robot, for instance,
it might observe whether it is in a corridor, an open room, a T-junction, etc., and those
observations might be error-prone. This problem is also referred to as the problem of
\incomplete perception," \perceptual aliasing," or \hidden state."
In this section, we will consider extensions to the basic MDP framework for solving
partially observable problems. The resulting formal model is called a partially observable
Markov decision process or POMDP.
7.1 State-Free Deterministic Policies
The most naive strategy for dealing with partial observability is to ignore it. That is, to
treat the observations as if they were the states of the environment and try to learn to
behave. Figure 9 shows a simple environment in which the agent is attempting to get to
the printer from an oce. If it moves from the oce, there is a good chance that the agent
will end up in one of two places that look like \hall", but that require dierent actions for
getting to the printer. If we consider these states to be the same, then the agent cannot
possibly behave optimally. But how well can it do?
The resulting problem is not Markovian, and Q-learning cannot be guaranteed to con-
verge. Small breaches of the Markov requirement are well handled by Q-learning, but it is
possible to construct simple environments that cause Q-learning to oscillate (Chrisman &
267
Kaelbling, Littman, & Moore

Littman, 1993). It is possible to use a model-based approach, however act according to


some policy and gather statistics about the transitions between observations, then solve for
the optimal policy based on those observations. Unfortunately, when the environment is not
Markovian, the transition probabilities depend on the policy being executed, so this new
policy will induce a new set of transition probabilities. This approach may yield plausible
results in some cases, but again, there are no guarantees.
It is reasonable, though, to ask what the optimal policy (mapping from observations to
actions, in this case) is. It is NP-hard (Littman, 1994b) to nd this mapping, and even the
best mapping can have very poor performance. In the case of our agent trying to get to the
printer, for instance, any deterministic state-free policy takes an innite number of steps to
reach the goal on average.
7.2 State-Free Stochastic Policies
Some improvement can be gained by considering stochastic policies these are mappings
from observations to probability distributions over actions. If there is randomness in the
agent's actions, it will not get stuck in the hall forever. Jaakkola, Singh, and Jordan (1995)
have developed an algorithm for nding locally-optimal stochastic policies, but nding a
globally optimal policy is still NP hard.
In our example, it turns out that the optimal stochastic policy p is for the agent, when
in a state that
p looks like a hall, to go east with probability 2 ; 2
0:6 and west with
probability 2 ; 1
0:4. This policy can be found by solving a simple (in this case)
quadratic program. The fact that such a simple example can produce irrational numbers
gives some indication that it is a dicult problem to solve exactly.
7.3 Policies with Internal State
The only way to behave truly eectively in a wide-range of environments is to use memory
of previous actions and observations to disambiguate the current state. There are a variety
of approaches to learning policies with internal state.
Recurrent Q-learning One intuitively simple approach is to use a recurrent neural net-
work to learn Q values. The network can be trained using backpropagation through time (or
some other suitable technique) and learns to retain \history features" to predict value. This
approach has been used by a number of researchers (Meeden, McGraw, & Blank, 1993 Lin
& Mitchell, 1992 Schmidhuber, 1991b). It seems to work eectively on simple problems,
but can suer from convergence to local optima on more complex problems.
Classier Systems Classier systems (Holland, 1975 Goldberg, 1989) were explicitly
developed to solve problems with delayed reward, including those requiring short-term
memory. The internal mechanism typically used to pass reward back through chains of
decisions, called the bucket brigade algorithm, bears a close resemblance to Q-learning. In
spite of some early successes, the original design does not appear to handle partially ob-
served environments robustly.
Recently, this approach has been reexamined using insights from the reinforcement-
learning literature, with some success. Dorigo did a comparative study of Q-learning and
classier systems (Dorigo & Bersini, 1994). Cli and Ross (1994) start with Wilson's zeroth-
268
Reinforcement Learning: A Survey

i
b a
SE π

Figure 10: Structure of a POMDP agent.

level classier system (Wilson, 1995) and add one and two-bit memory registers. They nd
that, although their system can learn to use short-term memory registers eectively, the
approach is unlikely to scale to more complex environments.
Dorigo and Colombetti applied classier systems to a moderately complex problem of
learning robot behavior from immediate reinforcement (Dorigo, 1995 Dorigo & Colombetti,
1994).
Finite-history-window Approach One way to restore the Markov property is to allow
decisions to be based on the history of recent observations and perhaps actions. Lin and
Mitchell (1992) used a xed-width nite history window to learn a pole balancing task.
McCallum (1995) describes the \utile sux memory" which learns a variable-width window
that serves simultaneously as a model of the environment and a nite-memory policy. This
system has had excellent results in a very complex driving-simulation domain (McCallum,
1995). Ring (1994) has a neural-network approach that uses a variable history window,
adding history when necessary to disambiguate situations.
POMDP Approach Another strategy consists of using hidden Markov model (HMM)
techniques to learn a model of the environment, including the hidden state, then to use that
model to construct a perfect memory controller (Cassandra, Kaelbling, & Littman, 1994
Lovejoy, 1991 Monahan, 1982).
Chrisman (1992) showed how the forward-backward algorithm for learning HMMs could
be adapted to learning POMDPs. He, and later McCallum (1993), also gave heuristic state-
splitting rules to attempt to learn the smallest possible model for a given environment. The
resulting model can then be used to integrate information from the agent's observations in
order to make decisions.
Figure 10 illustrates the basic structure for a perfect-memory controller. The component
on the left is the state estimator, which computes the agent's belief state, b as a function of
the old belief state, the last action a, and the current observation i. In this context, a belief
state is a probability distribution over states of the environment, indicating the likelihood,
given the agent's past experience, that the environment is actually in each of those states.
The state estimator can be constructed straightforwardly using the estimated world model
and Bayes' rule.
Now we are left with the problem of nding a policy mapping belief states into action.
This problem can be formulated as an MDP, but it is dicult to solve using the techniques
described earlier, because the input space is continuous. Chrisman's approach (1992) does
not take into account future uncertainty, but yields a policy after a small amount of com-
putation. A standard approach from the operations-research literature is to solve for the
269
Kaelbling, Littman, & Moore

optimal policy (or a close approximation thereof) based on its representation as a piecewise-
linear and convex function over the belief space. This method is computationally intractable,
but may serve as inspiration for methods that make further approximations (Cassandra
et al., 1994 Littman, Cassandra, & Kaelbling, 1995a).

8. Reinforcement Learning Applications


One reason that reinforcement learning is popular is that is serves as a theoretical tool for
studying the principles of agents learning to act. But it is unsurprising that it has also
been used by a number of researchers as a practical computational tool for constructing
autonomous systems that improve themselves with experience. These applications have
ranged from robotics, to industrial manufacturing, to combinatorial search problems such
as computer game playing.
Practical applications provide a test of the ecacy and usefulness of learning algorithms.
They are also an inspiration for deciding which components of the reinforcement learning
framework are of practical importance. For example, a researcher with a real robotic task
can provide a data point to questions such as:
How important is optimal exploration? Can we break the learning period into explo-
ration phases and exploitation phases?
What is the most useful model of long-term reward: Finite horizon? Discounted?
Innite horizon?
How much computation is available between agent decisions and how should it be
used?
What prior knowledge can we build into the system, and which algorithms are capable
of using that knowledge?
Let us examine a set of practical applications of reinforcement learning, while bearing these
questions in mind.
8.1 Game Playing
Game playing has dominated the Articial Intelligence world as a problem domain ever since
the eld was born. Two-player games do not t into the established reinforcement-learning
framework since the optimality criterion for games is not one of maximizing reward in the
face of a xed environment, but one of maximizing reward against an optimal adversary
(minimax). Nonetheless, reinforcement-learning algorithms can be adapted to work for a
very general class of games (Littman, 1994a) and many researchers have used reinforcement
learning in these environments. One application, spectacularly far ahead of its time, was
Samuel's checkers playing system (Samuel, 1959). This learned a value function represented
by a linear function approximator, and employed a training scheme similar to the updates
used in value iteration, temporal dierences and Q-learning.
More recently, Tesauro (1992, 1994, 1995) applied the temporal dierence algorithm
to backgammon. Backgammon has approximately 1020 states, making table-based rein-
forcement learning impossible. Instead, Tesauro used a backpropagation-based three-layer
270
Reinforcement Learning: A Survey

Training Hidden Results


Games Units
Basic Poor
TD 1.0 300,000 80 Lost by 13 points in 51
games
TD 2.0 800,000 40 Lost by 7 points in 38
games
TD 2.1 1,500,000 80 Lost by 1 point in 40
games

Table 2: TD-Gammon's performance in games against the top human professional players.
A backgammon tournament involves playing a series of games for points until one
player reaches a set target. TD-Gammon won none of these tournaments but came
suciently close that it is now considered one of the best few players in the world.

neural network as a function approximator for the value function


Board Position ! Probability of victory for current player:
Two versions of the learning algorithm were used. The rst, which we will call Basic TD-
Gammon, used very little predened knowledge of the game, and the representation of a
board position was virtually a raw encoding, suciently powerful only to permit the neural
network to distinguish between conceptually dierent positions. The second, TD-Gammon,
was provided with the same raw state information supplemented by a number of hand-
crafted features of backgammon board positions. Providing hand-crafted features in this
manner is a good example of how inductive biases from human knowledge of the task can
be supplied to a learning algorithm.
The training of both learning algorithms required several months of computer time, and
was achieved by constant self-play. No exploration strategy was used|the system always
greedily chose the move with the largest expected probability of victory. This naive explo-
ration strategy proved entirely adequate for this environment, which is perhaps surprising
given the considerable work in the reinforcement-learning literature which has produced
numerous counter-examples to show that greedy exploration can lead to poor learning per-
formance. Backgammon, however, has two important properties. Firstly, whatever policy
is followed, every game is guaranteed to end in nite time, meaning that useful reward
information is obtained fairly frequently. Secondly, the state transitions are suciently
stochastic that independent of the policy, all states will occasionally be visited|a wrong
initial value function has little danger of starving us from visiting a critical part of state
space from which important information could be obtained.
The results (Table 2) of TD-Gammon are impressive. It has competed at the very top
level of international human play. Basic TD-Gammon played respectably, but not at a
professional standard.
271
Figure 11: Schaal and Atkeson's devil-sticking robot. The tapered stick is hit alternately
by each of the two hand sticks. The task is to keep the devil stick from falling
for as many hits as possible. The robot has three motors indicated by torque
vectors 1 2 3.

Although experiments with other games have in some cases produced interesting learning
behavior, no success close to that of TD-Gammon has been repeated. Other games that
have been studied include Go (Schraudolph, Dayan, & Sejnowski, 1994) and Chess (Thrun,
1995). It is still an open question as to if and how the success of TD-Gammon can be
repeated in other domains.
8.2 Robotics and Control
In recent years there have been many robotics and control applications that have used
reinforcement learning. Here we will concentrate on the following four examples, although
many other interesting ongoing robotics investigations are underway.

1. Schaal and Atkeson (1994) constructed a two-armed robot, shown in Figure 11, that
learns to juggle a device known as a devil-stick. This is a complex non-linear control
task involving a six-dimensional state space and less than 200 msecs per control deci-
sion. After about 40 initial attempts the robot learns to keep juggling for hundreds of
hits. A typical human learning the task requires an order of magnitude more practice
to achieve prociency at mere tens of hits.
The juggling robot learned a world model from experience, which was generalized
to unvisited states by a function approximation scheme known as locally weighted
regression (Cleveland & Delvin, 1988 Moore & Atkeson, 1992). Between each trial,
a form of dynamic programming specic to linear control policies and locally linear
transitions was used to improve the policy. The form of dynamic programming is
known as linear-quadratic-regulator design (Sage & White, 1977).
272
Reinforcement Learning: A Survey

2. Mahadevan and Connell (1991a) discuss a task in which a mobile robot pushes large
boxes for extended periods of time. Box-pushing is a well-known dicult robotics
problem, characterized by immense uncertainty in the results of actions. Q-learning
was used in conjunction with some novel clustering techniques designed to enable a
higher-dimensional input than a tabular approach would have permitted. The robot
learned to perform competitively with the performance of a human-programmed so-
lution. Another aspect of this work, mentioned in Section 6.3, was a pre-programmed
breakdown of the monolithic task description into a set of lower level tasks to be
learned.
3. Mataric (1994) describes a robotics experiment with, from the viewpoint of theoret-
ical reinforcement learning, an unthinkably high dimensional state space, containing
many dozens of degrees of freedom. Four mobile robots traveled within an enclo-
sure collecting small disks and transporting them to a destination region. There were
three enhancements to the basic Q-learning algorithm. Firstly, pre-programmed sig-
nals called progress estimators were used to break the monolithic task into subtasks.
This was achieved in a robust manner in which the robots were not forced to use
the estimators, but had the freedom to prot from the inductive bias they provided.
Secondly, control was decentralized. Each robot learned its own policy independently
without explicit communication with the others. Thirdly, state space was brutally
quantized into a small number of discrete states according to values of a small num-
ber of pre-programmed boolean features of the underlying sensors. The performance
of the Q-learned policies were almost as good as a simple hand-crafted controller for
the job.
4. Q-learning has been used in an elevator dispatching task (Crites & Barto, 1996). The
problem, which has been implemented in simulation only at this stage, involved four
elevators servicing ten oors. The objective was to minimize the average squared
wait time for passengers, discounted into future time. The problem can be posed as a
discrete Markov system, but there are 1022 states even in the most simplied version of
the problem. Crites and Barto used neural networks for function approximation and
provided an excellent comparison study of their Q-learning approach against the most
popular and the most sophisticated elevator dispatching algorithms. The squared wait
time of their controller was approximately 7% less than the best alternative algorithm
(\Empty the System" heuristic with a receding horizon controller) and less than half
the squared wait time of the controller most frequently used in real elevator systems.
5. The nal example concerns an application of reinforcement learning by one of the
authors of this survey to a packaging task from a food processing industry. The
problem involves lling containers with variable numbers of non-identical products.
The product characteristics also vary with time, but can be sensed. Depending on
the task, various constraints are placed on the container-lling procedure. Here are
three examples:
The mean weight of all containers produced by a shift must not be below the
manufacturer's declared weight W .
273
Kaelbling, Littman, & Moore

The number of containers below the declared weight must be less than P %.
No containers may be produced below weight W 0 .
Such tasks are controlled by machinery which operates according to various setpoints.
Conventional practice is that setpoints are chosen by human operators, but this choice
is not easy as it is dependent on the current product characteristics and the current
task constraints. The dependency is often dicult to model and highly non-linear.
The task was posed as a nite-horizon Markov decision task in which the state of the
system is a function of the product characteristics, the amount of time remaining in
the production shift and the mean wastage and percent below declared in the shift
so far. The system was discretized into 200,000 discrete states and local weighted
regression was used to learn and generalize a transition model. Prioritized sweep-
ing was used to maintain an optimal value function as each new piece of transition
information was obtained. In simulated experiments the savings were considerable,
typically with wastage reduced by a factor of ten. Since then the system has been
deployed successfully in several factories within the United States.
Some interesting aspects of practical reinforcement learning come to light from these
examples. The most striking is that in all cases, to make a real system work it proved
necessary to supplement the fundamental algorithm with extra pre-programmed knowledge.
Supplying extra knowledge comes at a price: more human eort and insight is required and
the system is subsequently less autonomous. But it is also clear that for tasks such as
these, a knowledge-free approach would not have achieved worthwhile performance within
the nite lifetime of the robots.
What forms did this pre-programmed knowledge take? It included an assumption of
linearity for the juggling robot's policy, a manual breaking up of the task into subtasks for
the two mobile-robot examples, while the box-pusher also used a clustering technique for
the Q values which assumed locally consistent Q values. The four disk-collecting robots
additionally used a manually discretized state space. The packaging example had far fewer
dimensions and so required correspondingly weaker assumptions, but there, too, the as-
sumption of local piecewise continuity in the transition model enabled massive reductions
in the amount of learning data required.
The exploration strategies are interesting too. The juggler used careful statistical anal-
ysis to judge where to protably experiment. However, both mobile robot applications
were able to learn well with greedy exploration|always exploiting without deliberate ex-
ploration. The packaging task used optimism in the face of uncertainty. None of these
strategies mirrors theoretically optimal (but computationally intractable) exploration, and
yet all proved adequate.
Finally, it is also worth considering the computational regimes of these experiments.
They were all very dierent, which indicates that the diering computational demands of
various reinforcement learning algorithms do indeed have an array of diering applications.
The juggler needed to make very fast decisions with low latency between each hit, but
had long periods (30 seconds and more) between each trial to consolidate the experiences
collected on the previous trial and to perform the more aggressive computation necessary
to produce a new reactive controller on the next trial. The box-pushing robot was meant to
274
Reinforcement Learning: A Survey

operate autonomously for hours and so had to make decisions with a uniform length control
cycle. The cycle was suciently long for quite substantial computations beyond simple Q-
learning backups. The four disk-collecting robots were particularly interesting. Each robot
had a short life of less than 20 minutes (due to battery constraints) meaning that substantial
number crunching was impractical, and any signicant combinatorial search would have
used a signicant fraction of the robot's learning lifetime. The packaging task had easy
constraints. One decision was needed every few minutes. This provided opportunities for
fully computing the optimal value function for the 200,000-state system between every
control cycle, in addition to performing massive cross-validation-based optimization of the
transition model being learned.
A great deal of further work is currently in progress on practical implementations of
reinforcement learning. The insights and task constraints that they produce will have an
important eect on shaping the kind of algorithms that are developed in future.

9. Conclusions
There are a variety of reinforcement-learning techniques that work eectively on a variety
of small problems. But very few of these techniques scale well to larger problems. This is
not because researchers have done a bad job of inventing learning techniques, but because
it is very dicult to solve arbitrary problems in the general case. In order to solve highly
complex problems, we must give up tabula rasa learning techniques and begin to incorporate
bias that will give leverage to the learning process.
The necessary bias can come in a variety of forms, including the following:
shaping: The technique of shaping is used in training animals (Hilgard & Bower, 1975) a
teacher presents very simple problems to solve rst, then gradually exposes the learner
to more complex problems. Shaping has been used in supervised-learning systems,
and can be used to train hierarchical reinforcement-learning systems from the bottom
up (Lin, 1991), and to alleviate problems of delayed reinforcement by decreasing the
delay until the problem is well understood (Dorigo & Colombetti, 1994 Dorigo, 1995).
local reinforcement signals: Whenever possible, agents should be given reinforcement
signals that are local. In applications in which it is possible to compute a gradient,
rewarding the agent for taking steps up the gradient, rather than just for achieving
the nal goal, can speed learning signicantly (Mataric, 1994).
imitation: An agent can learn by \watching" another agent perform the task (Lin, 1991).
For real robots, this requires perceptual abilities that are not yet available. But
another strategy is to have a human supply appropriate motor commands to a robot
through a joystick or steering wheel (Pomerleau, 1993).
problem decomposition: Decomposing a huge learning problem into a collection of smaller
ones, and providing useful reinforcement signals for the subproblems is a very power-
ful technique for biasing learning. Most interesting examples of robotic reinforcement
learning employ this technique to some extent (Connell & Mahadevan, 1993).
reexes: One thing that keeps agents that know nothing from learning anything is that
they have a hard time even nding the interesting parts of the space they wander
275
Kaelbling, Littman, & Moore

around at random never getting near the goal, or they are always \killed" immediately.
These problems can be ameliorated by programming a set of \reexes" that cause the
agent to act initially in some way that is reasonable (Mataric, 1994 Singh, Barto,
Grupen, & Connolly, 1994). These reexes can eventually be overridden by more
detailed and accurate learned knowledge, but they at least keep the agent alive and
pointed in the right direction while it is trying to learn. Recent work by Millan (1996)
explores the use of reexes to make robot learning safer and more ecient.
With appropriate biases, supplied by human programmers or teachers, complex reinforcement-
learning problems will eventually be solvable. There is still much work to be done and many
interesting questions remaining for learning techniques and especially regarding methods for
approximating, decomposing, and incorporating bias into problems.

Acknowledgements
Thanks to Marco Dorigo and three anonymous reviewers for comments that have helped
to improve this paper. Also thanks to our many colleagues in the reinforcement-learning
community who have done this work and explained it to us.
Leslie Pack Kaelbling was supported in part by NSF grants IRI-9453383 and IRI-
9312395. Michael Littman was supported in part by Bellcore. Andrew Moore was supported
in part by an NSF Research Initiation Award and by 3M Corporation.

References
Ackley, D. H., & Littman, M. L. (1990). Generalization and scaling in reinforcement learn-
ing. In Touretzky, D. S. (Ed.), Advances in Neural Information Processing Systems
2, pp. 550{557 San Mateo, CA. Morgan Kaufmann.
Albus, J. S. (1975). A new approach to manipulator control: Cerebellar model articulation
controller (cmac). Journal of Dynamic Systems, Measurement and Control, 97, 220{
227.
Albus, J. S. (1981). Brains, Behavior, and Robotics. BYTE Books, Subsidiary of McGraw-
Hill, Peterborough, New Hampshire.
Anderson, C. W. (1986). Learning and Problem Solving with Multilayer Connectionist
Systems. Ph.D. thesis, University of Massachusetts, Amherst, MA.
Ashar, R. R. (1994). Hierarchical learning in stochastic domains. Master's thesis, Brown
University, Providence, Rhode Island.
Baird, L. (1995). Residual algorithms: Reinforcement learning with function approxima-
tion. In Prieditis, A., & Russell, S. (Eds.), Proceedings of the Twelfth International
Conference on Machine Learning, pp. 30{37 San Francisco, CA. Morgan Kaufmann.
Baird, L. C., & Klopf, A. H. (1993). Reinforcement learning with high-dimensional, con-
tinuous actions. Tech. rep. WL-TR-93-1147, Wright-Patterson Air Force Base Ohio:
Wright Laboratory.
276
Reinforcement Learning: A Survey

Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic
programming. Arti cial Intelligence, 72 (1), 81{138.
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that
can solve dicult learning control problems. IEEE Transactions on Systems, Man,
and Cybernetics, SMC-13 (5), 834{846.
Bellman, R. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ.
Berenji, H. R. (1991). Articial neural networks and approximate reasoning for intelligent
control in space. In American Control Conference, pp. 1075{1080.
Berry, D. A., & Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments.
Chapman and Hall, London, UK.
Bertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Models.
Prentice-Hall, Englewood Clis, NJ.
Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control. Athena Scientic,
Belmont, Massachusetts. Volumes 1 and 2.
Bertsekas, D. P., & Casta~non, D. A. (1989). Adaptive aggregation for innite horizon
dynamic programming. IEEE Transactions on Automatic Control, 34 (6), 589{598.
Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and Distributed Computation: Numer-
ical Methods. Prentice-Hall, Englewood Clis, NJ.
Box, G. E. P., & Draper, N. R. (1987). Empirical Model-Building and Response Surfaces.
Wiley.
Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely
approximating the value function. In Tesauro, G., Touretzky, D. S., & Leen, T. K.
(Eds.), Advances in Neural Information Processing Systems 7 Cambridge, MA. The
MIT Press.
Burghes, D., & Graham, A. (1980). Introduction to Control Theory including Optimal
Control. Ellis Horwood.
Cassandra, A. R., Kaelbling, L. P., & Littman, M. L. (1994). Acting optimally in partially
observable stochastic domains. In Proceedings of the Twelfth National Conference on
Arti cial Intelligence Seattle, WA.
Chapman, D., & Kaelbling, L. P. (1991). Input generalization in delayed reinforcement
learning: An algorithm and performance comparisons. In Proceedings of the Interna-
tional Joint Conference on Arti cial Intelligence Sydney, Australia.
Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual
distinctions approach. In Proceedings of the Tenth National Conference on Arti cial
Intelligence, pp. 183{188 San Jose, CA. AAAI Press.

277
Kaelbling, Littman, & Moore

Chrisman, L., & Littman, M. (1993). Hidden state and short-term memory.. Presentation
at Reinforcement Learning Workshop, Machine Learning Conference.
Cichosz, P., & Mulawka, J. J. (1995). Fast and ecient reinforcement learning with trun-
cated temporal dierences. In Prieditis, A., & Russell, S. (Eds.), Proceedings of the
Twelfth International Conference on Machine Learning, pp. 99{107 San Francisco,
CA. Morgan Kaufmann.
Cleveland, W. S., & Delvin, S. J. (1988). Locally weighted regression: An approach to
regression analysis by local tting. Journal of the American Statistical Association,
83 (403), 596{610.
Cli, D., & Ross, S. (1994). Adding temporary memory to ZCS. Adaptive Behavior, 3 (2),
101{150.
Condon, A. (1992). The complexity of stochastic games. Information and Computation,
96 (2), 203{224.
Connell, J., & Mahadevan, S. (1993). Rapid task learning for real robots. In Robot Learning.
Kluwer Academic Publishers.
Crites, R. H., & Barto, A. G. (1996). Improving elevator performance using reinforcement
learning. In Touretzky, D., Mozer, M., & Hasselmo, M. (Eds.), Neural Information
Processing Systems 8.
Dayan, P. (1992). The convergence of TD( ) for general . Machine Learning, 8 (3), 341{
362.
Dayan, P., & Hinton, G. E. (1993). Feudal reinforcement learning. In Hanson, S. J., Cowan,
J. D., & Giles, C. L. (Eds.), Advances in Neural Information Processing Systems 5
San Mateo, CA. Morgan Kaufmann.
Dayan, P., & Sejnowski, T. J. (1994). TD( ) converges with probability 1. Machine Learn-
ing, 14 (3).
Dean, T., Kaelbling, L. P., Kirman, J., & Nicholson, A. (1993). Planning with deadlines in
stochastic domains. In Proceedings of the Eleventh National Conference on Arti cial
Intelligence Washington, DC.
D'Epenoux, F. (1963). A probabilistic production and inventory problem. Management
Science, 10, 98{108.
Derman, C. (1970). Finite State Markovian Decision Processes. Academic Press, New York.
Dorigo, M., & Bersini, H. (1994). A comparison of q-learning and classier systems. In
From Animals to Animats: Proceedings of the Third International Conference on the
Simulation of Adaptive Behavior Brighton, UK.
Dorigo, M., & Colombetti, M. (1994). Robot shaping: Developing autonomous agents
through learning. Arti cial Intelligence, 71 (2), 321{370.
278
Reinforcement Learning: A Survey

Dorigo, M. (1995). Alecsys and the AutonoMouse: Learning to control a real robot by
distributed classier systems. Machine Learning, 19.
Fiechter, C.-N. (1994). Ecient reinforcement learning. In Proceedings of the Seventh
Annual ACM Conference on Computational Learning Theory, pp. 88{97. Association
of Computing Machinery.
Gittins, J. C. (1989). Multi-armed Bandit Allocation Indices. Wiley-Interscience series in
systems and optimization. Wiley, Chichester, NY.
Goldberg, D. (1989). Genetic algorithms in search, optimization, and machine learning.
Addison-Wesley, MA.
Gordon, G. J. (1995). Stable function approximation in dynamic programming. In Priedi-
tis, A., & Russell, S. (Eds.), Proceedings of the Twelfth International Conference on
Machine Learning, pp. 261{268 San Francisco, CA. Morgan Kaufmann.
Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued
functions. Neural Networks, 3, 671{692.
Gullapalli, V. (1992). Reinforcement learning and its application to control. Ph.D. thesis,
University of Massachusetts, Amherst, MA.
Hilgard, E. R., & Bower, G. H. (1975). Theories of Learning (fourth edition). Prentice-Hall,
Englewood Clis, NJ.
Homan, A. J., & Karp, R. M. (1966). On nonterminating stochastic games. Management
Science, 12, 359{370.
Holland, J. H. (1975). Adaptation in Natural and Arti cial Systems. University of Michigan
Press, Ann Arbor, MI.
Howard, R. A. (1960). Dynamic Programming and Markov Processes. The MIT Press,
Cambridge, MA.
Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative
dynamic programming algorithms. Neural Computation, 6 (6).
Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Monte-carlo reinforcement learning in
non-Markovian decision problems. In Tesauro, G., Touretzky, D. S., & Leen, T. K.
(Eds.), Advances in Neural Information Processing Systems 7 Cambridge, MA. The
MIT Press.
Kaelbling, L. P. (1993a). Hierarchical learning in stochastic domains: Preliminary results.
In Proceedings of the Tenth International Conference on Machine Learning Amherst,
MA. Morgan Kaufmann.
Kaelbling, L. P. (1993b). Learning in Embedded Systems. The MIT Press, Cambridge, MA.
Kaelbling, L. P. (1994a). Associative reinforcement learning: A generate and test algorithm.
Machine Learning, 15 (3).

279
Kaelbling, Littman, & Moore

Kaelbling, L. P. (1994b). Associative reinforcement learning: Functions in k-DNF. Machine


Learning, 15 (3).
Kirman, J. (1994). Predicting Real-Time Planner Performance by Domain Characterization.
Ph.D. thesis, Department of Computer Science, Brown University.
Koenig, S., & Simmons, R. G. (1993). Complexity analysis of real-time reinforcement
learning. In Proceedings of the Eleventh National Conference on Arti cial Intelligence,
pp. 99{105 Menlo Park, California. AAAI Press/MIT Press.
Kumar, P. R., & Varaiya, P. P. (1986). Stochastic Systems: Estimation, Identi cation, and
Adaptive Control. Prentice Hall, Englewood Clis, New Jersey.
Lee, C. C. (1991). A self learning rule-based controller employing approximate reasoning
and neural net concepts. International Journal of Intelligent Systems, 6 (1), 71{93.
Lin, L.-J. (1991). Programming robots using reinforcement learning and teaching. In
Proceedings of the Ninth National Conference on Arti cial Intelligence.
Lin, L.-J. (1993a). Hierachical learning of robot skills by reinforcement. In Proceedings of
the International Conference on Neural Networks.
Lin, L.-J. (1993b). Reinforcement Learning for Robots Using Neural Networks. Ph.D. thesis,
Carnegie Mellon University, Pittsburgh, PA.
Lin, L.-J., & Mitchell, T. M. (1992). Memory approaches to reinforcement learning in non-
Markovian domains. Tech. rep. CMU-CS-92-138, Carnegie Mellon University, School
of Computer Science.
Littman, M. L. (1994a). Markov games as a framework for multi-agent reinforcement learn-
ing. In Proceedings of the Eleventh International Conference on Machine Learning,
pp. 157{163 San Francisco, CA. Morgan Kaufmann.
Littman, M. L. (1994b). Memoryless policies: Theoretical limitations and practical results.
In Cli, D., Husbands, P., Meyer, J.-A., & Wilson, S. W. (Eds.), From Animals
to Animats 3: Proceedings of the Third International Conference on Simulation of
Adaptive Behavior Cambridge, MA. The MIT Press.
Littman, M. L., Cassandra, A., & Kaelbling, L. P. (1995a). Learning policies for partially
observable environments: Scaling up. In Prieditis, A., & Russell, S. (Eds.), Proceed-
ings of the Twelfth International Conference on Machine Learning, pp. 362{370 San
Francisco, CA. Morgan Kaufmann.
Littman, M. L., Dean, T. L., & Kaelbling, L. P. (1995b). On the complexity of solving
Markov decision problems. In Proceedings of the Eleventh Annual Conference on
Uncertainty in Arti cial Intelligence (UAI{95) Montreal, Qu(ebec, Canada.
Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov
decision processes. Annals of Operations Research, 28, 47{66.
280
Reinforcement Learning: A Survey

Maes, P., & Brooks, R. A. (1990). Learning to coordinate behaviors. In Proceedings Eighth
National Conference on Arti cial Intelligence, pp. 796{802. Morgan Kaufmann.
Mahadevan, S. (1994). To discount or not to discount in reinforcement learning: A case
study comparing R learning and Q learning. In Proceedings of the Eleventh Inter-
national Conference on Machine Learning, pp. 164{172 San Francisco, CA. Morgan
Kaufmann.
Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms,
and empirical results. Machine Learning, 22 (1).
Mahadevan, S., & Connell, J. (1991a). Automatic programming of behavior-based robots
using reinforcement learning. In Proceedings of the Ninth National Conference on
Arti cial Intelligence Anaheim, CA.
Mahadevan, S., & Connell, J. (1991b). Scaling reinforcement learning to robotics by ex-
ploiting the subsumption architecture. In Proceedings of the Eighth International
Workshop on Machine Learning, pp. 328{332.
Mataric, M. J. (1994). Reward functions for accelerated learning. In Cohen, W. W., &
Hirsh, H. (Eds.), Proceedings of the Eleventh International Conference on Machine
Learning. Morgan Kaufmann.
McCallum, A. K. (1995). Reinforcement Learning with Selective Perception and Hidden
State. Ph.D. thesis, Department of Computer Science, University of Rochester.
McCallum, R. A. (1993). Overcoming incomplete perception with utile distinction memory.
In Proceedings of the Tenth International Conference on Machine Learning, pp. 190{
196 Amherst, Massachusetts. Morgan Kaufmann.
McCallum, R. A. (1995). Instance-based utile distinctions for reinforcement learning with
hidden state. In Proceedings of the Twelfth International Conference Machine Learn-
ing, pp. 387{395 San Francisco, CA. Morgan Kaufmann.
Meeden, L., McGraw, G., & Blank, D. (1993). Emergent control and planning in an au-
tonomous vehicle. In Touretsky, D. (Ed.), Proceedings of the Fifteenth Annual Meeting
of the Cognitive Science Society, pp. 735{740. Lawerence Erlbaum Associates, Hills-
dale, NJ.
Millan, J. d. R. (1996). Rapid, safe, and incremental learning of navigation strategies. IEEE
Transactions on Systems, Man, and Cybernetics, 26 (3).
Monahan, G. E. (1982). A survey of partially observable Markov decision processes: Theory,
models, and algorithms. Management Science, 28, 1{16.
Moore, A. W. (1991). Variable resolution dynamic programming: Eciently learning ac-
tion maps in multivariate real-valued spaces. In Proc. Eighth International Machine
Learning Workshop.

281
Kaelbling, Littman, & Moore

Moore, A. W. (1994). The parti-game algorithm for variable resolution reinforcement learn-
ing in multidimensional state-spaces. In Cowan, J. D., Tesauro, G., & Alspector, J.
(Eds.), Advances in Neural Information Processing Systems 6, pp. 711{718 San Mateo,
CA. Morgan Kaufmann.
Moore, A. W., & Atkeson, C. G. (1992). An investigation of memory-based function ap-
proximators for learning control. Tech. rep., MIT Artical Intelligence Laboratory,
Cambridge, MA.
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with
less data and less real time. Machine Learning, 13.
Moore, A. W., Atkeson, C. G., & Schaal, S. (1995). Memory-based learning for control.
Tech. rep. CMU-RI-TR-95-18, CMU Robotics Institute.
Narendra, K., & Thathachar, M. A. L. (1989). Learning Automata: An Introduction.
Prentice-Hall, Englewood Clis, NJ.
Narendra, K. S., & Thathachar, M. A. L. (1974). Learning automata|a survey. IEEE
Transactions on Systems, Man, and Cybernetics, 4 (4), 323{334.
Peng, J., & Williams, R. J. (1993). Ecient learning and planning within the Dyna frame-
work. Adaptive Behavior, 1 (4), 437{454.
Peng, J., & Williams, R. J. (1994). Incremental multi-step Q-learning. In Proceedings of the
Eleventh International Conference on Machine Learning, pp. 226{232 San Francisco,
CA. Morgan Kaufmann.
Pomerleau, D. A. (1993). Neural network perception for mobile robot guidance. Kluwer
Academic Publishing.
Puterman, M. L. (1994). Markov Decision Processes|Discrete Stochastic Dynamic Pro-
gramming. John Wiley & Sons, Inc., New York, NY.
Puterman, M. L., & Shin, M. C. (1978). Modied policy iteration algorithms for discounted
Markov decision processes. Management Science, 24, 1127{1137.
Ring, M. B. (1994). Continual Learning in Reinforcement Environments. Ph.D. thesis,
University of Texas at Austin, Austin, Texas.
Rude, U. (1993). Mathematical and computational techniques for multilevel adaptive meth-
ods. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania.
Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel Distributed Processing:
Explorations in the microstructures of cognition. Volume 1: Foundations. The MIT
Press, Cambridge, MA.
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems.
Tech. rep. CUED/F-INFENG/TR166, Cambridge University.
282
Reinforcement Learning: A Survey

Rust, J. (1996). Numerical dynamic programming in economics. In Handbook of Computa-


tional Economics. Elsevier, North Holland.
Sage, A. P., & White, C. C. (1977). Optimum Systems Control. Prentice Hall.
Salganico, M., & Ungar, L. H. (1995). Active exploration and learning in real-valued
spaces using multi-armed bandit allocation indices. In Prieditis, A., & Russell, S.
(Eds.), Proceedings of the Twelfth International Conference on Machine Learning,
pp. 480{487 San Francisco, CA. Morgan Kaufmann.
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM
Journal of Research and Development, 3, 211{229. Reprinted in E. A. Feigenbaum
and J. Feldman, editors, Computers and Thought, McGraw-Hill, New York 1963.
Schaal, S., & Atkeson, C. (1994). Robot juggling: An implementation of memory-based
learning. Control Systems Magazine, 14.
Schmidhuber, J. (1996). A general method for multi-agent learning and incremental self-
improvement in unrestricted environments. In Yao, X. (Ed.), Evolutionary Computa-
tion: Theory and Applications. Scientic Publ. Co., Singapore.
Schmidhuber, J. H. (1991a). Curious model-building control systems. In Proc. International
Joint Conference on Neural Networks, Singapore, Vol. 2, pp. 1458{1463. IEEE.
Schmidhuber, J. H. (1991b). Reinforcement learning in Markovian and non-Markovian
environments. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances
in Neural Information Processing Systems 3, pp. 500{506 San Mateo, CA. Morgan
Kaufmann.
Schraudolph, N. N., Dayan, P., & Sejnowski, T. J. (1994). Temporal dierence learning of
position evaluation in the game of Go. In Cowan, J. D., Tesauro, G., & Alspector,
J. (Eds.), Advances in Neural Information Processing Systems 6, pp. 817{824 San
Mateo, CA. Morgan Kaufmann.
Schrijver, A. (1986). Theory of Linear and Integer Programming. Wiley-Interscience, New
York, NY.
Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted re-
wards. In Proceedings of the Tenth International Conference on Machine Learning,
pp. 298{305 Amherst, Massachusetts. Morgan Kaufmann.
Singh, S. P., Barto, A. G., Grupen, R., & Connolly, C. (1994). Robust reinforcement
learning in motion planning. In Cowan, J. D., Tesauro, G., & Alspector, J. (Eds.),
Advances in Neural Information Processing Systems 6, pp. 655{662 San Mateo, CA.
Morgan Kaufmann.
Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces.
Machine Learning, 22 (1).

283
Kaelbling, Littman, & Moore

Singh, S. P. (1992a). Reinforcement learning with a hierarchy of abstract models. In


Proceedings of the Tenth National Conference on Arti cial Intelligence, pp. 202{207
San Jose, CA. AAAI Press.
Singh, S. P. (1992b). Transfer of learning by composing solutions of elemental sequential
tasks. Machine Learning, 8 (3), 323{340.
Singh, S. P. (1993). Learning to Solve Markovian Decision Processes. Ph.D. thesis, Depart-
ment of Computer Science, University of Massachusetts. Also, CMPSCI Technical
Report 93-77.
Stengel, R. F. (1986). Stochastic Optimal Control. John Wiley and Sons.
Sutton, R. S. (1996). Generalization in Reinforcement Learning: Successful Examples Using
Sparse Coarse Coding. In Touretzky, D., Mozer, M., & Hasselmo, M. (Eds.), Neural
Information Processing Systems 8.
Sutton, R. S. (1984). Temporal Credit Assignment in Reinforcement Learning. Ph.D. thesis,
University of Massachusetts, Amherst, MA.
Sutton, R. S. (1988). Learning to predict by the method of temporal dierences. Machine
Learning, 3 (1), 9{44.
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based
on approximating dynamic programming. In Proceedings of the Seventh International
Conference on Machine Learning Austin, TX. Morgan Kaufmann.
Sutton, R. S. (1991). Planning by incremental dynamic programming. In Proceedings
of the Eighth International Workshop on Machine Learning, pp. 353{357. Morgan
Kaufmann.
Tesauro, G. (1992). Practical issues in temporal dierence learning. Machine Learning, 8,
257{277.
Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-
level play. Neural Computation, 6 (2), 215{219.
Tesauro, G. (1995). Temporal dierence learning and TD-Gammon. Communications of
the ACM, 38 (3), 58{67.
Tham, C.-K., & Prager, R. W. (1994). A modular q-learning architecture for manipula-
tor task decomposition. In Proceedings of the Eleventh International Conference on
Machine Learning San Francisco, CA. Morgan Kaufmann.
Thrun, S. (1995). Learning to play the game of chess. In Tesauro, G., Touretzky, D. S., &
Leen, T. K. (Eds.), Advances in Neural Information Processing Systems 7 Cambridge,
MA. The MIT Press.
284
Reinforcement Learning: A Survey

Thrun, S., & Schwartz, A. (1993). Issues in using function approximation for reinforcement
learning. In Mozer, M., Smolensky, P., Touretzky, D., Elman, J., & Weigend, A.
(Eds.), Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ.
Lawrence Erlbaum.
Thrun, S. B. (1992). The role of exploration in learning control. In White, D. A., &
Sofge, D. A. (Eds.), Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive
Approaches. Van Nostrand Reinhold, New York, NY.
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine
Learning, 16 (3).
Tsitsiklis, J. N., & Van Roy, B. (1996). Feature-based methods for large scale dynamic
programming. Machine Learning, 22 (1).
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27 (11),
1134{1142.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, King's College,
Cambridge, UK.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8 (3), 279{292.
Whitehead, S. D. (1991). Complexity and cooperation in Q-learning. In Proceedings of the
Eighth International Workshop on Machine Learning Evanston, IL. Morgan Kauf-
mann.
Williams, R. J. (1987). A class of gradient-estimating algorithms for reinforcement learning
in neural networks. In Proceedings of the IEEE First International Conference on
Neural Networks San Diego, CA.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine Learning, 8 (3), 229{256.
Williams, R. J., & Baird, III, L. C. (1993a). Analysis of some incremental variants of policy
iteration: First steps toward understanding actor-critic learning systems. Tech. rep.
NU-CCS-93-11, Northeastern University, College of Computer Science, Boston, MA.
Williams, R. J., & Baird, III, L. C. (1993b). Tight performance bounds on greedy policies
based on imperfect value functions. Tech. rep. NU-CCS-93-14, Northeastern Univer-
sity, College of Computer Science, Boston, MA.
Wilson, S. (1995). Classier tness based on accuracy. Evolutionary Computation, 3 (2),
147{173.
Zhang, W., & Dietterich, T. G. (1995). A reinforcement learning approach to job-shop
scheduling. In Proceedings of the International Joint Conference on Arti cial Intel-
lience.

285

You might also like