4
4
Estimating Probabilities
Machine Learning
Copyright
c 2017. Tom M. Mitchell. All rights reserved.
*DRAFT OF January 26, 2018*
This is a rough draft chapter intended for inclusion in the upcoming second
edition of the textbook Machine Learning, T.M. Mitchell, McGraw Hill.
You are welcome to use this for educational purposes, but do not duplicate
or repost it on the internet. For online copies of this and other materials
related to this book, visit the web site www.cs.cmu.edu/∼tom/mlbook.html.
Please send suggestions for improvements, or suggested exercises, to
[email protected].
1
Copyright
c 2016, Tom M. Mitchell. 2
Table 1: A Joint Probability Distribution. This table defines a joint probability distri-
bution over three random variables: Gender, HoursWorked, and Wealth.
Gender, the number of HoursWorked each week, and their Wealth. In general,
defining a joint probability distribution over a set of discrete-valued variables in-
volves three simple steps:
1. Define the random variables, and the set of values each variable can take
on. For example, in Table 1 the variable Gender can take on the value
male or female, the variable HoursWorked can take on the value “< 40.5’
or “≥ 40.5,” and Wealth can take on values rich or poor.
2. Create a table containing one row for each possible joint assignment of val-
ues to the variables. For example, Table 1 has 8 rows, corresponding to the 8
possible ways of jointly assigning values to three boolean-valued variables.
More generally, if we have n boolean valued variables, there will be 2n rows
in the table.
3. Define a probability for each possible joint assignment of values to the vari-
ables. Because the rows cover every possible joint assignment of values,
their probabilities must sum to 1.
• The probability that any single variable will take on any specific value. For
example, we can calculate that the probability P(Gender = male) = 0.6685
for the joint distribution in Table 1, by summing the four rows for which
Gender = male. Similarly, we can calculate the probability P(Wealth =
rich) = 0.2393 by adding together the probabilities for the four rows cover-
ing the cases for which Wealth=rich.
Copyright
c 2016, Tom M. Mitchell. 3
• The probability that any subset of the variables will take on a particular joint
assignment. For example, we can calculate that the probability P(Wealth=rich∧
Gender=female) = 0.0362, by summing the two table rows that satisfy this
joint assignment.
2 Estimating Probabilities
Let us begin our discussion of how to estimate probabilities with a simple exam-
ple, and explore two intuitive algorithms. It will turn out that these two intuitive
algorithms illustrate the two primary approaches used in nearly all probabilistic
machine learning algorithms.
In this simple example you have a coin, represented by the random variable
X. If you flip this coin, it may turn up heads (indicated by X = 1) or tails (X = 0).
The learning task is to estimate the probability that it will turn up heads; that is, to
estimate P(X = 1). We will use θ to refer to the true (but unknown) probability of
heads (e.g., P(X = 1) = θ), and use θ̂ to refer to our learned estimate of this true
θ. You gather training data by flipping the coin n times, and observe that it turns
up heads α1 times, and tails α0 times. Of course n = α1 + α0 .
What is the most intuitive approach to estimating θ = P(X =1) from this train-
ing data? Most people immediately answer that we should estimate the probability
by the fraction of flips that result in heads:
coin by adding in any number of imaginary coin flips resulting in heads or tails.
We can use this option of introducing γ1 imaginary heads, and γ0 imaginary tails,
to express our prior assumptions:
(α1 + γ1 )
θ̂ =
(α1 + γ1 ) + (α0 + γ0 )
Figure 1: MLE and MAP estimates of θ as the number of coin flips grows. Data was
generated by a random number generator that output a value of 1 with probability θ = 0.3,
and a value of 0 with probability of (1 − θ) = 0.7. Each plot shows the two estimates of θ
as the number of observed coin flips grows. Plots on the left correspond to values of γ1 and
γ0 that reflect the correct prior assumption about the value of θ, plots on the right reflect
the incorrect prior assumption that θ is most probably 0.4. Plots in the top row reflect
lower confidence in the prior assumption, by including only 60 = γ1 + γ0 imaginary data
points, whereas bottom plots assume 120. Note as the size of the data grows, the MLE
and MAP estimates converge toward each other, and toward the correct estimate for θ.
maximizes the probability of the observed data. In fact we can prove (and will,
below) that Algorithm 1 outputs an estimate of θ that makes the observed data at
least as probable as any other possible estimate of θ. Algorithm 2 follows a dif-
ferent principle called Maximum a Posteriori (MAP) estimation, in which we seek
the estimate of θ that is itself most probable, given the observed data, plus back-
ground assumptions about its value. Thus, the difference between these two prin-
ciples is that Algorithm 2 assumes background knowledge is available, whereas
Algorithm 1 does not. Both principles have been widely used to derive and to
justify a vast range of machine learning algorithms, from Bayesian networks, to
linear regression, to neural network learning. Our coin flip example represents
just one of many such learning problems.
The experimental behavior of these two algorithms is shown in Figure 1. Here
Copyright
c 2016, Tom M. Mitchell. 7
the learning task is to estimate the unknown value of θ = P(X = 1) for a boolean-
valued random variable X, based on a sample of n values of X drawn indepen-
dently (e.g., n independent flips of a coin with probability θ of heads). In this
figure, the true value of θ is 0.3, and the same sequence of training examples is
used in each plot. Consider first the plot in the upper left. The blue line shows
the estimates of θ produced by Algorithm 1 (MLE) as the number n of training
examples grows. The red line shows the estimates produced by Algorithm 2, us-
ing the same training examples and using priors γ0 = 42 and γ1 = 18. This prior
assumption aligns with the correct value of θ (i.e., [γ1 /(γ1 + γ0 )] = 0.3). Note
that as the number of training example coin flips grows, both algorithms converge
toward the correct estimate of θ, though Algorithm 2 provides much better esti-
mates than Algorithm 1 when little data is available. The bottom left plot shows
the estimates if Algorithm 2 uses even more confident priors, captured by twice as
many imaginary examples (γ0 = 84 and γ1 = 36). The two plots on the right side
of the figure show the estimates produced when Algorithm 2 (MAP) uses incor-
rect priors (where [γ1 /(γ1 + γ0 )] = 0.4). The difference between the top right and
bottom right plots is again only a difference in the number of imaginary examples,
reflecting the difference in confidence that θ should be close to 0.4.
The intuition underlying this principle is simple: we are more likely to observe
data D if we are in a world where the appearance of this data is highly probable.
Therefore, we should estimate θ by assigning it whatever value maximizes the
probability of having observed D.
Beginning with this principle for choosing among possible estimates of θ, it
is possible to mathematically derive a formula for the value of θ that provably
maximizes P(D|θ). Many machine learning algorithms are defined so that they
provably learn a collection of parameter values that follow this maximum likeli-
hood principle. Below we derive Algorithm 1 for our above coin flip example,
beginning with the maximum likelihood principle.
To precisely define our coin flipping example, let X be a random variable
which can take on either value 1 or 0, and let θ = P(X = 1) refer to the true, but
possibly unknown, probability that a random draw of X will take on the value 1.2
Assume we flip the coin X a number of times to produce training data D, in which
2A random variable defined in this way is called a Bernoulli random variable, and the proba-
bility distribution it follows, defined by θ, is called a Bernoulli distribution.
Copyright
c 2016, Tom M. Mitchell. 8
The above expression gives us a formula for P(D = hα1 , α0 i|θ). The quantity
P(D|θ) is often called the data likelihood, or the data likelihood function because
it expresses the probability of the observed data D as a function of θ. This likeli-
hood function is often written L(θ) = P(D|θ).
Our final step in this derivation is to determine the value of θ that maximizes
the data likelihood function P(D = hα1 , α0 i|θ). Notice that maximizing P(D|θ)
with respect to θ is equivalent to maximizing its logarithm, ln P(D|θ) with respect
to θ, because ln(x) increases monotonically with x:
∂ [α1 ln θ + α0 ln(1−θ)]
=
∂θ
∂ ln θ ∂ ln(1−θ)
= α1 + α0
∂θ ∂θ
∂ ln θ ∂ ln(1−θ) ∂(1 − θ)
= α1 + α0 ·
∂θ ∂(1 − θ) ∂θ
∂`(θ) 1 1
= α1 + α0 · (−1) (3)
∂θ θ (1 − θ)
∂ ln x 1
where the last step follows from the equality ∂x = x , and where the next to last
∂ f (x) ∂ f (x) ∂g(x)
step follows from the chain rule ∂x = ∂g(x) · ∂x .
Finally, to calculate the value of θ that maximizes `(θ), we set the derivative
in equation (3) to zero, and solve for θ.
1 1
0 = α1 − α0
θ 1−θ
1 1
α0 = α1
1−θ θ
α0 θ = α1 (1 − θ)
(α1 + α0 )θ = α1
α1
θ = (4)
α1 + α0
Thus, we have derived in equation (4) the intuitive Algorithm 1 for estimating
θ, starting from the principle that we want to choose the value of θ that maximizes
P(D|θ).
α1
θ̂MLE = arg max P(D|θ) = arg max ln P(D|θ) = (5)
θ θ α1 + α0
This same maximum likelihood principle is used as the basis for deriving many
machine learning algorithms for more complex problems where the solution is not
so intuitively obvious.
When applied to the coin flipping problem discussed above, it yields Algorithm
2. Using Bayes rule, we can rewrite the MAP principle as:
P(D|θ)P(θ)
θ̂MAP = arg max P(θ|D) = arg max
θ θ P(D)
Copyright
c 2016, Tom M. Mitchell. 10
and given that P(D) does not depend on θ, we can simplify this by ignoring the
denominator:
Comparing this to the MLE principle described in equation (1), we see that whereas
the MLE principle is to choose θ to maximize P(D|θ), the MAP principle instead
maximizes P(D|θ)P(θ). The only difference is the extra P(θ).
To produce a MAP estimate for θ we must specify a prior distribution P(θ)
that summarizes our a priori assumptions about the value of θ. In the case where
data is generated by multiple i.i.d. draws of a Bernoulli random variable, as in our
coin flip example, the most common form of prior is a Beta distribution:
θβ1 −1 (1 − θ)β0 −1
P(θ) = Beta(β0 , β1 ) = (7)
B(β0 , β1 )
Here β0 and β1 are parameters whose values we must specify in advance to define
a specific P(θ). As we shall see, choosing values for β0 and β1 corresponds to
choosing the number of imaginary examples γ0 and γ1 in the above Algorithm
2. The denominator B(β0 , β1 ) is a normalization term defined by the function B,
which assures the probability integrates to one, but which is independent of θ.
As defined in Eq. (6), the MAP estimate involves choosing the value of θ that
maximizes P(D|θ)P(θ). Recall we already have an expression for P(D|θ) in Eq.
(2). Combining this with the above expression for P(θ) we have:
α1 α0 θβ1 −1 (1 − θ)β0 −1
= arg max θ (1−θ)
θ B(β0 , β1 )
θα1 +β1 −1 (1 − θ)α0 +β0 −1
= arg max
θ B(β0 , β1 )
= arg max θα1 +β1 −1 (1 − θ)α0 +β0 −1 (8)
θ
where the final line follows from the previous line because B(β0 , β1 ) is indepen-
dent of θ.
How can we solve for the value of θ that maximizes the expression in Eq. (8)?
Fortunately, we have already answered this question! Notice that the quantity we
seek to maximize in Eq. (8) can be made identical to the likelihood function in Eq.
(2) if we substitute (α1 + β1 − 1) for α1 in Eq. (2), and substitute (α0 + β0 − 1)
for α0 . We can therefore reuse the derivation of θ̂MLE beginning from Eq. (2) and
ending with Eq. (4), simply by carrying through this substitution. Applying this
same substitution to Eq. (4) implies the solution to Eq. (8) is therefore
(α1 + β1 − 1)
θ̂MAP = arg max P(D|θ)P(θ) = (9)
θ (α1 + β1 − 1) + (α0 + β0 − 1)
Copyright
c 2016, Tom M. Mitchell. 11
Thus, we have derived in Eq. (9) the intuitive Algorithm 2 for estimating θ,
starting from the principle that we want to choose the value of θ that maximizes
P(θ|D). The number γ1 of imaginary ”heads” in Algorithm 2 is equal to β1 − 1,
and the number γ0 of imaginary ”tails” is equal to β0 − 1. This same maximum
a posteriori probability principle is used as the basis for deriving many machine
learning algorithms for more complex problems where the solution is not so intu-
itively obvious as it is in our coin flipping example.
The Beta(β0 , β1 ) distribution defined in Eq. (7) is called the conjugate prior
for the binomial likelihood function θα1 (1−θ)α0 , because the posterior distribu-
tion P(D|θ)P(θ) is also a Beta distribution. More generally, any P(θ) is called the
conjugate prior for a likelihood function L(θ) = P(D|θ) if the posterior P(θ|D) is
of the same form as P(θ).
distribution has n different θi parameters, its prior will have to specify the proba-
bility for each joint assignment of these n parameters. The Dirichlet distribution
is a generalization of the Beta distribution, and has the form
(β1 −1) (β2 −1) (β −1)
θ1 θ2 . . . θn n
P(θ1 , . . . θn ) =
B(β1 , . . . , βn )
where the denominator is again a normalizing function to assure that the total
probability mass is 1, and where this normalizing function B(β1 , . . . , βn ) is inde-
pendent of the vector of parameters θ = hθ1 . . . θn i and therefore can be ignored
when deriving their MAP estimates.
The MAP estimate for each θi for a Categorial distribution is given by
(αi + βi − 1)
θ̂MAP
i = (11)
(α1 + β1 − 1) + . . . + (αn + βn − 1)
where α j indicates the number of times the value X = j was observed in the
data, and where the β j s are the parameters of the Dirichlet prior which reflects
our prior knowledge or assumptions. Here again, we can view the MAP estimate
as combining the observed data given by the α j values with β j − 1 additional
imaginary observations for X = j. Comparing this formula to the earlier formula
giving the MAP estimate for a Bernoulli random variable (eq. 9), it is easy to see
that this is a direct generalization of that simpler case, and that it again follows the
intuition of our earlier Algorithm 2.
(α1 + γ1 )
θ̂MAP = arg max P(data|θ)P(θ) =
θ (α1 + γ1 ) + (α0 + γ0 )
EXERCISES
1. In the MAP estimation of θ for our Bernoulli random variable X in this
chapter, we used a Beta(β0 , β1 ) prior probability distribution to capture our
prior beliefs about the prior probability of different values of θ, before see-
ing the observed data.
• View the plot you created above to visually determine the approximate
Maximum a Posterior probability estimate θMAP . What is it? What is
the exact value of the MAP estimate? What is the exact value of the
Maximum Likelihood Estimate θMLE ?
• How do you think your plot of the posterior probability would change
if you altered the Beta prior distribution to use γ0 = 420, γ1 = 180?
(hint: it’s ok to actually plot this). What if you changed the Beta prior
to γ0 = 32, γ1 = 28?
5 Acknowledgements
I very much appreciate receiving helpful comments on earlier drafts of this chapter
from Ondřej Filip, Ayush Garg, Akshay Mishra and Tao Chen. Andrew Moore
provided the data summary shown in Table 1.
REFERENCES
Mitchell, T (1997). Machine Learning, McGraw Hill.
Wasserman, L. (2004). All of Statistics, Springer-Verlag.
CHAPTER 3
Machine Learning
Copyright
2015.
c Tom M. Mitchell. All rights reserved.
*DRAFT OF September 23, 2017*
This is a rough draft chapter intended for inclusion in the upcoming second edi-
tion of the textbook Machine Learning, T.M. Mitchell, McGraw Hill. You are
welcome to use this for educational purposes, but do not duplicate or repost it
on the internet. For online copies of this and other materials related to this book,
visit the web site www.cs.cmu.edu/∼tom/mlbook.html.
Please send suggestions for improvements, or suggested exercises, to
[email protected].
P(X = xk |Y = yi )P(Y = yi )
P(Y = yi |X = xk ) =
∑ j P(X = xk |Y = y j )P(Y = y j )
1
Copyright
c 2015, Tom M. Mitchell. 2
where ym denotes the mth possible value for Y , xk denotes the kth possible vector
value for X, and where the summation in the denominator is over all legal values
of the random variable Y .
One way to learn P(Y |X) is to use the training data to estimate P(X|Y ) and
P(Y ). We can then use these estimates, together with Bayes rule above, to deter-
mine P(Y |X = xk ) for any new instance xk .
θi j ≡ P(X = xi |Y = y j )
where the index i takes on 2n possible values (one for each of the possible vector
values of X), and j takes on 2 possible values. Therefore, we will need to estimate
approximately 2n+1 parameters. To calculate the exact number of required param-
eters, note for any fixed j, the sum over i of θi j must be one. Therefore, for any
particular value y j , and the 2n possible values of xi , we need compute only 2n − 1
independent parameters. Given the two possible values for Y , we must estimate
a total of 2(2n − 1) such θi j parameters. Unfortunately, this corresponds to two
1 Why? See Chapter 5 of edition 1 of Machine Learning.
Copyright
c 2015, Tom M. Mitchell. 3
distinct parameters for each of the distinct instances in the instance space for X.
Worse yet, to obtain reliable estimates of each of these parameters, we will need to
observe each of these distinct instances multiple times! This is clearly unrealistic
in most practical learning domains. For example, if X is a vector containing 30
boolean features, then we will need to estimate more than 3 billion parameters.
P(X|Y ) = P(X1 , X2 |Y )
= P(X1 |X2 ,Y )P(X2 |Y )
= P(X1 |Y )P(X2 |Y )
Where the second line follows from a general property of probabilities, and the
third line follows directly from our above definition of conditional independence.
More generally, when X contains n attributes which satisfy the conditional inde-
pendence assumption, we have
n
P(X1 . . . Xn |Y ) = ∏ P(Xi |Y ) (1)
i=1
Notice that when Y and the Xi are boolean variables, we need only 2n parameters
to define P(Xi = xik |Y = y j ) for the necessary i, j, k. This is a dramatic reduction
compared to the 2(2n − 1) parameters needed to characterize P(X|Y ) if we make
no conditional independence assumption.
Let us now derive the Naive Bayes algorithm, assuming in general that Y is
any discrete-valued variable, and the attributes X1 . . . Xn are any discrete or real-
valued attributes. Our goal is to train a classifier that will output the probability
distribution over possible values of Y , for each new instance X that we ask it to
classify. The expression for the probability that Y will take on its kth possible
value, according to Bayes rule, is
P(Y = yk )P(X1 . . . Xn |Y = yk )
P(Y = yk |X1 . . . Xn ) =
∑ j P(Y = y j )P(X1 . . . Xn |Y = y j )
where the sum is taken over all possible values y j of Y . Now, assuming the Xi are
conditionally independent given Y , we can use equation (1) to rewrite this as
P(Y = yk ) ∏i P(Xi |Y = yk )
P(Y = yk |X1 . . . Xn ) = (2)
∑ j P(Y = y j ) ∏i P(Xi |Y = y j )
Equation (2) is the fundamental equation for the Naive Bayes classifier. Given a
new instance X new = hX1 . . . Xn i, this equation shows how to calculate the prob-
ability that Y will take on any given value, given the observed attribute values
of X new and given the distributions P(Y ) and P(Xi |Y ) estimated from the training
data. If we are interested only in the most probable value of Y , then we have the
Naive Bayes classification rule:
P(Y = yk ) ∏i P(Xi |Y = yk )
Y ← arg max
yk ∑ j P(Y = y j ) ∏i P(Xi |Y = y j )
which simplifies to the following (because the denominator does not depend on
yk ).
Y ← arg max P(Y = yk ) ∏ P(Xi |Y = yk ) (3)
yk
i
Copyright
c 2015, Tom M. Mitchell. 5
for each attribute Xi and each possible value yk of Y . Note there are 2nK of these
parameters, all of which must be estimated independently.
Of course we must also estimate the priors on Y as well
πk = P(Y = yk ) (12)
The above model summarizes a Gaussian Naive Bayes classifier, which as-
sumes that the data X is generated by a mixture of class-conditional (i.e., depen-
dent on the value of the class variable Y ) Gaussians. Furthermore, the Naive Bayes
assumption introduces the additional constraint that the attribute values Xi are in-
dependent of one another within each of these mixture components. In particular
problem settings where we have additional information, we might introduce addi-
tional assumptions to further restrict the number of parameters or the complexity
of estimating them. For example, if we have reason to believe that noise in the
observed Xi comes from a common source, then we might further assume that all
of the σik are identical, regardless of the attribute i or class k (see the homework
exercise on this issue).
Again, we can use either maximum likelihood estimates (MLE) or maximum
a posteriori (MAP) estimates for these parameters. The maximum likelihood esti-
mator for µik is
1 j
µ̂ik = j ∑ Xi δ(Y j = yk ) (13)
∑ j δ(Y = yk ) j
Copyright
c 2015, Tom M. Mitchell. 7
where the superscript j refers to the jth training example, and where δ(Y = yk ) is
1 if Y = yk and 0 otherwise. Note the role of δ here is to select only those training
examples for which Y = yk .
The maximum likelihood estimator for σ2ik is
1 j
σ̂2ik = ∑(X − µ̂ik )2δ(Y j = yk ) (14)
∑ j δ(Y j = yk ) j i
3 Logistic Regression
Logistic Regression is an approach to learning functions of the form f : X → Y , or
P(Y |X) in the case where Y is discrete-valued, and X = hX1 . . . Xn i is any vector
containing discrete or continuous variables. In this section we will primarily con-
sider the case where Y is a boolean variable, in order to simplify notation. In the
final subsection we extend our treatment to the case where Y takes on any finite
number of discrete values.
Logistic Regression assumes a parametric form for the distribution P(Y |X),
then directly estimates its parameters from the training data. The parametric
model assumed by Logistic Regression in the case where Y is boolean is:
1
P(Y = 1|X) = (16)
1 + exp(w0 + ∑ni=1 wi Xi )
and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) = (17)
1 + exp(w0 + ∑ni=1 wi Xi )
Notice that equation (17) follows directly from equation (16), because the sum of
these two probabilities must equal 1.
One highly convenient property of this form for P(Y |X) is that it leads to a
simple linear expression for classification. To classify any given X we generally
want to assign the value yk that maximizes P(Y = yk |X). Put another way, we
assign the label Y = 0 if the following condition holds:
P(Y = 0|X)
1<
P(Y = 1|X)
Figure 1: Form of the logistic function. In Logistic Regression, P(Y |X) is as-
sumed to follow this form.
and taking the natural log of both sides we have a linear classification rule that
assigns label Y = 0 if X satisfies
n
0 < w0 + ∑ wi Xi (18)
i=1
Note here we are assuming the standard deviations σi vary from attribute to at-
tribute, but do not depend on Y .
We now derive the parametric form of P(Y |X) that follows from this set of
GNB assumptions. In general, Bayes rule allows us to write
P(Y = 1)P(X|Y = 1)
P(Y = 1|X) =
P(Y = 1)P(X|Y = 1) + P(Y = 0)P(X|Y = 0)
Dividing both the numerator and denominator by the numerator yields:
1
P(Y = 1|X) = =0)P(X|Y =0)
1 + P(Y
P(Y =1)P(X|Y =1)
or equivalently
1
P(Y = 1|X) = =0)P(X|Y =0)
1 + exp(ln P(Y
P(Y =1)P(X|Y =1) )
Note the final step expresses P(Y = 0) and P(Y = 1) in terms of the binomial
parameter π.
Now consider just the summation in the denominator of equation (19). Given
our assumption that P(Xi |Y = yk ) is Gaussian, we can expand this term as follows:
2
√ 1 2 exp −(Xi −µ2 i0 )
P(Xi |Y = 0) 2πσ 2σi
∑ ln P(Xi|Y = 1) = ∑ ln √ 1 i −(Xi−µi1)2
i i exp 2σ2i
2πσ2i
(Xi − µi1 )2 − (Xi − µi0 )2
= ∑ ln exp
i 2σ2i
(Xi − µi1 )2 − (Xi − µi0 )2
= ∑
i 2σ2i
2
(Xi − 2Xi µi1 + µ2i1 ) − (Xi2 − 2Xi µi0 + µ2i0 )
= ∑
i 2σ2i
2Xi (µi0 − µi1 ) + µ2i1 − µ2i0
= ∑
i 2σ2i
µ2i1 − µ2i0
µi0 − µi1
= ∑ Xi + (20)
i σ2i 2σ2i
Copyright
c 2015, Tom M. Mitchell. 10
Note this expression is a linear weighted sum of the Xi ’s. Substituting expression
(20) back into equation (19), we have
1
P(Y = 1|X) = (21)
µ2i1 −µ2i0 )
µi0 −µi1
1 + exp(ln 1−π
π + ∑i σi2 Xi + 2σ2i
)
Or equivalently,
1
P(Y = 1|X) = (22)
1 + exp(w0 + ∑ni=1 wi Xi )
where the weights w1 . . . wn are given by
µi0 − µi1
wi =
σ2i
and where
1−π µ2 − µ2
w0 = ln + ∑ i1 2 i0
π i 2σi
Also we have
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) = 1 − P(Y = 1|X) = (23)
1 + exp(w0 + ∑ni=1 wi Xi )
value of X in the lth training example. The expression to the right of the arg max
is the conditional data likelihood. Here we include W in the conditional, to em-
phasize that the expression is a function of the W we are attempting to maximize.
Equivalently, we can work with the log of the conditional likelihood:
This conditional data log likelihood, which we will denote l(W ) can be written
as
l(W ) = ∑ Y l ln P(Y l = 1|X l ,W ) + (1 −Y l ) ln P(Y l = 0|X l ,W )
l
Note here we are utilizing the fact that Y can take only values 0 or 1, so only one
of the two terms in the expression will be non-zero for any given Y l .
To keep our derivation consistent with common usage, we will in this section
flip the assignment of the boolean variable Y so that we assign
1
P(Y = 0|X) = (24)
1 + exp(w0 + ∑ni=1 wi Xi )
and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 1|X) = (25)
1 + exp(w0 + ∑ni=1 wi Xi )
In this case, we can reexpress the log of the conditional likelihood as:
where Xil denotes the value of Xi for the lth training example. Note the superscript
l is not related to the log likelihood function l(W ).
Unfortunately, there is no closed form solution to maximizing l(W ) with re-
spect to W . Therefore, one common approach is to use gradient ascent, in which
we work with the gradient, which is the vector of partial derivatives. The ith
component of the vector gradient has the form
∂l(W )
= ∑ Xil (Y l − P̂(Y l = 1|X l ,W ))
∂wi l
where P̂(Y l |X l ,W ) is the Logistic Regression prediction using equations (24) and
(25) and the weights W . To accommodate weight w0 , we assume an imaginary
X0 = 1 for all l. This expression for the derivative has an intuitive interpretation:
the term inside the parentheses is simply the prediction error; that is, the difference
Copyright
c 2015, Tom M. Mitchell. 12
between the observed Y l and its predicted probability! Note if Y l = 1 then we wish
for P̂(Y l = 1|X l ,W ) to be 1, whereas if Y l = 0 then we prefer that P̂(Y l = 1|X l ,W )
be 0 (which makes P̂(Y l = 0|X l ,W ) equal to 1). This error term is multiplied by
the value of Xil , which accounts for the magnitude of the wi Xil term in making this
prediction.
Given this formula for the derivative of each wi , we can use standard gradient
ascent to optimize the weights W . Beginning with initial weights of zero, we
repeatedly update the weights in the direction of the gradient, on each iteration
changing every weight wi according to
where η is a small constant (e.g., 0.01) which determines the step size. Because
the conditional log likelihood l(W ) is a concave function in W , this gradient ascent
procedure will converge to a global maximum. Gradient ascent is described in
greater detail, for example, in Chapter 4 of Mitchell (1997). In many cases where
computational efficiency is important it is common to use a variant of gradient
ascent called conjugate gradient ascent, which often converges more quickly.
λ
W ← arg max ∑ ln P(Y l |X l ,W ) − ||W ||2
W
l 2
∑ ln P(Y l |X l ,W ) + ln P(W )
l
and if P(W ) is a zero mean Gaussian distribution, then ln P(W ) yields a term
proportional to ||W ||2 .
Given this penalized log likelihood function, it is easy to rederive the gradient
descent rule. The derivative of this penalized log likelihood function is similar to
Copyright
c 2015, Tom M. Mitchell. 13
Here w ji denotes the weight associated with the jth class Y = y j and with input
Xi . It is easy to see that our earlier expressions for the case where Y is boolean
(equations (16) and (17)) are a special case of the above expressions. Note also
that the form of the expression for P(Y = yK |X) assures that [∑K k=1 P(Y = yk |X)] =
1.
The primary difference between these expressions and those for boolean Y is
that when Y takes on K possible values, we construct K −1 different linear expres-
sions to capture the distributions for the different values of Y . The distribution for
the final, Kth, value of Y is simply one minus the probabilities of the first K − 1
values.
In this case, the gradient descent rule with regularization becomes:
w ji ← w ji + η ∑ Xil (δ(Y l = y j ) − P̂(Y l = y j |X l ,W )) − ηλw ji (29)
l
• When the GNB modeling assumptions do not hold, Logistic Regression and
GNB typically learn different classifier functions. In this case, the asymp-
totic (as the number of training examples approach infinity) classification
accuracy for Logistic Regression is often better than the asymptotic accu-
racy of GNB. Although Logistic Regression is consistent with the Naive
Bayes assumption that the input features Xi are conditionally independent
given Y , it is not rigidly tied to this assumption as is Naive Bayes. Given
data that disobeys this assumption, the conditional likelihood maximization
algorithm for Logistic Regression will adjust its parameters to maximize the
fit to (the conditional likelihood of) the data, even if the resulting parameters
are inconsistent with the Naive Bayes parameter estimates.
• GNB and Logistic Regression converge toward their asymptotic accuracies
at different rates. As Ng & Jordan (2002) show, GNB parameter estimates
converge toward their asymptotic values in order log n examples, where n
is the dimension of X. In contrast, Logistic Regression parameter estimates
converge more slowly, requiring order n examples. The authors also show
that in several data sets Logistic Regression outperforms GNB when many
training examples are available, but GNB outperforms Logistic Regression
when training data is scarce.
• We can use Bayes rule as the basis for designing learning algorithms (func-
tion approximators), as follows: Given that we wish to learn some target
function f : X → Y , or equivalently, P(Y |X), we use the training data to
learn estimates of P(X|Y ) and P(Y ). New X examples can then be classi-
fied using these estimated probability distributions, plus Bayes rule. This
Copyright
c 2015, Tom M. Mitchell. 15
6 Further Reading
Wasserman (2004) describes a Reweighted Least Squares method for Logistic
Regression. Ng and Jordan (2002) provide a theoretical and experimental com-
parison of the Naive Bayes classifier and Logistic Regression.
Copyright
c 2015, Tom M. Mitchell. 16
EXERCISES
1. At the beginning of the chapter we remarked that “A hundred training ex-
amples will usually suffice to obtain an estimate of P(Y ) that is within a
few percent of the correct value.” Describe conditions under which the 95%
confidence interval for our estimate of P(Y ) will be ±0.02.
2. Consider learning a function X → Y where Y is boolean, where X = hX1 , X2 i,
and where X1 is a boolean variable and X2 a continuous variable. State the
parameters that must be estimated to define a Naive Bayes classifier in this
case. Give the formula for computing P(Y |X), in terms of these parameters
and the feature values X1 and X2 .
3. In section 3 we showed that when Y is Boolean and X = hX1 . . . Xn i is a
vector of continuous variables, then the assumptions of the Gaussian Naive
Bayes classifier imply that P(Y |X) is given by the logistic function with
appropriate parameters W . In particular:
1
P(Y = 1|X) =
1 + exp(w0 + ∑ni=1 wi Xi )
and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) =
1 + exp(w0 + ∑ni=1 wi Xi )
Consider instead the case where Y is Boolean and X = hX1 . . . Xn i is a vec-
tor of Boolean variables. Prove for this case also that P(Y |X) follows this
same form (and hence that Logistic Regression is also the discriminative
counterpart to a Naive Bayes generative classifier over Boolean features).
Hints:
• Simple notation will help. Since the Xi are Boolean variables, you
need only one parameter to define P(Xi |Y = yk ). Define θi1 ≡ P(Xi =
1|Y = 1), in which case P(Xi = 0|Y = 1) = (1 − θi1 ). Similarly, use
θi0 to denote P(Xi = 1|Y = 0).
• Notice with the above notation you can represent P(Xi |Y = 1) as fol-
lows
P(Xi |Y = 1) = θXi1i (1 − θi1 )(1−Xi )
Note when Xi = 1 the second term is equal to 1 because its exponent
is zero. Similarly, when Xi = 0 the first term is equal to 1 because its
exponent is zero.
4. (based on a suggestion from Sandra Zilles). This question asks you to con-
sider the relationship between the MAP hypothesis and the Bayes optimal
hypothesis. Consider a hypothesis space H defined over the set of instances
X, and containing just two hypotheses, h1 and h2 with equal prior probabil-
ities P(h1) = P(h2) = 0.5. Suppose we are given an arbitrary set of training
Copyright
c 2015, Tom M. Mitchell. 17
7 Acknowledgements
I very much appreciate receiving helpful comments on earlier drafts of this chapter
from the following: Nathaniel Fairfield, Rainer Gemulla, Vineet Kumar, Andrew
McCallum, Anand Prahlad, Wei Wang, Geoff Webb, and Sandra Zilles.
REFERENCES
Mitchell, T (1997). Machine Learning, McGraw Hill.
Ng, A.Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A compar-
ison of Logistic Regression and Naive Bayes, Neural Information Processing Systems, Ng,
A.Y., and Jordan, M. (2002).
Wasserman, L. (2004). All of Statistics, Springer-Verlag.
CS229 Lecture notes
Andrew Ng
Part V
Support Vector Machines
This set of notes presents the Support Vector Machine (SVM) learning al-
gorithm. SVMs are among the best (and many believe are indeed the best)
“off-the-shelf” supervised learning algorithms. To tell the SVM story, we’ll
need to first talk about margins and the idea of separating data with a large
“gap.” Next, we’ll talk about the optimal margin classifier, which will lead
us into a digression on Lagrange duality. We’ll also see kernels, which give
a way to apply SVMs efficiently in very high dimensional (such as infinite-
dimensional) feature spaces, and finally, we’ll close off the story with the
SMO algorithm, which gives an efficient implementation of SVMs.
1 Margins: Intuition
We’ll start our story on SVMs by talking about margins. This section will
give the intuitions about margins and about the “confidence” of our predic-
tions; these ideas will be made formal in Section 3.
Consider logistic regression, where the probability p(y = 1|x; θ) is mod-
eled by hθ (x) = g(θT x). We would then predict “1” on an input x if and
only if hθ (x) ≥ 0.5, or equivalently, if and only if θT x ≥ 0. Consider a
positive training example (y = 1). The larger θT x is, the larger also is
hθ (x) = p(y = 1|x; w, b), and thus also the higher our degree of “confidence”
that the label is 1. Thus, informally we can think of our prediction as being
a very confident one that y = 1 if θT x ≫ 0. Similarly, we think of logistic
regression as making a very confident prediction of y = 0, if θT x ≪ 0. Given
a training set, again informally it seems that we’d have found a good fit to
the training data if we can find θ so that θT x(i) ≫ 0 whenever y (i) = 1, and
1
2
θT x(i) ≪ 0 whenever y (i) = 0, since this would reflect a very confident (and
correct) set of classifications for all the training examples. This seems to be
a nice goal to aim for, and we’ll soon formalize this idea using the notion of
functional margins.
For a different type of intuition, consider the following figure, in which x’s
represent positive training examples, o’s denote negative training examples,
a decision boundary (this is the line given by the equation θT x = 0, and
is also called the separating hyperplane) is also shown, and three points
have also been labeled A, B and C.
A0
1
B0
1
C0
1
Notice that the point A is very far from the decision boundary. If we are
asked to make a prediction for the value of y at A, it seems we should be
quite confident that y = 1 there. Conversely, the point C is very close to
the decision boundary, and while it’s on the side of the decision boundary
on which we would predict y = 1, it seems likely that just a small change to
the decision boundary could easily have caused out prediction to be y = 0.
Hence, we’re much more confident about our prediction at A than at C. The
point B lies in-between these two cases, and more broadly, we see that if
a point is far from the separating hyperplane, then we may be significantly
more confident in our predictions. Again, informally we think it’d be nice if,
given a training set, we manage to find a decision boundary that allows us
to make all correct and confident (meaning far from the decision boundary)
predictions on the training examples. We’ll formalize this later using the
notion of geometric margins.
3
2 Notation
To make our discussion of SVMs easier, we’ll first need to introduce a new
notation for talking about classification. We will be considering a linear
classifier for a binary classification problem with labels y and features x.
From now, we’ll use y ∈ {−1, 1} (instead of {0, 1}) to denote the class labels.
Also, rather than parameterizing our linear classifier with the vector θ, we
will use parameters w, b, and write our classifier as
Note that if y (i) = 1, then for the functional margin to be large (i.e., for
our prediction to be confident and correct), we need wT x + b to be a large
positive number. Conversely, if y (i) = −1, then for the functional margin
to be large, we need wT x + b to be a large negative number. Moreover, if
y (i) (wT x + b) > 0, then our prediction on this example is correct. (Check
this yourself.) Hence, a large functional margin represents a confident and a
correct prediction.
For a linear classifier with the choice of g given above (taking values in
{−1, 1}), there’s one property of the functional margin that makes it not a
very good measure of confidence, however. Given our choice of g, we note that
if we replace w with 2w and b with 2b, then since g(wT x + b) = g(2wT x + 2b),
4
this would not change hw,b (x) at all. I.e., g, and hence also hw,b (x), depends
only on the sign, but not on the magnitude, of wT x + b. However, replacing
(w, b) with (2w, 2b) also results in multiplying our functional margin by a
factor of 2. Thus, it seems that by exploiting our freedom to scale w and b,
we can make the functional margin arbitrarily large without really changing
anything meaningful. Intuitively, it might therefore make sense to impose
some sort of normalization condition such as that ||w||2 = 1; i.e., we might
replace (w, b) with (w/||w||2 , b/||w||2 ), and instead consider the functional
margin of (w/||w||2 , b/||w||2 ). We’ll come back to this later.
Given a training set S = {(x(i) , y (i) ); i = 1, . . . , m}, we also define the
function margin of (w, b) with respect to S as the smallest of the functional
margins of the individual training examples. Denoted by γ̂, this can therefore
be written:
γ̂ = min γ̂ (i) .
i=1,...,m
Next, let’s talk about geometric margins. Consider the picture below:
A w
γ (i)
find that the point B is given by x(i) − γ (i) · w/||w||. But this point lies on
the decision boundary, and all points x on the decision boundary satisfy the
equation wT x + b = 0. Hence,
T (i) (i) w
w x −γ + b = 0.
||w||
Note that if ||w|| = 1, then the functional margin equals the geometric
margin—this thus gives us a way of relating these two different notions of
margin. Also, the geometric margin is invariant to rescaling of the parame-
ters; i.e., if we replace w with 2w and b with 2b, then the geometric margin
does not change. This will in fact come in handy later. Specifically, because
of this invariance to the scaling of the parameters, when trying to fit w and b
to training data, we can impose an arbitrary scaling constraint on w without
changing anything important; for instance, we can demand that ||w|| = 1, or
|w1 | = 5, or |w1 + b| + |w2 | = 2, and any of these can be satisfied simply by
rescaling w and b.
Finally, given a training set S = {(x(i) , y (i) ); i = 1, . . . , m}, we also define
the geometric margin of (w, b) with respect to S to be the smallest of the
geometric margins on the individual training examples:
γ = min γ (i) .
i=1,...,m
on the training set and a good “fit” to the training data. Specifically, this
will result in a classifier that separates the positive and the negative training
examples with a “gap” (geometric margin).
For now, we will assume that we are given a training set that is linearly
separable; i.e., that it is possible to separate the positive and negative ex-
amples using some separating hyperplane. How will we find the one that
achieves the maximum geometric margin? We can pose the following opti-
mization problem:
maxγ,w,b γ
s.t. y (i) (wT x(i) + b) ≥ γ, i = 1, . . . , m
||w|| = 1.
Here, we’re going to maximize γ̂/||w||, subject to the functional margins all
being at least γ̂. Since the geometric and functional margins are related by
γ = γ̂/||w|, this will give us the answer we want. Moreover, we’ve gotten rid
of the constraint ||w|| = 1 that we didn’t like. The downside is that we now
γ̂
have a nasty (again, non-convex) objective ||w|| function; and, we still don’t
have any off-the-shelf software that can solve this form of an optimization
problem.
Let’s keep going. Recall our earlier discussion that we can add an arbi-
trary scaling constraint on w and b without changing anything. This is the
key idea we’ll use now. We will introduce the scaling constraint that the
functional margin of w, b with respect to the training set must be 1:
γ̂ = 1.
7
5 Lagrange duality
Let’s temporarily put aside SVMs and maximum margin classifiers, and talk
about solving constrained optimization problems.
Consider a problem of the following form:
minw f (w)
s.t. hi (w) = 0, i = 1, . . . , l.
Some of you may recall how the method of Lagrange multipliers can be used
to solve it. (Don’t worry if you haven’t seen it before.) In this method, we
define the Lagrangian to be
l
X
L(w, β) = f (w) + βi hi (w)
i=1
1
You may be familiar with linear programming, which solves optimization problems
that have linear objectives and linear constraints. QP software is also widely available,
which allows convex quadratic objectives and linear constraints.
8
Here, the βi ’s are called the Lagrange multipliers. We would then find
and set L’s partial derivatives to zero:
∂L ∂L
= 0; = 0,
∂wi ∂βi
and solve for w and β.
In this section, we will generalize this to constrained optimization prob-
lems in which we may have inequality as well as equality constraints. Due to
time constraints, we won’t really be able to do the theory of Lagrange duality
justice in this class,2 but we will give the main ideas and results, which we
will then apply to our optimal margin classifier’s optimization problem.
Consider the following, which we’ll call the primal optimization problem:
minw f (w)
s.t. gi (w) ≤ 0, i = 1, . . . , k
hi (w) = 0, i = 1, . . . , l.
To solve it, we start by defining the generalized Lagrangian
k
X l
X
L(w, α, β) = f (w) + αi gi (w) + βi hi (w).
i=1 i=1
Here, the αi ’s and βi ’s are the Lagrange multipliers. Consider the quantity
θP (w) = max L(w, α, β).
α,β : αi ≥0
Here, the “P” subscript stands for “primal.” Let some w be given. If w
violates any of the primal constraints (i.e., if either gi (w) > 0 or hi (w) 6= 0
for some i), then you should be able to verify that
k
X l
X
θP (w) = max f (w) + αi gi (w) + βi hi (w) (1)
α,β : αi ≥0
i=1 i=1
= ∞. (2)
Conversely, if the constraints are indeed satisfied for a particular value of w,
then θP (w) = f (w). Hence,
f (w) if w satisfies primal constraints
θP (w) =
∞ otherwise.
2
Readers interested in learning more about this topic are encouraged to read, e.g., R.
T. Rockarfeller (1970), Convex Analysis, Princeton University Press.
9
Thus, θP takes the same value as the objective in our problem for all val-
ues of w that satisfies the primal constraints, and is positive infinity if the
constraints are violated. Hence, if we consider the minimization problem
min θP (w) = min max L(w, α, β),
w w α,β : αi ≥0
we see that it is the same problem (i.e., and has the same solutions as) our
original, primal problem. For later use, we also define the optimal value of
the objective to be p∗ = minw θP (w); we call this the value of the primal
problem.
Now, let’s look at a slightly different problem. We define
θD (α, β) = min L(w, α, β).
w
Here, the “D” subscript stands for “dual.” Note also that whereas in the
definition of θP we were optimizing (maximizing) with respect to α, β, here
we are minimizing with respect to w.
We can now pose the dual optimization problem:
max θD (α, β) = max min L(w, α, β).
α,β : αi ≥0 α,β : αi ≥0 w
This is exactly the same as our primal problem shown above, except that the
order of the “max” and the “min” are now exchanged. We also define the
optimal value of the dual problem’s objective to be d∗ = maxα,β : αi ≥0 θD (w).
How are the primal and the dual problems related? It can easily be shown
that
d∗ = max min L(w, α, β) ≤ min max L(w, α, β) = p∗ .
α,β : αi ≥0 w w α,β : αi ≥0
(You should convince yourself of this; this follows from the “max min” of a
function always being less than or equal to the “min max.”) However, under
certain conditions, we will have
d ∗ = p∗ ,
so that we can solve the dual problem in lieu of the primal problem. Let’s
see what these conditions are.
Suppose f and the gi ’s are convex,3 and the hi ’s are affine.4 Suppose
further that the constraints gi are (strictly) feasible; this means that there
exists some w so that gi (w) < 0 for all i.
3
When f has a Hessian, then it is convex if and only if the Hessian is positive semi-
definite. For instance, f (w) = wT w is convex; similarly, all linear (and affine) functions
are also convex. (A function f can also be convex without being differentiable, but we
won’t need those more general definitions of convexity here.)
4
I.e., there exists ai , bi , so that hi (w) = aTi w + bi . “Affine” means the same thing as
linear, except that we also allow the extra intercept term bi .
10
We have one such constraint for each training example. Note that from the
KKT dual complementarity condition, we will have αi > 0 only for the train-
ing examples that have functional margin exactly equal to one (i.e., the ones
11
The points with the smallest margins are exactly the ones closest to the
decision boundary; here, these are the three points (one negative and two pos-
itive examples) that lie on the dashed lines parallel to the decision boundary.
Thus, only three of the αi ’s—namely, the ones corresponding to these three
training examples—will be non-zero at the optimal solution to our optimiza-
tion problem. These three points are called the support vectors in this
problem. The fact that the number of support vectors can be much smaller
than the size the training set will be useful later.
Let’s move on. Looking ahead, as we develop the dual form of the prob-
lem, one key idea to watch out for is that we’ll try to write our algorithm
in terms of only the inner product hx(i) , x(j) i (think of this as (x(i) )T x(j) )
between points in the input feature space. The fact that we can express our
algorithm in terms of these inner products will be key when we apply the
kernel trick.
When we construct the Lagrangian for our optimization problem we have:
m
1 X
L(w, b, α) = ||w||2 − αi y (i) (wT x(i) + b) − 1 .
(8)
2 i=1
Note that there’re only “αi ” but no “βi ” Lagrange multipliers, since the
problem has only inequality constraints.
Let’s find the dual form of the problem. To do so, we need to first
minimize L(w, b, α) with respect to w and b (for fixed α), to get θD , which
12
If we take the definition of w in Equation (9) and plug that back into the
Lagrangian (Equation 8), and simplify, we get
m m m
X 1 X (i) (j) (i) T (j)
X
L(w, b, α) = αi − y y αi αj (x ) x − b αi y (i) .
i=1
2 i,j=1 i=1
But from Equation (10), the last term must be zero, so we obtain
m m
X 1 X (i) (j)
L(w, b, α) = αi − y y αi αj (x(i) )T x(j) .
i=1
2 i,j=1
You should also be able to verify that the conditions required for p∗ =
d∗ and the KKT conditions (Equations 3–7) to hold are indeed satisfied in
our optimization problem. Hence, we can solve the dual in lieu of solving
the primal problem. Specifically, in the dual problem above, we have a
maximization problem in which the parameters are the αi ’s. We’ll talk later
13
about the specific algorithm that we’re going to use to solve the dual problem,
but if we are indeed able to solve it (i.e., find the α’s that maximize W (α)
subject to the constraints), then we can use Equation (9) to go back and find
the optimal w’s as a function of the α’s. Having found w∗ , by considering
the primal problem, it is also straightforward to find the optimal value for
the intercept term b as
maxi:y(i) =−1 w∗ T x(i) + mini:y(i) =1 w∗ T x(i)
b∗ = − . (11)
2
(Check for yourself that this is correct.)
Before moving on, let’s also take a more careful look at Equation (9),
which gives the optimal value of w in terms of (the optimal value of) α.
Suppose we’ve fit our model’s parameters to a training set, and now wish to
make a prediction at a new point input x. We would then calculate wT x + b,
and predict y = 1 if and only if this quantity is bigger than zero. But
using (9), this quantity can also be written:
m
!T
X
T (i) (i)
w x+b = αi y x x+b (12)
i=1
m
X
= αi y (i) hx(i) , xi + b. (13)
i=1
Hence, if we’ve found the αi ’s, in order to make a prediction, we have to
calculate a quantity that depends only on the inner product between x and
the points in the training set. Moreover, we saw earlier that the αi ’s will all
be zero except for the support vectors. Thus, many of the terms in the sum
above will be zero, and we really need to find only the inner products between
x and the support vectors (of which there is often only a small number) in
order calculate (13) and make our prediction.
By examining the dual form of the optimization problem, we gained sig-
nificant insight into the structure of the problem, and were also able to write
the entire algorithm in terms of only inner products between input feature
vectors. In the next section, we will exploit this property to apply the ker-
nels to our classification problem. The resulting algorithm, support vector
machines, will be able to efficiently learn in very high dimensional spaces.
7 Kernels
Back in our discussion of linear regression, we had a problem in which the
input x was the living area of a house, and we considered performing regres-
14
Rather than applying SVMs using the original input attributes x, we may
instead want to learn using some features φ(x). To do so, we simply need to
go over our previous algorithm, and replace x everywhere in it with φ(x).
Since the algorithm can be written entirely in terms of the inner prod-
ucts hx, zi, this means that we would replace all those inner products with
hφ(x), φ(z)i. Specifically, given a feature mapping φ, we define the corre-
sponding Kernel to be
Thus, we see that K(x, z) = φ(x)T φ(z), where the feature mapping φ is given
(shown here for the case of n = 3) by
x1 x1
x1 x2
x1 x3
x2 x1
φ(x) =
x 2 x 2
.
x2 x3
x3 x1
x3 x2
x3 x3
Note that whereas calculating the high-dimensional φ(x) requires O(n2 ) time,
finding K(x, z) takes only O(n) time—linear in the dimension of the input
attributes.
For a related kernel, also consider
(Check this yourself.) This corresponds to the feature mapping (again shown
16
for n = 3)
x1 x1
x1 x2
x1 x3
x2 x1
x2 x2
x2 x3
φ(x) =
x3 x1 ,
x3 x2
√x3 x3
√2cx1
√2cx2
2cx3
c
and the parameter c controls the relative weighting between the xi (first
order) and the xi xj (second order) terms.
T d
More broadly, the kernel K(x, z) = (x z + c) corresponds to a feature
n+d
mapping to an d feature space, corresponding of all monomials of the
form xi1 xi2 . . . xik that are up to order d. However, despite working in this
O(nd )-dimensional space, computing K(x, z) still takes only O(n) time, and
hence we never need to explicitly represent feature vectors in this very high
dimensional feature space.
Now, let’s talk about a slightly different view of kernels. Intuitively, (and
there are things wrong with this intuition, but nevermind), if φ(x) and φ(z)
are close together, then we might expect K(x, z) = φ(x)T φ(z) to be large.
Conversely, if φ(x) and φ(z) are far apart—say nearly orthogonal to each
other—then K(x, z) = φ(x)T φ(z) will be small. So, we can think of K(x, z)
as some measurement of how similar are φ(x) and φ(z), or of how similar are
x and z.
Given this intuition, suppose that for some learning problem that you’re
working on, you’ve come up with some function K(x, z) that you think might
be a reasonable measure of how similar x and z are. For instance, perhaps
you chose
||x − z||2
K(x, z) = exp − .
2σ 2
This is a reasonable measure of x and z’s similarity, and is close to 1 when
x and z are close, and near 0 when x and z are far apart. Can we use this
definition of K as the kernel in an SVM? In this particular example, the
answer is yes. (This kernel is called the Gaussian kernel, and corresponds
17
to an infinite dimensional feature mapping φ.) But more broadly, given some
function K, how can we tell if it’s a valid kernel; i.e., can we tell if there is
some feature mapping φ so that K(x, z) = φ(x)T φ(z) for all x, z?
Suppose for now that K is indeed a valid kernel corresponding to some
feature mapping φ. Now, consider some finite set of m points (not necessarily
the training set) {x(1) , . . . , x(m) }, and let a square, m-by-m matrix K be
defined so that its (i, j)-entry is given by Kij = K(x(i) , x(j) ). This matrix
is called the Kernel matrix. Note that we’ve overloaded the notation and
used K to denote both the kernel function K(x, z) and the kernel matrix K,
due to their obvious close relationship.
Now, if K is a valid Kernel, then Kij = K(x(i) , x(j) ) = φ(x(i) )T φ(x(j) ) =
φ(x(j) )T φ(x(i) ) = K(x(j) , x(i) ) = Kji , and hence K must be symmetric. More-
over, letting φk (x) denote the k-th coordinate of the vector φ(x), we find that
for any vector z, we have
XX
z T Kz = zi Kij zj
i j
XX
= zi φ(x(i) )T φ(x(j) )zj
i j
XX X
= zi φk (x(i) )φk (x(j) )zj
i j k
XXX
= zi φk (x(i) )φk (x(j) )zj
k i j
!2
X X
= zi φk (x(i) )
k i
≥ 0.
The second-to-last step above used the same trick as you saw in Problem
set 1 Q1. Since z was arbitrary, this shows that K is positive semi-definite
(K ≥ 0).
Hence, we’ve shown that if K is a valid kernel (i.e., if it corresponds to
some feature mapping φ), then the corresponding Kernel matrix K ∈ Rm×m
is symmetric positive semidefinite. More generally, this turns out to be not
only a necessary, but also a sufficient, condition for K to be a valid kernel
(also called a Mercer kernel). The following result is due to Mercer.5
5
Many texts present Mercer’s theorem in a slightly more complicated form involving
L functions, but when the input attributes take values in Rn , the version given here is
2
equivalent.
18
that we’ll see later in this class will also be amenable to this method, which
has come to be known as the “kernel trick.”
Thus, examples are now permitted to have (functional) margin less than 1,
and if an example has functional margin 1 − ξi (with ξ > 0), we would pay
a cost of the objective function being increased by Cξi . The parameter C
controls the relative weighting between the twin goals of making the ||w||2
small (which we saw earlier makes the margin large) and of ensuring that
most examples have functional margin at least 1.
20
Now, all that remains is to give an algorithm for actually solving the dual
problem, which we will do in the next section.
of the SVM. Partly to motivate the SMO algorithm, and partly because it’s
interesting in its own right, let’s first take another digression to talk about
the coordinate ascent algorithm.
max W (α1 , α2 , . . . , αm ).
α
Here, we think of W as just some function of the parameters αi ’s, and for now
ignore any relationship between this problem and SVMs. We’ve already seen
two optimization algorithms, gradient ascent and Newton’s method. The
new algorithm we’re going to consider here is called coordinate ascent:
For i = 1, . . . , m, {
αi := arg maxα̂i W (α1 , . . . , αi−1 , α̂i , αi+1 , . . . , αm ).
}
Thus, in the innermost loop of this algorithm, we will hold all the vari-
ables except for some αi fixed, and reoptimize W with respect to just the
parameter αi . In the version of this method presented here, the inner-loop
reoptimizes the variables in order α1 , α2 , . . . , αm , α1 , α2 , . . .. (A more sophis-
ticated version might choose other orderings; for instance, we may choose
the next variable to update according to which one we expect to allow us to
make the largest increase in W (α).)
When the function W happens to be of such a form that the “arg max”
in the inner loop can be performed efficiently, then coordinate ascent can be
a fairly efficient algorithm. Here’s a picture of coordinate ascent in action:
22
2.5
1.5
0.5
−0.5
−1
−1.5
−2
The ellipses in the figure are the contours of a quadratic function that
we want to optimize. Coordinate ascent was initialized at (2, −2), and also
plotted in the figure is the path that it took on its way to the global maximum.
Notice that on each step, coordinate ascent takes a step that’s parallel to one
of the axes, since only one variable is being optimized at a time.
9.2 SMO
We close off the discussion of SVMs by sketching the derivation of the SMO
algorithm. Some details will be left to the homework, and for others you
may refer to the paper excerpt handed out in class.
Here’s the (dual) optimization problem that we want to solve:
m m
X 1 X (i) (j)
maxα W (α) = αi − y y αi αj hx(i) , x(j) i. (17)
i=1
2 i,j=1
s.t. 0 ≤ αi ≤ C, i = 1, . . . , m (18)
Xm
αi y (i) = 0. (19)
i=1
Let’s say we have set of αi ’s that satisfy the constraints (18-19). Now,
suppose we want to hold α2 , . . . , αm fixed, and take a coordinate ascent step
and reoptimize the objective with respect to α1 . Can we make any progress?
The answer is no, because the constraint (19) ensures that
m
X
(1)
α1 y =− αi y (i) .
i=2
23
(This step used the fact that y (1) ∈ {−1, 1}, and hence (y (1) )2 = 1.) Hence,
α1 is exactly determined by the other αi ’s, and if we were to hold α2 , . . . , αm
fixed, then we can’t make any change to α1 without violating the constrain-
t (19) in the optimization problem.
Thus, if we want to update some subject of the αi ’s, we must update at
least two of them simultaneously in order to keep satisfying the constraints.
This motivates the SMO algorithm, which simply does the following:
Repeat till convergence {
1. Select some pair αi and αj to update next (using a heuristic that
tries to pick the two that will allow us to make the biggest progress
towards the global maximum).
2. Reoptimize W (α) with respect to αi and αj , while holding all the
other αk ’s (k 6= i, j) fixed.
}
To test for convergence of this algorithm, we can check whether the KKT
conditions (Equations 14-16) are satisfied to within some tol. Here, tol is
the convergence tolerance parameter, and is typically set to around 0.01 to
0.001. (See the paper and pseudocode for details.)
The key reason that SMO is an efficient algorithm is that the update to
αi , αj can be computed very efficiently. Let’s now briefly sketch the main
ideas for deriving the efficient update.
Let’s say we currently have some setting of the αi ’s that satisfy the con-
straints (18-19), and suppose we’ve decided to hold α3 , . . . , αm fixed, and
want to reoptimize W (α1 , α2 , . . . , αm ) with respect to α1 and α2 (subject to
the constraints). From (19), we require that
m
X
(1) (2)
α1 y + α2 y =− αi y (i) .
i=3
Since the right hand side is fixed (as we’ve fixed α3 , . . . αm ), we can just let
it be denoted by some constant ζ:
α1 y (1) + α2 y (2) = ζ. (20)
We can thus picture the constraints on α1 and α2 as follows:
24
H α1y(1)+ α2y(2)=ζ
α2
L
α1 C
From the constraints (18), we know that α1 and α2 must lie within the box
[0, C] × [0, C] shown. Also plotted is the line α1 y (1) + α2 y (2) = ζ, on which we
know α1 and α2 must lie. Note also that, from these constraints, we know
L ≤ α2 ≤ H; otherwise, (α1 , α2 ) can’t simultaneously satisfy both the box
and the straight line constraint. In this example, L = 0. But depending on
what the line α1 y (1) + α2 y (2) = ζ looks like, this won’t always necessarily be
the case; but more generally, there will be some lower-bound L and some
upper-bound H on the permissible values for α2 that will ensure that α1 , α2
lie within the box [0, C] × [0, C].
Using Equation (20), we can also write α1 as a function of α2 :
α1 = (ζ − α2 y (2) )y (1) .
(Check this derivation yourself; we again used the fact that y (1) ∈ {−1, 1} so
that (y (1) )2 = 1.) Hence, the objective W (α) can be written
Finally, having found the α2new , we can use Equation (20) to go back and find
the optimal value of α1new .
There’re a couple more details that are quite easy but that we’ll leave you
to read about yourself in Platt’s paper: One is the choice of the heuristics
used to select the next αi , αj to update; the other is how to update b as the
SMO algorithm is run.
10-601 Machine Learning
The ability to generalize beyond what we have seen in the training phase is the essence of machine
learning, essentially what makes machine learning, machine learning. In these notes we describe
some basic concepts and the classic formalization that allows us to talk about these important
concepts in a precise way.
Distributional Learning
The basic idea of the distributional learning setting is to assume that examples are being provided
from a fixed (but perhaps unknown) distribution over the instance space. The assumption of a
fixed distribution gives us hope that what we learn based on some training data will carry over
to new test data we haven’t seen yet. A nice feature of this assumption is that it provides us a
well-defined notion of the error of a hypothesis with respect to target concept.
Specifically, in the distributional learning setting (captured by the PAC model of Valiant and Sta-
tistical Learning Theory framework of Vapnik) we assume that the input to the learning algorithm
is a set of labeled examples
S: (x1 , y1 ), . . . , (xm , ym )
where xi are drawn i.i.d. from some fixed but unknown distribution D over the the instance space
X and that they are labeled by some target concept c∗ . So yi = c∗ (xi ). Here the goal is to do
optimization over the given sample S in order to find a hypothesis h : X → {0, 1}, that has small
error over whole distribution D. The true error of h with respect to a target concept c∗ and the
underlying distribution D is defined as
(Prx∼D (A) means the probability of event A given that x is selected according to distribution D.)
We denote by
m
∗ 1 X
errS (h) = Pr (h(x) 6= c (x)) = I[h(xi ) 6= c∗ (xi )]
x∼S m i=1
the empirical error of h over the sample S (that is the fraction of examples in S misclassified by h).
What kind of guarantee could we hope to make?
• We converge quickly to the target concept (or equivalent). But, what if our distribution
places low weight on some part of X?
1
• We converge quickly to an approximation of the target concept. But, what if the examples
we see don’t correctly reflect the distribution?
• With high probability we converge to an approximation of the target concept. This is the
idea of Probably Approximately Correct learning.
Here is a basic result that is meaningful in the realizable case (when the target function belongs to
an a-priori known finite hypothesis space H.)
Theorem 1 Let H be a finite hypothesis space. Let D be an arbitrary, fixed unknown probability
distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if we
draw a sample from D of size
1 1
m= ln(|H|) + ln ,
δ
then with probability at least 1 − δ, all hypotheses/concepts in H with error ≥ are inconsistent
with the data (or alternatively, with probability at least 1 − δ any hypothesis consistent with the data
will have error at most .)
1. Consider some specific “bad” hypothesis h whose error is at least . The probability that this
bad hypothesis h is consistent with m examples drawn from D is at most (1 − )m .
2. Notice that there are (only) at most |H| possible bad hypotheses.
3. (1) and (2) imply that given m examples drawn from D, the probability there exists a bad
hypothesis consistent with all of them is at most |H|(1 − )m . Suppose that m is sufficiently
large so that this quantity is at most δ. That means that with probability (1 − δ) there are
no consistent hypothesis whose error is more than .
Using the inequality 1 − x ≤ e−x , it is simple to verify that (1) is true as long as:
1 1
m≥ ln(|H|) + ln .
δ
2
For any δ > 0, if we draw a sample from D of size m then with probability at least 1 − δ, any
hypothesis in H consistent with the data will have error at most
1 1
ln(|H|) + ln .
m δ
This is the more “statistical learning theory style” way of writing the same bound.
In the general case, the target function might not be in the class of functions we consider. Formally,
in the non-realizable or agnostic passive supervised learning setting, we assume assume that the
input to a learning algorithm is a set S of labeled examples S = {(x1 , y1 ), . . . , (xm , ym )}. We
assume that these examples are drawn i.i.d. from some fixed but unknown distribution D over the
the instance space X and that they are labeled by some target concept c∗ . So yi = c∗ (xi ). The
goal is just as in the realizable case to do optimization over the given sample S in order to find a
hypothesis h : X → {0, 1} of small error over whole distribution D. Our goal is to compete with
the best function (the function of smallest true error rate) in some concept class H.
A natural hope is that picking a concept c with a small observed error rate gives us small true error
rate. It is therefore useful to find a relationship between observed error rate for a sample and the
true error rate.
Consider a hypothesis with true error rate p (or a coin of bias p) observed on m examples (the coin
is flipped m times). Let S be the number of observed errors (the number of heads seen) so S/m is
the observed error rate.
Hoeffding bounds state that for any ∈ [0, 1],
2
S
1. Pr[ m > p + ] ≤ e−2m , and
2
S
2. Pr[ m < p − ] ≤ e−2m .
Theorem 2 Let H be a finite hypothesis space. Let D be an arbitrary, fixed unknown probability
distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if we
draw a sample S from D of size
1 1
m ≥ 2 ln(2|H|) + ln ,
2 δ
then probability at least (1 − δ), all hypotheses h in H have
3
Proof: Let us fix a hypothesis h. By Hoeffding, we get that the probability that its observed error
2
within of its true error is at most 2e−2m ≤ δ/|H|. By union bound over all all h in H, we then
get the desired result.
Note: A statement of type one is called a uniform convergence result. It implies that the hypoth-
esis that minimizes the empirical error rate will be very close in generalization error to the best
∗
hypothesis in the class. In particular if h h∈H err S (h) we have err(h) ≤ err(h ) + 2,
b = argmin b
∗
where h is a hypothesis of smallest true error rate.
Note: The sample size grows quadratically with 1/. Recall that the learning sample size in the
realizable (PAC) case grew only linearly with 1/.
Note: Another way to write the bound in Theorem 2 is as follows:
For any δ > 0, if we draw a sample from D of size m then with probability at least 1 − δ, all
hypotheses h in H have v
u
u ln(2|H|) + ln 1
t δ
err(h) ≤ errS (h) +
2m
This is the more “statistical learning theory style” way of writing the same bound.
In the case where H is not finite, we will replace |H| with other measures of complexity of H
(shattering coefficient, VC-dimension, Rademacher complexity).
Shattering, VC dimension
Let H be a concept class over an instance space X, i.e. a set of functions functions from X to
{0, 1} (where both H and X may be infinite). For any S ⊆ X, let’s denote by H (S) the set of
all behaviors or dichotomies on S that are induced or realized by H, i.e. if S = {x1 , · · · , xm }, then
H (S) ⊆ {0, 1}m and
H (S) = {(c (x1 ) , · · · , c (xm )) ; c ∈ H} .
Also, for any natural number m, we consider H [m] to be the maximum number of ways to split m
points using concepts in H, that is
To instantiate this, to get a feel of what this result means imagine that H is the class of thresholds
on the line, then H[m] = m + 1, or that H is the class of intervals, then H[m] = O(m2 ), or for
linear separators in Rd , H[m] = md+1 .
4
Note 1 In order to show that the VC dimension of a class is at least d we must simply find some
shattered set of size d. In order to show that the VC dimension is at most d we must show that no
set of size d + 1 is shattered.
Examples
1. Let H be the concept class of thresholds on the real number line. Clearly samples of size
1 can be shattered by this class. However, no sample of size 2 can be shattered since it is
impossible to choose threshold such that x1 is labeled positive and x2 is labeled negative for
x1 ≤ x2 . Hence the V Cdim(H) = 1.
2. Let H be the concept class intervals on the real line. Here a sample of size 2 is shattered, but
no sample of size 3 is shattered, since no concept can satisfy a sample whose middle point is
negative and outer points are positive. Hence, V Cdim(H) = 2.
3. Let H be the concept class of k non-intersecting intervals on the real line. A sample of
size 2k shatters (just treat each pair of points as a separate case of example 2) but no
sample of size 2k + 1 shatters, since if the sample points are alternated positive/negative,
starting with a positive point, the positive points can’t be covered by only k intervals. Hence
V Cdim(H) = 2k.
4. Let H the class of linear separators in R2 . Three points can be shattered, but four cannot;
hence V Cdim(H) = 3. To see why four points can never be shattered, consider two cases.
The trivial case is when one point can be placed within a triangle formed by the other three;
then if the middle point is positive and the others are negative, no half space can contain
only the positive points. If however the points cannot be arranged in that pattern, then label
two points diagonally across from each other as positive, and the other two as negative In
general, one can show that the VCdimension of the class of linear separators in Rn is n + 1.
5. The class of axis-aligned rectangles in the plane has V CDIM = 4. The trick here is to note that
for any collection of five points, at least one of them must be interior to or on the boundary
of any rectangle bounded by the other four; hence if the bounding points are positive, the
interior point cannot be made negative.
Sauer’s Lemma
Pd m
Lemma 1 If d = V Cdim(H), then for all m, H[m] ≤ Φd (m), where Φd (m) = i=0 i .
For m > d we have: d
em
Φd (m) ≤ .
d
Note that for H the class of intervals we achieve H[m] = Φd (m), where d = V Cdim(H), so the
bound in the Sauer’s lemma is tight.
5
Sample Complexity Results based on Shattering and VCdim
Interestingly, we can roughly replace ln(|H|) from the case where H is finite with the shattering
coefficient H[2m] when H is infinite. Specifically:
Theorem 3 Let H be an arbitrary hypothesis space. Let D be an arbitrary, fixed unknown proba-
bility distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if
we draw a sample S from D of size
2 1
m > · log2 (2 · H[2m]) + log2 (3)
δ
then with probability (1 − δ), all bad hypothesis in H (with error > with respect to c and D) are
inconsistent with the data.
Theorem 4 Let H be an arbitrary hypothesis space. Let D be an arbitrary, fixed unknown proba-
bility distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if
we draw a sample S from D of size
We can now use Sauer’s lemma to get a nice closed form expression on sample complexity (an
upper bound on the number of samples needed to learn concepts from the class) based on the VC-
dimension of a concept class. The following is the VC dimension based sample complexity bound
for the realizable case:
So it is possible to learn a class C of VC-dimension d with parameters δ and given that the
number of samples m is at least m ≥ c d log 1 + 1 log 1δ where c is a fixed constant. So, as long
as V Cdim(H) is finite, it is possible to learn concepts from H even though H might be infinite!
One can also show that this sample complexity result is tight within a factor of O(log(1/)). Here
is a simplified version of the lower bound:
Theorem 6 Any algorithm for learning a concept class of VC dimension d with parameters and
δ ≤ 1/15 must use more than (d − 1)/(64) examples in the worst case.
6
The following is the VC dimension based sample complexity bound for the non-realizable case:
Note: As in the finite case, we can rewrite the bounds in Theorems 5 and 7 in the “statistical
learning theory style” as follows:
Let H be an arbitrary hypothesis space of VC-dimension d. For any δ > 0, if we draw a sample
from D of size m then with probability at least 1 − δ, any hypothesis in H consistent with the data
will have error at most
1 1
O d ln(m/d) + ln .
m δ
For any δ > 0 if we draw a sample from D of size m then with probability at least 1 − δ, all
hypotheses h in H have s
d + ln(1/δ)
err(h) ≤ errS (h) + O .
m
We can see from these bounds that the gap between true error and empirical error in the realizable
√
case is O(ln(m)/m), whereas in the general (non-realizable) case this is (larger) O(1/ m).
7
Maximum Likelihood, Logistic Regression,
and Stochastic Gradient Training
Charles Elkan
[email protected]
January 10, 2014
The notation above makes us think of the distribution θ as fixed and the examples
xj as unknown, or varying. However, we can think of the training data as fixed
and consider alternative parameter values. This is the point of view behind the
definition of the likelihood function:
L(θ; x1 , . . . , xn ) = f (x1 , . . . , xn ; θ).
Note that if f (x; θ) is a probability mass function, then the likelihood is always
less than one, but if f (x; θ) is a probability density function, then the likelihood
can be greater than one, since densities can be greater than one.
1
The principle of maximum likelihood says that given the training data, we
should use as our model the distribution f (·; θ̂) that gives the greatest possible
probability to the training data. Formally,
θ̂ = argmaxθ L(θ; x1 , . . . , xn ).
Usually, we use the notation P (·) for a probability mass, and the notation p(·) for
a probability density. For mathematical convenience write P (X) as
P (X = x) = θx (1 − θ)1−x .
Suppose that the training data are x1 through xn where each xi ∈ {0, 1}. The
likelihood function is
n
Y
L(θ; x1 , . . . , xn ) = f (x1 , . . . , xn ; θ) = P (X = xi ) = θh (1 − θ)n−h
i=1
Pn
where h = i=1 xi . The maximization is performed over the possible scalar
values 0 ≤ θ ≤ 1.
We can do the maximization by setting the derivative with respect to θ equal
to zero. The derivative is
d h
θ (1 − θ)n−h = hθh−1 (1 − θ)n−h + θh (n − h)(1 − θ)n−h−1 (−1)
dθ
= θh−1 (1 − θ)n−h−1 [h(1 − θ) − (n − h)θ]
2
which has solutions θ = 0, θ = 1, and θ = h/n. The solution which is a maximum
is clearly θ = h/n while θ = 0 and θ = 1 are minima. So we have the maximum
likelihood estimate θ̂ = h/n.
The log likelihood function, written l(·), is simply the logarithm of the likeli-
hood function L(·). Because logarithm is a monotonic strictly increasing function,
maximizing the log likelihood is precisely equivalent to maximizing the likeli-
hood, and also to minimizing the negative log likelihood.
For an example of maximizing the log likelihood, consider the problem of
estimating the parameters of a univariate Gaussian distribution. This distribution
is
1 (x − µ)2
f (x; µ, σ 2 ) = √ exp[− ].
σ 2π 2σ 2
The log likelihood for one example x is
√ (x − µ)2
l(µ, σ 2 ; x) = log L(µ, σ 2 ; x) = − log σ − log 2π − .
2σ 2
Suppose that we have training data {x1 , . . . , xn }. The maximum log likelihood
estimates are
n
√ 1 X
hµ̂, σ̂ 2 i = argmaxhµ,σ2 i [−n log σ − n log 2π − (xi − µ)2 ].
2σ 2 i=1
3
The first term ni=1 (xi − x̄)2 does not depend on µ
P
Pson
it is irrelevant to the min-
imization. The second term equals zero, because i=1 (xi − x̄) = 0. The third
term is always positive, so it is clear that it is minimized when µ = x̄.
To perform the second minimization, work out the derivative symbolically and
then work out when it equals zero:
∂ 1 1
[n log σ + σ −2 T ] = nσ −1 + (−2σ −3 )T
∂σ 2 2
= σ (n − T σ −2 )
−1
= 0 if σ 2 = T /n.
Maximum likelihood estimators are typically reasonable, P but they may have is-
sues. Consider the Gaussian variance estimator σ̂MLE = ni=1 (xi − x̄)2 /n and
2
2
the case where n = 1. In this case σ̂MLE = 0. This estimate is guaranteed to be
too small. Intuitively, the estimate is optimistically assuming that all future data
points x2 and so on will equal x1 exactly.
It can be proved that in general the maximum likelihood estimate of the vari-
ance of a Gaussian is too small, on average:
n
1X n−1 2
E[ (xi − x̄)2 ; µ, σ 2 ] = σ < σ2.
n i=1 n
This phenomenon can be considered an instance of overfitting: the observed
spread around the observed mean x̄ is less than the unknown true spread σ 2 around
the unknown true mean µ.
3 Conditional likelihood
An important extension of the idea of likelihood is conditional likelihood. Re-
member that the notation p(y|x) is an abbreviation for the conditional probability
p(Y = y|X = x) where Y and X are random variables. The conditional like-
lihood of θ given data x and y is L(θ; y|x) = p(y|x) = f (y|x; θ). Intuitively,
Y follows a probability distribution that is different for different x. Technically,
for each x there is a different function f (y|x; θ), but all these functions share the
same parameters θ. We assume that x itself is never unknown, so there is no need
to have a probabilistic model of it.
Given training data consisting of hxi , yi i pairs, the principle of maximum con-
Q likelihood says to choose a parameter estimate θ̂ that maximizes the prod-
ditional
uct i f (yi |xi ; θ). Note that we do not need to assume that the xi are independent
4
in order to justify the conditional likelihood being a product; we just need to as-
sume that the yi are independent when each is conditioned on its own xi . For
any specific value of x, θ̂ can then be used to compute probabilities for alternative
values y of Y . By assumption, we never want to predict values of x.
Suppose that Y is a binary (Bernoulli) outcome and that x is a real-valued
vector. We can assume that the probability that Y = 1 is a nonlinear function of a
linear function of x. Specifically, we assume the conditional model
d
X 1
p(Y = 1|x; α, β) = σ(α + β j xj ) = Pd
j=1 1 + exp −[α + j=1 βj xj ]
where σ(z) = 1/(1 + e−z ) is the nonlinear function. This model is called logistic
regression. We use j to index over the feature values x1 to xd of a single example
of dimensionality d, since we use i below to index over training examples 1 to
n. If necessary, the notation xij means the jth feature value of the ith example.
Be sure to understand the distinction between a feature and a value of a feature.
Essentially a feature is a random variable, while a value of a feature is a possible
outcome of the random variable. Features may also be called attributes, predictors,
or independent variables. The dependent random variable Y is sometimes called
a dependent variable.
The logistic regression model is easier to understand in the form
d
p X
log =α+ βj xj
1−p j=1
where p is an abbreviation for p(Y = 1|x; α, β). The ratio p/(1 − p) is called
the odds of the event Y = 1 given X = x, and log[p/(1 − p)] is called the log
odds. Since probabilities range between 0 and 1, odds range between 0 and +∞
P unboundedly between −∞ and +∞. A linear expression of
and log odds range
the form α + j βj xj can also take unbounded values, so it is reasonable to use
a linear expression as a model for log odds, but not as a model for odds or for
probabilities. Essentially, logistic regression is the simplest reasonable model for
a random yes/no outcome whose probability depends linearly on predictors x1 to
xd .
For each feature j, exp(βj xj ) is a multiplicative scaling factor on the odds
p/(1 − p). If the predictor xj is binary, then exp(βj ) is the extra odds of having
the outcome Y = 1 rather than Y = 0 when xj = 1, compared to when xj = 0.
If the predictor xj is real-valued, then exp(βj ) is the extra odds of having the
5
outcome Y = 1 when the value of xj increases by one unit. A major limitation
of the basic logistic regression model is that the probability p must either increase
monotonically, or decrease monotonically, as a function of each predictor xj . The
basic model does not allow the probability to depend in a U-shaped way on any
xj .
Given the training set {hx1 , y1 i, . . . , hxn , yn i}, we learn a logistic regression
classifier by maximizing the log joint conditional likelihood. This is the sum of
the log conditional likelihood for each training example:
n
X n
X
LCL = log L(θ; yi |xi ) = log f (yi |xi ; θ).
i=1 i=1
Given a single training example hxi , yi i, the log conditional likelihood is log pi if
the true label yi = 1 and log(1 − pi ) if yi = 0, where pi = p(y = 1|xi ; θ).
To simplify the following discussion, assume from now on that α = β0 and
x0 = 1 for every example x, so the parameter vector θ is β ∈ Rd+1 . By group-
ing together the positive and negative training examples, we can write the total
conditional log likelihood as
X X
LCL = log pi + log(1 − pi ).
i:yi =1 i:yi =0
For an individual training example hx, yi, if its label y = 1 the partial derivative is
∂ 1 ∂
log p = p
∂βj p ∂βj
while if y = 0 it is
∂ 1 ∂
log(1 − p) = − p .
∂βj 1−p ∂βj
6
∂ X
= (−1)(1 + e)−2 (e) [− β j xj ]
∂βj j
So
∂ ∂
log p = (1 − p)xj and log(1 − p) = −pxj .
∂βj ∂βj
For the entire training set the partial derivative of the log conditional likelihood
with respect to βj is
∂ X X X
LCL = (1 − pi )xij + −pi xij = (yi − pi )xij
∂βj i:y =1 i:y =0 i
i i
where xij is the value of the jth feature of the ith training example. Setting the
partial derivative to zero yields
X X
yi xij = pi xij .
i i
We have one equation of this type for each parameter βj . The equations can be
used to check the correctness of a trained model.P
Informally, but not precisely, the expression i yi xij is the average value over
the training set of the ith feature, where each training example
P is weighted 1 if its
true label is positive, and 0 otherwise. The expression i pi xij is the same aver-
age, except that each example i is weighted according to its predicted probability
pi of being positive. When the logistic regression classifier is trained correctly,
then these two averages must be the same for every feature. The special case for
j = 0 gives
1X 1X
yi = pi .
n i n i
In words, the empirical base rate probability of being positive must equal the
average predicted probability of being positive.
7
4 Stochastic gradient training
There are several sophisticated ways of actually doing the maximization of the to-
tal conditional log likelihood, i.e. the conditional log likelihood summed over all
training examples hxi , yi i; for details see [Minka, 2007, Komarek and Moore, 2005].
However, here we consider a method called stochastic gradient ascent. This
method changes the parameter values to increase the log likelihood based on one
example at a time. It is called stochastic because the derivative based on a ran-
domly chosen single example is a random approximation to the true derivative
based on all the training data.
As explained in the previous section, the partial derivative of the log condi-
tional likelihood with respect to βj is
∂ X
LCL = (yi − pi )xij
∂βj i
where xij is the value of the jth feature of the ith training example. The gradient-
based update of the parameter βj is
∂
βj := βj + λ LCL
∂βj
where λ is a step size. A major problem with thisP approach is the time complexity
of computing the partial derivatives. Evaluating i (yi − pi )xij for all j requires
O(nd) time where n is the number of training examples and d is their dimen-
sionality. Typically, after this evaluation, each βj can be changed by only a small
amount. The partial derivatives must then be evaluated again, at high computa-
tional cost again, before updating βj further.
The stochastic gradient idea is that we can get a random approximation of the
partial derivatives in much less than O(nd) time, so the parameters can be updated
much more rapidly. In general, for each parameter βj we want to define a random
variable Zj such that
∂
E[Zj ] = LCL.
∂βj
For logistic regression, one such Zj is n(yi − pi )xij where i is chosen randomly,
with uniform probability, from the set {1, 2, . . . , n}. Based on this, the stochastic
gradient update of βj is
βj := βj + λZ = βj + λ(yi − pi )xij
8
where i is selected randomly and n has been dropped since it is a constant. As be-
fore, the learning rate λ is a multiplier that controls the magnitude of the changes
to the parameters.
Stochastic gradient ascent (or descent, for a minimization problem) is a method
that is often useful in machine learning. Experience suggests some heuristics for
making it work well in practice.
• The training examples are sorted in random order, and the parameters are
updated for each example sequentially. One complete update for every ex-
ample is called an epoch. Typically, a small constant number of epochs is
used, perhaps 3 to 100 epochs.
• The learning rate is chosen by trial and error. It can be kept constant across
all epochs, e.g. λ = 0.1 or λ = 1, or it can be decreased gradually as a
function of the epoch number.
• Because the learning rate is the same for every parameter, it is useful to
scale the features xj so that their magnitudes are similar for all j. Given
that the feature x0 has constant value 1, it is reasonable to normalize every
other feature to have mean zero and variance 1, for example.
Stochastic gradient ascent (or descent) has some properties that are very useful in
practice. First, suppose that xj = 0 for most features j of a training example x.
Then updating βj based on x can be skipped. This means that the time to do one
epoch is O(nf d) where n is the number of training examples, d is the number of
features, and f is the average number of nonzero feature values per example. If
an example x is the bag-of-words representation of document, then d is the size of
the vocabulary (often over 30,000) but f d is the average number of words actually
used in a document (often under 300).
Second, suppose that the number n of training examples is very large, as is the
case in many modern applications. Then, a stochastic gradient method may con-
verge to good parameter estimates in less than one epoch of training. In contrast,
a training method that computes the log likelihood of all data and uses this in the
same way regardless of n will be inefficient in how it uses the data.
For each example, a stochastic gradient method updates all parameters once.
The dual idea is to update one parameter at a time, based on all examples. This
method is called coordinate ascent (or descent). For feature j the update rule is
X
βj := βj + λ (yi − pi )xij .
i
9
The update for the whole parameter vector β̄ is
β̄ := β̄ + λ(ȳ − p̄)T X
where the matrix X is the entire training set and the column vector ȳ consists of
the 0/1 labels for every training example. Often, coordinate ascent converges too
slowly to be useful. However, it can be useful to do one update of β̄ after all
epochs of stochastic gradient ascent.
Regardless of the method used to train a model, it is important to remember
that optimizing the model perfectly on the training data usually does not lead to
the best possible performance on test examples. There are several reasons for this:
• The model with best possible performance may not belong to the family of
models under consideration. This is an instance of the principle “you cannot
learn it if you cannot represent it.”
• The training data may not be representative of the test data, i.e. the training
and test data may be samples from different populations.
• The objective function for training, namely log likelihood or conditional log
likelihood, may not be the desired objective from an application perspective;
for example, the desired objective may be classification accuracy.
5 Regularization
Consider learning a logistic regression classifier for categorizing documents. Sup-
pose that word number j appears only in documents whose labels are positive. The
partial derivative of the log conditional likelihood with respect to the parameter
for this word is
∂ X
LCL = (yi − pi )xij .
∂βj i
10
There is a standard method for solving this overfitting problem that is quite
simple, but quite successful. The solution is called regularization. The idea is to
impose a penalty on the magnitude of the parameter values. This penalty should
be minimized, in a trade-off with maximizing likelihood. Mathematically, the
optimization problem to be solved is
Remember that for logistic regression the partial derivative of the log conditional
likelihood for one example is
∂
log p(y|x; β) = (y − p)xj
∂βj
11
where λ is the learning rate. Update rules like the one above are often called
“weight decay” rules, since the weight βj is made smaller at each update unless
y − p has the same sign as xj .
Straightforward stochastic gradient ascent for training a regularized logistic
regression model loses the desirable sparsity property described above, because
the value of every parameter βj must be decayed for every training example. How
to overcome this computational inefficiency is described in [Carpenter, 2008].
Writing the regularized optimization problem as a minimization gives
n
X d
X
β̂ = argminβ − log p(yi |xi ; β) + µ βj2 .
i=1 j=0
The expression − log p(yi |xi ; β) is called the “loss” for training example i. If the
predicted probability, using β, of the true label yi is close to 1, then the loss is
small. But if the predicted probability of yi is close to 0, then the loss is large.
Losses are always non-negative; we want to minimize them. We also want to
minimize the numerical magnitude of the trained parameters.
References
[Carpenter, 2008] Carpenter, B. (2008). Lazy sparse stochastic gradi-
ent descent for regularized multinomial logistic regression. Techni-
cal report, Alias-i. Available at http://lingpipe-blog.com/
lingpipe-white-papers/.
[Komarek and Moore, 2005] Komarek, P. and Moore, A. W. (2005). Making lo-
gistic regression a core data mining tool with TR-IRLS. In Proceedings of
the Fifth IEEE International Conference on Data Mining (ICDM’05), pages
685–688.
[3 points] In one or two sentences, explain intuitively the reason why this is the
MLE. You do not need to use any equations.
Note: The MLE above is an example of overfitting, since the true value of θ is
almost certainly larger than the MLE.
CSE 250B Quiz 4, January 27, 2011
Write your name:
Assume that winning or losing a basketball game is similar to flipping a biased
coin. Suppose that San Diego State University (SDSU) has won all six games that
it has played.
(a) The maximum likelihood estimate (MLE) of the probability that SDSU will
win its next game is 1.0. Explain why this is the MLE. (Using equations is not
required.)
[3 points] Work out the partial derivative of the objective function with respect to
weight wj .
Xn d
X
2 2
E=[ fi − 2fi yi + yi ] + µ wj2 .
i=1 j=0
Work out the stochastic gradient update rule as specifically as possible, when
the error function is absolute error: e(z) = |z|.
Hint: Use the notation e0 (z) for the derivative of e(z).
Answer: For each component wj of the parameter vector w, the update rule is
∂
wj := wj − α E
∂wj
∂
wj := wj − α sign(f (x; w) − y) f.
∂wj
Rob Schapire
Princeton University
Example: “How May I Help You?”
[Gorin et al.]
• goal: automatically categorize type of call requested by phone
customer (Collect, CallingCard, PersonToPerson, etc.)
• yes I’d like to place a collect call long distance
please (Collect)
• operator I need to make a call but I need to bill
it to my office (ThirdNumber)
• yes I’d like to place a call on my master card
please (CallingCard)
• I just called a number in sioux city and I musta
rang the wrong number because I got the wrong
party and I would like to have that taken off of
my bill (BillingCredit)
Example: “How May I Help You?”
[Gorin et al.]
• goal: automatically categorize type of call requested by phone
customer (Collect, CallingCard, PersonToPerson, etc.)
• yes I’d like to place a collect call long distance
please (Collect)
• operator I need to make a call but I need to bill
it to my office (ThirdNumber)
• yes I’d like to place a call on my master card
please (CallingCard)
• I just called a number in sioux city and I musta
rang the wrong number because I got the wrong
party and I would like to have that taken off of
my bill (BillingCredit)
• observation:
• easy to find “rules of thumb” that are “often” correct
• e.g.: “IF ‘card’ occurs in utterance
THEN predict ‘CallingCard’ ”
• hard to find single highly accurate prediction rule
The Boosting Approach
• brief background
• basic algorithm and core theory
• other ways of understanding boosting
• experiments, applications and extensions
Brief Background
Strong and Weak Learnability
• [Schapire ’89]:
• first provable boosting algorithm
• [Freund ’90]:
• “optimal” algorithm that “boosts by majority”
• [Drucker, Schapire & Simard ’92]:
• first experiments using boosting
• limited by practical drawbacks
AdaBoost
• [Freund & Schapire ’95]:
introduced “AdaBoost” algorithm
•
strong practical advantages over previous boosting
•
algorithms
• experiments and applications using AdaBoost:
[Drucker & Cortes ’96] [Abney, Schapire & Singer ’99] [Tieu & Viola ’00]
[Jackson & Craven ’96] [Haruno, Shirai & Ooyama ’99] [Walker, Rambow & Rogati ’01]
[Freund & Schapire ’96] [Cohen & Singer’ 99] [Rochery, Schapire, Rahim & Gupta ’01]
[Quinlan ’96] [Dietterich ’00] [Merler, Furlanello, Larcher & Sboner ’01]
[Breiman ’96] [Schapire & Singer ’00] [Di Fabbrizio, Dutton, Gupta et al. ’02]
[Maclin & Opitz ’97] [Collins ’00] [Qu, Adam, Yasui et al. ’02]
[Bauer & Kohavi ’97] [Escudero, Màrquez & Rigau ’00] [Tur, Schapire & Hakkani-Tür ’03]
[Schwenk & Bengio ’98] [Iyer, Lewis, Schapire et al. ’00] [Viola & Jones ’04]
[Schapire, Singer & Singhal ’98] [Onoda, Rätsch & Müller ’00] [Middendorf, Kundaje, Wiggins et al. ’04]
.
.
.
• continuing development of theory and algorithms:
[Breiman ’98, ’99] [Duffy & Helmbold ’99, ’02] [Koltchinskii, Panchenko & Lozano ’01]
[Schapire, Freund, Bartlett & Lee ’98] [Freund & Mason ’99] [Collins, Schapire & Singer ’02]
[Grove & Schuurmans ’98] [Ridgeway, Madigan & Richardson ’99] [Demiriz, Bennett & Shawe-Taylor ’02]
[Mason, Bartlett & Baxter ’98] [Kivinen & Warmuth ’99] [Lebanon & Lafferty ’02]
[Schapire & Singer ’99] [Friedman, Hastie & Tibshirani ’00] [Wyner ’02]
[Cohen & Singer ’99] [Rätsch, Onoda & Müller ’00] [Rudin, Daubechies & Schapire ’03]
[Freund & Mason ’99] [Rätsch, Warmuth, Mika et al. ’00] [Jiang ’04]
[Domingo & Watanabe ’99] [Allwein, Schapire & Singer ’00] [Lugosi & Vayatis ’04]
[Mason, Baxter, Bartlett & Frean ’99] [Friedman ’01] [Zhang ’04]
.
.
.
Basic Algorithm and Core Theory
• introduction to AdaBoost
• analysis of training error
• analysis of test error based on
margins theory
A Formal Description of Boosting
• constructing Dt :
• D1 (i ) = 1/m
AdaBoost
[with Freund]
• constructing Dt :
• D1 (i ) = 1/m
• given Dt and ht :
−α
Dt (i ) e t if yi = ht (xi )
Dt+1 (i ) = ×
Zt e αt if yi 6= ht (xi )
Dt (i )
= exp(−αt yi ht (xi ))
Zt
where Zt = normalization
constant
1 1 − t
αt = 2 ln >0
t
AdaBoost
[with Freund]
• constructing Dt :
• D1 (i ) = 1/m
• given Dt and ht :
−α
Dt (i ) e t if yi = ht (xi )
Dt+1 (i ) = ×
Zt e αt if yi 6= ht (xi )
Dt (i )
= exp(−αt yi ht (xi ))
Zt
where Zt = normalization
constant
1 1 − t
αt = 2 ln >0
t
• final classifier: !
X
• Hfinal (x) = sign αt ht (x)
t
Toy Example
D1
h1 D2
ε1 =0.30
α1=0.42
Round 2
h2 D3
ε2 =0.21
α2=0.65
Round 3
h3
ε3 =0.14
α3=0.92
Final Classifier
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
final
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
$ $ $ $ $ $ $ $ $ & &
! ! % % % % % % % % % ' '
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
( ( ( " ( " ( " ( " ( " ( " ( " ( " ( " ( " " " "
) ) ) ) # ) # ) # ) # ) # ) # ) # ) # ) # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
=
"
"
"
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
"
#
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
" " " " " " " " " " " " "
# # # # # # # # # # # #
Analyzing the training error
• Theorem:
• write t as 1/2 − γt
Analyzing the training error
• Theorem:
• write t as 1/2 − γt
• then
Yh p i
training error(Hfinal ) ≤ 2 t (1 − t )
t
Yq
= 1 − 4γt2
t
!
X
≤ exp −2 γt2
t
Analyzing the training error
• Theorem:
• write t as 1/2 − γt
• then
Yh p i
training error(Hfinal ) ≤ 2 t (1 − t )
t
Yq
= 1 − 4γt2
t
!
X
≤ exp −2 γt2
t
• so: if ∀t : γt ≥ γ > 0
2
then training error(Hfinal ) ≤ e −2γ T
• AdaBoost is adaptive:
• does not need to know γ or T a priori
• can exploit γt γ
Proof
X
• let f (x) = αt ht (x) ⇒ Hfinal (x) = sign(f (x))
t
• Step 1: unwrapping recurrence:
!
X
exp −yi αt ht (xi )
1 t
Dfinal (i ) = Y
m Zt
t
1 X 1 if yi f (xi ) ≤ 0
=
m 0 else
i
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
• Proof:
1 X 1 if yi 6= Hfinal (xi )
training error(Hfinal ) =
m 0 else
i
1 X 1 if yi f (xi ) ≤ 0
=
m 0 else
i
1 X
≤ exp(−yi f (xi ))
m
i
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
• Proof:
1 X 1 if yi 6= Hfinal (xi )
training error(Hfinal ) =
m 0 else
i
1 X 1 if yi f (xi ) ≤ 0
=
m 0 else
i
1 X
≤ exp(−yi f (xi ))
m
i
X Y
= Dfinal (i ) Zt
i t
Proof (cont.)
Y
• Step 2: training error(Hfinal ) ≤ Zt
t
• Proof:
1 X 1 if yi 6= Hfinal (xi )
training error(Hfinal ) =
m 0 else
i
1 X 1 if yi f (xi ) ≤ 0
=
m 0 else
i
1 X
≤ exp(−yi f (xi ))
m
i
X Y
= Dfinal (i ) Zt
i t
Y
= Zt
t
Proof (cont.)
p
• Step 3: Zt = 2 t (1 − t )
Proof (cont.)
p
• Step 3: Zt = 2 t (1 − t )
• Proof:
X
Zt = Dt (i ) exp(−αt yi ht (xi ))
i
X X
= Dt (i )e αt + Dt (i )e −αt
i :yi 6=ht (xi ) i :yi =ht (xi )
−αt
= t e αt + (1
− t ) e
p
= 2 t (1 − t )
How Will Test Error Behave? (A First Guess)
0.8
0.6
error
0.4 test
0.2
train
20 40 60 80 100
# of rounds (T)
expect:
• training error to continue to drop (or reach zero)
• test error to increase when Hfinal becomes “too complex”
• “Occam’s razor”
• overfitting
• hard to know when to stop training
Actual Typical Run
20
error
10 (boosting C4.5 on
test “letter” dataset)
5
0
train
10 100 1000
# of rounds (T)
• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]
• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
• recall: Hfinal is weighted majority vote of weak classifiers
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]
• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
• recall: Hfinal is weighted majority vote of weak classifiers
• measure confidence by margin = strength of the vote
= (fraction voting correctly) − (fraction voting incorrectly)
cumulative distribution
1.0
20
1000
15 100
error
10 0.5
5
test
train 5
0
10 100 1000 -1 -0.5 0.5 1
# of rounds (T) margin
# rounds
5 100 1000
train error 0.0 0.0 0.0
test error 8.4 3.3 3.1
% margins ≤ 0.5 7.7 0.0 0.0
minimum margin 0.14 0.52 0.55
Theoretical Evidence: Analyzing Boosting Using Margins
p !
d/m
generalization error ≤ P̂r[margin ≤ θ] + Õ
θ
• game theory
• loss minimization
• estimating conditional probabilities
Game Theory
• game defined by matrix M:
Rock Paper Scissors
Rock 1/2 1 0
Paper 0 1/2 1
Scissors 1 0 1/2
• row player chooses row i
• column player chooses column j
(simultaneously)
• row player’s goal: minimize loss M(i , j)
Game Theory
• game defined by matrix M:
Rock Paper Scissors
Rock 1/2 1 0
Paper 0 1/2 1
Scissors 1 0 1/2
• row player chooses row i
• column player chooses column j
(simultaneously)
• row player’s goal: minimize loss M(i , j)
• usually allow randomized play:
• players choose distributions P and Q over rows and
columns
• learner’s (expected) loss
X
= P(i )M(i , j)Q(j)
i ,j
= PT MQ ≡ M(P, Q)
The Minmax Theorem
• von Neumann’s minmax theorem:
• in words:
• v = min max means:
• row player has strategy P∗
such that ∀ column strategy Q
loss M(P∗ , Q) ≤ v
• v = max min means:
• this is optimal in sense that
column player has strategy Q∗
such that ∀ row strategy P
loss M(P, Q∗ ) ≥ v
The Boosting Game
• let {g1 , . . . , gN } = space of all weak classifiers
• row player ↔ booster
• column player ↔ weak learner
• matrix M:
• row ↔ example (xi , yi )
• column ↔ weak classifier gj
1 if yi = gj (xi )
• M(i , j) =
0 else
weak learner
g1 gj gN
x1 y1
booster
xi yi M(i,j)
xmym
Boosting and the Minmax Theorem
• if:
• ∀ distributions over examples
∃h with accuracy ≥ 21 + γ
• then:
1
• min max M(P, j) ≥ 2 +γ
P j
• by minmax theorem:
1 1
• max min M(i , Q) ≥ 2 +γ > 2
Q i
• which means:
• ∃ weighted majority of classifiers which correctly classifies
all examples with positive margin (2γ)
• optimal margin ↔ “value” of game
AdaBoost and Game Theory
[with Freund]
yf(x)
Coordinate Descent
[Breiman]
f ← f − α∇f L(f )
Functional Gradient Descent
[Friedman][Mason et al.]
• want to minimize
X
L(f ) = L(f (x1 ), . . . , f (xm )) = exp(−yi f (xi ))
i
f ← f − α∇f L(f )
f ← f + αht
Functional Gradient Descent
[Friedman][Mason et al.]
• want to minimize
X
L(f ) = L(f (x1 ), . . . , f (xm )) = exp(−yi f (xi ))
i
f ← f − α∇f L(f )
f ← f + αht
0.8 x
’test’
’train’
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
• dynamical systems
• statistical consistency
• maximum entropy
Experiments, Applications and Extensions
• basic experiments
• multiclass classification
• confidence-rated predictions
• text categorization /
spoken-dialogue systems
• incorporating prior knowledge
• active learning
• face detection
Practical Advantages of AdaBoost
• fast
• simple and easy to program
• no parameters to tune (except T )
• flexible — can combine with any learning algorithm
• no prior knowledge needed about weak learner
• provably effective, provided can consistently find rough rules
of thumb
→ shift in mind set — goal now is merely to find classifiers
barely better than random guessing
• versatile
• can use with data that is textual, numeric, discrete, etc.
• has been extended to learning problems well beyond
binary classification
Caveats
30 30
25 25
20 20
C4.5
C4.5
15 15
10 10
5 5
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
ht : X → Y
e −αt
Dt (i ) if yi = ht (xi )
Dt+1 (i ) = ·
Zt e αt if yi 6= ht (xi )
X
Hfinal (x) = arg max αt
y ∈Y
t:ht (x)=y
Multiclass Problems
[with Freund]
• say y ∈ Y = {1, . . . , k}
• direct approach (AdaBoost.M1):
ht : X → Y
e −αt
Dt (i ) if yi = ht (xi )
Dt+1 (i ) = ·
Zt e αt if yi 6= ht (xi )
X
Hfinal (x) = arg max αt
y ∈Y
t:ht (x)=y
• can prove:
k Y
training error(Hfinal ) ≤ · Zt
2
Dt (i )
Dt+1 (i ) = · exp(−αt yi ht (xi ))
Zt
and identical rule for combining weak classifiers
• question: how to choose αt and ht on each round
Confidence-rated Predictions (cont.)
• saw earlier:
!
Y 1 X X
training error(Hfinal ) ≤ Zt = exp −yi αt ht (xi )
t
m t
i
50
% Error
40
30
20
10
2 card
3 my home
4 person ? person
5 code
6 I
More Weak Classifiers
rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
7 time
8 wrong number
9 how
10 call
11 seven
12 trying to
13 and
More Weak Classifiers
rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
14 third
15 to
16 for
17 charges
18 dial
19 just
Finding Outliers
examples with most weight are often outliers (mislabeled and/or
ambiguous)
• I’m trying to make a credit card call (Collect)
• hello (Rate)
• yes I’d like to make a long distance collect call
please (CallingCard)
• calling card please (Collect)
• yeah I’d like to use my calling card number (Collect)
• can I get a collect call (CallingCard)
• yes I would like to make a long distant telephone call
and have the charges billed to another number
(CallingCard DialForMe)
• yeah I can not stand it this morning I did oversea
call is so bad (BillingCredit)
• yeah special offers going on for long distance
(AttService Rate)
• mister allen please william allen (PersonToPerson)
• yes ma’am I I’m trying to make a long distance call to
a non dialable point in san miguel philippines
(AttService Other)
Application: Human-computer Spoken Dialogue
[with Rahim, Di Fabbrizio, Dutton, Gupta, Hollister & Riccardi]
text−to−speech automatic
speech
recognizer
text response
text
dialogue
manager natural language
predicted understanding
category
80
data+knowledge
knowledge only
70 data only
60
% error rate
50
40
30
20
10
100 1000 10000
# training examples
Results: Helpdesk
90
85
data + knowledge
Classification Accuracy
80
75
data
70
65
60
55
knowledge
50
45
0 500 1000 1500 2000 2500
# Training Examples
Problem: Labels are Expensive
• idea:
• use selective sampling to choose which examples to label
• focus on least confident examples [Lewis & Gale]
• for boosting, use (absolute) margin |f (x)| as natural
confidence measure
[Abe & Mamitsuka]
Labeling Scheme
32
% error rate
30
28
26
24
0 5000 10000 15000 20000 25000 30000 35000 40000
# labeled examples
20
% error rate
15
10
0
0 2000 4000 6000 8000 10000 12000 14000 16000
# labeled examples
Abstract
Boosting is a general method for improving the accuracy of any given
learning algorithm. Focusing primarily on the AdaBoost algorithm, this
chapter overviews some of the recent work on boosting including analyses
of AdaBoost’s training error and generalization error; boosting’s connection
to game theory and linear programming; the relationship between boosting
and logistic regression; extensions of AdaBoost for multiclass classification
problems; methods of incorporating human knowledge into boosting; and
experimental and applied work using boosting.
1 Introduction
Machine learning studies automatic techniques for learning to make accurate pre-
dictions based on past observations. For example, suppose that we would like to
build an email filter that can distinguish spam (junk) email from non-spam. The
machine-learning approach to this problem would be the following: Start by gath-
ering as many examples as posible of both spam and non-spam emails. Next, feed
these examples, together with labels indicating if they are spam or not, to your
favorite machine-learning algorithm which will automatically produce a classifi-
cation or prediction rule. Given a new, unlabeled email, such a rule attempts to
predict if it is spam or not. The goal, of course, is to generate a rule that makes the
most accurate predictions possible on new test examples.
1
Building a highly accurate prediction rule is certainly a difficult task. On the
other hand, it is not hard at all to come up with very rough rules of thumb that
are only moderately accurate. An example of such a rule is something like the
following: “If the phrase ‘buy now’ occurs in the email, then predict it is spam.”
Such a rule will not even come close to covering all spam messages; for instance,
it really says nothing about what to predict if ‘buy now’ does not occur in the
message. On the other hand, this rule will make predictions that are significantly
better than random guessing.
Boosting, the machine-learning method that is the subject of this chapter, is
based on the observation that finding many rough rules of thumb can be a lot easier
than finding a single, highly accurate prediction rule. To apply the boosting ap-
proach, we start with a method or algorithm for finding the rough rules of thumb.
The boosting algorithm calls this “weak” or “base” learning algorithm repeatedly,
each time feeding it a different subset of the training examples (or, to be more pre-
cise, a different distribution or weighting over the training examples 1 ). Each time
it is called, the base learning algorithm generates a new weak prediction rule, and
after many rounds, the boosting algorithm must combine these weak rules into a
single prediction rule that, hopefully, will be much more accurate than any one of
the weak rules.
To make this approach work, there are two fundamental questions that must be
answered: first, how should each distribution be chosen on each round, and second,
how should the weak rules be combined into a single rule? Regarding the choice
of distribution, the technique that we advocate is to place the most weight on the
examples most often misclassified by the preceding weak rules; this has the effect
of forcing the base learner to focus its attention on the “hardest” examples. As
for combining the weak rules, simply taking a (weighted) majority vote of their
predictions is natural and effective.
There is also the question of what to use for the base learning algorithm, but
this question we purposely leave unanswered so that we will end up with a general
boosting procedure that can be combined with any base learning algorithm.
Boosting refers to a general and provably effective method of producing a very
accurate prediction rule by combining rough and moderately inaccurate rules of
thumb in a manner similar to that suggested above. This chapter presents an
overview of some of the recent work on boosting, focusing especially on the Ada-
Boost algorithm which has undergone intense theoretical study and empirical test-
ing.
1
A distribution over training examples can be used to generate a subset of the training examples
simply by sampling repeatedly from the distribution.
2
!#"%$&'($*)
Given: ,-
./$01
where ,
Initialize +
$&
43
.
For 2 :
5
Train base learner using<distribution
;>=
+76 .
5
Get base classifier
=
8 6:9 .
5
Choose ? 6 .
5
Update: ,-
DCFEHG-"
,-
B +(6 ?I6 8J6
+ 6A@ K
6
K
where 6 is a normalization factor (chosen so that + 6A@ will be a distribu-
tion).
L M
.ONQPSR#TVUX W Z
-[\
? 6 8 6
6AY
2 AdaBoost
Working in Valiant’s PAC (probably approximately correct) learning model [75],
Kearns and Valiant [41, 42] were the first to pose the question of whether a “weak”
learning algorithm that performs just slightly better than random guessing can be
“boosted” into an arbitrarily accurate “strong” learning algorithm. Schapire [66]
came up with the first provable polynomial-time boosting algorithm in 1989. A
year later, Freund [26] developed a much more efficient boosting algorithm which,
although optimal in a certain sense, nevertheless suffered like Schapire’s algorithm
from certain practical drawbacks. The first experiments with these early boosting
algorithms were carried out by Drucker, Schapire and Simard [22] on an OCR task.
The AdaBoost algorithm, introduced in 1995 by Freund and Schapire [32],
solved many of the practical difficulties of the earlier boosting algorithms, and is
the focus of this paper. Pseudocode for AdaBoost is given in Fig. 1 in the slightly
generalized form given by Schapire and Singer [70].
I
]^
The algorithm takes as input
a training set D
where each belongs
to some domain or
instance space , and each label is in some label set . For most of this paper,
<_#"%$&'($*)
we assume ; in Section 7, we discuss extensions to the multiclass
case. AdaBoost calls a given weak or base learning algorithm repeatedly in a series
3
$&
43
of rounds 2 . One of the main ideas of the algorithm is to maintain a
distribution or set, of weights over the training ,-
Once the base classifier 8 6 has been received, AdaBoost chooses a parameter
V=
? 6 that intuitively measures the importance that it assigns to 8 6 . In the figure,
we have deliberately left the choice of ? 6 unspecified. For binary 8 6 , we typically
set $ "
T 6
? 6 (1)
6
as in the original description of AdaBoost given by Freund and Schapire [32]. More
on choosing ? 6 follows in Section 3. The distribution + 6 is then updated using the
L
rule shown 3 in the figure. The final or combined classifier is a weighted majority
vote of the base classifiers where ?:6 is the weight assigned to 86 .
1
$
,
9
L
O )
1
$
X
CFEHGI-"
:
\ K
6 (2)
6
(3)
!
6
L M
NQPSR#T :M
"
"
" "
so that . (For simplicity of notation, we write and 6 as
#%$'&)(*,+.-/(1032
shorthand for $ Y and
W6AY ,
4
Eq. (2) suggests that the training error can be reduced most rapidly (in a greedy
way) by choosing ? 6 and 8 6 on each round to minimize
K X ,-
DCFE G-"
6 + 6 ? 6 8 6 (4)
In the case of binary classifiers, this leads to the choice of ? 6 given in Eq. (1) and
gives a bound on the training error of
K
6
6
4$ "
6
!
$ "
6
CFE GVU "
X
6
[
(5)
6 6 6 6
where we define 6
$0
6 . This bound was first proved by Freund and
"
2
Schapire [32]. Thus, if each base classifier is slightly better than random3 so that
#$
6 for some , then the training error drops exponentially fast in since
the bound in Eq. (5) is at most W . This bound, combined with the bounds
on generalization error given below prove that AdaBoost is indeed a boosting al-
gorithm in the sense that it can efficiently convert a true weak learning algorithm
(that can always generate a classifier with a weak edge for any distribution) into
a strong learning algorithm (that can generate a classifier with an arbitrarily low
error rate, given sufficient data).
Eq. (2) points to the fact that, at heart, AdaBoost is a procedure for finding a
linear combination of base classifiers which attempts to minimize
X
CFEHGI-"
:
. X
CFE GVU " X
? 6 8 6
[\
(6)
6
Essentially, on each round, AdaBoost chooses 8 6 (by calling the base learner) and
then sets ? 6 to add one more term to the accumulating weighted sum of base classi-
fiers in such a way that the sum of exponentials above will be maximally reduced.
In other words, AdaBoost is doing a kind of steepest descent search to minimize
Eq. (6) where the search is constrained at each step to follow coordinate direc-
tions (where we identify coordinates with the weights assigned to base classifiers).
This view of boosting and its generalization are examined in considerable detail
by Duffy and Helmbold [23], Mason et al. [51, 52] and Friedman [35]. See also
Section 6.
Schapire and Singer [70] discuss the choice of ? 6 Mand
“confidence-rated prediction”
Z
5
4 Generalization error
In studying and designing learning algorithms, we are of course interested in per-
formance on examples not seen during training, i.e., in the generalization error, the
topic of this section. Unlike Section 3 where the training examples were arbitrary,
here we assume that all examples
some unknown distribution on
(both train and test) are generated i.i.d. from
. The generalization error is the probability
of misclassifying a new example, while the test error is the fraction of mistakes on
a newly sampled test set (thus, generalization error is expected test error). Also,
for simplicity, we restrict our attention to binary base classifiers.
Freund and Schapire [32] showed how to bound the1 generalization error of the
2
final classifier in terms of its training error, the size of the sample,
3
dimension of the base classifier space and the number of rounds of boosting.
the VC-
Specifically, they used techniques from Baum and Haussler [5] to show that the
generalization error, with high probability, is at most 3
3
3
L Z
'
1
where
denotes empirical probability on the training sample.3 This bound sug-
gests that boosting will overfit if run for too many rounds, i.e., as becomes large.
In fact, this sometimes does happen. However, in early experiments, several au-
thors [8, 21, 59] observed empirically that boosting often does not overfit, even
when run for thousands of rounds. Moreover, it was observed that AdaBoost would
sometimes continue to drive down the generalization error long after the training
error had reached zero, clearly contradicting the spirit of the bound above. For
instance, the left side of Fig. 2 shows the training and test curves of running boost-
ing on top of Quinlan’s C4.5 decision-tree learning algorithm [60] on the “letter”
dataset.
In response to these empirical findings, Schapire et al. [69], following the work
of Bartlett [3], gave an alternative analysis
IJ
*
QR#PAT
:
:Z
X
6
? 6 8 6
Z
6
? 6
X
6
? 6
2
The Vapnik-Chervonenkis (VC) dimension is a standard measure of the “complexity” of a space
of binary functions. See, for instance, refs. [6, 76] for its definition and relation to learning theory.
3
The “soft-Oh” notation , here used rather informally, is meant to hide all logarithmic and
constant factors (in the same way that standard “big-Oh” notation hides only constant factors).
6
cumulative distribution
1.0
20
15
error
10 0.5
0
10 100 1000 -1 -0.5 0.5 1
# rounds margin
Figure 2: Error curves and the margin distribution graph for boosting C4.5 on
the letter dataset as reported by Schapire et al. [69]. Left: the training and test
error curves (lower and upper curves, respectively) of the combined classifier as
a function of the number of rounds of boosting. The horizontal lines indicate the
test error rate of the base classifier as well as the test error of the final combined
classifier. Right: The cumulative distribution of margins of the training examples
after 5, 100 and 1000 iterations, indicated by short-dashed, long-dashed (mostly
hidden) and solid curves, respectively.
It is a number in
"]$&'($
and is positive if and only if correctly classifies the
L
Specifically, the generalization error is at most
*
QR#P T IJ
'
1
for3 any
with high probability. Note that this bound is entirely independent
of , the number of rounds of boosting. In addition, Schapire et al. proved that
boosting is particularly aggressive at reducing the margin (in a quantifiable sense)
since it concentrates on the examples with the smallest margins (whether positive
or negative). Boosting’s effect on the margins can be seen empirically, for instance,
on the right side of Fig. 2 which shows the cumulative distribution of margins of the
training examples on the “letter” dataset. In this case, even after the training error
reaches zero, boosting continues to increase the margins of the training examples
effecting a corresponding drop in the test error.
Although the margins theory gives a qualitative explanation of the effectiveness
of boosting, quantitatively, the bounds are rather weak. Breiman [9], for instance,
7
shows empirically that one classifier can have a margin distribution that is uni-
formly better than that of another classifier, and yet be inferior in test accuracy. On
the other hand, Koltchinskii, Panchenko and Lozano [44, 45, 46, 58] have recently
proved new margin-theoretic bounds that are tight enough to give useful quantita-
tive predictions.
Attempts (not always successful) to use the insights gleaned from the theory
of margins have been made by several authors [9, 37, 50]. In addition, the margin
theory points to a strong connection between boosting and the support-vector ma-
chines of Vapnik and others [7, 14, 77] which explicitly attempt to maximize the
minimum margin.
The row player now is the boosting algorithm, and the column player is the base
learner. The boosting algorithm’s choice of a distribution + 6 over training exam-
ples becomes a distribution over rows of , while the base learner’s choice of a
base classifier 8 6 becomes the choice of a column of .
As an example of the connection between boosting and game theory, consider
von Neumann’s famous minmax theorem which states that
E
PAT
P T
E
for any matrix . When applied to the matrix just defined and reinterpreted in
the boosting setting, this can be shown to have the following meaning: If, for any
8
$0 "
distribution over examples, there exists a base classifier with error at most ,
then
where, as usual,
:M
9
and the function that, we have already noted, AdaBoost attempts to minimize:
X
# $'&)(*,+ -/(10
(9)
Specifically, it can be verified
$ " T
that Eq. (8) is upper bounded by Eq. (9). In addition,
if we add the constant to Eq. (8) (which does not affect its minimization),
then it can be verified that the resulting function and the one in Eq. (9) have iden-
tical Taylor expansions around zero up to second order; thus, their behavior near
zero is very similar. Finally, it can be shown that, for any distribution over pairs
# $ & *,+.- 0
, the expectations
T *$B'
# $'& *,+.- 0
and
are minimized by the same (unconstrained) function , namely,
'($
:M
: T
/"%$
Thus, for all these reasons, minimizing Eq. (9), as is done by AdaBoost, can be
viewed as a method of approximately minimizing the negative log likelihood given
in Eq. (8). Therefore, we may expect Eq. (7) to give a reasonable probability
estimate.
Of course, as Friedman, Hastie and Tibshirani point out, rather than minimiz-
ing the exponential loss in Eq. (6), we could attempt instead to directly minimize
the logistic loss in Eq. (8). To this end, they propose their LogitBoost algorithm.
A different, more direct modification of AdaBoost for logistic loss was proposed
by Collins, Schapire and Singer [13]. Following up on work by Kivinen and War-
muth [43] and Lafferty [47], they derive this algorithm using a unification of logis-
tic regression and boosting based on Bregman distances. This work further con-
nects boosting to the maximum-entropy literature, particularly the iterative-scaling
family of algorithms [15, 16]. They also give unified proofs of convergence to
optimality for a family of new and old algorithms, including AdaBoost, for both
the exponential loss used by AdaBoost and the logistic loss used for logistic re-
gression. See also the later work of Lebanon and Lafferty [48] who showed that
logistic regression and boosting are in fact solving the same constrained optimiza-
tion problem, except that in boosting, certain normalization constraints have been
dropped.
For logistic regression, we attempt to minimize the loss function
X
T $ '
# $'&)(*,+.-/(10 (10)
10
which is the same as in Eq. (8) except for an inconsequential change of constants
in the exponent. The modification of AdaBoost proposed by Collins, Schapire and
Singer to handle this loss function is particularly simple.
,-
In AdaBoost, unraveling
the definition of + 6 given in Fig. 1 shows that + 6 is proportional (i.e., equal up
to normalization) to !
$
CFE G]-"
where we define
6
Z
B X M
6 ? 6 8 6
6 Y
To minimize,-
the loss function in Eq. (10), the only necessary modification is to
redefine + 6 to be proportional to
$
$
$ ' CFE G]
A very similar algorithm is described by Duffy and Helmbold [23]. Note that in
each case, the weight on the examples, viewed as a vector, is proportional to the
negative gradient of the respective loss function. This is because both algorithms
are doing a kind of functional gradient descent, an observation that is spelled out
and exploited by Breiman [9], Duffy and Helmbold [23], Mason et al. [51, 52] and
Friedman [35].
Besides logistic regression, there have been a number of approaches taken to
apply boosting to more general regression problems in which the labels are real
numbers and the goal is to produce real-valued predictions that are close to these la-
bels. Some of these, such as those of Ridgeway [63] and Freund and Schapire [32],
attempt to reduce the regression problem to a classification problem. Others, such
as those of Friedman [35] and Duffy and Helmbold [24] use the functional gradient
descent view of boosting to derive algorithms that directly minimize a loss func-
tion appropriate for regression. Another boosting-based approach to regression
was proposed by Drucker [20].
7 Multiclass classification
There are several methods of extending AdaBoost to the multiclass case. The most
straightforward generalization [32], called AdaBoost.M1, is adequate when the
base learner is strong enough to achieve reasonably high accuracy, even on the
hard distributions created by AdaBoost. However, this method fails if the base
learner cannot achieve at least 50% accuracy when run on these hard distributions.
11
For the latter case, several more sophisticated methods have been developed.
These generally work by reducing the multiclass problem to a larger binary prob-
lem. Schapire and Singer’s [70] algorithm
AdaBoost.MH works by creating a set
of binary problems,
for each example and each possible label , of the form:
“For example , is the correct label or is it one of the other labels?” Freund
and Schapire’s [32] algorithm AdaBoost.M2 (which is a special case of Schapire
and Singer’s [70]
AdaBoost.MR algorithm)
instead creates binary
problems, for
each example
with correct label
and each incorrect label of the form: “For
example , is the correct label or ?”
These methods require additional effort in the design of the base learning algo-
rithm. A different technique [67], which incorporates Dietterich and Bakiri’s [19]
method of error-correcting output codes, achieves similar provable bounds to those
of AdaBoost.MH and AdaBoost.M2, but can be used with any base learner that
can handle simple, binary labeled data. Schapire and Singer [70] and Allwein,
Schapire and Singer [2] give yet another method of combining boosting with error-
correcting output codes.
12
with one that incorporates prior knowledge, namely,
X
T $B'
# $'&)(1*,+.-/(10
' X
$ '
# $ *,+ -/(0
where
7
T
0 #
B'
4$%"
is binary relative
T 4$ "
0 4$%" #
entropy. The first term is the same as that in Eq. (10). The second term gives a
measure of the distance from the model built by boosting to the human’s model.
Thus, we balance the conditional likelihood of the data against the distance from
our model to the human’s model. The relative importance of the two terms is
controlled by the parameter .
13
30 30
30 30
25 25
25 25
20 20
C4.5
C4.5
20 20
C4.5
C4.5
15 15
15 15
10 10
10 10
5 5
5 5
0 0
00 5 10 15 20 25 30 00 5 10 15 20 25 30
0 5 10 15 20 25 30 0 5 10 15 20 25 30
boosting stumps boosting C4.5
boosting stumps boosting C4.5
Figure 3: Comparison of C4.5 versus boosting stumps and boosting C4.5 on a set
of 27 benchmark problems as reported by Freund and Schapire [30]. Each point
in each scatterplot shows the test error rate of the two competing algorithms on
a single benchmark. The -coordinate of each point gives the test error rate (in
percent) of C4.5 on the given benchmark, and the -coordinate gives the error rate
of boosting stumps (left plot) or boosting C4.5 (right plot). All error rates have
been averaged over multiple runs.
AdaBoost to four other methods are shown in Fig. 4. In nearly all of these ex-
periments and for all of the performance measures tested, boosting performed as
well or significantly better than the other methods tested. As shown in Fig. 5, these
experiments also demonstrated the effectiveness of using confidence-rated predic-
tions [70], mentioned in Section 3 as a means of speeding up boosting.
Boosting has also been applied to text filtering [72] and routing [39], “ranking”
problems [28], learning problems arising in natural language processing [1, 12, 25,
38, 55, 78], image retrieval [74], medical diagnosis [53], and customer monitoring
and segmentation [56, 57].
Rochery et al.’s [64, 65] method of incorporating human knowledge into boost-
ing, described in Section 8, was applied to two speech categorization tasks. In this
case, the prior knowledge took the form of a set of hand-built rules mapping key-
words to predicted categories. The results are shown in Fig. 6.
The final classifier produced by AdaBoost when used, for instance, with a
decision-tree base learning algorithm, can be extremely complex and difficult to
comprehend. With greater care, a more human-understandable final classifier can
be obtained using boosting. Cohen and Singer [11] showed how to design a base
14
16 35
14
30
12
25
10
% Error
% Error
8 20
6
15
4
AdaBoost AdaBoost
Sleeping-experts 10 Sleeping-experts
2 Rocchio Rocchio
Naive-Bayes Naive-Bayes
PrTFIDF PrTFIDF
0 5
3 4 5 6 4 6 8 10 12 14 16 18 20
Number of Classes Number of Classes
Figure 4: Comparison of error rates for AdaBoost and four other text categoriza-
tion methods (naive Bayes, probabilistic TF-IDF, Rocchio and sleeping experts)
as reported by Schapire and Singer [71]. The algorithms were tested on two text
corpora — Reuters newswire articles (left) and AP newswire headlines (right) —
and with varying numbers of class labels as indicated on the -axis of each figure.
learning algorithm that, when combined with AdaBoost, results in a final classifier
consisting of a relatively small set of rules similar to those generated by systems
like RIPPER [10], IREP [36] and C4.5rules [60]. Cohen and Singer’s system,
called SLIPPER, is fast, accurate and produces quite compact rule sets. In other
work, Freund and Mason [29] showed how to apply boosting to learn a generaliza-
tion of decision trees called “alternating trees.” Their algorithm produces a single
alternating tree rather than an ensemble of trees as would be obtained by running
AdaBoost on top of a decision-tree learning algorithm. On the other hand, their
learning algorithm achieves error rates comparable to those of a whole ensemble
of trees.
A nice property of AdaBoost is its ability to identify outliers, i.e., examples
that are either mislabeled in the training data, or that are inherently ambiguous and
hard to categorize. Because AdaBoost focuses its weight on the hardest examples,
the examples with the highest weight often turn out to be outliers. An example of
this phenomenon can be seen in Fig. 7 taken from an OCR experiment conducted
by Freund and Schapire [30].
When the number of outliers is very large, the emphasis placed on the hard ex-
amples can become detrimental to the performance of AdaBoost. This was demon-
strated very convincingly by Dietterich [18]. Friedman, Hastie and Tibshirani [34]
suggested a variant of AdaBoost, called “Gentle AdaBoost” that puts less emphasis
on outliers. Rätsch, Onoda and Müller [61] show how to regularize AdaBoost to
handle noisy data. Freund [27] suggested another algorithm, called “BrownBoost,”
that takes a more radical approach that de-emphasizes outliers when it seems clear
that they are “too hard” to classify correctly. This algorithm, which is an adaptive
15
discrete AdaBoost.MR discrete AdaBoost.MR
70 discrete AdaBoost.MH 70 discrete AdaBoost.MH
real AdaBoost.MH real AdaBoost.MH
60 60
50 50
% Error
% Error
40 40
30 30
20 20
10 10
Figure 5: Comparison of the training (left) and test (right) error using three boost-
ing methods on a six-class text classification problem from the TREC-AP collec-
tion, as reported by Schapire and Singer [70, 71]. Discrete AdaBoost.MH and
discrete AdaBoost.MR are multiclass versions of AdaBoost that require binary
#"]$&'($*)
( -valued) base classifiers, while real AdaBoost.MH is a multiclass ver-
sion that uses “confidence-rated” (i.e., real-valued) base classifiers.
10 Conclusion
In this overview, we have seen that there have emerged a great many views or
interpretations of AdaBoost. First and foremost, AdaBoost is a genuine boosting
algorithm: given access to a true weak learning algorithm that always performs a
little bit better than random guessing on every distribution over the training set, we
can prove arbitrarily good bounds on the training error and generalization error of
AdaBoost.
Besides this original view, AdaBoost has been interpreted as a procedure based
on functional gradient descent, as an approximation of logistic regression and as
a repeated-game playing algorithm. AdaBoost has also been shown to be re-
lated to many other topics, such as game theory and linear programming, Breg-
man distances, support-vector machines, Brownian motion, logistic regression and
maximum-entropy methods such as iterative scaling.
All of these connections and interpretations have greatly enhanced our under-
standing of boosting and contributed to its extension in ever more practical di-
rections, such as to logistic regression and other loss-minimization problems, to
multiclass problems, to incorporate regularization and to allow the integration of
prior background knowledge.
16
92 90
knowledge + data
90 85
knowledge + data
data
88 80
Classification Accuracy (%)
84 70
82 65
80 60
78 55
knowledge
76 50
74 45
0 200 400 600 800 1000 1200 1400 1600 0 500 1000 1500 2000 2500 3000
# Training Sentences # Training Examples
References
[1] Steven Abney, Robert E. Schapire, and Yoram Singer. Boosting applied to tagging
and PP attachment. In Proceedings of the Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing and Very Large Corpora, 1999.
[2] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to
binary: A unifying approach for margin classifiers. Journal of Machine Learning
Research, 1:113–141, 2000.
[3] Peter L. Bartlett. The sample complexity of pattern classification with neural net-
works: the size of the weights is more important than the size of the network. IEEE
Transactions on Information Theory, 44(2):525–536, March 1998.
[4] Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algo-
rithms: Bagging, boosting, and variants. Machine Learning, 36(1/2):105–139, 1999.
[5] Eric B. Baum and David Haussler. What size net gives valid generalization? Neural
Computation, 1(1):151–160, 1989.
[6] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth.
Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for
Computing Machinery, 36(4):929–965, October 1989.
17
4:1/0.23,4/0.22 8:6/0.18,8/0.18 4:9/0.16,4/0.16 4:1/0.23,4/0.22 3:5/0.18,3/0.17 0:6/0.22,0/0.22 7:9/0.20,7/0.19 3:5/0.29,3/0.29 9:9/0.15,4/0.15
[7] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm
for optimal margin classifiers. In Proceedings of the Fifth Annual ACM Workshop on
Computational Learning Theory, pages 144–152, 1992.
[8] Leo Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998.
[9] Leo Breiman. Prediction games and arcing classifiers. Neural Computation,
11(7):1493–1517, 1999.
[10] William Cohen. Fast effective rule induction. In Proceedings of the Twelfth Interna-
tional Conference on Machine Learning, pages 115–123, 1995.
[11] William W. Cohen and Yoram Singer. A simple, fast, and effective rule learner. In
Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999.
[12] Michael Collins. Discriminative reranking for natural language parsing. In Proceed-
ings of the Seventeenth International Conference on Machine Learning, 2000.
[13] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, Ada-
Boost and Bregman distances. Machine Learning, to appear.
[14] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,
20(3):273–297, September 1995.
18
[15] J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models.
The Annals of Mathematical Statistics, 43(5):1470–1480, 1972.
[16] Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features
of random fields. IEEE Transactions Pattern Analysis and Machine Intelligence,
19(4):1–13, April 1997.
[17] Ayhan Demiriz, Kristin P. Bennett, and John Shawe-Taylor. Linear programming
boosting via column generation. Machine Learning, 46(1/2/3):225–254, 2002.
[18] Thomas G. Dietterich. An experimental comparison of three methods for construct-
ing ensembles of decision trees: Bagging, boosting, and randomization. Machine
Learning, 40(2):139–158, 2000.
[19] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via
error-correcting output codes. Journal of Artificial Intelligence Research, 2:263–286,
January 1995.
[20] Harris Drucker. Improving regressors using boosting techniques. In Machine Learn-
ing: Proceedings of the Fourteenth International Conference, pages 107–115, 1997.
[21] Harris Drucker and Corinna Cortes. Boosting decision trees. In Advances in Neural
Information Processing Systems 8, pages 479–485, 1996.
[22] Harris Drucker, Robert Schapire, and Patrice Simard. Boosting performance in neural
networks. International Journal of Pattern Recognition and Artificial Intelligence,
7(4):705–719, 1993.
[23] Nigel Duffy and David Helmbold. Potential boosters? In Advances in Neural Infor-
mation Processing Systems 11, 1999.
[24] Nigel Duffy and David Helmbold. Boosting methods for regression. Machine Learn-
ing, 49(2/3), 2002.
[25] Gerard Escudero, Lluís Màrquez, and German Rigau. Boosting applied to word
sense disambiguation. In Proceedings of the 12th European Conference on Machine
Learning, pages 129–141, 2000.
[26] Yoav Freund. Boosting a weak learning algorithm by majority. Information and
Computation, 121(2):256–285, 1995.
[27] Yoav Freund. An adaptive version of the boost by majority algorithm. Machine
Learning, 43(3):293–318, June 2001.
[28] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efficient boost-
ing algorithm for combining preferences. In Machine Learning: Proceedings of the
Fifteenth International Conference, 1998.
[29] Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In
Machine Learning: Proceedings of the Sixteenth International Conference, pages
124–133, 1999.
19
[30] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In
Machine Learning: Proceedings of the Thirteenth International Conference, pages
148–156, 1996.
[31] Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boosting.
In Proceedings of the Ninth Annual Conference on Computational Learning Theory,
pages 325–332, 1996.
[32] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of Computer and System Sciences,
55(1):119–139, August 1997.
[33] Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative
weights. Games and Economic Behavior, 29:79–103, 1999.
[34] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression:
A statistical view of boosting. The Annals of Statistics, 38(2):337–374, April 2000.
[35] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.
The Annals of Statistics, 29(5), October 2001.
[36] Johannes Fürnkranz and Gerhard Widmer. Incremental reduced error pruning. In
Machine Learning: Proceedings of the Eleventh International Conference, pages 70–
77, 1994.
[37] Adam J. Grove and Dale Schuurmans. Boosting in the limit: Maximizing the mar-
gin of learned ensembles. In Proceedings of the Fifteenth National Conference on
Artificial Intelligence, 1998.
[38] Masahiko Haruno, Satoshi Shirai, and Yoshifumi Ooyama. Using decision trees to
construct a practical parser. Machine Learning, 34:131–149, 1999.
[39] Raj D. Iyer, David D. Lewis, Robert E. Schapire, Yoram Singer, and Amit Singhal.
Boosting for document routing. In Proceedings of the Ninth International Conference
on Information and Knowledge Management, 2000.
[40] Jeffrey C. Jackson and Mark W. Craven. Learning sparse perceptrons. In Advances
in Neural Information Processing Systems 8, pages 654–660, 1996.
[41] Michael Kearns and Leslie G. Valiant. Learning Boolean formulae or finite automata
is as hard as factoring. Technical Report TR-14-88, Harvard University Aiken Com-
putation Laboratory, August 1988.
[42] Michael Kearns and Leslie G. Valiant. Cryptographic limitations on learning Boolean
formulae and finite automata. Journal of the Association for Computing Machinery,
41(1):67–95, January 1994.
[43] Jyrki Kivinen and Manfred K. Warmuth. Boosting as entropy projection. In Proceed-
ings of the Twelfth Annual Conference on Computational Learning Theory, pages
134–144, 1999.
20
[44] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the
generalization error of combined classifiers. The Annals of Statistics, 30(1), February
2002.
[45] Vladimir Koltchinskii, Dmitriy Panchenko, and Fernando Lozano. Further explana-
tion of the effectiveness of voting methods: The game between margins and weights.
In Proceedings 14th Annual Conference on Computational Learning Theory and 5th
European Conference on Computational Learning Theory, pages 241–255, 2001.
[46] Vladimir Koltchinskii, Dmitriy Panchenko, and Fernando Lozano. Some new bounds
on the generalization error of combined classifiers. In Advances in Neural Informa-
tion Processing Systems 13, 2001.
[47] John Lafferty. Additive models, boosting and inference for generalized divergences.
In Proceedings of the Twelfth Annual Conference on Computational Learning The-
ory, pages 125–133, 1999.
[48] Guy Lebanon and John Lafferty. Boosting and maximum likelihood for exponential
models. In Advances in Neural Information Processing Systems 14, 2002.
[49] Richard Maclin and David Opitz. An empirical evaluation of bagging and boost-
ing. In Proceedings of the Fourteenth National Conference on Artificial Intelligence,
pages 546–551, 1997.
[50] Llew Mason, Peter Bartlett, and Jonathan Baxter. Direct optimization of margins
improves generalization in combined classifiers. In Advances in Neural Information
Processing Systems 12, 2000.
[51] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradi-
ent techniques for combining hypotheses. In Alexander J. Smola, Peter J. Bartlett,
Bernhard Schölkopf, and Dale Schuurmans, editors, Advances in Large Margin Clas-
sifiers. MIT Press, 1999.
[52] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting algorithms
as gradient descent. In Advances in Neural Information Processing Systems 12, 2000.
[53] Stefano Merler, Cesare Furlanello, Barbara Larcher, and Andrea Sboner. Tuning cost-
sensitive boosting and its application to melanoma diagnosis. In Multiple Classifier
Systems: Proceedings of the 2nd International Workshop, pages 32–42, 2001.
[54] C. J. Merz and P. M. Murphy. UCI repository of machine learning databases, 1999.
www.ics.uci.edu/ mlearn/MLRepository.html.
[55] Pedro J. Moreno, Beth Logan, and Bhiksha Raj. A boosting approach for confidence
scoring. In Proceedings of the 7th European Conference on Speech Communication
and Technology, 2001.
[56] Michael C. Mozer, Richard Wolniewicz, David B. Grimes, Eric Johnson, and Howard
Kaushansky. Predicting subscriber dissatisfaction and improving retention in the
wireless telecommunications industry. IEEE Transactions on Neural Networks,
11:690–696, 2000.
21
[57] Takashi Onoda, Gunnar Rätsch, and Klaus-Robert Müller. Applying support vector
machines and boosting to a non-intrusive monitoring system for household electric
appliances with inverters. In Proceedings of the Second ICSC Symposium on Neural
Computation, 2000.
[58] Dmitriy Panchenko. New zero-error bounds for voting algorithms. Unpublished
manuscript, 2001.
[59] J. R. Quinlan. Bagging, boosting, and C4.5. In Proceedings of the Thirteenth Na-
tional Conference on Artificial Intelligence, pages 725–730, 1996.
[60] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[61] G. Rätsch, T. Onoda, and K.-R. Müller. Soft margins for AdaBoost. Machine Learn-
ing, 42(3):287–320, 2001.
[62] Gunnar Rätsch, Manfred Warmuth, Sebastian Mika, Takashi Onoda, Steven Lemm,
and Klaus-Robert Müller. Barrier boosting. In Proceedings of the Thirteenth Annual
Conference on Computational Learning Theory, pages 170–179, 2000.
[63] Greg Ridgeway, David Madigan, and Thomas Richardson. Boosting methodology
for regression problems. In Proceedings of the International Workshop on AI and
Statistics, pages 152–161, 1999.
[64] M. Rochery, R. Schapire, M. Rahim, N. Gupta, G. Riccardi, S. Bangalore, H. Al-
shawi, and S. Douglas. Combining prior knowledge and boosting for call classifica-
tion in spoken language dialogue. Unpublished manuscript, 2001.
[65] Marie Rochery, Robert Schapire, Mazin Rahim, and Narendra Gupta. BoosTexter for
text categorization in spoken language dialogue. Unpublished manuscript, 2001.
[66] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–
227, 1990.
[67] Robert E. Schapire. Using output codes to boost multiclass learning problems. In
Machine Learning: Proceedings of the Fourteenth International Conference, pages
313–321, 1997.
[68] Robert E. Schapire. Drifting games. Machine Learning, 43(3):265–291, June 2001.
[69] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the
margin: A new explanation for the effectiveness of voting methods. The Annals of
Statistics, 26(5):1651–1686, October 1998.
[70] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using
confidence-rated predictions. Machine Learning, 37(3):297–336, December 1999.
[71] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text
categorization. Machine Learning, 39(2/3):135–168, May/June 2000.
[72] Robert E. Schapire, Yoram Singer, and Amit Singhal. Boosting and Rocchio ap-
plied to text filtering. In Proceedings of the 21st Annual International Conference on
Research and Development in Information Retrieval, 1998.
22
[73] Holger Schwenk and Yoshua Bengio. Training methods for adaptive boosting of
neural networks. In Advances in Neural Information Processing Systems 10, pages
647–653, 1998.
[74] Kinh Tieu and Paul Viola. Boosting image retrieval. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2000.
[75] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–
1142, November 1984.
[76] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probability and its applications,
XVI(2):264–280, 1971.
[77] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
[78] Marilyn A. Walker, Owen Rambow, and Monica Rogati. SPoT: A trainable sentence
planner. In Proceedings of the 2nd Annual Meeting of the North American Chapter
of the Associataion for Computational Linguistics, 2001.
23
Center Based Clustering: A Foundational Perspective
Abstract
In the first part of this chapter we detail center based clustering methods, namely methods based on
finding a “best” set of center points and then assigning data points to their nearest center. In particular,
we focus on k-means and k-median clustering which are two of the most widely used clustering objectives.
We describe popular heuristics for these methods and theoretical guarantees associated with them. We
also describe how to design worst case approximately optimal algorithms for these problems. In the
second part of the chapter we describe recent work on how to improve on these worst case algorithms
even further by using insights from the nature of real world clustering problems and data sets. Finally,
we also summarize theoretical work on clustering data generated from mixture models such as a mixture
of Gaussians.
One of the most popular approaches to clustering is to define an objective function over the
data points and find a partitioning which achieves the optimal solution, or an approximately
optimal solution to the given objective function. Common objective functions include center
based objective functions such as k-median and k-means where one selects k center points
and the clustering is obtained by assigning each data point to its closest center point. Here
closeness is measured in terms of a pairwise distance function d(), which the clustering
algorithm has access to, encoding how dissimilar two data points are. For instance, the
data could be points in Euclidean space with d() measuring Euclidean distance, or it could
be strings with d() representing an edit distance, or some other dissimilarity score. For
mathematical convenience it is also assumed that the distance function d() is a metric. In
P center points c1 , c2 , · · · ck , and a partitioning of
k-median clustering the objective is to find
the data so as to minimize Φk−median = x mini d(x, ci ). This objective is historically very
useful and well studied for facility location problems [16, 43]. Similarly the objective in
k-means is to minimize Φk−means = x mini d(x, ci )2 . Optimizing this objective is closely
P
related to fitting the maximum likelihood mixture model for a given dataset. For a given set
of centers, the optimal clustering for that set is obtained by assigning each data point to its
closest center point. This is known as the Voronoi partitioning of the data. Unfortunately,
exactly optimizing the k-median and the k-means objectives is a notoriously hard problem.
Intuitively this is expected since the objective function is a non-convex function of the
variables involved. This apparent hardness can also be formally justified by appealing to the
1
notion of NP completeness [43, 33, 8]. At a high level the notion of NP completeness identifies
a wide class of problems which are in principle equivalent to each other. In other words, an
efficient algorithm for exactly optimizing one of the problems in the class on all instances
would also lead to algorithms for all the problems in the class. This class contains many
optimization problems that are believed to be hard1 to exactly optimize in the worst case
and not surprisingly, k-median and k-means also fall into the class. Hence it is unlikely that
one would be able to optimize these objectives exactly using efficient algorithms. Naturally,
this leads to the question of recovering approximate solutions and a lot of the work in the
theoretical community has focused on this direction [16, 11, 29, 34, 43, 47, 48, 57, 20].
Such works typically fall into two categories, a) providing formal worst case guarantees
on all instances of the problem, and b) providing better guarantees suited for for nicer,
stable instances. In this chapter we discuss several stepping stone results in these directions,
focusing our attention on the k-means objective. A lot of the the ideas and techniques
mentioned apply in a straightforward manner to the k-median objective as well. We will
point out crucial differences between the two objectives as and when they appear. We will
additionally discuss several practical implications of these results.
We will begin by describing a very popular heuristic for the k-means problem known as
Lloyd’s method. Lloyd’s method [51] is an iterative procedure which starts out with a set
of k seed centers and at each step computes a new set of centers with a lower k-means cost.
This is achieved by computing the Voronoi partitioning of the current set of centers and
replacing each center with the center of the corresponding partition. We will describe the
theoretical properties and limitations of Lloyd’s method which will also motivate the need
for good worst case approximation algorithms for k-means and k-median. We will see that
the method is very sensitive to the choice of the seed centers. Next we will describe a general
method based on local search which achieves constant factor approximations for both the
k-means and the k-median objectives. Similar to Lloyd’s method, the local search heuristic
starts out with a set of k seed centers and at each step swaps one of the centers for a new one
resulting in a decrease in the k-means cost. Using a clever analysis it can be shown that this
procedure outputs a good approximation to the optimal solution [47]. This is interesting,
since as mentioned above, optimizing the k-means is NP-complete, in fact it is NP-complete
even for k = 2, for points in the Euclidean space [33]2 .
In the second part of the chapter we will describe some of the recent developments in the
study of clustering objectives. These works take a non-worst case analysis approach to the
problem. The basic theme is to design algorithms which give good solutions to clustering
problems only when the underlying optimal solution has a meaningful structure. We will
call such clustering instances as stable instances. We would describe in detail two recently
studied notions of stability. The first one called separability was proposed by Ostrovsky et.
al [57]. According to this notion a k-clustering instance is stable if it is much more expensive
to cluster the data using (k − 1) or fewer clusters. For such instances Ostrovsky et. al
show that one can design a simple Lloyd’s type algorithm which achieves a constant factor
approximation. A different notion called approximation stability was proposed by Balcan et.
al [20]. The motivation comes from the fact that often in practice optimizing an objective
function acts as a proxy for the real problem of getting close to the correct unknown ground
truth clustering. Hence it is only natural to assume that any good approximation to the proxy
function such as k-means or k-median will also be close to the ground truth clustering in
1 This is the famous P vs NP problem, and there is a whole area called Computational Complexity Theory that studies this
k-tuples of centers and choosing the best. The difficulty of k-means for k = 2 in Euclidean space comes from the fact that the
optimal centers need not be data points.
2
terms of structure. Balcan et. al show that under this assumption one can design algorithms
that solve the end goal of getting close to the the ground truth clustering. More surprisingly
this is true even in cases where it is NP -hard to achieve a good approximation to the proxy
objective.
In the last part of the chapter we briefly review existing theoretical work on clustering
data generated from mixture models. We mainly focus on Gaussian Mixture Models (GMM)
which are the most widely studied distributional model for clustering. We will study algo-
rithms for clustering data from a GMM under the assumption that the mean vectors of
the component Gaussians are well separated. We will also see the effectiveness of spectral
techniques for GMMs. Finally, we will look at recent work on estimating the parameters of
a Gaussian mixture model under minimal assumptions.
Lloyd’s method, also known as the k-means algorithm is the most popular heuristic for
k-means clustering in the Euclidean space which has been shown to be one of the top ten
algorithms in data mining [69]. The method is an iterative procedure which is described
below.
3
We would like to stress that although Lloyd’s method is popularly known as the k-means
algorithm, there is a difference between the underlying k-means objective (which is usually
hard to optimize) and the k-means algorithm which is a heuristic to solve the problem. An
attractive feature of Lloyd’s method is that the k-means cost of the clustering obtained never
increases. This follows from the fact that for any set of points, the 1-means cost is minimized
by choosing the mean of the set as the center. Hence for any cluster Ci in the partitioning,
choosing mean(Ci ) will never lead to a solution of higher cost. Hence if we repeat this
method until there is no change in the k-means cost, we will reach a local optimum of the
k-means cost function in finite time. In particular the number of iterations will be at most
nO(kd) which is the maximum number of Voronoi partitions of a set of n points in ℜd [42].
The basic method mentioned above leads to a class of algorithms depending upon the choice
of the seeding method. A simple way is to start with k randomly chosen data points. This
choice however can lead to arbitrarily bad solution quality as shown in Figure 1. In addition
it is also known that the Lloyd’s method can take upto 2n iterations to converge even in 2
dimensions [14, 66].
A B C D
x y
A B C D
x y
In sum, from a theoretical standpoint, k-means with random/arbitrary seeds is not a good
clustering algorithm in terms of efficiency or quality. Nevertheless, the speed and simplicity
of k-means are quite appealing in practical applications. Therefore, recent work has focused
on improving the initialization procedure: deciding on a better way to initialize the clustering
dramatically changes the performance of the Lloyd’s iteration, both in terms of quality and
convergence properties. For example, [15] showed that choosing a good set of seed points
is crucial and if done carefully can itself be a good candidate solution without the need for
further iterations. Their algorithm called k-means++ uses the following seeding procedure:
it selects only the first center uniformly at random from the data and each subsequent center
is selected with a probability proportional to its contribution to the overall error given the
previous selections. See Algorithm kmeans++ for a formal description:
4
Algorithm kmeans++
1. Initialize: a set S by choosing a data point at random.
2. While |S| < k,
(a) Choose a data point x with probability proportional to minz∈S d(x, z)2 , and add it
to S.
3. Output: the clustering obtained by the Voronoi partitioning of the data using the
centers in S.
[15] showed that Algorithm kmeans++ is an log k approximation algorithm for the
k-means objective. We say that an algorithm is an α-approximation for a given objective
function Φ if for every clustering instance the algorithm outputs a solution of expected cost
at most α times the cost of the best solution. The design of approximation algorithms for
NP -hard problems has been a fruitful research direction and has led to a wide array of tools
and techniques. Formally, [15] show that:
Theorem 1 ([15]). Let S be the set of centers output by the above algorithm and Φ(S)
be the k-means cost of the clustering obtained using S as the centers. Then E[Φ(S)] ≤
O(log k)OPT, where OPT is the cost of the optimal k-means solution.
We would like to point out that in general the output of k-means++ is not a local
optimum. Hence it might be desirable in practice to run a few steps of the Lloyd’s method
starting from this solution. This could only lead to a better solution.
Subsequent work of [6] introduced a streaming algorithm inspired by the k-means++
algorithm that makes a single pass over the data. They show that if one is allowed to cluster
using a little more than k centers, specifically O(k log k) centers, then one can achieve a
constant-factor approximation in expectation to the k-means objective. The approximation
guarantee was improved in [5]. Such approximation algorithms which use more than k centers
are also known as bi-criteria approximations.
As mentioned earlier, Lloyd’s method can take up to exponential iterations in order to
converge to a local optimum. However [13] showed that the method converges quickly on an
“average” instance. In order to formalize this, they study the problem under the smoothed
analysis framework of [65]. In the smoothed analysis framework the input is generated by
applying a small Gaussian perturbation to an adversarial input. [65] showed that the simplex
method takes polynomial number of iterations on such smoothed instances. In a similar
spirit, [13] showed that for smoothed instances Lloyd’s method runs in time polynomial
in n, the number of points and σ1 , the standard deviation of the Gaussian perturbation.
However, these works do not provide any guarantee on the quality of the final solution
produced.
We would like to point out that in principle the Lloyd’s method can be extended to the
k-median objective. A natural extension would be to replace the mean computation in the
Reseeding step with computing the median of a set of points X in the Euclidean space, i.e.,
d
P
a point c ∈ ℜ such that x∈X d(x, c) is minimized. However this problem turns out to be
NP-complete [53]. For this reason, the Lloyd’s method is typically used only for the k-means
objective.
5
3 Properties of the k-means objective
In this section we provide some useful facts about the k-means clustering objective. We will
use C to denote the set of n points which represent a clustering instance. The first fact can
be used to show that given a Voronoi partitioning of the data, replacing a given center with
the mean of the corresponding partition can never increase the k-means cost.
Fact 2. ConsiderPa finite set X ⊂ Rd and c =mean(X). For any y ∈ Rd , we have that,
2 2 2
P
x∈X d(x, y) = x∈X d(x, c) + |X|d(c, y) .
X d
X X
2 2
= d(x, c) + |X|d(c, y) + 2(ci − yi ) (xi − ci )
x∈X i=1 x∈X
X
2 2
= d(x, c) + |X|d(c, y)
x∈X
P
Here the last equality follows from the fact that for any i, ci = x∈X xi /n.
6
|X1 |c1 +|X2 |c2
Part (a) follows by substituting c = |X1 |+|X2 |
in the above equation.
In the previous section we saw that a carefully chosen seeding can lead to a good approxima-
tion for the k-means objective. In this section we will see how to design much better (constant
factor) approximation algorithms for k-means (as well as k-median). We will describe a very
generic approach based on local search. These algorithms work by making local changes to
a candidate solution and improving it at each step. They have been successfully used for
a variety of optimization problems [7, 28, 36, 40, 58, 61]. Kanungo et. al [47] analyzed a
simple local search based algorithm for k-means as described below.
We would like to point out that in order to make the above algorithm run in polynomial
time, one needs to change the criteria in the while loop to be Φ(Tcurr ) < (1 − ǫ)Φ(Told ). The
running time will then depend polynomially in n and 1/ǫ. For simplicity of analysis, we will
prove the following theorem for the idealized version of the algorithm with no ǫ.
Theorem 5 ([47]). Let S be the final set of centers returned by the above procedure. Then,
Φ(S) ≤ 50OPT.
In order to prove the above theorem we start by building up some notation. Let T be
the set of k data points returned by the local search algorithm as candidate centers. Let O
be the set of k data points which achieve the minimum value of the k-means cost function
among all sets of k data points. Note that the centers in O do not necessarily represent the
optimal solution as the optimal centers might not be data points. However using the next
lemma one can show that using data points as centers is only twice as bad as the optimal
solution.
7
Lemma 6. Given C ⊆ Rd , and the optimal k-means clustering of C, {C1 , C2 , · · · Ck }, there
exists a set S of k data points such that Φ(S) ≤ 2OPT.
Proof. For a given set C ⊆ ℜd , let ∆1 2 represent the 1-means cost of C. From Fact 2 it is easy
to see that this cost is achieved by choosing the mean of C as the center. In order to prove the
above lemma it is enough to show Pthat for each optimal cluster Ci with mean ci , there exists
a data point xi ∈ Ci such that x∈Ci d(x, xi )2 ≤ 2∆1 2 (Ci ). Let xi be the data point in Ci
which is closest to ci . Again using Fact 2 we have x∈Ci d(x, xi )2 = ∆1 2 (Ci ) + |Ci |d(x, ci )2 ≤
P
2∆1 2 (Ci ).
Hence it is enough to compare the cost of the centers returned by the algorithm to the cost
of the optimal centers using data points. In particular, we will show that Φ(T ) ≤ 25Φ(O).
We start with the simple observation that by the property of the local search algorithm, for
any t ∈ T , and o ∈ O, swapping t for o results in an increase in cost. In other words
Φ(T − t + o) − Φ(T ) ≥ 0 (4.1)
The main idea is to add up Equation 4.1 over a carefully chosen set of swaps {o, t} to get
the desired result. In order to describe the set of swaps chosen we start by defining a cover
graph
Definition 2. A cover graph is a bipartite graph with the centers in T on one side and the
centers in O on the other side. For each o ∈ O, let to be the point in T which is closest to
o. The cover graph contains edges of the form o, to for all o ∈ O.
Consider a swap {o, t} output by using the cover graph. We will apply Equation 4.1 to this
pair. We will explicitly define a clustering using centers in T −t+o and upper bound its cost.
8
We will then use the lower bound of Φ(T ) from Equation 4.1 to get the kind of equations
we want to sum up over. Let the clustering given by centers in T be C1 , C2 , · · · Ck . Let Co ∗
be the cluster corresponding to center o in the optimal clustering given by O. Let ox be the
closest point in O to x. Similarly let tx be the closest point in T to x. The key property
satisfied by any output pair {o, t} is the following
Fact 7. Let {o, t} be a swap pair output using the cover graph. Then we have that for any
x ∈ Ct either ox = o or tox 6= t.
Proof. Assume that for some x ∈ Ct , ox = o′ 6= o. By the procedure used to output swap
pairs we have that t has degree 1 or 0 in the cover graph. In addition, if t has degree 1 then
to = t. In both the cases we have that to′ 6= t.
Next we create a new clustering by swapping o for t and assigning all the points in Co ∗ to
o. Next we reassign points in Ct \ Co ∗ . Consider a point x ∈ Ct \ Co ∗ . Clearly ox 6= o. Let
tox be the point in T which is connected to ox in the cover graph. We assign x to tox . One
needs to ensure here that tox 6= t which follows from Fact 7. From Equation 4.1 the increase
in cost due to this reassignment must be non-negative. In other words we have
X X
(d(x, o)2 − d(x, tx )2 ) + (d(x, tox )2 − d(x, t)2 ) ≥ 0 (4.2)
x∈Co ∗ x∈Ct \Co ∗
We will add up Equation 4.2 over the set of all good swaps.
In order to sum up over all swaps notice that in the first term in Equation 4.2 every point
x ∈ C appears exactly once by beingPin Co ∗ for some o ∈ O. Hence the sum over all swaps
of the first term can be written as x∈C (d(x, ox )2 − d(x, tx )2 ). Consider the second term
in Equation 4.2. We have that (d(x, tox )2 − d(x, t)2 )) ≥ 0 since x is in Ct . Hence we can
replace the second summation over all x ∈ Ct without affecting the inequality. Also every
point x ∈ C appears at most twice in the second term by Pbeing in Ct for some t ∈ T . Hence
the sum over all swaps of the second term is at most x∈C (d(x, tox )2 − d(x, tx )2 ). Adding
these up and rearranging we get that
Φ(O) − 3Φ(T ) + 2R ≥ 0 (4.3)
d(x, tox )2 .
P
Here R = x∈C
In the last part we will upper bound the quantity R. R represents the cost of assigning
every point x to a center in T but not necessarily the closest one. Hence, R ≥ Φ(T ) ≥ Φ(O).
However we next show that this reassignment cost is not too large.
2 2
P P P
Notice that R can also be written as o∈O ∗ d(x, to ) .
x∈Co P Also x∈Co ∗ d(x, to ) =
P 2 ∗ 2
P 2 2
x∈Co ∗ d(x, o) + |Co |d(o, to ) . Hence we have that R = o∈O x∈Co ∗ (d(x, o) + d(o, to ) ).
9
Also note that d(o, to ) ≤ d(o, tx ) for any x. Hence
X X
R ≤ (d(x, o)2 + d(o, tx )2 )
o∈O x∈Co ∗
X
= (d(x, ox )2 + d(ox , tx )2 )
x∈C
Using triangle inequality we know that d(ox , tx ) ≤ d(ox , x) + d(x, tx ). Substituting above
and expanding we get that
X
R ≤ 2Φ(O) + Φ(T ) + 2 d(x, ox )d(x, tx ) (4.4)
x∈C
The P
last term in the above p equation
p can be bounded using Cauchy-Schwarz inequal-
ity as
p px∈C d(x, ox )d(x, tx ) ≤ Φ(O) Φ(S). So we have that R ≤ 2Φ(O) + Φ(T ) +
2 Φ(O) Φ(S). Substituting this in Equation 4.3 and solving we get the desired result
that Φ(T ) ≤ 25Φ(O). Combining this with Lemma 6 proves Theorem 5.
A natural generalization of Algorithm Local search is to swap more than one centers
at each step. This could potentially lead to a much better local optimum. This multi-swap
scheme was analyzed by [47] and using a similar analysis as above one can show the following
Theorem 8. Let S be the final set of centers by the local search algorithm which swaps upto
p centers at a time. Then we have that Φ(S) ≤ 2(3 + 2p )2 OPT, where OPT is the cost of the
optimal k-means solution.
For the case of k-median the same algorithm and analysis gives [16]
Theorem 9. Let S be the final set of centers by the local search algorithm which swaps upto
p centers at a time. Then we have that Φ(S) ≤ (3 + p2 )OPT, where OPT is the cost of the
optimal k-median solution.
√
This approximation factor for k-median has recently been improved to (1 + 3 + ǫ) [50].
For the case of k-means in Euclidean space [48] give an algorithm which achieves a (1 + ǫ)
approximation to the k-means objective for any constant ǫ > 0. However the runtime of the
algorithm depends exponentially in k and hence it is only suitable for small instances.
In this part of the chapter we delve into some of the more modern research in the theory
of clustering. In recent past there has been an increasing interest in designing clustering
algorithms that enjoy strong theoretical guarantees on non-worst case instance. This is
of significant interest for two reasons: a) From a theoretical point of view, this helps us
understand and characterize the class of problems for which one can get optimal or close
10
to optimal guarantees, b) From a practical point of view, real world instances often have
additional structure that could be exploited to get better performance. Compared to worst
case analysis, the main challenge here is to formalize well motivated and interesting additional
structures of clustering instances under which good algorithms exist. In this section we
present two popular interesting notions.
5.1 ǫ-separability
Ci , i.e., x∈Ci d(x, ci )2 . Such an instance is called ǫ-separable if it satisfies OPT(k − 1) >
P
1
ǫ2
OPT(k).
The definition is motivated by the following issue: when approaching a clustering problem,
one typically has to decide how many clusters one wants to partition the data in, i.e., the
value of k. If the k-means objective is the underlying criteria being used to judge the quality
of a clustering, and the optimal (k − 1)-means clustering is comparable to the optimal k-
means clustering, then one can in principle also use (k − 1) clusters to describe the data set.
In fact this particular method is a very popular heuristic to find out the number of hidden
clusters in the data set. In other words choose the value of k at which there is a significant
increase in the k-means cost when going from k to k − 1. As an illustrative example consider
the case of a mixture of k spherical unit variance Gaussians in d dimensions whose pair wise
means are separated by a distance D >> 1. Given n points from each Gaussian, the optimal
k-means cost with high probability is nkd. On the other hand, if we try to cluster this data
using (k − 1) clusters, the optimal cost will now become n(k − 1)d + n(D 2 + d). Hence, taking
2 +d 2
the ratio of the two costs, this instance will be ǫ-separable for ǫ12 = (k−1)d+D = 1 + Dkd ),
2 √ kd
so ǫ = (1 + Dkd )−1/2 . Hence, if D ≫ kd, then the instance will be highly separable (the
separability parameter ǫ will be o(1)).
It was shown by Ostrovsky et al. [57] that one can design much better approximation
algorithms for ǫ-separable instances.
Theorem 10 ( [57]). There is a polynomial time algorithm which given any ǫ-separable 2-
means instance returns a clustering of cost at most OP T
1−ρ
with probability at least 1 − O(ρ)
2 2
where c2 ǫ ≤ ρ ≤ c1 ǫ for some constants c1 , c2 > 0.
Theorem 11 ( [57]). There is a polynomial time algorithm which given any ǫ-separable k-
means instance a clustering of cost at most OPT
1−ρ
with probability 1 − O((ρ)1/4 ) where c2 ǫ2 ≤
ρ ≤ c1 ǫ2 for some constants c1 , c2 > 0.
Notice that the above algorithm does not need to know the value of ǫ from the separability
of the instance. Define ri to be the radius of cluster Ci in the optimal k-means clustering,
11
i.e., ri 2 = OPT
|Ci |
i
. The main observation is that under the ǫ-separability condition, the optimal
k-means clustering is “spread out”. In other words, the radius of any cluster is much smaller
than the inter cluster distances. This can be formulated in the following lemma
1−ǫ2
Lemma 12. ∀i, j, d(ci , cj )2 ≥ ǫ2
max(ri 2 , rj 2 ).
Proof. Given an ǫ-separable instance of k-means, consider any two clusters Ci and Cj in
the optimal clustering with centers ci and cj respectively. Consider the (k − 1) clustering
obtained by deleting cj and assigning all the points in Cj to Ci . By ǫ-separability, the cost
of this new clustering must be at least OPT ǫ2
. However the increase in the cost will be exactly
2
|Cj |d(ci , cj ) . This follows from the simple observation stated in Fact 2. Hence we have that
ǫ2 2
|Cj |d(ci , cj )2 > ( ǫ12 − 1)OPT. This gives us that rj 2 = OPT
|Cj |
≤ 1−ǫ 2 d(ci , cj ) . Similarly, if we
ǫ2
delete ci and assign all the points in Ci to Cj we get that ri 2 ≤ 1−ǫ2
d(ci , cj )2 .
When dealing with the two means problem, if one could find two initial candidate center
points which are close to the corresponding optimal centers, then we could hope to run a
Lloyd’s type step and improve the solution quality. In particular if we could find c¯1 and c¯2
such that d(c1 , c¯1 )2 ≤ αr1 2 and d(c2 , c¯2 )2 ≤ αr2 2 , then we know from Fact 2 that using these
center points will give us a (1 + α) approximation to OPT. Lemma 12 suggests the following
approach: pick data points x, y with probability proportional to d(x, y)2. We will show that
this will lead to seed points cˆ1 and cˆ2 not too far from the optimal centers. Applying a Lloyd
type reseeding step will then lead us to the final centers which will be much closer to the
optimal centers. We start by defining the core of a cluster.
Definition 3 (Core of a cluster). Let ρ < 1 be a constant. We define Xi = {x ∈ Ci :
2
d(x, ci )2 ≤ rρi }. We call Xi as the core of the cluster Ci .
We next show that if we pick initial seeds {cˆ1 , cˆ2 } = {x, y} with probability proportional
to d(x, y)2 then with high probability the points lie within the core of different clusters.
100ǫ2
Lemma 13. For sufficiently small ǫ and ρ = 1−ǫ2
, we have P r[{cˆ1 , cˆ2 } ∩ X1 6= ∅ and
{x, y} ∩ X2 6= ∅] = 1 − O(ρ).
ϭ Ϯ
ƌϮͬρ ƌϮͬρ
Đϭ ĚϮ
ĐϮ
Proof Sketch. For simplicity assume that the sizes of the two clusters is the same, i.e.,
|Ci | = |Cj | = n/2. In this case, we have r1 2 = r2 2 = 2OP
n
T
= r 2 . Also, let d2 (c1 , c2 ) = d2 .
12
2
From ε-separability, we know that d2 > 1−ǫ ǫ2
r 2 . Also, from the definition of the core, we
know that at least (1 − ρ) fraction of the mass of each cluster lies within P the core. Hence,
the clustering instance looks like the one showed in Figure 3. Let A = x∈X1 ,y∈X2 d(x, y)2
2 A
P
and B = x,y⊂C d(x, y) . Then the probability of the event is exactly B . Let’s analyze
quantity B first. The proof goes by arguing that the pairwise distances between X1 and
X2 will dominate B. This is because of Lemma 12 which says that d2 is much greater than
r 2 , the average radius of a cluster. More formally, From Corollary 3 and from Fact 4 we
can get that B = n∆1 2 (C) = n∆2 2 (C) + n2 /4d2 . In addition ǫ-separability tells us that
n2
∆1 2 (C) > 1/ǫ2 ∆2 2 (C). Hence we get that B ≤ 4(1−ǫ 2
2) d .
Let’s analyze A = x∈X1 ,y∈X2 d(x, y)2 . From triangle inequality, we have that for any
P
√ √
x ∈ X1 , y ∈ X2 , d2 (x, y) ≥ (d − 2r/ ρ)2 . Hence A ≥ 14 (1 − ρ)2 n2 (d − 2r/ ρ)2 . Substituting
these bounds and using the fact that ρO(ǫ2 ), gives us that A/B ≥ (1 − O(ρ)).
Using these initial seeds we now show that a single step of a Lloyd’s type method can
yield good a solution. Define r = d(cˆ1 , cˆ2 )/3. Define c¯1 as the mean of the points in B(cˆ1 , r)
and c¯2 as the mean of the points in B(cˆ2 , r). Notice that instead of taking the mean of the
Voronoi partition corresponding to cˆ1 and cˆ2 , we take the mean of the points within a small
radius of the given seeds.
Lemma 14. Given cˆ1 ∈ X1 and cˆ2 ∈ X2 , the clustering obtained using c¯1 and c¯2 as centers
has 2-means cost at most OP
1−ρ
T
.
Proof. We will first show that X1 ⊆ B(cˆ1 , r) ⊆ C1 . Using Lemma 12 we know that d(cˆ1 , c1 ) ≤
ǫ
ρ(1−ǫ2 )
d(c1 , c2 ) ≤ d(c1 , c2 )/10 for sufficiently small ǫ. Similarly d(cˆ2 , c2 ) ≤ d(c1 , c2 )/10. Hence
we get that 4/5 ≤ r ≤ 6/5. So for any z ∈ B(cˆ1 , r), d(z, c1 ) ≤ d(c1 , c2 )/2. Hence z ∈ C1 .
2
Also for any z ∈ X1 , d(z, cˆ1 ) ≤ 2 rρ1 ≤ r. Similarly one can show that X2 ⊆ B(cˆ2 , r) ⊆ C2 .
ρ ρ
Now applying Fact 4 we can claim that d(c¯1 , c1 ) ≤ 1−ρ r1 2 and d(c¯2 , c2 ) ≤ 1−ρ r2 2 . So using
ρ
c¯1 and c¯2 as centers we get a clustering of cost at most OPT + 1−ρ OPT = OPT 1−ρ
.
Summarizing the discussion above, we have the following simple algorithm for the 2-means
problem.
Algorithm 2-means
1. Seeding: Choose initial seeds x, y with probability proportional to d(x, y)2 .
2. Given seeds cˆ1 , cˆ2 , let r = d(cˆ1 , cˆ2 )/3. Define c¯1 = mean(B(cˆ1 , r)) and c¯2 =
mean(B(cˆ2 , r)).
3. Output: c¯1 and c¯2 as the cluster centers.
13
5.3 Proof Sketch and Intuition for Theorem 11
In order to generalize the above argument to the case of k clusters, one could follow a
similar approach and start with k initial seed centers. Again we start by choosing x, y
with probability proportional to d(x, y)2 . After choosing a set of U of points, we choose
the next point z with probability proportional to mincˆi ∈U d(z, cˆi )2 . Using a similar analysis
as in Lemma 13 one can show that if we pick k seeds then with probability (1 − O(ρ))k
they will lie with the cores of different clusters. However this probability of success is
exponentially small in k and is not good for our purpose. The approach taken in [57] is to
sample a larger set of points and argue that with high probability it is going to contain k
seed points from the “outer” cores of different clusters. Here we define outer core of a cluster
2
as Xi out = {x ∈ Ci : d(x, ci )2 ≤ rρi3 } – so this notion is similar to the core notion for k = 2
except that the radius of the core is bigger by a factor of 1/(ρ) than before. We would like
to again point out a similar seeding procedure as the one described above is used in the
k-means++ algorithm [15](See Section 2). One can show that using k seed centers in this
way gives an O(log(k))-approximation to the k-means objective in the worst case.
√
Lemma 15 ( [57]). Let N = 1−5ρ 2k
+ 2(1−5ρ)
ln(2/δ)
2 , where ρ = ǫ. If we sample N points using the
sampling procedure then P r[∀j = 1 · · · k, there exists some x̂i ∈ Xj out ] ≥ 1 − δ
Since we sample more than k points in the first step, one needs to extract k good seed
points out of this set before running the Lloyd step. This is achieved by the following greedy
procedure:
Using the above lemma and applying the same Lloyd step as in the 2-means problem
we get a set of k good final centers. These centers have the property that for each i,
ρ
d(ci , c¯i ) ≤ 1−ρ ri 2 . Putting the above argument formally we get the desired result.
14
5.4 Approximation stability
In [20] Balcan et al. introduce and analyze a class of approximation stable instances for which
they provide polynomial time algorithms for finding accurate clustering. The starting point
of this work, is that for many problems of interest to machine learning, such as as clustering
proteins by function, images by subject, or documents by topic, there is some unknown
correct target clustering. In such cases the implicit hope when pursuing an objective based
clustering approach (k-means or k-median) is that approximately optimizing the objective
function will in fact produce a clustering of low clustering error, i.e. a clustering that is
point wise close to the target clustering. Balcan et al. have shown that by making this
implicit assumption explicit, one can efficiently compute a low-error clustering even in cases
when the approximation problem of the objective function is NP-complete! This is quite
interesting since it shows that by exploiting the properties of the problem at hand one
can solve the desired problem and bypass worst case hardness results. A similar stability
assumption, regarding additive approximations, was presented in [54]. The work of [54]
studied sufficient conditions under which the stability assumption holds true.
Here the term “target” clustering refers to the ground truth clustering of X which one is
trying to approximate. It is also important to clarify what we mean by an ǫ-close clustering.
Given two k clusterings PC and C ∗ of n points, the distance between them is measured as
dist(C, C ) = minσ∈Sk n ki=1 |Ci \ C ∗σ(i) |. We say that C is ǫ-close to C ∗ if the distance
∗ 1
Notice that the above theorem is valid even for values of α for which getting a (1 + α)-
approximation to k-median and k-means is NP -hard! In a recent paper, [4] show that
running the kmeans++ algorithm for approximation stable instances of k-means gives a
constant factor approximation with probability Ω( k1 ). In the following we will provide a
sketch of the proof of Theorem 17 for k-means clustering.
15
approximation stability is that most of the points are much closer to their own center than
to the centers of other clusters. Specifically:
Lemma 18. If the instance (M, X) satisfies (1 + α, ǫ)-approximation-stability then less than
6ǫn points satisfy w2 (x)2 − w(x)2 ≤ αOPT
2ǫn
.
Proof. Let C ∗ be the optimal k-means clustering. First notice that by approximation-stability
dist(C ∗ , CT ) = ǫ∗ ≤ ǫ. Let B be the set of points that satisfy w2 (x)2 − w(x)2 ≤ αOPT
2ǫn
. Let us
′
assume that |B| > 6ǫn. We will create a new clustering C by transferring some of the points
in B to their second closest center. In particular it can be shown that there exists a subset
of size |B|/3 such that for each point reassigned in this set, the distance of the clustering
to C ∗ increases by 1/n. Hence we will have a clustering C ′ which is 2ǫ away from C ∗ and at
least ǫ away from CT . However the increase in cost in going from C ∗ to C ′ is at most αOPT.
This contradicts the approximation stability assumption.
q
αOPT
Let us define dcrit = 50ǫn
as the critical distance. We call a point x good if it satisfies
w(x)2 < dcrit 2 and w2 (x)2 − w(x)2 > 25dcrit 2 . Otherwise we call x as a bad point. Let B be
the set of all bad points and let Gi be the good points in target cluster i. By Lemma 18
at most 6ǫn points satisfy w2 (x)2 − w(x)2 > 25d2crit. Also from Markov’s inequality at most
50ǫn
α
points can have w(x)2 > dcrit 2 . Hence |B| = O(ǫ/α).
Given Lemma 18, if we then define the τ -threshold graph Gτ = (S, Eτ ) to be the graph
C
produced by connecting all pairs {x, y} ∈ 2 with d(x, y) < τ , and consider τ = 2dcrit we
get the following two properties:
(1) For x, y ∈ Ci ∗ such that x and y are good points, we have {x, y} ∈ E(Gτ ).
(2) For x ∈ Ci ∗ and y ∈ Cj ∗ such that x and y are good points, {x, y} ∈
/ E(Gτ ).
(3) For x ∈ Ci ∗ and y ∈ Cj ∗ , x and y do not have any good point as a common neighbor.
Hence the threshold graph has the structure as shown in Figure 4, where each Gi is a
clique representing the set of good points in cluster i. This suggests the following algorithm
for k-means clustering. Notice that unlike the algorithm for ǫ-separability, the algorithm for
approximation stability mentioned below needs to know the values of the stability parameters
α and ǫ4 .
4 This is specifically for the goal of finding a clustering that nearly matches an unknown target clustering, because one may
not in general have a way to identify which of two proposed solutions is preferable. On the other hand, if the goal is to find a
solution of low cost, then one does not need to know α or ǫ: one can just try all possible values for dcrit in the algorithm and
take the solution of least total cost.
16
Algorithm k-means algorithm
Input: ǫ ≤ 1, α > 0, k.
q
αOPT a
1. Initialization: Define dcrit = 50ǫn
a For simplicity we assume here that one knows the value of OPT. If not, one can run a constant-factor approximation
G1 G2
Gk−1 Gk
The authors in [20] use the properties of the threshold graph to show that the greedy
method of Step 3 of the algorithm produces an accurate clustering. In particular, if the
vertex vj we pick is a good point in some cluster Ci , then we are guaranteed to extract the
whole set Gi of good points in that cluster and potentially some bad points as well (see
Figure 5(a)). If on the other hand the vertex vj we pick is a bad point, then we might
extract only a part of a good set Gi and miss some good points in Gi , which might lead to
some errors. (Note that by property (3) we never extract parts of two different good sets
Gi and Gj ). However, since vj was picked to be the vertex of the highest degree in Gτ , we
are guaranteed to extract at least as many bad points as the number of missed good points
in Gi see Figure 5(b). These than implies that overall we can charge the errors to the bad
points, so the distance between the target clustering and the resulting clustering is O(ǫ/α)n,
as desired.
17
G1 G2
vj
B vj
(a) (b)
Figure 5: If the greedy algorithm chooses a good vertex vj as in (a), we get the entire good
set of points from that cluster. If vj is a bad point as in (b), the missed good points can be
charged to bad points.
the converse is not necessarily the case: an instance could satisfy approximation-stability
without being ǫ-separated.5 [21] presents a specific example of points in Euclidean space
with c = 2. In fact, for the case that k is much larger than 1/ǫ, the difference between the
two properties can be more substantial. See Figure 6 for an example. In addition, algorithms
for approximation stability have been successfully applied in clustering problems arising in
computational biology [68] (See Section 5.8 for details).
[17] study center based clustering objectives and define a notion of stability called α-weak
deletion stability. A clustering instance is stable under this notion if in the optimal clustering
merging any two clusters into one increases the cost by a multiplicative factor of (1 + α).
This a broad notion of stability that generalizes both the ǫ-separability notion studied in
section 5.1 and the approximation stability in the case of large cluster sizes. Remarkably, [17]
show that for such instances of k-median and k-means one can design a (1+ǫ) approximation
algorithm for any ǫ > 0. This leads to immediate improvements over the works of [20] (for the
case of large clusters) and of [57]. However, the runtime of the resulting algorithm depends
polynomially in n and k and exponentially in the parameters 1/α and 1/ǫ, so the simpler
algorithms of [17] and [20] are more suitable for scenarios where one expects the stronger
properties to hold. See Section 5.8 for further discussion. [3] also study various notions of
clusterability of a dataset and present algorithms for such stable instances.
Kumar and Kannan [49] consider the problem of recovering a target clustering under
deterministic separation conditions that are motivated by the k-means objective and by
Gaussian and related mixture models. They consider the setting of points in Euclidean
space, and show that if the projection of any data point onto the line joining the mean of
its cluster in the target clustering to the mean of any other cluster of the target is Ω(k)
standard deviations closer to its own mean than the other mean, then they can recover the
target clusters in polynomial time. This condition was further analyzed and reduced by
work of [18]. This separation condition is formally incomparable to approximation-stability
(even restricting to the case of k-means with points in Euclidean space). In particular,
5 [57] shows an implication in this direction (Theorem 5.2); however, this implication requires a substantially stronger
condition, namely that data satisfy (c, ǫ)-approximation-stability for c = 1/ǫ2 (and that target clusters be large). In contrast,
the primary interest of [21] in the case where c is below the threshold for existence of worst-case approximation algorithms.
18
ϭ
ϭ
ϭ
Figure 6: Suppose ǫ is√a small constant,√and consider a clustering instance in which the
target consists of k = n clusters with n points each, such that all points in the same
cluster have distance 1 and all points in different clusters have distance D + 1 where
√ D is
a large constant. Then, merging two clusters increases the cost additively by Θ( n), since
D is a constant.
√ Consequently, the optimal (k − 1)-means/median solution is just a factor
1 + O(1/ n) more expensive than the optimal k-means/median clustering. However, for
D sufficiently large compared to 1/ǫ, this example satisfies (2, ǫ)-approximation-stability or
even (1/ǫ, ǫ)-approximation-stability – see [21] for formal details.
if the dimension is low and k is large compared to 1/ǫ, then this condition can require
more separation than approximation-stability (e.g., with k well-spaced clusters of unit radius
approximation-stability would require separation only O(1/ǫ) and independent of k – see [21]
for an example). On the other hand if the clusters are high-dimensional, then this condition
can require less separation than approximation-stability since the ratio of projected distances
will be more pronounced than the ratios of distances in the original space.
Bilu and Linial [25] consider inputs satisfying the condition that the optimal solution to
the objective remains optimal even after bounded perturbations to the input weight matrix.
This condition is known as perturbation resilience. Bilu and Linial [25] give an algorithm
for a different clustering objective known as maxcut. The maxcut objective asks for a 2
partitioning of a graph such the total number of edges going between the two pieces is max-
imized. The authors show that the maxcut objective is easy under the assumption that the
optimal solution is stable to O(n2/3 )-factor multiplicative perturbations to the edge weights.
The√ work of Makarychev et al. [52] subsequently reduced the required resilience factor to
O( log n). In [18] the authors study perturbation resilience for center-based clustering ob-
jectives such as k-median and k-means, and give an algorithm that finds the optimal solution
√
when the input is stable to only factor-3 perturbations. This factor is improved to 1 + 2
by [22], who also design algorithms under a relaxed (c, ǫ)-stability to perturbations condi-
tion in which the optimal solution need not be identical on the c-perturbed instance, but
may change on an ǫ fraction of the points (in this case, the algorithms require c = 4).
Note that for the k-median objective, (c, ǫ)-approximation-stability with respect to C ∗ im-
plies (c, ǫ)-stability to perturbations because an optimal solution in a c-perturbed instance
is guaranteed to be a c-approximation on the original instance;6 so, (c, ǫ)-stability to per-
6 In particular, a c-perturbed instance d˜ satisfies d(x, y) ≤ d(x,
˜ y) ≤ cd(x, y) for all points x, y. So, using Φ to denote cost
in the original instance, Φ̃ to denote cost in the perturbed instance and using C̃ to denote the optimal clustering under Φ̃, we
19
turbations is a weaker condition. Similarly, for k-means, (c, ǫ)-stability to perturbations is
implied by (c2 , ǫ)-approximation-stability. However, as noted above, the values of c known
to lead to efficient clustering in the case of stability to perturbations are larger than for
approximation-stability, where any constant c > 1 suffices.
Below we provide the run time guarantees of the various algorithms discussed so far. While
these may be improved with appropriate data structures, we assume here a straightforward
implementation in which computing the distance between two data points takes time O(d),
as does adding or averaging two data points. For example, computing a step of Lloyd’s algo-
rithm requires assigning each of the n data points to its nearest center, which in turn requires
taking the minimum of k distances per data point (so O(nkd) time total), and then resetting
each center to the average of all data points assigned to it (so O(nd) time total). This gives
Lloyd’s algorithm a running time of O(nkd) per iteration. The k-means++ algorithm has
only a seed-selection step, which can be run in time O(nd) per seed by remembering the
minimum distances of each point to the previous seeds, so it has a total time of O(nkd).
For the ǫ-separability algorithm, to obtain the sampling probabilities for the first two
seeds one can compute all pairwise distances at cost of O(n2 d). Obtaining the rest of the
seeds is faster since one only needs to compute distances to previous seeds, so this takes
time O(ndk). Finally there is a greedy deletion procedure at time O(ndk) per step for O(k)
steps. So the overall time is O(n2d + ndk 2 ).
For the approximation-stability algorithm, creating a graph of distances takes time O(n2 d),
after which creating the threshold graph takes time O(n2) if one knows the value of dcrit .
For the rest of the algorithm, each step takes time O(n) to find the highest-degree vertex,
and then time proportional to the number of edges examined to remove the vertex and its
neighbors. Over the entire remainder of the algorithm this takes time O(n2 ) total. If the
value of dcrit is not known, one can try O(n) values, taking the best solution. This gives an
overall time of O(n3 + n2 d).
Finally, for local search, one can first create a graph of distances in time O(n2d). Each
local swap step has O(nk) pairs (x, y) to try, and for each pair one can compute its cost
in time O(nk) by computing the minimum distance of each data point to the proposed k
centers. So, the algorithm can be run in time O(n2 k 2 ) per iteration. The total number of
iterations is at most poly(n) 7 so the overall running time is at most O(n2d + n2 k 2 poly(n)).
As can be seen from the table below the algorithms become more and more computationally
expensive if one needs formal guarantees on a larger instance space. For example, the local
search algorithm provides worst case approximation guarantees on all instances but is very
slow. On the other hand Lloyd’s method and k-means++ are very fast but provide bad
worst case guarantees, especially when the number of clusters k is large. Algorithms based
on stability notions aim to provide the best of both worlds by being fast and provably good
on well behaved instances. In the conclusion section 7 we outline a guideline for practitioners
when working with the various clustering assumptions.
˜ ≤ Φ̃(C ∗ ) ≤ cΦ(C ∗ ).
have Φ(C̃) ≤ Φ̃(C)
7 The actual number of iterations depend upon the cost of the initial solution and the stopping condition.
20
Method Runtime
Lloyd’s O(nkd)) × (#iterations)
k-means++ O(nkd)
ǫ-separability O(n2 d + ndk 2 )
Approximation stability O(n3 + n2 d)
Local search O(n d + n2 k 2 poly(n))
2
Table 1: A run time analysis of various algorithms discussed in the chapter. The running
time degrades as one requires formal guarantees on larger instance spaces.
5.8 Extensions
21
imation-stability for the k-median objective finds a clustering that is very close to the target
by using only O(k) one-versus-all queries in the large cluster case, and in addition is faster
than the algorithm we present here. In particular, the algorithm for the large clusters case
described in [20] (similar to the one we described in Section 5.4 for the k-means objective)
can be implemented in O(|S|3) time, while the one proposed in [68] runs in time O(|S|k(k +
log |S|)). [68] use their algorithm to cluster biological datasets in the Pfam [38] and SCOP
[56] databases, where the points are proteins and distances are inversely proportional to
their sequence similarity. This setting nicely fits the one-versus all queries model because
one can use a fast sequence database search program to query a sequence against an entire
dataset. The Pfam [38] and SCOP [56] databases are used in biology to observe evolutionary
relationships between proteins and to find close relatives of particular proteins. [68] find
that for one of these sources they can obtain clusterings that almost exactly match the given
classification, and for the other the performance of their algorithm is comparable to that of
the best known algorithms using the full distance matrix.
6 Mixture Models
In the previous sections we saw worst case approximation algorithms for various clustering
objectives. We also saw examples of how assumptions on the nature of the optimal solution
can lead to much better approximation algorithms. In this section we will study a different
assumption on how the data is generated in the first place. In the machine learning literature,
such assumptions take the form of a probabilistic model for generating a clustering instance.
The goal is to cluster correctly (with high probability) an instance generated from the par-
ticular model. The most famous and well studied example of this is the Gaussian Mixture
Model (GMM)[46]. This will be the main focus of this section. We will illustrate conditions
under which datasets arising from such a mixture model can be provably clustered.
Gaussian Mixture Model A univariate Gaussian random variable X, with mean µ and
−(x−µ)2
variance σ 2 has the density function f (x) = σ√12π e σ2 . Similarly, a multivariate Gaussian
random variable, X ∈ ℜn has the density function
1
e( 2 (x−µ) Σ (x−µ)) .
−1 T −1
f (x) = n/2
|Σ|1/2 (2π)
Here µ ∈ ℜn is called the mean vector and Σ is the n × n covariance matrix. A spe-
cial case is the spherical Gaussian for which Σ = σ 2 In . Here σ 2 refers to the variance
of the Gaussian in any given direction. Consider k n-dimensional Gaussian distributions,
N (µ1 , Σ1 ), N (µ2, Σ2 ), · · · , N (µk , Σk ). A Gaussian mixture model M refers to the distribu-
tion obtained from a convex combination of such Gaussian. More specifically
M = w1 N (µ1 , Σ1 ) + w2 N (µ2, Σ2 ) + · · · wk N (µk , Σk ).
P
Here wi ≥ 0, are called the mixing weights and satisfy i wi = 1. One can think of a
point being generated from M by first choosing a component Gaussian i, with probability
wi , and then generating a point from the corresponding Gaussian distribution N (µi, Σi ).
Given a data set of m points coming from such a mixture model, a fairly natural question
22
is to recover the individual components of the mixture model. This is a clustering problem
where one wants to cluster the points into k clusters such that the points drawn from the
same Gaussian are in a single partition. Notice that unlike in the previous sections, the
algorithms designed for mixture models will have probabilistic guarantees. In other words,
we would like the clustering algorithm to recover, with high probability, the individual com-
ponents. Here the probability is over the draw of the m sample points. Another problem one
could ask is to approximate the parameters (mean, variance) of each individual component
Gaussian. This is known as the parameter estimation problem. It is easy to see that if one
could solve the clustering problem approximately optimally, then estimating the parameters
of each individual component is also easy. Conversely, after doing parameter estimation
one can easily compute the Bayes optimal clustering. To study the clustering problem, one
typically assumes separation conditions among the component Gaussians which limit the
amount of overlap between them. The most common among them is to assume that the
mean vectors of the component Gaussians are far apart. However, there are also scenarios
when such separation conditions do not hold (consider two Gaussian which are aligned in
an ’X’ shape), yet the data can be clustered well. In order to do this, one first does param-
eter estimation which needs much weaker assumptions. After estimating the parameters,
the optimal clustering can be recovered. This is an important reason to study parameter
estimation. In the next section we will see examples of some separation conditions and the
corresponding clustering algorithms that one can use. Later, we will also look at recent work
on parameter estimation under minimal separation conditions.
In this section we will look at distance based clustering algorithms for learning a mixture
of Gaussians. For simplicity, we will start with the case of k spherical Gaussians in ℜn
with means {µ1 , µ2 , · · · , µk } and variance Σ = σ 2 In . The algorithms we describe will work
under the assumption that the means are far apart. We will call this as the center separation
property:
Definition 5 (Center Separation). A mixture of k identical spherical Gaussians satisfies
center separation if ∀i 6= j,
∆i,j = ||µi − µj || > βi,j σ
The quantity βi,j typically depends on k the number of clusters, n the dimensionality of
the dataset and wmin , the minimum mixing weight. If the spherical Gaussians have different
variances σi ’s, the R.H.S. is replaced by βi,j (σi + σj ). For the case of general Gaussians,
σi will denote the maximum variance of Gaussian i in any particular direction. One of the
earliest results using center separation
√ for clustering is by Dasgupta [32]. We will start with
a simple condition that βi,j = C n, for some constant C > 4 and will also assume that
wmin = Ω(1/k). Let’s consider Pn a typical point x from a particular Gaussian N (µi, σ 2 In ).
We have E[||X − µi || ] = E[ d=1 |xd − µi d | ] = nσ 2 . Now consider two typical points x and
2 2
23
For C large enough (say C > 4), we will have that for any two typical points x, y in the
same cluster, ||x−y||2 ≤ 2σ 2 n. And for any two points in different clusters ||x−y||2 > 18σ 2 n.
Using standard concentration bounds we can say that for a sample of size poly(n), with high
probability, all points from a single Gaussian will be closer to each other, than to points
from other Gaussians. In this case one could simply create a graph by connecting any two
points x, y such that ||x − y||2 ≤ 2σ 2 n. It is easy to see that the connected components in
this graph will correspond precisely to the individual components of the mixture model. If
C is smaller, say 2, one needs a stronger concentration result [10] mentioned below
Lemma 19. If x, y are picked independently from N(µi , σ 2 In ), then with probability 1−1/n3 ,
||x − y||2 ∈ [2σ 2 n(1 − 4 log(n)
√
n
), 2σ 2 n(1 + 5 log(n)
√
n
)].
Also, as before, one can show that with high probability, for x and y from two differ-
ent Gaussians, we have ||x − y||2 > 2σ 2 n(1 + 4 log(n)
√
n
). From this it follows that if r is the
minimum distance between any two points in the sample, then for any x in Gaussian i
and any y in the same Gaussian, we have ||x − y||2 ≤ (1 + 4.5 √ log(n)
n
)r. And for a point z in
any other Gaussian we have ||x−z||2 > (1 + 4.5 √
log(n)
n
)r. This suggests the following algorithm
Algorithm Cluster Spherical Gaussians
1. Let D be the set of all sample points.
2. For: i = 1 to k,
(a) Let x0 and y0 be such that kx0 − y0 k2 = r = minx,y∈D kx − yk2.
4.5√log n
(b) Let T = {y ∈ D : kx0 − yk2 ≤ r(1 + n
).
(c) Output: T as one of the clusters.
Handling smaller c
For smaller values of C, for example C < 1, one cannot in general say that the above
strong concentration will hold true. In fact, in order to correctly classify the points, we
might need to √ see points which are much closer to the center of a Gaussain (say at distance
less than 12 σ n). However, most of the mass of a Gaussian lies in a thin shell around radius
√
of σ n. Hence, one might have to see exponentially many samples in order to get a good
classification. Dasgupta [32] solves this problem by first projecting the data onto a random
d = O(log(k)/ǫ2 ) dimensional subspace. This has the effect that the center separation
property is still preserved up to a factor of (1 − ǫ). One can now do distance based clustering
in this subspace as the number of samples needed will be proportional to 2d instead of 2n .
General Gaussians The results of Dasgupta were extend by Arora and Kannan [10] to the
case of general Gaussians. They also managed to reduce the required separation between
means. They assumed that βi,j = Ω(log(n))(Ri + Rj )(σi + σj ). As mentioned before, σi
denotes the maximum variance of Gaussian i in any direction. Ri denotes the median radius
of Gaussian i8 . For the case of spherical gaussians, this separation becomes Ω(n1/4 log(n)(σi +
8 The radius such that the probability mass within Ri equals 1/2
24
σj )). Arora and Kannan use isoperimetric inequalities to get strong concentration results
for such Gaussians. In particular they show that
Theorem 20. Given βi,j = Ω(log(n)(Ri + Rj )), there exists a polynomial time algorithm
2 2
which given at least m = δ2nw6kmin samples from a mixture of k general Gaussians, solves the
clustering problem exactly with probability (1 − δ).
Proof Intuition: The first step is to generalize Lemma 19 for the case of general Gaussians.
In particular one can show that for x, y are picked at random from a general Gaussian i,
with median radius Ri and maximum variance σi , we have with high probability
2Ri 2 − 18 log(n)σi Ri ≤ ||x − y||2 ≤ 2(Ri + 20 log(n)σi )2
Similarly, for x, y from different Gaussians i and j, we have with high probability
||x − y||2 > 2min(Ri 2 , Rj 2 ) + 120 log(n)(σi + σj )(Ri + Rj ) + Ω((log(n))2 (σi 2 + σj 2 ).
The above concentration results imply (w.h.p.) that pairwise distances within points
from a Gaussian i lie in an interval Ii and distances between Gaussians Ii,j lie in the interval
Ii,j . Furthermore, Ii,j will be disjoint from the interval corresponding to the Gaussian with
smaller value of Ri (Need a figure here). In particular, if one looks at balls of increasing
radius around a point from the Gaussian with minimum radius, σi , there will be a stage
when there exists a gap: i.e., increasing the radius slightly does not include any more points.
From the above lemmas, this gap will be roughly Ω(σi ). Hence, at this stage, we can remove
this Gaussian from the data and recurse. This property suggests the following algorithm
outline.
One point to mention is that one does not really know beforehand the value of sigma at
each iteration. Arora and Kannan [10] get around this by estimating the variance from the
data in the ball B(x, r). They then show that this estimate is good enough for the algorithm
to work.
25
6.2 Spectral Algorithms
The algorithms mentioned in the above section need the center separation to grow polyno-
mially with n. This is prohibitively large especially in cases when k ≪ n. In this section,
we look at how spectral techniques can be used to only require the separation to grow with
k instead of n.
Algorithmic Intuition In order to remove the dependence on n we would like to project
the data such that points from the same Gaussian become much closer while still maintain-
ing the large separation between means. One idea is to do a random projection. However,
random projections from n to d dimensions scale each squared distance equally (by factor
d/n) and will not give us any advantage. However, consider the case of two spherical gaus-
sians with means µ1 and µ2 and variance σ 2 In . Consider projecting all the points to the
line joining µ1 and µ2 . Now consider any random point x from the first Gaussian. For
any unit vector along the line joining µ1 and µ2 we have that (x − µ1 ).v behaves like a 1-
dimensional Gaussian with mean 0 and variance σ 2 . Hence the expected distance of a point
x from its mean becomes σ 2 . This means that for any two points in the same Gaussian,
the expected squared distance becomes 4σ 2 (as opposed to 2nσ 2 ). However, the distance
between the means remains the same. In fact the above claim is true if we project onto
any subspace containing the means. This subspace is exactly characterized by the Singu-
lar Value Decomposition (SVD) of the data matrix. This suggests the following algorithm
Algorithm Spectral Clustering
1. Compute the SVD decomposition of the data.
2. Project the data onto the space of top-k right singular vectors.
3. Run a distance based clustering method in this projected space.
Such spectral algorithms were proposed by Vempala and Wang [67] who reduced the
separation for spherical Gaussians to βi,j = Ω(k 1/4 (log(n/wmin ))1/4 ). The case of general
Gaussians was studied in [2] who give efficient clustering algorithms for βi,j = ( √ 1 +
min(wi ,wj )
p 3/2
k log(kmin(2k , n))). [45] give algorithms for general Gaussians for βi,j = wkmin 2 .
In the previous sections we looked at the problem of clustering points from a Gaussian
Mixture Model. Another important problem is that of estimating the parameters of the
component Gaussians. These parameters refer to the mixture weights wi ’s, mean vectors
µi ’s and the covariance matrices Σi ’s. As mentioned before, if one could do efficiently get
a good clustering, then the parameter estimation problem is solved by simply producing
empirical estimates from the corresponding clusters. However, there could be scenarios
when it is not possible to produce a good clustering. For, ex. consider two one dimensional
gaussians with mean 0 and variance σ 2 and 2σ 2 . These gaussians have a large overlap and
any clustering method will inherently have a large error.R On the other hand, let’s look at
the statistical distance between the two gaussians, i.e., x |f1 (x) − f2 (x)|dx. This measures
26
how much one distribution dominates the other one. It is easy to see that in this case the
Gaussian with the higher variance will dominate the other Gaussian almost everywhere.
Hence the statistical distance is close to 1. This suggests that information theoretically, one
should be able to estimate the parameters of these two mixtures. In this section, we will
look at some recent work of Kalai, Moitra, Valiant [44] and Moitra, Valiant [55] in efficient
algorithms for estimating the parameters of a Gaussian mixture model. These works make
minimal assumption on the nature of the data, namely, that the component gaussians have
noticeable statistical distance. Similar results were proven in [23] who also gave algorithms
for more general distributions.
The case of two Gaussians:
We will first look at the case of 2 Gaussians in ℜn . We will R assume that the statistical
distance between the gaussians, D(N1 , N2 ) is noticeable, i.e., x |f1 (x) − f2 (x)|dx > α. Kalai
et. al [44] show the following theorem
Theorem 21. Let M = w1 N1 (µ1 , Σ1 )+w2 N2 (µ2 , Σ2 ) be an isotropic GMM where D(N1 , N2 ) >
α. Then, there is an algorithm which outputs M′ = w1′ N ′ 1 (µ′1 , Σ′1 ) + w2′ N ′ 2 (µ′2 , Σ′2 ) such
that for some permutation π : {0, 1} 7→ {0, 1} we have,
′
|wi − wπ(i) | ≤ ǫ
||µi − µ′π(i) || ≤ ǫ
||Σi − Σ′π(i) || ≤ ǫ
The condition on the mixture being isotropic is necessary to recover a good additive
approximation for the means and the variances since otherwise, one could just scale the data
and the estimates will scale proportionately.
Reduction to a one dimensional problem
In order to estimate the mixture parameters, Kalai et. al, reduce the problem to a series of
one dimensional learning problems. Consider an arbitrary unit vector v. Suppose we project
the data onto the direction of v and let the means of the Gaussians in this projected space be
µ′1 and µ′2 . Then we have that µ1 = E[x.v] = E[(x − µ1 ).v] = µ1 .v. Hence, the parameters
of the original mean vector are linearly related to the mean in the projected space. Similarly,
let’s perturb v to get v ′ = v +ǫ(ei +ej ). Here ei and ej denote the basis vectors corresponding
to coordinates i and j. Let σ1′ 2 be the variance of the gaussian in the projected space v ′ .
Then writing σ1′ 2 = E[(x.v ′ )2 ] and expanding, we get that E[xi xj ] will be linearly related to
σ1′ 2 , σ1 2 and the µi ’s. Hence, by estimating the parameters correctly over a series of n2 , one
dimensional vectors, one can efficiently recover the original parameters (by solving a system
of linear equations).
Solving the one dimensional problem
The one dimensional problem is solved by the method of moments. In particular, define
Li [M] to be the ith moment for the mixture model M, i.e., Li [M] = Ex∼M [xi M(x)]. Also
define L̂i to be the empirical ith moment of the data. The algorithm in [44] does a brute force
search over the parameter space for the two Gaussians and for a given candidate model M′
computes the first 6 moments. If all the moments are within ǫ of the empirical moments, then
27
the analysis in [44] shows that the parameters will be ǫ1/67 close to the parameters of the two
gaussians. The same claim is also true for learning a mixture of k 1 dimensional gaussians if
one goes upto (4k − 2) moments [55]. The search space however will be exponential in k. It
is shown in [55] that for learning k one dimensional gaussians, this exponential dependence
is unavoidable.
Solving the labeling problem
As noted above, the learning algorithm will solve n2 , 1-dimensional problems and get param-
eter estimates for the two gaussians for each 1-dimensional problem. In order to solve for the
parameters of the original gaussians, we need to identify for each gaussian, the correspond-
ing n2 parameters for each of the subproblems. Kalai et. al do this by arguing that if one
projects the two gaussians onto a random direction v, with high enough probability, the cor-
responding parameters for the two projected gaussians will differ by poly(α). Hence, if one
takes small random perturbations of this vector v, the corresponding parameter estimates
will be easily distinguishable.
28
will almost surely collapse components 2 and 3. [55] solve this problem by first running a
clustering algorithm to separate components 2 and 3 from component 1 and recursively
solving the two sub-instances. Once, 2 and 3 have been separated, one can scale the space
to ensure that they remain separated over a random projection. The algorithm from [55]
has the sample complexity which depends exponentially on k. They also show that this
dependence is necessary. One could use the algorithm from [55] to also cluster the points
into component Gaussians under minimal assumptions. The sample complexity however, will
depend exponentially in k. In contrast, one could algorithms from previous sections to cluster
in polynomial time under stronger separation assumptions. The work of [41, 9] removes the
exponential dependence on k and designs polynomial time algorithms for clustering data
from a GMM under minimal separation assuming only that the mean vectors span a k
dimensional subspace. However, their algorithm which is based on Tensor decompositions
only works in the case when all the component Gaussians are spherical. It is an open question
to get similar result for general Gaussians. There has also been work on clustering points
from a mixture of other distributions. [31, 30] gave algorithms for clustering a mixture
of heavy tailed distributions. [27] gave algorithms for clustering a mixture of 2 Gaussians
assuming only that the two distributions are separated by a hyperplane. The recent work
of [49] studies a deterministic separation condition on a set of points and show that any set
of points satisfying this condition can be clustered accurately. Using this they easily derive
many previously known results for clustering mixture of Gaussians as a corollary.
7 Conclusion
In this chapter we presented a selection of recent work on clustering problems in the computer
science community. As is evident, the focus of all these works is on providing efficient
algorithms with rigorous guarantees for various clustering problems. In many cases, these
guarantees depend on the specific structure and properties of the instance at hand which are
captured by stability assumptions and/or distributional assumptions. The study of different
stability assumptions also provide insights into the structural properties of real world data
and in some cases also lead to practically useful algorithms [68]. As discussed in Section 5.6
different assumptions are suited for different kinds of data and they relate to each other in
interesting ways. For instance, perturbation resilience is a much weaker assumption than
both ǫ-separability and approximation stability. However, we have algorithms with much
stronger guarantees for the latter two. As a practitioner one is often torn between using
algorithms with formal guarantees (which are typically slower) vs. fast heuristics like the
Lloyd’s method. When dealing with data which may satisfy any of the stability notions
proposed in this chapter, a general rule of thumb we suggest is to run the algorithms proposed
in this chapter on a smaller random subset of the data and use the solution obtained to
initialize fast heuristics like the Lloyd’s method. Current research on clustering algorithms
continues to explore more realistic notions of data stability and their implications for practical
clustering scenarios.
References
[1] Gibbs random fields, fuzzy clustering, and the unsupervised segmentation of textured
images. CVGIP: Graphical Models and Image Processing, 55(1):1 – 19, 1993.
29
[2] D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In
Proceedings of the Eighteenth Annual Conference on Learning Theory, 2005.
[3] Margareta Ackerman and Shai Ben-David. Clusterability: A theoretical study. Journal
of Machine Learning Research - Proceedings Track, 5, 2009.
[4] Manu Agarwal, Ragesh Jaiswal, and Arindam Pal. k-means++ under approximation
stability. The 10th annual conference on Theory and Applications of Models of Compu-
tation, 2013.
[5] Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means
clustering. In Proceedings of the 12th International Workshop and 13th International
Workshop on Approximation, Randomization, and Combinatorial Optimization. Algo-
rithms and Techniques, APPROX ’09 / RANDOM ’09, 2009.
[6] N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In Ad-
vances in Neural Information Processing Systems, 2009.
[7] Paola Alimonti. Non-oblivious local search for graph and hypergraph coloring problems.
In Graph-Theoretic Concepts in Computer Science, Lecture Notes in Computer Science.
1995.
[8] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of
euclidean sum-of-squares clustering. Mach. Learn.
[9] Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgar-
sky. Tensor decompositions for learning latent variable models. Technical report,
http://arxiv.org/abs/1210.7559, 2012.
[10] S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of
the 37th ACM Symposium on Theory of Computing, 2005.
[11] S. Arora, P. Raghavan, and S. Rao. Approximation schemes for Euclidean k-medians
and related problems. In Proceedings of the Thirty-First Annual ACM Symposium on
Theory of Computing. 1999.
[12] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cam-
bridge University Press, 2009.
[13] David Arthur, Bodo Manthey, and Heiko Röglin. Smoothed analysis of the k-means
method. Journal of the ACM, 58(5), October 2011.
[14] David Arthur and Sergei Vassilvitskii. How slow is the k-means method? In Proceedings
of the twenty-second annual symposium on Computational geometry, 2006.
[15] David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding.
In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms,
2007.
[16] V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local
search heuristics for k-median and facility location problems. SIAM Journal on Com-
puting, 33(3):544–562, 2004.
[17] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Stability yields a PTAS for k-median
and k-means clustering. In Proceedings of the 2010 IEEE 51st Annual Symposium on
Foundations of Computer Science, 2010.
[18] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under pertur-
bation stability. Information Processing Letters, 112(1-2), January 2012.
[19] B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-
means++. In Proceedings of the 38th International Conference on Very Large Databases,
2012.
[20] M.-F. Balcan, A. Blum, and A. Gupta. Approximate clustering without the approxi-
mation. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2009.
[21] M.-F. Balcan, A. Blum, and A. Gupta. Clustering under approximation stability. In
Journal of the ACM, 2013.
[22] Maria-Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. Pro-
ceedings of the 39th International Colloquium on Automata, Languages and Program-
30
ming, 2012.
[23] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In
Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science,
2010.
[24] James C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms.
Kluwer Academic Publishers, Norwell, MA, USA, 1981.
[25] Yonatan Bilu and Nathan Linial. Are stable instances easy? In Proceedings of the First
Symposium on Innovations in Computer Science, 2010.
[26] Lon Bottou and Yoshua Bengio. Convergence properties of the k-means algorithms. In
Advances in Neural Information Processing Systems 7, pages 585–592. MIT Press, 1995.
[27] Spencer Charles Brubaker and Santosh Vempala. Isotropic PCA and affine-invariant
clustering. In Proceedings of the 2008 49th Annual IEEE Symposium on Foundations
of Computer Science, 2008.
[28] Barun Chandra, Howard Karloff, and Craig Tovey. New results on the old k-opt algo-
rithm for the tsp. In Proceedings of the fifth annual ACM-SIAM symposium on Discrete
algorithms, 1994.
[29] M. Charikar, S. Guha, E. Tardos, and D. B. Shmoy. A constant-factor approximation
algorithm for the k-median problem. In Proceedings of the Thirty-First Annual ACM
Symposium on Theory of Computing, 1999.
[30] Kamalika Chaudhuri and Satish Rao. Beyond gaussians: Spectral methods for learning
mixtures of heavy-tailed distributions. In Proceedings of the 21st Annual Conference on
Learning Theory, 2008.
[31] Kamalika Chaudhuri and Satish Rao. Learning mixtures of product distributions us-
ing correlations and independence. In Proceedings of the 21st Annual Conference on
Learning Theory, 2008.
[32] S. Dasgupta. Learning mixtures of gaussians. In Proceedings of The 40th Annual
Symposium on Foundations of Computer Science, 1999.
[33] S. Dasgupta. The hardness of k-means clustering. Technical report, University of
California, San Diego, 2008.
[34] W. Fernandez de la Vega, Marek Karpinski, Claire Kenyon, and Yuval Rabani. Ap-
proximation schemes for clustering problems. In Proceedins of the Thirty-Fifth Annual
ACM Symposium on Theory of Computing, 2003.
[35] Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral graph
partitioning. In Proceedings of the seventh ACM SIGKDD international conference on
Knowledge discovery and data mining, 2001.
[36] Doratha E. Drake and Stefan Hougardy. Linear time local improvements for weighted
matchings in graphs. In Proceedings of the 2nd international conference on Experimental
and efficient algorithms, 2003.
[37] Vance Faber. Clustering and the Continuous k-Means Algorithm. 1994.
[38] R.D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger andJ.E. Pollington, O.L. Gavin,
P. Gunesekaran, G. Ceric, K. Forslund, L. Holm, E.L. Sonnhammer, S.R. Eddy, and
A. Bateman. The pfam protein families database. Nucleic Acids Research, 38:D211–222,
2010.
[39] Allen Gersho and Robert M. Gray. Vector quantization and signal compression. Kluwer
Academic Publishers, Norwell, MA, USA, 1991.
[40] Pierre Hansen and Brigitte Jaumard. Algorithms for the maximum satisfiability prob-
lem. Computing, 1990.
[41] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: moment
methods and spectral decompositions. In Proceedings of the 4th Innovations in Theo-
retical Computer Science Conference, 2013.
[42] Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams
and randomization to variance-based k-clustering: (extended abstract). In Proceedings
31
of the tenth annual symposium on Computational geometry, 1994.
[43] K. Jain, M. Mahdian, and A. Saberi. A new greedy approach for facility location
problems. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing,
2002.
[44] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Efficiently learning mixtures
of two gaussians. In Proceedings of the 42th ACM Symposium on Theory of Computing,
2010.
[45] R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture
models. In Proceedings of The Eighteenth Annual Conference on Learning Theory, 2005.
[46] Ravi Kannan and Santosh Vempala. Spectral algorithms. Foundations and Trends in
Theoretical Computer Science, 4(3-4), 2009.
[47] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth
Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means
clustering. In Proceedings of the eighteenth annual symposium on Computational geom-
etry, New York, NY, USA, 2002. ACM.
[48] A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + ǫ)-approximation
algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual
IEEE Symposium on Foundations of Computer Science, Washington, DC, USA, 2004.
[49] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means
algorithm. In Proceedings of the 51st Annual IEEE Symposium on Foundations of
Computer Science, 2010.
[50] Shi Li and Ola Svensson. Approximating k-median via pseudo-approximation. In Pro-
ceedings of the 45th ACM Symposium on Theory of Computing, 2013.
[51] S.P. Lloyd. Least squares quantization in PCM. IEEE Trans. Inform. Theory,
28(2):129–137, 1982.
[52] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Bilu-linial
stable instances of max cut and minimum multiway cut. In SODA, pages 890–906.
SIAM, 2014.
[53] N. Megiddo and K. Supowit. On the complexity of some common geometric location
problems. SIAM Journal on Computing, 13(1):182–196, 1984.
[54] Marina Meilă. The uniqueness of a good optimum for K-means. In Proceedings of the
International Machine Learning Conference, pages 625–632, 2006.
[55] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures
of gaussians. In Proceedings of the 51st Annual IEEE Symposium on Foundations of
Computer Science, 2010.
[56] A.G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. Scop: a structural classifi-
cation of proteins database for the investigation of sequences and structures. Journal
of Molecular Biology, 247:536–540, 1995.
[57] R. Ostrovsky, Y. Rabani, L. Schulman, and C. Swamy. The effectiveness of lloyd-type
methods for the k-means problem. In Proceedings of the 47th Annual IEEE Symposium
on Foundations of Computer Science, 2006.
[58] Christos H. Papadimitriou. On selecting a satisfying truth assignment (extended ab-
stract). In Proceedings of the 32nd annual symposium on Foundations of computer
science, 1991.
[59] Dan Pelleg and Andrew W. Moore. X-means: Extending k-means with efficient es-
timation of the number of clusters. In Proceedings of the Seventeenth International
Conference on Machine Learning, 2000.
[60] Edie M. Rasmussen. Clustering algorithms. In Information Retrieval: Data Structures
& Algorithms, pages 419–442. 1992.
[61] Petra Schuurman and Tjark Vredeveld. Performance guarantees of local search for
multiprocessor scheduling. INFORMS J. on Computing, 2007.
[62] Gideon Schwarz. Estimating the Dimension of a Model. The Annals of Statistics,
32
6(2):461–464, 1978.
[63] Michael Shindler, Alex Wong, and Adam Meyerson. Fast and accurate k-means for
large datasets. In Proceedings of the 25th Annual Conference on Neural Information
Processing Systems, 2011.
[64] A. H S Solberg, T. Taxt, and A.K. Jain. A markov random field model for classification
of multisource satellite imagery. IEEE Transactions on Geoscience and Remote Sensing,
1996.
[65] Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the
simplex algorithm usually takes polynomial time. Journal of the ACM, 51(3), May 2004.
[66] Andrea Vattani. k-means requires exponentially many iterations even in the plane. In
Proceedings of the 25th annual symposium on Computational geometry, 2009.
[67] S. Vempala and G. Wang. A spectral algorithm for learning mixture models. Journal
of Computer and System Sciences, 68(2):841–860, 2004.
[68] K. Voevodski, M. F. Balcan, H. Roeglin, S. Teng, and Y. Xia. Efficient clustering with
limited distance information. In Proceedings of the 26th Conference on Uncertainty in
Artificial Intelligence, 2010.
[69] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan,
A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. The
top ten algorithms in data mining. Knowledge and Information Systems, 2008.
33
Two faces of active learning
Sanjoy Dasgupta
[email protected]
Abstract
An active learner has a collection of data points, each with a label that is initially hidden but can be
obtained at some cost. Without spending too much, it wishes to find a classifier that will accurately map
points to labels. There are two common intuitions about how this learning process should be organized:
(i) by choosing query points that shrink the space of candidate classifiers as rapidly as possible; and (ii) by
exploiting natural clusters in the (unlabeled) data set. Recent research has yielded learning algorithms for
both paradigms that are efficient, work with generic hypothesis classes, and have rigorously characterized
labeling requirements. Here we survey these advances by focusing on two representative algorithms and
discussing their mathematical properties and empirical performance.
1 Introduction
As digital storage gets cheaper, and sensing devices proliferate, and the web grows ever larger, it gets easier
to amass vast quantities of unlabeled data – raw speech, images, text documents, and so on. But to build
classifiers from these data, labels are needed, and obtaining them can be costly and time consuming. When
building a speech recognizer for instance, the speech signal comes cheap but thereafter a human must examine
the waveform and label the beginning and end of each phoneme within it. This is tedious and painstaking,
and requires expertise.
We will consider situations in which we are given a large set of unlabeled points from some domain X ,
each of which has a hidden label, from a finite set Y, that can be queried. The idea is to find a good classifier,
a mapping h : X → Y from a pre-specified set H, without making too many queries. For instance, each
x ∈ X might be the description of a molecule, with its label y ∈ {+1, −1} denoting whether or not it binds
to a particular target of interest. If the x’s are vectors, a possible choice of H is the class of linear separators.
In this setting, a supervised learner would query a random subset of the unlabeled data and ignore the
rest. A semisupervised learner would do the same, but would keep around the unlabeled points and use them
to constrain the choice of classifier. Most ambitious of all, an active learner would try to get the most out
of a limited budget by choosing its query points in an intelligent and adaptive manner (Figure 1).
This high level scheme (Figure 2) has a ready and intuitive appeal. But how can it be analyzed?
1
− − − −
−
+
− +
−
+ − +
+ +
+
+
Figure 1: Each circle represents an unlabeled point, while + and − denote points of known label. Left: raw
and cheap – a large reservoir of unlabeled data. Middle: supervised learning picks a few points to label and
ignores the rest. Right: Semisupervised and active learning get more use out of the unlabeled pool, by using
them to constrain the choice of classifier, or by choosing informative points to label.
− −
+
−
+
+
Figure 2: A typical active learning strategy chooses the next query point near the decision boundary obtained
from the current set of labeled points. Here the boundary is a linear separator, and there are several unlabeled
points close to it that would be candidates for querying.
2
w∗ w
111
000
000
11145% 5%
111
000
000
111 5% 45%
Figure 3: An illustration of sampling bias in active learning. The data lie in four groups on the line, and are
(say) distributed uniformly within each group. The two extremal groups contain 90% of the distribution.
Solids have a + label, while stripes have a − label.
To do so, we shall consider learning problems within the framework of statistical learning theory. In this
model, there is an unknown, underlying distribution P from which data points (and their hidden labels) are
drawn independently at random. If X denotes the space of data and Y the labels, this P is a distribution
over X × Y. Any classifier we build is evaluated in terms of its performance on P.
In a typical learning problem, we choose classifiers from a set of candidate hypotheses H. The best such
candidate, h∗ ∈ H, is by definition the one with smallest error on P, that is, with smallest
err(h) = P[h(X) 6= Y ].
Since P is unknown, we cannot perform this minimization ourselves. However, if we have access to a sample
of n points from P, we can choose a classifier hn that does well on this sample. We hope, then, that hn → h∗
as n grows. If this is true, we can also talk about the rate of convergence of err(hn ) to err(h∗ ).
A special case of interest is when h∗ makes no mistakes: that is, h∗ (x) = y for all (x, y) in the support
of P. We will call this the separable case and will frequently use it in preliminary discussions because it is
especially amenable to analysis. All the algorithms we describe here, however, are designed for the more
realistic nonseparable scenario.
+1 if x ≥ w
− +
hw (x) =
−1 if x < w w
Then the initial boundary will lie somewhere in the center group, and the first query point will lie in
this group. So will every subsequent query point, forever. As active learning proceeds, the algorithm will
gradually converge to the classifier shown as w. But this has 5% error, whereas classifier w∗ has only 2.5%
error. Thus the learner is not consistent: even with infinitely many labels, it returns a suboptimal classifier.
The problem is that the second group from the left gets overlooked. It is not part of the initial random
sample, and later on, the learner is mistakenly confident that the entire group has a − label. And this is just
in one dimension; in high dimension, the problem can be expected to be worse, since there are more places
3
H
+ −
Figure 4: Two faces of active learning. Left: In the case of binary labels, each data point x cuts the hypothesis
space H into two pieces: the hypotheses that label it +, and those that label it −. If data is separable, one
of these two pieces can be discarded once the label of x is known. A series of well-chosen query points could
rapidly shrink H. Right: If the unlabeled points look like this, perhaps we just need five labels.
for this troublesome group to be hiding out. For a discussion of this problem in text classification, see the
paper of Schutze et al. [17].
Sampling bias is the most fundamental challenge posed by active learning. In this paper, we will deal
exclusively with learning strategies that are provably consistent and we will analyze their label complexity:
the number of labels queried in order to achieve a given rate of accuracy.
4
Separable data General (nonseparable) data
Aggressive Query by committee [13]
Splitting index [10]
A2 algorithm [2]
Mellow Generic mellow learner [9] Disagreement coefficient [15]
Reduction to supervised [12]
Importance-weighted approach [5]
Figure 5: Some of the key results on active learning within the framework of statistical learning theory.
The splitting index and disagreement coefficient are parameters of a learning problem that control the label
complexity of active learning. The other entries of the table are all learning algorithms.
In supervised learning, such issues are well understood. The standard machinery of sample complexity
[6] tells us that if the data are separable—that is, if they can be perfectly classified by some hypothesis
in H—then we need approximately 1/ǫ random labeled examples from P, and it is enough to return any
classifier consistent with them.
Now suppose we instead draw 1/ǫ unlabeled samples from P:
If we lay these points down on the line, their hidden labels are a sequence of −’s followed by a sequence
of +’s, and the goal is to discover the point w at which the transition occurs. This can be accomplished
with a binary search which asks for just log 1/ǫ labels: first ask for the label of the median point; if it’s +,
move to the 25th percentile point, otherwise move to the 75th percentile point; and so on. Thus, for this
hypothesis class, active learning gives an exponential improvement in the number of labels needed, from 1/ǫ
to just log 1/ǫ. For instance, if supervised learning requires a million labels, active learning requires just
log 1,000,000 ≈ 20, literally!
This toy example is only for separable data, but with a little care something similar can be achieved for
the nonseparable case. It is a tantalizing possibility that even for more complicated hypothesis classes H, a
sort of generalized binary search is possible.
5
H1 = H S = {} (points seen so far)
For t = 1, 2, . . .: For t = 1, 2, . . .:
Receive unlabeled point xt Receive unlabeled point xt
If disagreement in Ht about xt ’s label: If learn(S ∪ (xt , +1)) and learn(S ∪ (xt , −1))
query label yt of xt both return an answer:
Ht+1 = {h ∈ Ht : h(xt ) = yt } query label yt
else: else:
Ht+1 = Ht set yt to whichever label succeeded
S = S ∪ {(xt , yt )}
Figure 6: Left: CAL, a generic mellow learner for separable data. Right: A way to simulate CAL without
having to explicitly maintain the version space Ht . Here learn(·) is a black-box supervised learner that
takes as input a data set and returns any classifier from H consistent with the data, provided one exists.
− − −
− + − + − +
+ + +
− + + − + + − + +
Figure 7: Left: The first seven points in the data stream were labeled. How about this next point? Middle:
Some of the hypotheses in the current version space. Right: The region of disagreement.
CAL works by always maintaining the current version space: the subset of hypotheses consistent with
the labels seen so far. At time t, this is some Ht ⊂ H. When the data point xt arrives, CAL checks to see
whether there is any disagreement within Ht about its label. If there isn’t, then the label can be inferred;
otherwise it must be requested (Figure 6, left).
Figure 7 shows CAL at work in a setting where the data points lie in the plane, and the hypotheses
are linear separators. A key concept is that of the disagreement region, the portion of the input space X
on which there is disagreement within Ht . A data point is queried if and only if it lies in this region, and
therefore the efficacy of CAL depends upon the rate at which the P-mass of this region shrinks. As we will
see shortly, there is a broad class of situations in which this shrinkage is geometric: the P-mass halves every
constant number of labels, giving a label complexity that is exponentially better than that of supervised
learning. It is quite surprising that so mellow a scheme performs this well; and it is of interest, then, to ask
how it might be made more practical.
6
from the actual hidden label yt . Regardless, every point gets labeled, one way or the other. The resulting
algorithm, which we will call DHM after its authors [12], is presented in the appendix. Here we give some
rough intuition.
After t time steps, there are t labeled points (some queried, some inferred). Let errt (h) denote the
empirical error of h on these points, that is, the fraction of these t points that h gets wrong. Writing ht for
the minimizer of errt (·), define
where ∆t comes out of some standard generalization bound (DHM doesn’t do this exactly, but is similar in
spirit). Then the following assertions hold:
• The optimal hypothesis h∗ (with minimum error on the underlying distribution P) lies in Ht for all t.
• Any inferred label is consistent with h∗ (although it might disagree with the actual, hidden label).
Because all points get labeled, there is no bias introduced into the marginal distribution on X . It might
seem, however, that there is some bias in the conditional distribution of y given x, because the inferred
labels can differ from the actual labels. The saving grace is that this bias shifts the empirical error of every
hypothesis in Ht by the same amount – because all these hypotheses agree with the inferred label – and thus
the relative ordering of hypotheses is preserved.
In a typical trial of DHM (or CAL), the querying eventually concentrates near the decision boundary
of the optimal hypothesis. In what respect, then, do these methods differ from the heuristics we described
in the introduction? To understand this, let’s return to the example of Figure 3. The first few data points
drawn from this distribution may well lie in the far-left and far-right clusters. So if the learner were to choose
a single hypothesis, it would lie somewhere near the middle of the line. But DHM doesn’t do this. Instead,
it maintains the entire version space (implicitly), and this version space includes all thresholds between the
two extremal clusters. Therefore the second cluster from the left, which tripped up naive schemes, will not
be overlooked.
To summarize, DHM avoids the consistency problems of many other active learning heuristics by (i) mak-
ing confidence judgements based on the current version space, rather than the single best current hypothesis,
and (ii) labeling all points, either by query or inference, to avoid skewing the distribution on X .
In the case of DHM, the distribution P might not be separable, in which case we need to take into account
the best achievable error:
ν = inf err(h).
h∈H
7
LDHM (ǫ, δ) is then the smallest to such that
for all t ≥ to . In typical supervised learning bounds, and here as well, the dependence of L(ǫ, δ) upon δ is
modest, at most poly log(1/δ). To avoid clutter, we will henceforth ignore δ and speak only of L(ǫ).
Theorem 1 [16] Suppose H has finite VC dimension d, and the learning problem is separable, with dis-
agreement coefficient θ. Then
e 1
LCAL (ǫ) ≤ O θd log ,
ǫ
e notation suppresses terms logarithmic in d, θ, and log 1/ǫ.
where the O
A supervised learner would need Ω(d/ǫ) examples to achieve this guarantee, so active learning yields an
exponential improvement when θ is finite: its label requirement scales as log 1/ǫ rather than 1/ǫ. And this
is without any effort at finding maximally informative points!
In the nonseparable case, the label complexity also depends on the minimum achievable error within the
hypothesis class.
Theorem 2 [12] With parameters as defined above,
2
LDHM (ǫ) ≤ O e θ d log2 1 + dν
ǫ ǫ2
where ν = inf h∈H err(h).
In this same setting, a supervised learner would require Ω((d/ǫ) + (dν/ǫ2 )) samples. If ν is small relative
to ǫ, we again see an exponential improvement from active learning; otherwise, the improvement is by the
constant factor ν.
The second term in the label complexity is inevitable for nonseparable data.
Theorem 3 [5] Pick any hypothesis class with finite VC dimension d. Then there exists a distribution P
over X × Y for which any active learner must incur a label complexity
2
dν
L(ǫ, 1/2) ≥ Ω ,
ǫ2
where ν = inf h∈H err(h).
The corresponding lower bound for supervised learning is dν/ǫ2 .
Now, suppose we are running either CAL or DHM, and that the current version space is some V ⊂ H.
Then the only points that will be queried are those that lie within the disagreement region
8
h
h*
Figure 8: Left: Suppose the data lie in the plane, and that hypothesis class consists of linear separators.
The distance between any two hypotheses h∗ and h is the probability mass (under P) of the region on which
they disagree. Middle: The thick line is h∗ . The thinner lines are examples of hypotheses in B(h∗ , r). Right:
DIS(B(h∗ , r)) might look something like this.
Then the distance between any two hypotheses ha,b and ha′ ,b′ is
d(ha,b , ha′ ,b′ ) = P{x : x ∈ [a, b] ∪ [a′ , b′ ], x 6∈ [a, b] ∩ [a′ , b′ ]} = P([a, b]∆[a′ , b′ ]),
9
where S∆T denotes the symmetric set difference (S ∪ T ) \ (S ∩ T ). Now suppose the target hypothesis is
some hα,β with α ≤ β. If r > P[α, β] then B(hα,β , r) includes all intervals of probability mass ≤ r − P[α, β].
Thus, if P is a density, the disagreement region of B(hα,β , r) is all of X ! Letting r approach P[α, β] from
above, we see that θ is at least 1/P[α, β], which is unbounded as β gets closer to α.
A saving grace is that for smaller values r ≤ P[α, β], the hypotheses in B(hα,β , r) are intervals intersecting
hα,β , and consequently the disagreement region has mass at most 4r. Thus there are two regimes in the
active learning process for H: an initial phase in which the radius of uncertainty r is brought down to P[α, β],
and a subsequent phase in which r is further decreased to O(ǫ). The first phase might be slow, but the second
should behave as if θ = 4. Moreover, the dependence of the label complexity upon ǫ should arise entirely
from the second phase. A series of recent papers [4, 14] analyzes such cases by loosening the definition of
disagreement coefficient from
P[DIS(B(h∗ , r))] P[DIS(B(h∗ , r))]
sup to lim sup .
r>0 r r→0 r
In the example above, the revised disagreement coefficient is 4.
The experiments so far are for one-dimensional data. There are two significant hurdles in scaling up the
DHM algorithm to real data sets:
1. The version space Ht is defined using a generalization bound. Current bounds are tight only in a few
special cases, such as small finite hypothesis classes and thresholds on the line. Otherwise they can be
extremely loose, with the result that Ht ends up being larger than necessary, and far too many points
get queried.
10
4000
boundary noise 0.2
3500
Number of queries
3000
2500 boundary noise 0.1
2000
1500
random noise 0.2
1000
random noise 0.1
500
no noise
0
0 5000 10000
Number of data points seen
Figure 9: Here the data distribution is uniform over X = [0, 1] and H consists of thresholds on the line. The
target threshold is at 0.5. We test five different noise models for the conditional distribution of labels.
3500
2500
width 0.2, random noise 0.1
2000
1500
0
0 5000 10000
Number of data points seen
Figure 10: The data distribution is uniform over X = [0, 1], and H consists of intervals. We vary the width
of the target interval and the noise model for the conditional distribution of labels.
11
0 0.4 0.6 1
Figure 11: The distribution of queries for the experiment of Figure 10, with target interval [0.4, 0.6] and
random noise of 0.1. The initial distribution of queries is shown above and the eventual distribution below.
12
80 0.18
supervised
70 active
0.16
Number of queries made
60
0.14
50
0.12
Error
40
0.1
30
0.08
20
10 0.06
0 0.04
0 100 200 300 400 500 0 10 20 30 40 50 60
Number of data points seen Number of labels seen
Figure 12: Here X = R10 and H consists of linear separators. Each class is Gaussian and the best separator
has 5% error. Left: Queries made over a stream of 500 examples. Right: Test error for active versus
supervised sampling.
This kind of nonparametric learner appears to have the usual problems of sampling bias, but differs from
the approaches of the previous section, and has been studied far less. One recently-proposed algorithm,
which we will call DH after its authors [11], attempts to capture the spirit of local propagation schemes
while maintaining sound statistics on just how “unknown” different regions are; we now turn to it.
13
In the DH algorithm, the only guide in deciding where to query is cluster structure. But the clusters in
use may change over time. Let Ct be the clustering operative at time t, while the tth query is being chosen.
This Ct is some partition of S into groups; more formally,
[
C = S, and any C, C ′ ∈ Ct are either identical or disjoint.
C∈Ct
+ − + +
− +− − ++
+ − − + ++
−
−
+ ⇒ − −
− − −
+ +
− − − −
+ − ++ +
+ ++ + +
+ + + ++
To control the error induced by this process, it is important to find clusters that are as homogeneous as
possible in their labels. In the above example, we can be fairly sure of this. We have five random labels
from the left cluster, all of which are −. Using a tail bound for the binomial distribution, we can obtain
an interval (such as [0.8, 1.0]) in which the true bias of this cluster is very likely to lie. The DH algorithm
makes heavy use of such confidence intervals.
If the current clustering C has a cluster that is very mixed in its labels, then this cluster needs to be split
further, to get a new clustering C′ :
− − − − − −
− −
− − + − − +
−
−
−
− ⇒ −
−
−
−
+ +
+ +
+ +
The lefthand cluster of C can be left alone (for the time being), since the one on the right is clearly more
troublesome. A fortunate consequence of Rule 1 is that the queries made to C can be reused for C′ : a
random label in the righthand cluster of C is also a random label for the new cluster in which it falls.
Thus the clustering of the data changes only by splitting clusters.
Rule 2. Pick any two times t′ > t. Then Ct′ must be a refinement of Ct , that is,
14
1
2 4
9
5 6 7 8
1111
0000
0000
1111 45% 5%
11
00
00
11 5% 45%
As in the example above, the nested structure of clusters makes it possible to re-use labels when the clustering
changes. If at time t, the querying process yields (x, y) for some x ∈ C ∈ Ct , then later, at t′ > t, this same
(x, y) is reusable as a random draw from the C ′ ∈ Ct′ to which x belongs.
The final rule imposes a constraint on the manner in which a clustering is refined.
Rule 3. When a cluster is split to obtain a new clustering, Ct → Ct+1 , the manner of split
cannot depend upon the labels seen.
This avoids complicated dependencies. The upshot of it is that we might as well start off with a hierarchical
clustering of S, set C1 to the root of the clustering, and gradually move down the hierarchy, as needed,
during the querying process.
15
P ← {root} (current pruning of tree)
L(root) ← 1 (arbitrary starting label for root)
For t = 1, 2, . . . (until the budget runs out):
Repeat B times:
v ← select(P )
Pick a random point z from subtree Tv
Query z’s label
Update counts for all nodes u on path from z to v
In a bottom-up pass of T , compute bound(u) for all nodes u ∈ T
For each (selected) v ∈ P :
Let (P ′ , L′ ) be the pruning and labeling of Tv minimizing bound(v)
P ← (P \ {v}) ∪ P ′
L(v) ← L′ (u) for all u ∈ P ′
For each cluster v ∈ P :
Assign each point in Tv the label L(v)
16
3. For each subtree (Tz , z ∈ P ), find the observed majority label, and assign this label to all points in
the subtree; fit a classifier h to this data; and choose v ∈ P with probability ∝ min{|{x ∈ Tv : h(x) =
+1}|, |{x ∈ Tv : h(x) = −1}|}.
This biases sampling towards regions close to the current decision boundary.
Innumerable variations of the third strategy are possible. Such schemes have traditionally suffered from
consistency problems (recall Figure 3), for instance because entire regions of space are overconfidently over-
looked. The DH framework relieves such concerns because there is always an accurate bound on the error
induced by the current labeling.
Label complexity
A rudimentary label complexity result for this model is proved in [11]: if the provided hierarchical clustering
contains a pruning P whose clusters are ǫ-pure in their labels, then the learner will find a labeling that is
O(ǫ)-pure with O(|P |d(P )/ǫ) labels, where d(P ) is the maximum depth of a node in P .
17
0.8 0.24 random
active
Fraction of labels incorrect
0.22
0.6 0.2
0.18
Error
0.4 0.16
0.14
0.2 0.12
0.1
0 0.08
0 200 400 600 800 1000 0 2000 4000 6000 8000 10000
Number of clusters Number of labels
Figure 15: Results on OCR data. Left: Errors of the best prunings in the OCR digits tree. Right: Test
error curves on classification task.
of clusters. As empirical corroboration, we ran the hierarchical sampler ten times, and on average 400 queries
were needed to discover a pruning of error rate 12% or less.
So far we have only talked about error rates on the training set. We can complete the picture by using
the final labeling from the sampling scheme as input to a supervised learner (logistic regression with ℓ2
regularization, the trade-off parameter chosen by 10-fold cross validation). A good baseline for comparison
is the same experiment but with random instead of active sampling. Figure 15, right, shows the resulting
learning curves: the tradeoff between the number of labels and the error rate on the held-out test set. The
initial advantage of cluster-adaptive sampling reflects its ability to discover and subsequently ignore relatively
pure clusters at the onset of sampling. Later on, it is left sampling from clusters of easily confused digits
(the prime culprits being 3’s, 5’s, and 8’s).
Acknowledgements
The author is grateful for his collaborators – Alina Beygelzimer, Daniel Hsu, Adam Kalai, John Langford,
and Claire Monteleoni – and for the support of the National Science Foundation under grant IIS-0713540.
There were also two anonymous reviewers who gave very helpful feedback on the first draft of this paper.
18
S = ∅ (points with inferred labels)
T = ∅ (points with queried labels)
For t = 1, 2, . . .:
Receive xt
If (h+1 = learn(S ∪ {(xt , +1)}, T )) fails: Add (xt , −1) to S and break
If (h−1 = learn(S ∪ {(xt , −1)}, T )) fails: Add (xt , +1) to S and break
If err(h−1 , S ∪ T ) − err(h+1 , S ∪ T ) > ∆t : Add (xt , +1) to S and break
If err(h+1 , S ∪ T ) − err(h−1 , S ∪ T ) > ∆t : Add (xt , −1) to S and break
Request yt and add (xt , yt ) to T
P
Figure 16: The DHM selective sampling algorithm. Here, err(h, A) = (1/|A|) (x,y)∈A 1(h(x) 6= y). A
possible setting for ∆t is shown in Equation 1. At any time, the current hypothesis is learn(S, T ).
References
[1] D. Angluin. Queries revisited. In Proceedings of the Twelfth International Conference on Algorithmic
Learning Theory, pages 12–31, 2001.
[2] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In International Conference
on Machine Learning, 2006.
[3] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Conference on Learning
Theory, 2007.
[4] M.-F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In Proceed-
ings of the 21st Annual Conference on Learning Theory, 2008.
[5] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In International
Conference on Machine Learning, 2009.
19
[6] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Lecture Notes
in Artificial Intelligence, 3176:169–207, 2004.
[7] R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Transactions on Information
Theory, 54(5):2339–2353, 2008.
[8] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Worst-case analysis of selective sampling for linear-
threshold algorithms. In Advances in Neural Information Processing Systems, 2004.
[9] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning,
15(2):201–221, 1994.
[10] S. Dasgupta. Coarse sample complexity bounds for active learning. In Neural Information Processing
Systems, 2005.
[11] S. Dasgupta and D.J. Hsu. Hierarchical sampling for active learning. In International Conference on
Machine Learning, 2008.
[12] S. Dasgupta, D.J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Neural
Information Processing Systems, 2007.
[13] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee
algorithm. Machine Learning, 28(2):133–168, 1997.
[14] E. Friedman. Active learning for smooth problems. In Conference on Learning Theory, 2009.
[15] S. Hanneke. A bound on the label complexity of agnostic active learning. In International Conference
on Machine Learning, 2007.
[16] S. Hanneke. Theoretical Foundations of Active Learning. PhD Thesis, CMU Machine Learning Depart-
ment, 2009.
[17] H. Schutze, E. Velipasaoglu, and J. Pedersen. Performance thresholding in practical text classification.
In ACM International Conference on Information and Knowledge Management, 2006.
[18] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimensionality
reduction. Science, 290(5500):2319–2323, 2000.
[19] J.H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical
Association, 58:236–244, 1963.
[20] X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-supervised learning using
gaussian fields and harmonic functions. In ICML Workshop on the Continuum from Labeled to Unlabeled
Data, 2003.
20
Semi-Supervised Learning
Xiaojin Zhu, University of Wisconsin-Madison
Definition
Semi-supervised learning uses both labeled and unlabeled data to perform
an otherwise supervised learning or unsupervised learning task.
In the former case, there is a distinction between inductive semi-supervised
learning and transductive learning. In inductive semi-supervised learning,
iid
the learner has both labeled training data {(xi , yi )}li=1 ∼ p(x, y) and unla-
iid
beled training data {xi }l+u
i=l+1 ∼ p(x), and learns a predictor f : X 7→ Y,
f ∈ F where F is the hypothesis space. Here x ∈ X is an input instance,
y ∈ Y its target label (discrete for classification or continuous for regression),
p(x, y) the unknown joint distribution and p(x) its marginal, and typically
l u. The goal is to learn a predictor that predicts future test data better
than the predictor learned from the labeled training data alone. In trans-
ductive learning, the setting is the same except that one is solely interested
in the predictions on the unlabeled training data {xi }l+u i=l+1 , without any
intention to generalize to future test data.
In the latter case, an unsupervised learning task is enhanced by labeled
data. For example, in semi-supervised clustering (a.k.a. constrained clus-
tering) one may have a few must-links (two instances must be in the same
cluster) and cannot-links (two instances cannot be in the same cluster) in ad-
dition to the unlabeled instances to be clustered; in semi-supervised dimen-
sionality reduction one might have the target low-dimensional coordinates
on a few instances.
This entry will focus on the former case of learning a predictor.
1
Motivation and Background
Semi-supervised learning is initially motivated by its practical value in learn-
ing faster, better, and cheaper. In many real world applications, it is rela-
tively easy to acquire a large amount of unlabeled data {x}. For example,
documents can be crawled from the Web, images can be obtained from
surveillance cameras, and speech can be collected from broadcast. However,
their corresponding labels {y} for the prediction task, such as sentiment
orientation, intrusion detection, and phonetic transcript, often requires slow
human annotation and expensive laboratory experiments. This labeling bot-
tleneck results in a scarce of labeled data and a surplus of unlabeled data.
Therefore, being able to utilize the surplus unlabeled data is desirable.
Recently, semi-supervised learning also finds applications in cognitive
psychology as a computational model for human learning. In human cate-
gorization and concept forming, the environment provides unsupervised data
(e.g., a child watching surrounding objects by herself) in addition to labeled
data from a teacher (e.g., Dad points to an object and says “bird!”). There
is evidence that human beings can combine labeled and unlabeled data to
facilitate learning.
The history of semi-supervised learning goes back to at least the 70s,
when self-training, transduction, and Gaussian mixtures with the EM al-
gorithm first emerged. It enjoyed an explosion of interest since the 90s,
with the development of new algorithms like co-training and transductive
support vector machines, new applications in natural language processing
and computer vision, and new theoretical analyses. More discussions can be
found in section 1.1.3 in [7].
Theory
It is obvious that unlabeled data {xi }l+ui=l+1 by itself does not carry any
information on the mapping X 7→ Y. How can it help us learn a better
predictor f : X 7→ Y? Balcan and Blum pointed out in [2] that the key lies
in an implicit ordering of f ∈ F induced by the unlabeled data. Informally,
if the implicit ordering happens to rank the target predictor f ∗ near the top,
then one needs less labeled data to learn f ∗ . This idea will be formalized
later on using PAC learning bounds. In other contexts, the implicit ordering
is interpreted as a prior over F or as a regularizer.
A semi-supervised learning method must address two questions: what
implicit ordering is induced by the unlabeled data, and how to algorith-
2
mically find a predictor near the top of this implicit ordering and fits the
labeled data well. Many semi-supervised learning methods have been pro-
posed, with different answers to these two questions [15, 7, 1, 10]. It is
impossible to enumerate all methods in this entry. Instead, we present a few
representative methods.
Generative Models
This semi-supervised learning method assumes the form of joint probability
p(x, y | θ) = p(y | θ)p(x | y, θ). For example, the class prior distribution
p(y | θ) can be a multinomial over Y, while the class conditional distribution
p(x | y, θ) can be a multivariate Gaussian in X [6, 9]. We use θ ∈ Θ to denote
the parameters of the joint probability. Each θ corresponds to a predictor
fθ via Bayes rule:
p(x, y | θ)
fθ (x) ≡ argmax p(y | x, θ) = argmax P 0
.
y y y 0 p(x, y | θ)
The top ranked fθ is the one whose θ (or rather the generative model with
parameters θ) best fits the unlabeled data. Therefore, this method assumes
that the form of the joint probability is correct for the task.
To identify the fθ that both fits the labeled data well and ranks high,
one maximizes the log likelihood of θ on both labeled and unlabeled data:
3
between the two classes y ∈ {−1, 1} [12, 8]. Consider the following hat loss
function on an unlabeled instance x:
max(1 − |f (x)|, 0)
which is positive when −1 < f (x) < 1, and zero outside. The hat loss
thus measures the violation in (unlabeled) large margin separation between
f and x. Averaging over all unlabeled training instances, it induces an
implicit ordering from small to large over f ∈ F:
l+u
1 X
max(1 − |f (x)|, 0).
u
i=l+1
The top ranked f is one whose decision boundary avoids most unlabeled
instances by a large margin.
To find the f that both fits the labeled data well and ranks high, one
typically minimizes the following objective:
l l+u
1X 1 X
argmin max(1 − yi f (xi ), 0) + λ1 kf k2 + λ2 max(1 − |f (x)|, 0),
f l u
i=1 i=l+1
Graph-Based Models
This semi-supervised learning method assumes that there is a graph G =
{V, E} such that the vertices V are the labeled and unlabeled training
instances, and the undirected edges E connect instances i, j with weight
wij [4, 14, 3]. The graph is sometimes assumed to be a random instanti-
ation of an underlying manifold structure that supports p(x). Typically,
wij reflects the proximity of xi , xj . For example, the Gaussian edge weight
function defines wij = exp −kxi − xj k2 /σ 2 . As another example, the kNN
4
Large wij implies a preference for the predictions f (xi ) and f (xj ) to be
the same. This can be formalized by the graph energy of a function f :
l+u
X
wij (f (xi ) − f (xj ))2 .
i,j=1
where c(f (x), y) is a convex loss function such as the hinge loss or the squared
loss. This is a convex optimization problem with efficient solvers.
5
hypothesis be an m-tuple of predictors hf (1) , . . . f (m) i. The disagreement of
a tuple on the unlabeled data can be defined as
l+u X
X m
c(f (u) (xi ), f (v) (xi )),
i=l+1 u,v=1
where c() is a loss function. Typical choices of c() are the 0-1 loss for
classification, and the squared loss for regression. Then the disagreement
induces an implicit ordering on tuples from small to large.
It is important for these m predictors to be of diverse types, and have
different inductive biases. In general, each predictor f (u) , u = 1 . . . m may
be evaluated by its individual loss function c(u) and regularizer Ω(u) . To find
a hypothesis (i.e,. m predictors) that fits the labeled data well and ranks
high, one can minimize the following objective:
m l
!
X 1 X (u) (u)
argmin c (f (xi ), yi ) + λ1 Ω(u) (f (u) )
(1)
hf ,...f (m) i l
u=1 i=1
l+u
X X m
+λ2 c(f (u) (xi ), f (v) (xi )).
i=l+1 u,v=1
Multiview learning typically optimizes this objective directly. When the loss
functions and regularizers are convex, numerical solution is relatively easy
to obtain. In the special cases when the loss functions are the squared loss,
and the regularizers are squared `2 norms, there is a closed form solution.
On the other hand, the co-training algorithm, as presented earlier, optimizes
the objective indirectly with the iterative procedure. One advantage of co-
training is that the algorithm is a wrapper method, in that it can use any
“blackbox” learners f (1) and f (2) without the need to modify the learners.
6
First, we introduce some notations. Consider the 0-1 loss for classifica-
tion. Let c∗ : X 7→ {0, 1} be the unknown target function, which may not
be in F. Let err(f ) = Ex∼p [f (x) 6= c∗ (x)] be the true error rate of a hy-
c ) = 1l li=1 f (xi ) 6= c∗ (xi ) be the empirical error rate
P
pothesis f , and err(f
of f on the labeled training sample. To characterize the implicit ordering,
we defined an “unlabeled error rate” errunl (f ) = 1 − Ex∼p [χ(f, x)], where
the compatibility function χ : F × X 7→ [0, 1] measures how “compatible” f
is to an unlabeled instance x. As an example, in semi-supervised support
vector machines, if x is far away from the decision boundary produced by
f , then χ(f, x) is large; but if x is close to the decision boundary, χ(f, x)
is small. In this example, a large errunl (f ) then means that the decision
boundary of f cuts through dense unlabeled data regions, and thus f is un-
desirable for semi-supervised learning. In contrast, a small errunl (f ) means
that the decision boundary of f lies in a low density gap, which is more
desirable. In theory, the implicit ordering on f ∈ F is to sort errunl (f )
from small to large.P In practice, we use the empirical unlabeled error rate
c unl (f ) = 1 − u1 l+u
err i=l+1 χ(f, xi ).
Our goal is to show that if an f ∈ F “fits the labeled data well and
ranks high”, then f is almost as good as the best hypothesis in F. Let
t ∈ [0, 1]. We first consider the best hypothesis ft∗ in the subset of F
that consists of hypotheses whose unlabeled error rate is no worse than
t: ft∗ = argminf 0 ∈F ,errunl (f 0 )≤t err(f 0 ). Obviously, t = 1 gives the best
hypothesis in the whole F. However, the nature of the guarantee has the
form err(f ) ≤ err(ft∗ ) + EstimationError(t) + c, where the EstimationError
term increases with t. Thus, with t = 1 the bound can be loose. On the
other hand, if t is close to 0, EstimationError(t) is small, but err(ft∗ ) can
be much worse than err(ft=1 ∗ ). The bound will account for the optimal t.
c 0 ) + ˆ(f 0 )
f = argmin err(f
f 0 ∈F
7
satisfies the guarantee that
r
log(8/δ)
err(f ) ≤ min(err(ft∗ ) + ˆ(ft∗ )) +5 .
t l
If a function f fits the labeled data well, it has a small err(f
c ). If it
ranks high, then F(f ) will be a small set, consequently ˆ(f ) is small. The
argmin operator identifies the best such function during training. The bound
account for the minimum of all possible t tradeoffs. Therefore, we see that
the “lucky” case is when the implicit ordering is good such that ft=1 ∗ , the
best hypothesis in F, is near the top of the ranking. This is when semi-
supervised learning is expected to perform well. Balcan and Blum also give
results addressing the key issue of how much unlabeled data is needed for
c unl (f ) and errunl (f ) to be close for all f ∈ F.
err
Applications
Because the type of semi-supervised learning discussed in this entry has the
same goal of creating a predictor as supervised learning, it is applicable to
essentially any problems where supervised learning can be applied. For ex-
ample, semi-supervised learning has been applied to natural language pro-
cessing (word sense disambiguation [13], document categorization, named
entity classification, sentiment analysis, machine translation), computer vi-
sion (object recognition, image segmentation), bioinformatics (protein func-
tion prediction), and cognitive psychology. Follow the recommended reading
for individual papers.
Future Directions
There are several directions to further enhance the value semi-supervised
learning. First, we need guarantees that it will outperform supervised learn-
ing. Currently, the practitioner has to manually choose a particular semi-
supervised learning method, and often manually set learning parameters.
Sometimes, a bad choice that does not match the task (e.g., modeling each
class with a Gaussian when the data does not have this distribution) can
make semi-supervised learning worse than supervised learning. Second, we
need methods that benefit from unlabeled when l, the size of labeled data,
is large. It has been widely observed that the gain over supervised learn-
ing is the largest when l is small, but diminishes as l increases. Third, we
need good ways to combine semi-supervised learning and active learning. In
8
natural learning systems such as humans, we routinely observe unlabeled in-
put, which often naturally leads to questions. And finally, we need methods
that can efficiently process massive unlabeled data, especially in an online
learning setting.
Cross References
active learning, classification, constrained clustering, dimensionality reduc-
tion, online learning, regression, supervised learning, unsupervised learning
Recommended Reading
[4] A. Blum and S. Chawla. Learning from labeled and unlabeled data
using graph mincuts. In Proc. 18th International Conf. on Machine
Learning, 2001.
[5] A. Blum and T. Mitchell. Combining labeled and unlabeled data with
co-training. In COLT: Proceedings of the Workshop on Computational
Learning Theory, 1998.
9
[9] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text clas-
sification from labeled and unlabeled documents using EM. Machine
Learning, 39(2/3):103–134, 2000.
[10] M. Seeger. Learning with labeled and unlabeled data. Technical report,
University of Edinburgh, 2001.
10
Journal of Arti cial Intelligence Research 4 (1996) 237-285 Submitted 9/95 published 5/96
Abstract
This paper surveys the eld of reinforcement learning from a computer-science per-
spective. It is written to be accessible to researchers familiar with machine learning. Both
the historical basis of the eld and a broad selection of current work are summarized.
Reinforcement learning is the problem faced by an agent that learns behavior through
trial-and-error interactions with a dynamic environment. The work described here has a
resemblance to work in psychology, but diers considerably in the details and in the use
of the word \reinforcement." The paper discusses central issues of reinforcement learning,
including trading o exploration and exploitation, establishing the foundations of the eld
via Markov decision theory, learning from delayed reinforcement, constructing empirical
models to accelerate learning, making use of generalization and hierarchy, and coping with
hidden state. It concludes with a survey of some implemented systems and an assessment
of the practical utility of current methods for reinforcement learning.
1. Introduction
Reinforcement learning dates back to the early days of cybernetics and work in statistics,
psychology, neuroscience, and computer science. In the last ve to ten years, it has attracted
rapidly increasing interest in the machine learning and articial intelligence communities.
Its promise is beguiling|a way of programming agents by reward and punishment without
needing to specify how the task is to be achieved. But there are formidable computational
obstacles to fullling the promise.
This paper surveys the historical basis of reinforcement learning and some of the current
work from a computer science perspective. We give a high-level overview of the eld and a
taste of some specic approaches. It is, of course, impossible to mention all of the important
work in the eld this should not be taken to be an exhaustive account.
Reinforcement learning is the problem faced by an agent that must learn behavior
through trial-and-error interactions with a dynamic environment. The work described here
has a strong family resemblance to eponymous work in psychology, but diers considerably
in the details and in the use of the word \reinforcement." It is appropriately thought of as
a class of problems, rather than as a set of techniques.
There are two main strategies for solving reinforcement-learning problems. The rst is to
search in the space of behaviors in order to nd one that performs well in the environment.
This approach has been taken by work in genetic algorithms and genetic programming,
c 1996 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.
Kaelbling, Littman, & Moore
s a
I i
B
R r
as well as some more novel search techniques (Schmidhuber, 1996). The second is to use
statistical techniques and dynamic programming methods to estimate the utility of taking
actions in states of the world. This paper is devoted almost entirely to the second set of
techniques because they take advantage of the special structure of reinforcement-learning
problems that is not available in optimization problems in general. It is not yet clear which
set of approaches is best in which circumstances.
The rest of this section is devoted to establishing notation and describing the basic
reinforcement-learning model. Section 2 explains the trade-o between exploration and
exploitation and presents some solutions to the most basic case of reinforcement-learning
problems, in which we want to maximize the immediate reward. Section 3 considers the more
general problem in which rewards can be delayed in time from the actions that were crucial
to gaining them. Section 4 considers some classic model-free algorithms for reinforcement
learning from delayed reward: adaptive heuristic critic, TD( ) and Q-learning. Section 5
demonstrates a continuum of algorithms that are sensitive to the amount of computation an
agent can perform between actual steps of action in the environment. Generalization|the
cornerstone of mainstream machine learning research|has the potential of considerably
aiding reinforcement learning, as described in Section 6. Section 7 considers the problems
that arise when the agent does not have complete perceptual access to the state of the
environment. Section 8 catalogs some of reinforcement learning's successful applications.
Finally, Section 9 concludes with some speculations about important open problems and
the future of reinforcement learning.
239
Kaelbling, Littman, & Moore
Some aspects of reinforcement learning are closely related to search and planning issues
in articial intelligence. AI search algorithms generate a satisfactory trajectory through a
graph of states. Planning operates in a similar manner, but typically within a construct
with more complexity than a graph, in which states are represented by compositions of
logical expressions instead of atomic symbols. These AI algorithms are less general than the
reinforcement-learning methods, in that they require a predened model of state transitions,
and with a few exceptions assume determinism. On the other hand, reinforcement learning,
at least in the kind of discrete cases for which theory has been developed, assumes that
the entire state space can be enumerated and stored in memory|an assumption to which
conventional search algorithms are not tied.
1.2 Models of Optimal Behavior
Before we can start thinking about algorithms for learning to behave optimally, we have
to decide what our model of optimality will be. In particular, we have to specify how the
agent should take the future into account in the decisions it makes about how to behave
now. There are three models that have been the subject of the majority of work in this
area.
The nite-horizon model is the easiest to think about at a given moment in time, the
agent should optimize its expected reward for the next h steps:
X
h
E ( rt)
t=0
it need not worry about what will happen after that. In this and subsequent expressions,
rt represents the scalar reward received t steps into the future. This model can be used in
two ways. In the rst, the agent will have a non-stationary policy that is, one that changes
over time. On its rst step it will take what is termed a h-step optimal action. This is
dened to be the best action available given that it has h steps remaining in which to act
and gain reinforcement. On the next step it will take a (h ; 1)-step optimal action, and so
on, until it nally takes a 1-step optimal action and terminates. In the second, the agent
does receding-horizon control, in which it always takes the h-step optimal action. The agent
always acts according to the same policy, but the value of h limits how far ahead it looks
in choosing its actions. The nite-horizon model is not always appropriate. In many cases
we may not know the precise length of the agent's life in advance.
The innite-horizon discounted model takes the long-run reward of the agent into ac-
count, but rewards that are received in the future are geometrically discounted according
to discount factor , (where 0 < 1):
1
X
E ( trt) :
t=0
We can interpret in several ways. It can be seen as an interest rate, a probability of living
another step, or as a mathematical trick to bound the innite sum. The model is conceptu-
ally similar to receding-horizon control, but the discounted model is more mathematically
tractable than the nite-horizon model. This is a dominant reason for the wide attention
this model has received.
240
Reinforcement Learning: A Survey
Another optimality criterion is the average-reward model, in which the agent is supposed
to take actions that optimize its long-run average reward:
lim E ( 1X
h
rt) :
h!1 h t=0
Such a policy is referred to as a gain optimal policy it can be seen as the limiting case of
the innite-horizon discounted model as the discount factor approaches 1 (Bertsekas, 1995).
One problem with this criterion is that there is no way to distinguish between two policies,
one of which gains a large amount of reward in the initial phases and the other of which
does not. Reward gained on any initial prex of the agent's life is overshadowed by the
long-run average performance. It is possible to generalize this model so that it takes into
account both the long run average and the amount of initial reward than can be gained.
In the generalized, bias optimal model, a policy is preferred if it maximizes the long-run
average and ties are broken by the initial extra reward.
Figure 2 contrasts these models of optimality by providing an environment in which
changing the model of optimality changes the optimal policy. In this example, circles
represent the states of the environment and arrows are state transitions. There is only
a single action choice from every state except the start state, which is in the upper left
and marked with an incoming arrow. All rewards are zero except where marked. Under a
nite-horizon model with h = 5, the three actions yield rewards of +6:0, +0:0, and +0:0, so
the rst action should be chosen under an innite-horizon discounted model with = 0:9,
the three choices yield +16:2, +59:0, and +58:5 so the second action should be chosen
and under the average reward model, the third action should be chosen since it leads to
an average reward of +11. If we change h to 1000 and to 0.2, then the second action is
optimal for the nite-horizon model and the rst for the innite-horizon discounted model
however, the average reward model will always prefer the best long-term average. Since the
choice of optimality model and parameters matters so much, it is important to choose it
carefully in any application.
The nite-horizon model is appropriate when the agent's lifetime is known one im-
portant aspect of this model is that as the length of the remaining lifetime decreases, the
agent's policy may change. A system with a hard deadline would be appropriately modeled
this way. The relative usefulness of innite-horizon discounted and bias-optimal models is
still under debate. Bias-optimality has the advantage of not requiring a discount parameter
however, algorithms for nding bias-optimal policies are not yet as well-understood as those
for nding optimal innite-horizon discounted policies.
1.3 Measuring Learning Performance
The criteria given in the previous section can be used to assess the policies learned by a
given algorithm. We would also like to be able to evaluate the quality of learning itself.
There are several incompatible measures in use.
Eventual convergence to optimal. Many algorithms come with a provable guar-
antee of asymptotic convergence to optimal behavior (Watkins & Dayan, 1992). This
is reassuring, but useless in practical terms. An agent that quickly reaches a plateau
241
Kaelbling, Littman, & Moore
+2
Average reward
Figure 2: Comparing models of optimality. All unlabeled arrows produce a reward of zero.
Regret. A more appropriate measure, then, is the expected decrease in reward gained
due to executing the learning algorithm instead of behaving optimally from the very
beginning. This measure is known as regret (Berry & Fristedt, 1985). It penalizes
mistakes wherever they occur during the run. Unfortunately, results concerning the
regret of algorithms are quite hard to obtain.
242
Reinforcement Learning: A Survey
Section 2.2 presents three techniques that are not formally justied, but that have had wide
use in practice, and can be applied (with similar lack of guarantee) to the general case.
2.1 Formally Justied Techniques
There is a fairly well-developed formal theory of exploration for very simple problems.
Although it is instructive, the methods it provides do not scale well to more complex
problems.
2.1.1 Dynamic-Programming Approach
If the agent is going to be acting for a total of h steps, it can use basic Bayesian reasoning
to solve for an optimal strategy (Berry & Fristedt, 1985). This requires an assumed prior
joint distribution for the parameters fpig, the most natural of which is that each pi is
independently uniformly distributed between 0 and 1. We compute a mapping from belief
states (summaries of the agent's experiences during this run) to actions. Here, a belief state
can be represented as a tabulation of action choices and payos: fn1 w1 n2 w2 : : : nk wk g
denotes a state of play in which each arm i has been pulled ni times with wi payos. We
write V (n1 w1 : : : nk wk ) as the expected payo remaining, given that a total of h pulls
are available,
P and we use the remaining pulls optimally.
If i ni = h, then there are no remaining pulls, and V (n1 w1 : : : nk wk ) = 0. This is
the basis of a recursive denition. If we know the V value for all belief states with t pulls
remaining, we can compute the V value of any belief state with t + 1 pulls remaining:
" #
Future payo if agent takes
V (n1 w1 : : : nk wk) = maxi E then acts optimally for remaining pulls action i ,
!
i V ( n 1 wi : : : n i + 1 w i + 1 : :
= maxi (1 ; )V (n w : : : n + 1 w : : : n w ): n k wk )+
i 1 i i i k k
where i is the posterior subjective probability of action i paying o given ni , wi and
our prior probability. For the uniform priors, which result in a beta distribution, i =
(wi + 1)=(ni + 2).
The expense of lling in the table of V values in this way for all attainable belief states
is linear in the number of belief states times actions, and thus exponential in the horizon.
2.1.2 Gittins Allocation Indices
Gittins gives an \allocation index" method for nding the optimal choice of action at each
step in k-armed bandit problems (Gittins, 1989). The technique only applies under the
discounted expected reward criterion. For each action, consider the number of times it has
been chosen, n, versus the number of times it has paid o, w. For certain discount factors,
there are published tables of \index values," I (n w) for each pair of n and w. Look up
the index value for each action i, I (ni wi). It represents a comparative measure of the
combined value of the expected payo of action i (given its history of payos) and the value
of the information that we would get by choosing it. Gittins has shown that choosing the
action with the largest index value guarantees the optimal balance between exploration and
exploitation.
244
Reinforcement Learning: A Survey
a=0 a=1
Figure 3: A Tsetlin automaton with 2N states. The top row shows the state transitions
that are made when the previous action resulted in a reward of 1 the bottom
row shows transitions after a reward of 0. In states in the left half of the gure,
action 0 is taken in those on the right, action 1 is taken.
Because of the guarantee of optimal exploration and the simplicity of the technique
(given the table of index values), this approach holds a great deal of promise for use in more
complex applications. This method proved useful in an application to robotic manipulation
with immediate reward (Salganico & Ungar, 1995). Unfortunately, no one has yet been
able to nd an analog of index values for delayed reinforcement problems.
2.1.3 Learning Automata
A branch of the theory of adaptive control is devoted to learning automata, surveyed by
Narendra and Thathachar (1989), which were originally described explicitly as nite state
automata. The Tsetlin automaton shown in Figure 3 provides an example that solves a
2-armed bandit arbitrarily near optimally as N approaches innity.
It is inconvenient to describe algorithms as nite-state automata, so a move was made
to describe the internal state of the agent as a probability distribution according to which
actions would be chosen. The probabilities of taking dierent actions would be adjusted
according to their previous successes and failures.
An example, which stands among a set of algorithms independently developed in the
mathematical psychology literature (Hilgard & Bower, 1975), is the linear reward-inaction
algorithm. Let pi be the agent's probability of taking action i.
When action ai succeeds,
pi := pi + (1 ; pi)
pj := pj ; pj for j 6= i
When action ai fails, pj remains unchanged (for all j ).
This algorithm converges with probability 1 to a vector containing a single 1 and the
rest 0's (choosing a particular action with probability 1). Unfortunately, it does not always
converge to the correct action but the probability that it converges to the wrong one can
be made arbitrarily small by making small (Narendra & Thathachar, 1974). There is no
literature on the regret of this algorithm.
245
Kaelbling, Littman, & Moore
The temperature parameter T can be decreased over time to decrease exploration. This
method works well if the best action is well separated from the others, but suers somewhat
when the values of the actions are close. It may also converge unnecessarily slowly unless
the temperature schedule is manually tuned with great care.
2.2.3 Interval-based Techniques
Exploration is often more ecient when it is based on second-order information about the
certainty or variance of the estimated values of actions. Kaelbling's interval estimation
algorithm (1993b) stores statistics for each action ai : wi is the number of successes and ni
the number of trials. An action is chosen by computing the upper bound of a 100 (1 ; )%
246
Reinforcement Learning: A Survey
condence interval on the success probability of each action and choosing the action with
the highest upper bound. Smaller values of the parameter encourage greater exploration.
When payos are boolean, the normal approximation to the binomial distribution can be
used to construct the condence interval (though the binomial should be used for small
n). Other payo distributions can be handled using their associated statistics or with
nonparametric methods. The method works very well in empirical trials. It is also related
to a certain class of statistical techniques known as experiment design methods (Box &
Draper, 1987), which are used for comparing multiple treatments (for example, fertilizers
or drugs) to determine which treatment (if any) is best in as small a set of experiments as
possible.
2.3 More General Problems
When there are multiple states, but reinforcement is still immediate, then any of the above
solutions can be replicated, once for each state. However, when generalization is required,
these solutions must be integrated with generalization methods (see section 6) this is
straightforward for the simple ad-hoc methods, but it is not understood how to maintain
theoretical guarantees.
Many of these techniques focus on converging to some regime in which exploratory
actions are taken rarely or never this is appropriate when the environment is stationary.
However, when the environment is non-stationary, exploration must continue to take place,
in order to notice changes in the world. Again, the more ad-hoc techniques can be modied
to deal with this in a plausible manner (keep temperature parameters from going to 0 decay
the statistics in interval estimation), but none of the theoretically guaranteed methods can
be applied.
3. Delayed Reward
In the general case of the reinforcement learning problem, the agent's actions determine
not only its immediate reward, but also (at least probabilistically) the next state of the
environment. Such environments can be thought of as networks of bandit problems, but
the agent must take into account the next state as well as the immediate reward when it
decides which action to take. The model of long-run optimality the agent is using determines
exactly how it should take the value of the future into account. The agent will have to be
able to learn from delayed reinforcement: it may take a long sequence of actions, receiving
insignicant reinforcement, then nally arrive at a state with high reinforcement. The agent
must be able to learn which of its actions are desirable based on reward that can take place
arbitrarily far in the future.
3.1 Markov Decision Processes
Problems with delayed reinforcement are well modeled as Markov decision processes (MDPs).
An MDP consists of
a set of states S ,
a set of actions A,
247
Kaelbling, Littman, & Moore
which assert that the value of a state s is the expected instantaneous reward plus the
expected discounted value of the next state, using the best available action. Given the
optimal value function, we can specify the optimal policy as
0 1
X
(s) = arg max @
a R(s a) + T (s a s0)V (s0)A :
s 2S 0
V s
initialize ( ) arbitrarily
loop until policy good enough
loop for s2S
a2A
Q(s a) := R(s a) + Ps 2S T (s a s0)V (s0)
loop for
0
It is not obvious when to stop the value iteration algorithm. One important result
bounds the performance of the current greedy policy as a function of the Bellman residual of
the current value function (Williams & Baird, 1993b). It says that if the maximum dierence
between two successive value functions is less than , then the value of the greedy policy,
(the policy obtained by choosing, in every state, the action that maximizes the estimated
discounted reward, using the current estimate of the value function) diers from the value
function of the optimal policy by no more than 2 =(1 ; ) at any state. This provides an
eective stopping criterion for the algorithm. Puterman (1994) discusses another stopping
criterion, based on the span semi-norm, which may result in earlier termination. Another
important result is that the greedy policy is guaranteed to be optimal in some nite number
of steps even though the value function may not have converged (Bertsekas, 1987). And in
practice, the greedy policy is often optimal long before the value function has converged.
Value iteration is very exible. The assignments to V need not be done in strict order
as shown above, but instead can occur asynchronously in parallel provided that the value
of every state gets updated innitely often on an innite run. These issues are treated
extensively by Bertsekas (1989), who also proves convergence results.
Updates based on Equation 1 are known as full backups since they make use of infor-
mation from all possible successor states. It can be shown that updates of the form
can also be used as long as each pairing of a and s is updated innitely often, s0 is sampled
from the distribution T (s a s0), r is sampled with mean R(s a) and bounded variance, and
the learning rate is decreased slowly. This type of sample backup (Singh, 1993) is critical
to the operation of the model-free methods discussed in the next section.
The computational complexity of the value-iteration algorithm with full backups, per
iteration, is quadratic in the number of states and linear in the number of actions. Com-
monly, the transition probabilities T (s a s0) are sparse. If there are on average a constant
number of next states with non-zero probability then the cost per iteration is linear in the
number of states and linear in the number of actions. The number of iterations required to
reach the optimal value function is polynomial in the number of states and the magnitude
of the largest reward if the discount factor is held constant. However, in the worst case
the number of iterations grows polynomially in 1=(1 ; ), so the convergence rate slows
considerably as the discount factor approaches 1 (Littman, Dean, & Kaelbling, 1995b).
249
Kaelbling, Littman, & Moore
until = 0
0
The value function of a policy is just the expected innite discounted reward that will
be gained, at each state, by executing that policy. It can be determined by solving a set
of linear equations. Once we know the value of each state under the current policy, we
consider whether the value could be improved by changing the rst action taken. If it can,
we change the policy to take the new action whenever it is in that situation. This step is
guaranteed to strictly improve the performance of the policy. When no improvements are
possible, then the policy is guaranteed to be optimal.
Since there are at most jAjjSj distinct policies, and the sequence of policies improves at
each step, this algorithm terminates in at most an exponential number of iterations (Puter-
man, 1994). However, it is an important open question how many iterations policy iteration
takes in the worst case. It is known that the running time is pseudopolynomial and that for
any xed discount factor, there is a polynomial bound in the total size of the MDP (Littman
et al., 1995b).
3.2.3 Enhancement to Value Iteration and Policy Iteration
In practice, value iteration is much faster per iteration, but policy iteration takes fewer
iterations. Arguments have been put forth to the eect that each approach is better for
large problems. Puterman's modi ed policy iteration algorithm (Puterman & Shin, 1978)
provides a method for trading iteration time for iteration improvement in a smoother way.
The basic idea is that the expensive part of policy iteration is solving for the exact value
of V . Instead of nding an exact value for V , we can perform a few steps of a modied
value-iteration step where the policy is held xed over successive iterations. This can be
shown to produce an approximation to V that converges linearly in . In practice, this can
result in substantial speedups.
Several standard numerical-analysis techniques that speed the convergence of dynamic
programming can be used to accelerate value and policy iteration. Multigrid methods can
be used to quickly seed a good initial approximation to a high resolution value function
by initially performing value iteration at a coarser resolution (Rude, 1993). State aggre-
gation works by collapsing groups of states to a single meta-state solving the abstracted
problem (Bertsekas & Casta~non, 1989).
250
Reinforcement Learning: A Survey
r
s v
AHC
a
RL
the immediate reward and the estimated value of the next state. This class of algorithms
is known as temporal dierence methods (Sutton, 1988). We will consider two dierent
temporal-dierence learning strategies for the discounted innite-horizon model.
4.1 Adaptive Heuristic Critic and TD( )
The adaptive heuristic critic algorithm is an adaptive version of policy iteration (Barto,
Sutton, & Anderson, 1983) in which the value-function computation is no longer imple-
mented by solving a set of linear equations, but is instead computed by an algorithm called
TD(0). A block diagram for this approach is given in Figure 4. It consists of two compo-
nents: a critic (labeled AHC), and a reinforcement-learning component (labeled RL). The
reinforcement-learning component can be an instance of any of the k-armed bandit algo-
rithms, modied to deal with multiple states and non-stationary rewards. But instead of
acting to maximize instantaneous reward, it will be acting to maximize the heuristic value,
v, that is computed by the critic. The critic uses the real external reinforcement signal to
learn to map states to their expected discounted values given that the policy being executed
is the one currently instantiated in the RL component.
We can see the analogy with modied policy iteration if we imagine these components
working in alternation. The policy implemented by RL is xed and the critic learns the
value function V for that policy. Now we x the critic and let the RL component learn a
new policy 0 that maximizes the new value function, and so on. In most implementations,
however, both components operate simultaneously. Only the alternating implementation
can be guaranteed to converge to the optimal policy, under appropriate conditions. Williams
and Baird explored the convergence properties of a class of AHC-related algorithms they
call \incremental variants of policy iteration" (Williams & Baird, 1993a).
It remains to explain how the critic can learn the value of a policy. We dene hs a r s0i
to be an experience tuple summarizing a single transition in the environment. Here s is the
agent's state before the transition, a is its choice of action, r the instantaneous reward it
receives, and s0 its resulting state. The value of a policy is learned using Sutton's TD(0)
algorithm (Sutton, 1988) which uses the update rule
V (s) := V (s) + (r + V (s0) ; V (s)) :
Whenever a state s is visited, its estimated value is updated to be closer to r + V (s0 ),
since r is the instantaneous reward received and V (s0) is the estimated value of the actually
occurring next state. This is analogous to the sample-backup rule from value iteration|the
only dierence is that the sample is drawn from the real world rather than by simulating
a known model. The key idea is that r + V (s0 ) is a sample of the value of V (s), and it is
252
Reinforcement Learning: A Survey
more likely to be correct because it incorporates the real r. If the learning rate is adjusted
properly (it must be slowly decreased) and the policy is held xed, TD(0) is guaranteed to
converge to the optimal value function.
The TD(0) rule as presented above is really an instance of a more general class of
algorithms called TD( ), with = 0. TD(0) looks only one step ahead when adjusting
value estimates although it will eventually arrive at the correct answer, it can take quite a
while to do so. The general TD( ) rule is similar to the TD(0) rule given above,
V (u) := V (u) + (r + V (s0) ; V (s))e(u)
but it is applied to every state according to its eligibility e(u), rather than just to the
immediately previous state, s. One version of the eligibility trace is dened to be
X
t (
e(s) = ( )t;k
ss , where
ssk = 1 if s = sk .
k=1
k 0 otherwise
The eligibility of a state s is the degree to which it has been visited in the recent past
when a reinforcement is received, it is used to update all the states that have been recently
visited, according to their eligibility. When = 0 this is equivalent to TD(0). When = 1,
it is roughly equivalent to updating all the states according to the number of times they
were visited by the end of a run. Note that we can update the eligibility online as follows:
(
e(s) := ee((ss)) + 1 ifotherwise
s = current state .
Note also that, since V (s) = maxa Q (s a), we have (s) = arg maxa Q (s a) as an
optimal policy.
Because the Q function makes the action explicit, we can estimate the Q values on-
line using a method essentially the same as TD(0), but also use them to dene the policy,
253
Kaelbling, Littman, & Moore
because an action can be chosen just by taking the one with the maximum Q value for the
current state.
The Q-learning rule is
Q(s a) := Q(s a) + (r + max
a 0
Q(s0 a0) ; Q(s a))
where hs a r s0i is an experience tuple as described earlier. If each action is executed in
each state an innite number of times on an innite run and is decayed appropriately, the
Q values will converge with probability 1 to Q (Watkins, 1989 Tsitsiklis, 1994 Jaakkola,
Jordan, & Singh, 1994). Q-learning can also be extended to update states that occurred
more than one step previously, as in TD( ) (Peng & Williams, 1994).
When the Q values are nearly converged to their optimal values, it is appropriate for
the agent to act greedily, taking, in each situation, the action with the highest Q value.
During learning, however, there is a dicult exploitation versus exploration trade-o to be
made. There are no good, formally justied approaches to this problem in the general case
standard practice is to adopt one of the ad hoc methods discussed in section 2.2.
AHC architectures seem to be more dicult to work with than Q-learning on a practical
level. It can be hard to get the relative learning rates right in AHC so that the two
components converge together. In addition, Q-learning is exploration insensitive: that
is, that the Q values will converge to the optimal values, independent of how the agent
behaves while the data is being collected (as long as all state-action pairs are tried often
enough). This means that, although the exploration-exploitation issue must be addressed
in Q-learning, the details of the exploration strategy will not aect the convergence of the
learning algorithm. For these reasons, Q-learning is the most popular and seems to be the
most eective model-free algorithm for learning from delayed reinforcement. It does not,
however, address any of the issues involved in generalizing over large state and/or action
spaces. In addition, it may converge quite slowly to a good policy.
4.3 Model-free Learning With Average Reward
As described, Q-learning can be applied to discounted innite-horizon MDPs. It can also
be applied to undiscounted problems as long as the optimal policy is guaranteed to reach a
reward-free absorbing state and the state is periodically reset.
Schwartz (1993) examined the problem of adapting Q-learning to an average-reward
framework. Although his R-learning algorithm seems to exhibit convergence problems for
some MDPs, several researchers have found the average-reward criterion closer to the true
problem they wish to solve than a discounted criterion and therefore prefer R-learning to
Q-learning (Mahadevan, 1994).
With that in mind, researchers have studied the problem of learning optimal average-
reward policies. Mahadevan (1996) surveyed model-based average-reward algorithms from
a reinforcement-learning perspective and found several diculties with existing algorithms.
In particular, he showed that existing reinforcement-learning algorithms for average reward
(and some dynamic programming algorithms) do not always produce bias-optimal poli-
cies. Jaakkola, Jordan and Singh (1995) described an average-reward learning algorithm
with guaranteed convergence properties. It uses a Monte-Carlo component to estimate the
expected future reward for each state as the agent moves through the environment. In
254
Reinforcement Learning: A Survey
addition, Bertsekas presents a Q-learning-like algorithm for average-case reward in his new
textbook (1995). Although this recent work provides a much needed theoretical foundation
to this area of reinforcement learning, many important problems remain unsolved.
1 2 3 ....... n Goal
Figure 5: In this environment, due to Whitehead (1991), random exploration would take
take O(2n ) steps to reach the goal even once, whereas a more intelligent explo-
ration strategy (e.g. \assume any untried action leads directly to goal") would
require only O(n2 ) steps.
Choose an action a0 to perform in state s0 , based on the Q values but perhaps modied
by an exploration strategy.
The Dyna algorithm requires about k times the computation of Q-learning per instance,
but this is typically vastly less than for the naive model-based method. A reasonable value
of k can be determined based on the relative speeds of computation and of taking action.
Figure 6 shows a grid world in which in each cell the agent has four actions (N, S, E,
W) and transitions are made deterministically to an adjacent cell, unless there is a block,
in which case no movement occurs. As we will see in Table 1, Dyna requires an order of
magnitude fewer steps of experience than does Q-learning to arrive at an optimal policy.
Dyna requires about six times more computational eort, however.
256
Reinforcement Learning: A Survey
Table 1: The performance of three algorithms described in the text. All methods used
the exploration heuristic of \optimism in the face of uncertainty": any state not
previously visited was assumed by default to be a goal state. Q-learning used
its optimal learning rate parameter for a deterministic maze: = 1. Dyna and
prioritized sweeping were permitted to take k = 200 backups per transition. For
prioritized sweeping, the priority queue often emptied before all backups were
used.
257
Kaelbling, Littman, & Moore
6. Generalization
All of the previous discussion has tacitly assumed that it is possible to enumerate the state
and action spaces and store tables of values over them. Except in very small environments,
this means impractical memory requirements. It also makes inecient use of experience. In
a large, smooth state space we generally expect similar states to have similar values and sim-
ilar optimal actions. Surely, therefore, there should be some more compact representation
than a table. Most problems will have continuous or large discrete state spaces some will
have large or continuous action spaces. The problem of learning in large spaces is addressed
through generalization techniques, which allow compact storage of learned information and
transfer of knowledge between \similar" states and actions.
The large literature of generalization techniques from inductive concept learning can be
applied to reinforcement learning. However, techniques often need to be tailored to specic
details of the problem. In the following sections, we explore the application of standard
function-approximation techniques, adaptive resolution models, and hierarchical methods
to the problem of reinforcement learning.
The reinforcement-learning architectures and algorithms discussed above have included
the storage of a variety of mappings, including S ! A (policies), S ! < (value functions),
S A ! < (Q functions and rewards), S A ! S (deterministic transitions), and S
A S ! $0 1] (transition probabilities). Some of these mappings, such as transitions and
immediate rewards, can be learned using straightforward supervised learning, and can be
handled using any of the wide variety of function-approximation techniques for supervised
learning that support noisy training examples. Popular techniques include various neural-
network methods (Rumelhart & McClelland, 1986), fuzzy logic (Berenji, 1991 Lee, 1991).
CMAC (Albus, 1981), and local memory-based methods (Moore, Atkeson, & Schaal, 1995),
such as generalizations of nearest neighbor methods. Other mappings, especially the policy
259
Kaelbling, Littman, & Moore
mapping, typically need specialized algorithms because training sets of input-output pairs
are not available.
6.1 Generalization over Input
A reinforcement-learning agent's current state plays a central role in its selection of reward-
maximizing actions. Viewing the agent as a state-free black box, a description of the
current state is its input. Depending on the agent architecture, its output is either an
action selection, or an evaluation of the current state that can be used to select an action.
The problem of deciding how the dierent aspects of an input aect the value of the output
is sometimes called the \structural credit-assignment" problem. This section examines
approaches to generating actions or evaluations as a function of a description of the agent's
current state.
The rst group of techniques covered here is specialized to the case when reward is not
delayed the second group is more generally applicable.
6.1.1 Immediate Reward
When the agent's actions do not inuence state transitions, the resulting problem becomes
one of choosing actions to maximize immediate reward as a function of the agent's current
state. These problems bear a resemblance to the bandit problems discussed in Section 2
except that the agent should condition its action selection on the current state. For this
reason, this class of problems has been described as associative reinforcement learning.
The algorithms in this section address the problem of learning from immediate boolean
reinforcement where the state is vector valued and the action is a boolean vector. Such
algorithms can and have been used in the context of a delayed reinforcement, for instance,
as the RL component in the AHC architecture described in Section 4.1. They can also be
generalized to real-valued reward through reward comparison methods (Sutton, 1984).
CRBP The complementary reinforcement backpropagation algorithm (Ackley & Littman,
1990) (crbp) consists of a feed-forward network mapping an encoding of the state to an
encoding of the action. The action is determined probabilistically from the activation of
the output units: if output unit i has activation yi , then bit i of the action vector has value
1 with probability yi , and 0 otherwise. Any neural-network supervised training procedure
can be used to adapt the network as follows. If the result of generating action a is r = 1,
then the network is trained with input-output pair hs ai. If the result is r = 0, then the
network is trained with input-output pair hs a&i, where a& = (1 ; a1 : : : 1 ; an ).
The idea behind this training rule is that whenever an action fails to generate reward,
crbp will try to generate an action that is dierent from the current choice. Although it
seems like the algorithm might oscillate between an action and its complement, that does
not happen. One step of training a network will only change the action slightly and since
the output probabilities will tend to move toward 0.5, this makes action selection more
random and increases search. The hope is that the random distribution will generate an
action that works better, and then that action will be reinforced.
ARC The associative reinforcement comparison (arc) algorithm (Sutton, 1984) is an
instance of the ahc architecture for the case of boolean actions, consisting of two feed-
260
Reinforcement Learning: A Survey
forward networks. One learns the value of situations, the other learns a policy. These can
be simple linear networks or can have hidden units.
In the simplest case, the entire system learns only to optimize immediate reward. First,
let us consider the behavior of the network that learns the policy, a mapping from a vector
describing s to a 0 or 1. If the output unit has activation yi , then a, the action generated,
will be 1 if y + > 0, where is normal noise, and 0 otherwise.
The adjustment for the output unit is, in the simplest case,
e = r(a ; 1=2)
where the rst factor is the reward received for taking the most recent action and the second
encodes which action was taken. The actions are encoded as 0 and 1, so a ; 1=2 always has
the same magnitude if the reward and the action have the same sign, then action 1 will be
made more likely, otherwise action 0 will be.
As described, the network will tend to seek actions that given positive reward. To extend
this approach to maximize reward, we can compare the reward to some baseline, b. This
changes the adjustment to
e = (r ; b)(a ; 1=2)
where b is the output of the second network. The second network is trained in a standard
supervised mode to estimate r as a function of the input state s.
Variations of this approach have been used in a variety of applications (Anderson, 1986
Barto et al., 1983 Lin, 1993b Sutton, 1984).
REINFORCE Algorithms Williams (1987, 1992) studied the problem of choosing ac-
tions to maximize immedate reward. He identied a broad class of update rules that per-
form gradient descent on the expected reward and showed how to integrate these rules with
backpropagation. This class, called reinforce algorithms, includes linear reward-inaction
(Section 2.1.3) as a special case.
The generic reinforce update for a parameter wij can be written
#wij = ij (r ; bij ) @w@ ln(gj )
ij
where ij is a non-negative factor, r the current reinforcement, bij a reinforcement baseline,
and gi is the probability density function used to randomly generate actions based on unit
activations. Both ij and bij can take on dierent values for each wij , however, when ij
is constant throughout the system, the expected update is exactly in the direction of the
expected reward gradient. Otherwise, the update is in the same half space as the gradient
but not necessarily in the direction of steepest increase.
Williams points out that the choice of baseline, bij , can have a profound eect on the
convergence speed of the algorithm.
Logic-Based Methods Another strategy for generalization in reinforcement learning is
to reduce the learning problem to an associative problem of learning boolean functions.
A boolean function has a vector of boolean inputs and a single boolean output. Taking
inspiration from mainstream machine learning work, Kaelbling developed two algorithms
for learning boolean functions from reinforcement: one uses the bias of k-DNF to drive
261
Kaelbling, Littman, & Moore
the generalization process (Kaelbling, 1994b) the other searches the space of syntactic
descriptions of functions using a simple generate-and-test method (Kaelbling, 1994a).
The restriction to a single boolean output makes these techniques dicult to apply. In
very benign learning situations, it is possible to extend this approach to use a collection
of learners to independently learn the individual bits that make up a complex output. In
general, however, that approach suers from the problem of very unreliable reinforcement:
if a single learner generates an inappropriate output bit, all of the learners receive a low
reinforcement value. The cascade method (Kaelbling, 1993b) allows a collection of learners
to be trained collectively to generate appropriate joint outputs it is considerably more
reliable, but can require additional computational eort.
6.1.2 Delayed Reward
Another method to allow reinforcement-learning techniques to be applied in large state
spaces is modeled on value iteration and Q-learning. Here, a function approximator is used
to represent the value function by mapping a state description to a value.
Many reseachers have experimented with this approach: Boyan and Moore (1995) used
local memory-based methods in conjunction with value iteration Lin (1991) used backprop-
agation networks for Q-learning Watkins (1989) used CMAC for Q-learning Tesauro (1992,
1995) used backpropagation for learning the value function in backgammon (described in
Section 8.1) Zhang and Dietterich (1995) used backpropagation and TD( ) to learn good
strategies for job-shop scheduling.
Although there have been some positive examples, in general there are unfortunate in-
teractions between function approximation and the learning rules. In discrete environments
there is a guarantee that any operation that updates the value function (according to the
Bellman equations) can only reduce the error between the current value function and the
optimal value function. This guarantee no longer holds when generalization is used. These
issues are discussed by Boyan and Moore (1995), who give some simple examples of value
function errors growing arbitrarily large when generalization is used with value iteration.
Their solution to this, applicable only to certain classes of problems, discourages such diver-
gence by only permitting updates whose estimated values can be shown to be near-optimal
via a battery of Monte-Carlo experiments.
Thrun and Schwartz (1993) theorize that function approximation of value functions
is also dangerous because the errors in value functions due to generalization can become
compounded by the \max" operator in the denition of the value function.
Several recent results (Gordon, 1995 Tsitsiklis & Van Roy, 1996) show how the appro-
priate choice of function approximator can guarantee convergence, though not necessarily to
the optimal values. Baird's residual gradient technique (Baird, 1995) provides guaranteed
convergence to locally optimal solutions.
Perhaps the gloominess of these counter-examples is misplaced. Boyan and Moore (1995)
report that their counter-examples can be made to work with problem-specic hand-tuning
despite the unreliability of untuned algorithms that provably converge in discrete domains.
Sutton (1996) shows how modied versions of Boyan and Moore's examples can converge
successfully. An open question is whether general principles, ideally supported by theory,
can help us understand when value function approximation will succeed. In Sutton's com-
262
Reinforcement Learning: A Survey
parative experiments with Boyan and Moore's counter-examples, he changes four aspects
of the experiments:
1. Small changes to the task specications.
2. A very dierent kind of function approximator (CMAC (Albus, 1975)) that has weak
generalization.
3. A dierent learning algorithm: SARSA (Rummery & Niranjan, 1994) instead of value
iteration.
4. A dierent training regime. Boyan and Moore sampled states uniformly in state space,
whereas Sutton's method sampled along empirical trajectories.
There are intuitive reasons to believe that the fourth factor is particularly important, but
more careful research is needed.
Adaptive Resolution Models In many cases, what we would like to do is partition
the environment into regions of states that can be considered the same for the purposes of
learning and generating actions. Without detailed prior knowledge of the environment, it
is very dicult to know what granularity or placement of partitions is appropriate. This
problem is overcome in methods that use adaptive resolution during the course of learning,
a partition is constructed that is appropriate to the environment.
Decision Trees In environments that are characterized by a set of boolean or discrete-
valued variables, it is possible to learn compact decision trees for representing Q values. The
G-learning algorithm (Chapman & Kaelbling, 1991), works as follows. It starts by assuming
that no partitioning is necessary and tries to learn Q values for the entire environment as
if it were one state. In parallel with this process, it gathers statistics based on individual
input bits it asks the question whether there is some bit b in the state description such
that the Q values for states in which b = 1 are signicantly dierent from Q values for
states in which b = 0. If such a bit is found, it is used to split the decision tree. Then,
the process is repeated in each of the leaves. This method was able to learn very small
representations of the Q function in the presence of an overwhelming number of irrelevant,
noisy state attributes. It outperformed Q-learning with backpropagation in a simple video-
game environment and was used by McCallum (1995) (in conjunction with other techniques
for dealing with partial observability) to learn behaviors in a complex driving-simulator. It
cannot, however, acquire partitions in which attributes are only signicant in combination
(such as those needed to solve parity problems).
Variable Resolution Dynamic Programming The VRDP algorithm (Moore, 1991)
enables conventional dynamic programming to be performed in real-valued multivariate
state-spaces where straightforward discretization would fall prey to the curse of dimension-
ality. A kd-tree (similar to a decision tree) is used to partition state space into coarse
regions. The coarse regions are rened into detailed regions, but only in parts of the state
space which are predicted to be important. This notion of importance is obtained by run-
ning \mental trajectories" through state space. This algorithm proved eective on a number
of problems for which full high-resolution arrays would have been impractical. It has the
disadvantage of requiring a guess at an initially valid trajectory through state-space.
263
Kaelbling, Littman, & Moore
Start
Figure 7: (a) A two-dimensional maze problem. The point robot must nd a path from
start to goal without crossing any of the barrier lines. (b) The path taken by
PartiGame during the entire rst trial. It begins with intense exploration to nd a
route out of the almost entirely enclosed start region. Having eventually reached
a suciently high resolution, it discovers the gap and proceeds greedily towards
the goal, only to be temporarily blocked by the goal's barrier region. (c) The
second trial.
s b1
b2 g a
b3
it can give to the low-level learner. When the master generates a particular command to
the slave, it must reward the slave for taking actions that satisfy the command, even if they
do not result in external reinforcement. The master, then, learns a mapping from states to
commands. The slave learns a mapping from commands and states to external actions. The
set of \commands" and their associated reinforcement functions are established in advance
of the learning.
This is really an instance of the general \gated behaviors" approach, in which the slave
can execute any of the behaviors depending on its command. The reinforcement functions
for the individual behaviors (commands) are given, but learning takes place simultaneously
at both the high and low levels.
6.3.2 Compositional Q-learning
Singh's compositional Q-learning (1992b, 1992a) (C-QL) consists of a hierarchy based on
the temporal sequencing of subgoals. The elemental tasks are behaviors that achieve some
recognizable condition. The high-level goal of the system is to achieve some set of condi-
tions in sequential order. The achievement of the conditions provides reinforcement for the
elemental tasks, which are trained rst to achieve individual subgoals. Then, the gating
function learns to switch the elemental tasks in order to achieve the appropriate high-level
sequential goal. This method was used by Tham and Prager (1994) to learn to control a
simulated multi-link robot arm.
6.3.3 Hierarchical Distance to Goal
Especially if we consider reinforcement learning modules to be part of larger agent archi-
tectures, it is important to consider problems in which goals are dynamically input to the
learner. Kaelbling's HDG algorithm (1993a) uses a hierarchical approach to solving prob-
lems when goals of achievement (the agent should get to a particular state as quickly as
possible) are given to an agent dynamically.
The HDG algorithm works by analogy with navigation in a harbor. The environment
is partitioned (a priori, but more recent work (Ashar, 1994) addresses the case of learning
the partition) into a set of regions whose centers are known as \landmarks." If the agent is
266
Reinforcement Learning: A Survey
office
2/5 1/5
2/5
hall hall
+100
printer
Figure 9: An example of a partially observable environment.
currently in the same region as the goal, then it uses low-level actions to move to the goal.
If not, then high-level information is used to determine the next landmark on the shortest
path from the agent's closest landmark to the goal's closest landmark. Then, the agent uses
low-level information to aim toward that next landmark. If errors in action cause deviations
in the path, there is no problem the best aiming point is recomputed on every step.
i
b a
SE π
level classier system (Wilson, 1995) and add one and two-bit memory registers. They nd
that, although their system can learn to use short-term memory registers eectively, the
approach is unlikely to scale to more complex environments.
Dorigo and Colombetti applied classier systems to a moderately complex problem of
learning robot behavior from immediate reinforcement (Dorigo, 1995 Dorigo & Colombetti,
1994).
Finite-history-window Approach One way to restore the Markov property is to allow
decisions to be based on the history of recent observations and perhaps actions. Lin and
Mitchell (1992) used a xed-width nite history window to learn a pole balancing task.
McCallum (1995) describes the \utile sux memory" which learns a variable-width window
that serves simultaneously as a model of the environment and a nite-memory policy. This
system has had excellent results in a very complex driving-simulation domain (McCallum,
1995). Ring (1994) has a neural-network approach that uses a variable history window,
adding history when necessary to disambiguate situations.
POMDP Approach Another strategy consists of using hidden Markov model (HMM)
techniques to learn a model of the environment, including the hidden state, then to use that
model to construct a perfect memory controller (Cassandra, Kaelbling, & Littman, 1994
Lovejoy, 1991 Monahan, 1982).
Chrisman (1992) showed how the forward-backward algorithm for learning HMMs could
be adapted to learning POMDPs. He, and later McCallum (1993), also gave heuristic state-
splitting rules to attempt to learn the smallest possible model for a given environment. The
resulting model can then be used to integrate information from the agent's observations in
order to make decisions.
Figure 10 illustrates the basic structure for a perfect-memory controller. The component
on the left is the state estimator, which computes the agent's belief state, b as a function of
the old belief state, the last action a, and the current observation i. In this context, a belief
state is a probability distribution over states of the environment, indicating the likelihood,
given the agent's past experience, that the environment is actually in each of those states.
The state estimator can be constructed straightforwardly using the estimated world model
and Bayes' rule.
Now we are left with the problem of nding a policy mapping belief states into action.
This problem can be formulated as an MDP, but it is dicult to solve using the techniques
described earlier, because the input space is continuous. Chrisman's approach (1992) does
not take into account future uncertainty, but yields a policy after a small amount of com-
putation. A standard approach from the operations-research literature is to solve for the
269
Kaelbling, Littman, & Moore
optimal policy (or a close approximation thereof) based on its representation as a piecewise-
linear and convex function over the belief space. This method is computationally intractable,
but may serve as inspiration for methods that make further approximations (Cassandra
et al., 1994 Littman, Cassandra, & Kaelbling, 1995a).
Table 2: TD-Gammon's performance in games against the top human professional players.
A backgammon tournament involves playing a series of games for points until one
player reaches a set target. TD-Gammon won none of these tournaments but came
suciently close that it is now considered one of the best few players in the world.
Although experiments with other games have in some cases produced interesting learning
behavior, no success close to that of TD-Gammon has been repeated. Other games that
have been studied include Go (Schraudolph, Dayan, & Sejnowski, 1994) and Chess (Thrun,
1995). It is still an open question as to if and how the success of TD-Gammon can be
repeated in other domains.
8.2 Robotics and Control
In recent years there have been many robotics and control applications that have used
reinforcement learning. Here we will concentrate on the following four examples, although
many other interesting ongoing robotics investigations are underway.
1. Schaal and Atkeson (1994) constructed a two-armed robot, shown in Figure 11, that
learns to juggle a device known as a devil-stick. This is a complex non-linear control
task involving a six-dimensional state space and less than 200 msecs per control deci-
sion. After about 40 initial attempts the robot learns to keep juggling for hundreds of
hits. A typical human learning the task requires an order of magnitude more practice
to achieve prociency at mere tens of hits.
The juggling robot learned a world model from experience, which was generalized
to unvisited states by a function approximation scheme known as locally weighted
regression (Cleveland & Delvin, 1988 Moore & Atkeson, 1992). Between each trial,
a form of dynamic programming specic to linear control policies and locally linear
transitions was used to improve the policy. The form of dynamic programming is
known as linear-quadratic-regulator design (Sage & White, 1977).
272
Reinforcement Learning: A Survey
2. Mahadevan and Connell (1991a) discuss a task in which a mobile robot pushes large
boxes for extended periods of time. Box-pushing is a well-known dicult robotics
problem, characterized by immense uncertainty in the results of actions. Q-learning
was used in conjunction with some novel clustering techniques designed to enable a
higher-dimensional input than a tabular approach would have permitted. The robot
learned to perform competitively with the performance of a human-programmed so-
lution. Another aspect of this work, mentioned in Section 6.3, was a pre-programmed
breakdown of the monolithic task description into a set of lower level tasks to be
learned.
3. Mataric (1994) describes a robotics experiment with, from the viewpoint of theoret-
ical reinforcement learning, an unthinkably high dimensional state space, containing
many dozens of degrees of freedom. Four mobile robots traveled within an enclo-
sure collecting small disks and transporting them to a destination region. There were
three enhancements to the basic Q-learning algorithm. Firstly, pre-programmed sig-
nals called progress estimators were used to break the monolithic task into subtasks.
This was achieved in a robust manner in which the robots were not forced to use
the estimators, but had the freedom to prot from the inductive bias they provided.
Secondly, control was decentralized. Each robot learned its own policy independently
without explicit communication with the others. Thirdly, state space was brutally
quantized into a small number of discrete states according to values of a small num-
ber of pre-programmed boolean features of the underlying sensors. The performance
of the Q-learned policies were almost as good as a simple hand-crafted controller for
the job.
4. Q-learning has been used in an elevator dispatching task (Crites & Barto, 1996). The
problem, which has been implemented in simulation only at this stage, involved four
elevators servicing ten oors. The objective was to minimize the average squared
wait time for passengers, discounted into future time. The problem can be posed as a
discrete Markov system, but there are 1022 states even in the most simplied version of
the problem. Crites and Barto used neural networks for function approximation and
provided an excellent comparison study of their Q-learning approach against the most
popular and the most sophisticated elevator dispatching algorithms. The squared wait
time of their controller was approximately 7% less than the best alternative algorithm
(\Empty the System" heuristic with a receding horizon controller) and less than half
the squared wait time of the controller most frequently used in real elevator systems.
5. The nal example concerns an application of reinforcement learning by one of the
authors of this survey to a packaging task from a food processing industry. The
problem involves lling containers with variable numbers of non-identical products.
The product characteristics also vary with time, but can be sensed. Depending on
the task, various constraints are placed on the container-lling procedure. Here are
three examples:
The mean weight of all containers produced by a shift must not be below the
manufacturer's declared weight W .
273
Kaelbling, Littman, & Moore
The number of containers below the declared weight must be less than P %.
No containers may be produced below weight W 0 .
Such tasks are controlled by machinery which operates according to various setpoints.
Conventional practice is that setpoints are chosen by human operators, but this choice
is not easy as it is dependent on the current product characteristics and the current
task constraints. The dependency is often dicult to model and highly non-linear.
The task was posed as a nite-horizon Markov decision task in which the state of the
system is a function of the product characteristics, the amount of time remaining in
the production shift and the mean wastage and percent below declared in the shift
so far. The system was discretized into 200,000 discrete states and local weighted
regression was used to learn and generalize a transition model. Prioritized sweep-
ing was used to maintain an optimal value function as each new piece of transition
information was obtained. In simulated experiments the savings were considerable,
typically with wastage reduced by a factor of ten. Since then the system has been
deployed successfully in several factories within the United States.
Some interesting aspects of practical reinforcement learning come to light from these
examples. The most striking is that in all cases, to make a real system work it proved
necessary to supplement the fundamental algorithm with extra pre-programmed knowledge.
Supplying extra knowledge comes at a price: more human eort and insight is required and
the system is subsequently less autonomous. But it is also clear that for tasks such as
these, a knowledge-free approach would not have achieved worthwhile performance within
the nite lifetime of the robots.
What forms did this pre-programmed knowledge take? It included an assumption of
linearity for the juggling robot's policy, a manual breaking up of the task into subtasks for
the two mobile-robot examples, while the box-pusher also used a clustering technique for
the Q values which assumed locally consistent Q values. The four disk-collecting robots
additionally used a manually discretized state space. The packaging example had far fewer
dimensions and so required correspondingly weaker assumptions, but there, too, the as-
sumption of local piecewise continuity in the transition model enabled massive reductions
in the amount of learning data required.
The exploration strategies are interesting too. The juggler used careful statistical anal-
ysis to judge where to protably experiment. However, both mobile robot applications
were able to learn well with greedy exploration|always exploiting without deliberate ex-
ploration. The packaging task used optimism in the face of uncertainty. None of these
strategies mirrors theoretically optimal (but computationally intractable) exploration, and
yet all proved adequate.
Finally, it is also worth considering the computational regimes of these experiments.
They were all very dierent, which indicates that the diering computational demands of
various reinforcement learning algorithms do indeed have an array of diering applications.
The juggler needed to make very fast decisions with low latency between each hit, but
had long periods (30 seconds and more) between each trial to consolidate the experiences
collected on the previous trial and to perform the more aggressive computation necessary
to produce a new reactive controller on the next trial. The box-pushing robot was meant to
274
Reinforcement Learning: A Survey
operate autonomously for hours and so had to make decisions with a uniform length control
cycle. The cycle was suciently long for quite substantial computations beyond simple Q-
learning backups. The four disk-collecting robots were particularly interesting. Each robot
had a short life of less than 20 minutes (due to battery constraints) meaning that substantial
number crunching was impractical, and any signicant combinatorial search would have
used a signicant fraction of the robot's learning lifetime. The packaging task had easy
constraints. One decision was needed every few minutes. This provided opportunities for
fully computing the optimal value function for the 200,000-state system between every
control cycle, in addition to performing massive cross-validation-based optimization of the
transition model being learned.
A great deal of further work is currently in progress on practical implementations of
reinforcement learning. The insights and task constraints that they produce will have an
important eect on shaping the kind of algorithms that are developed in future.
9. Conclusions
There are a variety of reinforcement-learning techniques that work eectively on a variety
of small problems. But very few of these techniques scale well to larger problems. This is
not because researchers have done a bad job of inventing learning techniques, but because
it is very dicult to solve arbitrary problems in the general case. In order to solve highly
complex problems, we must give up tabula rasa learning techniques and begin to incorporate
bias that will give leverage to the learning process.
The necessary bias can come in a variety of forms, including the following:
shaping: The technique of shaping is used in training animals (Hilgard & Bower, 1975) a
teacher presents very simple problems to solve rst, then gradually exposes the learner
to more complex problems. Shaping has been used in supervised-learning systems,
and can be used to train hierarchical reinforcement-learning systems from the bottom
up (Lin, 1991), and to alleviate problems of delayed reinforcement by decreasing the
delay until the problem is well understood (Dorigo & Colombetti, 1994 Dorigo, 1995).
local reinforcement signals: Whenever possible, agents should be given reinforcement
signals that are local. In applications in which it is possible to compute a gradient,
rewarding the agent for taking steps up the gradient, rather than just for achieving
the nal goal, can speed learning signicantly (Mataric, 1994).
imitation: An agent can learn by \watching" another agent perform the task (Lin, 1991).
For real robots, this requires perceptual abilities that are not yet available. But
another strategy is to have a human supply appropriate motor commands to a robot
through a joystick or steering wheel (Pomerleau, 1993).
problem decomposition: Decomposing a huge learning problem into a collection of smaller
ones, and providing useful reinforcement signals for the subproblems is a very power-
ful technique for biasing learning. Most interesting examples of robotic reinforcement
learning employ this technique to some extent (Connell & Mahadevan, 1993).
reexes: One thing that keeps agents that know nothing from learning anything is that
they have a hard time even nding the interesting parts of the space they wander
275
Kaelbling, Littman, & Moore
around at random never getting near the goal, or they are always \killed" immediately.
These problems can be ameliorated by programming a set of \reexes" that cause the
agent to act initially in some way that is reasonable (Mataric, 1994 Singh, Barto,
Grupen, & Connolly, 1994). These reexes can eventually be overridden by more
detailed and accurate learned knowledge, but they at least keep the agent alive and
pointed in the right direction while it is trying to learn. Recent work by Millan (1996)
explores the use of reexes to make robot learning safer and more ecient.
With appropriate biases, supplied by human programmers or teachers, complex reinforcement-
learning problems will eventually be solvable. There is still much work to be done and many
interesting questions remaining for learning techniques and especially regarding methods for
approximating, decomposing, and incorporating bias into problems.
Acknowledgements
Thanks to Marco Dorigo and three anonymous reviewers for comments that have helped
to improve this paper. Also thanks to our many colleagues in the reinforcement-learning
community who have done this work and explained it to us.
Leslie Pack Kaelbling was supported in part by NSF grants IRI-9453383 and IRI-
9312395. Michael Littman was supported in part by Bellcore. Andrew Moore was supported
in part by an NSF Research Initiation Award and by 3M Corporation.
References
Ackley, D. H., & Littman, M. L. (1990). Generalization and scaling in reinforcement learn-
ing. In Touretzky, D. S. (Ed.), Advances in Neural Information Processing Systems
2, pp. 550{557 San Mateo, CA. Morgan Kaufmann.
Albus, J. S. (1975). A new approach to manipulator control: Cerebellar model articulation
controller (cmac). Journal of Dynamic Systems, Measurement and Control, 97, 220{
227.
Albus, J. S. (1981). Brains, Behavior, and Robotics. BYTE Books, Subsidiary of McGraw-
Hill, Peterborough, New Hampshire.
Anderson, C. W. (1986). Learning and Problem Solving with Multilayer Connectionist
Systems. Ph.D. thesis, University of Massachusetts, Amherst, MA.
Ashar, R. R. (1994). Hierarchical learning in stochastic domains. Master's thesis, Brown
University, Providence, Rhode Island.
Baird, L. (1995). Residual algorithms: Reinforcement learning with function approxima-
tion. In Prieditis, A., & Russell, S. (Eds.), Proceedings of the Twelfth International
Conference on Machine Learning, pp. 30{37 San Francisco, CA. Morgan Kaufmann.
Baird, L. C., & Klopf, A. H. (1993). Reinforcement learning with high-dimensional, con-
tinuous actions. Tech. rep. WL-TR-93-1147, Wright-Patterson Air Force Base Ohio:
Wright Laboratory.
276
Reinforcement Learning: A Survey
Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic
programming. Arti cial Intelligence, 72 (1), 81{138.
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that
can solve dicult learning control problems. IEEE Transactions on Systems, Man,
and Cybernetics, SMC-13 (5), 834{846.
Bellman, R. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ.
Berenji, H. R. (1991). Articial neural networks and approximate reasoning for intelligent
control in space. In American Control Conference, pp. 1075{1080.
Berry, D. A., & Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments.
Chapman and Hall, London, UK.
Bertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Models.
Prentice-Hall, Englewood Clis, NJ.
Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control. Athena Scientic,
Belmont, Massachusetts. Volumes 1 and 2.
Bertsekas, D. P., & Casta~non, D. A. (1989). Adaptive aggregation for innite horizon
dynamic programming. IEEE Transactions on Automatic Control, 34 (6), 589{598.
Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and Distributed Computation: Numer-
ical Methods. Prentice-Hall, Englewood Clis, NJ.
Box, G. E. P., & Draper, N. R. (1987). Empirical Model-Building and Response Surfaces.
Wiley.
Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely
approximating the value function. In Tesauro, G., Touretzky, D. S., & Leen, T. K.
(Eds.), Advances in Neural Information Processing Systems 7 Cambridge, MA. The
MIT Press.
Burghes, D., & Graham, A. (1980). Introduction to Control Theory including Optimal
Control. Ellis Horwood.
Cassandra, A. R., Kaelbling, L. P., & Littman, M. L. (1994). Acting optimally in partially
observable stochastic domains. In Proceedings of the Twelfth National Conference on
Arti cial Intelligence Seattle, WA.
Chapman, D., & Kaelbling, L. P. (1991). Input generalization in delayed reinforcement
learning: An algorithm and performance comparisons. In Proceedings of the Interna-
tional Joint Conference on Arti cial Intelligence Sydney, Australia.
Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual
distinctions approach. In Proceedings of the Tenth National Conference on Arti cial
Intelligence, pp. 183{188 San Jose, CA. AAAI Press.
277
Kaelbling, Littman, & Moore
Chrisman, L., & Littman, M. (1993). Hidden state and short-term memory.. Presentation
at Reinforcement Learning Workshop, Machine Learning Conference.
Cichosz, P., & Mulawka, J. J. (1995). Fast and ecient reinforcement learning with trun-
cated temporal dierences. In Prieditis, A., & Russell, S. (Eds.), Proceedings of the
Twelfth International Conference on Machine Learning, pp. 99{107 San Francisco,
CA. Morgan Kaufmann.
Cleveland, W. S., & Delvin, S. J. (1988). Locally weighted regression: An approach to
regression analysis by local tting. Journal of the American Statistical Association,
83 (403), 596{610.
Cli, D., & Ross, S. (1994). Adding temporary memory to ZCS. Adaptive Behavior, 3 (2),
101{150.
Condon, A. (1992). The complexity of stochastic games. Information and Computation,
96 (2), 203{224.
Connell, J., & Mahadevan, S. (1993). Rapid task learning for real robots. In Robot Learning.
Kluwer Academic Publishers.
Crites, R. H., & Barto, A. G. (1996). Improving elevator performance using reinforcement
learning. In Touretzky, D., Mozer, M., & Hasselmo, M. (Eds.), Neural Information
Processing Systems 8.
Dayan, P. (1992). The convergence of TD( ) for general . Machine Learning, 8 (3), 341{
362.
Dayan, P., & Hinton, G. E. (1993). Feudal reinforcement learning. In Hanson, S. J., Cowan,
J. D., & Giles, C. L. (Eds.), Advances in Neural Information Processing Systems 5
San Mateo, CA. Morgan Kaufmann.
Dayan, P., & Sejnowski, T. J. (1994). TD( ) converges with probability 1. Machine Learn-
ing, 14 (3).
Dean, T., Kaelbling, L. P., Kirman, J., & Nicholson, A. (1993). Planning with deadlines in
stochastic domains. In Proceedings of the Eleventh National Conference on Arti cial
Intelligence Washington, DC.
D'Epenoux, F. (1963). A probabilistic production and inventory problem. Management
Science, 10, 98{108.
Derman, C. (1970). Finite State Markovian Decision Processes. Academic Press, New York.
Dorigo, M., & Bersini, H. (1994). A comparison of q-learning and classier systems. In
From Animals to Animats: Proceedings of the Third International Conference on the
Simulation of Adaptive Behavior Brighton, UK.
Dorigo, M., & Colombetti, M. (1994). Robot shaping: Developing autonomous agents
through learning. Arti cial Intelligence, 71 (2), 321{370.
278
Reinforcement Learning: A Survey
Dorigo, M. (1995). Alecsys and the AutonoMouse: Learning to control a real robot by
distributed classier systems. Machine Learning, 19.
Fiechter, C.-N. (1994). Ecient reinforcement learning. In Proceedings of the Seventh
Annual ACM Conference on Computational Learning Theory, pp. 88{97. Association
of Computing Machinery.
Gittins, J. C. (1989). Multi-armed Bandit Allocation Indices. Wiley-Interscience series in
systems and optimization. Wiley, Chichester, NY.
Goldberg, D. (1989). Genetic algorithms in search, optimization, and machine learning.
Addison-Wesley, MA.
Gordon, G. J. (1995). Stable function approximation in dynamic programming. In Priedi-
tis, A., & Russell, S. (Eds.), Proceedings of the Twelfth International Conference on
Machine Learning, pp. 261{268 San Francisco, CA. Morgan Kaufmann.
Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued
functions. Neural Networks, 3, 671{692.
Gullapalli, V. (1992). Reinforcement learning and its application to control. Ph.D. thesis,
University of Massachusetts, Amherst, MA.
Hilgard, E. R., & Bower, G. H. (1975). Theories of Learning (fourth edition). Prentice-Hall,
Englewood Clis, NJ.
Homan, A. J., & Karp, R. M. (1966). On nonterminating stochastic games. Management
Science, 12, 359{370.
Holland, J. H. (1975). Adaptation in Natural and Arti cial Systems. University of Michigan
Press, Ann Arbor, MI.
Howard, R. A. (1960). Dynamic Programming and Markov Processes. The MIT Press,
Cambridge, MA.
Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative
dynamic programming algorithms. Neural Computation, 6 (6).
Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Monte-carlo reinforcement learning in
non-Markovian decision problems. In Tesauro, G., Touretzky, D. S., & Leen, T. K.
(Eds.), Advances in Neural Information Processing Systems 7 Cambridge, MA. The
MIT Press.
Kaelbling, L. P. (1993a). Hierarchical learning in stochastic domains: Preliminary results.
In Proceedings of the Tenth International Conference on Machine Learning Amherst,
MA. Morgan Kaufmann.
Kaelbling, L. P. (1993b). Learning in Embedded Systems. The MIT Press, Cambridge, MA.
Kaelbling, L. P. (1994a). Associative reinforcement learning: A generate and test algorithm.
Machine Learning, 15 (3).
279
Kaelbling, Littman, & Moore
Maes, P., & Brooks, R. A. (1990). Learning to coordinate behaviors. In Proceedings Eighth
National Conference on Arti cial Intelligence, pp. 796{802. Morgan Kaufmann.
Mahadevan, S. (1994). To discount or not to discount in reinforcement learning: A case
study comparing R learning and Q learning. In Proceedings of the Eleventh Inter-
national Conference on Machine Learning, pp. 164{172 San Francisco, CA. Morgan
Kaufmann.
Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms,
and empirical results. Machine Learning, 22 (1).
Mahadevan, S., & Connell, J. (1991a). Automatic programming of behavior-based robots
using reinforcement learning. In Proceedings of the Ninth National Conference on
Arti cial Intelligence Anaheim, CA.
Mahadevan, S., & Connell, J. (1991b). Scaling reinforcement learning to robotics by ex-
ploiting the subsumption architecture. In Proceedings of the Eighth International
Workshop on Machine Learning, pp. 328{332.
Mataric, M. J. (1994). Reward functions for accelerated learning. In Cohen, W. W., &
Hirsh, H. (Eds.), Proceedings of the Eleventh International Conference on Machine
Learning. Morgan Kaufmann.
McCallum, A. K. (1995). Reinforcement Learning with Selective Perception and Hidden
State. Ph.D. thesis, Department of Computer Science, University of Rochester.
McCallum, R. A. (1993). Overcoming incomplete perception with utile distinction memory.
In Proceedings of the Tenth International Conference on Machine Learning, pp. 190{
196 Amherst, Massachusetts. Morgan Kaufmann.
McCallum, R. A. (1995). Instance-based utile distinctions for reinforcement learning with
hidden state. In Proceedings of the Twelfth International Conference Machine Learn-
ing, pp. 387{395 San Francisco, CA. Morgan Kaufmann.
Meeden, L., McGraw, G., & Blank, D. (1993). Emergent control and planning in an au-
tonomous vehicle. In Touretsky, D. (Ed.), Proceedings of the Fifteenth Annual Meeting
of the Cognitive Science Society, pp. 735{740. Lawerence Erlbaum Associates, Hills-
dale, NJ.
Millan, J. d. R. (1996). Rapid, safe, and incremental learning of navigation strategies. IEEE
Transactions on Systems, Man, and Cybernetics, 26 (3).
Monahan, G. E. (1982). A survey of partially observable Markov decision processes: Theory,
models, and algorithms. Management Science, 28, 1{16.
Moore, A. W. (1991). Variable resolution dynamic programming: Eciently learning ac-
tion maps in multivariate real-valued spaces. In Proc. Eighth International Machine
Learning Workshop.
281
Kaelbling, Littman, & Moore
Moore, A. W. (1994). The parti-game algorithm for variable resolution reinforcement learn-
ing in multidimensional state-spaces. In Cowan, J. D., Tesauro, G., & Alspector, J.
(Eds.), Advances in Neural Information Processing Systems 6, pp. 711{718 San Mateo,
CA. Morgan Kaufmann.
Moore, A. W., & Atkeson, C. G. (1992). An investigation of memory-based function ap-
proximators for learning control. Tech. rep., MIT Artical Intelligence Laboratory,
Cambridge, MA.
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with
less data and less real time. Machine Learning, 13.
Moore, A. W., Atkeson, C. G., & Schaal, S. (1995). Memory-based learning for control.
Tech. rep. CMU-RI-TR-95-18, CMU Robotics Institute.
Narendra, K., & Thathachar, M. A. L. (1989). Learning Automata: An Introduction.
Prentice-Hall, Englewood Clis, NJ.
Narendra, K. S., & Thathachar, M. A. L. (1974). Learning automata|a survey. IEEE
Transactions on Systems, Man, and Cybernetics, 4 (4), 323{334.
Peng, J., & Williams, R. J. (1993). Ecient learning and planning within the Dyna frame-
work. Adaptive Behavior, 1 (4), 437{454.
Peng, J., & Williams, R. J. (1994). Incremental multi-step Q-learning. In Proceedings of the
Eleventh International Conference on Machine Learning, pp. 226{232 San Francisco,
CA. Morgan Kaufmann.
Pomerleau, D. A. (1993). Neural network perception for mobile robot guidance. Kluwer
Academic Publishing.
Puterman, M. L. (1994). Markov Decision Processes|Discrete Stochastic Dynamic Pro-
gramming. John Wiley & Sons, Inc., New York, NY.
Puterman, M. L., & Shin, M. C. (1978). Modied policy iteration algorithms for discounted
Markov decision processes. Management Science, 24, 1127{1137.
Ring, M. B. (1994). Continual Learning in Reinforcement Environments. Ph.D. thesis,
University of Texas at Austin, Austin, Texas.
Rude, U. (1993). Mathematical and computational techniques for multilevel adaptive meth-
ods. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania.
Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel Distributed Processing:
Explorations in the microstructures of cognition. Volume 1: Foundations. The MIT
Press, Cambridge, MA.
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems.
Tech. rep. CUED/F-INFENG/TR166, Cambridge University.
282
Reinforcement Learning: A Survey
283
Kaelbling, Littman, & Moore
Thrun, S., & Schwartz, A. (1993). Issues in using function approximation for reinforcement
learning. In Mozer, M., Smolensky, P., Touretzky, D., Elman, J., & Weigend, A.
(Eds.), Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ.
Lawrence Erlbaum.
Thrun, S. B. (1992). The role of exploration in learning control. In White, D. A., &
Sofge, D. A. (Eds.), Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive
Approaches. Van Nostrand Reinhold, New York, NY.
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine
Learning, 16 (3).
Tsitsiklis, J. N., & Van Roy, B. (1996). Feature-based methods for large scale dynamic
programming. Machine Learning, 22 (1).
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27 (11),
1134{1142.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, King's College,
Cambridge, UK.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8 (3), 279{292.
Whitehead, S. D. (1991). Complexity and cooperation in Q-learning. In Proceedings of the
Eighth International Workshop on Machine Learning Evanston, IL. Morgan Kauf-
mann.
Williams, R. J. (1987). A class of gradient-estimating algorithms for reinforcement learning
in neural networks. In Proceedings of the IEEE First International Conference on
Neural Networks San Diego, CA.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine Learning, 8 (3), 229{256.
Williams, R. J., & Baird, III, L. C. (1993a). Analysis of some incremental variants of policy
iteration: First steps toward understanding actor-critic learning systems. Tech. rep.
NU-CCS-93-11, Northeastern University, College of Computer Science, Boston, MA.
Williams, R. J., & Baird, III, L. C. (1993b). Tight performance bounds on greedy policies
based on imperfect value functions. Tech. rep. NU-CCS-93-14, Northeastern Univer-
sity, College of Computer Science, Boston, MA.
Wilson, S. (1995). Classier tness based on accuracy. Evolutionary Computation, 3 (2),
147{173.
Zhang, W., & Dietterich, T. G. (1995). A reinforcement learning approach to job-shop
scheduling. In Proceedings of the International Joint Conference on Arti cial Intel-
lience.
285