0% found this document useful (0 votes)
47 views

Introduction To Boosting: Cynthia Rudin PACM, Princeton University

This document provides an introduction to boosting algorithms and AdaBoost in particular. It defines the statistical learning problem of predicting labels for new examples based on training data. Weak learning algorithms produce weak classifiers that make simple predictions. Boosting algorithms like AdaBoost combine many weak classifiers into a strong classifier by iteratively adjusting weights on training examples. At each iteration, AdaBoost selects a weak classifier, increases the weights of misclassified examples, and updates coefficients to build a linear combination of weak classifiers that serves as the final strong classifier.

Uploaded by

asdfasdffdsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Introduction To Boosting: Cynthia Rudin PACM, Princeton University

This document provides an introduction to boosting algorithms and AdaBoost in particular. It defines the statistical learning problem of predicting labels for new examples based on training data. Weak learning algorithms produce weak classifiers that make simple predictions. Boosting algorithms like AdaBoost combine many weak classifiers into a strong classifier by iteratively adjusting weights on training examples. At each iteration, AdaBoost selects a weak classifier, increases the weights of misclassified examples, and updates coefficients to build a linear combination of weak classifiers that serves as the final strong classifier.

Uploaded by

asdfasdffdsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Introduction to Boosting

Cynthia Rudin
PACM, Princeton
University
Advisors
Ingrid Daubechies and Robert
Schapire
Say you have a database of news articles…

( , +1) ( , +1) ( ) (
, -1 )
, -1

( , +1) ( , +1) ( ) (
, -1 )
, -1

( , +1) ( , +1) ( ) (
, -1 )
, -1

( , +1) ( , +1) ( ) (
, -1 )
, +1

where articles are labeled ‘+1’ if the category is “entertainment”,


and ‘-1’ otherwise.

Your goal is: Given a new article , find its label.

This is not easy, there are noisy datasets, high dimensions.


Examples of Statistical Learning Tasks:
• Optical Character Recognition (OCR) (post office,
banks), object recognition in images.

• Bioinformatics (analysis of gene array data for tumor


detection, protein classification, etc.)

• Webpage classification (search engines), email filtering,


document retrieval

• Semantic classification for speech, automatic .mp3


sorting

• Time-series prediction (regression)


Huge number of applications, but all have high dimensional data
Examples of classification algorithms:
• SVM’s (Support Vector Machines – large margin classifiers)
• Neural Networks
• Decision Trees / Decision Stumps (CART)
• RBF Networks
• Nearest Neighbors
• Bayes Net

Which is the best?


Depends on amount and type of data, and application!

It’s a tie between SVM’s and Boosted Decision Trees/Stumps


for general applications.

One can always find a problem where a particular algorithm is the best. Boosted convolutional neural nets are the best for
OCR (Yann LeCun et al).
Training Data: {(xi,yi)}i=1..m where (xi,yi) is chosen iid from an
unknown probability distribution on X×{-1,1}.
“space of all possible articles” “labels”

Huge Question: Given a new random example x, can we predict


its correct label with high probability? That is, can we generalize
from our training data?

X + +
_
_
+

? _
Huge Question: Given a new random example x, can we predict
its correct label with high probability? That is, can we generalize
from our training data?

Yes!!! That’s what the field of statistical learning is all about.

The goal of statistical learning is to characterize points from an


unknown probability distribution when given a representative
sample from that distribution.
How do we construct a classifier?
• Divide the space X into two sections, based on the sign of a
function f : X→R.

• Decision boundary is the zero-level set of f. f(x)=0


+
-

X + +
_
_
+

? _

Classifiers divide the space into two pieces for binary classification. Multiclass classification can always be reduced to binary.
Y Overview of Talk Z

• The Statistical Learning Problem (done)

• Introduction to Boosting and AdaBoost

• AdaBoost as Coordinate Descent

• The Margin Theory and Generalization


Say we have a “weak” learning algorithm:
• A weak learning algorithm produces weak classifiers.
• (Think of a weak classifier as a “rule of thumb”)

Examples of weak classifiers for “entertainment” application:

h1( ) = +1 if contains the term “movie”,


-1 otherwise

h2( ) = +1 if contains the term “actor”,


-1 otherwise

h3( ) = +1 if contains the term “drama”,


-1 otherwise

Wouldn’t it be nice to combine the weak classifiers?


Boosting algorithms combine weak
classifiers in a meaningful way.
Example:
f( ) = sign[.4 h1 ( ) + .3 h2 ( ) + .3 h3 ( )]

ASo
boosting
if the article
algorithm
contains
takesthe
asterm
input:
“movie”, and the word “drama”,
but not the word “actor”:
- the weak learning algorithm which produces the weak classifiers
- a large
Thetraining
value ofdatabase
f is sign[.4-.3+.3] = 1, so we label it +1.

and outputs:

- the coefficients of the weak classifiers to make the combined


classifier
Two ways to use a Boosting Algorithm:

• As a way to increase the performance of already `strong`


classifiers.

• Ex. neural networks, decision trees

• “On their own” with a really basic weak classifier

• Ex. decision stumps


AdaBoost
(Freund and Schapire ’95)

-Start with a uniform distribution (“weights”) over training examples.


(The
At theweights tell the
end, make weak learning
(carefully!) algorithm
a linear which
combination ofexamples
the weak are important.)
classifiers
obtained at all iterations.
-Request a weak classifier from the weak learning algorithm, hj :X→{-1,1}.
f final (x) = sign(λ1h1 (x) + K + λn hn (x)) t

-Increase the weights on the training examples that were misclassified.

-(Repeat)
AdaBoost
Define three important things:

d t ∈ R m := distribution (“weights”) over examples at time t

dt = [ .25 .3 .2 .25 ]

1 2 3 4
AdaBoost
Define three important things:

λ t ∈ R n := coeffs of weak classifiers for the linear combination

f t (x) = sign(λt ,1h1 (x) + ... + λt ,n hn (x))


AdaBoost
Define:
M ∈ R m×n :=matrix of hypotheses and data
Enumerate every possible weak
classifier which can be produced
by weak learning algorithm
h1 hj hn
“movie” “actor” “drama”

i Mij

# of data points ⎧ 1 if weak classif. h j classifies pt x i correctly


M ij := h j (x i ) yi = ⎨
⎩− 1 otherwise

The matrix M has too many columns to actually be enumerated. M acts as the only input to AdaBoost.
M AdaBoost λ final

dt , λ t
AdaBoost (Freund and Schapire 95)

λ1 = 0 Initialize coeffs to 0
for t = 1..T final
1
d t ,i = m
e −( Mλ t )i for all i Calculate (normalized) distribution
∑e
i '=1
− ( Mλ t ) i '

jt ∈ arg max j (d Tt M ) j Request weak classif. from


weak learning algorithm

rt = (d Tt M ) jt

α t = ln⎜⎜
1 t
⎟⎟
2 ⎝ 1 − rt ⎠
⎛1+ r ⎞

λ t +1 = λ t + α t e jt
} Update linear combo of weak classifiers

end for
AdaBoost (Freund and Schapire 95)

λ1 = 0
for t = 1..T final
1
d t ,i = m
e −( Mλ t )i for all i
∑e
i '=1
− ( Mλ t ) i '

jt ∈ arg max j (d Tt M ) j

rt = (d Tt M ) jt “Edge” or “correlation” of weak classifier jt.


m
1 ⎛1+ r ⎞ (d M ) jt = ∑ d t ,i yi h jt (x i ) = Edt [ yi h jt ]
T
α t = ln⎜⎜ t
⎟⎟ t
i =1
2 ⎝ 1 − rt ⎠
λ t +1 = λ t + α t e jt
end for
Y AdaBoost as Coordinate Descent Z
Breiman, Mason et al., Duffy and Helmbold, etc. noticed that
AdaBoost is a coordinate descent algorithm.

• Coordinate descent is a minimization algorithm like gradient


descent, except that we only move along coordinates.

• We cannot calculate the gradient because of the high


dimensionality of the space!

• “coordinates” = weak classifiers


“distance to move in that direction” = the update αt
AdaBoost minimizes the following function via coordinate descent:

m
F (λ ) := ∑ e −( Mλ )i
i =1

Choose a direction: jt ∈ arg max j (d Tt M ) j

Choose a distance to move in that direction:


rt = (d Tt M ) jt
1 ⎛1+ r ⎞
α t = ln⎜⎜ t
⎟⎟
2 ⎝ 1 − rt ⎠
λ t +1 = λ t + α t e jt
m
The function F (λ ) := ∑ e −( Mλ )i is convex:
i =1

1) If the data is non-separable by the weak classifiers, the minimizer


of F occurs when the size of λ is finite.

(This case is ok. AdaBoost converges to something we understand.)

2) If the data is separable, the minimum of F is 0

(This case is confusing!)


The original paper suggested that AdaBoost would probably
overfit…

But it didn’t in practice!

Why not?

The margin theory!


Y Boosting and Margins Z

• We want the boosted classifier (defined via λ) to


generalize well, i.e., we want it to perform well on
data that is not in the training set.

• The margin theory: The margin of a boosted


classifier indicates whether it will generalize well.
(Schapire et al. ‘98)

• Large margin classifiers work well in practice,


(but there’s more to this story).

Think of the margin as the confidence of a prediction.


Generalization Ability of Boosted Classifiers
Can we guess whether a boosted classifier f generalizes well?
• Can not calculate Prerror(f)

Minimize the rhs of a (loose) inequality such as this one (Schapire et al.)
When there are no training errors, with probability at least 1-δ,
⎛ ⎛ 2 m ⎞
1
2⎞
⎜ 1 ⎜ d log ( d ) ⎟ ⎟
Prerror ( f ) ≤ Ο⎜ + log( 1 ) ⎟.
⎜ m ⎜ ( µ ( f )) 2 δ ⎟ ⎟
⎝ ⎝ ⎠ ⎠
Probability that
classifier f makes # of training examples margin of f
an error on a
random position d=VC dim. of
x∈ X hyp. space, d≤m
The margin theory:
When there are no training errors, with high probability:
(Schapire et al, ‘98)

⎛ d ⎞
~⎜ m ⎟ d=VC dim. of
Prerror ( f ) ≤ Ο⎜ ⎟. hyp. space, d≤m
⎜ µ( f ) ⎟
⎝ ⎠ # of training examples
Probability that
classifier f makes margin of f
an error on a
random position
x∈ X

Large margin = better generalization = smaller probability of error


For Boosting, the margin of combined classifier f λ
(where fλ := sign(λ1h1 + …+ λnhn ) ) is defined by
(Mλ ) i
margin := µ ( fλ ) := min .
i λ1
Does AdaBoost produce maximum margin classifiers?

(AdaBoost was invented before the margin theory…)

(Grove and Schuurmans ’98)


- yes, empirically.

(Schapire, et al. ’98)


- proved AdaBoost achieves at least half the maximum possible
margin.

(Rätsch and Warmuth ’03)


- yes, empirically.
- improved the bound.

(R, Daubechies, Schapire ’04)


- no, it doesn’t.
AdaBoost performs mysteriously well!

AdaBoost performs better than algorithms which are


designed to maximize the margin
Still open: Why does AdaBoost work so well?
Does AdaBoost converge?

Better / more predictable boosting algorithms!

You might also like