Introduction To Boosting: Cynthia Rudin PACM, Princeton University
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
Cynthia Rudin
PACM, Princeton
University
Advisors
Ingrid Daubechies and Robert
Schapire
Say you have a database of news articles…
( , +1) ( , +1) ( ) (
, -1 )
, -1
( , +1) ( , +1) ( ) (
, -1 )
, -1
( , +1) ( , +1) ( ) (
, -1 )
, -1
( , +1) ( , +1) ( ) (
, -1 )
, +1
One can always find a problem where a particular algorithm is the best. Boosted convolutional neural nets are the best for
OCR (Yann LeCun et al).
Training Data: {(xi,yi)}i=1..m where (xi,yi) is chosen iid from an
unknown probability distribution on X×{-1,1}.
“space of all possible articles” “labels”
X + +
_
_
+
? _
Huge Question: Given a new random example x, can we predict
its correct label with high probability? That is, can we generalize
from our training data?
X + +
_
_
+
? _
Classifiers divide the space into two pieces for binary classification. Multiclass classification can always be reduced to binary.
Y Overview of Talk Z
ASo
boosting
if the article
algorithm
contains
takesthe
asterm
input:
“movie”, and the word “drama”,
but not the word “actor”:
- the weak learning algorithm which produces the weak classifiers
- a large
Thetraining
value ofdatabase
f is sign[.4-.3+.3] = 1, so we label it +1.
and outputs:
-(Repeat)
AdaBoost
Define three important things:
dt = [ .25 .3 .2 .25 ]
1 2 3 4
AdaBoost
Define three important things:
i Mij
The matrix M has too many columns to actually be enumerated. M acts as the only input to AdaBoost.
M AdaBoost λ final
dt , λ t
AdaBoost (Freund and Schapire 95)
λ1 = 0 Initialize coeffs to 0
for t = 1..T final
1
d t ,i = m
e −( Mλ t )i for all i Calculate (normalized) distribution
∑e
i '=1
− ( Mλ t ) i '
rt = (d Tt M ) jt
α t = ln⎜⎜
1 t
⎟⎟
2 ⎝ 1 − rt ⎠
⎛1+ r ⎞
λ t +1 = λ t + α t e jt
} Update linear combo of weak classifiers
end for
AdaBoost (Freund and Schapire 95)
λ1 = 0
for t = 1..T final
1
d t ,i = m
e −( Mλ t )i for all i
∑e
i '=1
− ( Mλ t ) i '
jt ∈ arg max j (d Tt M ) j
m
F (λ ) := ∑ e −( Mλ )i
i =1
Why not?
Minimize the rhs of a (loose) inequality such as this one (Schapire et al.)
When there are no training errors, with probability at least 1-δ,
⎛ ⎛ 2 m ⎞
1
2⎞
⎜ 1 ⎜ d log ( d ) ⎟ ⎟
Prerror ( f ) ≤ Ο⎜ + log( 1 ) ⎟.
⎜ m ⎜ ( µ ( f )) 2 δ ⎟ ⎟
⎝ ⎝ ⎠ ⎠
Probability that
classifier f makes # of training examples margin of f
an error on a
random position d=VC dim. of
x∈ X hyp. space, d≤m
The margin theory:
When there are no training errors, with high probability:
(Schapire et al, ‘98)
⎛ d ⎞
~⎜ m ⎟ d=VC dim. of
Prerror ( f ) ≤ Ο⎜ ⎟. hyp. space, d≤m
⎜ µ( f ) ⎟
⎝ ⎠ # of training examples
Probability that
classifier f makes margin of f
an error on a
random position
x∈ X