Boosting Mit
Boosting Mit
1
Toy Example Generative modeling
• Computer receives telephone call
• Measures Pitch of voice mean1 mean2
Probability
• Decides gender of caller
var1 var2
Male
Human
Voice
Female Voice Pitch
2
Discriminative approach Ill-behaved data
No. of mistakes
mean1
mean2
of mistakes
Probability
No.
Voice Pitch Voice Pitch
3
Traditional Statistics vs.
Comparison of methodologies
Machine Learning
Model Generative Discriminative
Machine Learning
Goal Probability Classification rule
estimates
Estimated Decision Predictions
Data Actions Performance Likelihood Misclassification rate
Statistics world state Theory
measure
Mismatch Outliers Misclassifications
problems
4
No
n-n sum
Bin
ega to
Fea
tiv 1
tur
ew
ev
lab
ect
eig
el
or
hts
h3
instances labels (x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
h4h5
h (x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
h6h7
x1,x2,x3,…,xn y1,y2,y3,…,yn (x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
h8h9
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
hT
The weak requirement:
Final rule: Sign[ α1 h1 + α2 h2 + + αΤ hT ]
5
Note that the weak learner MUST do better than
random (error less than 50-50 on binary classification)!
6
7
8
9
Adaboost
• Binary labels y = -1,+1
• margin(x,y) = y [Σt αt ht(x)]
• P(x,y) = (1/Z) exp (-margin(x,y))
• Given ht, we choose αt to minimize
Σ(x,y) exp (-margin(x,y))
10
Main property of adaboost Adaboost as gradient descent
• If advantages of weak rules over random • Discriminator class: a linear discriminator in the
guessing are: γ1,γ2,..,γT then in-sample error space of “weak hypotheses”
of final rule is at most • Original goal: find hyperplane with smallest
number of mistakes
– Known to be an NP-hard problem (no algorithm that
runs in time polynomial in d, where d is the dimension
of the space)
(w.r.t. the initial weights)
• Computational method: Use exponential loss as a
surrogate, perform gradient descent.
11
Margins view Adaboost et al.
x, w " R n ; y " {!1,+1} Prediction = sign( w • x) Adaboost = e ! y ( w• x )
Logitboost
y ( w • x)
Margin = Loss
w! x
Cumulative # examples
- w Brownboost
+ -+ + Mistakes Correct
+ -+ + - 0-1 loss
- - Project
+- -
ct
rre
Co
Margin
es
Mistakes
ak
ist
Margin
Correct
M
12
One coordinate at a time What is a good weak learner?
• Adaboost performs gradient descent on exponential loss • The set of weak rules (features) should be flexible enough
• Adds one coordinate (“weak learner”) at each iteration. to be (weakly) correlated with most conceivable relations
• Weak learning in binary classification = slightly better between feature vector and label.
than random guessing. • Small enough to allow exhaustive search for the minimal
• Weak learning in regression – unclear. weighted training error.
• Uses example-weights to communicate the gradient • Small enough to avoid over-fitting.
direction to the weak learner • Should be able to calculate predicted label very efficiently.
• Solves a computational problem • Rules can be “specialists” – predict only on a small subset
of the input space and abstain from predicting on the rest
(output 0).
13
14
15
16
17
18
19
20
21
22
Decision Trees Decision tree as a sum
Y Y
-0.2
X>3
+1 X>3 +0.2
+1
no
no
ye
ye
s
s
5
-1 Y>5 -1
sign -0.1 +0.1
-0.1
-1 -0.2 +0.1
Y>5
no
ye
-1 -0.3
-1
s
no
ye
-1 +1
s
-0.3 +0.2
X X
3
23
An alternating decision tree Example: Medical Diagnostics
Y
no
ye
ye
s
s
ye
s
-0.3 +0.2
+0.7
+1
X
24
Adtree for Cleveland heart-disease diagnostics problem
Cross-validated accuracy
Learning Number of Average Test error
algorithm splits test error variance
25
Curious phenomenon
Boosting decision trees
26
Explanation using margins Explanation using margins
27
Experimental Evidence Theorem Schapire, Freund, Bartlett & Lee
Annals of stat. 98
28
Suggested optimization problem Idea of Proof
d
m
!
Margin
!
29
Applications of Boosting Academic research
% test error rates
• Academic research Error
Database Other Boosting
• Applied research reduction
30
Schapire, Singer, Gorin 98
31
Weak rules generated by “boostexter”
Third Results
Calling Collect party
Category
card call
Weak • 7844 training examples
Rule – hand transcribed
• 1000 test examples
– hand / machine transcribed
• Accuracy with 20% rejected
– Machine transcribed: 75%
– Hand transcribed: 90%
Word occurs
Word does
not occur
32
Freund, Mason, Rogers, Pregibon, Cortes 2000
33
Alternating tree for “buizocity” Alternating Tree (Detail)
34
Precision/recall graphs
Business impact
• Increased coverage from 44% to 56%
• Accuracy ~94%
Accuracy
Score
35
Summary Come talk with me!
• Boosting is a computational method for learning • [email protected]
accurate classifiers
• http://www.cs.huji.ac.il/~yoavf
• Resistance to over-fit explained by margins
• Underlying explanation –
large “neighborhoods” of good classifiers
• Boosting has been applied successfully to a
variety of classification problems
36