0% found this document useful (0 votes)
14 views

Boosting Mit

This document discusses an introduction to boosting algorithms including generative vs discriminative modeling, boosting, alternating decision trees, boosting and overfitting, and applications of boosting. The key points covered are generative vs discriminative approaches, how boosting works by combining multiple weak learners to create a strong learner, and how boosting can help address overfitting issues.

Uploaded by

Cyrus Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Boosting Mit

This document discusses an introduction to boosting algorithms including generative vs discriminative modeling, boosting, alternating decision trees, boosting and overfitting, and applications of boosting. The key points covered are generative vs discriminative approaches, how boosting works by combining multiple weak learners to create a strong learner, and how boosting can help address overfitting issues.

Uploaded by

Cyrus Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Plan of talk

• Generative vs. non-generative modeling


An Introduction to Boosting • Boosting
• Alternating decision trees
• Boosting and over-fitting
• Applications

1
Toy Example Generative modeling
• Computer receives telephone call
• Measures Pitch of voice mean1 mean2

Probability
• Decides gender of caller
var1 var2

Male
Human
Voice
Female Voice Pitch

2
Discriminative approach Ill-behaved data
No. of mistakes

mean1
mean2

of mistakes
Probability
No.
Voice Pitch Voice Pitch

3
Traditional Statistics vs.
Comparison of methodologies
Machine Learning
Model Generative Discriminative
Machine Learning
Goal Probability Classification rule
estimates
Estimated Decision Predictions
Data Actions Performance Likelihood Misclassification rate
Statistics world state Theory
measure
Mismatch Outliers Misclassifications
problems

4
No
n-n sum
Bin

ega to
Fea

A weak learner The boosting process


ary

tiv 1
tur

ew
ev

lab
ect

eig
el
or

hts

A weak rule weak learner h1


(x1,y1,1/n), … (xn,yn,1/n)
(x1,y1,w1),(x2,y2,w2) … (xn,yn,wn)
weak learner h
Weighted
weak learner h2
training set (x1,y1,w1), … (xn,yn,wn)

h3
instances labels (x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
h4h5
h (x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
h6h7
x1,x2,x3,…,xn y1,y2,y3,…,yn (x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
h8h9
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
hT
The weak requirement:
Final rule: Sign[ α1 h1 + α2 h2 + + αΤ hT ]

5
Note that the weak learner MUST do better than
random (error less than 50-50 on binary classification)!

6
7
8
9
Adaboost
• Binary labels y = -1,+1
• margin(x,y) = y [Σt αt ht(x)]
• P(x,y) = (1/Z) exp (-margin(x,y))
• Given ht, we choose αt to minimize
Σ(x,y) exp (-margin(x,y))

10
Main property of adaboost Adaboost as gradient descent
• If advantages of weak rules over random • Discriminator class: a linear discriminator in the
guessing are: γ1,γ2,..,γT then in-sample error space of “weak hypotheses”
of final rule is at most • Original goal: find hyperplane with smallest
number of mistakes
– Known to be an NP-hard problem (no algorithm that
runs in time polynomial in d, where d is the dimension
of the space)
(w.r.t. the initial weights)
• Computational method: Use exponential loss as a
surrogate, perform gradient descent.

11
Margins view Adaboost et al.
x, w " R n ; y " {!1,+1} Prediction = sign( w • x) Adaboost = e ! y ( w• x )
Logitboost
y ( w • x)
Margin = Loss
w! x
Cumulative # examples
- w Brownboost
+ -+ + Mistakes Correct
+ -+ + - 0-1 loss
- - Project
+- -
ct
rre
Co

Margin
es

Mistakes
ak
ist

Margin
Correct
M

12
One coordinate at a time What is a good weak learner?
• Adaboost performs gradient descent on exponential loss • The set of weak rules (features) should be flexible enough
• Adds one coordinate (“weak learner”) at each iteration. to be (weakly) correlated with most conceivable relations
• Weak learning in binary classification = slightly better between feature vector and label.
than random guessing. • Small enough to allow exhaustive search for the minimal
• Weak learning in regression – unclear. weighted training error.
• Uses example-weights to communicate the gradient • Small enough to avoid over-fitting.
direction to the weak learner • Should be able to calculate predicted label very efficiently.
• Solves a computational problem • Rules can be “specialists” – predict only on a small subset
of the input space and abstain from predicting on the rest
(output 0).

13
14
15
16
17
18
19
20
21
22
Decision Trees Decision tree as a sum
Y Y
-0.2

X>3
+1 X>3 +0.2
+1
no

no
ye

ye
s

s
5
-1 Y>5 -1
sign -0.1 +0.1
-0.1
-1 -0.2 +0.1
Y>5
no

ye

-1 -0.3
-1
s

no

ye
-1 +1

s
-0.3 +0.2
X X
3

23
An alternating decision tree Example: Medical Diagnostics
Y

-0.2 •Cleve dataset from UC Irvine database.


+0.2
+1
Y<1 X>3
•Heart disease diagnostics (+1=healthy,-1=sick)
no

no

ye
ye

s
s

-1 •13 features from tests (real valued and discrete).


sign 0.0 +0.7 -0.1 +0.1
-0.1 +0.1
Y>5 -1
-0.3 •303 instances.
no

ye
s

-0.3 +0.2

+0.7
+1
X

24
Adtree for Cleveland heart-disease diagnostics problem
Cross-validated accuracy
Learning Number of Average Test error
algorithm splits test error variance

ADtree 6 17.0% 0.6%

C5.0 27 27.2% 0.5%


C5.0 +
446 20.2% 0.5%
boosting
Boost
16 16.5% 0.8%
Stumps

25
Curious phenomenon
Boosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

26
Explanation using margins Explanation using margins

0-1 loss 0-1 loss

Margin No examples Margin


with small !
margins!!

27
Experimental Evidence Theorem Schapire, Freund, Bartlett & Lee
Annals of stat. 98

For any convex combination and any threshold

Fraction of training example


Probability of mistake with small margin

Size of training sample


No dependence on number
of weak rules
that are combined!!!
VC dimension of weak
rules

28
Suggested optimization problem Idea of Proof

d
m
!

Margin
!

29
Applications of Boosting Academic research
% test error rates
• Academic research Error
Database Other Boosting
• Applied research reduction

• Commercial deployment Cleveland 27.2 (DT) 16.5 39%


Promoters 22.0 (DT) 11.8 46%
Letter 13.8 (DT) 3.5 74%
Reuters 4 5.8, 6.0, 9.8 2.95 ~60%
Reuters 8 11.3, 12.1, 13.4 7.4 ~40%

30
Schapire, Singer, Gorin 98

Applied research Examples


• “AT&T, How may I help you?” • Yes I’d like to place a collect call long distance
please  collect
• Classify voice requests • Operator I need to make a call but I need to bill it
• Voice -> text -> category to my office  third party
• Fourteen categories • Yes I’d like to place a call on my master card
Area code, AT&T service, billing credit, please  calling card
calling card, collect, competitor, dial assistance, • I just called a number in Sioux city and I musta
directory, how to dial, person to person, rate, rang the wrong number because I got the wrong
third party, time charge ,time party and I would like to have that taken off my
bill  billing credit

31
Weak rules generated by “boostexter”
Third Results
Calling Collect party
Category
card call
Weak • 7844 training examples
Rule – hand transcribed
• 1000 test examples
– hand / machine transcribed
• Accuracy with 20% rejected
– Machine transcribed: 75%
– Hand transcribed: 90%
Word occurs
Word does
not occur

32
Freund, Mason, Rogers, Pregibon, Cortes 2000

Commercial deployment Massive datasets


• Distinguish business/residence customers • 260M calls / day
• 230M telephone numbers
• Using statistics from call-detail records
• Label unknown for ~30%
• Alternating decision trees • Hancock: software for computing statistical signatures.
– Similar to boosting decision trees, more flexible • 100K randomly selected training examples,
– Combines very simple rules • ~10K is enough
– Can over-fit, cross validation used to stop • Training takes about 2 hours.
• Generated classifier has to be both accurate and efficient

33
Alternating tree for “buizocity” Alternating Tree (Detail)

34
Precision/recall graphs
Business impact
• Increased coverage from 44% to 56%
• Accuracy ~94%
Accuracy

• Saved AT&T 15M$ in the year 2000 in


operations costs and missed opportunities.

Score

35
Summary Come talk with me!
• Boosting is a computational method for learning • [email protected]
accurate classifiers
• http://www.cs.huji.ac.il/~yoavf
• Resistance to over-fit explained by margins
• Underlying explanation –
large “neighborhoods” of good classifiers
• Boosting has been applied successfully to a
variety of classification problems

36

You might also like