0% found this document useful (0 votes)

55 views29 pages

Introduction To Boosting: Cynthia Rudin PACM, Princeton University

This document provides an introduction to boosting algorithms and AdaBoost in particular. It defines the statistical learning problem of predicting labels for new examples based on training data. Weak learning algorithms produce weak classifiers that make simple predictions. Boosting algorithms like AdaBoost combine many weak classifiers into a strong classifier by iteratively adjusting weights on training examples. At each iteration, AdaBoost selects a weak classifier, increases the weights of misclassified examples, and updates coefficients to build a linear combination of weak classifiers that serves as the final strong classifier.

Uploaded by

asdfasdffdsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views29 pages

Introduction To Boosting: Cynthia Rudin PACM, Princeton University

Uploaded by

asdfasdffdsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Introduction to Boosting

Cynthia Rudin
PACM, Princeton
University
Advisors
Ingrid Daubechies and Robert
Schapire
Say you have a database of news articles…

( , +1) ( , +1) ( ) (
, -1 )
, -1

( , +1) ( , +1) ( ) (
, -1 )
, +1

where articles are labeled ‘+1’ if the category is “entertainment”,

and ‘-1’ otherwise.

Your goal is: Given a new article , find its label.

This is not easy, there are noisy datasets, high dimensions.

Examples of Statistical Learning Tasks:
• Optical Character Recognition (OCR) (post office,
banks), object recognition in images.

• Bioinformatics (analysis of gene array data for tumor

detection, protein classification, etc.)

• Webpage classification (search engines), email filtering,

document retrieval

• Semantic classification for speech, automatic .mp3

sorting

• Time-series prediction (regression)

Huge number of applications, but all have high dimensional data
Examples of classification algorithms:
• SVM’s (Support Vector Machines – large margin classifiers)
• Neural Networks
• Decision Trees / Decision Stumps (CART)
• RBF Networks
• Nearest Neighbors
• Bayes Net

Which is the best?

Depends on amount and type of data, and application!

It’s a tie between SVM’s and Boosted Decision Trees/Stumps

for general applications.

One can always find a problem where a particular algorithm is the best. Boosted convolutional neural nets are the best for
OCR (Yann LeCun et al).
Training Data: {(xi,yi)}i=1..m where (xi,yi) is chosen iid from an
unknown probability distribution on X×{-1,1}.
“space of all possible articles” “labels”

Huge Question: Given a new random example x, can we predict

its correct label with high probability? That is, can we generalize
from our training data?

X + +
_
_
+

? _
Huge Question: Given a new random example x, can we predict
its correct label with high probability? That is, can we generalize
from our training data?

Yes!!! That’s what the field of statistical learning is all about.

The goal of statistical learning is to characterize points from an

unknown probability distribution when given a representative
sample from that distribution.
How do we construct a classifier?
• Divide the space X into two sections, based on the sign of a
function f : X→R.

• Decision boundary is the zero-level set of f. f(x)=0

+
-

X + +
_
_
+

? _

Classifiers divide the space into two pieces for binary classification. Multiclass classification can always be reduced to binary.
Y Overview of Talk Z

• The Statistical Learning Problem (done)

• Introduction to Boosting and AdaBoost

• AdaBoost as Coordinate Descent

• The Margin Theory and Generalization

Say we have a “weak” learning algorithm:
• A weak learning algorithm produces weak classifiers.
• (Think of a weak classifier as a “rule of thumb”)

Examples of weak classifiers for “entertainment” application:

h1( ) = +1 if contains the term “movie”,

-1 otherwise

h2( ) = +1 if contains the term “actor”,

-1 otherwise

h3( ) = +1 if contains the term “drama”,

-1 otherwise

Wouldn’t it be nice to combine the weak classifiers?

Boosting algorithms combine weak
classifiers in a meaningful way.
Example:
f( ) = sign[.4 h1 ( ) + .3 h2 ( ) + .3 h3 ( )]

ASo
boosting
if the article
algorithm
contains
takesthe
asterm
input:
“movie”, and the word “drama”,
but not the word “actor”:
- the weak learning algorithm which produces the weak classifiers
- a large
Thetraining
value ofdatabase
f is sign[.4-.3+.3] = 1, so we label it +1.

and outputs:

- the coefficients of the weak classifiers to make the combined

classifier
Two ways to use a Boosting Algorithm:

• As a way to increase the performance of already `strong`

classifiers.

• Ex. neural networks, decision trees

• “On their own” with a really basic weak classifier

• Ex. decision stumps

AdaBoost
(Freund and Schapire ’95)

-Start with a uniform distribution (“weights”) over training examples.

(The
At theweights tell the
end, make weak learning
(carefully!) algorithm
a linear which
combination ofexamples
the weak are important.)
classifiers
obtained at all iterations.
-Request a weak classifier from the weak learning algorithm, hj :X→{-1,1}.
f final (x) = sign(λ1h1 (x) + K + λn hn (x)) t

-Increase the weights on the training examples that were misclassified.

-(Repeat)
AdaBoost
Define three important things:

d t ∈ R m := distribution (“weights”) over examples at time t

dt = [ .25 .3 .2 .25 ]

1 2 3 4
AdaBoost
Define three important things:

λ t ∈ R n := coeffs of weak classifiers for the linear combination

f t (x) = sign(λt ,1h1 (x) + ... + λt ,n hn (x))

AdaBoost
Define:
M ∈ R m×n :=matrix of hypotheses and data
Enumerate every possible weak
classifier which can be produced
by weak learning algorithm
h1 hj hn
“movie” “actor” “drama”

i Mij

# of data points ⎧ 1 if weak classif. h j classifies pt x i correctly

M ij := h j (x i ) yi = ⎨
⎩− 1 otherwise

The matrix M has too many columns to actually be enumerated. M acts as the only input to AdaBoost.
M AdaBoost λ final

dt , λ t
AdaBoost (Freund and Schapire 95)

λ1 = 0 Initialize coeffs to 0
for t = 1..T final
1
d t ,i = m
e −( Mλ t )i for all i Calculate (normalized) distribution
∑e
i '=1
− ( Mλ t ) i '

jt ∈ arg max j (d Tt M ) j Request weak classif. from

weak learning algorithm

rt = (d Tt M ) jt

α t = ln⎜⎜
1 t
⎟⎟
2 ⎝ 1 − rt ⎠
⎛1+ r ⎞

λ t +1 = λ t + α t e jt
} Update linear combo of weak classifiers

end for
AdaBoost (Freund and Schapire 95)

λ1 = 0
for t = 1..T final
1
d t ,i = m
e −( Mλ t )i for all i
∑e
i '=1
− ( Mλ t ) i '

jt ∈ arg max j (d Tt M ) j

rt = (d Tt M ) jt “Edge” or “correlation” of weak classifier jt.

m
1 ⎛1+ r ⎞ (d M ) jt = ∑ d t ,i yi h jt (x i ) = Edt [ yi h jt ]
T
α t = ln⎜⎜ t
⎟⎟ t
i =1
2 ⎝ 1 − rt ⎠
λ t +1 = λ t + α t e jt
end for
Y AdaBoost as Coordinate Descent Z
Breiman, Mason et al., Duffy and Helmbold, etc. noticed that
AdaBoost is a coordinate descent algorithm.

• Coordinate descent is a minimization algorithm like gradient

descent, except that we only move along coordinates.

• We cannot calculate the gradient because of the high

dimensionality of the space!

• “coordinates” = weak classifiers

“distance to move in that direction” = the update αt
AdaBoost minimizes the following function via coordinate descent:

m
F (λ ) := ∑ e −( Mλ )i
i =1

Choose a direction: jt ∈ arg max j (d Tt M ) j

Choose a distance to move in that direction:

rt = (d Tt M ) jt
1 ⎛1+ r ⎞
α t = ln⎜⎜ t
⎟⎟
2 ⎝ 1 − rt ⎠
λ t +1 = λ t + α t e jt
m
The function F (λ ) := ∑ e −( Mλ )i is convex:
i =1

1) If the data is non-separable by the weak classifiers, the minimizer

of F occurs when the size of λ is finite.

(This case is ok. AdaBoost converges to something we understand.)

2) If the data is separable, the minimum of F is 0

(This case is confusing!)

The original paper suggested that AdaBoost would probably
overfit…

But it didn’t in practice!

Why not?

The margin theory!

Y Boosting and Margins Z

• We want the boosted classifier (defined via λ) to

generalize well, i.e., we want it to perform well on
data that is not in the training set.

• The margin theory: The margin of a boosted

classifier indicates whether it will generalize well.
(Schapire et al. ‘98)

• Large margin classifiers work well in practice,

(but there’s more to this story).

Think of the margin as the confidence of a prediction.

Generalization Ability of Boosted Classifiers
Can we guess whether a boosted classifier f generalizes well?
• Can not calculate Prerror(f)

Minimize the rhs of a (loose) inequality such as this one (Schapire et al.)
When there are no training errors, with probability at least 1-δ,
⎛ ⎛ 2 m ⎞
1
2⎞
⎜ 1 ⎜ d log ( d ) ⎟ ⎟
Prerror ( f ) ≤ Ο⎜ + log( 1 ) ⎟.
⎜ m ⎜ ( µ ( f )) 2 δ ⎟ ⎟
⎝ ⎝ ⎠ ⎠
Probability that
classifier f makes # of training examples margin of f
an error on a
random position d=VC dim. of
x∈ X hyp. space, d≤m
The margin theory:
When there are no training errors, with high probability:
(Schapire et al, ‘98)

⎛ d ⎞
~⎜ m ⎟ d=VC dim. of
Prerror ( f ) ≤ Ο⎜ ⎟. hyp. space, d≤m
⎜ µ( f ) ⎟
⎝ ⎠ # of training examples
Probability that
classifier f makes margin of f
an error on a
random position
x∈ X

Large margin = better generalization = smaller probability of error

For Boosting, the margin of combined classifier f λ
(where fλ := sign(λ1h1 + …+ λnhn ) ) is defined by
(Mλ ) i
margin := µ ( fλ ) := min .
i λ1
Does AdaBoost produce maximum margin classifiers?

(AdaBoost was invented before the margin theory…)

(Grove and Schuurmans ’98)

- yes, empirically.

(Schapire, et al. ’98)

- proved AdaBoost achieves at least half the maximum possible
margin.

(Rätsch and Warmuth ’03)

- yes, empirically.
- improved the bound.

(R, Daubechies, Schapire ’04)

- no, it doesn’t.
AdaBoost performs mysteriously well!

AdaBoost performs better than algorithms which are

designed to maximize the margin
Still open: Why does AdaBoost work so well?
Does AdaBoost converge?

Better / more predictable boosting algorithms!

UDL Answer Booklet Students
100% (1)
UDL Answer Booklet Students
79 pages
Itae004 Test 2
100% (2)
Itae004 Test 2
7 pages
f22 hw7 Sol
No ratings yet
f22 hw7 Sol
12 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Boosting and Applications Yuan
No ratings yet
Boosting and Applications Yuan
41 pages
LectureNotes7
No ratings yet
LectureNotes7
8 pages
Ada Boost
No ratings yet
Ada Boost
25 pages
Boosting Approach To Machine Learn
No ratings yet
Boosting Approach To Machine Learn
23 pages
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
No ratings yet
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
35 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
Boosting Mit
No ratings yet
Boosting Mit
36 pages
Adaboost: Derek Hoiem March 31, 2004
No ratings yet
Adaboost: Derek Hoiem March 31, 2004
46 pages
Introduction To Boosting - 2
No ratings yet
Introduction To Boosting - 2
79 pages
1 Eric Boosting304FinalRpdf
No ratings yet
1 Eric Boosting304FinalRpdf
19 pages
Adaboost Algorithm
No ratings yet
Adaboost Algorithm
17 pages
Boosting
No ratings yet
Boosting
11 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
chapter 3- boosting theory
No ratings yet
chapter 3- boosting theory
7 pages
CS229 Supplemental Lecture Notes: 1 Boosting
No ratings yet
CS229 Supplemental Lecture Notes: 1 Boosting
11 pages
Ada Boost
No ratings yet
Ada Boost
7 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Pradipta Kumar Pattanayak - Ada Boosting
No ratings yet
Pradipta Kumar Pattanayak - Ada Boosting
44 pages
FAQ - Boosting - Ensemble Techniques - Great Learning
No ratings yet
FAQ - Boosting - Ensemble Techniques - Great Learning
2 pages
Zhu - Multiclass Adaboost2009 PDF
No ratings yet
Zhu - Multiclass Adaboost2009 PDF
12 pages
DM(Boosting)
No ratings yet
DM(Boosting)
15 pages
addaboost
No ratings yet
addaboost
12 pages
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
100% (1)
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
41 pages
ENG6500 7 Ensembles Boosting
No ratings yet
ENG6500 7 Ensembles Boosting
49 pages
Boosting and AdaBoost For Machine Learning
No ratings yet
Boosting and AdaBoost For Machine Learning
18 pages
Adaboost Matas
No ratings yet
Adaboost Matas
136 pages
Adaboost
No ratings yet
Adaboost
29 pages
AdaBoost New PDF
No ratings yet
AdaBoost New PDF
45 pages
Boosting: 1. What Is The Difference Between Adaboost and Gradient Boosting?
No ratings yet
Boosting: 1. What Is The Difference Between Adaboost and Gradient Boosting?
2 pages
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
No ratings yet
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
16 pages
AdaBoost M1
No ratings yet
AdaBoost M1
16 pages
Scha Pire
No ratings yet
Scha Pire
182 pages
Foundations of Machine Learning: Courant Institute and Google Research
No ratings yet
Foundations of Machine Learning: Courant Institute and Google Research
42 pages
AdaBoost Final
No ratings yet
AdaBoost Final
97 pages
22 Boosting
No ratings yet
22 Boosting
32 pages
Improving Classification With AdaBoost
No ratings yet
Improving Classification With AdaBoost
20 pages
_LECTURE+NOTES_Boosting
No ratings yet
_LECTURE+NOTES_Boosting
8 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Bagging vs Boosting in Machine Learning
No ratings yet
Bagging vs Boosting in Machine Learning
5 pages
ADABOOST
No ratings yet
ADABOOST
9 pages
Ensemble Classifiers
No ratings yet
Ensemble Classifiers
37 pages
Class Adv Classification V
No ratings yet
Class Adv Classification V
50 pages
IMPROVE_boost_1999
No ratings yet
IMPROVE_boost_1999
40 pages
Lecture 16: Boosting — Applied ML
No ratings yet
Lecture 16: Boosting — Applied ML
20 pages
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
No ratings yet
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
66 pages
Lecture18 Boosting
No ratings yet
Lecture18 Boosting
21 pages
107 Boostong Models
No ratings yet
107 Boostong Models
27 pages
Foundations of Machine Learning: Boosting
No ratings yet
Foundations of Machine Learning: Boosting
41 pages
Unit V -Multiple Learners
No ratings yet
Unit V -Multiple Learners
54 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
AdaBoost Is Consistent
No ratings yet
AdaBoost Is Consistent
22 pages
Lecture-10-boosting
No ratings yet
Lecture-10-boosting
20 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
From Everand
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Probability Theory: Sargur N. Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Probability Theory: Sargur N. Srihari Srihari@cedar - Buffalo.edu
49 pages
Iterative Reweighted Least Squares: Sargur N. Srihari
No ratings yet
Iterative Reweighted Least Squares: Sargur N. Srihari
22 pages
Mixture Models: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Mixture Models: Sargur Srihari Srihari@cedar - Buffalo.edu
7 pages
Forward Sampling: Sargur Srihari Srihari@buffalo - Edu
No ratings yet
Forward Sampling: Sargur Srihari Srihari@buffalo - Edu
13 pages
K-Means Clustering: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
K-Means Clustering: Sargur Srihari Srihari@cedar - Buffalo.edu
20 pages
Kernel Methods!: Sargur Srihari!
No ratings yet
Kernel Methods!: Sargur Srihari!
29 pages
Ch13 4-LinearDynamicalSystems
No ratings yet
Ch13 4-LinearDynamicalSystems
20 pages
Basic Sampling Methods: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Basic Sampling Methods: Sargur Srihari Srihari@cedar - Buffalo.edu
30 pages
Support Vector Machines: Srihari@buffalo - Edu
No ratings yet
Support Vector Machines: Srihari@buffalo - Edu
42 pages
Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu
No ratings yet
Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu
31 pages
Functions of Bounded Variation
No ratings yet
Functions of Bounded Variation
5 pages
Analysis Report
No ratings yet
Analysis Report
11 pages
Artificial Intelligence (Elective II) Credits: 03: Unit I: Foundation
No ratings yet
Artificial Intelligence (Elective II) Credits: 03: Unit I: Foundation
56 pages
ESGB_2025_classification and regression tress [Enregistré automatiquement]
No ratings yet
ESGB_2025_classification and regression tress [Enregistré automatiquement]
43 pages
18CS744 QB FINAL (2023 2024 Odd)
No ratings yet
18CS744 QB FINAL (2023 2024 Odd)
10 pages
Optimized Ant Based Routing Protocol For MANETs
100% (1)
Optimized Ant Based Routing Protocol For MANETs
5 pages
6 - Association Rules- for students
No ratings yet
6 - Association Rules- for students
39 pages
Data Analytics 02: Drag Connect It Change Remove Cabin, Life Boat, Name, and Ticket Number
No ratings yet
Data Analytics 02: Drag Connect It Change Remove Cabin, Life Boat, Name, and Ticket Number
2 pages
Mahzaib CV
No ratings yet
Mahzaib CV
2 pages
Electrical Power and Energy Systems: S.S. Gokhale, V.S. Kale
No ratings yet
Electrical Power and Energy Systems: S.S. Gokhale, V.S. Kale
7 pages
Vector Spaces Lecture Notes
No ratings yet
Vector Spaces Lecture Notes
48 pages
Statistics and Probability (Week 3 and 4)
No ratings yet
Statistics and Probability (Week 3 and 4)
8 pages
Chapter 3 (Inserting The Missing Character) (Unit 1)
100% (2)
Chapter 3 (Inserting The Missing Character) (Unit 1)
15 pages
Credit Card Customer Analysis
No ratings yet
Credit Card Customer Analysis
18 pages
An introduction to mathematics of deep learning
No ratings yet
An introduction to mathematics of deep learning
14 pages
Capital Rationing
No ratings yet
Capital Rationing
24 pages
Correlation Analysis
No ratings yet
Correlation Analysis
3 pages
Compiler Design (13Cs401) List of Programs
No ratings yet
Compiler Design (13Cs401) List of Programs
12 pages
Linear FODE Final
No ratings yet
Linear FODE Final
39 pages
Bmva Ss 2018 Breckon Deepmachinelearning PDF
No ratings yet
Bmva Ss 2018 Breckon Deepmachinelearning PDF
120 pages
AaCape-physics-unit-1-p2-mark-schemes-2022-2007 Pdfcoffee - Com - Pdf-Free
No ratings yet
AaCape-physics-unit-1-p2-mark-schemes-2022-2007 Pdfcoffee - Com - Pdf-Free
2 pages
RU-CSE-IV SEM TT 2023-2024 - New - Format (2) (1) - 1
No ratings yet
RU-CSE-IV SEM TT 2023-2024 - New - Format (2) (1) - 1
22 pages
What Is EViews
No ratings yet
What Is EViews
4 pages
CS8451-Design and Analysis of Algorithms PDF
No ratings yet
CS8451-Design and Analysis of Algorithms PDF
18 pages
Number Theory and RSA Attacks: A Brief Overview of Attack On RSA
No ratings yet
Number Theory and RSA Attacks: A Brief Overview of Attack On RSA
54 pages
DS Assi 1-5 Ques
No ratings yet
DS Assi 1-5 Ques
5 pages
Exam
No ratings yet
Exam
10 pages