0% found this document useful (0 votes)
215 views

Adaboost: Derek Hoiem March 31, 2004

The document discusses the Adaboost algorithm. It provides background on related algorithms like bootstrapping, bagging, and boosting. It then describes the discrete and real versions of Adaboost, how it works by adaptively reweighting examples, and theoretical interpretations of it minimizing error and maximizing margins. The document also covers practical issues like complexity of weak learners, variants of Adaboost, and examples of its use in face detection problems.

Uploaded by

dahman16
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
215 views

Adaboost: Derek Hoiem March 31, 2004

The document discusses the Adaboost algorithm. It provides background on related algorithms like bootstrapping, bagging, and boosting. It then describes the discrete and real versions of Adaboost, how it works by adaptively reweighting examples, and theoretical interpretations of it minimizing error and maximizing margins. The document also covers practical issues like complexity of weak learners, variants of Adaboost, and examples of its use in face detection problems.

Uploaded by

dahman16
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

Adaboost

Derek Hoiem

March 31, 2004


Outline
 Background

 Adaboost Algorithm

 Theory/Interpretations

 Practical Issues

 Face detection experiments


What’s So Good About Adaboost

 Improves classification accuracy

 Can be used with many different classifiers

 Commonly used in many areas

 Simple to implement

 Not prone to overfitting


A Brief History

 Bootstrapping

 Bagging

 Boosting (Schapire 1989)

 Adaboost (Schapire 1995)


Bootstrap Estimation
 Repeatedly draw n samples from D
 For each set of samples, estimate a
statistic
 The bootstrap estimate is the mean of the
individual estimates
 Used to estimate a statistic (parameter)
and its variance
Bagging - Aggregate Bootstrapping

 For i = 1 .. M
 Draw n*<n samples from D with replacement
 Learn classifier Ci

 Final classifier is a vote of C1 .. CM


 Increases classifier stability/reduces
variance
Boosting (Schapire 1989)
 Randomly select n1 < n samples from D without replacement to
obtain D1
 Train weak learner C1

 Select n2 < n samples from D with half of the samples misclassified


by C1 to obtain D2
 Train weak learner C2

 Select all samples from D that C1 and C2 disagree on


 Train weak learner C3

 Final classifier is vote of weak learners


Adaboost - Adaptive Boosting

 Instead of sampling, re-weight


 Previous weak learner has only 50% accuracy over
new distribution

 Can be used to learn weak classifiers

 Final classification based on weighted vote of


weak classifiers
Adaboost Terms
 Learner = Hypothesis = Classifier

 Weak Learner: < 50% error over any


distribution

 Strong Classifier: thresholded linear


combination of weak learner outputs
Discrete Adaboost (DiscreteAB)
(Friedman’s wording)
Discrete Adaboost (DiscreteAB)
(Freund and Schapire’s wording)
Adaboost with Confidence
Weighted Predictions (RealAB)
Comparison 2 Node Trees
Bound on Training Error (Schapire)
Finding a weak hypothesis
 Train classifier (as usual) on weighted training data

 Some weak learners can minimize Z by gradient descent

 Sometimes we can ignore alpha (when the weak learner


output can be freely scaled)
Choosing Alpha
 Choose alpha to minimize Z

 Results in 50% error rate in latest weak learner

 In general, compute numerically


Special Case
Domain-partitioning weak hypothesis (e.g. decision trees,
histograms)

x
P1 x x P3
x x Wx1 = 5/13, Wo1 = 0/13
o
x o o Wx2 = 1/13, Wo2 = 2/13
x o o
Wx3 = 1/13, Wo3 = 4/13
P o
2
Z = 2 (0 + sqrt(2)/13 + 2/13) = .525
Smoothing Predictions
 Equivalent to adding prior in partitioning case

 Confidence bound by
Justification for the Error Function
 Adaboost minimizes:

 Provides a differentiable upper bound on training error


(Shapire)

 Minimizing J(F) is equivalent up to 2nd order Taylor


expansion about F = 0 to maximizing expected binomial
log-likelihood (Friedman)
Probabilistic Interpretation
(Friedman)

Proof:
Misinterpretations of the
Probabilistic Interpretation
 Lemma 1 applies to the true distribution
 Only applies to the training set
 Note that P(y|x) = {0, 1} for the training set in most
cases

 Adaboost will converge to the global minimum


 Greedy process  local minimum
 Complexity of strong learner limited by complexities of
weak learners
Example of Joint Distribution that Cannot Be
Learned with Naïve Weak Learners

0, 0 0, 1 P(o | 0,0) = 1
o x P(x | 0,1) = 1

x1 P(x | 1,0) = 1
P(o | 1,1) = 1
x o
1, 0 1, 1
x2

H(x1,x2) = a*x1 + b*(1-x1) + c*x2 + d*(1-x2) = (a-b)*x1 + (c-d)*x2 + (b+d)

Linear decision for non-linear problem when


using naïve weak learners
Immunity to Overfitting?
Adaboost as Logistic Regression
(Friedman)
 Additive Logistic Regression: Fitting class conditional probability log
ratio with additive terms

 DiscreteAB builds an additive logistic regression model via Newton-


like updates for minimizing

 RealAB fits an additive logistic regression model by stage-wise and


approximate optimization of

 Even after training error reaches zero, AB produces a “purer”


solution probabilistically
Adaboost as Margin Maximization
(Schapire)
Bound on Generalization Error
Confidence of correct decision VC dimension

Confidence Margin Number of training


Bound confidence term
examples

1/20
1/8
To loose to be of 1/2 1/4

practical value:
Maximizing the margin…
 But Adaboost doesn’t necessarily maximize the margin
on the test set (Ratsch)
 Ratsch proposes an algorithm (Adaboost*) that does so
Adaboost and Noisy Data
 Examples with the largest gap between the label and the
classifier prediction get the most weight

 What if the label is wrong?

Opitz: Synthetic data with 20% one-sided noisy labeling

Number of Networks in Ensemble


WeightBoost (Jin)
 Uses input-dependent weighting factors for weak
learners
Text Categorization
BrownBoost (Freund)
 Non-monotonic weighting function
 Examples far from boundary decrease in weight

 Set a target error rate – algorithm attempts to achieve


that error rate

 No results posted (ever)


Adaboost Variants Proposed By
Friedman
 LogitBoost
 Solves
 Requires care to avoid numerical problems

 GentleBoost
 Update is fm(x) = P(y=1 | x) – P(y=0 | x) instead of
 Bounded [0 1]
Comparison (Friedman)

Synthetic, non-additive decision boundary


Complexity of Weak Learner
 Complexity of strong classifier limited by
complexities of weak learners

 Example:
 Weak Learner: stubs  WL DimVC = 2
 Input space: RN  Strong DimVC = N+1
 N+1 partitions can have arbitrary confidences
assigned
Complexity of the Weak Learner
Complexity of the Weak Learner

Non-Additive Decision Boundary


Adaboost in Face Detection
Detector Adaboost Weak
Variant Learner
Viola-Jones DiscreteAB Stubs

FloatBoost FloatBoost 1-D Histograms

KLBoost KLBoost 1-D Histograms

Schneiderman RealAB One Group of


N-D Histograms
Boosted Trees – Face Detection
Stubs (2 Nodes)
8 Nodes
Boosted Trees – Face Detection

Ten Stubs vs. One 20 Node 2, 8, and 20 Nodes


Boosted Trees vs. Bayes Net
FloatBoost (Li)
FP Rate at 95.5% Det Rate

80.0%

75.0%

70.0% Trees - 20 bins


(L1+L2)
65.0%
Trees - 20 bins
60.0% (L2)
55.0% Trees - 8 bins (L2)

50.0%
Trees - 2 bins (L2)
45.0%

40.0%

35.0%
5 10 15 20 25 30
Num Iterations

1. Less greedy approach (FloatBoost) yields better results


2. Stubs are not sufficient for vision
When to Use Adaboost

 Give it a try for any classification problem

 Be wary if using noisy/unlabeled data


How to Use Adaboost
 Start with RealAB (easy to implement)

 Possibly try a variant, such as FloatBoost, WeightBoost,


or LogitBoost

 Try varying the complexity of the weak learner

 Try forming the weak learner to minimize Z (or some


similar goal)
Conclusion
 Adaboost can improve classifier accuracy
for many problems

 Complexity of weak learner is important

 Alternative boosting algorithms exist to


deal with many of Adaboost’s problems
References
 Duda, Hart, ect – Pattern Classification

 Freund – “An adaptive version of the boost by majority algorithm”

 Freund – “Experiments with a new boosting algorithm”

 Freund, Schapire – “A decision-theoretic generalization of on-line learning and an application to boosting”

 Friedman, Hastie, etc – “Additive Logistic Regression: A Statistical View of Boosting”

 Jin, Liu, etc (CMU) – “A New Boosting Algorithm Using Input-Dependent Regularizer”

 Li, Zhang, etc – “Floatboost Learning for Classification”

 Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study”

 Ratsch, Warmuth – “Efficient Margin Maximization with Boosting”

 Schapire, Freund, etc – “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods”

 Schapire, Singer – “Improved Boosting Algorithms Using Confidence-Weighted Predictions”

 Schapire – “The Boosting Approach to Machine Learning: An overview”

 Zhang, Li, etc – “Multi-view Face Detection with Floatboost”


Adaboost Variants Proposed By
Friedman
 LogitBoost
Adaboost Variants Proposed By
Friedman
 GentleBoost

You might also like