0% found this document useful (0 votes)
58 views

Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin

This document discusses bagging and boosting machine learning techniques. It defines bagging as generating multiple training sets by sampling with replacement from the original data and combining the results of models trained on each sample. Boosting is defined as sequentially training models to focus on misclassified examples from previous models, and combining models with higher weights given to more accurate ones. The document provides mathematical analysis of how bagging reduces variance and how boosting can combine weak learners into a strong learner. It also presents gradient descent and adaptive resampling views of boosting algorithms like AdaBoost.

Uploaded by

Akhi Danu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin

This document discusses bagging and boosting machine learning techniques. It defines bagging as generating multiple training sets by sampling with replacement from the original data and combining the results of models trained on each sample. Boosting is defined as sequentially training models to focus on misclassified examples from previous models, and combining models with higher weights given to more accurate ones. The document provides mathematical analysis of how bagging reduces variance and how boosting can combine weak learners into a strong learner. It also presents gradient descent and adaptive resampling views of boosting algorithms like AdaBoost.

Uploaded by

Akhi Danu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Bagging and Boosting

9.520 Class 10, 13 March 2006


Sasha Rakhlin
Plan
• Bagging and sub-sampling methods
• Bias-Variance and stability for bagging
• Boosting and correlations of machines
• Gradient descent view of boosting
Bagging (Bootstrap AGGregatING)

Given a training set D = {(x1, y1), . . . (xn, yn)},

• sample T sets of n elements from D (with replacement)


D1, D2, . . . DT → T quasi replica training sets;

• train a machine on each Di, i = 1, ..., T and obtain a


sequence of T outputs f1(x), . . . fT (x).
Bagging (cont.)
The final aggregate classifier can be

• for regression
T
X
f¯(x) = fi(x),
i=1
the average of fi for i = 1, ..., T ;

• for classification
T
X
f¯(x) = sign( fi(x))
i=1
or the majority vote
T
X
f¯(x) = sign( sign(fi(x)))
i=1
Variation I: Sub-sampling methods

- “Standard” bagging: each of the T subsamples has size


n and created with replacement.

- “Sub-bagging”: create T subsamples of size α only (α <


n).

- No replacement: same as bagging or sub-bagging, but


using sampling without replacement

- Overlap vs non-overlap: Should the T subsamples over-


lap? i.e. create T subsamples each with Tn training data.
Bias - Variance for Regression (Breiman
1996)

Let
Z
I[f ] = (f (x) − y)2p(x, y)dxdy

be the expected risk and f0 the regression function. With


f¯(x) = ES fS (x), if we define the bias as
Z
(f0(x) − f¯(x))2p(x)dx

and the variance as


½Z ¾
ES (fS (x) − f¯(x))2p(x)dx ,

we have the decomposition

ES {I[fS ]} = I[f0] + bias + variance.


Bagging reduces variance (Intuition)

If each single classifier is unstable – that is, it has high


variance, the aggregated classifier f¯ has a smaller vari-
ance than a single original classifier.

The aggregated classifier f¯ can be thought of as an ap-


proximation to the true average f obtained by replacing
the probability distribution p with the bootstrap approxi-
mation to p obtained concentrating mass 1/n at each point
(xi, yi).
Variation II: weighting and combining
alternatives

- No subsampling, but instead each machine uses different


weights on the training data.

- Instead of equal voting, use weighted voting.

- Instead of voting, combine using other schemes.


Weak and strong learners

Kearns and Valiant in 1988/1989 asked if there exist two


types of hypothesis spaces of classifiers.

• Strong learners: Given a large enough dataset the clas-


sifier can arbitrarily accurately learn the target function
1−τ

• Weak learners: Given a large enough dataset the clas-


sifier can barely learn the target function 1
2+τ

The hypothesis boosting problem: are the above equiva-


lent ?
The original Boosting (Schapire, 1990):
For Classification Only

1. Train a first classifier f1 on a training set drawn from


a probability p(x, y). Let ²1 be the obtained training
performance;

2. Train a second classifier f2 on a training set drawn from


a probability p2(x, y) such that it has half its measure
on the event that h1 makes a mistake and half on the
rest. Let ²2 be the obtained performance;

3. Train a third classifier f3 on disagreements of the first


two – that is, drawn from a probability p3(x, y) which
has its support on the event that h1 and h2 disagree.
Let ²3 be the obtained performance.
Boosting (cont.)

Main result: If ²i < p for all i, the boosted hypothesis

g = M ajorityV ote (f1, f2, f3)


has training performance no worse than ² = 3p2 − 2p3

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Adaboost (Freund and Schapire, 1996)

The idea is of adaptively resampling the data

• Maintain a probability distribution over training set;

• Generate a sequence of classifiers in which the “next”


classifier focuses on sample where the “previous” clas-
sifier failed;

• Weigh machines according to their performance.


Adaboost
Given: a class F = {f : X 7→ {−1, 1}} of weak learners and
the data {(x1, y1), . . . , (xn, yn)}, yi ∈ {−1, 1}. Initialize the
weights as w1(i) = 1/n.
For t = 1, . . . T :

1. Find a weak learner ft based on weights wt(i);

Pn
2. Compute the weighted error ²t = i=1 wt (i)I(yi 6= ft (xi ));

³ ´
1−²t
3. Compute the importance of ft as αt = 1/2 ln ²t ;

wt(i)e−αt yift (xi )


4. Update the distribution wt+1(i) = Zt ,
Pn −αt yi ht(xi ) .
Zt = i=1 wt (i)e
Adaboost (cont.)

Adopt as final hypothesis


 
T
X
g(x) = sign  αtft(x)
t=1
Theory of Boosting

We define the margin of (xi, yi) according to the real valued


function g to be

margin(xi, yi) = yig(xi).


Note that this notion of margin is different from the SVM
margin. This defines a margin for each training point!
Performance of Adaboost

Theorem: Let γt = 1/2 − ²t (how much better ft is on the


weighted sample than tossing a coin). Then
n T q
1 X Y
I(yig(xi) < 0) ≤ 1 − 4γt2
n i=1 t=1
Gradient descent view of boosting

We would like to minimize


n
1 X
I(yig(xi) < 0)
n i=1
over the linear span of some base class F . Think of F as
the weak learners.

Two problems: a) linear span of F can be huge and search-


ing for the minimizer directly is intractable. b) the indi-
cator is non-convex and the problem can be shown to be
NP-hard even for simple F.

Solution to b): replace the indicator I(yg(x) < 0) with a


convex upper bound φ(yg(x)).

Solution to a)?
Gradient descent view of boosting

Let’s search over the linear span of F step-by-step. At


each step t, we add a new function ft ∈ F to the existing
Pt−1
g = i=1 αifi.

P
1 n
Let Cφ(g) = n i=1 φ(yi g(xi )). We wish to find ft ∈ F
to add to g such that Cφ(g + ²ft) decreases. The desired
direction is −5Cφ(g). We choose the new function ft such
that it has the greatest inner product with − 5 Cφ, i.e. it
maximizes
− < 5Cφ(g), ft > .
Gradient descent view of boosting

One can verify that


n
1 X
− < 5Cφ(g), ft >= − 2 yift(xi)φ0(yig(xi)).
n i=1
Hence, finding ft maximizing − < 5Cφ(g), ft > is equivalent
to minimizing the weighted error
n
X
wt(i)I(ft(xi) 6= yi)
i=1
where
φ0(yig(xi))
wt(i) := Pn 0
j=1 φ (yj g(xj ))

For φ(yg(x)) = e−yg(x) this becomes Adaboost.

You might also like