0% found this document useful (0 votes)
2 views

boosting

The document discusses ensemble learning techniques, specifically bagging and boosting, which involve combining multiple models to improve classification performance. Bagging uses random subsampling of training data to create diverse classifiers, while boosting focuses on iteratively adjusting the weights of misclassified data points to enhance model accuracy. The lecture highlights the effectiveness of these methods in reducing variance and bias, ultimately leading to better generalization on unseen data.

Uploaded by

venkatesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

boosting

The document discusses ensemble learning techniques, specifically bagging and boosting, which involve combining multiple models to improve classification performance. Bagging uses random subsampling of training data to create diverse classifiers, while boosting focuses on iteratively adjusting the weights of misclassified data points to enhance model accuracy. The lecture highlights the effectiveness of these methods in reducing variance and bias, ultimately leading to better generalization on unseen data.

Uploaded by

venkatesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Lecture 22: Ensemble Learning, Bagging and Boosting

Instructor: Prof. Ganesh Ramakrishnan

November 6, 2019 1 / 48
Ensembles
An ensemble is simply a collection of models that are all trained to perform the same task.
An ensemble can consist of many different versions of the same model, or many different
types of models.
The final output for an ensemble of classifiers is typically obtained through a (weighted)
average or vote of the predictions of the different models in the ensemble.
An ensemble of different models that all achieve similar generalization performance often
outperforms any of the individual models.
Question: How is this possible?

November 6, 2019 2 / 48
Example: Netflix Prize (2009)

Winning team used Ensemble of 450+ different models.

November 6, 2019 3 / 48
Ensemble Intuition
Suppose we have an ensemble of binary classification functions fk (x) for k = 1, ..., K .
Suppose that on average they have the same expected error rate
� = Ep(x,y ) [y �= fk (x)] < 0.5, but that the errors they make are independent.
The intuition is that the majority of the K classifiers in the ensemble will be correct on
many examples where any individual classifier makes an error.
A simple majority vote can significantly improve classification performance by decreasing
variance in this setting.
Question: How can we come up with such an ensemble?

November 6, 2019 4 / 48
Ensemble Intuition
Suppose we have an ensemble of binary classification functions fk (x) for k = 1, ..., K .
Suppose that on average they have the same expected error rate
� = Ep(x,y ) [y �= fk (x)] < 0.5, but that the errors they make are independent.
The intuition is that the majority of the K classifiers in the ensemble will be correct on
many examples where any individual classifier makes an error.
A simple majority vote can significantly improve classification performance by decreasing
variance in this setting.
Question: How can we come up with such an ensemble?
1 Eg 1: Dropout in Deep NN - Performs Regularization
2 Eg 2: Bagging (motivated by Independent Datasets) - Balances Bias & Variance
3 Eg 3: Boosting - Balances Bias & Variance

November 6, 2019 4 / 48
Eg 1: Dropout in Deep NN & Ensemble Learning (Section 7.12)
Masking binary variable µli ∈ {0, 1}
sampled independently from each other
Probability Pr(µl−1
i = 1) = β is a hyperparameter β.
µl−1
1 σ1
l−1
Usually β = 0.5 for the hidden layers (l > 1) and
w1jl β = 0.8 for the input (l = 1).
1 l l
w2jl β σj µ i
µl−1
2 σ2
l−1 The resulting σjl is multiplied by β1 to ensure probabilistic
wijl
semantics and then multiplied with its own mask µli
wsll−1 j
µl−1
i σil−1 Equivalent to randomly selecting one of the
sub-networks from the complete network and running
µl−1 l−1 forward propagation through it ⇒
sl−1 σsl−1

November 6, 2019 5 / 48
Eg 1: Dropout in Deep NN & Ensemble Learning (Section 7.12)
Masking binary variable µli ∈ {0, 1}
sampled independently from each other
Probability Pr(µl−1
i = 1) = β is a hyperparameter β.
µl−1
1 σ1
l−1
Usually β = 0.5 for the hidden layers (l > 1) and
w1jl β = 0.8 for the input (l = 1).
1 l l
w2jl β σj µ i
µl−1
2 σ2
l−1 The resulting σjl is multiplied by β1 to ensure probabilistic
wijl
semantics and then multiplied with its own mask µli
wsll−1 j
µl−1
i σil−1 Equivalent to randomly selecting one of the
sub-networks from the complete network and running
µl−1 l−1 forward propagation through it ⇒ Final network is
sl−1 σsl−1
Combination (Ensemble) of exponentially large number
of similar neural networks
Dropout training ⇒ Minimizing expected value of error
wrt random variable µ: Eµ [E (w, b)]
November 6, 2019 5 / 48
Eg 1: Ensemble Learning through Dropouts

Note: This is an ensemble


(with Arithmatic mean) of
exponentially large number of
neural networks
See Section 7.12 for other
means such as Geometric
mean
(Source: ”Deep Learning” by
Yoshua Bengio Ian J.
Goodfellow Aaron Courville)

November 6, 2019 6 / 48
Backpropagation with Dropouts

µl+1
1 σ1
l+1
(l − 1)th layer

µl−1
1 σ1
l−1
l
w1j sl+1
� ∂E ∂σpl+1
l+1 l+1
∂E = µp wjp if µlj = 1
∂σ l l+1
∂sumpl+1
j p=1 ∂σp
l
w2j ∂E
µl−1
2 σ2
l−1
∂σ l
j
=0 otherwise

wijl
µl−1
i σil−1
wsl
l−1 j

µl−1 l+1
sl+1 σsl+1
µl−1 l−1
sl−1 σsl−1

November 6, 2019 7 / 48
Eg 2: Independent Training Sets
Suppose we collect multiple independent training sets Tr1 , ..., TrK and use each of
these training sets to train a different instance of the same classifier obtaining K
classification functions f1 (x), ..., fK (x).
Classifiers trained in this way are guaranteed to make independent errors on test
data.
If the expected error of each classifier is less than 0.5, then the weighted majority
vote is guaranteed to reduce the expected generalization error.
Question: What is the weakness of this approach?

November 6, 2019 8 / 48
Eg 2: Bagging, i.e., (Bootstrap AGgregation)
Bootstrap AGgregation) or Bagging is an approximation to the previous method
that takes a single training set Tr and randomly sub-samples from it K times (with
replacement) to form K training sets Tr1 , ..., TrK .
Each of these training sets is used to train a different instance of the same classifier
⇒ K classification functions f1 (x), ..., fK (x).
The errors won’t by totally independent because the data sets aren’t independent,
but the random re-sampling usually introduces enough diversity to decrease the
variance and give improved performance.

November 6, 2019 9 / 48
Eg 2: Bagging and Random Forests
Bagging: Particularly useful for high-variance, high-capacity models.
Historically, closely associated with decision tree models (covered in class)
Random forest classifier: popular extension of bagged trees
� The random forests algorithm further decorrelates the learned trees by only considering a
random sub-set of the available features when deciding which variable to split on at each
node in the tree.

Lab 8, Q2, Task 1 associates Bagging with Perceptron (as weak learner)

November 6, 2019 10 / 48
Eg 2: Bagging on Deep NNs

Uniformly at random (with replacements), sample subsets Ds ⊆ D of the training


data, Φs ⊆ Φ of the input layer features and construct (deep) model Ns for each
such random subset.
Bagging Algorithm:

November 6, 2019 11 / 48
Eg 2: Bagging on Deep NNs

Uniformly at random (with replacements), sample subsets Ds ⊆ D of the training


data, Φs ⊆ Φ of the input layer features and construct (deep) model Ns for each
such random subset.
Bagging Algorithm: For s = 1 to B repeat:
1 Bagging: Draw a bootstrap sample Ds of size ns from the training data D of size n
2 Learn a random (deep) model Ns on Ds
Output: Ensemble of (deep) models {Ns }B1

November 6, 2019 11 / 48
Eg 2: Bagging applied to Query (Test) data
Ensemble of Models {Ms }B1

Consider Prn (c | x) for each


each model n ∈ N for each
class c = [1..K ]
Decision for a new test point
x:

November 6, 2019 12 / 48
Eg 2: Bagging applied to Query (Test) data
Ensemble of Models {Ms }B1

Consider Prn (c | x) for each


each model n ∈ N for each
class c = [1..K ]
Decision for a new test point
x: Pr (c | x) =
1
�N
N n=1 Prn (c | x)
For m √ data points, with
|N| = m, consistencya
results have been proved
a
Convergence of loss to Bayes Risk, i.e.,
expectation of loss under distribution of
parameters θ

November 6, 2019 12 / 48
Eg 2: Bagging – Balancing Bias and Variance
Ensemble of Models {Ms }B1

Decision for a new test point


x: Pr (c | x) =
1
�N
N n=1 Prn (c | x)
Each single (deep) model,
viewed as an estimator of the
ideal model has high variance,
with very less bias
(assumptions)
But since the models Ni and
Nj are uncorrelated, when
decision is averaged out across
them, it tends to be very
accurate.
November 6, 2019 13 / 48
Eg 3: Boosting
Boosting is based on iteratively re-weighting the data set instead of randomly
resampling it.
Main idea:
� Up-weight the importance of data cases that are missclassified by the classifiers currently in
the ensemble
� Next add a next classifier that will focus on data cases that are causing the errors.

Can be shown to decrease error, assuming that base classifier can always achieve an
error of less than 0.5 on any data sample
AdaBoost (with decision trees as the weak learners) is often referred to as the best
out-of-the-box classifier (covered in class)
Lab 8, Q2, Task 2 associates Boosting with Perceptron (as a weak learner)

November 6, 2019 14 / 48
From Bagging to Boosting

November 6, 2019 15 / 48
Weak Models: From Bagging to Boosting

Bagging: Ensemble of Independently Weakly Learnt Models (Eg: Trees {Ts }B1 ):
1
�B
Pr (c | x) = |B| t=1 Prt (c | x)

November 6, 2019 16 / 48
Weak Models: From Bagging to Boosting

Bagging: Ensemble of Independently Weakly Learnt Models (Eg: Trees {Ts }B1 ):
1
�B
Pr (c | x) = |B| t=1 Prt (c | x)

Boosting: Wtd combinations of Iteratively Weakly Learnt Models (Eg: Trees


{αt , Tt }B1 ):
November 6, 2019 16 / 48
Weak Models: From Bagging to Boosting

Bagging: Ensemble of Independently Weakly Learnt Models (Eg: Trees {Ts }B1 ):
1
�B
Pr (c | x) = |B| t=1 Prt (c | x)

Boosting: Wtd combinations of Iteratively Weakly Learnt Models (Eg: Trees


1
�B
{αt , Tt }B1 ): Pr (c | x) = |B| t=1 αt Prt (c | x) where αt = (1/2) ln ((1 − errt )/errt )
November 6, 2019 16 / 48
Adaptive Boosting of Iteratively Learnt Weak Models

Error driven weighted linear combinations of models: αt = (1/2) ln ((1 − errt )/errt )

November 6, 2019 17 / 48
Adaptive Boosting of Iteratively Learnt Weak Models

Error driven weighted linear combinations of models: αt = (1/2) ln ((1 − errt )/errt )

Reweighting of each data instance x(i) before learning�the next model Tt :


� � � ��� m
ξi δ (y (i) �=Tt (x(i) ))
ξi = ξi exp αt δ y (i) �= Tt x(i) . Note that errt = i=1 �m ξi
i=1
November 6, 2019 17 / 48
Adaboost Algorithm

1
Initialize each instance weight ξi = m
. For t = 1 to B do:
1 Learn the t th model Tt by weighing example x(i) by ξi
�m
(y (i) �=Tt (x(i) ))
i=1 ξi δ
2 Compute the corresponding error on the training set errt = �m
i=1 ξi
3 Compute the error driven weighted linear factor for Tt :
αt = (1/2) ln ((1 − errt )/errt )
(i)
Reweigh each � (i)instance� x(i) ���
� data before learning the next model:
4

ξi = ξi exp αt δ y �= Tt x .

November 6, 2019 18 / 48
Adaboost Algorithm: Motivation (Tutorial 10)
Freund & Schapire, 1995: Converting a “weak” PAC1 learning algorithm that
performs just slightly better than random guessing into one with arbitrarily high
accuracy. �
Let Ct (x) = tj=1 αj Tj (x) be the boosted linear combination of classifiers until t th
iteration.
Let the error to be minimized over αt be the sum of its exponential loss on each
data point,
�m m
� � � ��� � � � ��
Et = δ y (i) �= sign Ct x(i) ≤ exp −y (i) Ct x(i)
i=1 i=1
Claim1: The error that is the sum of exponential loss on each data point is an
upper bound on the simple sum of training errors on each data point
Claim2: αt = (1/2) ln ((1 − errt )/errt ) actually minimizes this upper bound.
Claim3: If each classifier is slightly better than random, that is if errt < 1/K ,
Adaboost achieves zero training error exponentially fast
1
http://web.cs.iastate.edu/~honavar/pac.pdf November 6, 2019 19 / 48
AdaBoost Algorithm

November 6, 2019 20 / 48
Example: AdaBoost

November 6, 2019 21 / 48

You might also like