boosting
boosting
November 6, 2019 1 / 48
Ensembles
An ensemble is simply a collection of models that are all trained to perform the same task.
An ensemble can consist of many different versions of the same model, or many different
types of models.
The final output for an ensemble of classifiers is typically obtained through a (weighted)
average or vote of the predictions of the different models in the ensemble.
An ensemble of different models that all achieve similar generalization performance often
outperforms any of the individual models.
Question: How is this possible?
November 6, 2019 2 / 48
Example: Netflix Prize (2009)
November 6, 2019 3 / 48
Ensemble Intuition
Suppose we have an ensemble of binary classification functions fk (x) for k = 1, ..., K .
Suppose that on average they have the same expected error rate
� = Ep(x,y ) [y �= fk (x)] < 0.5, but that the errors they make are independent.
The intuition is that the majority of the K classifiers in the ensemble will be correct on
many examples where any individual classifier makes an error.
A simple majority vote can significantly improve classification performance by decreasing
variance in this setting.
Question: How can we come up with such an ensemble?
November 6, 2019 4 / 48
Ensemble Intuition
Suppose we have an ensemble of binary classification functions fk (x) for k = 1, ..., K .
Suppose that on average they have the same expected error rate
� = Ep(x,y ) [y �= fk (x)] < 0.5, but that the errors they make are independent.
The intuition is that the majority of the K classifiers in the ensemble will be correct on
many examples where any individual classifier makes an error.
A simple majority vote can significantly improve classification performance by decreasing
variance in this setting.
Question: How can we come up with such an ensemble?
1 Eg 1: Dropout in Deep NN - Performs Regularization
2 Eg 2: Bagging (motivated by Independent Datasets) - Balances Bias & Variance
3 Eg 3: Boosting - Balances Bias & Variance
November 6, 2019 4 / 48
Eg 1: Dropout in Deep NN & Ensemble Learning (Section 7.12)
Masking binary variable µli ∈ {0, 1}
sampled independently from each other
Probability Pr(µl−1
i = 1) = β is a hyperparameter β.
µl−1
1 σ1
l−1
Usually β = 0.5 for the hidden layers (l > 1) and
w1jl β = 0.8 for the input (l = 1).
1 l l
w2jl β σj µ i
µl−1
2 σ2
l−1 The resulting σjl is multiplied by β1 to ensure probabilistic
wijl
semantics and then multiplied with its own mask µli
wsll−1 j
µl−1
i σil−1 Equivalent to randomly selecting one of the
sub-networks from the complete network and running
µl−1 l−1 forward propagation through it ⇒
sl−1 σsl−1
November 6, 2019 5 / 48
Eg 1: Dropout in Deep NN & Ensemble Learning (Section 7.12)
Masking binary variable µli ∈ {0, 1}
sampled independently from each other
Probability Pr(µl−1
i = 1) = β is a hyperparameter β.
µl−1
1 σ1
l−1
Usually β = 0.5 for the hidden layers (l > 1) and
w1jl β = 0.8 for the input (l = 1).
1 l l
w2jl β σj µ i
µl−1
2 σ2
l−1 The resulting σjl is multiplied by β1 to ensure probabilistic
wijl
semantics and then multiplied with its own mask µli
wsll−1 j
µl−1
i σil−1 Equivalent to randomly selecting one of the
sub-networks from the complete network and running
µl−1 l−1 forward propagation through it ⇒ Final network is
sl−1 σsl−1
Combination (Ensemble) of exponentially large number
of similar neural networks
Dropout training ⇒ Minimizing expected value of error
wrt random variable µ: Eµ [E (w, b)]
November 6, 2019 5 / 48
Eg 1: Ensemble Learning through Dropouts
November 6, 2019 6 / 48
Backpropagation with Dropouts
µl+1
1 σ1
l+1
(l − 1)th layer
µl−1
1 σ1
l−1
l
w1j sl+1
� ∂E ∂σpl+1
l+1 l+1
∂E = µp wjp if µlj = 1
∂σ l l+1
∂sumpl+1
j p=1 ∂σp
l
w2j ∂E
µl−1
2 σ2
l−1
∂σ l
j
=0 otherwise
wijl
µl−1
i σil−1
wsl
l−1 j
µl−1 l+1
sl+1 σsl+1
µl−1 l−1
sl−1 σsl−1
November 6, 2019 7 / 48
Eg 2: Independent Training Sets
Suppose we collect multiple independent training sets Tr1 , ..., TrK and use each of
these training sets to train a different instance of the same classifier obtaining K
classification functions f1 (x), ..., fK (x).
Classifiers trained in this way are guaranteed to make independent errors on test
data.
If the expected error of each classifier is less than 0.5, then the weighted majority
vote is guaranteed to reduce the expected generalization error.
Question: What is the weakness of this approach?
November 6, 2019 8 / 48
Eg 2: Bagging, i.e., (Bootstrap AGgregation)
Bootstrap AGgregation) or Bagging is an approximation to the previous method
that takes a single training set Tr and randomly sub-samples from it K times (with
replacement) to form K training sets Tr1 , ..., TrK .
Each of these training sets is used to train a different instance of the same classifier
⇒ K classification functions f1 (x), ..., fK (x).
The errors won’t by totally independent because the data sets aren’t independent,
but the random re-sampling usually introduces enough diversity to decrease the
variance and give improved performance.
November 6, 2019 9 / 48
Eg 2: Bagging and Random Forests
Bagging: Particularly useful for high-variance, high-capacity models.
Historically, closely associated with decision tree models (covered in class)
Random forest classifier: popular extension of bagged trees
� The random forests algorithm further decorrelates the learned trees by only considering a
random sub-set of the available features when deciding which variable to split on at each
node in the tree.
Lab 8, Q2, Task 1 associates Bagging with Perceptron (as weak learner)
November 6, 2019 10 / 48
Eg 2: Bagging on Deep NNs
November 6, 2019 11 / 48
Eg 2: Bagging on Deep NNs
November 6, 2019 11 / 48
Eg 2: Bagging applied to Query (Test) data
Ensemble of Models {Ms }B1
November 6, 2019 12 / 48
Eg 2: Bagging applied to Query (Test) data
Ensemble of Models {Ms }B1
November 6, 2019 12 / 48
Eg 2: Bagging – Balancing Bias and Variance
Ensemble of Models {Ms }B1
Can be shown to decrease error, assuming that base classifier can always achieve an
error of less than 0.5 on any data sample
AdaBoost (with decision trees as the weak learners) is often referred to as the best
out-of-the-box classifier (covered in class)
Lab 8, Q2, Task 2 associates Boosting with Perceptron (as a weak learner)
November 6, 2019 14 / 48
From Bagging to Boosting
November 6, 2019 15 / 48
Weak Models: From Bagging to Boosting
Bagging: Ensemble of Independently Weakly Learnt Models (Eg: Trees {Ts }B1 ):
1
�B
Pr (c | x) = |B| t=1 Prt (c | x)
November 6, 2019 16 / 48
Weak Models: From Bagging to Boosting
Bagging: Ensemble of Independently Weakly Learnt Models (Eg: Trees {Ts }B1 ):
1
�B
Pr (c | x) = |B| t=1 Prt (c | x)
Bagging: Ensemble of Independently Weakly Learnt Models (Eg: Trees {Ts }B1 ):
1
�B
Pr (c | x) = |B| t=1 Prt (c | x)
Error driven weighted linear combinations of models: αt = (1/2) ln ((1 − errt )/errt )
November 6, 2019 17 / 48
Adaptive Boosting of Iteratively Learnt Weak Models
Error driven weighted linear combinations of models: αt = (1/2) ln ((1 − errt )/errt )
1
Initialize each instance weight ξi = m
. For t = 1 to B do:
1 Learn the t th model Tt by weighing example x(i) by ξi
�m
(y (i) �=Tt (x(i) ))
i=1 ξi δ
2 Compute the corresponding error on the training set errt = �m
i=1 ξi
3 Compute the error driven weighted linear factor for Tt :
αt = (1/2) ln ((1 − errt )/errt )
(i)
Reweigh each � (i)instance� x(i) ���
� data before learning the next model:
4
ξi = ξi exp αt δ y �= Tt x .
November 6, 2019 18 / 48
Adaboost Algorithm: Motivation (Tutorial 10)
Freund & Schapire, 1995: Converting a “weak” PAC1 learning algorithm that
performs just slightly better than random guessing into one with arbitrarily high
accuracy. �
Let Ct (x) = tj=1 αj Tj (x) be the boosted linear combination of classifiers until t th
iteration.
Let the error to be minimized over αt be the sum of its exponential loss on each
data point,
�m m
� � � ��� � � � ��
Et = δ y (i) �= sign Ct x(i) ≤ exp −y (i) Ct x(i)
i=1 i=1
Claim1: The error that is the sum of exponential loss on each data point is an
upper bound on the simple sum of training errors on each data point
Claim2: αt = (1/2) ln ((1 − errt )/errt ) actually minimizes this upper bound.
Claim3: If each classifier is slightly better than random, that is if errt < 1/K ,
Adaboost achieves zero training error exponentially fast
1
http://web.cs.iastate.edu/~honavar/pac.pdf November 6, 2019 19 / 48
AdaBoost Algorithm
November 6, 2019 20 / 48
Example: AdaBoost
November 6, 2019 21 / 48