0% found this document useful (0 votes)
22 views

DM - Lecture 4

Gradient boosted ensembles learn multiple weak models sequentially to improve predictions. Each new model focuses on instances the existing ensemble mispredicts by weighting them based on the gradient of the loss function. This gradient boosting approach can optimize any differentiable loss function, making it applicable to both regression and classification tasks. Popular implementations like XGBoost have achieved strong performance through techniques like sparsity-aware splitting and efficient data storage.

Uploaded by

Maa See
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

DM - Lecture 4

Gradient boosted ensembles learn multiple weak models sequentially to improve predictions. Each new model focuses on instances the existing ensemble mispredicts by weighting them based on the gradient of the loss function. This gradient boosting approach can optimize any differentiable loss function, making it applicable to both regression and classification tasks. Popular implementations like XGBoost have achieved strong performance through techniques like sparsity-aware splitting and efficient data storage.

Uploaded by

Maa See
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

1

GRADIENT BOOSTED
ENSEMBLES

Jesse Davis
Ensemble Methods: Learn Multiple
Models and Combine Their Output
2

Key: Models in ensemble are “different”

Data +
+…+

Key questions:
1. How do we generate multiple different models?
2. How do we learn the models efficiently?
Canonical Approach: Boosting
3

 Focuses on combining “weak” learners


 Learns an additive ensemble in an iterative,
stagewise manner
F(X) = α1h1(X) + α2h2(X) + … + αtht(X)

Real value Weak model, e.g., depth-bounded tree

 Two big ideas:


 Idea 1: Assign weights to examples to focus
attention on misclassified examples
 Idea 2: Prediction is a weighted vote based on
how accurate each hi is
Recall: AdaBoost
4

 Approach: Learn model iteratively


 Given: Fm-1(X) = α1h1(X) + α2h2(X) + … + αm-1hm-1(X)
 Add: αmhm(X) that minimizes the exponential error
𝑁

𝐸 = ෍ 𝑒 −𝑦𝑗 (𝐹𝑚−1 𝑥𝑗 + α𝑚 ℎ𝑚 𝑥𝑗 )

𝑗=1

 Two big problems


 Justbinary classification problems
 Specific loss function, that is, exponential loss
Gradient Tree Boosting
5

 The base algorithm is old but very hyped now


 1996: Adaboost, the first practical boosting
algorithm [Freund et al.]
 1998: Formulate Adaboost as gradient descent
with a special loss function [Breiman et al.]
 2000: Generalize Adaboost to Gradient Boosting
works with any differentiable loss [Friedman et al.]
 Since: MART, XGBoost, LightGBM, BitBoost, etc.

 Will focus on least squares regression case

 Will ignore some of mathematical details


Gradient Boosting is Popular!
6

N° of pages on Kaggle.com containing term:


Linear models 21100

TensorFlow 16900

PyTorch 5500

AdaBoost 2290

LightGBM 12700

XGBoost 17400

0 5000 10000 15000 20000 25000

Kaggle popularity
Gradient Boosting Big Picture
7

 Gradient Boosting =
Gradient Descent + Boosting

 Fit an additive model (ensemble) in a greedy


forward stage-wise manner

 Each stage introduces a weak learner to


address the shortcomings of the current model

 Shortcomings are identified by gradients


Formal Definition: Gradient Boosting
for Squared Loss
8

 Given: {(x1,y1), (x2,y2),…,(xn,yn)}

 Goal: Learn function F: X ↦ Y

𝑛
1 2
 Least squares objective: ℓ = ෍ 𝑦𝑖 − 𝐹 𝑥𝑖
2
𝑖=1

 Representation of F: 𝐹 𝑥𝑖 = ෍ 𝜂𝑡 ℎ𝑡 (𝑥𝑖 )
𝑡=1
Intuition
9

 Suppose we start with simple h: 𝐹 𝑥𝑖 = 𝑌ത

 Cannot change F in anyway


(e.g., remove a tree, change a parameter)

 Ideal scenario: Add a new h such that:


𝐹 𝑥1 + ℎ 𝑥1 = 𝑦1
𝐹 𝑥2 + ℎ 𝑥2 = 𝑦2

𝐹 𝑥𝑛 + ℎ 𝑛 = 𝑦𝑛

Such a h won’t exist, but one may approximate it


Learning h
10

 Learning h: Equivalent
𝐹 𝑥1 + ℎ 𝑥1 = 𝑦1 ℎ 𝑥1 = 𝑦1 −𝐹 𝑥1
𝐹 𝑥2 + ℎ 𝑥2 = 𝑦2
… ⇒ ℎ 𝑥2 = 𝑦2 −𝐹 𝑥2

𝐹 𝑥𝑛 + ℎ 𝑛 = 𝑦𝑛 ℎ 𝑛 = 𝑦𝑛 −𝐹 𝑥𝑛
 Construct new data set and learn h on it:
{ 𝑥1 , 𝑦1 −𝐹 𝑥1 , 𝑥2 , 𝑦2 −𝐹 𝑥2 , … , (𝑥𝑛 , 𝑦𝑛 −𝐹 𝑥𝑛 )}

 Add learned ℎ to 𝐹

 Repeat this procedure


Pictorial Representation
11

Additive Model: F(X) = h0(X) + h1(X) + … + hm(X)


Function Space:
All Decision Trees

+
+…+

Residual
Initial residual
for an instance 0
Connection to Gradient Descent
12

 So far, reweight with residual: = yi – F(xi)

 Gradient descent: Minimize a function by


moving in opposite direction of the gradient

 By changing F, we minimize our loss function


1 2
ℓ = ෍ 𝑦𝑖 − 𝐹(𝑥𝑖 )
2 𝑖

 View F(xi)s as parameters and take derivative


𝜕ℓ 𝜕1Τ2 σ𝑖 𝑦𝑖 −𝐹(𝑥𝑖 ) 2
=
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )
Squared Error: Can Interpret Residuals
as Negative Gradient
13

𝜕ℓ 𝜕1Τ2 𝑦1 −𝐹(𝑥1 ) 2 +…+ 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 +⋯+ 𝑦𝑛 −𝐹(𝑥𝑛 ) 2


=
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )

𝜕ℓ 𝜕1Τ2 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 No other term involves F(Xi)


= These terms’ derivatives are 0
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )

𝜕ℓ 𝜕1Τ2 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 By chain rule


= = 𝐹(𝑥𝑖 ) - 𝑦𝑖
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )

𝜕ℓ
Negative of gradient: − = 𝑦𝑖 − 𝐹 𝑥𝑖
𝜕𝐹 𝑥𝑖

Note: Negative gradient ≠ residual for all loss functions


Details on Chain Rule
14

∂ ½[(Yi –F(Xi))2]
=
∂F(Xi)
View above as F(x) = f (g (x)) with g (x) =yi –F(xi)

By chain rule F’(x) = f’ (g(x))g’(x)

f’(g(x)) = yi –F(xi) and g’(x) = -1, thus

∂ ½[(yi –F(xi))2]
= = F(xi) – yi
∂F(xi)
Illustration of Loss Function
15

Each tree fits step towards better prediction:

L2 Loss: Log Loss:


Regression Classification

0 0
The Power of Gradient Boosting
16

 Abstract away the algorithm from the loss


function and hence the task

 Thus can plug in any differentiable loss function


and use the same algorithm
 Other regression loss functions
 Classification

 Ranking
AdaBoost vs. Gradient Boosting
17

 Similarities
 Stage wise greedy learning of additive model
 Focus on mispredicted examples

 Typically use decision trees

 Difference 1: Focus on mispredictions


 AdaBoost: High-weight data points
 Gradient Boosting: Gradient of loss function

 Difference 2: Generality
 AdaBoost: Just classification
 Gradient Boosting: Any differentiable loss !
18 Anatomy of a Boosting System
XGBoost
19

 A system ML / DM paper that is hugely


successful in practice
 Wins Kaggle competitions
 Very good and easy to use implementation

 Spurred a number of follow up papers


 LightGBM (Microsoft)
 CatBoost (Yandex)

 BitBoost (DTAI)

 Commonalties in design decisions


Key Design Features
20

 Tree structure: Internal and leaf nodes

 Criteria for evaluating splits

 Optimizing evaluations of the splits

 Add randomization

 XGBoost specific features


 Data storage
 Sparsity aware splitting
Only Consider Binary Trees!
21

 Reals: As is Age < 35


 Binary: As is
T F
 Discrete: One-hot encoding
R Y B 20 HasAuto
Xi,r 1 0 0 T F
Color = {r, y, b} Xi,y 0 1 0
10 60
Xi,b 0 0 1
 Ordinal: Two choices
 One-hot encoding
 Convert to integers

Size = {small, medium, large} Size = {0,1,2}


Leaf Nodes Always Real-Valued
22

Value of leaf node depends on loss function


𝜕ℓ(𝐹 𝑥𝑖 ,𝑦𝑖 )
Let 𝑔𝑖 =
𝜕𝐹(𝑥𝑖 )
1 2
 Squared loss ℓ(𝐹 𝑥𝑖 , 𝑦𝑖 ) = 𝑦𝑖 − 𝐹 𝑥𝑖
2
σ𝑖∈𝐼 𝑔𝑖
𝑗
Value of leaf j: 𝑤𝑗 = − where 𝐼𝑗 = instances
|𝐼𝑗 |
sorted to leaf j
 Logistic loss ℓ(𝐹 𝑥𝑖 , 𝑦𝑖 ) = log(1 + 𝑒 [−2𝑦𝐹 𝑥𝑖 ] )
σ𝑖∈𝐼 −𝑔𝑖
𝑗
Value of leaf j: 𝑤𝑗 = σ
𝑖∈𝐼𝑗 |𝑔𝑖 |(2−|𝑔𝑖 |)
Predictions Always Real-Valued!
23
𝑇

𝐹 𝑥𝑖 = ෍ 𝜂𝑡 ℎ𝑡 (𝑥𝑖 )
𝑡=1
 Not problematic for regression tasks

 Threshold raw score for classification tasks


E.g.: Sign((𝐹 𝑥𝑖 )
 Convert raw score into a probability
 Traina logistic regression model to convert raw
score to a probability (aka Platt scaling)
 Loss function specific conversions

E.g.: logistic loss: 𝑝(𝑦𝑖 = 1 | 𝑥𝑖 ) = 1ൗ1+exp(−2𝐹 𝑥𝑖 )


Evaluating Splits: Loss Reduction
24

Split(𝐼𝑃 : Instance Set, Split Conditions S)


for each s ∊ S do
Let 𝐼𝐿 𝐼𝑅 be the instance set of left (right) child
2 2 2
σ𝑖∈𝐼𝐿 𝑔𝑖 σ𝑖∈𝐼𝑅 𝑔𝑖 σ𝑖∈𝐼𝑃 𝑔𝑖
Δloss = + −
|𝐼𝐿 | |𝐼𝑅 | |𝐼𝑃 |

Bottleneck: ~90% of training time spent evaluating splits


Trick 1: Exploit Tree Structure
25

Age < 35
These quantities were
computed in the parent
? node: Reuse them!
2 2 2
σ𝑖∈𝐼 𝑔𝑖 σ𝑖∈𝐼 𝑔𝑖 σ𝑖∈𝐼 𝑔𝑖
𝐿 𝑅 𝑃
Loss = + −
|𝐼𝐿 | |𝐼𝑅 | |𝐼𝑃 |

Let Σ𝑃 = σ𝑖∈𝐼𝑃 𝑔𝑖

Let Σ𝐿 = σ𝑖∈𝐼𝐿 𝑔𝑖
Exploit only binary splits:
Let Σ𝑅 = Σ𝑃 - Σ𝐿 save add operation!
Σ𝐿 2 Σ𝑅 2 Σ𝑃 2
Loss = + −
|𝐼𝐿 | 𝐼𝑃 −|𝐼𝐿 | |𝐼𝑃 |
Biggest Problem: Continuous Features
26

X1 … Xd Y
0 ... -1 5
0 … 5 -10
.. … … …
1 … 10 10
1. Copy feature: -1 5 95 -5 -1 … 10

2. Sort array: -5 -1 -1 5 10 … 95

3. Try all possible splits: Xd < -1 Xd < 5


Problem: Lots of thresholds to try
Trick 2: Histograms
27

 Determine a limited set of split points per node


 Can evaluate these in one pass over the data
X1 … Xd Y 1. Pick small number of equal
0 ... -1 5 width bins (e.g., 256)
0 … 5 -10 2. Pass over data and fill bins
.. … … … 3. Only consider splits at bin
1 … 10 10 boundaries
-5 ≤ Xd < 5 5 ≤ Xd < 15 85 ≤ Xd ≤ 95
22,50 3,-5 … 6,20
Xd < 5 Xd < 15 Count Sum gi
Add Randomization
28

 Bagging: Bootstrap replicate S’ by sampling |S|


examples with replacement from S
Data Bootstrap Replicatet
63.2% chance
example appears
in replicate


 Feature bagging: Randomly select subset of
columns when learning each tree
 Randomness in splits: Randomly select
subset of splits at each internal node
XGBoost Data Representation:
(Compressed) Column Format
29

Married Age Type Class

Matrix Format Column Format


Y 45 Sedan -10 Y 45 Sedan -10
N 20 SUV 5 N 20 SUV 5
Y 30 Sport 10 Y 30 Sport 10
Y 60 Berline -15 Y 60 Berline -15

☺ Easier and faster to randomly select a feature


☺ Presort continuous features
 More record keeping (if you presort)
Overhead with Presorting
30

Unsorted Presorted
Feature Y Feature Y
45 -10 20 -10
20 5 30 5
30 10 45 10
60 -15 60 -15

 Array indices  Alignment broken


aligned between  Requires storing a
feature and Y pointer to Y value for
 Easy to look up Y each feature
value
XGBoost: Sparsity-Aware Splitting
31

 Many entries “zero” due to one-hot encoding,


missing data, natural sparseness, etc.
 Do not store “zero” entries
Dense Sparse
0 -10 1 -10
0 -15 1 -15
Drop
0 5 5
1 15 15
1 10 10

Fewer entries to iterate over:


Can result in 50x speed ups!
Which Details Were Skipped?
32

 Regularization to avoid overfitting


 Restrict depth of trees
 Restrict number of leaves

 Add penalty term to loss function

 Setting the learning rate η [see Friedman 2011]

 Derivations and discussion of all loss functions


33 Issues with Ensembles
Two Problems with Ensembles
34

 Problem: Over multiple models, how can I


determine which features are interesting?

Solution: Feature importances

 Problem: Ensembles have multiple models


 Predictionstake longer
 Models take up more space

Solution: Model compression


Feature Importance:
Mean Decrease in Impurity
35

1 𝑆𝑛
= ෍ ෍ 𝐺𝑎𝑖𝑛(𝑣, 𝑆𝑛 )
|𝐸| 𝑚∈𝐸 𝑛𝑣 ∈𝑚 |𝐷|

E = {𝑚1 ,…, 𝑚𝑡 } is an ensemble of models


𝑛𝑣 is a node splitting on variable v
𝑆𝑛 is the set of training examples reaching node 𝑛𝑣
D is the training data
Gain(v, 𝑆𝑛 ) is the impurity reduction of split 𝑆𝑛 on v
Feature Importance:
Permutation Test
36

 For each variable, randomly permute its values


in the out of bag example
 Outof bag: Examples not selected in bootstrap
 Permutation: Type

Married Age Type Class Married Age Type Class


N 20 SUV N N 20 Sport N
Y 30 Sport N Y 30 Berline N
Y 60 Berline Y Y 60 SUV Y

 Measure decrease in accuracy


Model Compression
37

 Idea: Compress ensemble by mimicking its


behavior with a model that is smaller and
executes quickly
 Build new data set D’ = {(xj’, E(xj’)}
 x j’
is an example
 E(x1’) is the ensembles prediction for x1’

 Train a new model M on D’


Two questions:
1. How to generate data?
2. What model?
Approach
38

 Generate data: For each example (xj,yj) in D,


create new example (xj’, E(xj’)):
 Let (xn,yn) be (xj,yj)’s nearest neighbor in D
 For i = 0 to d
◼r ~ U[0,1]
◼ If (r < 0.5) then xj,i’ = xj,i
◼ Else xj,i’ = xn,I

 Add (xj’, E(xj’)) to D’


 Train a neural network on D’
39 Applications
Web Search
40

Query: “107.7 the end”


How Helpful Is an On-the-Ball Action in
a Soccer Match?
41

A soccer match has ±1600 on-the-ball actions


Problem: 99% of actions do not directly affect the score

Goal
How Helpful Is an On-the-Ball Action in
a Soccer Match?
42

A soccer match has ±1600 on-the-ball actions


Problem: 99% of actions do not directly affect the score

Goal

𝑃𝑎𝑠𝑠

Question: How valuable is an action (e.g., pass, dribble,…)?


Contribution Rating: How Much Did an
Action Contribute to the Scoreline?
43

 Insight: Action changes game state


 Assign value to each game state si
 Value of action 𝐶𝑅 𝑠𝑖 , 𝑎𝑖 = 𝑉 𝑠𝑖+1 − 𝑉(𝑠𝑖 )
Value(pass) = 0.04 ≈ Pass’ Expected Δgoal difference
V(si) = 0.01 V(si+1) = 0.05
Valuing a Game State
44

Intuition: Good actions either


1) Increase the short-term chance of scoring
2) Decrease the short-term chance of conceding
𝑉 𝑠𝑖 = 𝑃𝑠𝑐𝑜𝑟𝑒𝑠 𝑠𝑖 − 𝑃𝑐𝑜𝑛𝑐𝑒𝑑𝑒𝑠 s𝑖

Estimate these from historical data


 Game state uses last 3 actions: si = {si-2,si-1,si}
 An action’s effect is temporally limited:
+ Example: Goal by either team in next 10 actions
 − Example: No goals in next 10 actions

 Train a gradient boosted probability estimator


Represent Game State Using
> 20 Features
45

3 types of features Home: 2; Away: 0; GD: +2;


Time = 80min
1) Simple: One action
2) Complex: Compare Distance
consecutive actions Pass to goal

3) Contextual: Game info


Type Pass
Result Success Time
difference
Start Location (60,20) Pass
End Location (75,60) Tackle
Body Part Foot
Example: Barcelona’s 3-0 goal versus
Real Madrid (Dec 23, 2017)

Phase starts here


Application Scouting: Top-5 U21
players in the 2017/18 Dutch League
47

VAEP June June Price


Rank Player Team Age
rating 2018 2019 delta

1 David Neres Ajax 21 0.62 € 20m € 45m + €25m

2 Mason Mount Vitesse 19 0.62 € 4m € 12m + €8m

3 Frenkie de Jong Ajax 20 0.50 € 7m € 85m + €78m

4 Steven Bergwijn PSV 20 0.49 € 12m € 35m + €23m

5 Donny van de Beek Ajax 21 0.47 € 14m €40m + €26m


Application Scouting: Top-5 U21
players in the 2017/18 Dutch League
48

VAEP June June Price


Rank Player Team Age
rating 2018 2019 delta

1 David Neres Ajax 21 0.62 € 20m € 45m + €25m

2 Mason Mount Vitesse 19 0.62 € 4m € 12m + €8m

3 Frenkie de Jong Ajax 20 0.50 € 7m € 85m + €78m

4 Steven Bergwijn PSV 20 0.49 € 12m € 35m + €23m

5 Donny van de Beek Ajax 21 0.47 € 14m €40m + €26m


Summary
49

 Gradient boosting generalizes AdaBoost to work


with any differential loss function

 Mainly done with trees though with some tweaks


to standard tree learner

 Many highly performant implements

 Widely applied to real problems


Questions?
50
51
AdaBoost from Principles of ML
for Easy Reference / Recall

(c) jesse davis


Boosting
52

 Arose in the theoretical PAC learning community


 Strong PAC learner ≈ For arbitrary ε and δ, with
probability 1-δ, produce model with error of < ε
 Boosting assumes a weak learner:
 Cannot PAC learn for arbitrary ε and δ
 Models are (slightly) better than random guessing

 General approach
 Learns an additive model
 Greedily adds model at a time

 Focuses on the current model correcting


examples that are incorrectly predicted (c) jesse davis
AdaBoost: First Practical Booster
53

 Works for binary classification problems


 Learns additive model iteratively
F(X) = α1h1(X) + α2h2(X) + … + αtht(X)

 Two big ideas:


 Idea 1: Assign weights to examples to focus
attention on misclassified examples
 Idea 2: Prediction is a weighted voted based on
how accurate each hi is
(c) jesse davis
Boosting Example
54

+ - +
+ +
-
+ - -
-
+ -
-
 Assume that we are going to make one axis
parallel cut through feature space

(c) jesse davis


Boosting Example
55

+ - + + - +
+ + + +-
- -
+ - - +
-
-
-
+ - + -
- -

 Errors: 3
 Upweight the mistakes, downweight everything
else
(c) jesse davis
Boosting Example
56

+ +
+ - + - + - +
+ + + +- + +-
- - -
+ - - -
- + + -
- -
+ - -
+ - + -
- -

(c) jesse davis


Boosting Example
57

+ +
+ - + - + - +
+ + + +- + +-
- - -
+ - - -
- + + -
- -
+ - -
+ - + -
- -

Three key questions for AdaBoost:


1. What model should we pick (in theory)?
2. How should we set α ?
3. What practical details are important?
(c) jesse davis
AdaBoost Setting
58

 Given: S = {(xj,yj)} with j ∊ {1,…,n} and y ∊ {-1,+1}

 Learn: F(X) = α1h1(X) + α2h2(X) + … + αtht(X)

−1 (Negative)
 Prediction: 𝐹 𝑥𝑖 = ቊ
+1 (Positive)
argmaxy = σt αtht(𝑥𝑖 )

1. What model should we pick (in theory)?


2. How should we set α ?
3. What practical details (c)
are important?
jesse davis
What Classifier Should We Pick?
59

 Suppose our current function is:


Fm-1(X) = α1h1(X) + α2h2(X) + … + αm-1hm-1(X)

 Goal:
 Pick αmhm(X) to add to the model
𝑁
 To minimize error: 𝐸 = ෍ 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 + α𝑚 ℎ𝑚 𝑥𝑖 )

𝑖=1

(c) jesse davis


Understanding the Error:
60
Exponential Loss Function
𝑦ො  Loss is small if
𝑁 predicted and true
𝐸 = ෍ 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 + α𝑚 ℎ𝑚 𝑥𝑖 )
label have same sign
𝑖=1
sign 𝑦 = sign 𝑦ො ⇒ −𝑦𝑦ො < 0
sign 𝑦 ≠ sign 𝑦ො ⇒ −𝑦𝑦ො > 0
Loss

 Loss drops quickly


𝑒 −(1)(−2) = 7.4
𝑒 −(1)(−1) = 2.7
𝑦ො (case 𝑦 = 1)  Loss always > 0
(c) jesse davis
What Classifier Should We Pick?
61
𝑁

 Goal: Minimize 𝐸 = ෍ 𝑒 −𝑦𝑖(𝐹𝑚−1 𝑥𝑖 + α𝑚ℎ𝑚 𝑥𝑖 )


𝑖=1
(𝑚)
 Let 𝑤𝑖 = 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 )

Error per example that is fixed because we


cannot change classifiers 1 to m-1 in the ensemble

(c) jesse davis


What Classifier Should We Pick?
62
𝑁

 Goal: Minimize 𝐸 = ෍ 𝑒 −𝑦𝑖(𝐹𝑚−1 𝑥𝑖 + α𝑚ℎ𝑚 𝑥𝑖 )


𝑖=1
(𝑚)
 Let 𝑤𝑖 = 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 )
and rewrite the error
(𝑚) −α𝑚
𝑒 α𝑚
(𝑚)
=෍ 𝑤𝑖 𝑒 +෍ 𝑤𝑖
𝑦𝑖 = ℎ𝑚 𝑥𝑖 𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖

Weight for correct predictions Weight for incorrect predictions


𝑁
(𝑚) −α𝑚
(𝑒 α𝑚 − 𝑒 −α𝑚 )
𝑚
= ෍ 𝑤𝑖 𝑒 +෍ 𝑤𝑖
𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖
𝑖=1
Assumes hm ‘s predictions hm that minimizes this sum,
are all correct minimizes E! (c) jesse davis
Setting α𝑚
63

(𝑚) (𝑚)
 Let 𝑊𝑐 = ෍ 𝑤𝑖 and 𝑊𝑖𝑐 = ෍ 𝑤𝑖
𝑦𝑖 = ℎ𝑚 𝑥𝑖 𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖

 Then 𝐸 = 𝑊𝑐 𝑒 −α𝑚 + 𝑊𝑖𝑐 𝑒 α𝑚


ⅆ𝐸
= −𝑊𝑐 𝑒 −α𝑚 + 𝑊𝑖𝑐 𝑒 α𝑚 = 0
ⅆα
−𝑊𝑐 + 𝑊𝑖𝑐 𝑒 2α𝑚 = 0
1 𝑊𝑐
α𝑚 = ln
2 𝑊𝑖𝑐
1 1 − 𝜀𝑚
α𝑚 = ln
2 𝜀𝑚
(𝑚)
σ𝑦𝑖 ≠ ℎ𝑚𝑥𝑖 𝑤𝑖
with 𝜀𝑚 =
σ 𝑤𝑖(𝑚) (c) jesse davis
AdaBoost
64

Given S = {(xj,yj)} where j ∊ {1,…,n}, Integer T


(1)
𝑤𝑖 = 1Τ𝑛 All examples
for t = 1 to T have same weight Weighted error
(𝑚)
Find classifier ht, with small error ϵt = σℎ𝑡 𝑥𝑖 ≠𝑦𝑖 𝑤𝑖
if (ϵt > 1Τ2) then break
𝛽𝑡 = ϵtൗ(1−ϵ ) Down weight correct predictions
t
(𝑡+1) (𝑡)
if (ℎ𝑡 𝑥𝑖 = 𝑦𝑖 ) then 𝑤𝑖 = 𝑤𝑖 𝛽𝑡
(𝑡+1)
(𝑡+1) 𝑤𝑖
𝑤𝑖 = (𝑡+1) Normalize weights
σ 𝑤𝑗
1
𝛼𝑡 = 𝑙𝑛
𝛽𝑡 (c) jesse davis
AdaBoost in Practice
65

 Typically use depth bounded decision tree


 Sometimes stumps: Just a single split
 Often depth 5 or 6

 Dealing with weighted instances


 Approach 1: Adapt learner to learn from
weighted instances; trivial for decision trees just
use weighted counts for split criteria
 Approach 2: Sample a large (≫n) set of
unweighted instances according to the weight
distribution and run learner
(c) jesse davis

You might also like