0% found this document useful (0 votes)

22 views

DM - Lecture 4

Gradient boosted ensembles learn multiple weak models sequentially to improve predictions. Each new model focuses on instances the existing ensemble mispredicts by weighting them based on the gradient of the loss function. This gradient boosting approach can optimize any differentiable loss function, making it applicable to both regression and classification tasks. Popular implementations like XGBoost have achieved strong performance through techniques like sparsity-aware splitting and efficient data storage.

Uploaded by

Maa See

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

DM - Lecture 4

Uploaded by

Maa See

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

1

GRADIENT BOOSTED
ENSEMBLES

Jesse Davis
Ensemble Methods: Learn Multiple
Models and Combine Their Output
2

Key: Models in ensemble are “different”

Data +
+…+

Key questions:
1. How do we generate multiple different models?
2. How do we learn the models efficiently?
Canonical Approach: Boosting
3

 Focuses on combining “weak” learners

 Learns an additive ensemble in an iterative,
stagewise manner
F(X) = α1h1(X) + α2h2(X) + … + αtht(X)

Real value Weak model, e.g., depth-bounded tree

 Two big ideas:

 Idea 1: Assign weights to examples to focus
attention on misclassified examples
 Idea 2: Prediction is a weighted vote based on
how accurate each hi is
Recall: AdaBoost
4

 Approach: Learn model iteratively

 Given: Fm-1(X) = α1h1(X) + α2h2(X) + … + αm-1hm-1(X)
 Add: αmhm(X) that minimizes the exponential error
𝑁

𝐸 = ෍ 𝑒 −𝑦𝑗 (𝐹𝑚−1 𝑥𝑗 + α𝑚 ℎ𝑚 𝑥𝑗 )

𝑗=1

 Two big problems

 Justbinary classification problems
 Specific loss function, that is, exponential loss
Gradient Tree Boosting
5

 The base algorithm is old but very hyped now

 1996: Adaboost, the first practical boosting
algorithm [Freund et al.]
 1998: Formulate Adaboost as gradient descent
with a special loss function [Breiman et al.]
 2000: Generalize Adaboost to Gradient Boosting
works with any differentiable loss [Friedman et al.]
 Since: MART, XGBoost, LightGBM, BitBoost, etc.

 Will focus on least squares regression case

 Will ignore some of mathematical details

Gradient Boosting is Popular!
6

N° of pages on Kaggle.com containing term:

Linear models 21100

TensorFlow 16900

PyTorch 5500

AdaBoost 2290

LightGBM 12700

XGBoost 17400

0 5000 10000 15000 20000 25000

Kaggle popularity
Gradient Boosting Big Picture
7

 Gradient Boosting =
Gradient Descent + Boosting

 Fit an additive model (ensemble) in a greedy

forward stage-wise manner

 Each stage introduces a weak learner to

address the shortcomings of the current model

 Shortcomings are identified by gradients

Formal Definition: Gradient Boosting
for Squared Loss
8

 Given: {(x1,y1), (x2,y2),…,(xn,yn)}

 Goal: Learn function F: X ↦ Y

𝑛
1 2
 Least squares objective: ℓ = ෍ 𝑦𝑖 − 𝐹 𝑥𝑖
2
𝑖=1

 Representation of F: 𝐹 𝑥𝑖 = ෍ 𝜂𝑡 ℎ𝑡 (𝑥𝑖 )
𝑡=1
Intuition
9

 Suppose we start with simple h: 𝐹 𝑥𝑖 = 𝑌ത

 Cannot change F in anyway

(e.g., remove a tree, change a parameter)

 Ideal scenario: Add a new h such that:

𝐹 𝑥1 + ℎ 𝑥1 = 𝑦1
𝐹 𝑥2 + ℎ 𝑥2 = 𝑦2
…
𝐹 𝑥𝑛 + ℎ 𝑛 = 𝑦𝑛

Such a h won’t exist, but one may approximate it

Learning h
10

 Learning h: Equivalent
𝐹 𝑥1 + ℎ 𝑥1 = 𝑦1 ℎ 𝑥1 = 𝑦1 −𝐹 𝑥1
𝐹 𝑥2 + ℎ 𝑥2 = 𝑦2
… ⇒ ℎ 𝑥2 = 𝑦2 −𝐹 𝑥2
…
𝐹 𝑥𝑛 + ℎ 𝑛 = 𝑦𝑛 ℎ 𝑛 = 𝑦𝑛 −𝐹 𝑥𝑛
 Construct new data set and learn h on it:
{ 𝑥1 , 𝑦1 −𝐹 𝑥1 , 𝑥2 , 𝑦2 −𝐹 𝑥2 , … , (𝑥𝑛 , 𝑦𝑛 −𝐹 𝑥𝑛 )}

 Add learned ℎ to 𝐹

 Repeat this procedure

Pictorial Representation
11

Additive Model: F(X) = h0(X) + h1(X) + … + hm(X)

Function Space:
All Decision Trees

+
+…+

Residual
Initial residual
for an instance 0
Connection to Gradient Descent
12

 So far, reweight with residual: = yi – F(xi)

 Gradient descent: Minimize a function by

moving in opposite direction of the gradient

 By changing F, we minimize our loss function

1 2
ℓ = ෍ 𝑦𝑖 − 𝐹(𝑥𝑖 )
2 𝑖

 View F(xi)s as parameters and take derivative

𝜕ℓ 𝜕1Τ2 σ𝑖 𝑦𝑖 −𝐹(𝑥𝑖 ) 2
=
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )
Squared Error: Can Interpret Residuals
as Negative Gradient
13

𝜕ℓ 𝜕1Τ2 𝑦1 −𝐹(𝑥1 ) 2 +…+ 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 +⋯+ 𝑦𝑛 −𝐹(𝑥𝑛 ) 2

=
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )

𝜕ℓ 𝜕1Τ2 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 No other term involves F(Xi)

= These terms’ derivatives are 0
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )

𝜕ℓ 𝜕1Τ2 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 By chain rule

= = 𝐹(𝑥𝑖 ) - 𝑦𝑖
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )

𝜕ℓ
Negative of gradient: − = 𝑦𝑖 − 𝐹 𝑥𝑖
𝜕𝐹 𝑥𝑖

Note: Negative gradient ≠ residual for all loss functions

Details on Chain Rule
14

∂ ½[(Yi –F(Xi))2]
=
∂F(Xi)
View above as F(x) = f (g (x)) with g (x) =yi –F(xi)

By chain rule F’(x) = f’ (g(x))g’(x)

f’(g(x)) = yi –F(xi) and g’(x) = -1, thus

∂ ½[(yi –F(xi))2]
= = F(xi) – yi
∂F(xi)
Illustration of Loss Function
15

Each tree fits step towards better prediction:

L2 Loss: Log Loss:

Regression Classification

0 0
The Power of Gradient Boosting
16

 Abstract away the algorithm from the loss

function and hence the task

 Thus can plug in any differentiable loss function

and use the same algorithm
 Other regression loss functions
 Classification

 Ranking
AdaBoost vs. Gradient Boosting
17

 Similarities
 Stage wise greedy learning of additive model
 Focus on mispredicted examples

 Typically use decision trees

 Difference 1: Focus on mispredictions

 AdaBoost: High-weight data points
 Gradient Boosting: Gradient of loss function

 Difference 2: Generality
 AdaBoost: Just classification
 Gradient Boosting: Any differentiable loss !
18 Anatomy of a Boosting System
XGBoost
19

 A system ML / DM paper that is hugely

successful in practice
 Wins Kaggle competitions
 Very good and easy to use implementation

 Spurred a number of follow up papers

 LightGBM (Microsoft)
 CatBoost (Yandex)

 BitBoost (DTAI)

 Commonalties in design decisions

Key Design Features
20

 Tree structure: Internal and leaf nodes

 Criteria for evaluating splits

 Optimizing evaluations of the splits

 Add randomization

 XGBoost specific features

 Data storage
 Sparsity aware splitting
Only Consider Binary Trees!
21

 Reals: As is Age < 35

 Binary: As is
T F
 Discrete: One-hot encoding
R Y B 20 HasAuto
Xi,r 1 0 0 T F
Color = {r, y, b} Xi,y 0 1 0
10 60
Xi,b 0 0 1
 Ordinal: Two choices
 One-hot encoding
 Convert to integers

Size = {small, medium, large} Size = {0,1,2}

Leaf Nodes Always Real-Valued
22

Value of leaf node depends on loss function

𝜕ℓ(𝐹 𝑥𝑖 ,𝑦𝑖 )
Let 𝑔𝑖 =
𝜕𝐹(𝑥𝑖 )
1 2
 Squared loss ℓ(𝐹 𝑥𝑖 , 𝑦𝑖 ) = 𝑦𝑖 − 𝐹 𝑥𝑖
2
σ𝑖∈𝐼 𝑔𝑖
𝑗
Value of leaf j: 𝑤𝑗 = − where 𝐼𝑗 = instances
|𝐼𝑗 |
sorted to leaf j
 Logistic loss ℓ(𝐹 𝑥𝑖 , 𝑦𝑖 ) = log(1 + 𝑒 [−2𝑦𝐹 𝑥𝑖 ] )
σ𝑖∈𝐼 −𝑔𝑖
𝑗
Value of leaf j: 𝑤𝑗 = σ
𝑖∈𝐼𝑗 |𝑔𝑖 |(2−|𝑔𝑖 |)
Predictions Always Real-Valued!
23
𝑇

𝐹 𝑥𝑖 = ෍ 𝜂𝑡 ℎ𝑡 (𝑥𝑖 )
𝑡=1
 Not problematic for regression tasks

 Threshold raw score for classification tasks

E.g.: Sign((𝐹 𝑥𝑖 )
 Convert raw score into a probability
 Traina logistic regression model to convert raw
score to a probability (aka Platt scaling)
 Loss function specific conversions

E.g.: logistic loss: 𝑝(𝑦𝑖 = 1 | 𝑥𝑖 ) = 1ൗ1+exp(−2𝐹 𝑥𝑖 )

Evaluating Splits: Loss Reduction
24

Split(𝐼𝑃 : Instance Set, Split Conditions S)

for each s ∊ S do
Let 𝐼𝐿 𝐼𝑅 be the instance set of left (right) child
2 2 2
σ𝑖∈𝐼𝐿 𝑔𝑖 σ𝑖∈𝐼𝑅 𝑔𝑖 σ𝑖∈𝐼𝑃 𝑔𝑖
Δloss = + −
|𝐼𝐿 | |𝐼𝑅 | |𝐼𝑃 |

Bottleneck: ~90% of training time spent evaluating splits

Trick 1: Exploit Tree Structure
25

Age < 35
These quantities were
computed in the parent
? node: Reuse them!
2 2 2
σ𝑖∈𝐼 𝑔𝑖 σ𝑖∈𝐼 𝑔𝑖 σ𝑖∈𝐼 𝑔𝑖
𝐿 𝑅 𝑃
Loss = + −
|𝐼𝐿 | |𝐼𝑅 | |𝐼𝑃 |

Let Σ𝑃 = σ𝑖∈𝐼𝑃 𝑔𝑖

Let Σ𝐿 = σ𝑖∈𝐼𝐿 𝑔𝑖
Exploit only binary splits:
Let Σ𝑅 = Σ𝑃 - Σ𝐿 save add operation!
Σ𝐿 2 Σ𝑅 2 Σ𝑃 2
Loss = + −
|𝐼𝐿 | 𝐼𝑃 −|𝐼𝐿 | |𝐼𝑃 |
Biggest Problem: Continuous Features
26

X1 … Xd Y
0 ... -1 5
0 … 5 -10
.. … … …
1 … 10 10
1. Copy feature: -1 5 95 -5 -1 … 10

2. Sort array: -5 -1 -1 5 10 … 95

3. Try all possible splits: Xd < -1 Xd < 5

Problem: Lots of thresholds to try
Trick 2: Histograms
27

 Determine a limited set of split points per node

 Can evaluate these in one pass over the data
X1 … Xd Y 1. Pick small number of equal
0 ... -1 5 width bins (e.g., 256)
0 … 5 -10 2. Pass over data and fill bins
.. … … … 3. Only consider splits at bin
1 … 10 10 boundaries
-5 ≤ Xd < 5 5 ≤ Xd < 15 85 ≤ Xd ≤ 95
22,50 3,-5 … 6,20
Xd < 5 Xd < 15 Count Sum gi
Add Randomization
28

 Bagging: Bootstrap replicate S’ by sampling |S|

examples with replacement from S
Data Bootstrap Replicatet
63.2% chance
example appears
in replicate
…

…
 Feature bagging: Randomly select subset of
columns when learning each tree
 Randomness in splits: Randomly select
subset of splits at each internal node
XGBoost Data Representation:
(Compressed) Column Format
29

Married Age Type Class

Matrix Format Column Format

Y 45 Sedan -10 Y 45 Sedan -10
N 20 SUV 5 N 20 SUV 5
Y 30 Sport 10 Y 30 Sport 10
Y 60 Berline -15 Y 60 Berline -15

☺ Easier and faster to randomly select a feature

☺ Presort continuous features
 More record keeping (if you presort)
Overhead with Presorting
30

Unsorted Presorted
Feature Y Feature Y
45 -10 20 -10
20 5 30 5
30 10 45 10
60 -15 60 -15

 Array indices  Alignment broken

aligned between  Requires storing a
feature and Y pointer to Y value for
 Easy to look up Y each feature
value
XGBoost: Sparsity-Aware Splitting
31

 Many entries “zero” due to one-hot encoding,

missing data, natural sparseness, etc.
 Do not store “zero” entries
Dense Sparse
0 -10 1 -10
0 -15 1 -15
Drop
0 5 5
1 15 15
1 10 10

Fewer entries to iterate over:

Can result in 50x speed ups!
Which Details Were Skipped?
32

 Regularization to avoid overfitting

 Restrict depth of trees
 Restrict number of leaves

 Add penalty term to loss function

 Setting the learning rate η [see Friedman 2011]

 Derivations and discussion of all loss functions

33 Issues with Ensembles
Two Problems with Ensembles
34

 Problem: Over multiple models, how can I

determine which features are interesting?

Solution: Feature importances

 Problem: Ensembles have multiple models

 Predictionstake longer
 Models take up more space

Solution: Model compression

Feature Importance:
Mean Decrease in Impurity
35

1 𝑆𝑛
= ෍ ෍ 𝐺𝑎𝑖𝑛(𝑣, 𝑆𝑛 )
|𝐸| 𝑚∈𝐸 𝑛𝑣 ∈𝑚 |𝐷|

E = {𝑚1 ,…, 𝑚𝑡 } is an ensemble of models

𝑛𝑣 is a node splitting on variable v
𝑆𝑛 is the set of training examples reaching node 𝑛𝑣
D is the training data
Gain(v, 𝑆𝑛 ) is the impurity reduction of split 𝑆𝑛 on v
Feature Importance:
Permutation Test
36

 For each variable, randomly permute its values

in the out of bag example
 Outof bag: Examples not selected in bootstrap
 Permutation: Type

Married Age Type Class Married Age Type Class

N 20 SUV N N 20 Sport N
Y 30 Sport N Y 30 Berline N
Y 60 Berline Y Y 60 SUV Y

 Measure decrease in accuracy

Model Compression
37

 Idea: Compress ensemble by mimicking its

behavior with a model that is smaller and
executes quickly
 Build new data set D’ = {(xj’, E(xj’)}
 x j’
is an example
 E(x1’) is the ensembles prediction for x1’

 Train a new model M on D’

Two questions:
1. How to generate data?
2. What model?
Approach
38

 Generate data: For each example (xj,yj) in D,

create new example (xj’, E(xj’)):
 Let (xn,yn) be (xj,yj)’s nearest neighbor in D
 For i = 0 to d
◼r ~ U[0,1]
◼ If (r < 0.5) then xj,i’ = xj,i
◼ Else xj,i’ = xn,I

 Add (xj’, E(xj’)) to D’

 Train a neural network on D’
39 Applications
Web Search
40

Query: “107.7 the end”

How Helpful Is an On-the-Ball Action in
a Soccer Match?
41

A soccer match has ±1600 on-the-ball actions

Problem: 99% of actions do not directly affect the score

Goal
How Helpful Is an On-the-Ball Action in
a Soccer Match?
42

A soccer match has ±1600 on-the-ball actions

Problem: 99% of actions do not directly affect the score

Goal

𝑃𝑎𝑠𝑠

Question: How valuable is an action (e.g., pass, dribble,…)?

Contribution Rating: How Much Did an
Action Contribute to the Scoreline?
43

 Insight: Action changes game state

 Assign value to each game state si
 Value of action 𝐶𝑅 𝑠𝑖 , 𝑎𝑖 = 𝑉 𝑠𝑖+1 − 𝑉(𝑠𝑖 )
Value(pass) = 0.04 ≈ Pass’ Expected Δgoal difference
V(si) = 0.01 V(si+1) = 0.05
Valuing a Game State
44

Intuition: Good actions either

1) Increase the short-term chance of scoring
2) Decrease the short-term chance of conceding
𝑉 𝑠𝑖 = 𝑃𝑠𝑐𝑜𝑟𝑒𝑠 𝑠𝑖 − 𝑃𝑐𝑜𝑛𝑐𝑒𝑑𝑒𝑠 s𝑖

Estimate these from historical data

 Game state uses last 3 actions: si = {si-2,si-1,si}
 An action’s effect is temporally limited:
+ Example: Goal by either team in next 10 actions
 − Example: No goals in next 10 actions

 Train a gradient boosted probability estimator

Represent Game State Using
> 20 Features
45

3 types of features Home: 2; Away: 0; GD: +2;

Time = 80min
1) Simple: One action
2) Complex: Compare Distance
consecutive actions Pass to goal

3) Contextual: Game info

Type Pass
Result Success Time
difference
Start Location (60,20) Pass
End Location (75,60) Tackle
Body Part Foot
Example: Barcelona’s 3-0 goal versus
Real Madrid (Dec 23, 2017)

Phase starts here

Application Scouting: Top-5 U21
players in the 2017/18 Dutch League
47

VAEP June June Price

Rank Player Team Age
rating 2018 2019 delta

1 David Neres Ajax 21 0.62 € 20m € 45m + €25m

2 Mason Mount Vitesse 19 0.62 € 4m € 12m + €8m

3 Frenkie de Jong Ajax 20 0.50 € 7m € 85m + €78m

4 Steven Bergwijn PSV 20 0.49 € 12m € 35m + €23m

5 Donny van de Beek Ajax 21 0.47 € 14m €40m + €26m

Application Scouting: Top-5 U21
players in the 2017/18 Dutch League
48

VAEP June June Price

Rank Player Team Age
rating 2018 2019 delta

1 David Neres Ajax 21 0.62 € 20m € 45m + €25m

2 Mason Mount Vitesse 19 0.62 € 4m € 12m + €8m

3 Frenkie de Jong Ajax 20 0.50 € 7m € 85m + €78m

4 Steven Bergwijn PSV 20 0.49 € 12m € 35m + €23m

5 Donny van de Beek Ajax 21 0.47 € 14m €40m + €26m

Summary
49

 Gradient boosting generalizes AdaBoost to work

with any differential loss function

 Mainly done with trees though with some tweaks

to standard tree learner

 Many highly performant implements

 Widely applied to real problems

Questions?
50
51
AdaBoost from Principles of ML
for Easy Reference / Recall

(c) jesse davis

Boosting
52

 Arose in the theoretical PAC learning community

 Strong PAC learner ≈ For arbitrary ε and δ, with
probability 1-δ, produce model with error of < ε
 Boosting assumes a weak learner:
 Cannot PAC learn for arbitrary ε and δ
 Models are (slightly) better than random guessing

 General approach
 Learns an additive model
 Greedily adds model at a time

 Focuses on the current model correcting

examples that are incorrectly predicted (c) jesse davis
AdaBoost: First Practical Booster
53

 Works for binary classification problems

 Learns additive model iteratively
F(X) = α1h1(X) + α2h2(X) + … + αtht(X)

 Two big ideas:

 Idea 1: Assign weights to examples to focus
attention on misclassified examples
 Idea 2: Prediction is a weighted voted based on
how accurate each hi is
(c) jesse davis
Boosting Example
54

+ - +
+ +
-
+ - -
-
+ -
-
 Assume that we are going to make one axis
parallel cut through feature space

(c) jesse davis

Boosting Example
55

+ - + + - +
+ + + +-
- -
+ - - +
-
-
-
+ - + -
- -

 Errors: 3
 Upweight the mistakes, downweight everything
else
(c) jesse davis
Boosting Example
56

+ +
+ - + - + - +
+ + + +- + +-
- - -
+ - - -
- + + -
- -
+ - -
+ - + -
- -

(c) jesse davis

Boosting Example
57

+ +
+ - + - + - +
+ + + +- + +-
- - -
+ - - -
- + + -
- -
+ - -
+ - + -
- -

Three key questions for AdaBoost:

1. What model should we pick (in theory)?
2. How should we set α ?
3. What practical details are important?
(c) jesse davis
AdaBoost Setting
58

 Given: S = {(xj,yj)} with j ∊ {1,…,n} and y ∊ {-1,+1}

 Learn: F(X) = α1h1(X) + α2h2(X) + … + αtht(X)

−1 (Negative)
 Prediction: 𝐹 𝑥𝑖 = ቊ
+1 (Positive)
argmaxy = σt αtht(𝑥𝑖 )

1. What model should we pick (in theory)?

2. How should we set α ?
3. What practical details (c)
are important?
jesse davis
What Classifier Should We Pick?
59

 Suppose our current function is:

Fm-1(X) = α1h1(X) + α2h2(X) + … + αm-1hm-1(X)

 Goal:
 Pick αmhm(X) to add to the model
𝑁
 To minimize error: 𝐸 = ෍ 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 + α𝑚 ℎ𝑚 𝑥𝑖 )

𝑖=1

(c) jesse davis

Understanding the Error:
60
Exponential Loss Function
𝑦ො  Loss is small if
𝑁 predicted and true
𝐸 = ෍ 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 + α𝑚 ℎ𝑚 𝑥𝑖 )
label have same sign
𝑖=1
sign 𝑦 = sign 𝑦ො ⇒ −𝑦𝑦ො < 0
sign 𝑦 ≠ sign 𝑦ො ⇒ −𝑦𝑦ො > 0
Loss

 Loss drops quickly

𝑒 −(1)(−2) = 7.4
𝑒 −(1)(−1) = 2.7
𝑦ො (case 𝑦 = 1)  Loss always > 0
(c) jesse davis
What Classifier Should We Pick?
61
𝑁

 Goal: Minimize 𝐸 = ෍ 𝑒 −𝑦𝑖(𝐹𝑚−1 𝑥𝑖 + α𝑚ℎ𝑚 𝑥𝑖 )

𝑖=1
(𝑚)
 Let 𝑤𝑖 = 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 )

Error per example that is fixed because we

cannot change classifiers 1 to m-1 in the ensemble

(c) jesse davis

What Classifier Should We Pick?
62
𝑁

 Goal: Minimize 𝐸 = ෍ 𝑒 −𝑦𝑖(𝐹𝑚−1 𝑥𝑖 + α𝑚ℎ𝑚 𝑥𝑖 )

𝑖=1
(𝑚)
 Let 𝑤𝑖 = 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 )
and rewrite the error
(𝑚) −α𝑚
𝑒 α𝑚
(𝑚)
=෍ 𝑤𝑖 𝑒 +෍ 𝑤𝑖
𝑦𝑖 = ℎ𝑚 𝑥𝑖 𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖

Weight for correct predictions Weight for incorrect predictions

𝑁
(𝑚) −α𝑚
(𝑒 α𝑚 − 𝑒 −α𝑚 )
𝑚
= ෍ 𝑤𝑖 𝑒 +෍ 𝑤𝑖
𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖
𝑖=1
Assumes hm ‘s predictions hm that minimizes this sum,
are all correct minimizes E! (c) jesse davis
Setting α𝑚
63

(𝑚) (𝑚)
 Let 𝑊𝑐 = ෍ 𝑤𝑖 and 𝑊𝑖𝑐 = ෍ 𝑤𝑖
𝑦𝑖 = ℎ𝑚 𝑥𝑖 𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖

 Then 𝐸 = 𝑊𝑐 𝑒 −α𝑚 + 𝑊𝑖𝑐 𝑒 α𝑚

ⅆ𝐸
= −𝑊𝑐 𝑒 −α𝑚 + 𝑊𝑖𝑐 𝑒 α𝑚 = 0
ⅆα
−𝑊𝑐 + 𝑊𝑖𝑐 𝑒 2α𝑚 = 0
1 𝑊𝑐
α𝑚 = ln
2 𝑊𝑖𝑐
1 1 − 𝜀𝑚
α𝑚 = ln
2 𝜀𝑚
(𝑚)
σ𝑦𝑖 ≠ ℎ𝑚𝑥𝑖 𝑤𝑖
with 𝜀𝑚 =
σ 𝑤𝑖(𝑚) (c) jesse davis
AdaBoost
64

Given S = {(xj,yj)} where j ∊ {1,…,n}, Integer T

(1)
𝑤𝑖 = 1Τ𝑛 All examples
for t = 1 to T have same weight Weighted error
(𝑚)
Find classifier ht, with small error ϵt = σℎ𝑡 𝑥𝑖 ≠𝑦𝑖 𝑤𝑖
if (ϵt > 1Τ2) then break
𝛽𝑡 = ϵtൗ(1−ϵ ) Down weight correct predictions
t
(𝑡+1) (𝑡)
if (ℎ𝑡 𝑥𝑖 = 𝑦𝑖 ) then 𝑤𝑖 = 𝑤𝑖 𝛽𝑡
(𝑡+1)
(𝑡+1) 𝑤𝑖
𝑤𝑖 = (𝑡+1) Normalize weights
σ 𝑤𝑗
1
𝛼𝑡 = 𝑙𝑛
𝛽𝑡 (c) jesse davis
AdaBoost in Practice
65

 Typically use depth bounded decision tree

 Sometimes stumps: Just a single split
 Often depth 5 or 6

 Dealing with weighted instances

 Approach 1: Adapt learner to learn from
weighted instances; trivial for decision trees just
use weighted counts for split criteria
 Approach 2: Sample a large (≫n) set of
unweighted instances according to the weight
distribution and run learner
(c) jesse davis

Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
09_EnsembleLearning
No ratings yet
09_EnsembleLearning
36 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
22 Boosting
No ratings yet
22 Boosting
32 pages
Module 3.5 Ensemble Learning XGBoost
No ratings yet
Module 3.5 Ensemble Learning XGBoost
26 pages
Lec 29
No ratings yet
Lec 29
33 pages
2. LearningFromExamples II
No ratings yet
2. LearningFromExamples II
36 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Lec5 Boosting v2.7 1
No ratings yet
Lec5 Boosting v2.7 1
46 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
Gradient Boosting
No ratings yet
Gradient Boosting
17 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
Large Scale Machine Learning With Python - XGBOOST - P236
No ratings yet
Large Scale Machine Learning With Python - XGBOOST - P236
19 pages
Chapter 7 - Ensemble
No ratings yet
Chapter 7 - Ensemble
12 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
Session 10 - Ensemble Methods (XGBoost)
No ratings yet
Session 10 - Ensemble Methods (XGBoost)
37 pages
1729585037_ML11_Generalization
No ratings yet
1729585037_ML11_Generalization
40 pages
Extreme Gradient Boosting
No ratings yet
Extreme Gradient Boosting
8 pages
Friedman 2002
No ratings yet
Friedman 2002
12 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Session 5 ppt
No ratings yet
Session 5 ppt
36 pages
_LECTURE+NOTES_Boosting
No ratings yet
_LECTURE+NOTES_Boosting
8 pages
5 - EnsembleModeling
No ratings yet
5 - EnsembleModeling
80 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
21csc305p Machine Learning Unit 5
No ratings yet
21csc305p Machine Learning Unit 5
61 pages
UNIT IV
No ratings yet
UNIT IV
18 pages
Gradient Boosting
No ratings yet
Gradient Boosting
9 pages
ML Techniques and Concepts
No ratings yet
ML Techniques and Concepts
48 pages
Boosting Mit
No ratings yet
Boosting Mit
36 pages
GRADIENTBOOSTING
No ratings yet
GRADIENTBOOSTING
6 pages
Unit 4 Part 1
No ratings yet
Unit 4 Part 1
47 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
03 Ai
No ratings yet
03 Ai
59 pages
Deep learning
No ratings yet
Deep learning
15 pages
107 Boostong Models
No ratings yet
107 Boostong Models
27 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
116 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Boosting
No ratings yet
Boosting
11 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
No ratings yet
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
12 pages
DAC ML Tutorial Final Deck
No ratings yet
DAC ML Tutorial Final Deck
150 pages
ML
No ratings yet
ML
9 pages
2008 Boosting for Model-Based Data Clustering
No ratings yet
2008 Boosting for Model-Based Data Clustering
10 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Slide07 Haykin Chapter 7: Committee Machines
No ratings yet
Slide07 Haykin Chapter 7: Committee Machines
8 pages
ML - Interview Prep
No ratings yet
ML - Interview Prep
9 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
Gradient Boosting: November 2020
100% (1)
Gradient Boosting: November 2020
7 pages
CS229 Supplemental Lecture Notes: 1 Boosting
No ratings yet
CS229 Supplemental Lecture Notes: 1 Boosting
11 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Boosting and Additive Tree
No ratings yet
Boosting and Additive Tree
26 pages
Ensemble Final
No ratings yet
Ensemble Final
41 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
DM - Lecture 5
No ratings yet
DM - Lecture 5
75 pages
CV - Deep Convolutional Neural Networks
No ratings yet
CV - Deep Convolutional Neural Networks
55 pages
Werker & Hensch - Article
No ratings yet
Werker & Hensch - Article
27 pages
DM - Lecture 3
No ratings yet
DM - Lecture 3
41 pages
Computer Vision and Deep Learning 1708702317
No ratings yet
Computer Vision and Deep Learning 1708702317
93 pages
Kaddour Najim Control of Continuous Linear Systems
No ratings yet
Kaddour Najim Control of Continuous Linear Systems
11 pages
CIS 4526: Foundations of Machine Learning Linear Classification: Perceptron
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Classification: Perceptron
33 pages
On Solution of Sylvester Equation PDF
No ratings yet
On Solution of Sylvester Equation PDF
3 pages
Computer Graphics: Curves and Surfaces II
No ratings yet
Computer Graphics: Curves and Surfaces II
33 pages
Index: Polynomial Evaluation Using Structure (With C++ Program) Stack Pop and Push
No ratings yet
Index: Polynomial Evaluation Using Structure (With C++ Program) Stack Pop and Push
42 pages
Unit4 Quiz - Attempt Review
No ratings yet
Unit4 Quiz - Attempt Review
3 pages
Chapter 11 Message Integrity & Authentication
No ratings yet
Chapter 11 Message Integrity & Authentication
19 pages
Solved Chapter 11.1 Problem 19P Solution Advanced Engineering Mathematics 10th Edition
No ratings yet
Solved Chapter 11.1 Problem 19P Solution Advanced Engineering Mathematics 10th Edition
6 pages
APC Part 6 Introduction To State Estimation
No ratings yet
APC Part 6 Introduction To State Estimation
113 pages
Agnes - Imam - Laporan Praktikum Numerik Gauss Dan Gauss Jordan
No ratings yet
Agnes - Imam - Laporan Praktikum Numerik Gauss Dan Gauss Jordan
10 pages
Unit-4 AI - SVM
No ratings yet
Unit-4 AI - SVM
21 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Moment Distribution Method: Structural Theory 2 CE 421
No ratings yet
Moment Distribution Method: Structural Theory 2 CE 421
15 pages
mgt202 Forecasting
No ratings yet
mgt202 Forecasting
4 pages
DM QB
No ratings yet
DM QB
3 pages
5 Calculus AA EXAM
No ratings yet
5 Calculus AA EXAM
11 pages
What Is Burst Time, Arrival Time, Exit Time, Response Time, Waiting Time, Turnaround Time, and Throughput
No ratings yet
What Is Burst Time, Arrival Time, Exit Time, Response Time, Waiting Time, Turnaround Time, and Throughput
7 pages
Utility
No ratings yet
Utility
6 pages
IJRPR14330
No ratings yet
IJRPR14330
4 pages
3 (x-1) 8 3x-3 8 3x 8+3 3x 11: Question: Solve The Following Equation: Answer: 3 (x-1) 8
No ratings yet
3 (x-1) 8 3x-3 8 3x 8+3 3x 11: Question: Solve The Following Equation: Answer: 3 (x-1) 8
8 pages
PDF Financial Modelling With Jump Processes 1st Edition Peter Tankov Download
100% (16)
PDF Financial Modelling With Jump Processes 1st Edition Peter Tankov Download
84 pages
Dip Mod 3 - Cec Notes - Ktustudents - in
No ratings yet
Dip Mod 3 - Cec Notes - Ktustudents - in
60 pages
Astar
No ratings yet
Astar
20 pages
Assignment 1 Each One of You Are Assigned Roll No Wise 1 Question Individually That You Are Submitting
No ratings yet
Assignment 1 Each One of You Are Assigned Roll No Wise 1 Question Individually That You Are Submitting
10 pages
AIH 23-25 Assessment 2 (NOV EXAM) Paper 1
No ratings yet
AIH 23-25 Assessment 2 (NOV EXAM) Paper 1
9 pages
Machine Learning Questions
No ratings yet
Machine Learning Questions
19 pages
Mathsprog2019 Y7 Eot Aut CD Ms
No ratings yet
Mathsprog2019 Y7 Eot Aut CD Ms
4 pages
Midterm-Sample-Paper-2 Solution
No ratings yet
Midterm-Sample-Paper-2 Solution
7 pages