DM - Lecture 4
DM - Lecture 4
GRADIENT BOOSTED
ENSEMBLES
Jesse Davis
Ensemble Methods: Learn Multiple
Models and Combine Their Output
2
Data +
+…+
Key questions:
1. How do we generate multiple different models?
2. How do we learn the models efficiently?
Canonical Approach: Boosting
3
𝐸 = 𝑒 −𝑦𝑗 (𝐹𝑚−1 𝑥𝑗 + α𝑚 ℎ𝑚 𝑥𝑗 )
𝑗=1
TensorFlow 16900
PyTorch 5500
AdaBoost 2290
LightGBM 12700
XGBoost 17400
Kaggle popularity
Gradient Boosting Big Picture
7
Gradient Boosting =
Gradient Descent + Boosting
𝑛
1 2
Least squares objective: ℓ = 𝑦𝑖 − 𝐹 𝑥𝑖
2
𝑖=1
Representation of F: 𝐹 𝑥𝑖 = 𝜂𝑡 ℎ𝑡 (𝑥𝑖 )
𝑡=1
Intuition
9
Learning h: Equivalent
𝐹 𝑥1 + ℎ 𝑥1 = 𝑦1 ℎ 𝑥1 = 𝑦1 −𝐹 𝑥1
𝐹 𝑥2 + ℎ 𝑥2 = 𝑦2
… ⇒ ℎ 𝑥2 = 𝑦2 −𝐹 𝑥2
…
𝐹 𝑥𝑛 + ℎ 𝑛 = 𝑦𝑛 ℎ 𝑛 = 𝑦𝑛 −𝐹 𝑥𝑛
Construct new data set and learn h on it:
{ 𝑥1 , 𝑦1 −𝐹 𝑥1 , 𝑥2 , 𝑦2 −𝐹 𝑥2 , … , (𝑥𝑛 , 𝑦𝑛 −𝐹 𝑥𝑛 )}
Add learned ℎ to 𝐹
+
+…+
Residual
Initial residual
for an instance 0
Connection to Gradient Descent
12
𝜕ℓ
Negative of gradient: − = 𝑦𝑖 − 𝐹 𝑥𝑖
𝜕𝐹 𝑥𝑖
∂ ½[(Yi –F(Xi))2]
=
∂F(Xi)
View above as F(x) = f (g (x)) with g (x) =yi –F(xi)
∂ ½[(yi –F(xi))2]
= = F(xi) – yi
∂F(xi)
Illustration of Loss Function
15
0 0
The Power of Gradient Boosting
16
Ranking
AdaBoost vs. Gradient Boosting
17
Similarities
Stage wise greedy learning of additive model
Focus on mispredicted examples
Difference 2: Generality
AdaBoost: Just classification
Gradient Boosting: Any differentiable loss !
18 Anatomy of a Boosting System
XGBoost
19
BitBoost (DTAI)
Add randomization
𝐹 𝑥𝑖 = 𝜂𝑡 ℎ𝑡 (𝑥𝑖 )
𝑡=1
Not problematic for regression tasks
Age < 35
These quantities were
computed in the parent
? node: Reuse them!
2 2 2
σ𝑖∈𝐼 𝑔𝑖 σ𝑖∈𝐼 𝑔𝑖 σ𝑖∈𝐼 𝑔𝑖
𝐿 𝑅 𝑃
Loss = + −
|𝐼𝐿 | |𝐼𝑅 | |𝐼𝑃 |
Let Σ𝑃 = σ𝑖∈𝐼𝑃 𝑔𝑖
Let Σ𝐿 = σ𝑖∈𝐼𝐿 𝑔𝑖
Exploit only binary splits:
Let Σ𝑅 = Σ𝑃 - Σ𝐿 save add operation!
Σ𝐿 2 Σ𝑅 2 Σ𝑃 2
Loss = + −
|𝐼𝐿 | 𝐼𝑃 −|𝐼𝐿 | |𝐼𝑃 |
Biggest Problem: Continuous Features
26
X1 … Xd Y
0 ... -1 5
0 … 5 -10
.. … … …
1 … 10 10
1. Copy feature: -1 5 95 -5 -1 … 10
2. Sort array: -5 -1 -1 5 10 … 95
…
Feature bagging: Randomly select subset of
columns when learning each tree
Randomness in splits: Randomly select
subset of splits at each internal node
XGBoost Data Representation:
(Compressed) Column Format
29
Unsorted Presorted
Feature Y Feature Y
45 -10 20 -10
20 5 30 5
30 10 45 10
60 -15 60 -15
1 𝑆𝑛
= 𝐺𝑎𝑖𝑛(𝑣, 𝑆𝑛 )
|𝐸| 𝑚∈𝐸 𝑛𝑣 ∈𝑚 |𝐷|
Goal
How Helpful Is an On-the-Ball Action in
a Soccer Match?
42
Goal
𝑃𝑎𝑠𝑠
General approach
Learns an additive model
Greedily adds model at a time
+ - +
+ +
-
+ - -
-
+ -
-
Assume that we are going to make one axis
parallel cut through feature space
+ - + + - +
+ + + +-
- -
+ - - +
-
-
-
+ - + -
- -
Errors: 3
Upweight the mistakes, downweight everything
else
(c) jesse davis
Boosting Example
56
+ +
+ - + - + - +
+ + + +- + +-
- - -
+ - - -
- + + -
- -
+ - -
+ - + -
- -
+ +
+ - + - + - +
+ + + +- + +-
- - -
+ - - -
- + + -
- -
+ - -
+ - + -
- -
−1 (Negative)
Prediction: 𝐹 𝑥𝑖 = ቊ
+1 (Positive)
argmaxy = σt αtht(𝑥𝑖 )
Goal:
Pick αmhm(X) to add to the model
𝑁
To minimize error: 𝐸 = 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 + α𝑚 ℎ𝑚 𝑥𝑖 )
𝑖=1
(𝑚) (𝑚)
Let 𝑊𝑐 = 𝑤𝑖 and 𝑊𝑖𝑐 = 𝑤𝑖
𝑦𝑖 = ℎ𝑚 𝑥𝑖 𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖