Chapter 7 - Trees
Chapter 7 - Trees
Chapter 7
Classification And Regression Trees
1
Overview
➢Trees
➢Bagging
➢Random Forest
➢Boosting
2
TREES
• Regression
• Classification.
3
TREES
Since the set of splitting rules used to segment the predictor space can
be summarized in a tree, these types of approaches are known as
decision-tree methods.
4
TREES
Regression Trees
5
TREES
Example: Hitter dataset
A data frame with 322 observations of major league players on the following 20
variables.
Predicting the salary (log salary) of a baseball player, based on the number
of years that he has played in the major leagues and the number of hits that
he made in the previous year.
7
TREES
Example: Hitter dataset
Salary is color-coded
from low (blue, green) to
high (yellow, red)
8
TREES
Example: Hitter dataset
Overall, the tree stratifies or segments the players into three regions of
predictor space:
9
TREES
Example: Hitter dataset
10
TREES
Terminology
• In keeping with the tree analogy, the extremities are known as terminal
nodes or leaves
• The points along the tree where the predictor space is split are referred to as
internal nodes
• At a given internal node, the label 𝑋𝑗 < 𝑡𝑘 indicates the left-hand branch
emanating from that split, and the right-hand branch corresponds to 𝑋𝑗 ≥ 𝑡𝑘.
• The number in each leaf is the mean of the response for the observations in
the corresponding region.
11
TREES Internal node
• Given that a player is less experienced, the number of Hits that he made in the
previous year seems to play little role in his Salary.
• But among players who have been in the major leagues for five or more years, the
number of Hits made in the previous year does affect Salary, and players who made
more Hits last year tend to have higher salaries.
1. Divide the predictor space - that is, the set of possible values for
X1;X2,…,Xp - into J distinct and non-overlapping regions, R1,R2,...,RJ .
2. For every observation that falls into the region Rj , the prediction is
simply the mean of the response values for the training observations in
Rj .
14
TREES
Tree-building process
• The goal is to find boxes R1,…,RJ that minimize the residual sum of
squares (RSS), given by
J
i Rj
( y − ˆ
y
j =1 iR j
) 2
where yˆ R is the mean response for the training observations within the
j
jth box.
15
TREES
Tree-building process
• Computationally infeasible to consider every possible partition of the feature
space into J boxes.
• Top-down : begins at the top of the tree and then successively splits the
predictor space; each split is indicated via two new branches further down on
the tree.
• Greedy : at each step of the tree-building process, the best split is made at that
particular step, rather than looking ahead and picking a split that will lead to a
better tree in some future step.
16
TREES
Tree-building process
• Select the predictor Xj and the cutpoint s such that splitting the predictor
space into the regions { X | X j s} and { X | X j s} leads to the greatest
possible reduction in residual sum of squares (RSS).
Prediction
Predict the response for a given test observation using the mean of the training
observations in the region to which that test observation belongs.
18
TREES
A partition of 2-dimensional space that could not A partition of 2-dimensional space that could result
result from recursive binary splitting from recursive binary splitting 19
TREES
• Possible solution (1) to limit the complexity of the tree: set some threshold. If
the RSS do not decrease more than the threshold after a split, then stop the
process.
• Drawback of (1): a split with low decrease in RSS could be followed with a
split with a large decrease in RSS
22
TREES
Other solution (2) to limit the complexity of the tree: build a large tree then cut
the “weak” links --- > Pruning
23
TREES
Pruning
• Grow a very large tree T0, and then prune it in order to obtain a subtree
• Use Cost complexity pruning, also known as weakest link pruning, to prune
24
TREES
Cost complexity pruning
• Consider a sequence of trees indexed by a nonnegative tuning parameter .
• To each value of there corresponds a subtree T T0 such that
T
m =1 i:xi Rm
( yi − yˆ Rm ) 2 + T
is as small as possible.
T : number of terminal nodes of the tree T,
Rm : rectangle ( subset of predictor space) corresponding to the mth terminal node
yˆ R : the mean of the training observations in Rm.
m
25
TREES
Choosing the best subtree
• Return to the full data set and obtain the subtree corresponding to 𝛼ො
26
TREES 𝑇
Choosing the best subtree (𝑦𝑖 − 𝑦ො𝑅𝑚 )2 + 𝛼 𝑇
𝑚=1 𝑖:𝑥𝑖 ∈𝑅𝑚
Given 𝑇0 , having N=17 leaves
Let 𝛼0 = 0 𝛼1 = 10 𝛼2 = 100 𝛼3 = 1000 …etc
17
𝐸0 = (𝑦𝑖 − 𝑦ො𝑅𝑚 )2
𝑚=1 𝑖:𝑥𝑖 ∈𝑅𝑚
27
TREES 𝑇
Choosing the best subtree (𝑦𝑖 − 𝑦ො𝑅𝑚 )2 + 𝛼 𝑇
𝑚=1 𝑖:𝑥𝑖 ∈𝑅𝑚
29
TREES
Example: Hitter dataset (using 263 observations and 9 attributes)
• Build a large regression tree on the training data and varied in order to
create subtrees with different numbers of terminal nodes.
30
TREES
Example: Hitter dataset
Unpruned tree 31
TREES
Example: Hitter dataset
32
TREES
Example: Hitter dataset
Final subtree 33
TREES
Classification Trees
34
TREES
Classification Trees
35
TREES
Classification Trees
• Just as in the regression setting, recursive binary splitting is used to grow a
classification tree.
• In the classification setting, RSS cannot be used as a criterion for making the
binary splits
• Alternative to RSS is the classification error rate, the fraction of the training
observations in that region that do not belong to the most common class.
𝑝Ƹ 𝑚𝑘 : proportion of training observations in the mth region that are from the
36
kth class
TREES
Classification Trees
• Gini Index
• Cross Entropy
37
TREES
Gini Index and Deviance
The Gini index takes on a small value if all of the 𝑝Ƹ 𝑚𝑘 are close to 0 or 1.
• For this reason the Gini index is referred to as a measure of node purity: a small
value indicates that a node contains predominantly observations from a single
class.
38
TREES
Example: Gini Index calculations
0
C1 0 𝑝Ƹ 𝑚0 = = 0
6
C2 6 6 G = 0 1 − 0 + 1(1 − 1) = 0
𝑝Ƹ 𝑚1 = = 1
6
1
C1 1 𝑝Ƹ 𝑚0 = 15 51
6
G= + = 0.278
C2 5 𝑝Ƹ 𝑚1 =
5 66 66
6
2
C1 2
𝑝Ƹ 𝑚0 =
6 24 42
4 G= + = 0.444
C2 4 𝑝Ƹ 𝑚1 = 66 66
6
39
TREES
Cross Entropy
• The Cross Entropy is defined by
40
TREES
Similarities for 2 Class problem
41
TREES
Example: Heart dataset
• These data contain a binary outcome HD for 303 patients who presented with
chest pain.
Unpruned Tree
43
TREES
Example: Heart dataset
Final Subtree 45
TREES
Trees versus Linear models
46
TREES
Advantages
Trees can easily handle qualitative predictors without the need to create
dummy variables
47
TREES
Disadvantages
Unfortunately, trees are typically not competitive compared to the best
supervised learning approaches in terms of prediction accuracy
Tress can be very non-robust: small change in the data can cause a large
change in the final estimated tree
48
Overview
➢Trees
➢Bagging
➢Random Forest
➢Boosting
49
BAGGING
https://bradleyboehmke.github.io/HOML/process.html 51
BAGGING
Regression
• Generate B different bootstrapped training datasets.
• Train the method on the bth bootstrapped training set in order to get fˆ *b ( x) ,
the prediction at a point x.
52
BAGGING
Remarks
• Bagging can improve predictions for many regression methods, but it is particularly
useful for decision trees.
• Hence each individual tree has high variance. Averaging these B trees reduces the
variance.
For each test observation, record the class predicted by each of the B trees,
and take a majority vote: the overall prediction is the most commonly
occurring class among the B predictions.
54
BAGGING
Out-of-bag Error Estimation
• A bagged model is evaluated on a test data
• It exists a very straightforward way to estimate the test error of a bagged model.
• It can be shown that, on average, each bagged tree makes use of around 2/3 of
the observations.
55
BAGGING
Out-of-bag Error Estimation
• The remaining 1/3 of the observations not used to fit a given bagged tree are
referred to as the out-of-bag (OOB) observations.
• The response for the ith observation can be predicted using each of the trees
in which that observation was OOB.
• This will yield around B/3 predictions for the ith observation.
• With a large number of trees in Bagging, it is no not clear which variables are most
important to the procedure.
• Record the total amount that the RSS is decreased due to splits over a given
predictor, averaged over all B trees.
• Similarly, for classification, add up the total amount that the Gini index is
decreased by splits over a given predictor, averaged over all B trees.
58
BAGGING
Example: Heart dataset
59
BAGGING
Limitation
• Suppose that there is one very strong predictor in the data set, along with a
number of other moderately strong predictors.
• Then in the collection of bagged trees, most or all of the trees will use this
strong predictor in the top split.
• Consequently, all of the bagged trees will look quite similar to each other -->
highly correlated predictions.
• Averaging many highly correlated quantities does not lead to a large reduction
in variance
• In this case, bagging will not lead to a substantial reduction in variance
60
Overview
➢Trees
➢Bagging
➢Random Forest
➢Boosting
61
RANDOM FOREST
https://builtin.com/data-science/random-forest-algorithm
62
RANDOM FOREST
• As in bagging, build a number of decision trees on bootstrapped training
samples.
• But when building these decision trees, each time a split in a tree is
considered, a random selection of m predictors is chosen as split candidates
from the full set of p predictors.
63
RANDOM FOREST
• As in bagging, build a number of decision trees on bootstrapped training
samples.
• But when building these decision trees, each time a split in a tree is
considered, a random selection of m predictors is chosen as split candidates
from the full set of p predictors. The split is allowed to use only one of those
m predictors.
• There are around 20,000 genes in humans, and individual genes have different
levels of activity, or expression, in particular cells, tissues, and biological
conditions.
• Each of the patient samples has a qualitative label with 15 different levels:
either normal or one of 14 different types of cancer. 65
RANDOM FOREST
Example: Gene expression data
• Use random forests to predict cancer type based on the 500 genes that have the
largest variance in the training set.
• Randomly divide the observations into a training and a test set, and apply
random forest to the training set for 3 different values of the number of
splitting variables m.
66
RANDOM FOREST
Example: Gene expression data
67
Overview
➢Trees
➢Bagging
➢Random Forest
➢Boosting
68
BOOSTING
• Recall that bagging fits a separate decision tree (independent of the other
trees) to each bootstrap, and then combines all of the trees to create a single
predictive model.
• Boosting works in a similar way, except that the trees are grown sequentially:
each tree is grown using information from previously grown trees.
• Boosting does not involve bootstrap sampling; instead each tree is fit on a
modified version of the original data set.
69
BOOSTING
• Uses voting/averaging but models are weighted according to their
performance
• There are several variants of this algorithm (e.g. AdaBoost, LPBoost, etc.)
BOOSTING
https://en.wikipedia.org/wiki/Boosting_(machine_learning)#/media/File:Ensemble_Boosting.svg
BOOSTING
https://towardsdatascience.com/boosting-algorithms-explained-d38f56ef3f30
72
BOOSTING
• Fitting a single large decision tree to the data (fitting hard) could potentially
result in overfitting
• Given the current model, Boosting fits a decision tree to the residuals from the
model. The new decision tree is added into the fitted function in order to
update the residuals.
73
BOOSTING
Tuning parameters
• Unlike bagging and random forests, boosting can overfit if B is too large
(although this overfitting tends to occur slowly if at all). Cross-validation could
be used to select the number of trees B.
• The shrinkage parameter , a small positive number. This controls the rate at
which boosting learns. (typical values are 0.01 or 0.001, and the right choice
can depend on the problem). Very small can require using a very large value
of B in order to achieve good performance.
74
BOOSTING
Tuning parameters
• The number of splits d in each tree, which controls the complexity of the
boosted ensemble. Often d = 1 works well, in which case each tree is a stump,
consisting of a single split and resulting in an additive model. More generally d
is the interaction depth, and controls the interaction order of the boosted
model, since d splits can involve at most d variables.
75
BOOSTING
Example: Gene expression data
76
BOOSTING
Example
77
BOOSTING
Example
78
SUMMARY
• Decision trees are simple and interpretable models for regression and classification
• However they are often not competitive with other methods in terms of prediction
accuracy
• Bagging, random forest and boosting are good methods for improving the prediction
accuracy of trees. They work by growing many trees on the training data and then
combining the predictions of the resulting ensemble of trees.
• Random forest and boosting are among the state-of-the-art methods for supervised
learning. However their results can be difficult to interpret.
79
80