0% found this document useful (0 votes)
56 views

Chapter 7 - Trees

This chapter discusses tree-based methods for classification and regression problems. Trees involve segmenting the predictor space into simple regions to make predictions. Regression trees are used for continuous output variables, while classification trees are used for categorical outputs. The document describes the tree-building process, which recursively splits the data space to minimize residual error at each step. It also discusses pruning trees to limit complexity and overfitting. Cross-validation is used to select the optimal pruned subtree.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Chapter 7 - Trees

This chapter discusses tree-based methods for classification and regression problems. Trees involve segmenting the predictor space into simple regions to make predictions. Regression trees are used for continuous output variables, while classification trees are used for categorical outputs. The document describes the tree-building process, which recursively splits the data space to minimize residual error at each step. It also discusses pruning trees to limit complexity and overfitting. Cross-validation is used to select the optimal pruned subtree.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

FEM 2063 - Data Analytics

Chapter 7
Classification And Regression Trees

1
Overview
➢Trees

➢Bagging

➢Random Forest

➢Boosting

2
TREES

• We describe tree-based methods for

• Regression

• Classification.

• It involves stratifying or segmenting the predictor space into a number of


simple regions.

3
TREES
Since the set of splitting rules used to segment the predictor space can
be summarized in a tree, these types of approaches are known as
decision-tree methods.

4
TREES

Regression Trees

Used when dealing with continuous output variable.

Examples: Income, Height, Volume etc.

5
TREES
Example: Hitter dataset

A data frame with 322 observations of major league players on the following 20
variables.

AtBat: Number of times at bat in 1986


Hits: Number of hits in 1986
Runs: Number of runs in 1986
Years: Number of years in the major league
PutOuts: Number of put outs in 1986
Assists: Number of assists in 1986
Salary: 1987 annual salary on opening day in thousands of dollars
...etc.
6
TREES
Example: Hitter dataset

Predicting the salary (log salary) of a baseball player, based on the number
of years that he has played in the major leagues and the number of hits that
he made in the previous year.

Hits: Number of hits in 1986


Years: Number of years in the major league
Salary: 1987 annual salary on opening day in thousands of dollars

7
TREES
Example: Hitter dataset

Salary is color-coded
from low (blue, green) to
high (yellow, red)

8
TREES
Example: Hitter dataset
Overall, the tree stratifies or segments the players into three regions of
predictor space:

R1 = {X | Years < 4.5},

R2 = {X | Years  4.5, Hits < 117.5}

R3 = {X | Years  4.5, Hits  117.5}

9
TREES
Example: Hitter dataset

Based on the 3 regions, this


tree will be generated

10
TREES
Terminology
• In keeping with the tree analogy, the extremities are known as terminal
nodes or leaves

• The points along the tree where the predictor space is split are referred to as
internal nodes

• At a given internal node, the label 𝑋𝑗 < 𝑡𝑘 indicates the left-hand branch
emanating from that split, and the right-hand branch corresponds to 𝑋𝑗 ≥ 𝑡𝑘.

• The number in each leaf is the mean of the response for the observations in
the corresponding region.
11
TREES Internal node

Example: Hitter dataset

2 internal nodes and 3 leaves

Terminal node or leaf 12


TREES
Hitter dataset: Interpretation of Results
• Years is the most important factor in determining Salary, and players with less
experience earn lower salaries than more experienced players.

• Given that a player is less experienced, the number of Hits that he made in the
previous year seems to play little role in his Salary.

• But among players who have been in the major leagues for five or more years, the
number of Hits made in the previous year does affect Salary, and players who made
more Hits last year tend to have higher salaries.

• It is an over-simplification, but compared to a regression model, it is easy to


display, interpret and explain
13
TREES
Tree-building process

1. Divide the predictor space - that is, the set of possible values for
X1;X2,…,Xp - into J distinct and non-overlapping regions, R1,R2,...,RJ .

2. For every observation that falls into the region Rj , the prediction is
simply the mean of the response values for the training observations in
Rj .

14
TREES
Tree-building process

• In theory, the regions could have any shape. However, high-


dimensional rectangles, or boxes, are simpler and easier for
interpretation of the resulting predictive model.

• The goal is to find boxes R1,…,RJ that minimize the residual sum of
squares (RSS), given by
J

  i Rj
( y − ˆ
y
j =1 iR j
) 2

where yˆ R is the mean response for the training observations within the
j

jth box.
15
TREES
Tree-building process
• Computationally infeasible to consider every possible partition of the feature
space into J boxes.

• Top-down, greedy approach known as recursive binary splitting.

• Top-down : begins at the top of the tree and then successively splits the
predictor space; each split is indicated via two new branches further down on
the tree.

• Greedy : at each step of the tree-building process, the best split is made at that
particular step, rather than looking ahead and picking a split that will lead to a
better tree in some future step.
16
TREES
Tree-building process
• Select the predictor Xj and the cutpoint s such that splitting the predictor
space into the regions { X | X j  s} and { X | X j  s} leads to the greatest
possible reduction in residual sum of squares (RSS).

• In each of the two previously identified regions: repeat the process of


looking for the best predictor and best cutpoint in order to split the data
further so as to minimize the RSS within each of the resulting regions.

• The process continues until a stopping criterion is reached; for instance,


until no region contains more than five observations.
17
TREES

Prediction

Predict the response for a given test observation using the mean of the training
observations in the region to which that test observation belongs.

18
TREES

A partition of 2-dimensional space that could not A partition of 2-dimensional space that could result
result from recursive binary splitting from recursive binary splitting 19
TREES

A partition of 2-dimensional space from recursive The corresponding tree


binary splitting 20
TREES

The Tree A perspective plot of the prediction surface


corresponding to the tree 21
TREES

• The previous process could produce a complex tree ---> overfit

• Possible solution (1) to limit the complexity of the tree: set some threshold. If
the RSS do not decrease more than the threshold after a split, then stop the
process.

• Drawback of (1): a split with low decrease in RSS could be followed with a
split with a large decrease in RSS

22
TREES
Other solution (2) to limit the complexity of the tree: build a large tree then cut
the “weak” links --- > Pruning

23
TREES
Pruning

• Grow a very large tree T0, and then prune it in order to obtain a subtree

• How to decide the leaves to remove? (What is the best subtree?)

• Not realistic to check all possible subtrees!

• Use Cost complexity pruning, also known as weakest link pruning, to prune

24
TREES
Cost complexity pruning
• Consider a sequence of trees indexed by a nonnegative tuning parameter  .
• To each value of  there corresponds a subtree T  T0 such that
T


m =1 i:xi Rm
( yi − yˆ Rm ) 2 +  T

is as small as possible.
T : number of terminal nodes of the tree T,
Rm : rectangle ( subset of predictor space) corresponding to the mth terminal node
yˆ R : the mean of the training observations in Rm.
m

25
TREES
Choosing the best subtree

• The tuning parameter controls a trade-off between the subtree's


complexity and its fit to the training data.

• Select an optimal value 𝛼ො using cross-validation.

• Return to the full data set and obtain the subtree corresponding to 𝛼ො

26
TREES 𝑇
Choosing the best subtree ෍ ෍ (𝑦𝑖 − 𝑦ො𝑅𝑚 )2 + 𝛼 𝑇
𝑚=1 𝑖:𝑥𝑖 ∈𝑅𝑚
Given 𝑇0 , having N=17 leaves
Let 𝛼0 = 0 𝛼1 = 10 𝛼2 = 100 𝛼3 = 1000 …etc

For 𝛼0 = 0 , the only subtree with N leaves is 𝑇0

17

𝐸0 = ෍ ෍ (𝑦𝑖 − 𝑦ො𝑅𝑚 )2
𝑚=1 𝑖:𝑥𝑖 ∈𝑅𝑚

27
TREES 𝑇
Choosing the best subtree ෍ ෍ (𝑦𝑖 − 𝑦ො𝑅𝑚 )2 + 𝛼 𝑇
𝑚=1 𝑖:𝑥𝑖 ∈𝑅𝑚

For 𝛼1 = 10 , find the subtree 𝑇1 ⊂ 𝑇0 with N-1=16 leaves that


minimizes
16

෍ ෍ (𝑦𝑖 − 𝑦ො𝑅𝑚 )2 + 160


𝑚=1 𝑖:𝑥𝑖 ∈𝑅𝑚

For 𝛼2 = 100 , find the subtree 𝑇2 ⊂ 𝑇1 with N-2=15 leaves that


minimizes
15

෍ ෍ (𝑦𝑖 − 𝑦ො𝑅𝑚 )2 + 1500


𝑚=1 𝑖:𝑥𝑖 ∈𝑅𝑚
…etc 28
TREES

For each subtree 𝑇𝑖 , perform a K-fold cross-validation.

Evaluate the mean squared prediction error

The subtree with the smallest MSE is selected as the final


29
TREES
Example: Hitter dataset (using 263 observations and 9 attributes)

• Random division of the dataset in half: 132 observations in the training


set and 131 observations in the test set.

• Build a large regression tree on the training data and varied  in order to
create subtrees with different numbers of terminal nodes.

• Perform 6-fold cross-validation in order to estimate the cross-validated


MSE of the subtrees

30
TREES
Example: Hitter dataset

Unpruned tree 31
TREES
Example: Hitter dataset

32
TREES
Example: Hitter dataset

Final subtree 33
TREES

Classification Trees

Used when dealing with qualitative output variable.

Examples: Gender, Yes/No, etc.

34
TREES

Classification Trees

Works the same way as Regression trees in terms of finding regions.

An observation will be classified in the most commonly occurring class


of training observations in the region to which it belongs.

35
TREES
Classification Trees
• Just as in the regression setting, recursive binary splitting is used to grow a
classification tree.

• In the classification setting, RSS cannot be used as a criterion for making the
binary splits

• Alternative to RSS is the classification error rate, the fraction of the training
observations in that region that do not belong to the most common class.

𝑝Ƹ 𝑚𝑘 : proportion of training observations in the mth region that are from the
36
kth class
TREES
Classification Trees

However classification error is not sufficiently sensitive for tree-growing, and


in practice two other measures are preferable:

• Gini Index

• Cross Entropy

37
TREES
Gini Index and Deviance

• The Gini index is defined by

a measure of total variance across the K classes.

The Gini index takes on a small value if all of the 𝑝Ƹ 𝑚𝑘 are close to 0 or 1.

• For this reason the Gini index is referred to as a measure of node purity: a small
value indicates that a node contains predominantly observations from a single
class.
38
TREES
Example: Gini Index calculations

0
C1 0 𝑝Ƹ 𝑚0 = = 0
6
C2 6 6 G = 0 1 − 0 + 1(1 − 1) = 0
𝑝Ƹ 𝑚1 = = 1
6
1
C1 1 𝑝Ƹ 𝑚0 = 15 51
6
G= + = 0.278
C2 5 𝑝Ƹ 𝑚1 =
5 66 66
6
2
C1 2
𝑝Ƹ 𝑚0 =
6 24 42
4 G= + = 0.444
C2 4 𝑝Ƹ 𝑚1 = 66 66
6

39
TREES
Cross Entropy
• The Cross Entropy is defined by

• Gini index and the cross-entropy are very similar numerically.

40
TREES
Similarities for 2 Class problem

41
TREES
Example: Heart dataset
• These data contain a binary outcome HD for 303 patients who presented with
chest pain.

• An outcome value of Yes indicates the presence of heart disease based on an


angiographic test, while No means no heart disease.

• There are 13 predictors including Age, Sex, Chol (a cholesterol


measurement), and other heart and lung function measurements.

• Remark: decision trees can be constructed even in the presence of qualitative


predictor variables.
42
TREES
Example: Heart dataset

Unpruned Tree
43
TREES
Example: Heart dataset

CV, training and test errors


44
TREES
Example: Heart dataset

Final Subtree 45
TREES
Trees versus Linear models

Top Row: True linear boundary;


Bottom row: true non-linear boundary.
Left column: linear model;
Right column: tree-based model

46
TREES
Advantages

Easy to explain (easier to explain than linear regression)!

Trees can be displayed graphically, and are easily interpreted.

Trees can easily handle qualitative predictors without the need to create
dummy variables

47
TREES
Disadvantages
Unfortunately, trees are typically not competitive compared to the best
supervised learning approaches in terms of prediction accuracy

Tress can be very non-robust: small change in the data can cause a large
change in the final estimated tree

By aggregating many decision trees, the predictive performance of trees can


be substantially improved!

48
Overview
➢Trees

➢Bagging

➢Random Forest

➢Boosting

49
BAGGING

• Bootstrap aggregation, or bagging, is a general-purpose procedure for


reducing the variance of a statistical learning method;

• It is particularly useful and frequently used in the context of decision trees.

• Given a set of n independent observations Z1,…,Zn, each with variance  2 , the


variance of the mean Z of the observations is  2 / n

• In other words, averaging a set of observations reduces variance.


50
BAGGING
Regression
• Averaging set of observations is not practical as in general multiple training
sets are not available.
• Bootstrap, by taking repeated samples from the (single) training data set.

https://bradleyboehmke.github.io/HOML/process.html 51
BAGGING
Regression
• Generate B different bootstrapped training datasets.

• Train the method on the bth bootstrapped training set in order to get fˆ *b ( x) ,
the prediction at a point x.

• Bagging: Average all the predictions to obtain

52
BAGGING
Remarks
• Bagging can improve predictions for many regression methods, but it is particularly
useful for decision trees.

• These trees are grown deep, and are not pruned.

• Hence each individual tree has high variance. Averaging these B trees reduces the
variance.

• Bagging has been demonstrated to give impressive improvements in accuracy by


combining together 100s or even 1000s of trees into a single procedure. 53
BAGGING
Classification

For each test observation, record the class predicted by each of the B trees,
and take a majority vote: the overall prediction is the most commonly
occurring class among the B predictions.

54
BAGGING
Out-of-bag Error Estimation
• A bagged model is evaluated on a test data

• It exists a very straightforward way to estimate the test error of a bagged model.

• In bagging, trees are repeatedly fit to bootstrapped subsets of the observations.

• It can be shown that, on average, each bagged tree makes use of around 2/3 of
the observations.

55
BAGGING
Out-of-bag Error Estimation
• The remaining 1/3 of the observations not used to fit a given bagged tree are
referred to as the out-of-bag (OOB) observations.

• The response for the ith observation can be predicted using each of the trees
in which that observation was OOB.

• This will yield around B/3 predictions for the ith observation.

• Take the average (or majority vote) of these observations.


56
BAGGING
Variable Importance Measures
• Bagging typically results in improved accuracy over prediction using a single tree.

• One advantage of trees was an easy interpretation of results

• With a large number of trees in Bagging, it is no not clear which variables are most
important to the procedure.

• Bagging improves prediction accuracy at the expense of interpretability.


57
BAGGING
Variable Importance Measures
• An overall summary of the importance of each predictor can be obtained
using the RSS (regression) or the Gini index (classification).

• Record the total amount that the RSS is decreased due to splits over a given
predictor, averaged over all B trees.

• Similarly, for classification, add up the total amount that the Gini index is
decreased by splits over a given predictor, averaged over all B trees.

• A large value indicates an important predictor.

58
BAGGING
Example: Heart dataset

59
BAGGING
Limitation

• Suppose that there is one very strong predictor in the data set, along with a
number of other moderately strong predictors.
• Then in the collection of bagged trees, most or all of the trees will use this
strong predictor in the top split.
• Consequently, all of the bagged trees will look quite similar to each other -->
highly correlated predictions.
• Averaging many highly correlated quantities does not lead to a large reduction
in variance
• In this case, bagging will not lead to a substantial reduction in variance
60
Overview
➢Trees

➢Bagging

➢Random Forest

➢Boosting

61
RANDOM FOREST

Random forests provide an improvement over bagged trees by way of a small


tweak that decorrelates the trees. This reduces the variance after averaging the
trees.

https://builtin.com/data-science/random-forest-algorithm

62
RANDOM FOREST
• As in bagging, build a number of decision trees on bootstrapped training
samples.
• But when building these decision trees, each time a split in a tree is
considered, a random selection of m predictors is chosen as split candidates
from the full set of p predictors.

• The split is allowed to use only one of those m predictors.

63
RANDOM FOREST
• As in bagging, build a number of decision trees on bootstrapped training
samples.

• But when building these decision trees, each time a split in a tree is
considered, a random selection of m predictors is chosen as split candidates
from the full set of p predictors. The split is allowed to use only one of those
m predictors.

• A fresh selection of m predictors is taken at each split, and typically m  p


(e.g. m=4 for the Heart data (p=13)).
64
RANDOM FOREST
Example: Gene expression data
• A high-dimensional biological data set consisting of expression measurements
of 4,718 genes measured on tissue samples from 349 patients.

• There are around 20,000 genes in humans, and individual genes have different
levels of activity, or expression, in particular cells, tissues, and biological
conditions.

• Each of the patient samples has a qualitative label with 15 different levels:
either normal or one of 14 different types of cancer. 65
RANDOM FOREST
Example: Gene expression data
• Use random forests to predict cancer type based on the 500 genes that have the
largest variance in the training set.

• Randomly divide the observations into a training and a test set, and apply
random forest to the training set for 3 different values of the number of
splitting variables m.

66
RANDOM FOREST
Example: Gene expression data

67
Overview
➢Trees

➢Bagging

➢Random Forest

➢Boosting

68
BOOSTING

• Like bagging, boosting is a general approach that can be applied to many


statistical learning methods for regression or classification.

• Recall that bagging fits a separate decision tree (independent of the other
trees) to each bootstrap, and then combines all of the trees to create a single
predictive model.

• Boosting works in a similar way, except that the trees are grown sequentially:
each tree is grown using information from previously grown trees.

• Boosting does not involve bootstrap sampling; instead each tree is fit on a
modified version of the original data set.
69
BOOSTING
• Uses voting/averaging but models are weighted according to their
performance

• Iterative procedure: new models are influenced by performance of


previously built ones

• New model is encouraged to become expert for instances classified


incorrectly by earlier models
• Intuitive justification: models should be experts that complement each
other

• There are several variants of this algorithm (e.g. AdaBoost, LPBoost, etc.)
BOOSTING

https://en.wikipedia.org/wiki/Boosting_(machine_learning)#/media/File:Ensemble_Boosting.svg
BOOSTING

https://towardsdatascience.com/boosting-algorithms-explained-d38f56ef3f30
72
BOOSTING

• Fitting a single large decision tree to the data (fitting hard) could potentially
result in overfitting

• Boosting approach is to learn slowly.

• Given the current model, Boosting fits a decision tree to the residuals from the
model. The new decision tree is added into the fitted function in order to
update the residuals.

73
BOOSTING
Tuning parameters
• Unlike bagging and random forests, boosting can overfit if B is too large
(although this overfitting tends to occur slowly if at all). Cross-validation could
be used to select the number of trees B.

• The shrinkage parameter  , a small positive number. This controls the rate at
which boosting learns. (typical values are 0.01 or 0.001, and the right choice
can depend on the problem). Very small  can require using a very large value
of B in order to achieve good performance.

74
BOOSTING
Tuning parameters

• The number of splits d in each tree, which controls the complexity of the
boosted ensemble. Often d = 1 works well, in which case each tree is a stump,
consisting of a single split and resulting in an additive model. More generally d
is the interaction depth, and controls the interaction order of the boosted
model, since d splits can involve at most d variables.

75
BOOSTING
Example: Gene expression data

76
BOOSTING
Example

77
BOOSTING
Example

78
SUMMARY
• Decision trees are simple and interpretable models for regression and classification

• However they are often not competitive with other methods in terms of prediction
accuracy

• Bagging, random forest and boosting are good methods for improving the prediction
accuracy of trees. They work by growing many trees on the training data and then
combining the predictions of the resulting ensemble of trees.

• Random forest and boosting are among the state-of-the-art methods for supervised
learning. However their results can be difficult to interpret.
79
80

You might also like