0% found this document useful (0 votes)
23 views24 pages

CHL5230 2025w Lecture 07 v1

Lecture 7 of CHL5230H discusses tree-based methods in applied machine learning, focusing on their recursive partitioning approach for classification and regression tasks. Key concepts include the selection of features and thresholds, measuring purity for classification, and the importance of pruning to avoid overfitting. The lecture also highlights the interpretability of trees and their use in ensemble models to enhance performance.

Uploaded by

amera6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views24 pages

CHL5230 2025w Lecture 07 v1

Lecture 7 of CHL5230H discusses tree-based methods in applied machine learning, focusing on their recursive partitioning approach for classification and regression tasks. Key concepts include the selection of features and thresholds, measuring purity for classification, and the importance of pruning to avoid overfitting. The lecture also highlights the interpretability of trees and their use in ensemble models to enhance performance.

Uploaded by

amera6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

CHL5230H - Applied Machine Learning for Health Data

LECTURE 7: TREE BASED METHODS

Nicholas Mitsakakis
[email protected]

Dalla Lana School of Public Health


University of Toronto

February 26, 2025

1 / 24
CHL5230H - Applied Machine Learning for Health Data

Introduction

I Tree based methods are methods that split the feature space
in a recursive way using one feature at a time
I The segmentation of the space can be summarized and
visualized by a tree-like structure
I They can be used for both classification and regression
I They are often called decision trees
I One of the most popular methods, Classification And
Regression Trees (CART) was invented by Leo Breiman
around 1984

2 / 24
CHL5230H - Applied Machine Learning for Health Data

Classification tree example

3 / 24
CHL5230H - Applied Machine Learning for Health Data

Recursive partitioning

I The whole set of data is partitioned in an iterative way


I The partition is made based on
I the selection of a feature,
I a threshold, if it is numerical
I a value, if it is categorical

4 / 24
CHL5230H - Applied Machine Learning for Health Data

Recursive partitioning

I E.g. for the babies dataset, if wt1 is selected with threshold


100, the space is split into R1 = {Data: wt1 < 100} and R2
= {Data: wt1 ≥ 100}
I Or, e.g. if “race = black” is selected, the space is split into
R1 = {Data: race = black} and R2 = {Data: race 6= black}

5 / 24
CHL5230H - Applied Machine Learning for Health Data

Recursive partitioning

I After the initial partitioning was done, each one of the two
regions can be further partitioned
I This is done again after selecting features and thresholds for
each one of the partitions
I The result has a tree-like structure

6 / 24
CHL5230H - Applied Machine Learning for Health Data

Recursive (and non-recursive) partitioning

7 / 24
CHL5230H - Applied Machine Learning for Health Data

How do we apply this algorithmic method for regression


and classification?

I How is the partitioning feature selected?


I If it is continuous, how is the threshold selected?
I When does this procedure stop?
I What does the trained model represent?
I Given that we have a trained model, how do we make
predictions?

8 / 24
CHL5230H - Applied Machine Learning for Health Data

For classification

I The feature and the threshold are chosen based on some


criterion
I The goal is to generate as “pure” regions as possible
I There are different mathematical functions and ways to
measure “purity” for a “data region” or node m for a
classification problem (e.g. Gini Index, Cross-entropy)
I The general idea of purity is around the homogeneity of a
node (i.e. region), in terms of containing data from the same
class, as much as possible

9 / 24
CHL5230H - Applied Machine Learning for Health Data

Measuring purity

PK
Gini index G = k=1 p̂mk (1 − p̂mk )
I

=− K
P
I Cross-entropy D k=1 p̂mk log p̂mk
I p̂mk represents the proportion of training observations in the
m-th region that are from the k-th class
I Both of these measures take values close to 0 if p̂mk are either
close to 0 or close to 1

10 / 24
CHL5230H - Applied Machine Learning for Health Data

Purity of partitions

11 / 24
CHL5230H - Applied Machine Learning for Health Data

When does the splitting stop?

I Different rules can be applied


I E.g. no further split of any of the nodes improves the “purity”
I Also, no node can have less than n number of observations
(e.g. 5 data points)

12 / 24
CHL5230H - Applied Machine Learning for Health Data

How does the trained model look like?

I It is tree looking structure


I Every node corresponds to a split based on a feature and,
possibly, some threshold
I The terminal nodes are called leaves
I Each leaf is assigned class membership probabilities, based on
the proportions of the classes in the node

13 / 24
CHL5230H - Applied Machine Learning for Health Data

Classification tree example

Here P(survival) for the leaves are 0.73, 0.17, 0.05, 0.89

14 / 24
CHL5230H - Applied Machine Learning for Health Data

How to make predictions

I Suppose we have a trained tree model and a new observation


we want to classify using its feature values
I Given the values of the features the new observation “goes
down the tree” and ends in one of the leaves
I It is then assigned the membership probabilities for that
particular leaf
I The membership probabilities can be used for classification

15 / 24
CHL5230H - Applied Machine Learning for Health Data

Missing data

I Predictions can be generated for new examples with missing


values for some of the variables
I Observation goes down the tree as far as it can go
I i.e. it stops at a node involving a variable with missing value
I Membership probabilities for that node are used for the
prediction

16 / 24
CHL5230H - Applied Machine Learning for Health Data

Pruning the tree

I The procedure described above can generate a very large tree


I As a model this can be very complex and sensitive to the
training data, leading to overfitting
I We fix this by pruning the tree
I After the full tree is generated some of the terminal branches
are eliminated (cut off, pruned)
I This gives a simpler model

17 / 24
CHL5230H - Applied Machine Learning for Health Data

Choosing the amount of pruning

I Suppose Tf ull is the fully developed tree and T ⊂ Tf ull a


pruned subtree of Tf ull
I The chosen T is the one that minimized the following loss
function
# of misclassified i + α|T |,
where |T | is the number of terminal nodes (leaves) of T
I The selection depends on the value of the complexity
parameter α
I Notice the similarity with lasso and ridge regression

18 / 24
CHL5230H - Applied Machine Learning for Health Data

Choosing the amount of pruning

I How do we decide on the value of α?


I It is decided using k-fold cross validation
I For different values of α
I trees are developed and pruned using k − 1 folds
I misclassification error is estimated using the remaining fold
I error is averaged over all k iterations
I The α value that minimizes that error is selected

19 / 24
CHL5230H - Applied Machine Learning for Health Data

Pruning example (from ISLR)


Thal:a
|

Ca < 0.5 Ca < 0.5

Slope < 1.5 Oldpeak < 1.1

MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b RestECG < 1


ChestPain:a Yes
RestBP < 157 No Yes Yes
Yes No
Chol < 244 No Chol < 244 Sex < 0.5
MaxHR < 156 No Yes
MaxHR < 145.5 Yes
No
No No No No Yes
No Yes

Thal:a
|
0.6

Training
Cross−Validation
Test
0.5
0.4
Error

0.3

Ca < 0.5 Ca < 0.5


0.2

Yes Yes
0.1

MaxHR < 161.5 ChestPain:bc


0.0

No No
No Yes
5 10 15

Tree Size
20 / 24
CHL5230H - Applied Machine Learning for Health Data

Regression tress

I They behave very similarly with the classification trees


I The selection around the split is based on minimizing the error
X X
(yi − ŷR1 )2 + (yi − ŷR2 )2
i: xi ∈R1 i: xi ∈R2

I The predictions ŷR1 , ŷR2 are the mean values of y inside R1


and R2

21 / 24
CHL5230H - Applied Machine Learning for Health Data

Predictions

I Like in classification tress, every new example “goes down the


tree” to a terminal node
I The average value of y for the data in that node is the
predicted value for the new example

22 / 24
CHL5230H - Applied Machine Learning for Health Data

Comparison with linear model

2
1

1
X2

X2
0

0
−1

−1
−2

−2
−2 −1 0 1 2 −2 −1 0 1 2

X1 X1
2

2
1

1
X2

X2
0

0
−1

−1
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2

X1 X1

23 / 24
CHL5230H - Applied Machine Learning for Health Data

Discussion

I Trees are very interpretable


I They offer a nice visualization
I They can model non-linear relationships
I They can model non-symmetric interactions
I But, they often do not have very competitive performance
I For that they are often combined together to form other more
complex, flexible models
I These are often called ensemble models

24 / 24

You might also like