0% found this document useful (0 votes)
3 views64 pages

Tree Based Methods

The document discusses tree-based methods in machine learning, focusing on decision trees, random forests, and boosted trees. It outlines the advantages and limitations of decision trees, including their ability to handle various data types and their susceptibility to overfitting. Additionally, it covers concepts like pruning, surrogate splits, and ensemble techniques such as BAGGING and boosting to enhance model performance.

Uploaded by

Arya Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views64 pages

Tree Based Methods

The document discusses tree-based methods in machine learning, focusing on decision trees, random forests, and boosted trees. It outlines the advantages and limitations of decision trees, including their ability to handle various data types and their susceptibility to overfitting. Additionally, it covers concepts like pruning, surrogate splits, and ensemble techniques such as BAGGING and boosting to enhance model performance.

Uploaded by

Arya Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Tree Based Methods

Which ML technique is most appropriate?

2
Which ML technique is most appropriate?

3
Tree Based Methods
• Capable of representing non-linear relationships
• Easy to interpret (if .. then .. )

4
Decision Trees
• Supervised learning algorithm
• Applicable to continuous and discrete responses
• Example:

5
Decision Trees: Advantages
• Advantages
– Can handle various situations:
• Sparse, skewed, continuous, categorical
• Less influenced by outliers and missing values
– Not required to guess the relationship of the response
variable to the predictors
• Eg. Linear – as required in Linear Regression
– Simple to understand
– Easy to interpret
– Data preparation – minimal
– Algorithmic complexity not high: ~log(data_points)
– Implicitly does feature selection
6
Decision Trees: Limitations
• Limitations
– Over complexity, sometimes: Not “generalized representation” of data
– Instability: small changes in data leading to large structural changes
– If predictor → response relationship does not follow
rectangular sub-spaces
• high prediction errors will result
– Algorithms are heuristic: Global optima not guaranteed
– Biased trees, if some classes dominate
– Not really appropriate for continuous variables (?)
• Overcome using methods like:
• Random Forest, Bootstrap Aggregation (BAGGING)
7
Tree Based Methods
• Belong to “non-parametric” techniques
– No assumptions about the nature of relationship
between the predictors and the response variable

• Tree based models


– Decision trees
– Random Forest
– Boosted trees

• Can be used for


– Regression
– Classification 8
Tree Based Methods
• Basic idea
– Partition the solution space into rectangular areas
• Which predictors to use? Where to split?
• Decided by minimizing a cost function
– Within each rectangle, fit a model (… a constant)
• Training stops when at least specified number of training
instances are assigned to each leaf nodes

• Use: Classification and Regression (CART)


– CART: also the name of the algorithm

9
Classification and Regression Trees
Algorithm
• All “features” are “searched” to find the optimum
split
• Once a split happens, each resulting partition is
then recursively split
– By searching all “features” for optimum split, as above
– Until a termination criteria is reached
• Termination criteria:
– Number of terminal nodes
– Depth of the tree
10
Regression Trees: Explanation
Consider the following data set

Type of learning:
– Supervised

Technique to be used:
– Recursively Partitioned Regression Tree

11
Regression Tree: Algorithm

12
Regression Tree: Algorithm

13
Regression Tree: Algorithm

14
Regression Tree: Algorithm

15
Regression Tree: Algorithm

16
Regression Tree: Algorithm

17
Regression Tree: Algorithm

18
Regression Tree: Algorithm

19
Regression Tree: Algorithm
SPLIT SSE
X=1 164
X=2 123.354
X=3 84.823
X=4 63.548
X=5 59.948
X=6 57.328
X=7 71.35
X=8 120.84

20
Regression Tree: Algorithm

21
Pruning of Regression Trees
Pruning of tree required when
– Size of the tree is large
– Hence there is possibility of over-fitting the training
data set
Pruning carried out using
– Cost complexity tuning
– That is:
– cp is known as the complexity parameter
– Best pruned tree is found by varying cp over a range
– SSE or RMSE is used as the selection criteria

22
Pruning of Regression Trees
Consider the following data

23
Full Tree

24
Trees with cp = 0, 0.1, 0.2

25
Trees with cp = 0, 0.1, 0.2

26
Surrogate Splits
Technique for handling missing values
• Missing data is ignored and splits happen
• However, alternate splits are also remembered –
whose results are similar
• If predictors are not available at some split
– One of the surrogate split is used

A number of surrogate splits may be stored


for each split in the tree

27
Classification Trees

28
How do splits happen in Classification Trees?
Criteria for creating splits in trees

• Continuous response variable: splitting based on


– Variance

• Categorical response variable: splitting based on


– Classification error rates
– Gini Index
– Entropy

29
Classification Trees: Gini Score
• In classification trees
– Goal: Partition data into smaller homogeneous groups
– “Class” of the data point is determined by “mode”
– One of the methods for branching: Gini Score

• Given p1 and p2 as the probabilities of Class-1 and


Class-2 respectively of a node, the Gini Score of
the node is defined as:

G = p1 * (1-p1) + p2 * (1-p2)
G = (2 * p1 * p2) … for two class problem 30
Derivation of Gini Score for 2 Class Situation

Gini: Sounds like "Jee-nee" (with a soft 'g' like in "giraffe" and the 'i' like in "jean").
31
Understanding the Gini Score
• The Gini Score represents impurity of data
• Lower the score higher the purity, as illustrated in the figure below

32
Calculating and Interpreting Gini Scores
The following figures illustrate Gini Score calculations when a dataset is subdivided

• When sub-divided, as shown, the overall Gini Score is the weighted sum of the
Gini Score of each part. After the split we see that the overall score has reduced.
• We say that the split has improved the classification quality by increasing class
purity in the resulting subsets.
33
Example: Decision based on Gini Index
• The overall Gini Score of a split is the weighed
sum of Gini Index of individual nodes

34
Example: Decision based on Gini Index
• The overall Gini Score of a split is the weighed
sum of Gini Index of individual nodes

35
Example: Decision based on Gini Index
• The overall Gini Score of a split is the weighed
sum of Gini Index of individual nodes

36
Example: Decision based on Gini Index
• The overall Gini Score of a split is the weighed
sum of Gini Index of individual nodes

37
Example: Decision based on Gini Index
• The overall Gini Score of a split is the weighed
sum of Gini Index of individual nodes

38
Example: Decision based on Gini Index
If the main node is split on the basis of Gender

39
Classification Trees: Cross-Entropy
• Entropy : An alternative to the Gini Index
• Cross Entropy:

• Interpretation:
– Purer the node, cross-entropy will be closer to zero
– Cross-entropy and Gini Index are similar in values

40
Classification Trees: Classification Error Rates
Classification error rate
– Fraction of training observations in the candidate
region that do not belong to the most common class

Observation:
– Classification error rate is not sensitive enough for
growing the trees

41
Governing Parameters in Tree-based Algorithms
• Parameters that need tuning:
– Minimum samples for a node-split
– Minimum samples for a terminal node
– Maximum depth of the tree
– Maximum number of terminal nodes
– Maximum features to consider for split
• Eg. sqrt(total_number_of_features)

42
Ensemble Techniques in Trees
• Ensemble
– A group of items viewed as a whole rather than
individually
– Eg. a group of musicians who play together
• In the context of trees
– Methods that combine many models’ predictions
• Specific techniques
– BAGGING
– Random Forest
– Boosting

43
Bootstrap Aggregation (BAGGING)
• Bootstrap Aggregation (BAGGING)
– General method for reducing the variance of a statistical
learning technique
• Method to reduce variance:
– Take a number of samples from the population:
• But this is not always possible
– Solution:
• Take repeated samples from the available data set
– Sampling with replacement
• Construct decision tree using each such sample
– Trees are grown deep, not pruned
– Such trees will have high variance but low bias
– Overall variance reduced by aggregating output from the trees
• Statistical basis of BAGGING
44
Bootstrap Aggregation (BAGGING)
• Aggregation
– Regression: take the mean of individual results
– Classification: majority vote
• Note:
– Results are more accurate
– But interpretability goes down
• Bagging improves prediction at the cost of interpretability
– Multiple full trees are generated:
• Hundreds, even thousands of trees may be generated
• Computational time increases

45
Data Set and Normal (Full) Tree

46
BAGGING of Trees
BAGGING with 25 iterations

47
Random Forest
• Essentially based on BAGGING concept
– A number of trees are built by bootstrapping samples of
training data
– Handles the situation of “strong predictors”
• Strong predictors = Correlated trees
• ➔ Variance not reduced as much as expected
• With some tweaks
– Select a random number of predictors (m) from total (p)
– And then split a node
• Advantages
– Lower variance (advantage of BAGGING)
– Lower bias (number of predictors can be considered)
– A very large number of predictors can be handled
48
Random Forest Algorithm

Applied Predictive Modeling, Kuhn and Johnson

49
Random Forest

50
Random Forest
• Tuning parameters
– Node size: can be small => goal: reduce bias
– Number of trees
– Number of predictors sampled:
• Guideline: one-third in case of regression
• Guideline: square root in case of classification
• Limitations
– Extrapolation cannot be done
– Training and prediction: slow processes
– In case of classification: do not perform well if too
many classes
51
Data Set 2: With Single Tree

52
Data Set 2: BAGGED Tree

53
Data Set 2: Random Forest

54
Data Set 2: Random Forest – Error Trend

55
Boosted Trees
• Essential idea
– Use a set of classifiers (potentially weak) to boost the
overall results
• Boosting:
– There is no bootstrap sampling
– Each tree is grown using info resulting from previous trees
– Boosting uses residuals from the previous tree to fit a new
tree
• Boosted trees are very popular …
– Originally developed for CLASSIFICATION
– Example: Adaptive Boosting (ADABOOST)

56
Single Tree: All points with equal weights

57
Single Tree: First 100 points with weights=5

58
Single Tree: 1:100->wts=5; 500:720->wts=0

59
Boosting Algorithms
• First of all create a tree with equal weights assigned
to all observations
• Find out the prediction error with respect to each
observation
• For observations in error, increase the weights; for
observations not in error, decrease weights
• Re-fit a new tree with the newly weighted
observations
• Continue till the number of pre-decided trees have
been fit
• Predicted value = weighted / voted value from all the
trees generated
60
Boosted Tree (100 trees)

61
Boosted Tree (500 Trees)

62
Boosted Tree (100 trees)

63
Boosted Tree (500 trees)

64

You might also like