Tree Based Methods
Which ML technique is most appropriate?
2
Which ML technique is most appropriate?
3
Tree Based Methods
• Capable of representing non-linear relationships
• Easy to interpret (if .. then .. )
4
Decision Trees
• Supervised learning algorithm
• Applicable to continuous and discrete responses
• Example:
5
Decision Trees: Advantages
• Advantages
– Can handle various situations:
• Sparse, skewed, continuous, categorical
• Less influenced by outliers and missing values
– Not required to guess the relationship of the response
variable to the predictors
• Eg. Linear – as required in Linear Regression
– Simple to understand
– Easy to interpret
– Data preparation – minimal
– Algorithmic complexity not high: ~log(data_points)
– Implicitly does feature selection
6
Decision Trees: Limitations
• Limitations
– Over complexity, sometimes: Not “generalized representation” of data
– Instability: small changes in data leading to large structural changes
– If predictor → response relationship does not follow
rectangular sub-spaces
• high prediction errors will result
– Algorithms are heuristic: Global optima not guaranteed
– Biased trees, if some classes dominate
– Not really appropriate for continuous variables (?)
• Overcome using methods like:
• Random Forest, Bootstrap Aggregation (BAGGING)
7
Tree Based Methods
• Belong to “non-parametric” techniques
– No assumptions about the nature of relationship
between the predictors and the response variable
• Tree based models
– Decision trees
– Random Forest
– Boosted trees
• Can be used for
– Regression
– Classification 8
Tree Based Methods
• Basic idea
– Partition the solution space into rectangular areas
• Which predictors to use? Where to split?
• Decided by minimizing a cost function
– Within each rectangle, fit a model (… a constant)
• Training stops when at least specified number of training
instances are assigned to each leaf nodes
• Use: Classification and Regression (CART)
– CART: also the name of the algorithm
9
Classification and Regression Trees
Algorithm
• All “features” are “searched” to find the optimum
split
• Once a split happens, each resulting partition is
then recursively split
– By searching all “features” for optimum split, as above
– Until a termination criteria is reached
• Termination criteria:
– Number of terminal nodes
– Depth of the tree
10
Regression Trees: Explanation
Consider the following data set
Type of learning:
– Supervised
Technique to be used:
– Recursively Partitioned Regression Tree
11
Regression Tree: Algorithm
12
Regression Tree: Algorithm
13
Regression Tree: Algorithm
14
Regression Tree: Algorithm
15
Regression Tree: Algorithm
16
Regression Tree: Algorithm
17
Regression Tree: Algorithm
18
Regression Tree: Algorithm
19
Regression Tree: Algorithm
SPLIT SSE
X=1 164
X=2 123.354
X=3 84.823
X=4 63.548
X=5 59.948
X=6 57.328
X=7 71.35
X=8 120.84
20
Regression Tree: Algorithm
21
Pruning of Regression Trees
Pruning of tree required when
– Size of the tree is large
– Hence there is possibility of over-fitting the training
data set
Pruning carried out using
– Cost complexity tuning
– That is:
– cp is known as the complexity parameter
– Best pruned tree is found by varying cp over a range
– SSE or RMSE is used as the selection criteria
22
Pruning of Regression Trees
Consider the following data
23
Full Tree
24
Trees with cp = 0, 0.1, 0.2
25
Trees with cp = 0, 0.1, 0.2
26
Surrogate Splits
Technique for handling missing values
• Missing data is ignored and splits happen
• However, alternate splits are also remembered –
whose results are similar
• If predictors are not available at some split
– One of the surrogate split is used
A number of surrogate splits may be stored
for each split in the tree
27
Classification Trees
28
How do splits happen in Classification Trees?
Criteria for creating splits in trees
• Continuous response variable: splitting based on
– Variance
• Categorical response variable: splitting based on
– Classification error rates
– Gini Index
– Entropy
29
Classification Trees: Gini Score
• In classification trees
– Goal: Partition data into smaller homogeneous groups
– “Class” of the data point is determined by “mode”
– One of the methods for branching: Gini Score
• Given p1 and p2 as the probabilities of Class-1 and
Class-2 respectively of a node, the Gini Score of
the node is defined as:
G = p1 * (1-p1) + p2 * (1-p2)
G = (2 * p1 * p2) … for two class problem 30
Derivation of Gini Score for 2 Class Situation
Gini: Sounds like "Jee-nee" (with a soft 'g' like in "giraffe" and the 'i' like in "jean").
31
Understanding the Gini Score
• The Gini Score represents impurity of data
• Lower the score higher the purity, as illustrated in the figure below
32
Calculating and Interpreting Gini Scores
The following figures illustrate Gini Score calculations when a dataset is subdivided
• When sub-divided, as shown, the overall Gini Score is the weighted sum of the
Gini Score of each part. After the split we see that the overall score has reduced.
• We say that the split has improved the classification quality by increasing class
purity in the resulting subsets.
33
Example: Decision based on Gini Index
• The overall Gini Score of a split is the weighed
sum of Gini Index of individual nodes
34
Example: Decision based on Gini Index
• The overall Gini Score of a split is the weighed
sum of Gini Index of individual nodes
35
Example: Decision based on Gini Index
• The overall Gini Score of a split is the weighed
sum of Gini Index of individual nodes
36
Example: Decision based on Gini Index
• The overall Gini Score of a split is the weighed
sum of Gini Index of individual nodes
37
Example: Decision based on Gini Index
• The overall Gini Score of a split is the weighed
sum of Gini Index of individual nodes
38
Example: Decision based on Gini Index
If the main node is split on the basis of Gender
39
Classification Trees: Cross-Entropy
• Entropy : An alternative to the Gini Index
• Cross Entropy:
• Interpretation:
– Purer the node, cross-entropy will be closer to zero
– Cross-entropy and Gini Index are similar in values
40
Classification Trees: Classification Error Rates
Classification error rate
– Fraction of training observations in the candidate
region that do not belong to the most common class
Observation:
– Classification error rate is not sensitive enough for
growing the trees
41
Governing Parameters in Tree-based Algorithms
• Parameters that need tuning:
– Minimum samples for a node-split
– Minimum samples for a terminal node
– Maximum depth of the tree
– Maximum number of terminal nodes
– Maximum features to consider for split
• Eg. sqrt(total_number_of_features)
42
Ensemble Techniques in Trees
• Ensemble
– A group of items viewed as a whole rather than
individually
– Eg. a group of musicians who play together
• In the context of trees
– Methods that combine many models’ predictions
• Specific techniques
– BAGGING
– Random Forest
– Boosting
43
Bootstrap Aggregation (BAGGING)
• Bootstrap Aggregation (BAGGING)
– General method for reducing the variance of a statistical
learning technique
• Method to reduce variance:
– Take a number of samples from the population:
• But this is not always possible
– Solution:
• Take repeated samples from the available data set
– Sampling with replacement
• Construct decision tree using each such sample
– Trees are grown deep, not pruned
– Such trees will have high variance but low bias
– Overall variance reduced by aggregating output from the trees
• Statistical basis of BAGGING
44
Bootstrap Aggregation (BAGGING)
• Aggregation
– Regression: take the mean of individual results
– Classification: majority vote
• Note:
– Results are more accurate
– But interpretability goes down
• Bagging improves prediction at the cost of interpretability
– Multiple full trees are generated:
• Hundreds, even thousands of trees may be generated
• Computational time increases
45
Data Set and Normal (Full) Tree
46
BAGGING of Trees
BAGGING with 25 iterations
47
Random Forest
• Essentially based on BAGGING concept
– A number of trees are built by bootstrapping samples of
training data
– Handles the situation of “strong predictors”
• Strong predictors = Correlated trees
• ➔ Variance not reduced as much as expected
• With some tweaks
– Select a random number of predictors (m) from total (p)
– And then split a node
• Advantages
– Lower variance (advantage of BAGGING)
– Lower bias (number of predictors can be considered)
– A very large number of predictors can be handled
48
Random Forest Algorithm
Applied Predictive Modeling, Kuhn and Johnson
49
Random Forest
50
Random Forest
• Tuning parameters
– Node size: can be small => goal: reduce bias
– Number of trees
– Number of predictors sampled:
• Guideline: one-third in case of regression
• Guideline: square root in case of classification
• Limitations
– Extrapolation cannot be done
– Training and prediction: slow processes
– In case of classification: do not perform well if too
many classes
51
Data Set 2: With Single Tree
52
Data Set 2: BAGGED Tree
53
Data Set 2: Random Forest
54
Data Set 2: Random Forest – Error Trend
55
Boosted Trees
• Essential idea
– Use a set of classifiers (potentially weak) to boost the
overall results
• Boosting:
– There is no bootstrap sampling
– Each tree is grown using info resulting from previous trees
– Boosting uses residuals from the previous tree to fit a new
tree
• Boosted trees are very popular …
– Originally developed for CLASSIFICATION
– Example: Adaptive Boosting (ADABOOST)
56
Single Tree: All points with equal weights
57
Single Tree: First 100 points with weights=5
58
Single Tree: 1:100->wts=5; 500:720->wts=0
59
Boosting Algorithms
• First of all create a tree with equal weights assigned
to all observations
• Find out the prediction error with respect to each
observation
• For observations in error, increase the weights; for
observations not in error, decrease weights
• Re-fit a new tree with the newly weighted
observations
• Continue till the number of pre-decided trees have
been fit
• Predicted value = weighted / voted value from all the
trees generated
60
Boosted Tree (100 trees)
61
Boosted Tree (500 Trees)
62
Boosted Tree (100 trees)
63
Boosted Tree (500 trees)
64