15 1 Random Forest and Decision Tree
15 1 Random Forest and Decision Tree
• Not a student
• 45 years old
• Medium income
• Fair credit record
• Student
• 27 years old
• Low income
• Excellent credit record
Definition
• A tree-like model that illustrates series of events leading to certain decisions
• Each node represents a test on an attribute and each branch is an outcome of
that test
Who to loan?
• Not a student
• 45 years old
• Medium income
• Fair credit record
➢ Yes
• Student
• 27 years old
• Low income
• Excellent credit record
Definition
• A tree-like model that illustrates series of events leading to certain decisions
• Each node represents a test on an attribute and each branch is an outcome of
that test
Who to loan?
• Not a student
• 45 years old
• Medium income
• Fair credit record
➢ Yes
• Student
• 27 years old
• Low income
• Excellent credit record
➢ No
Decision Tree Learning
• We use labeled data to obtain a suitable decision tree for future predictions
➢ We want a decision tree that works well on unseen data, while asking as few
questions as possible
Decision Tree Learning
• Basic step: choose an attribute and, based on its values, split the data into
smaller sets
➢ Recursively repeat this step until we can surely decide the label
Decision Tree Learning
• Basic step: choose an attribute and, based on its values, split the data into
smaller sets
➢ Recursively repeat this step until we can surely decide the label
Decision Tree Learning
• Basic step: choose an attribute and, based on its values, split the data into
smaller sets
➢ Recursively repeat this step until we can surely decide the label
Decision Tree Learning
• Basic step: choose an attribute and, based on its values, split the data into
smaller sets
➢ Recursively repeat this step until we can surely decide the label
Decision Tree Learning
• Basic step: choose an attribute and, based on its values, split the data into
smaller sets
➢ Recursively repeat this step until we can surely decide the label
What is a good attribute?
𝑋𝑣
𝑔𝑎𝑖𝑛 𝑋, 𝑎 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 − σ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑋𝑣 )
𝑋
𝑣 ∈ 𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)
3 3 4 4
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝mammal log2 𝑝mammal − 𝑝bird log2 𝑝bird = − log2 - log2 ≈ 0.985
7 7 7 7
Best attribute = highest information gain
3 3 4 4
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝mammal log2 𝑝mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985
7 7 7 7
1 1 2 2
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋 𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛) = − log2 − log2 ≈ 0.918
3 3 3 3
Best attribute = highest information gain
3 3 4 4
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝mammal log2 𝑝mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985
7 7 7 7
1 1 2 2
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋 𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛) = − log2 − log2 ≈ 0.918 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒) = 1
3 3 3 3
Best attribute = highest information gain
3 3 4 4
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝mammal log2 𝑝mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985
7 7 7 7
1 1 2 2
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋 𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛) = −log2 − log2 ≈ 0.918 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒) = 1
3 3 3 3
𝟑 𝟒
𝒈𝒂𝒊𝒏 𝑿, 𝒄𝒐𝒍𝒐𝒓 = 𝟎. 𝟗𝟖𝟓 − ∙ 𝟎. 𝟗𝟏𝟖 − ∙ 𝟏 ≈ 𝟎. 𝟎𝟐𝟎
𝟕 𝟕
Best attribute = highest information gain
3 3 4 4
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝mammal log2 𝑝mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985
7 7 7 7
1 1 2 2
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋 𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛) = −log2 − log2 ≈ 0.918 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒) = 1
3 3 3 3
𝟑 𝟒
𝒈𝒂𝒊𝒏 𝑿, 𝒄𝒐𝒍𝒐𝒓 = 𝟎. 𝟗𝟖𝟓 − ∙ 𝟎. 𝟗𝟏𝟖 − ∙ 𝟏 ≈ 𝟎. 𝟎𝟐𝟎
𝟕 𝟕
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑓𝑙𝑦=𝑦𝑒𝑠) = 0
Best attribute = highest information gain
3 3 4 4
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝mammal log2 𝑝mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985
7 7 7 7
1 1 2 2
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋 𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛) = −log2 − log2 ≈ 0.918 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒) = 1
3 3 3 3
𝟑 𝟒
𝒈𝒂𝒊𝒏 𝑿, 𝒄𝒐𝒍𝒐𝒓 = 𝟎. 𝟗𝟖𝟓 − ∙ 𝟎. 𝟗𝟏𝟖 − ∙ 𝟏 ≈ 𝟎. 𝟎𝟐𝟎
𝟕 𝟕
3 3 1 1
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑓𝑙𝑦=𝑦𝑒𝑠) = 0 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑓𝑙𝑦=𝑛𝑜) = − log2 − log2 ≈ 0. 811
4 4 4 4
Best attribute = highest information gain
In practice, we compute 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑋) only once!
3 3 4 4
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝mammal log2 𝑝mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985
7 7 7 7
1 1 2 2
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋 𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛) = −log2 − log2 ≈ 0.918 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒) = 1
3 3 3 3
𝟑 𝟒
𝒈𝒂𝒊𝒏 𝑿, 𝒄𝒐𝒍𝒐𝒓 = 𝟎. 𝟗𝟖𝟓 − ∙ 𝟎. 𝟗𝟏𝟖 − ∙ 𝟏 ≈ 𝟎. 𝟎𝟐𝟎
𝟕 𝟕
3 3 1 1
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑓𝑙𝑦=𝑦𝑒𝑠) = 0 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑓𝑙𝑦=𝑛𝑜) = − log2 − log2 ≈ 0. 811
4 4 4 4
𝟑 𝟒
𝒈𝒂𝒊𝒏 𝑿, 𝒇𝒍𝒚 = 𝟎. 𝟗𝟖𝟓 − ∙ 𝟎 − ∙ 𝟎. 𝟖𝟏𝟏 ≈ 𝟎. 𝟓𝟐𝟏
𝟕 𝟕
ID3 Algorithm (Python)
# ID = Iterative Dichotomiser
def ID3(X):
node = TreeNode(X)
if all_points_have_same_class(X):
node.label = majority_label(X)
else:
a = select_attribute_with_highest_information_gain(X)
if gain(X, a) == 0:
node.label = majority_label(X)
else:
for v in values(a):
𝑋𝑣 = {𝑥 ∈ 𝑋 | 𝑥[𝑎] == 𝑣}
node.children.append(ID3(𝑋𝑣))
return node
Gini Impurity
Gini Impurity
• Gini impurity measures how often a randomly chosen example would be
incorrectly labeled if it was randomly labeled according to the label distribution
𝑔𝑖𝑛𝑖 𝑋 = 1 − σ 𝑝𝑖2
𝑖=1
𝑋𝑣
𝑔ini 𝑋, 𝑎 = gini (𝑋) − σ gini (𝑋𝑣 )
𝑋
𝑣 ∈ 𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)
2 2
3 4
𝑔𝑖𝑛𝑖 𝑋 = 1 − − ≈ 0.489
7 7 2 2
1 2
𝑔𝑖𝑛𝑖 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛 ) = 1 − − ≈ 0.444 𝑔𝑖𝑛𝑖 (𝑋 𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒) = 0.5
3 3
𝟑 𝟒
△ 𝒈𝒊𝒏𝒊 𝑿, 𝒄𝒐𝒍𝒐𝒓 = 𝟎. 𝟒𝟖𝟗 − ∙ 𝟎. 𝟒𝟒𝟒 − ∙ 𝟎. 𝟓 ≈ 𝟎. 𝟎𝟏𝟑 Gini Impurity (X,color)= 1-gini(X,color)=1-0.013=0.987
𝟕 𝟕 2 2
3 1
𝑔𝑖𝑛𝑖 (𝑋𝑓𝑙𝑦=𝑦𝑒𝑠) = 0 𝑔𝑖𝑛𝑖 (𝑋𝑓𝑙𝑦=𝑛𝑜 ) = 1 − − ≈ 0. 375
4 4
𝟑 𝟒
△ 𝒈𝒊𝒏𝒊 𝑿, 𝒇𝒍𝒚 = 𝟎. 𝟒𝟖𝟗 − ∙ 𝟎 − ∙ 𝟎. 𝟑𝟕𝟓 ≈ 𝟎. 𝟐𝟕𝟒 Gini Impurity(X,fly)= 1-gini(X,fly)=1-0.274= 0.726
𝟕 𝟕
Entropy versus Gini Impurity
• Entropy and Gini Impurity give similar results in practice
➢ They only disagree in about 2% of cases
“Theoretical Comparison between the Gini Index and Information Gain
Criteria” [Răileanu & Stoffel, AMAI 2004]
➢ Entropy might be slower to compute, because of the log
Pruning
Pruning
• Pruning is a technique that reduces the size of a decision tree by
removing branches of the tree which provide little predictive power
• It is a regularization method that reduces the complexity of the
final model, thus reducing overfitting
➢ Decision trees are prone to overfitting!
• Pruning methods:
➢ Pre-pruning: Stop the tree building algorithm before it fully
classifies the data
➢ Post-pruning: Build the complete tree, then replace some non-
leaf nodes with leaf nodes if this improves validation error
Pre-pruning Minimum
𝜽𝒆𝒏𝒕 = 𝟎. 𝟒 threshold
• Pre-pruning implies early stopping: on entropy
➢ If some condition is met, the current node will
not be split, even if it is not 100% pure
➢ It will become a leaf node with the label of the
majority class in the current set
(the class distribution could be used as prediction 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(54|5) = 0.29 < 𝜃𝑒𝑛𝑡
confidence)
Stop, even if node
can be split
• Common stopping criteria include setting a
threshold on:
➢ Entropy (or Gini Impurity) of the current set
➢ Number of samples in the current set
➢ Gain of the best-splitting attribute
➢ Depth of the tree
91.52% confidence
Post-pruning
Mean of
each
consecutive
Sort pair
Handling numerical attributes
• Numerical attributes have to be treated differently
➢ Find the best splitting value
𝑋𝑎≤𝑡 𝑋𝑎>𝑡
𝑔𝑎𝑖𝑛 𝑋, 𝑎, 𝑡 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 − 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋𝑎≤𝑡 − 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋𝑎>𝑡
𝑋 𝑋
Mean of
each
consecutive
Sort pair
Mean of
each
consecutive
Sort pair
Mean of
each
consecutive
Sort pair
7
𝑔𝑎𝑖𝑛 𝑋, ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦, 83.5 = 0.94 − ∙ 0.59
14
Handling numerical attributes
• Numerical attributes have to be treated differently
➢ Find the best splitting value
𝑋𝑎≤𝑡 𝑋𝑎>𝑡
𝑔𝑎𝑖𝑛 𝑋, 𝑎, 𝑡 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 − 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋𝑎≤𝑡 − 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋𝑎>𝑡
𝑋 𝑋
Mean of
each
consecutive
Sort pair
7 7
𝑔𝑎𝑖𝑛 𝑋, ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦, 83.5 = 0.94 − ∙ 0.59 − ∙ 0.98
14 14
Handling numerical attributes
• Numerical attributes have to be treated differently
➢ Find the best splitting value
𝑋𝑎≤𝑡 𝑋𝑎>𝑡
𝑔𝑎𝑖𝑛 𝑋, 𝑎, 𝑡 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 − 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋𝑎≤𝑡 − 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋𝑎>𝑡
𝑋 𝑋
Mean of
each
consecutive
Sort pair
7 7
𝑔𝑎𝑖𝑛 𝑋, ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦, 83.5 = 0.94 − ∙ 0.59 − ∙ 0.98
14 14
≈ 𝟎. 𝟏𝟓𝟐
Handling numerical attributes
• Numerical attributes have to be treated differently
➢ Find the best splitting value
Mean of
each Gain for
consecutive every
Sort pair candidate 83.5 is the
best splitting
value with an
information
gain of 0.152
Handling numerical attributes
• Numerical attributes have to be treated differently
➢ Find the best splitting value
𝑃 𝐵𝑟𝑜𝑤𝑛 𝑀𝑎𝑚𝑚𝑎𝑙 = 0
Handling missing values at training time
• Data sets might have samples with missing values for some
attributes
• Simply ignoring them would mean throwing away a lot of
information
• There are better ways of handling missing values:
➢ Set them to the most common value
➢ Set them to the most probable value given the label
➢ Add a new instance for each possible value
Handling missing values at training time
• Data sets might have samples with missing values for some
attributes
• Simply ignoring them would mean throwing away a lot of
information
• There are better ways of handling missing values:
➢ Set them to the most common value
➢ Set them to the most probable value given the label
entropy(𝑋𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛) = 0 ➢ Add a new instance for each possible value
entropy(𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒) = 1 ➢ Leave them unknown, but discard the sample when
evaluating the gain of that attribute
2 4
gain 𝑋 𝑐𝑜𝑙𝑜𝑟 = 0.985 − ∙ 0 − ∙ 1 (if the attribute is chosen for splitting, send the instances
6 6
= 0.318 with unknown values to all children)
Handling missing values at training time
• Data sets might have samples with missing values for some
attributes
• Simply ignoring them would mean throwing away a lot of
information
• There are better ways of handling missing values:
➢ Set them to the most common value
➢ Set them to the most probable value given the label
➢ Add a new instance for each possible value
➢ Leave them unknown, but discard the sample when
evaluating the gain of that attribute
(if the attribute is chosen for splitting, send the instances
with unknown values to all children)
➢ Build a decision tree on all other attributes (including label)
to predict missing values
(use instances where the attribute is defined as training data)
Handling missing values at inference time
• When we encounter a node that checks an attribute with a missing value, we
explore all possibilities
Loan?
• Not a student
• 49 years old
• Unknown income
• Fair credit record
Handling missing values at inference time
• When we encounter a node that checks an attribute with a missing value, we
explore all possibilities
• We explore all branches and take the final prediction based on a (weighted)
vote of the corresponding leaf nodes
Loan?
• Not a student
• 49 years old
• Unknown income
• Fair credit record
➢ Yes
C4.5 Algorithm
• C4.5 algorithm is an extension of ID3 algorithm that brings several
improvements:
➢ Ability to handle both categorical (discrete) and numerical
(continuous) attributes
(continuous attributes are split by finding a best-splitting threshold)
➢ Ability to handle missing values both at training and inference time
(missing values at training are not used when information gain is
computed; missing values at inference time are handled by exploring
all corresponding branches)
➢ Ability to handle attributes with different costs
➢ Post-pruning in a bottom-up manner for removing branches that
decrease validation error (i.e., that increase generalization capacity)
Decision Boundaries
• Decision trees produce non-linear decision boundaries
Training
Inference
History of Decision Trees
• The first regression tree algorithm
➢ “Automatic Interaction Detection (AID)” [Morgan & Sonquist, 1963]
• The first classification tree algorithm
➢ “Theta Automatic Interaction Detection (THAID)” [Messenger & Mandel,
1972]
• Decision trees become popular
➢ “Classification and regression trees (CART)” [Breiman et al., 1984]
• Introduction of the ID3 algorithm
➢ “Induction of Decision Trees” [Quinlan, 1986]
• Introduction of the C4.5 algorithm
➢ “C4.5: Programs for Machine Learning” [Quinlan, 1993]
Summary
• Decision trees represent a tool based on a tree-like graph of decisions
and their possible outcomes
• Decision tree learning is a machine learning method that employs a
decision tree as a predictive model
• ID3 builds a decision tree by iteratively splitting the data based on the
values of an attribute with the largest information gain (decrease in
entropy)
➢ Using the decrease of Gini Impurity is also a commonly-used option in
practice
• C4.5 is an extension of ID3 that handles attributes with continuous
values, missing values and adds regularization by pruning branches likely
to overfit
Random Forests
(Ensemble learning with decision trees)
Random Forests
• Random Forests:
➢ Instead of building a single decision tree and use it to make
predictions, build many slightly different trees and combine their
predictions
• We have a single data set, so how do we obtain slightly different trees?
1. Bagging (Bootstrap Aggregating):
➢ Take random subsets of data points from the training set to create N
smaller data sets
➢ Fit a decision tree on each subset
2. Random Subspace Method (also known as Feature Bagging):
➢ Fit N different decision trees by constraining each one to operate on a
random subset of features
Bagging at training time
N subsets (with
replacement)
Training set
Bagging at inference time
A test sample
75% confidence
Random Subspace Method at training time
Training data
Random Subspace Method at inference time
A test sample
66% confidence
Random Forests
Tree 1 Tree 2
Random Forest Tree N
History of Random Forests