DT_RF
DT_RF
Random Forests.
Definition
• A tree-like model that illustrates series of events leading to certain decisions
• Each node represents a test on an attribute and each branch is an outcome of
that test
Who to loan?
• Not a student
• 45 years old
• Medium income
• Fair credit record
• Student
• 27 years old
• Low income
• Excellent credit record
Definition
• A tree-like model that illustrates series of events leading to certain decisions
• Each node represents a test on an attribute and each branch is an outcome of
that test
Who to loan?
• Not a student
• 45 years old
• Medium income
• Fair credit record
Yes
• Student
• 27 years old
• Low income
• Excellent credit record
Definition
• A tree-like model that illustrates series of events leading to certain decisions
• Each node represents a test on an attribute and each branch is an outcome of
that test
Who to loan?
• Not a student
• 45 years old
• Medium income
• Fair credit record
Yes
• Student
• 27 years old
• Low income
• Excellent credit record
No
Decision Tree Learning
• We use labeled data to obtain a suitable decision tree for future predictions
We want a decision tree that works well on unseen data, while asking as few
questions as possible
Decision Tree Learning
• Basic step: choose an attribute and, based on its values, split the data into
smaller sets
Recursively repeat this step until we can surely decide the label
Decision Tree Learning
• Basic step: choose an attribute and, based on its values, split the data into
smaller sets
Recursively repeat this step until we can surely decide the label
Decision Tree Learning
• Basic step: choose an attribute and, based on its values, split the data into
smaller sets
Recursively repeat this step until we can surely decide the label
Decision Tree Learning
• Basic step: choose an attribute and, based on its values, split the data into
smaller sets
Recursively repeat this step until we can surely decide the label
Decision Tree Learning
• Basic step: choose an attribute and, based on its values, split the data into
smaller sets
Recursively repeat this step until we can surely decide the label
Decision Tree Learning (Python)
def make_tree(X):
node = TreeNode(X)
if should_be_leaf_node(X):
node.label = majority_label(X)
else:
a = select_best_splitting_attribute(X)
for v in values(a):
𝑣 ={ | [ ] == }
node.children.append(make_tree( 𝑣))
return node
What is a good attribute?
2 2
Best attribute = highest information gain
2 2
Best attribute = highest information gain
2 2
Best attribute = highest information gain
2 2
Best attribute = highest information gain
2 2
2 2
Best attribute = highest information gain
In practice, we compute 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑋) only once!
2 2
2 2
ID3 Algorithm (Python)
# ID = Iterative Dichotomiser
def ID3(X):
node = TreeNode(X)
if all_points_have_same_class(X):
node.label = majority_label(X)
else:
a = select_attribute_with_highest_information_gain(X)
if gain(X, a) == 0:
node.label = majority_label(X)
else:
for v in values(a):
𝑣
node.children.append(ID3( 𝑣 ))
return node
Gini Impurity
Gini Impurity
• Gini impurity measures how often a randomly chosen example would be
incorrectly labeled if it was randomly labeled according to the label distribution
• Pruning methods:
Pre-pruning: Stop the tree building algorithm before it fully
classifies the data
Post-pruning: Build the complete tree, then replace some non-
leaf nodes with leaf nodes if this improves validation error
Pre-pruning Minimum
𝜽𝒆𝒏𝒕 = 𝟎. 𝟒 threshold
• Pre-pruning implies early stopping: on entropy
If some condition is met, the current node will
not be split, even if it is not 100% pure
It will become a leaf node with the label of the
majority class in the current set
(the class distribution could be used as prediction
𝑒𝑛𝑡
confidence)
Stop, even if node
can be split
• Common stopping criteria include setting a
threshold on:
Entropy (or Gini Impurity) of the current set
Number of samples in the current set
Gain of the best-splitting attribute
Depth of the tree
91.52% confidence
Post-pruning
𝑃 𝐵𝑟𝑜𝑤𝑛 𝑀𝑎𝑚𝑚𝑎𝑙 = 0
Handling missing values at training time
• Data sets might have samples with missing values for some
attributes
• Simply ignoring them would mean throwing away a lot of
information
• There are better ways of handling missing values:
Set them to the most common value
Set them to the most probable value given the label
Add a new instance for each possible value
Handling missing values at training time
• Data sets might have samples with missing values for some
attributes
• Simply ignoring them would mean throwing away a lot of
information
• There are better ways of handling missing values:
Set them to the most common value
Set them to the most probable value given the label
entropy(𝑋 )=0 Add a new instance for each possible value
entropy(𝑋 )=1 Leave them unknown, but discard the sample when
evaluating the gain of that attribute
gain 𝑋 𝑐𝑜𝑙𝑜𝑟 = 0.985 − 0− 1 (if the attribute is chosen for splitting, send the instances
= 0.318 with unknown values to all children)
Handling missing values at training time
• Data sets might have samples with missing values for some
attributes
• Simply ignoring them would mean throwing away a lot of
information
• There are better ways of handling missing values:
Set them to the most common value
Set them to the most probable value given the label
Add a new instance for each possible value
Leave them unknown, but discard the sample when
evaluating the gain of that attribute
(if the attribute is chosen for splitting, send the instances
with unknown values to all children)
Build a decision tree on all other attributes (including label)
to predict missing values
(use instances where the attribute is defined as training data)
Handling missing values at inference time
• When we encounter a node that checks an attribute with a missing value, we
explore all possibilities
Loan?
• Not a student
• 49 years old
• Unknown income
• Fair credit record
Handling missing values at inference time
• When we encounter a node that checks an attribute with a missing value, we
explore all possibilities
• We explore all branches and take the final prediction based on a (weighted)
vote of the corresponding leaf nodes
Loan?
• Not a student
• 49 years old
• Unknown income
• Fair credit record
Yes
Decision Boundaries
• Decision trees produce non-linear decision boundaries
Training
Inference
History of Decision Trees
• The first regression tree algorithm
“Automatic Interaction Detection (AID)” [Morgan & Sonquist, 1963]
• The first classification tree algorithm
“Theta Automatic Interaction Detection (THAID)” [Messenger & Mandel,
1972]
• Decision trees become popular
“Classification and regression trees (CART)” [Breiman et al., 1984]
• Introduction of the ID3 algorithm
“Induction of Decision Trees” [Quinlan, 1986]
• Introduction of the C4.5 algorithm
“C4.5: Programs for Machine Learning” [Quinlan, 1993]
Summary
• Decision trees represent a tool based on a tree-like graph of decisions
and their possible outcomes
• Decision tree learning is a machine learning method that employs a
decision tree as a predictive model
• ID3 builds a decision tree by iteratively splitting the data based on the
values of an attribute with the largest information gain (decrease in
entropy)
Using the decrease of Gini Impurity is also a commonly-used option in
practice
• C4.5 is an extension of ID3 that handles attributes with continuous
values, missing values and adds regularization by pruning branches likely
to overfit
Random Forests
(Ensemble learning with decision trees)
Random Forests
• Random Forests:
Instead of building a single decision tree and use it to make
predictions, build many slightly different trees and combine their
predictions
• We have a single data set, so how do we obtain slightly different trees?
1. Bagging (Bootstrap Aggregating):
Take random subsets of data points from the training set to create N
smaller data sets
Fit a decision tree on each subset
2. Random Subspace Method (also known as Feature Bagging):
Fit N different decision trees by constraining each one to operate on a
random subset of features
Bagging at training time
N subsets (with
replacement)
Training set
Bagging at inference time
A test sample
75% confidence
Random Subspace Method at training time
Training data
Random Subspace Method at inference time
A test sample
66% confidence
Random Forests
Tree 1 Tree 2
Random Forest Tree N
History of Random Forests
Reweight based on
model’s mistakes
Boosting
Reweight based
on current
model’s mistakes
Boosting
Boosting
Summary
• Ensemble Learning methods combine multiple learning algorithms to
obtain performance improvements over its components
• Commonly-used ensemble methods:
Bagging (multiple models on random subsets of data samples)
Random Subspace Method (multiple models on random subsets of
features)
Boosting (train models iteratively, while making the current model
focus on the mistakes of the previous ones by increasing the weight
of misclassified samples)
• Random Forests are an ensemble learning method that employ decision
tree learning to build multiple trees through bagging and random
subspace method.
They rectify the overfitting problem of decision trees!
Decision Trees and Random Forest (Python)
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier