0% found this document useful (0 votes)
16 views

Decision Tree

Uploaded by

masuma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Decision Tree

Uploaded by

masuma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

DECISION TREE

Subtitle
Contents

 Introduction
 Working
 Formula
 Example
Introduction

 In machine learning, decision trees are a widely used supervised learning technique. They
excel at creating models for both classification (predicting discrete categories) and
regression (predicting continuous values) tasks.
 Their structure, resembling an actual tree, makes them intuitive to understand and
interpret, even for those without a machine learning background.
 A decision tree is a non-parametric supervised learning algorithm for classification
and regression tasks.
 It has a hierarchical tree structure consisting of a root node, branches, internal nodes, and
leaf nodes.
 Decision trees are used for classification and regression tasks, providing easy-to-
understand models.
Continue

 A decision tree is a hierarchical model used in decision support that depicts decisions and
their potential outcomes, incorporating chance events, resource expenses, and utility.
 This algorithmic model utilizes conditional control statements and is non-parametric,
supervised learning, useful for both classification and regression tasks.
 The tree structure is comprised of a root node, branches, internal nodes, and leaf nodes,
forming a hierarchical, tree-like structure.
Terminology

 Root Node: The initial node at the beginning of a decision tree, where the entire population or
dataset starts dividing based on various features or conditions.
 Decision Nodes: Nodes resulting from the splitting of root nodes are known as decision nodes.
These nodes represent intermediate decisions or conditions within the tree.
 Leaf Nodes: Nodes where further splitting is not possible, often indicating the final classification
or outcome. Leaf nodes are also referred to as terminal nodes.
 Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section of a decision
tree is referred to as a sub-tree. It represents a specific portion of the decision tree.
 Pruning: The process of removing or cutting down specific nodes in a decision tree to prevent
overfitting and simplify the model.
 Branch / Sub-Tree: A subsection of the entire decision tree is referred to as a branch or sub-tree.
It represents a specific path of decisions and outcomes within the tree.
Continue

 Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is known
as a parent node, and the sub-nodes emerging from it are referred to as child nodes. The
parent node represents a decision or condition, while the child nodes represent the
potential outcomes or further decisions based on that condition.
Example of Decision Tree
Continue

 Decision trees are upside down which means the root is at the top and then this root is
split into various several nodes.
 Decision trees are nothing but a bunch of if-else statements in layman terms.
 It checks if the condition is true and if it is then it goes to the next node attached to that
decision.
 In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or
rainy? If yes then it will go to the next feature which is humidity and wind.
 It will again check if there is a strong wind or weak, if it’s a weak wind and it’s rainy then
the person may go and play.
Continue
How decision tree algorithms work?

 Decision Tree algorithm works in simpler steps


 Starting at the Root: The algorithm begins at the top, called the “root node,”
representing the entire dataset.
 Asking the Best Questions: It looks for the most important feature or question that
splits the data into the most distinct groups. This is like asking a question at a fork in the
tree.
 Branching Out: Based on the answer to that question, it divides the data into smaller
subsets, creating new branches. Each branch represents a possible route through the tree.
 Repeating the Process: The algorithm continues asking questions and splitting the data
at each branch until it reaches the final “leaf nodes,” representing the predicted outcomes
or classifications.
Decision Tree Assumptions

 Several assumptions are made to build effective models when creating decision trees.
These assumptions help guide the tree’s construction and impact its performance. Here
are some common assumptions and considerations when creating decision trees:
 Binary Splits
 Decision trees typically make binary splits, meaning each node divides the data into
two subsets based on a single feature or condition. This assumes that each decision
can be represented as a binary choice.
 Recursive Partitioning
 Decision trees use a recursive partitioning process, where each node is divided into
child nodes, and this process continues until a stopping criterion is met. This assumes
that data can be effectively subdivided into smaller, more manageable subsets.
Continue

 Feature Independence
 Decision trees often assume that the features used for splitting nodes are independent. In
practice, feature independence may not hold, but decision trees can still perform well if
features are correlated.
 Homogeneity
 Decision trees aim to create homogeneous subgroups in each node, meaning that the samples
within a node are as similar as possible regarding the target variable. This assumption helps in
achieving clear decision boundaries.
 Top-Down Greedy Approach
 Decision trees are constructed using a top-down, greedy approach, where each split is chosen
to maximize information gain or minimize impurity at the current node. This may not always
result in the globally optimal tree.
 Categorical and Numerical Features
 Decision trees can handle both categorical and numerical features. However, they may require
different splitting strategies for each type.
Continue

 Overfitting
 Decision trees are prone to overfitting when they capture noise in the data. Pruning
and setting appropriate stopping criteria are used to address this assumption.
 Impurity Measures
 Decision trees use impurity measures such as Gini impurity or entropy to evaluate
how well a split separates classes. The choice of impurity measure can impact tree
construction.
 No Missing Values
 Decision trees assume that there are no missing values in the dataset or that missing
values have been appropriately handled through imputation or other methods.
 Equal Importance of Features
 Decision trees may assume equal importance for all features unless feature scaling or
weighting is applied to emphasize certain features.
Continue

 No Outliers
 Decision trees are sensitive to outliers, and extreme values can influence their
construction. Preprocessing or robust methods may be needed to handle outliers
effectively.
 Sensitivity to Sample Size
 Small datasets may lead to overfitting, and large datasets may result in overly complex
trees. The sample size and tree depth should be balanced.
Entropy

 Entropy is nothing but the uncertainty in our dataset or measure of disorder.


 Suppose you have a group of friends who decides which movie they can watch together
on Sunday. There are 2 choices for movies, one is “Lucy” and the second
is “Titanic” and now everyone has to tell their choice.
 After everyone gives their answer we see that “Lucy” gets 4 votes and “Titanic” gets 5 votes.
Which movie do we watch now? Isn’t it hard to choose 1 movie now because the votes
for both the movies are somewhat equal.
 This is exactly what we call disorderness, there is an equal number of votes for both the
movies, and we can’t really decide which movie we should watch. It would have been
much easier if the votes for “Lucy” were 8 and for “Titanic” it was 2. Here we could
easily say that the majority of votes are for “Lucy” hence everyone will be watching this
movie.
 In a decision tree, the output is mostly “yes” or “no”
Continue

 The formula for Entropy is shown below:

 Here,
 p+ is the probability of positive class
 p– is the probability of negative class
 S is the subset of the training example
Continue

 Entropy basically measures the impurity of a node. Impurity is the degree of randomness;
it tells how random our data is.
 Apure sub-split means that either you should be getting “yes”, or you should be getting
“no”.
 Supposea featurehas 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’
and 2 ‘no’whereas right node gets 3 ‘yes’ and 2 ‘no’.
 We see here the split is not pure, why? Because we can still see some negative classes in
both the nodes. In order to make a decision tree, we need to calculate the impurity of
each split, and when the purity is 100%, we make it as a leaf node.
Continue

 To check the impurity of feature 2 and feature 3 we will take the help for Entropy
formula.
For feature 3
Continue

 We can clearly see from the tree itself that left node has low entropy or more purity than
right node since left node has a greater number of “yes” and it is easy to decide here.
 Always remember that the higher the Entropy, the lower will be the purity and the higher
will be the impurity.
 As mentioned earlier the goal of machine learning is to decrease the uncertainty or
impurity in the dataset, here by using the entropy we are getting the impurity of a
particular node, we don’t know if the parent entropy or the entropy of a particular node
has decreased or not.
 For this, we bring a new metric called “Information gain” which tells us how much the
parent entropy has decreased after splitting it with some feature.
Information Gain

 Information gain measures the reduction of uncertainty given some feature and it is also a
deciding factor for which attribute should be selected as a decision node or root node.

 It is just entropy of the full dataset – entropy of the dataset given some feature.
 To understand this better let’s consider an example:Suppose our entire population has a total of 30
instances. The dataset is to predict whether the person will go to the gym or not. Let’s say 16
people go to the gym and 14 people don’t
 Now we have two features to predict whether he/she will go to the gym or not.
 Feature 1 is “Energy” which takes two values “high” and “low”
 Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly motivated”.
Continue

 Let’s see how our decision tree will be made using these 2 features. We’ll use information
gain to decide which feature should be the root node and which feature should be placed
after the split.
Continue

 Let’s calculate the entropy

 To see the weighted average of entropy of each node we will do as follows:


Continue

 Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:

 Our parent entropy was near 0.99 and after looking at this value of information gain, we
can say that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our
root node.
Continue

 Similarly, we will do this with the other feature “Motivation” and calculate its information
gain.
Continue

 Let’s calculate the entropy here:

 To see the weighted average of entropy of each node we will do as follows:


Continue

 Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:

 We now see that the “Energy” feature gives more reduction which is 0.37 than the
“Motivation” feature. Hence we will select the feature which has the highest information
gain and then split the node based on that feature.
 In this example “Energy” will be our root node and we’ll do the same for sub-nodes.
Here we can see that when the energy is “high” the entropy is low and hence we can say a
person will definitely go to the gym if he has high energy, but what if the energy is low?
We will again split the node based on the new feature which is “Motivation”.
Complete Example

Outlook Temp Humidity Windy Play Tennis


Sunny Hot High False No
Sunny Hot High True No
Overcast Hot Normal False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool High True No
Overcast Cool Normal True Yes
Sunny Mild High False Yes
Rainy Mild Normal False Yes
Sunny Cool Normal False No
Calculations

 Total instances: 10
 Entropy (Before Splitting) = - (5/10) * log2(5/10) - (5/10) * log2(5/10) ≈ 0.997
2. Information Gain (Outlook)
Entropy Proportion *
Outlook Yes No Proportion
(Outlook_i) Entropy
- (1/4) *
log2(1/4) - (3/4)
Sunny 1 3 4/10 0.289
* log2(3/4) ≈
0.722
- (2/2) *
Overcast 2 0 2/10 0
log2(2/2) = 0
- (2/4) *
Rainy 2 2 4/10 log2(2/4) - (2/4) 0.400
* log2(2/4) = 1
Continue

 Information Gain (Outlook) = Entropy (Before Splitting) - Σ [ Proportion(Outlook_i) *


Entropy(Outlook_i) ] = 0.997 - (0.289 + 0 + 0.400) ≈ 0.308
3. Information Gain (Temp)
Entropy Proportion *
Temp Yes No Proportion
(Temp_i) Entropy
- (1/4) *
log2(1/4) - (3/4)
Hot 1 3 4/10 0.289
* log2(3/4) ≈
0.722
- (3/3) *
Mild 3 1 3/10 0
log2(3/3) = 0
- (1/3) *
log2(1/3) - (2/3)
Cool 1 1 3/10 0.171
* log2(2/3) ≈
0.569
Continue
 Information Gain (Temp) = Entropy (Before Splitting) - Σ [ Proportion(Temp_i) *
Entropy(Temp_i) ] = 0.997 - (0.289 + 0 + 0.171) ≈ 0.537
4. Information Gain (Humidity)

Entropy Proportion *
Humidity Yes No Proportion
(Humidity_i) Entropy
- (3/6) *
log2(3/6) -
High 3 3 6/10 0.600
(3/6) *
log2(3/6) = 1
- (2/4) *
log2(2/4) -
Normal 2 2 4/10 0.400
(2/4) *
log2(2/4) = 1
Continue

 Information Gain (Humidity) = Entropy (Before Splitting) - Σ [ Proportion(Humidity_i)


* Entropy(Temp_i) ] = 0.997 - (0.600 + 0.400) ≈ 0.0

5. Information Gain (Windy)


Entropy Proportion *
Windy Yes No Proportion
(Windy_i) Entropy
- (4/6) *
log2(4/6) -
False 4 2 6/10 (2/6) * 0.551
log2(2/6) ≈
0.918
- (1/4) *
log2(1/4) -
True 1 3 4/10 (3/4) * 0.289
log2(3/4) ≈
0.722
 Information Gain (Windy) = Entropy (Before Splitting) - Σ [ Proportion(Windy_i) *
Entropy(Windy_i) ] = 0.997 - (0.551 + 0.289) ≈ 0.157

 Summary of Information Gain


 Outlook: 0.308 (Highest)
 Temp: 0.537
 Humidity: 0.0 (Both categories have equal Yes/No distribution)
 Windy: 0.157
Outlook Temperature Humidity Windy Play Tennis?
Sunny Hot High False No
Sunny Hot High True No
Overcast Cool High False Yes
Rainy Cool Normal False Yes
Rainy Overcast Normal False Yes
Rainy Overcast High True No
Sunny Mild High False Yes
Overcast Mild High True Yes
Sunny Hot Normal False Yes
Rainy Mild Normal False Yes
Continue

 Target Variable: Play Tennis? (Yes/No)Step


 1: Calculate Entropy (Before Splitting)Entropy measures the randomness or uncertainty
in the target variable.
 Here, it tells us how predictable "Play Tennis?" is based on the current data.
 We use the formula: Entropy = -Σ (pi * log2(pi))Σ (summation):
 We calculate the sum for each class (Yes and No)pi:
 Probability of each class (Yes and No)log2:
 Logarithm base 2 First, calculate the probability of each class:
 Yes: 6/10 (6 Yes instances out of 10 total)
 No: 4/10 (4 No instances)

 Then calculate the entropy: Entropy = - ((6/10) * log2(6/10)) - ((4/10) * log2(4/10))


= 0.94
Continue

 Step 2: Calculate Information Gain for an Attribute (e.g., Outlook)


 Information gain tells us how much the entropy reduces after splitting the data based on a particular
attribute (like Outlook). We'll calculate it for "Outlook."
 Here's the breakdown:
 Entropy (Outlook = Sunny):
 Yes: 2/4 (2 Yes instances out of 4 Sunny examples)
 No: 2/4 (2 No instances)
 Entropy (Sunny) = - ((2/4) * log2(2/4)) - ((2/4) * log2(2/4)) = 1 (Perfectly balanced)
 Entropy (Outlook = Overcast):
 Yes: 4/4 (All 4 Overcast instances are Yes)
 No: 0/4 (No No instances)
 Entropy (Overcast) = - ((4/4) * log2(4/4)) - ((0/4) * log2(0/4)) = 0 (Perfectly predictable)
 Entropy (Outlook = Rainy):
 Yes: 4/4 (All 4 Rainy instances are Yes)
 No: 0/4 (No No instances)
 Entropy (Rainy) = 0 (Perfectly predictable)
Continue

 Calculate Information Gain (Outlook):


 Information Gain (Outlook) = Entropy (Before Splitting) - Σ [ (Entropy (Split Value) *
Proportion of Examples in Split Value)]
 Information Gain (Outlook) = 0.94 - [(1 * 4/10) + (0 * 3/10) + (0 * 3/10)] = 0.94
 Interpretation:
 Although we calculated information gain for "Outlook," it turned out to be 0.94, which is
the same as the original entropy. This means "Outlook" doesn't provide any gain in
reducing uncertainty about playing tennis in this dataset (all sub-categories within
"Outlook" are perfectly predictable - Yes or No).
Continue

 Step 3: Choose the Attribute with Highest Information Gain


 In this example, "Outlook" didn't give us any information gain. We would repeat this
process for other attributes (Temperature, Humidity, Windy) and choose the one that
maximizes information gain. The attribute with the highest gain becomes the root node
of the decision tree, and the process continues to split the data further based on the most
informative attributes.
Continue
1. Information Gain (Outlook):
We already saw this in the previous example, and it was 0.94. This means "Outlook" doesn't provide any
information gain because the classifications within each Outlook category (Sunny, Overcast, Rainy) are perfectly
predictable (Yes or No for playing tennis).
2. Information Gain (Temperature):
Following the same approach as for "Outlook":
•Entropy (Temperature = Hot):
•Yes: 2/5 (2 Yes instances out of 5 Hot examples)
•No: 3/5 (3 No instances)
•Entropy (Hot) = - ((2/5) * log2(2/5)) - ((3/5) * log2(3/5)) ≈ 0.971
•Entropy (Temperature = Mild):
•Yes: 3/3 (All 3 Mild instances are Yes)
•No: 0/3 (No No instances)
•Entropy (Mild) = 0 (Perfectly predictable)
•Entropy (Temperature = Cool):
•Yes: 2/2 (All 2 Cool instances are Yes)
•No: 0/2 (No No instances)
•Entropy (Cool) = 0 (Perfectly predictable)
Information Gain (Temperature):
= Information Gain (Temperature):
= 0.94 - [(0.918 * 5/10) + (0 * 3/10) + (0 * 2/10)] ≈ 0.486
Continue

 3. Information Gain (Humidity):


 Entropy (Humidity = High):
 Yes: 3/5 (3 Yes out of 5 High Humidity examples)
 No: 2/5 (2 No instances)
 Entropy (High) = - ((3/5) * log2(3/5)) - ((2/5) * log2(2/5)) ≈ 0.722
 Entropy (Humidity = Normal):
 Yes: 3/5 (3 Yes out of 5 Normal Humidity examples)
 No: 2/5 (2 No instances)
 Entropy (Normal) = - ((3/5) * log2(3/5)) - ((2/5) * log2(2/5)) ≈ 0.722
 Information Gain (Humidity):= 0.94 - [(0.722 * 5/10) + (0.722 * 5/10)] = 0.94 - 0.722 =
0.218
Continue

 4. Information Gain (Windy):


 Entropy (Windy = True):
 Yes: 2/3 (2 Yes out of 3 Windy examples)
 No: 1/3 (1 No instance)
 Entropy (True) = - ((2/3) * log2(2/3)) - ((1/3) * log2(1/3)) ≈ 0.569
 Entropy (Windy = False):
 Yes: 4/7 (4 Yes out of 7 Non-Windy examples)
 No: 3/7 (3 No instances)
 Entropy (False) = - ((4/7) * log2(4/7)) - ((3/7) * log2(3/7)) ≈ 0.985
 Information Gain (Windy):= 0.94 - [(0.569 * 3/10) + (0.985 * 7/10)] ≈ 0.94 - 0.84 = 0.1
Continue

 Interpretation:
 "Temperature" has the highest information gain (0.486) among the attributes. This means
it's the most informative attribute for predicting whether someone will play tennis based
on this dataset.
 "Humidity" has a moderate information gain (0.218), indicating some predictability based
on humidity.
 "Windy" has the lowest information gain (0.1), suggesting it has the least influence on the
decision of playing tennis in this dataset.
When to Stop Splitting?

 There are two main reasons to stop growing a decision tree:


 Purity: The decision tree reaches a point where all data points in a node belong to the
same class. This is ideal because there's no more uncertainty to split on. However, this can
lead to overfitting if the tree is too specific to the training data.
 Cost-Complexity Trade-off: Continuing to grow the tree can improve accuracy on the
training data, but it can also lead to overfitting. Overfitting means the tree becomes too
specific to the training data and performs poorly on unseen data.
Continue

 Here are some common strategies to decide when to stop growing a decision tree:
 Minimum Samples per Split: Set a minimum number of data points required in a node
before splitting it further. This helps prevent the tree from becoming too specific to small
subsets of the data.
 Minimum Samples per Leaf: Set a minimum number of data points required in a leaf
node (terminal node). This avoids creating overly specific leaf nodes that might not
generalize well.
 Maximum Depth: Limit the maximum depth of the tree. This prevents the tree from
becoming too complex and potentially overfitting.
 Pruning: Prune the tree after it's grown by removing branches that don't contribute
significantly to the overall accuracy. This helps to simplify the tree and reduce overfitting.
Hyper parameters

 There are many ways to tackle this problem through hyperparameter tuning. We can set the
maximum depth of our decision tree using themax_depth parameter.
 The more the value of max_depth, the more complex your tree will be.
 The training error will off-course decrease if we increase the max_depth value but when our test
data comes into the picture, we will get a very bad accuracy. Hence you need a value that will not
overfit as well as underfit our data and for this, you can use GridSearchCV.
 Another way is to set the minimum number of samples for each spilt. It is denoted
by min_samples_split.
 Here we specify the minimum number of samples required to do a spilt. For example, we can use a
minimum of 10 samples to reach a decision. That means if a node has less than 10 samples then
using this parameter, we can stop the further splitting of this node and make it a leaf node.
 There are more hyperparameters such as :
 min_samples_leaf – represents the minimum number of samples required to be in the leaf node.
The more you increase the number, the more is the possibility of overfitting.
 max_features – it helps us decide what number of features to consider when looking for the best
split.

You might also like