Decision Tree
Decision Tree
Subtitle
Contents
Introduction
Working
Formula
Example
Introduction
In machine learning, decision trees are a widely used supervised learning technique. They
excel at creating models for both classification (predicting discrete categories) and
regression (predicting continuous values) tasks.
Their structure, resembling an actual tree, makes them intuitive to understand and
interpret, even for those without a machine learning background.
A decision tree is a non-parametric supervised learning algorithm for classification
and regression tasks.
It has a hierarchical tree structure consisting of a root node, branches, internal nodes, and
leaf nodes.
Decision trees are used for classification and regression tasks, providing easy-to-
understand models.
Continue
A decision tree is a hierarchical model used in decision support that depicts decisions and
their potential outcomes, incorporating chance events, resource expenses, and utility.
This algorithmic model utilizes conditional control statements and is non-parametric,
supervised learning, useful for both classification and regression tasks.
The tree structure is comprised of a root node, branches, internal nodes, and leaf nodes,
forming a hierarchical, tree-like structure.
Terminology
Root Node: The initial node at the beginning of a decision tree, where the entire population or
dataset starts dividing based on various features or conditions.
Decision Nodes: Nodes resulting from the splitting of root nodes are known as decision nodes.
These nodes represent intermediate decisions or conditions within the tree.
Leaf Nodes: Nodes where further splitting is not possible, often indicating the final classification
or outcome. Leaf nodes are also referred to as terminal nodes.
Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section of a decision
tree is referred to as a sub-tree. It represents a specific portion of the decision tree.
Pruning: The process of removing or cutting down specific nodes in a decision tree to prevent
overfitting and simplify the model.
Branch / Sub-Tree: A subsection of the entire decision tree is referred to as a branch or sub-tree.
It represents a specific path of decisions and outcomes within the tree.
Continue
Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is known
as a parent node, and the sub-nodes emerging from it are referred to as child nodes. The
parent node represents a decision or condition, while the child nodes represent the
potential outcomes or further decisions based on that condition.
Example of Decision Tree
Continue
Decision trees are upside down which means the root is at the top and then this root is
split into various several nodes.
Decision trees are nothing but a bunch of if-else statements in layman terms.
It checks if the condition is true and if it is then it goes to the next node attached to that
decision.
In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or
rainy? If yes then it will go to the next feature which is humidity and wind.
It will again check if there is a strong wind or weak, if it’s a weak wind and it’s rainy then
the person may go and play.
Continue
How decision tree algorithms work?
Several assumptions are made to build effective models when creating decision trees.
These assumptions help guide the tree’s construction and impact its performance. Here
are some common assumptions and considerations when creating decision trees:
Binary Splits
Decision trees typically make binary splits, meaning each node divides the data into
two subsets based on a single feature or condition. This assumes that each decision
can be represented as a binary choice.
Recursive Partitioning
Decision trees use a recursive partitioning process, where each node is divided into
child nodes, and this process continues until a stopping criterion is met. This assumes
that data can be effectively subdivided into smaller, more manageable subsets.
Continue
Feature Independence
Decision trees often assume that the features used for splitting nodes are independent. In
practice, feature independence may not hold, but decision trees can still perform well if
features are correlated.
Homogeneity
Decision trees aim to create homogeneous subgroups in each node, meaning that the samples
within a node are as similar as possible regarding the target variable. This assumption helps in
achieving clear decision boundaries.
Top-Down Greedy Approach
Decision trees are constructed using a top-down, greedy approach, where each split is chosen
to maximize information gain or minimize impurity at the current node. This may not always
result in the globally optimal tree.
Categorical and Numerical Features
Decision trees can handle both categorical and numerical features. However, they may require
different splitting strategies for each type.
Continue
Overfitting
Decision trees are prone to overfitting when they capture noise in the data. Pruning
and setting appropriate stopping criteria are used to address this assumption.
Impurity Measures
Decision trees use impurity measures such as Gini impurity or entropy to evaluate
how well a split separates classes. The choice of impurity measure can impact tree
construction.
No Missing Values
Decision trees assume that there are no missing values in the dataset or that missing
values have been appropriately handled through imputation or other methods.
Equal Importance of Features
Decision trees may assume equal importance for all features unless feature scaling or
weighting is applied to emphasize certain features.
Continue
No Outliers
Decision trees are sensitive to outliers, and extreme values can influence their
construction. Preprocessing or robust methods may be needed to handle outliers
effectively.
Sensitivity to Sample Size
Small datasets may lead to overfitting, and large datasets may result in overly complex
trees. The sample size and tree depth should be balanced.
Entropy
Here,
p+ is the probability of positive class
p– is the probability of negative class
S is the subset of the training example
Continue
Entropy basically measures the impurity of a node. Impurity is the degree of randomness;
it tells how random our data is.
Apure sub-split means that either you should be getting “yes”, or you should be getting
“no”.
Supposea featurehas 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’
and 2 ‘no’whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative classes in
both the nodes. In order to make a decision tree, we need to calculate the impurity of
each split, and when the purity is 100%, we make it as a leaf node.
Continue
To check the impurity of feature 2 and feature 3 we will take the help for Entropy
formula.
For feature 3
Continue
We can clearly see from the tree itself that left node has low entropy or more purity than
right node since left node has a greater number of “yes” and it is easy to decide here.
Always remember that the higher the Entropy, the lower will be the purity and the higher
will be the impurity.
As mentioned earlier the goal of machine learning is to decrease the uncertainty or
impurity in the dataset, here by using the entropy we are getting the impurity of a
particular node, we don’t know if the parent entropy or the entropy of a particular node
has decreased or not.
For this, we bring a new metric called “Information gain” which tells us how much the
parent entropy has decreased after splitting it with some feature.
Information Gain
Information gain measures the reduction of uncertainty given some feature and it is also a
deciding factor for which attribute should be selected as a decision node or root node.
It is just entropy of the full dataset – entropy of the dataset given some feature.
To understand this better let’s consider an example:Suppose our entire population has a total of 30
instances. The dataset is to predict whether the person will go to the gym or not. Let’s say 16
people go to the gym and 14 people don’t
Now we have two features to predict whether he/she will go to the gym or not.
Feature 1 is “Energy” which takes two values “high” and “low”
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly motivated”.
Continue
Let’s see how our decision tree will be made using these 2 features. We’ll use information
gain to decide which feature should be the root node and which feature should be placed
after the split.
Continue
Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99 and after looking at this value of information gain, we
can say that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our
root node.
Continue
Similarly, we will do this with the other feature “Motivation” and calculate its information
gain.
Continue
Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:
We now see that the “Energy” feature gives more reduction which is 0.37 than the
“Motivation” feature. Hence we will select the feature which has the highest information
gain and then split the node based on that feature.
In this example “Energy” will be our root node and we’ll do the same for sub-nodes.
Here we can see that when the energy is “high” the entropy is low and hence we can say a
person will definitely go to the gym if he has high energy, but what if the energy is low?
We will again split the node based on the new feature which is “Motivation”.
Complete Example
Total instances: 10
Entropy (Before Splitting) = - (5/10) * log2(5/10) - (5/10) * log2(5/10) ≈ 0.997
2. Information Gain (Outlook)
Entropy Proportion *
Outlook Yes No Proportion
(Outlook_i) Entropy
- (1/4) *
log2(1/4) - (3/4)
Sunny 1 3 4/10 0.289
* log2(3/4) ≈
0.722
- (2/2) *
Overcast 2 0 2/10 0
log2(2/2) = 0
- (2/4) *
Rainy 2 2 4/10 log2(2/4) - (2/4) 0.400
* log2(2/4) = 1
Continue
Entropy Proportion *
Humidity Yes No Proportion
(Humidity_i) Entropy
- (3/6) *
log2(3/6) -
High 3 3 6/10 0.600
(3/6) *
log2(3/6) = 1
- (2/4) *
log2(2/4) -
Normal 2 2 4/10 0.400
(2/4) *
log2(2/4) = 1
Continue
Interpretation:
"Temperature" has the highest information gain (0.486) among the attributes. This means
it's the most informative attribute for predicting whether someone will play tennis based
on this dataset.
"Humidity" has a moderate information gain (0.218), indicating some predictability based
on humidity.
"Windy" has the lowest information gain (0.1), suggesting it has the least influence on the
decision of playing tennis in this dataset.
When to Stop Splitting?
Here are some common strategies to decide when to stop growing a decision tree:
Minimum Samples per Split: Set a minimum number of data points required in a node
before splitting it further. This helps prevent the tree from becoming too specific to small
subsets of the data.
Minimum Samples per Leaf: Set a minimum number of data points required in a leaf
node (terminal node). This avoids creating overly specific leaf nodes that might not
generalize well.
Maximum Depth: Limit the maximum depth of the tree. This prevents the tree from
becoming too complex and potentially overfitting.
Pruning: Prune the tree after it's grown by removing branches that don't contribute
significantly to the overall accuracy. This helps to simplify the tree and reduce overfitting.
Hyper parameters
There are many ways to tackle this problem through hyperparameter tuning. We can set the
maximum depth of our decision tree using themax_depth parameter.
The more the value of max_depth, the more complex your tree will be.
The training error will off-course decrease if we increase the max_depth value but when our test
data comes into the picture, we will get a very bad accuracy. Hence you need a value that will not
overfit as well as underfit our data and for this, you can use GridSearchCV.
Another way is to set the minimum number of samples for each spilt. It is denoted
by min_samples_split.
Here we specify the minimum number of samples required to do a spilt. For example, we can use a
minimum of 10 samples to reach a decision. That means if a node has less than 10 samples then
using this parameter, we can stop the further splitting of this node and make it a leaf node.
There are more hyperparameters such as :
min_samples_leaf – represents the minimum number of samples required to be in the leaf node.
The more you increase the number, the more is the possibility of overfitting.
max_features – it helps us decide what number of features to consider when looking for the best
split.