Decision Tree
Decision Tree
1. - ID3: Ross Quinlan is credited within the development of ID3, which is shorthand for “Iterative
Dichotomiser 3.” This algorithm leverages entropy and information gain as metrics to evaluate
candidate splits. Some of Quinlan’s research on this algorithm from 1986 can be found here.
2. - C4.5: This algorithm is considered a later iteration of ID3, which was also developed by
Quinlan. It can use information gain or gain ratios to evaluate split points within the decision
trees.
3. - CART: The term, CART, is an abbreviation for “classification and regression trees” and was
introduced by Leo Breiman. This algorithm typically utilizes Gini impurity to identify the ideal
attribute to split on. Gini impurity measures how often a randomly chosen attribute is
misclassified. When evaluating using Gini impurity, a lower value is more ideal.
Decision Tree Assumptions
• Several assumptions are made to build effective models when creating decision
trees. These assumptions help guide the tree’s construction and impact its
performance. Here are some common assumptions and considerations when
creating decision trees:
1. Binary Splits
➢Decision trees typically make binary splits, meaning each node divides the data
into two subsets based on a single feature or condition. This assumes that each
decision can be represented as a binary choice.
2. Recursive Partitioning
➢Decision trees use a recursive partitioning process, where each node is divided
into child nodes, and this process continues until a stopping criterion is met. This
assumes that data can be effectively subdivided into smaller, more manageable
subsets.
3. Feature Independence
➢These trees often assume that the features used for splitting nodes are independent.
In practice, feature independence may not hold, but it can still perform well if
features are correlated.
4. Homogeneity
➢It aim to create homogeneous subgroups in each node, meaning that the samples
within a node are as similar as possible regarding the target variable. This
assumption helps in achieving clear decision boundaries.
5. Top-Down Greedy Approach
➢They are constructed using a top-down, greedy approach, where each split is
chosen to maximize information gain or minimize impurity at the current node.
This may not always result in the globally optimal tree.
❖A leaf node in a decision tree is the terminal node at the bottom of the tree, where
no further splits are made. Leaf nodes represent the final output or prediction of
the decision tree. Once a data point reaches a leaf node, a decision or prediction is
made based on the majority class (for classification) or the average value (for
regression) of the data points that reach that leaf.
❖To check mathematically if any split is pure split or not we use entropy or gini
impurity. Information Gain helps us to determine which features need to be
selected
Entropy
➢Entropy is a concept borrowed from information theory and is commonly used as
a measure of uncertainty or disorder in a set of data. In the context of decision
trees, entropy is often employed as a criterion to decide how to split data points at
each node, aiming to create subsets that are more homogeneous with respect to the
target variable.
➢Entropy is a measure of uncertainty or disorder. A low entropy indicates a
more ordered or homogeneous set, while a high entropy signifies greater
disorder or diversity.
➢In the context of a decision tree, the goal is to reduce entropy by selecting features
and split points that result in more ordered subsets.
➢Entropy values range from 0 to 1. The minimum entropy (0) occurs when all
instances belong to a single class, making the set perfectly ordered. The maximum
entropy () occurs when instances are evenly distributed across all classes, creating
a state of maximum disorder.
➢Entropy values can fall between 0 and 1. If all samples in data set, S, belong to
one class, then entropy will equal zero. If half of the samples are classified as one
class and the other half are in another class, entropy will be at its highest at 1. In
order to select the best feature to split on and find the optimal decision tree, the
attribute with the smallest amount of entropy should be used.
➢At each node of a decision tree, the algorithm evaluates the entropy for each
feature and split point. The feature and split point that result in the largest
reduction in entropy are chosen for the split. The reduction in entropy is often
referred to as Information Gain and is calculated as the difference between the
entropy before and after the split.
Entropy and Gini impurity formulas
The minimum value of entropy is 0.
Thus, for different numbers of classes:
➢ For 2 classes (binary classification): maximum entropy is 1.
➢ For 3 classes: maximum entropy is log 2(3)≈1.585
➢ For 4 classes: maximum entropy is log 2(4)=2 and so on
➢ G = 0 indicates a perfectly pure node (all elements belong to the same class).
➢ G = 0.5 indicates maximum impurity (elements are evenly distributed among all classes)
Gini impurity
➢Gini impurity is a measure of the impurity or disorder in a set of elements,
commonly used in decision tree algorithms, especially for classification tasks. It
quantifies the likelihood of misclassification of a randomly chosen element in the
dataset.
➢A lower Gini impurity suggests a more homogeneous set of elements within the
node, making it an attractive split in a decision tree. Decision tree algorithms aim
to minimize the Gini impurity at each node, selecting the feature and split point
that results in the lowest impurity.
❑ To check the impurity of feature 2 and feature 3 we will take the help for
Entropy formula.
For feature 3,
➢We can clearly see from the tree itself that left node has low entropy or more
purity than right node since left node has a greater number of “yes” and it is easy
to decide here.
➢Always remember that the higher the Entropy, the lower will be the purity and the
higher will be the impurity.
➢As mentioned earlier the goal of machine learning is to decrease the uncertainty or
impurity in the dataset, here by using the entropy we are getting the impurity of a
particular node, we don’t know if the parent entropy or the entropy of a particular
node has decreased or not.
➢For this, we bring a new metric called “Information gain” which tells us how
much the parent entropy has decreased after splitting it with some feature.
Information Gain
➢Information gain represents the difference in entropy before and after a split on a
given attribute. The attribute with the highest information gain will produce the
best split as it’s doing the best job at classifying the training data according to its
target classification. Information gain is usually represented with the following
formula,
where;
If our dataset is huge we should choose Gini impurity as its calculation is much simpler compared to entropy
Example 2: Imagine that we have the following arbitrary dataset:
➢For this dataset, the entropy is 0.94. This can be calculated by finding the
proportion of days where “Play Tennis” is “Yes”, which is 9/14, and the proportion
of days where “Play Tennis” is “No”, which is 5/14. Then, these values can be
plugged into the entropy formula above.
➢We can then compute the information gain for each of the attributes individually.
For example, the information gain for the attribute, “Humidity” would be the
following:
• Easy to Understand: They are simple to visualize and interpret, making them easy
to understand even for non-experts.
• Handles Both Numerical and Categorical Data: They can work with both types of
data without needing much preprocessing.
• No Need for Data Scaling: These trees do not require normalization or scaling of
data.
• Automated Feature Selection: They automatically identify the most important
features for decision-making.
• Handles Non-Linear Relationships: They can capture non-linear patterns in the
data effectively.
Disadvantages of Decision Trees
• Overfitting Risk: It can easily overfit the training data, especially if they are too
deep.
• Unstable with Small Changes: Small changes in data can lead to completely
different trees.
• Biased with Imbalanced Data: They tend to be biased if one class dominates the
dataset.
• Limited to Axis-Parallel Splits: They struggle with diagonal or complex decision
boundaries.
• Can Become Complex: Large trees can become hard to interpret and may lose
their simplicity.