Decision Tree and Random Forest
Decision Tree and Random Forest
Decision Tree
• Decision Tree
• "A decision tree in machine learning is a flowchart structure in which each node
represents a "test" on the attribute and each branch represents the outcome of the
test."
• The end node (called leaf node) represents a class label.
• Decision Tree is a Supervised learning technique that can be used for
both classification and Regression problems, but mostly it is preferred
for solving Classification problems.
• In a Decision tree, there are two nodes, which are the Decision Node
and Leaf Node. Decision nodes are used to make any decision and
have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
• To make a prediction using the decision tree, we start at the root
node and follow the branches based on the feature values of the new
instance.
• While implementing a Decision tree, the main issue arises that how to
select the best attribute for the root node and for sub-nodes. So, to
solve such problems there is a technique which is called as Attribute
selection measure or ASM.
• By this measurement, we can easily select the best attribute for the
nodes of the tree. There are two popular techniques for ASM, which
are:
• Information Gain
• Gini Index
• The decision tree algorithm starts with the entire dataset and selects
the best feature to split the data based on a criterion, such as
information gain or Gini impurity.
• The selected feature becomes the root node of the tree, and the data
is split into different branches based on the possible attribute values.
1. Information gain
• Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build
the decision tree.
• A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest information
gain is split first.
• It can be calculated using the below formula:
Information Gain= Entropy(S)-
sum[(Weighted Avg) *Entropy(each feature) ]
2. The Gini Index
• The Gini Index is calculated by subtracting the sum of the squared
probabilities of each class from one.
• Gini index is a measure of impurity or purity used while creating a
decision tree in the CART(Classification and Regression Tree)
algorithm.
• An attribute with the low Gini index should be preferred as compared
to the high Gini index. Perfectly classified, Gini Index would be zero.
• Gini index can be calculated using the below formula:
Decision Tree Terminology
• Entropy E(x): The "average amount of information" contained by a random variable (x) is
called Entropy. It is denoted by (E) or (H).
• In other words, entropy is the "measure of randomness of information" of a variable.
Entropy (E) is the measure of impurity or uncertainty associated with a random variable
(X).
ID3 Algorithm
ID3 Steps
1. Calculate the Information Gain of each feature.
2. Considering that all rows don’t belong to the same class, split the dataset S into subsets
using the feature for which the Information Gain is maximum.
3. Make a decision tree node using the feature with the maximum Information gain.
4. If all rows belong to the same class, make the current node as a leaf node with the class as
its label.
5. Repeat for the remaining features until we run out of all features, or the decision tree has
all leaf nodes.
Here, nine “Yes”, Five “No” i.e. 9 (+ve) and 5(-ve) example in this Table 1
Advantages of the Decision Tree
• Overfit condition arises when the model memorizes the noise of the
training data and fails to capture important patterns.
• A perfectly fit decision tree performs well for training data but
performs poorly for unseen test data.
Random Forest Technique
• Random Forest is a popular machine learning algorithm that belongs
to the supervised learning technique.
• Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.
• The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
• Random Forest employs a technique called bootstrapping, where
multiple random subsets (with replacement) are created from the
original training dataset.
• They can handle both numerical and categorical features without requiring
extensive data preprocessing.