decision_trees_implementation (1)
decision_trees_implementation (1)
A Comprehensive Guide
🌐 Follow Me on Social Media:
LinkedIn Connect
GitHub Follow
Kaggle Follow
Decision trees are one of the most popular and interpretable algorithms in machine learning,
commonly used for both classification and regression tasks. They work by recursively
splitting the dataset based on feature thresholds, creating a tree-like structure where each
internal node represents a decision based on a feature, and each leaf node corresponds to
an output label or value.
The main advantages of decision trees are their simplicity, ease of visualization, and ability to
handle both numerical and categorical data. However, when implemented from scratch,
careful handling of split criteria, stopping conditions, and data preprocessing is required to
ensure the model performs optimally.
By understanding and building a decision tree from the ground up, we gain valuable insights
into the mechanics of tree-based algorithms and lay a strong foundation for extending these
concepts to advanced methods like Random Forests and Gradient Boosted Trees.
2. Entropy Formula
k
Entropy(y) = − ∑ pi log2 (pi )
i=1
Explanation:
(pi ) is the proportion of class (i) in the dataset.
(nleft , nright ): Number of samples in the left and right child nodes.
The Node class represents a single node in the decision tree. Each node can either be:
1. An internal node: Contains information about a feature index and threshold used for
splitting the data, along with pointers to its left and right child nodes.
2. A leaf node: Contains a classification value when further splits are no longer possible or
desirable.
1. feature_index
Purpose:
Stores the index of the feature used for splitting at this node.
Example: If feature_index = 2 , it means this node splits based on the third
feature in the dataset.
Type: Integer or None .
Usage:
Stores the threshold value for the feature used to split the data at this node.
Example: If threshold = 5.5 , this node splits data into:
Left child: Samples where the feature value is ≤ 5.5.
Right child: Samples where the feature value is > 5.5.
Type: Float or None .
Usage:
3. left
Purpose:
Usage:
4. right
Purpose:
Usage:
5. value
Purpose:
Stores the value of the prediction (or class label) at a leaf node.
For classification tasks:
It’s the most common label in the data at this node.
For regression tasks:
It’s the mean or another aggregation metric of the target values at this node.
Type: Depends on the task:
Node Behavior
Internal Nodes:
Example Usage
Internal Node Example:
An internal node that splits based on the feature at index 2 with a threshold of 3.5:
leaf = Node(value=1)
2. _build_tree(self, X, y, depth)
Purpose: Recursively builds the decision tree.
Steps:
If no split improves the gain, create a leaf node with the most common label
in y .
4. Split Data and Recur:
3. _most_common_label(self, y)
Purpose: Finds the most common class in the given labels y .
Steps:
4. fit(self, X, y)
Purpose: Fits (or trains) the decision tree on the training data.
Steps:
Calls _build_tree with the training data X , labels y , and an initial depth of 0 .
Stores the resulting tree in self.root .
5. _predict(self, x, node)
Purpose: Predicts the class for a single sample x by traversing the tree.
Steps:
6. predict(self, X)
Purpose: Predicts the class for all samples in the dataset X .
Steps:
Root node
Internal nodes
Leaf nodes
Steps:
If the node is a leaf node (its value is not None ), increment the leaves count.
Otherwise, increment the internal_nodes count.
Recurse into the left and right children.
8. count_nodes(self)
Purpose: Provides a summary of the number of different types of nodes in the tree.
Steps:
Steps:
For leaf nodes, print "Leaf Node: Class = ..." with indentation proportional to depth.
For internal nodes, print "Internal Node: Feature[...] <= ..." with the feature and
threshold.
Recurse into the left and right children, increasing the depth.
10. print_tree(self)
Purpose: Prints the entire tree structure starting from the root.
Steps:
Calls _print_tree with the root node and an initial depth of 0 .
Class Summary
This DecisionTree class:
1. Builds a Tree:
Using recursive splitting based on the best information gain.
2. Predicts Classes:
Traverses the tree to make predictions for given inputs.
3. Analyzes the Tree:
Counts the types of nodes.
Prints the tree structure.
# Stop criteria
if depth >= self.max_depth or n_samples < self.min_samples_split or len(uni
leaf_value = self._most_common_label(y)
return Node(value=leaf_value)
def count_nodes(self):
counts = {"root": 1, "internal_nodes": 0, "leaves": 0}
self._count_nodes(self.root, counts)
return counts
def print_tree(self):
print("Decision Tree Structure:")
self._print_tree(self.root)
# Load dataset
# data = load_iris()
# X, y = data.data, data.target
df = pd.read_csv('heart.csv')
df
Out[24]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
... ... ... ... ... ... ... ... ... ... ... ... ... ..
le = LabelEncoder()
for column in X_train.columns:
if X_train[column].dtype == 'object': # Encode only non-numerical columns
X_train[column] = le.fit_transform(X_train[column])
X_test[column] = le.transform(X_test[column])
# Make predictions
y_pred = tree.predict(X_test)
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Scratch Tree : {accuracy}")
tree1 = DecisionTreeClassifier(max_depth=10)
tree1.fit(X_train, y_train)
predictions1 = tree1.predict(X_test)
Root Node: 1
Internal Nodes: 2
Leaf Nodes: 3