Decision tree classifier
Dishant Kumar Yadav 2021BCS0136
Implementation:
General Terms: Let us first discuss a few statistical concepts used in this post.
Entropy: The entropy of a dataset, is a measure the impurity, of the dataset Entropy can also be
thought, as a measure of uncertainty. We should try to minimize, the Entropy. The goal of
machine learning models is to reduce uncertainty or entropy, as far as possible.
Information Gain: Information gain, is a measure of, how much information, a feature gives us
about the classes. Decision Trees algorithm, will always try, to maximize information gain.
Feature, that perfectly partitions the data, should give maximum information. A feature, with the
highest Information gain, will be used for split first.
keyboard_arrow_down Import Libraries:
We are going to import NumPy and the pandas library.
# Import the required libraries
import pandas as pd
import numpy as np
from google.colab import files
uploaded = files.upload()
Choose Files diabetes11.csv
diabetes11.csv(text/csv) - 7491 bytes, last modified: 17/1/2024 - 100% done
Saving diabetes11.csv to diabetes11.csv
import shutil
# Assuming the file name is 'diabetes.csv'
shutil.move('diabetes11.csv', '/content/diabetes11.csv')
'/content/diabetes11.csv'
import os
# List files in the /content directory
os.listdir('/content')
['.config',
'diabetes (1).csv',
'diabetes.csv',
'diabetes11.csv',
'sample_data']
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv('/content/diabetes11.csv')
# Display the first few rows of the DataFrame
df.head()
1 to 5 of 5 entries Filter
index Glucose BloodPressure diabetes
0 148 72 1
1 85 66 0
2 183 64 1
3 89 66 0
4 137 40 1
Show 25 per page
Like what you see? Visit the data table notebook to learn more about interactive tables.
Distributions
2-d distributions
Time series
# Define the calculate entropy function
def calculate_entropy(df_label):
classes,class_counts = np.unique(df_label,return_counts = True)
entropy_value = np.sum([(-class_counts[i]/np.sum(class_counts))*np.log2(class_counts
for i in range(len(classes))])
return entropy_value
# Define the calculate information gain function
def calculate_information_gain(dataset,feature,label):
# Calculate the dataset entropy
dataset_entropy = calculate_entropy(dataset[label])
values,feat_counts= np.unique(dataset[feature],return_counts=True)
# Calculate the weighted feature entropy # Call the ca
weighted_feature_entropy = np.sum([(feat_counts[i]/np.sum(feat_counts))*calculate_ent
==values[i]).dropna()[label]) for i in range(len(values))]
feature_info_gain = dataset_entropy - weighted_feature_entropy
return feature_info_gain
# Set the features and label
features = df.columns[:-1]
label = 'diabetes'
parent=None
features
Index(['Glucose', 'BloodPressure'], dtype='object')
import numpy as np
def create_decision_tree(dataset, df, features, label, parent=None):
datum = np.unique(df[label], return_counts=True)
unique_data = np.unique(dataset[label])
if len(unique_data) <= 1:
return unique_data[0]
elif len(dataset) == 0:
return unique_data[np.argmax(datum[1])]
elif len(features) == 0:
return parent
else:
parent = unique_data[np.argmax(datum[1])]
# call the calculate_information_gain function
item_values = [calculate_information_gain(dataset, feature, label) for feature in
optimum_feature = features[np.argmax(item_values)]