0% found this document useful (0 votes)
4 views

AIH_Lab2

The document outlines an experiment on the Decision Tree (ID3) algorithm for AI in healthcare, focusing on building and applying decision trees using medical datasets. It details the objectives, outcomes, and system requirements, along with theoretical concepts such as entropy and information gain. The experiment includes Python code for implementing the decision tree classifier and regressor, evaluating model performance, and emphasizes the importance of pruning to enhance accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

AIH_Lab2

The document outlines an experiment on the Decision Tree (ID3) algorithm for AI in healthcare, focusing on building and applying decision trees using medical datasets. It details the objectives, outcomes, and system requirements, along with theoretical concepts such as entropy and information gain. The experiment includes Python code for implementing the decision tree classifier and regressor, evaluating model performance, and emphasizes the importance of pruning to enhance accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Sardar Patel Institute of Technology, Mumbai

Department of Electronics and Telecommunication Engineering


B.E. Sem-VII- PE-IV (2024-2025)
IT 24 - AI in Healthcare

Exp 2-Experiment: Decision Tree (ID3) algorithm

Name: Sanika Tiwarekar Date: 26/08/2024

Objective: Write Python program to demonstrate the working of the decision tree based
ID3 algorithm by using appropriate medical data set for building the decision tree and
apply this knowledge to forecast.
Outcomes:
1. Find entropy of data and follow steps of the algorithm to construct a tree.
2. Representation of hypothesis using decision tree.
3. Apply Decision Tree algorithm to classify the given data.
4. Interpret the output of Decision Tree.

System Requirements:
Linux OS with Python and libraries or R or windows with MATLAB
Theory:

The decision tree builds classification or regression models in the form of a tree structure. It breaks down
a dataset into smaller and smaller subsets while at the same time an associated decision tree is
incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node
(e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g., Play)
represents a classification or decision. The topmost decision node in a tree which corresponds to the best
predictor called root node. Decision trees can handle both categorical and numerical data.

Entropy
A decision tree is built top-down from a root node and involves partitioning the data into subsets that
contain instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the
homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample
is an equally divided it has entropy of one.

E(S) is the Entropy of the entire set, while the second term E(S, A) relates to an Entropy of an attribute A.

Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the
most homogeneous branches).
Dataset Description:
https://www.kaggle.com/datasets/reihanenamdari/breast-cancer
The dataset involved female patients with infiltrating duct and lobular carcinoma breast cancer diagnosed
in 2006-2010. Patients with unknown tumour size, examined regional LNs, positive regional LNs, and
patients whose survival months were less than 1 month were excluded; thus, 4024 patients were
ultimately included.
https://www.kaggle.com/datasets/amirhosseinmirzaie/countries-life-expectancy
There are 18 columns in this dataset to uncover the reasons causing differences in longevity among
countries.

Decision Tree Classifier Code:


df = pd.read_csv('/content/drive/MyDrive/AIH C4/Exp2/Breast Cancer
Prediction.zip')
x = df.drop(columns=['diagnosis'])
y = df['diagnosis']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=0)
clf = DecisionTreeClassifier(random_state=2)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

#result before pruning the decision tree


print(classification_report(y_test, y_pred))

# plot the decision tree


from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
plt.figure(figsize=(20,20))
plot_tree(clf, filled=True)

#plot the confusion matrix


from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')

# remove count of begnin tumours to make data more balanced by a factor of


0.4
df_benign = df[df['diagnosis'] == 'B']
df_malignant = df[df['diagnosis'] == 'M']
df_benign = df_benign.sample(frac=0.6)
df = pd.concat([df_benign, df_malignant])

# iterate over different heights and print the train and test accuracy
for i in range(1, 11):
clf = DecisionTreeClassifier(max_depth=i, random_state=1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(f'Accuracy of test set with max_depth={i}: {clf.score(x_test,
y_test)}')
print(f'Accuracy of train set with max_depth={i}: {clf.score(x_train,
y_train)}')

print("-------------------------------------------------------------------
------")

# choosing depth as 3
clf = DecisionTreeClassifier(max_depth=3, random_state=1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))

# result after pruning


print(f'Accuracy of test set: {clf.score(x_test, y_test)}')
print(f'Accuracy of train set: {clf.score(x_train, y_train)}')

Decision Tree Classifier Output:


Printing all the columns and first 5 rows of the dataset:
Result before pruning of the Decision Tree, Accuracy: 90%

Plotting the Decision Tree


Plotting the Confusion Matrix

Checking for Class Imbalance


Checking for which max depth best testing accuracy received, here at max depth:3

Results after pruning of the Decision Tree, Accuracy : 96%, 6% increase than before
pruning

Decision Tree Regressor Code:


train_data = pd.read_csv("/content/drive/MyDrive/AIH
C4/Exp2/LifeExpectancy.zip")

def drop_col(df):
return df.drop(labels=['Country', 'Year'], axis=1)

X_train = drop_col(X_train)
X_val = drop_col(X_val)

def preprocess_data(data):
# Create dummy variables for 'Status' column
dummies = pd.get_dummies(data['Status'], dtype=int)
data = pd.concat([data, dummies], axis=1)

# Drop 'Status' column


data = data.drop(labels='Status', axis=1)

return data

# Preprocess train_data
X_train = preprocess_data(X_train)

# Preprocess test_data
X_val = preprocess_data(X_val)

scaler = StandardScaler()

X_V = X_val.values
X_VV = X_train.values
scaled_x_train = scaler.fit_transform(X_train)
scaled_x_val = scaler.transform(X_V)
scaled_x_train_val=scaler.transform(X_VV)

models = [
LinearRegression(),
DecisionTreeRegressor(),
RandomForestRegressor(),
GradientBoostingRegressor(),
ExtraTreesRegressor(),
]

# Train and evaluate each model


for model in models:
model.fit(scaled_x_train, y_train)
y_pred = model.predict(scaled_x_val)
mse = mean_squared_error(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
print(f"Model: {type(model).__name__}, r2: {r2}")
param_grid = {
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['auto', 'sqrt', 'log2']
}

model = DecisionTreeRegressor()

grid_search = GridSearchCV(model, param_grid, scoring='r2', cv=8)


grid_search.fit(scaled_x_train, y_train)

print("Best Parameters:", grid_search.best_params_)


print("Best R2 Score:", grid_search.best_score_)

model_DTR = DecisionTreeRegressor( max_depth = None, max_features =


'log2', min_samples_leaf = 4, min_samples_split = 10)
model_DTR.fit(scaled_x_train, y_train)
y_pred_train = model_DTR.predict(scaled_x_train_val)
mse = mean_squared_error(y_train, y_pred_train)
mae = mean_absolute_error(y_train, y_pred_train)
r2 = r2_score(y_train, y_pred_train)
print(f"Model: {type(model_DTR).__name__}, mse: {mse}")
print(f"Model: {type(model_DTR).__name__}, mae: {mae}")
print(f"Model: {type(model_DTR).__name__}, r2: {r2}")

model_DTR = DecisionTreeRegressor( max_depth = None, max_features =


'log2', min_samples_leaf = 4, min_samples_split = 10)
model_DTR.fit(scaled_x_train, y_train)
y_pred = model_DTR.predict(scaled_x_val)
mse = mean_squared_error(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
print(f"Model: {type(model_DTR).__name__}, mse: {mse}")
print(f"Model: {type(model_DTR).__name__}, mae: {mae}")
print(f"Model: {type(model_DTR).__name__}, r2: {r2}")
Decision Tree Regressor Output:
Printing all the columns and first 5 rows of the Dataset

Printing the Relationship Plot for various features of the dataset

Results of various Model upon Regression

Output from GridSearch CV for best parameters


Training results after applying best parameters

Testing results after applying best parameters

Plotting the results

Conclusion
● We used scikit to run the Decision Tree algorithm on a larger dataset and estimate the
accuracy of the model created.
● In Decision Tree as the depth of the tree increases the model overfits the data and
accuracy reduce to avoid these, parameters for pruning the tree should be passed to the
classifier.
● Learnt about pruning and effectively choosing the best parameters to prune the Decision
Tree in a way that it enhances the test accuracy and performance of the model.

You might also like