AIH_Lab2
AIH_Lab2
Objective: Write Python program to demonstrate the working of the decision tree based
ID3 algorithm by using appropriate medical data set for building the decision tree and
apply this knowledge to forecast.
Outcomes:
1. Find entropy of data and follow steps of the algorithm to construct a tree.
2. Representation of hypothesis using decision tree.
3. Apply Decision Tree algorithm to classify the given data.
4. Interpret the output of Decision Tree.
System Requirements:
Linux OS with Python and libraries or R or windows with MATLAB
Theory:
The decision tree builds classification or regression models in the form of a tree structure. It breaks down
a dataset into smaller and smaller subsets while at the same time an associated decision tree is
incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node
(e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g., Play)
represents a classification or decision. The topmost decision node in a tree which corresponds to the best
predictor called root node. Decision trees can handle both categorical and numerical data.
Entropy
A decision tree is built top-down from a root node and involves partitioning the data into subsets that
contain instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the
homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample
is an equally divided it has entropy of one.
E(S) is the Entropy of the entire set, while the second term E(S, A) relates to an Entropy of an attribute A.
Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the
most homogeneous branches).
Dataset Description:
https://www.kaggle.com/datasets/reihanenamdari/breast-cancer
The dataset involved female patients with infiltrating duct and lobular carcinoma breast cancer diagnosed
in 2006-2010. Patients with unknown tumour size, examined regional LNs, positive regional LNs, and
patients whose survival months were less than 1 month were excluded; thus, 4024 patients were
ultimately included.
https://www.kaggle.com/datasets/amirhosseinmirzaie/countries-life-expectancy
There are 18 columns in this dataset to uncover the reasons causing differences in longevity among
countries.
# iterate over different heights and print the train and test accuracy
for i in range(1, 11):
clf = DecisionTreeClassifier(max_depth=i, random_state=1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(f'Accuracy of test set with max_depth={i}: {clf.score(x_test,
y_test)}')
print(f'Accuracy of train set with max_depth={i}: {clf.score(x_train,
y_train)}')
print("-------------------------------------------------------------------
------")
# choosing depth as 3
clf = DecisionTreeClassifier(max_depth=3, random_state=1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
Results after pruning of the Decision Tree, Accuracy : 96%, 6% increase than before
pruning
def drop_col(df):
return df.drop(labels=['Country', 'Year'], axis=1)
X_train = drop_col(X_train)
X_val = drop_col(X_val)
def preprocess_data(data):
# Create dummy variables for 'Status' column
dummies = pd.get_dummies(data['Status'], dtype=int)
data = pd.concat([data, dummies], axis=1)
return data
# Preprocess train_data
X_train = preprocess_data(X_train)
# Preprocess test_data
X_val = preprocess_data(X_val)
scaler = StandardScaler()
X_V = X_val.values
X_VV = X_train.values
scaled_x_train = scaler.fit_transform(X_train)
scaled_x_val = scaler.transform(X_V)
scaled_x_train_val=scaler.transform(X_VV)
models = [
LinearRegression(),
DecisionTreeRegressor(),
RandomForestRegressor(),
GradientBoostingRegressor(),
ExtraTreesRegressor(),
]
model = DecisionTreeRegressor()
Conclusion
● We used scikit to run the Decision Tree algorithm on a larger dataset and estimate the
accuracy of the model created.
● In Decision Tree as the depth of the tree increases the model overfits the data and
accuracy reduce to avoid these, parameters for pruning the tree should be passed to the
classifier.
● Learnt about pruning and effectively choosing the best parameters to prune the Decision
Tree in a way that it enhances the test accuracy and performance of the model.