0% found this document useful (0 votes)
25 views

ML Exp8 C36

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

ML Exp8 C36

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Prathmesh Gaikwad

TUS3F202128 C36

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment No. 8

A.1 Aim:
To implement CART.

A.2 Prerequisite:
Python Basic Concepts

A.3 Outcome:
Students will be able to implement decision tree using CART.

A.4 Theory:

CART (Classification and Regression Tree) is a variation of the decision tree algorithm. It
can handle both classification and regression tasks. Scikit-Learn uses the Classification and
Regression Tree (CART) algorithm to train Decision Trees (also called “growing” trees).
CART was first produced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles
Stone in 1984.
CART Algorithm
CART is a predictive algorithm used in Machine learning and it explains how the target
variable’s values can be predicted based on other matters. It is a decision tree where each fork
is split into a predictor variable and each node has a prediction for the target variable at the
end.
In the decision tree, nodes are split into sub-nodes on the basis of a threshold value of an
attribute. The root node is taken as the training set and is split into two by considering the
best attribute and threshold value. Further, the subsets are also split using the same logic.
This continues till the last pure sub-set is found in the tree or the maximum number of leaves
possible in that growing tree.
The CART algorithm works via the following process:
• The best split point of each input is obtained.
• Based on the best split points of each input in Step 1, the new “best” split point is
identified.
• Split the chosen input according to the “best” split point.
• Continue splitting until a stopping rule is satisfied or no further desirable splitting is
available.

CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini index
criterion.
Prathmesh Gaikwad
TUS3F202128 C36

Gini index/Gini impurity


The Gini index is a metric for the classification tasks in CART. It stores the sum of squared
probabilities of each class. It computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation of the Gini coefficient. It
works on categorical variables, provides outcomes either “successful” or “failure” and hence
conducts binary splitting only.

The degree of the Gini index varies from 0 to 1,


• Where 0 depicts that all the elements are allied to a certain class, or only one class
exists there.
• The Gini index of value 1 signifies that all the elements are randomly distributed
across various classes, and
• A value of 0.5 denotes the elements are uniformly distributed into some classes.

Mathematically, we can write Gini Impurity as follows:

where pi is the probability of an object being classified to a particular class.


Prathmesh Gaikwad
TUS3F202128 C36

Program:
# Importing the required packages
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Function importing Dataset


def importdata():
balance_data =
pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-' +
'databases/balance-scale/balance-
scale.data',
sep=',', header=None)
# Printing the dataset shape
print("Dataset Length: ", len(balance_data))
print("Dataset Shape: ", balance_data.shape)
# Printing the dataset obseravtions print (balance_data.head())
return balance_data

# Function to split the dataset


def splitdataset(balance_data):
# Separating the target
variable X =
balance_data.values[:, 1:5] Y =
balance_data.values[:, 0]

# Splitting the dataset into train and test


X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.3, random_state=
100)

return X, Y, X_train, X_test, y_train, y_test

# Function to perform training with giniIndex.


def train_using_gini(X_train, X_test, y_train):
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion="gini",
random_state=100, max_depth=3,
min_samples_leaf=5)
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini

# Function to perform training with entropy.


def tarin_using_entropy(X_train, X_test, y_train):
# Decision tree with entropy
clf_entropy = DecisionTreeClassifier(
criterion="entropy", random_state=100, max_depth=3,
min_samples_leaf=5)
# Performing training
Prathmesh Gaikwad
TUS3F202128 C36

clf_entropy.fit(X_train, y_train)
return clf_entropy

# Function to make predictions


def prediction(X_test, clf_object):
# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred

# Function to calculate accuracy


def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
print("Accuracy : ", accuracy_score(y_test, y_pred) * 100)
print("Report : ", classification_report(y_test, y_pred))

# Driver code
def main():
# Building Phase
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train,
X_test, y_train)
# Operational Phase
print("Results Using Gini Index:")
# Prediction using gini
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
print("Results Using Entropy:")
# Prediction using entropy

y_pred_entropy = prediction(X_test, clf_entropy)


cal_accuracy(y_test, y_pred_entropy)
# Calling main function

if name == " main ":


main()
Prathmesh Gaikwad
TUS3F202128 C36

Output:
Prathmesh Gaikwad
TUS3F202128 C36

PART B
(PART B: TO BE COMPLETED BY STUDENTS)

Roll No: BE-C36 Name: Prathmesh Krishna Gaikwad


Class: BE-Comps Batch: C2
Date of Experiment: 22/08/2023 Date of Submission: 22/08/2023
Grade:

B.1 Software Code written by student:


# Modeling using CART
import warnings
import joblib
#import pydotplus
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeClassifier, export_graphviz, export_text
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate,
validation_curve
#from skompiler import skompile
#Reading the dataset
df = pd.read_csv("diabetes.csv")
# The first 5 observation units of the data set were accessed.
df.head()
y = df["Outcome"]
X = df.drop(["Outcome"], axis=1)
# Model
cart_model = DecisionTreeClassifier(random_state=17).fit(X, y)
#y_pred for Confusion Matrix :
y_pred = cart_model.predict(X)
Prathmesh Gaikwad
TUS3F202128 C36

#y_prob for AUC:


y_prob = cart_model.predict_proba(X)[:, 1]
# Confusion matrix
print(classification_report(y, y_pred))
# AUC
roc_auc_score(y, y_prob)
# Evaluation of Success with the Holdout Method
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=85)
cart_model = DecisionTreeClassifier(random_state=17).fit(X_train, y_train)
# Train Error
y_pred = cart_model.predict(X_train)
y_prob = cart_model.predict_proba(X_train)[:, 1]
print(classification_report(y_train, y_pred))
roc_auc_score(y_train, y_prob)
# Test Error
y_pred = cart_model.predict(X_test)
y_prob = cart_model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
roc_auc_score(y_test, y_prob)
# Evaluation of Success with Cross Validation
cart_model = DecisionTreeClassifier(random_state=17).fit(X, y)
cv_results = cross_validate(cart_model,
X, y,
cv=10,
scoring=["accuracy", "f1", "roc_auc"])
cv_results['test_accuracy'].mean()
cv_results['test_f1'].mean()
cv_results['test_roc_auc'].mean()
# Hyperparameter Optimization with GridSearchCV
cart_model.get_params()
# Hyperparameter set to search:
Prathmesh Gaikwad
TUS3F202128 C36

cart_params = {'max_depth': range(1, 11),


"min_samples_split": range(2, 20)}
# GridSearchCV
cart_best_grid = GridSearchCV(cart_model,
cart_params,
cv=5,
n_jobs=-1,
verbose=True).fit(X, y)
# Best hyper parameter values:
cart_best_grid.best_params_
# Best score:
cart_best_grid.best_score_
random = X.sample(1, random_state=45)
print(random)
cart_best_grid.predict(random)
# 5. Final Model
cart_final = DecisionTreeClassifier(**cart_best_grid.best_params_,
random_state=17).fit(X, y)
cart_final.get_params()
# Another way to assign the best parameters to the model:
cart_final = cart_model.set_params(**cart_best_grid.best_params_).fit(X, y)
# CV error of final model:
cv_results = cross_validate(cart_final,
X, y,
cv=10,
scoring=["accuracy", "f1", "roc_auc"])
cv_results['test_accuracy'].mean()

cv_results['test_f1'].mean()

cv_results['test_roc_auc'].mean()
Prathmesh Gaikwad
TUS3F202128 C36

# 6. Feature Importance
def plot_importance(model, features, num=len(X), save=False):
feature_imp = pd.DataFrame({'Value': model.feature_importances_, 'Feature':
features.columns})
plt.figure(figsize=(10, 10))
sns.set(font_scale=1)
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value",
ascending=False)[0:num])
plt.title('Features')
plt.tight_layout()
plt.show()
if save:
plt.savefig('importances.png')
plot_importance(cart_final, X, 15)
# 7. Analyzing Model Complexity with Learning Curves
train_score, test_score = validation_curve(
cart_final, X=X, y=y,
param_name='max_depth',
param_range=range(1, 11),
scoring="roc_auc",
cv=10)
mean_train_score = np.mean(train_score, axis=1)
mean_test_score = np.mean(test_score, axis=1)
plt.plot(range(1, 11), mean_train_score,
label="Training Score", color='b')
plt.plot(range(1, 11), mean_test_score,
label="Validation Score", color='g')

plt.title("Validation Curve for CART")


plt.xlabel("Number of max_depth")
plt.ylabel("AUC")
Prathmesh Gaikwad
TUS3F202128 C36

plt.tight_layout()
plt.legend(loc='best')
plt.show()
# 8. Extracting Decision Rules
tree_rules = export_text(cart_model, feature_names=list(X.columns))
print(tree_rules)

B.2 Input and Output:


Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36

B.3 Observations and learning:


This algorithm is widely used in making Decision Trees through Classification and
Regression. Decision Trees are widely used in data mining to create a model that predicts the
value of a target based on the values of many input variables (or independent variables).

B.4 Conclusion:
Hence, we successfully studied and implemented CART.

B.5 Question of Curiosity (Handwritten any 3)


1. Explain and write CART algorithm for drawing decision trees.
2. How does CART differ from the other decision tree algorithms?
3. What are the main advantages of using CART for classification and regression techniques?
4. What are the key steps involved in implementing CART in python?
5. How does CART handle categorical and numerical features in the data?
6. What are the techniques or approaches for visualizing and interpreting CART trees?
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36

You might also like