0% found this document useful (0 votes)
13 views

CSET301 LabW8L2

Uploaded by

Anmol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

CSET301 LabW8L2

Uploaded by

Anmol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Decision Tree Visualization

Decision Tree
Decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart like tree structure,
where each node finds the best threshold on that feature to further classify/predict more accurately, each branch represents an
outcome of that threshold, and each leaf node holds a class label.

In [1]:
from matplotlib import pyplot as plt # For plotting
from sklearn import datasets # For loading standard datasets
from sklearn.tree import DecisionTreeClassifier # To run decision tree model
from sklearn import tree # to visualize decision trees

Iris Dataset Description:


Classes: 3
Samples per class: 50
Samples total: 150
Dimesionaltiy: 4
Source: https://archive.ics.uci.edu/ml/datasets/iris

Quick Tip: sklearn.datasets has some toy datasets, the package also has helpers to fetch larger datasets commonly used by the machine
learning community

In [2]:
# Prepare the data data
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [3]:
# Initialize the model
clf = tree.DecisionTreeClassifier()
# Fir the model
clf.fit(iris.data,iris.target)

Out[3]: DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',


max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')

Task
Train your own decision tree and play with the following hyper-parameters then state your observations on at least 15 different
hyper-parameter settings. Following are only some of the parameters:

Must read: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


max_depth : The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves
contain less than min_samples_split samples.
min_samples_split : The minimum number of samples required to split an internal node.
min_samples_leaf : The minimum number of samples required to be at a leaf node. This may have the effect of smoothing the
model, especially in regression.
random state : Controls the randomness of the estimator
Write a function to calculate the accuracy

Print accuracies for each hyper-parameter setting used. Print in following format:
1. PARAMS[random_state=1, max_depth=....] , Accuracy=0.97
2. PARAMS[random_state=42, min_samples_split=....] , Accuracy=0.94
..
.
Perform the same set of acitvites on different dataset: https://gist.github.com/kudaliar032/b8cf65d84b73903257ed603f6c1a2508

In [4]:
# initialise and then Fit the classifier
clf = tree.DecisionTreeClassifier()
clf.fit(X, y)

Out[4]: DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',


max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')

In [5]:
# Gives text representation to the decision tree trained
text_representation = tree.export_text(clf)
print(text_representation)

|--- feature_2 <= 2.45


| |--- class: 0
|--- feature_2 > 2.45
| |--- feature_3 <= 1.75
| | |--- feature_2 <= 4.95
| | | |--- feature_3 <= 1.65
| | | | |--- class: 1
| | | |--- feature_3 > 1.65
| | | | |--- class: 2
| | |--- feature_2 > 4.95
| | | |--- feature_3 <= 1.55
| | | | |--- class: 2
| | | |--- feature_3 > 1.55
| | | | |--- feature_0 <= 6.95
| | | | | |--- class: 1
| | | | |--- feature_0 > 6.95
| | | | | |--- class: 2
| |--- feature_3 > 1.75
| | |--- feature_2 <= 4.85
| | | |--- feature_1 <= 3.10
| | | | |--- class: 2
| | | |--- feature_1 > 3.10
| | | | |--- class: 1
| | |--- feature_2 > 4.85
| | | |--- class: 2

In [6]:
# To save the above info in a text file
with open("decistion_tree.log", "w") as fout:
fout.write(text_representation)

How to Visualize Decision Trees using Matplotlib


Scikit-learn version >=0.21.0 allows Decision Trees to be plotted with matplotlib using 'sklearn.tree.plot_tree'

In [7]:
# Visualize the results in a beautiful manner using sklearn plot_tree
# Look documentation for modifying fonts: https://scikit-learn.org/stable/modules/generated/sklearn.tree.plo
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)

In the above figure color of the nodes represent the majoritiy of the class

In [8]:
# TODO: Write accuracy function here
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
X_s_train, X_s_test, y_s_train, y_s_test = train_test_split(X, y, test_size=0.25, random_state=6)
y_pred=clf.predict(X_s_test)
from sklearn.model_selection import train_test_split

print("Accuracy:",metrics.accuracy_score(y_s_test, y_pred))

Accuracy: 1.0

In [9]:
# TODO: Print 15 hyperparam settings along with accuracy
import sklearn.metrics as metrics
for i in range(5,13):
clf1=DecisionTreeClassifier(criterion = "gini", splitter = 'random', max_leaf_nodes = 10, min_samples_leaf
clf1.fit(X,y)
X_s_train, X_s_test, y_s_train, y_s_test = train_test_split(X, y, test_size=0.25, random_state=6)
y_pred=clf1.predict(X_s_test)
print('PARAMS[criterion = "gini", splitter = "random", max_leaf_nodes = 10, min_samples_leaf = 5, max_dept
for i in range(35,43):
clf1=DecisionTreeClassifier(criterion = "entropy", splitter = 'random',min_samples_split=4, max_leaf_nodes
clf1.fit(X,y)
X_s_train, X_s_test, y_s_train, y_s_test = train_test_split(X, y, test_size=0.25, random_state=6)
y_pred=clf1.predict(X_s_test)
print('PARAMS[criterion = "entropy", splitter = "random",min_samples_split=4, max_leaf_nodes = 5, min_samp

PARAMS[criterion = "gini", splitter = "random", max_leaf_nodes = 10, min_samples_leaf = 5, max_depth= 5 ]


Accuracy: 0.9473684210526315
PARAMS[criterion = "gini", splitter = "random", max_leaf_nodes = 10, min_samples_leaf = 5, max_depth= 6 ]
Accuracy: 1.0
PARAMS[criterion = "gini", splitter = "random", max_leaf_nodes = 10, min_samples_leaf = 5, max_depth= 7 ]
Accuracy: 0.9736842105263158
PARAMS[criterion = "gini", splitter = "random", max_leaf_nodes = 10, min_samples_leaf = 5, max_depth= 8 ]
Accuracy: 0.9210526315789473
PARAMS[criterion = "gini", splitter = "random", max_leaf_nodes = 10, min_samples_leaf = 5, max_depth= 9 ]
Accuracy: 0.9473684210526315
PARAMS[criterion = "gini", splitter = "random", max_leaf_nodes = 10, min_samples_leaf = 5, max_depth= 10 ]
Accuracy: 0.9473684210526315
PARAMS[criterion = "gini", splitter = "random", max_leaf_nodes = 10, min_samples_leaf = 5, max_depth= 11 ]
Accuracy: 1.0
PARAMS[criterion = "gini", splitter = "random", max_leaf_nodes = 10, min_samples_leaf = 5, max_depth= 12 ]
Accuracy: 0.9210526315789473
PARAMS[criterion = "entropy", splitter = "random",min_samples_split=4, max_leaf_nodes = 5, min_samples_leaf
= 5, max_depth= 35 ] Accuracy: 0.9210526315789473
PARAMS[criterion = "entropy", splitter = "random",min_samples_split=4, max_leaf_nodes = 5, min_samples_leaf
= 5, max_depth= 36 ] Accuracy: 0.8947368421052632
PARAMS[criterion = "entropy", splitter = "random",min_samples_split=4, max_leaf_nodes = 5, min_samples_leaf
= 5, max_depth= 37 ] Accuracy: 0.9210526315789473
PARAMS[criterion = "entropy", splitter = "random",min_samples_split=4, max_leaf_nodes = 5, min_samples_leaf
= 5, max_depth= 38 ] Accuracy: 0.868421052631579
PARAMS[criterion = "entropy", splitter = "random",min_samples_split=4, max_leaf_nodes = 5, min_samples_leaf
= 5, max_depth= 39 ] Accuracy: 0.9736842105263158
PARAMS[criterion = "entropy", splitter = "random",min_samples_split=4, max_leaf_nodes = 5, min_samples_leaf
= 5, max_depth= 40 ] Accuracy: 0.8157894736842105
PARAMS[criterion = "entropy", splitter = "random",min_samples_split=4, max_leaf_nodes = 5, min_samples_leaf
= 5, max_depth= 41 ] Accuracy: 0.9210526315789473
PARAMS[criterion = "entropy", splitter = "random",min_samples_split=4, max_leaf_nodes = 5, min_samples_leaf
= 5, max_depth= 42 ] Accuracy: 0.9210526315789473

In [10]:
# Save the figure
fig.savefig("decistion_tree.png")

How to visualize decision trees using graphviz


If you get runtime error with graphviz, refer to

https://stackoverflow.com/questions/35064304/runtimeerror-make-sure-the-graphviz-executables-are-on-your-systems-path-aft

Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks.

In [11]:
import graphviz
# DOT data - since graphviz accepts data in DOT we will convert our tree into a compatable format
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)

# Draw graph
graph = graphviz.Source(dot_data, format="png")
graph

Out[11]: petal length (cm) <= 2.45


gini = 0.667
samples = 150
value = [50, 50, 50]
class = setosa
False
True

petal width (cm) <= 1.75


gini = 0.0
gini = 0.5
samples = 50
samples = 100
value = [50, 0, 0]
value = [0, 50, 50]
class = setosa
class = versicolor

petal length (cm) <= 4.95 petal length (cm) <= 4.85
gini = 0.168 gini = 0.043
samples = 54 samples = 46
value = [0, 49, 5] value = [0, 1, 45]
class = versicolor class = virginica

petal width (cm) <= 1.65 petal width (cm) <= 1.55 sepal width (cm) <= 3.1
gini = 0.0
gini = 0.041 gini = 0.444 gini = 0.444
samples = 43
samples = 48 samples = 6 samples = 3
value = [0, 0, 43]
value = [0, 47, 1] value = [0, 2, 4] value = [0, 1, 2]
class = virginica
class = versicolor class = virginica class = virginica

sepal length (cm) <= 6.95


gini = 0.0 gini = 0.0 gini = 0.0 gini = 0.0 gini = 0.0
gini = 0.444
samples = 47 samples = 1 samples = 3 samples = 2 samples = 1
samples = 3
value = [0, 47, 0] value = [0, 0, 1] value = [0, 0, 3] value = [0, 0, 2] value = [0, 1, 0]
value = [0, 2, 1]
class = versicolor class = virginica class = virginica class = virginica class = versicolor
class = versicolor

gini = 0.0 gini = 0.0


samples = 2 samples = 1
value = [0, 2, 0] value = [0, 0, 1]
class = versicolor class = virginica

In [12]:
graph.render("decision_tree_graphivz")

Out[12]: 'decision_tree_graphivz.png'

Resources
https://mljar.com/blog/visualize-decision-tree/ (source code)
https://towardsdatascience.com/visualizing-decision-trees-with-python-scikit-learn-graphviz-matplotlib-1c50b4aa68dc
https://explained.ai/decision-tree-viz/
https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html

In [ ]:

You might also like