Decision tree classifier
Decision tree classifier
In this tutorial, you will learn a popular machine learning algorithm, Decision Trees. You will use this classification algorithm to build a
model from the historical data of patients, and their response to different medications. Then you will use the trained decision tree to
predict the class of an unknown patient, or to find a proper drug for a new patient.
Let's download and import the data on China's GDP using pandas read_csv() method.
Download Dataset
Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The features of
this dataset are Age, Sex, Blood Pressure, and the Cholesterol of the patients, and the target is the drug that each patient responded to.
It is a sample of multiclass classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the
class of an unknown patient, or to prescribe a drug to a new patient.
df.shape
(200, 6)
Data Preprocessing
df.columns
df.dtypes
Age int64
Sex object
BP object
Cholesterol object
Na_to_K float64
Drug object
dtype: object
As you can see, some features in this dataset are categorical, such as Sex or BP. Unfortunately, Sklearn Decision Trees does not handle
categorical variables. We can still convert these features to numerical values using pandas.get_dummies() to convert the categorical
variable into dummy/indicator variables.
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1])
le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])
le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3])
X[0:5]
array([[23, 0, 0, 0, 25.355],
[47, 1, 1, 0, 13.093],
[47, 1, 1, 0, 10.114],
[28, 0, 2, 0, 7.798],
[61, 0, 1, 0, 18.043]], dtype=object)
The X and Y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state
ensures that we obtain the same splits.
X_train.shape
(140, 5)
X_test.shape
(60, 5)
y_train.shape
(140,)
y_test.shape
(60,)
Modeling
We will first create an instance of the DecisionTreeClassifier called drugTree.
Inside of the classifier, specify criterion="entropy" so we can see the information gain of each node.
DecisionTreeClassifier(criterion='entropy', max_depth=4)
Next, we will fit the data with the training feature matrix X_train and training response vector y_train
drugTree.fit(X_train,y_train)
DecisionTreeClassifier(criterion='entropy', max_depth=4)
Prediction
Let's make some predictions on the testing dataset and store it into a variable called y_pred.
y_pred = drugTree.predict(X_test)
Evaluation
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, y_pred))
Accuracy classification score computes subset accuracy: the set of labels predicted for a sample must exactly match the
corresponding set of labels in y_true.
In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly matches with
the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.
Visualization
Let's visualize the tree
tree.plot_tree(drugTree)
plt.show()
Thank you
Author
Moazzam Ali
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js