Decision Trees and Random Forests
Decision Trees and Random Forests
Decision Trees and Random Forests
By the end of this module, you will learn how to implement these two extremely powerful
Machine Learning models using Python and scikit-learn. For every classification model built
with scikit-learn, you should follow four main steps:
4. Reporting metrics and evaluating the performance and generalisation ability of the
constructed model.
Validation techniques will be applied throughout these steps to avoid cases of overfitting (or
underfitting). Finally, you will learn how to optimise the hyperparameters of a model as a way of
boosting its overall performance.
Decision Trees
Decision Trees are one of the most widely-used and popular classification techniques in Machine
Learning due to their
Simplicity: not many parameters need to be tuned and there’s no need to normalise the data
before using them.
Scalability: the classification process requires less operation than other classification models
(such as KNN).
Interpretability: a decision tree is easy to visualise and interpret, and can provide valuable
insights about the data.
Efficiency: decision trees can handle both numerical but also categorical data as well as
missing values.
http://beta.cambridgespark.com/courses/jpm/04module.html 1/25
2016. 11. 27. Decision Trees and Random Forests
Decision Tree classifiers construct classification models in the form of a tree structure. A
decision tree progressively splits the training set into smaller subsets. Each node of the tree
represents a subset of the data.
A root node that has no incoming edges and zero or more outgoing edges.
Internal nodes, each of which has exactly one incoming edge, and two or more outgoing
edges.
Finally, leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing
edges. In a decision tree, each leaf node is assigned a class label.
Root and internal nodes contain feature test conditions to separate samples based on different
characteristics and leaf nodes are linked to the final decision of the model. Once a new sample is
presented to the model, it applies the test conditions while a leaf node is reached and the class
linked to the leaf node is reported as result.
Leaf nodes need to be as pure as possible, which means that all the samples at a leaf
node need to belong mostly to the same class.
Nodes at the top should be impure, which means that all the samples at the top node
need to represent samples of both classes.
As a result all the classes need to have similar chances to be selected (the impurity of the
tree is minimised).
http://beta.cambridgespark.com/courses/jpm/04module.html 2/25
2016. 11. 27. Decision Trees and Random Forests
Before we start with the actual model building process, we need to ensure the generalisation
ability of our classifier (remember that generalisation is the capacity of a model to perform well
on data that has not been used for the training phase).
Training and testing a classification model on the same dataset is a methodological mistake: a
model that would just repeat the labels of the samples that it has just seen would overestimate
the score and would fail to predict anything useful on yet-unseen data, leading to poor
generalisation performance.
To use different datasets for training and testing, we need to split the online retail dataset into
two mutually exclusive datasets, the training set and the test set; this validation approach is
referred to as the holdout method and is depicted as follows:
Figure 2. Holdout approach (random split into two disjoint datasets, the train and test set).
PYTHON
# Split into training and test sets
1. The random_state argument specifies a value for the seed of the random generator. By
setting this seed to a particular value, each time the code is run, the split between train and
test datasets will be exactly the same. If this value is not specified, a different split will be
output each time since the random generator driving the split will be seeded by a pseudo-
random number.
http://beta.cambridgespark.com/courses/jpm/04module.html 3/25
2016. 11. 27. Decision Trees and Random Forests
The output of train_test_split consists of four arrays. XTrain and yTrain are the two arrays
you use to train your model. XTest and yTest are the two arrays that you use to evaluate your
model. By default, scikit-learn splits the data so that 25% of it is used for testing, but you can also
specify the proportion of data you want to use for training and testing.
As previously, you can check the sizes of the different training and test sets by using the shape
attribute:
PYTHON
# Print the dimensionality of the individual splits
PYTHON
> XTrain dimensions: (1498, 10)
> yTrain dimensions: (1498, )
> XTest dimensions: (500, 10)
> yTest dimensions: (500, )
You can also investigate how the class labels are distributed within the yTest vector by using the
itemfreq function from module 2:
PYTHON
# Calculate the frequency of classes in yTest
yFreq = scipy.stats.itemfreq(yTest)
print(yFreq)
> [[ 0 59]
[ 1 441]] (1)
1. In this case, we can see that yTest includes 59 random samples of class 0 (non-returning
customers) and 441 random samples of class 1 (returning customers).
Try building a simple decision tree with 3 layers (See Decision Tree documentation
(http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for the
arguments that can be passed to the classifier).
http://beta.cambridgespark.com/courses/jpm/04module.html 4/25
2016. 11. 27. Decision Trees and Random Forests
PYTHON
# Building the classification model using a pre-defined parameter
dtc = DecisionTreeClassifier(max_depth=3) (1)
1. Setting the parameter max_depth to 3 we stipulate that the decision tree will have no more
than 3 links from the root node to the leaves.
precision,
overall accuracy.
As opposed to overall classification accuracy, the first three metrics are class-specific: they may
differ for the two classes, whereas the overall accuracy may remain the same. To understand
these metrics, it is useful to create a confusion matrix, which records all the true positive, true
negative, false positive and false negative values.
We can compute the confusion matrix for our classifier using the confusion_matrix function
in the metrics module.
PYTHON
# Get the confusion matrix for your classifier using metrics.confusion_matrix
> [[ 32 27]
[ 19 422]] (2)
1. Compute the confusion matrix for our predictions. Remember that the test data contain
observations that are not in the training data.
2. The first value in the first row (32) is the number of True Positives (TP); the second value in
the first row (27) is the number of False Negatives (FN); the first value in the second row (19)
is the number of False Positives (FP), and the second value in the second row (422) is the
http://beta.cambridgespark.com/courses/jpm/04module.html 5/25
2016. 11. 27. Decision Trees and Random Forests
Validation Metrics
Accuracy: Accuracy is the overall "correctness" of the model and is calculated as the
number of correctly classified observations divided by the total number of
observations. Accuracy is defined by
where tp and tn are the numbers of true positive and true negative predictions and
total is the total number of instances.
Precision (for a class): Precision is a measure of the accuracy for a specific class, it
reports the proportion of correct classifications for a specific class. It is defined by:
P recision = tp/(tp + f p)
where tp and f p are the numbers of true positive and false positive predictions for the
considered class, e.g. ability to correctly classify a customer as being returning or non-
returning. tp + f p is the total number of elements labelled as belonging to the
considered class by the classifier.
Recall, aka. Sensitivity, True positive rate (for a class): Recall reports the ability of a
model to select instances of a certain class from a dataset, e.g. a classifier that has high
sensitivity with regards to the non-returning class will do well at correctly classifying
customers as being non-returning (although this may make it more likely to incorrectly
include more returning customers in this class). It is defined by:
where tp and f n are the numbers of true positive and false negative predictions for the
considered class. tp + f n is the total number of elements that actually belong to the
considered class.
Specificity, True negative rate (for a class): Specificity reports the ability of the
model to correctly exclude class non-members in a dataset from the class, e.g. a
classifier that has high specificity wrt the non-returning class will do well at correctly
excluding returning customers from the class (although this may make it more likely to
miss non-returning customers). It is defined by:
where tn and f p are the numbers of true negative and false positive predictions for the
considered class. f p + tn is the total number of elements that should not be included in
the class.
F1-score (for a class): This measures the accuracy of the model with respect to a
particular class, and is the harmonic mean of precision and recall. It is defined by:
http://beta.cambridgespark.com/courses/jpm/04module.html 6/25
2016. 11. 27. Decision Trees and Random Forests
Because performance metrics are such an important step of model evaluation, scikit-learn offers
a wrapper around these functions, metrics.classification_report , to facilitate their
computation. It also offers the function metrics.accuracy_score to compute the overall
accuracy.
PYTHON
# Report the metrics using metrics.classification_report
PYTHON
precision recall f1-score support
PYTHON
#### Check the arguments of the function
help(visplots.dtDecisionPlot)
http://beta.cambridgespark.com/courses/jpm/04module.html 7/25
2016. 11. 27. Decision Trees and Random Forests
Boundary interpretation
The decision region for a decision tree is rectilinear ("stair-like" or "box-like" surfaces) with
segments parallel to the input axes since each test condition involves only a single
attribute. In this case, the boundary defines a decision region for each class.
The more splits are performed the more detailed the model becomes (the deeper the tree).
Despite the power and simplicity of this algorithm, decision-tree learners have three main
drawbacks:
They can be very unstable as they are sensitive to small changes in the training data: a small
change can lead to a totally different tree (high variance).
They can easily overfit. Decision-tree learners can create over-complex trees that do not
generalise the data well.
Bias-variance trade-oۘ
One of the fundamental concepts of Machine Learning is that of overfitting and generalisation.
When a trained model performs extremely well on the training dataset, but fails to predict new
unseen data, the model is sufferning from the effect of high variance. This situation is called
overfitting, leading to poor generalisation performance.
http://beta.cambridgespark.com/courses/jpm/04module.html 8/25
2016. 11. 27. Decision Trees and Random Forests
Similarly, a model might be too "simple", unable to capture the true relationship in the data that
we observe. These models would be said to have high bias; they do not fit the data very well,
which leads to a high generalisation error on new test data. This situation is called underfitting.
There is a fundamental trade-off between model complexity and the possibility of high bias or
high variance for all Machine Learning algorithms. Such an example is presented in Figure 3-4.
We can see from this picture that initially both the training and test error are quite high (hence
the accuracy will be low) as the model is too simplistic and thus unable to learn and predict the
data accurately (case of under-fitting). However, as the boundaries become more and more
complex, capturing noise in the data, the test error tends to increase while the training error
steadily decreases; this is a case of overfitting.
The main task during the optimisation of any Machine Learning model is to find the optimal
"sweet spot" where the model minimises simultaneously both the bias and the variance. One
very powerful tool at our disposal is validation. It is a common approach used to help us to
detect and avoid cases of over- and underfitting.
http://beta.cambridgespark.com/courses/jpm/04module.html 9/25
2016. 11. 27. Decision Trees and Random Forests
Figure Interpretation
http://beta.cambridgespark.com/courses/jpm/04module.html 10/25
2016. 11. 27. Decision Trees and Random Forests
The green lines across the panel represent different models (Model 1-3) that have been
trained on a dataset (top row). The prediction errors are recorded below each panel. From
left to right, the models have increasing complexity. At first glance Model 3 seems perfect
with a prediction error of 0. However when submitting new data (test data - second row) to
this model it soon becomes apparent that this model does not perform well anymore. This
is referred to as overfitting. On the other end of the spectrum, Model 1 is too simple and
does not capture the relationship that is being classified. For both the training and test data
it performs poorly (underfitting). Model 2 strikes the right balance, it has a low prediction
error and is generalisable.
Ensemble models
Ensemble learning (or modelling) involves the combination of several accurate and diverse
models to solve a single prediction problem. It works by generating multiple models, which
learn and make predictions independently. Those predictions are then combined into a single
(mega) prediction that should be as good or better than the prediction made by any one classifer.
An ensemble is itself a supervised learning problem as it can be trained and used to make
predictions.
Averaging methods: where several estimators are being built independently and then their
predictions are averaged or combined by a voting scheme. Averaging methods attempt to
reduce the variance of the single base estimators. Examples include: Bagging methods and
Forests of randomised trees, among others.
Boosting methods: in this ensemble model, base estimators are built sequentially with the
motivation is to combine several weak models to produce a powerful ensemble. Boosting
methods attempt to reduce the bias of the combined estimator. Examples include AdaBoost
and Gradient Tree Boosting, among others.
Random Forests
The random forests model is an ensemble method since it aggregates a group of decision trees
into an ensemble (http://scikit-learn.org/stable/modules/ensemble.html). Unlike single decision trees
which are likely to suffer from high variance or high bias (depending on how they are tuned)
http://beta.cambridgespark.com/courses/jpm/04module.html 11/25
2016. 11. 27. Decision Trees and Random Forests
Random Forests use averaging to find a natural balance between the two extremes.
A forest of uncorrelated trees is being built using a CART like procedure, combined with
randomised node optimisation and "bagging" (bootstrap aggregating). Bagging helps reduce
variance, improve unstable procedures and avoid overfitting. So, for each tree to learn we:
3. Apply the learning procedure only to the subsample drawn and the features selected.
4. Once many models are generated, their predictions can be combined into a single (mega)
prediction using majority vote or averaging that should be better, on average, than the
prediction made by the single models.
The percentage of data to grow each tree is arbitrary, but a widely used choice is 63% (.632 rule).
Boosting
Bagging is not the only option to create an ensemble of trees, a very popular alternative is
Boosting. Boosting doesn’t subsample the data, it uses the entire dataset multiple times
giving to samples which are hard to classify an increasing importance. This strategy also
assigns a different importance to each tree which is used to weigh their vote during the
classification.
PYTHON
# Build a Random Forest classifier with 150 decision trees
print(metrics.classification_report(yTest, predRF))
print("Overall Accuracy:", round(metrics.accuracy_score(yTest, predRF),2))
PYTHON
precision recall f1-score support
1. By using the random_state attribute, you ensure that your results remain the same every
time you re-run this script. Otherwise, if removed, you may notice that your results are
different to the ones presented here (and will be different every time you run this script).
This is due to the random nature of random forests, since every predictor is trained with a
bootstrap of the data, which is a random sampling with replacement. Also, in every tree
there is some randomness in how the subset of attributes for training is selected.
PYTHON
# Visualise average accuracy with an increasing number of trees
http://beta.cambridgespark.com/courses/jpm/04module.html 13/25
2016. 11. 27. Decision Trees and Random Forests
Figure 6. Average testing accuracy with the increasing number of decision trees in the Random
Forest.
Feature importance
Random forests allow you to compute a heuristic for determining how “important” a feature is
in predicting a target. This heuristic measures the change in prediction accuracy when a split is
introduced in a given feature. The more the accuracy drops when the feature is splitted, the
more “important” we deem the feature to be.
You can use the feature_importances_ attribute of the RF classifier to get the relative
importance of each feature, which you can then visualise using a simple bar plot.
http://beta.cambridgespark.com/courses/jpm/04module.html 14/25
2016. 11. 27. Decision Trees and Random Forests
PYTHON
# Display the importance of the features in a barplot
data = [
Bar(
x=rf.feature_importances_[importance_sorted_idx],
y=names[importance_sorted_idx],
orientation = 'h',
)
]
layout = Layout(
xaxis=dict(title = "Importance of features"),
yaxis=dict(title = "Features"),
width=800,
margin=Margin(
l=250,
r=50,
b=100,
t=50,
pad=4
)
)
iplot(fig)
1. argsort returns the indices that would sort the features by their importance.
http://beta.cambridgespark.com/courses/jpm/04module.html 15/25
2016. 11. 27. Decision Trees and Random Forests
Boundary visualisation
You can visualise the classification boundary created by the Random Forest using the
visplots.rfDecisionPlot function. You can check the arguments passed in this function by
using the help command. In addition to the mandatory arguments, the function
visplots.rfDecisionPlot takes as optional arguments the ones from the
RandomForestClassifier function, so you can have a look at the documentation
(http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).
PYTHON
#### Check the arguments of the function
help(visplots.rfDecisionPlot)
http://beta.cambridgespark.com/courses/jpm/04module.html 16/25
2016. 11. 27. Decision Trees and Random Forests
Boundary interpretation
A Random Forest boundary will be the union of the intersection of the decision boundaries
of many decision trees (remember that the decision region for a decision tree is rectilinear
with "stair-like" or "box-like" surfaces). Therefore, the decision region of a Random Forest
also has a rectilinear boundary composed of axis-parallel segments.
Selecting the optimal values of these parameters - a process commonly referred to as "tuning" -
for a given classification problem is no trivial task, and can greatly affect a model’s performance.
A common approach towards optimising the hyperparameters is to apply a "three-way split",
where one subset (usually 30% of the data using a holdout approach) of the original data is left
aside during the whole training process as a test set to evaluate the generalisation ability of the
model, whereas the remaining data are further split into training (used to fit the model) and
validation (used to tune the model parameters) dataset(s) using holdout or cross-validation (or
other validation techniques).
http://beta.cambridgespark.com/courses/jpm/04module.html 17/25
2016. 11. 27. Decision Trees and Random Forests
Figure 9. Summary of the process of optimal parameters selection. For each parameter
combination, the data are split into training and test sets a specified number of times. The
accuracy is then calculated from the mean of the models trained and evaluated on these splits, i.e.
given a cross validation accuracy score. The best parameter combination is the one with the
highest cross validation accuracy score. The figure has been extracted from this tutorial
(http://online.cambridgecoding.com/notebooks/cca_admin/misleading-modelling-overfitting-crossvalidation-
and-the-biasvariance-tradeoff)
.
K-fold cross-validation
K-fold cross-validation is a way of “cross-validating” your model. Cross-validation allows you to
estimate the performance of your model when presented with completely unseen (real) test data
using the dataset you are given. You do this by repeatedly splitting your dataset up into training
and (pseudo-)"test" sets and then evaluating the accuracy each time, i.e. each time, one
proportion of your dataset will “pretend” to be coming from outside the dataset. By repeating the
validation process for different splits of the data, you can have more confidence that your
estimate of the model’s accuracy will generalise and that it will be "robust".
The K in K-fold validation refers to the number of times you will be splitting the data. For
example, if K=3:
http://beta.cambridgespark.com/courses/jpm/04module.html 18/25
2016. 11. 27. Decision Trees and Random Forests
Figure 10. K-fold validation where K=3. Accuracy is calculated for the model trained from each
split so that in the case of K=3, a1 is the accuracy of the model trained on the first split, a2 is the
accuracy of the model trained on the second split, and a3 is the accuracy of the model trained on
the third split.
The accuracy is then calculated by taking the average of the accuracies of the models trained on
each split, i.e. the mean of a1, a2 and a3.
Grid search
The traditional way of performing hyperparameter optimisation is by applying grid search,
which is simply an exhaustive searching through a manually specified subset of the
hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some
performance metric, typically measured by cross-validation on the training set or evaluation on
a held-out validation set.
http://beta.cambridgespark.com/courses/jpm/04module.html 19/25
2016. 11. 27. Decision Trees and Random Forests
Figure 11. In the context of hyperparameter optimization, you perform k-fold cross validation
together with grid search to get a more robust estimate of the model performance associated
with specific hyperparameter values. The figure has been extracted from
https://blog.cambridgecoding.com/2016/04/03/scanning-hyperspace-how-to-tune-machine-
learning-models/
PYTHON
# View the list of arguments to be optimised
help(RandomForestClassifier())
n_estimators: The number of trees in the forest. A larger number of trees is preferable as it
will decrease the variance in predictions, but it is also more computationally expensive. In
addition, results will stop getting significantly better beyond a critical number of trees.
http://beta.cambridgespark.com/courses/jpm/04module.html 20/25
2016. 11. 27. Decision Trees and Random Forests
max_features: The size of the random subsets of features to consider when splitting a node.
The smaller the subset of features is, the greater the reduction of variance, but also the
greater the increase in bias. Empirically, it has been found that good default values are
max_features=sqrt(n_features) (default case) for classification tasks (where
n_features is the number of features in the data).
max_depth: The maximum number of links between the root of the tree and the leaves. The
smaller it is, the simpler the decision boundary will be.
Good results are often achieved when setting max_depth=None in combination with
min_samples_split=1 . Bear in mind though that using these values might result in models that
consume a lot of memory. In addition, note that with scikit-learn, the RandomForestClassifier
uses bootstrap samples by default ( bootstrap=True ). (This is not the case for the
ExtraTreeClassifier, where bootstrap=False by default).
Remember, the best parameter values should always be cross-validated. The grid search
evaluates the RF classifier for every possible combination of parameters specified by the
dictionary parameters using 10-fold cross-validation. Precision is used as the measure to
evaluate the model and find the best parameter set:
http://beta.cambridgespark.com/courses/jpm/04module.html 21/25
2016. 11. 27. Decision Trees and Random Forests
PYTHON
# Conduct a grid search with 5-fold cross-validation using the dictionary of parameters
1. The n_jobs argument can be used for parallelisation to speed up the tuning process.
2. You may notice that your results are different from the ones presented here (affecting also
the overall accuracy and relevant metrics) since we haven’t used the random_state
argument within the RandomForestClassifier to ensure the results remain the same
every time we run this script.
You may also choose to include in your dictionary of parameters and grid search any of the
following options:
PYTHON
# Additional parameters to investigate could include
# max_features = [1, 3, 10] or max_features: ['auto', 'sqrt', 'log2']
# min_samples_split = [1, 3, 10]
# min_samples_leaf = [1, 3, 10]
# bootstrap = [True, False]
# criterion = ["gini", "entropy"]
http://beta.cambridgespark.com/courses/jpm/04module.html 22/25
2016. 11. 27. Decision Trees and Random Forests
PYTHON
# Create a heatmap to visualise the results of the grid search with cross-validation
data = [
Heatmap(
x=n_estimators,
y=max_depth,
z=scores.T,
colorscale='Blues',
reversescale=True,
colorbar=dict(
title="Classification Accuracy",
nticks=10
)
)
]
layout = Layout(
xaxis = dict(title="Number of estimators", tickvals=n_estimators),
yaxis = dict(title="Max Depth", tickvals= max_depth),
height = 700,
)
iplot(fig)
http://beta.cambridgespark.com/courses/jpm/04module.html 23/25
2016. 11. 27. Decision Trees and Random Forests
Figure 12. Grid search scores from the tuning of the RF hyperparameters
PYTHON
# Build the classifier using the *optimal* parameters detected by grid search
clfRDF = RandomForestClassifier(n_estimators=best_n_estim,
max_depth=best_max_depth,
max_features = best_max_features)
clfRDF.fit(XTrain, yTrain)
predRF = clfRDF.predict(XTest)
http://beta.cambridgespark.com/courses/jpm/04module.html 24/25
2016. 11. 27. Decision Trees and Random Forests
PYTHON
precision recall f1-score support
Wrap up of Module 4
A decision tree model consists of a root node, internal nodes and terminal (leaf) nodes. Root
nodes and internal nodes contain feature test conditions, while terminal nodes assign the
class label.
Random Forests reduce variance by combining many independently trained decision trees by
averaging or majority voting. The individual trees are trained on data from bootstrap
sampling with replacement.
Random Forests, like other "ensemble learning" methods, use many models from the training
data to make predictions. In the case of random forests, the individual models are decision
trees.
http://beta.cambridgespark.com/courses/jpm/04module.html 25/25