Data Science - Decision Tree - Random Forest
Data Science - Decision Tree - Random Forest
Related terms:
• Training a chosen “base model” (here, regression decision trees because vari-
ables are numeric on the samples)
• Testing—For each test example
The records were split into 25 subsets of the training dataset. The average mean
squared error is 0.23. In other words, it estimates the prediction error of the response
variable CARAVAN.
It is interesting that the BE model has the same significant independent variables
as the RP. As we will see in Section 7.3.7, (Table 7.8 and ROC graph in Figure 7.5)
their explanatory and predictive powers are also very similar. It would be interesting
to explore the extent to which systematic similarities between the two classification
methods can explain this outcome.
Classification
Jiawei Han, ... Jian Pei, in Data Mining (Third Edition), 2012
Random forests can be built using bagging (Section 8.6.2) in tandem with random
attribute selection. A training set, D, of D tuples is given. The general procedure
to generate k decision trees for the ensemble is as follows. For each iteration, ), a
training set, Di, of D tuples is sampled with replacement from D. That is, each Di
is a bootstrap sample of D (Section 8.5.4), so that some tuples may occur more
than once in Di, while others may be excluded. Let F be the number of attributes
to be used to determine the split at each node, where F is much smaller than the
number of available attributes. To construct a decision tree classifier, Mi, randomly
select, at each node, F attributes as candidates for the split at the node. The CART
methodology is used to grow the trees. The trees are grown to maximum size and
are not pruned. Random forests formed this way, with random input selection, are
called Forest-RI.
Another form of random forest, called Forest-RC, uses random linear combinations
of the input attributes. Instead of randomly selecting a subset of the attributes,
it creates new attributes (or features) that are a linear combination of the existing
attributes. That is, an attribute is generated by specifying L, the number of original
attributes to be combined. At a given node, L attributes are randomly selected and
added together with coefficients that are uniform random numbers on . F linear
combinations are generated, and a search is made over these for the best split. This
form of random forest is useful when there are only a few attributes available, so as
to reduce the correlation between individual classifiers.
Random forests are comparable in accuracy to AdaBoost, yet are more robust to
errors and outliers. The generalization error for a forest converges as long as the
number of trees in the forest is large. Thus, overfitting is not a problem. The accuracy
of a random forest depends on the strength of the individual classifiers and a
measure of the dependence between them. The ideal is to maintain the strength
of individual classifiers without increasing their correlation. Random forests are
insensitive to the number of attributes selected for consideration at each split.
Typically, up to are chosen. (An interesting empirical observation was that using
a single random input attribute may result in good accuracy that is often higher
than when using several attributes.) Because random forests consider many fewer
attributes for each split, they are efficient on very large databases. They can be
faster than either bagging or boosting. Random forests give internal estimates of
variable importance.
The gains mentioned above are based on the direct measurements at the interfaces.
In the field of analytics, the input and output are data and/or information, and so, the
system gains can be defined based on the data (content gains) or based on attributes
of the data (context gains). Information gain was earlier defined as an attribute/con-
text gain, as it was associated with an increase in some measurable system entropy
(information being extracted increases entropy). If the system information is written
to a histogram, with each element, i, in the histogram having probability p(i), then its
information gain is determined from the change in the entropy (Eq. 3.19, rephrased
in Eq. 10.1) from input to output, a familiar motif now for anyone who has read these
chapters consecutively:
(10.1)
Eq. (10.1) provides a relationship between input and output and if negative indicates
the loss of information (meaning we overreached on our modeling, modularization,
etc.). This equation is a gain, but is obviously no longer a ratio.
Ensemble Method R2 Q2
Hybrid GA-DT 0.81 0.86
Random Forests 0.81 0.76
Regularized Random Forests 0.81 0.74
Extremely Randomized Trees 0.91 0.73
Gradient Boosted Trees 0.57 0.48
Table 20.3. Example of Fifteen Bootstrap Samples from the Training Set Shown in
Table 20.2
Bootstrap Sample Project IDs
S1 7 6 3 3 2 6 2 6 6 3
S2 10 5 3 5 3 8 8 4 5 9
S3 4 9 9 4 5 7 10 5 5 1
S4 5 7 5 2 1 1 10 1 7 7
S5 10 10 9 2 3 3 10 3 8 2
S6 7 5 9 6 1 5 5 3 10 3
S7 5 4 7 7 5 5 2 8 7 3
S8 5 1 9 8 7 6 1 7 8 7
S9 10 6 7 10 7 10 9 9 7 5
S10 1 9 8 5 5 8 8 7 3 2
S11 7 8 8 2 3 4 9 2 4 5
S12 7 7 9 10 7 7 8 7 5 9
S13 7 6 4 2 2 5 7 8 10 5
S14 10 8 5 2 10 2 5 5 9 2
S15 10 10 1 1 6 9 10 7 5 8
After the ensemble is created, then it can start being used to make predictions for
future instances based on their input features. In the case of SEE, future instances are
new projects to which we do not know the true required effort and wish to estimate
it. In the case of regression tasks, where the value to be estimated is numeric, the
predictions given by the ensemble are the simple average of the predictions given
by its base models. This would be the case for SEE. However, bagging can also be
used for classification tasks, where the value to be predicted is a class. An example of
classification task would be to predict whether a certain module of a software is faulty
or nonfaulty. In the case of classification tasks, the prediction given by the ensemble
is the majority vote, i.e., the class most often predicted by its base models.
The concept of the bootstrap method with its aggregation, popularly known as
bagging, is the key behind this predictive modeling. Bootstrap is a common yet
effective statistical technique for estimation from a data sample by using resam-
pling to create multiple samples. There are algorithms [21] [22] which though are
quite potent but have a high variance such as Classification and Regression Trees
(abbreviated as CART). Bootstrap aggregation can be used to reduce the variance
using the collective prediction of the group. Decision trees' structure and prediction
vary based on the training dataset. Here, when we create multiple decision trees with
less or no correlation and use ensemble and aggregation we can get better accuracy.
Less correlation is important for varied tree structures and randomness is used to
achieve it. There is another area where we use randomness in Random Forest. In
the case of decision trees for selecting a feature for the split, we use all the available
features [23]. In contrast in Random Forest, we select a random set of features from
the available set. Randomness in RF mainly is in selecting random observations for
growing the tree and random features for splitting the nodes.
Machine learning
Patrick Schneider, Fatos Xhafa, in Anomaly Detection and Complex Event Processing
over IoT Data Streams, 2022
The methods that handle concept drift can be based on either one learner or a set
of learners. In the latter case, they are called ensemble methods. Single learners, or
ensemble-based learners, can be managed in three techniques to handle concept
drift:
• By exploiting the training set.
The final decision can be issued either by combining the individual learners' weight-
ed decisions, selecting one of the individual learners, or combining the decisions of
a subset of selected individual learners. Finally, in this chapter, the criteria that may
indicate the evaluation outcome of the machine learning and data mining scheme to
handle concept drift are defined. They allow evaluating the autonomy (involvement
of a human being in the learning scheme), the reliability of drift detection and
description, the independence and influence of parameter setting on the learning
scheme performance, and the time and memory requirements for the decision
computing. Table 8.1 summarizes the guidelines to help readers choose the methods
suitable for handling concept drift according to its kind and characteristics.
Table 8.1. Guidelines to select methods to handle concept drift according to the drift
kind and characteristics [86].
For more information on drift concept for data streams, the reader is referred to
Section 1.4 in Chapter 1.