0% found this document useful (0 votes)
60 views

Data Science - Decision Tree - Random Forest

Uploaded by

Mahesh Pokhrel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Data Science - Decision Tree - Random Forest

Uploaded by

Mahesh Pokhrel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Ensemble Method

Related terms:

Machine Learning, Decision Tree, Classification (Machine Learning), Classification,


Support Vector Machine, Clustering Algorithm, Learning Method

View all Topics

Caravan Insurance Customer Profile


Modeling with R
Mukesh Patel, Mudit Gupta, in Data Mining Applications with R, 2014

7.3.4 Bagging Ensemble


BE method is a combination of classifier and regression tree methods designed to
stabilize the tree proposed by Breiman (1996a,b, 1998). Briefly, it works by splitting
the data into multiple (training) data sets to which a class of learning or optimizing
methods—that is decision trees and neural networks—is applied. The method is
training multiple (k) models on different sets and then averaging the predictions
of each model, hence bagging. The goal is to develop a model that optimizes the
accuracy of one model. The rationale is that averaging of misclassification errors
on different data splits gives a better estimate of the predictive ability of a learning
method. The Bagging algorithm steps are:

• Training—In each iteration i, i = 1, …, n

• Random sampling with replacement N samples from the training set

• Training a chosen “base model” (here, regression decision trees because vari-
ables are numeric on the samples)
• Testing—For each test example

• Starting all trained base models

• Predicting by combining results of all i trained models:–Regression: averag-


ing–Classification: a majority vote
The following is the R command sequence to run this analysis:

> library(ipred)–[To load the package ipred in R by Peters et al. (2002b)]

> cust.ip<− bagging(CARAVAN ~. , data=dataset1, coob=TRUE)

> cust.ip.prob<− predict(cust.ip, type=“prob”, newdata=dataset2)

The output is:

Bagging regression trees with 25 bootstrap replications

Call: bagging.data.frame(formula = CARAVAN ~ ., data = dataset1, coob = TRUE)

Out–of–bag estimate of root mean squared error: 0.2303

The records were split into 25 subsets of the training dataset. The average mean
squared error is 0.23. In other words, it estimates the prediction error of the response
variable CARAVAN.

Next, we generated RP model details of all significant independent variables and


their relative importance using the following R command. The results are presented
in the histogram in Figure 7.2:

Figure 7.2. BE model significant variables.

> cust4.var.imp <− varImp(cust4.ip, useModel=bagging)

It is interesting that the BE model has the same significant independent variables
as the RP. As we will see in Section 7.3.7, (Table 7.8 and ROC graph in Figure 7.5)
their explanatory and predictive powers are also very similar. It would be interesting
to explore the extent to which systematic similarities between the two classification
methods can explain this outcome.

Table 7.8. Summary Table of AUC of Classifier Models


Modeling Method No. Ind. Variables AUC the ROC Curve Criteria for Selecting
Variables
Recursive Partitioning 12 0.68 Mean squared error at
(RP) each split
Bagging Ensemble (BE) 12 0.68 Prediction by com-
bining results of
trained models-Regres-
sion, classification
Support Vector Machine 67 0.66 Margin maximization
(SVM) and feature weight
greater than 1
Logistic Regression (LR) 28 0.72 Absolute t-value statistic

> Read full chapter

Classification
Jiawei Han, ... Jian Pei, in Data Mining (Third Edition), 2012

8.6.4 Random Forests


We now present another ensemble method called random forests. Imagine that each
of the classifiers in the ensemble is a decision tree classifier so that the collection of
classifiers is a “forest.” The individual decision trees are generated using a random
selection of attributes at each node to determine the split. More formally, each tree
depends on the values of a random vector sampled independently and with the same
distribution for all trees in the forest. During classification, each tree votes and the
most popular class is returned.

Random forests can be built using bagging (Section 8.6.2) in tandem with random
attribute selection. A training set, D, of D tuples is given. The general procedure
to generate k decision trees for the ensemble is as follows. For each iteration, ), a
training set, Di, of D tuples is sampled with replacement from D. That is, each Di
is a bootstrap sample of D (Section 8.5.4), so that some tuples may occur more
than once in Di, while others may be excluded. Let F be the number of attributes
to be used to determine the split at each node, where F is much smaller than the
number of available attributes. To construct a decision tree classifier, Mi, randomly
select, at each node, F attributes as candidates for the split at the node. The CART
methodology is used to grow the trees. The trees are grown to maximum size and
are not pruned. Random forests formed this way, with random input selection, are
called Forest-RI.

Another form of random forest, called Forest-RC, uses random linear combinations
of the input attributes. Instead of randomly selecting a subset of the attributes,
it creates new attributes (or features) that are a linear combination of the existing
attributes. That is, an attribute is generated by specifying L, the number of original
attributes to be combined. At a given node, L attributes are randomly selected and
added together with coefficients that are uniform random numbers on . F linear
combinations are generated, and a search is made over these for the best split. This
form of random forest is useful when there are only a few attributes available, so as
to reduce the correlation between individual classifiers.

Random forests are comparable in accuracy to AdaBoost, yet are more robust to
errors and outliers. The generalization error for a forest converges as long as the
number of trees in the forest is large. Thus, overfitting is not a problem. The accuracy
of a random forest depends on the strength of the individual classifiers and a
measure of the dependence between them. The ideal is to maintain the strength
of individual classifiers without increasing their correlation. Random forests are
insensitive to the number of attributes selected for consideration at each split.
Typically, up to are chosen. (An interesting empirical observation was that using
a single random input attribute may result in good accuracy that is often higher
than when using several attributes.) Because random forests consider many fewer
attributes for each split, they are efficient on very large databases. They can be
faster than either bagging or boosting. Random forests give internal estimates of
variable importance.

> Read full chapter

System design optimization


Steven Simske, in Meta-Analytics, 2019

10.1.1 System considerations—Revisiting the system gains


In Chapter 3, we described how ensemble methods, such as those described in Ref.
[Sims13], tend to move the correct answer higher in the overall ranked list (rank)
more so than simply moving the correct result to the top of the list (accuracy). In
terms of system design theory, the co-occurrence and similarity-based ensemble
approaches of previous chapters are designated as rank-biased systems, not accu-
racy-biased systems. This is a well-known benefit of meta-algorithmic approaches,
and it has a huge positive impact on overall system cost models, since the highest
costs are often associated with recovering from errors. Rank-biased systems make
less errors over time, even if they make (slightly) more primary errors (have slightly
lower rank = 1 accuracy). This is because intelligently designed ensemble systems
generally have higher percentages of rank = 2 and rank = 3 results than individual
algorithms.
System gain is a ratio of a measured output variable of interest to the system users,
divided by a measured input variable also of interest to the system users. Because
module-to-module interfaces should be fully specified as part of the system design,
the two parameters used in the gain formula should be explicitly listed and explained
in the system specifications. This is especially important where there are multiple
relevant gains for a single module and its input and output interfaces. For example,
an instrumentational amplifier, which is often used for medical applications because
of its very high input impedance, excellent common-mode rejection ratio (CMRR),
and stable voltage gain, has multiple gains that are of interest to the electronic
system designer, including voltage, current, and power gain. These ratios are based
on two inputs and two outputs (input and output current and voltage), and CMRR
is effectively the same except that the input voltage is that across the positive
and negative terminals of the operational amplifier. From these same input and
output values, impedance, admittance, resistance, and conductance gain can also be
computed. As mentioned in Chapter 4, if there are N inputs and P outputs specified,
there are 2NP simple gains for the module. More can be defined if more composite
inputs and outputs can be used; for example, power gain is really the product of two
other gains: voltage and current gain.

The gains mentioned above are based on the direct measurements at the interfaces.
In the field of analytics, the input and output are data and/or information, and so, the
system gains can be defined based on the data (content gains) or based on attributes
of the data (context gains). Information gain was earlier defined as an attribute/con-
text gain, as it was associated with an increase in some measurable system entropy
(information being extracted increases entropy). If the system information is written
to a histogram, with each element, i, in the histogram having probability p(i), then its
information gain is determined from the change in the entropy (Eq. 3.19, rephrased
in Eq. 10.1) from input to output, a familiar motif now for anyone who has read these
chapters consecutively:

(10.1)

Eq. (10.1) provides a relationship between input and output and if negative indicates
the loss of information (meaning we overreached on our modeling, modularization,
etc.). This equation is a gain, but is obviously no longer a ratio.

Extending the concept of a gain to that of differences in information allows us to cre-


ate a category designated earlier as functional gains. Functional gains represented
an upgrade in the value of data in some problem space and so embody a functional
relationship between measurable content produced by the analytic system (output)
and the content as entered into the system (input). In Chapter 3, two functional gains
discussed are knowledge gain and efficiency gain. We revisit these in context of all
that we have explored in the intervening five chapters.
Knowledge gain is a direct measurement of the product of an analytic system or
else a measurement comparing it with alternative products. In text analytics, the
knowledge gained can be assessed, among other ways, by comparing the entropy
and coverage the analytics have to a specific, salient text reference such as a dictio-
nary, taxonomy, or ontology. Similar knowledge gains can be defined in terms of
metadata entries, tags, meaningful coverage of a set of search queries, inclusion of
foreign words and phrases, etc. Knowledge gain can also be qualitative—increased
use of a system, increased downloading from a mirror site, etc. may indicate that its
knowledge base has improved in value.

A second functional gain of interest is efficiency gain, which is indicative of


improved ability of an information system to achieve its processing goals, nor-
malized by the resources required for this achievement. This is not the same as
the performance, or throughput, of a system, although it certainly relates to it.
Efficiency in most cases more closely correlates with the rank efficiency of a
system, that is, its ability to yield the correct answer more quickly. Efficiency can
come from simple improvement in indexing or in adding contextual information
(user and her historical behavior, GPS location, time of day, etc.). Efficiency from
the perspective of applying meta-analytic approaches may be viewed as parsimony of
algorithm—the most efficient meta-analytic is the one with the most streamlined
design or the design that can be conveyed with the shortest amount of description.
This could also be efficiency in terms of the number and complexity of modules
that are required to design the analytic, the minimum set of coefficients and other
wildcard expressions within the modules, the simplest meta-analytic pattern (e.g.,
weighted voting is a simpler approach than reinforcement-void), etc. Discussed in
Chapter 4, we can now look at these gains in light of the design choice we made for
the system.

A third functional gain of interest, not previously introduced, is robustness gain. We


are ready to explore this gain based on what we learned from the synonym-antonym
and reinforcement-void patterns explored in Chapter 8 and the analytics around
analytics in Chapter 9. A gain in robustness is an improvement in a system ar-
chitecture that allows the system to respond better to changing input, including
changes in the nature (depth and breadth) of the input, the scale of the input, and
the context of the input. In previous chapters, we have discussed several ways in
which to improve robustness. The simplest conceptually is to employ hybrid analytics
by design, benefitting from the factor that ensemble analytics, meta-analytics, and
other combinatorial approaches tend to cover the input spaces more completely, and
with entropy and variance measurements to support this, more evenly. Straightfor-
ward ways of representing system robustness include data indicating the worst-case
throughput, the worst-case accuracy, the worst-case costs, etc. The most robust
systems are generally the most reliable, and thus, the variability in response to input
is lowest.

> Read full chapter

27th European Symposium on Comput-


er Aided Process Engineering
Vikrant A. Dev, ... Mario R. Eden, in Computer Aided Chemical Engineering, 2017

3 Results and Discussion


The R2 and Q2 values obtained for the four tree based ensemble methods are
listed in Table 1. With respect to the training set, all the utilized ensemble methods
except gradient boosted regression trees, performed as well or better than the
hybrid GA-DT of Datta et al. (2017). However, the performance was lower, when
compared to the hybrid GA-DT, on the test set. Gradient boosted regression trees,
in general fared poorly. This could be due to the small sample size used in our work.
Randomization-based methods performed well overall.

Table 1. Comparison of R2 and Q2 values of hybrid GA-DT method versus different


ensemble methods utilized in this work

Ensemble Method R2 Q2
Hybrid GA-DT 0.81 0.86
Random Forests 0.81 0.76
Regularized Random Forests 0.81 0.74
Extremely Randomized Trees 0.91 0.73
Gradient Boosted Trees 0.57 0.48

> Read full chapter

Ensembles of Learning Machines


Tim Menzies, ... Burak Turhan, in Sharing Data and Models in Software Engineer-
ing, 2015

20.2.1 How bagging works


Bootstrap aggregating (bagging) [46] is an ensemble method that uses a single
type of base learner to produce different base models. Figure 20.1 illustrates how
it works. Consider a training data set D of size |D|. In the case of SEE, each of
the |D| training examples that composes D could be a completed project with
known required effort. The input features of this project could be, for example, its
functional size, development type, language type, team expertise, etc. The target
output would be the true required effort of this project. An example of illustrative
training set is given in Table 20.2. Consider also that one would like to create an
ensemble with N base models, where N is a predefined value, by using a certain
base learning algorithm. The procedure to create the ensemble's base models is as
follows. Generate N bootstrap samples Si (1 ≤ i ≤ N) of size |D| by sampling training
examples uniformly1 with replacement from D. For example, let's say that D is the
training set shown in Table 20.2 and N = 15. An example of fifteen bootstrap samples
is given in Table 20.3. The procedure then uses the base learning algorithm to create
each base model i using sample Si as the training set.

Figure 20.1. Bagging scheme. BL stands for base learner.

Table 20.2. An Illustrative Software Effort Estimation Training Set

Project ID Input Feature Target Output


Functional Size Development Type Language Type True Effort
1 100 New 3GL 520
2 102 Enhancement 3GL 530
3 111 New 4GL 300
4 130 Enhancement 4GL 310
5 203 Enhancement 3GL 900
6 210 New 3GL 910
7 215 New 4GL 700
8 300 Enhancement 3GL 1500
9 295 New 4GL 2000
10 300 Enhancement 4GL 1340

The project IDs are usually not used for learning.

Table 20.3. Example of Fifteen Bootstrap Samples from the Training Set Shown in
Table 20.2
Bootstrap Sample Project IDs
S1 7 6 3 3 2 6 2 6 6 3

S2 10 5 3 5 3 8 8 4 5 9

S3 4 9 9 4 5 7 10 5 5 1

S4 5 7 5 2 1 1 10 1 7 7

S5 10 10 9 2 3 3 10 3 8 2

S6 7 5 9 6 1 5 5 3 10 3

S7 5 4 7 7 5 5 2 8 7 3

S8 5 1 9 8 7 6 1 7 8 7

S9 10 6 7 10 7 10 9 9 7 5

S10 1 9 8 5 5 8 8 7 3 2

S11 7 8 8 2 3 4 9 2 4 5

S12 7 7 9 10 7 7 8 7 5 9

S13 7 6 4 2 2 5 7 8 10 5

S14 10 8 5 2 10 2 5 5 9 2

S15 10 10 1 1 6 9 10 7 5 8

Only the project IDs are shown for simplicity.

After the ensemble is created, then it can start being used to make predictions for
future instances based on their input features. In the case of SEE, future instances are
new projects to which we do not know the true required effort and wish to estimate
it. In the case of regression tasks, where the value to be estimated is numeric, the
predictions given by the ensemble are the simple average of the predictions given
by its base models. This would be the case for SEE. However, bagging can also be
used for classification tasks, where the value to be predicted is a class. An example of
classification task would be to predict whether a certain module of a software is faulty
or nonfaulty. In the case of classification tasks, the prediction given by the ensemble
is the majority vote, i.e., the class most often predicted by its base models.

> Read full chapter

Mining challenges in large-scale IoT


data framework – a machine learning
perspective
Gaurav Mohindru, ... Haider Banka, in Advanced Data Mining Tools and Methods
for Social Computing, 2022

12.4.3 Concept of random forest


Random Forest (abbreviated as RF) is a generally used and effective bagging machine
learning ensemble method. Bagging or Bootstrap aggregation is a basic ensemble
technique where predictions from multiple models are combined using some model
averaging techniques such as majority vote, weighted average, or normal average
to decide the result instead of one model. It creates multiple decision trees with
randomness in sampling and feature selection. The uncorrelated trees together
result in more accurate predictions and reduce variance using the value-added
through an ensemble. In the first stage of Random Forest, we need to create a
random forest of decision trees while in the second we make a prediction using
the models created. The Random forest can handle overfitting issues of boosting
algorithms efficiently. Random Forest algorithm can be used for classification as
well as a regression task. It also helped in reducing variance. Because of its in-built
ensemble capacity, the task of building a generalized model for all datasets turns
out much easier. It is capable of effectively handling large datasets with a lot of
variables. The average decrease in impurity using each feature can help determine
the importance of the feature.

The concept of the bootstrap method with its aggregation, popularly known as
bagging, is the key behind this predictive modeling. Bootstrap is a common yet
effective statistical technique for estimation from a data sample by using resam-
pling to create multiple samples. There are algorithms [21] [22] which though are
quite potent but have a high variance such as Classification and Regression Trees
(abbreviated as CART). Bootstrap aggregation can be used to reduce the variance
using the collective prediction of the group. Decision trees' structure and prediction
vary based on the training dataset. Here, when we create multiple decision trees with
less or no correlation and use ensemble and aggregation we can get better accuracy.
Less correlation is important for varied tree structures and randomness is used to
achieve it. There is another area where we use randomness in Random Forest. In
the case of decision trees for selecting a feature for the split, we use all the available
features [23]. In contrast in Random Forest, we select a random set of features from
the available set. Randomness in RF mainly is in selecting random observations for
growing the tree and random features for splitting the nodes.

> Read full chapter


Computational Analysis and Under-
standing of Natural Languages: Princi-
ples, Methods and Applications
Ehsan Fathi, Babak Maleki Shoja, in Handbook of Statistics, 2018

5.2.1 Bagging and Other Ensemble Methods


The idea of bootstrap aggregating or bagging is rooted in a general strategy in
ML called model averaging (ensemble methods). In this strategy, several different
models are trained separately and each model votes on the output for test examples.
Usually, different models do not make all the same errors on the test set. That is why
ensemble methods are useful. There are different ensemble methods. It is possible
to train models which are completely different in algorithm and objective function
or a same kind of model can be reused several times (bagging). In bagging, different
datasets are constructed by sampling from the original dataset, thus, each dataset is
missing some examples from the original dataset and also each dataset has duplicate
examples. These differences in datasets lead to different trained models.

In addition, due to random initialization, random selection of minibatches, etc.,


neural networks reach to different solution points even if all models are trained on
the same dataset. Consequently, they can benefit from model averaging.

The cost of model averaging is increased computation and memory. Therefore, it is


discouraged when benchmarking algorithms for scientific papers are used.

> Read full chapter

31st European Symposium on Comput-


er Aided Process Engineering
Nikolaus I. Vollmer, ... Gürkan Sin, in Computer Aided Chemical Engineering, 2021

2.4 Regression Tree Ensemble


The last introduced machine learning concept is a so-called regression tree ensem-
ble (RTE), belonging to the class of ensemble methods, where several decision trees
are fitted to the data, and the weighted combination of these trees serves as the basis
for regression. The concept of a single decision tree is visually very well depicted
by dividing input data based on several input criteria onto different branches of the
tree. Tree ensembles are quite powerful as they can effectively fit sparse datasets with
exceptional predictive capacities (Thebelt et al., 2020).

> Read full chapter

31st European Symposium on Comput-


er Aided Process Engineering
Christos Chatzilenas, ... Ioannis Marinos, in Computer Aided Chemical Engineering,
2021

4.1 Prediction Methods


We implemented a group of models using 7 prediction methods, grouped in 3 major
categories, (a) Multiple Linear Regression, (b) Ensemble Methods, and (c) Hyper-
planes. Multiple Linear Regression includes the classic form of linear regression that
calculates the coefficients of a linear function by minimizing the mean square
error between the predicted and the actual values. Furthermore, Ridge and Lasso
regressions both are a convention of the linear regression which minimize the mean
square error plus one regularization term to keep the coefficients at a low range.
However, these regularization terms are different as the Lasso regularizer leads to
more aggressive reduction of the range of coefficients. This often generates zero
coefficients corresponding to characteristics that have a small effect on the target
value. Practically, it performs a feature selection.

In the Ensemble Methods, AdaBoost Regression (Freund and Schapire, 1995) (-


Drucker, 1997) and XGBoost Regression (T. Hastie and Friedman, 2009) the esti-
mation of the target is done by combining estimates of many individual prediction
models based on decision trees. On the contrary, in the Random Forest Regression
(Breiman, 2001) the estimation of the target value is done by combining average
estimation values of several individual prediction models based on classifying deci-
sion trees for a number of subsets of the data set. The difference between AdaBoost
Regression and XGBoost Regression is how the combination of the estimations is
done. In AdaBoost Regression the combination of estimations is done serially, so
each new model corrects the previous one and is trained giving more weight to
the training data for which the previous model showed underfitting behaviors. On
the other hand, in XGBoost Regression every new model is trained to estimate the
prediction errors of the previous model, so that the prediction of the previous model
is corrected based on the new one.
Finally, Hyperplanes included Support Vector Regression-SVR which is based on the
support vector machine classification model, which has been modified to predict
continuous values. The main difference between SVR and linear regression is that
the latter aims to minimize the mean square error between the predicted values and
the actual values of the target value, while the SVR aims to limit the error to a range
of values.

> Read full chapter

Machine learning
Patrick Schneider, Fatos Xhafa, in Anomaly Detection and Complex Event Processing
over IoT Data Streams, 2022

8.6 Handling concept drifts


In [86], a general overview of the methods, tools and techniques for dealing with
concept drift has been presented. Depending on the specific application, these
methods have been classified according to predefined meaningful criteria to provide
the reader with a guide for designing an efficient self-adaptive machine learning
and data mining scheme. The different approaches to deal with concept drift have
been classified into two main categories, depending on how they update the learner
in response to the occurrence of a drift. The first category is the informed methods,
where the learner is only updated when a drift is detected and confirmed. Therefore,
the informed methods use a set of change indicators to trigger the updating of
the learner. Depending on the availability of the true labels, these indicators are
divided into supervised and unsupervised change indicators. Supervised change
indicators assume that the true labels of the incoming patterns are immediately
available, while unsupervised indicators monitor changes in the learner complexity
or the properties of the data distribution in the feature space. The second category
is the blind method, in which the learner is continuously updated on the incoming
data patterns regardless of whether drift has occurred. The methods that deal with
concept drift were also divided into two categories: Sequential and Window-based
approaches. Sequential methods handle each data sample as soon as it arrives and
then discard it, while window-based approaches process data samples within a time
window.

The methods that handle concept drift can be based on either one learner or a set
of learners. In the latter case, they are called ensemble methods. Single learners, or
ensemble-based learners, can be managed in three techniques to handle concept
drift:
• By exploiting the training set.

• By integrating a fixed or variable number of learners trained using the same


learning method but with different parameter settings or using different
learning methods.
• By managing the decisions of the ensemble's individual learners.

The final decision can be issued either by combining the individual learners' weight-
ed decisions, selecting one of the individual learners, or combining the decisions of
a subset of selected individual learners. Finally, in this chapter, the criteria that may
indicate the evaluation outcome of the machine learning and data mining scheme to
handle concept drift are defined. They allow evaluating the autonomy (involvement
of a human being in the learning scheme), the reliability of drift detection and
description, the independence and influence of parameter setting on the learning
scheme performance, and the time and memory requirements for the decision
computing. Table 8.1 summarizes the guidelines to help readers choose the methods
suitable for handling concept drift according to its kind and characteristics.

Table 8.1. Guidelines to select methods to handle concept drift according to the drift
kind and characteristics [86].

Drift Type Method


Abrupt Sequential Methods + single learner + ensemble
(fixed size)
Gradual Probabilistic Window based (variable size) + ensemble (variable
size)
Gradual Continuous Window based (variable size) + ensemble (variable
size)
Global Sequential methods + single learner + ensemble
(fixed size)
Local Window based (variable size) + ensemble (variable
size)
Real Informed (supervised drift indicators)
Virtual Informed (unsupervised drift indicators) + blind
methods
Cyclic Ensemble (fixed size)
Acyclic Sequential Methods + single learner + ensemble
(variable size)
Predictable Informed methods
Unpredictable Informed (unsupervised drift indicators) + blind
methods

For more information on drift concept for data streams, the reader is referred to
Section 1.4 in Chapter 1.

> Read full chapter

ScienceDirect is Elsevier’s leading information solution for researchers.


Copyright © 2018 Elsevier B.V. or its licensors or contributors. ScienceDirect ® is a registered trademark of Elsevier B.V. Terms and conditions apply.

You might also like