0% found this document useful (0 votes)
115 views

Note 6

This document discusses classification and regression trees, random forests, and boosted trees. It begins with an overview of classification trees, also known as decision trees. The key steps of recursive partitioning are described, including repeatedly splitting records into two parts to maximize homogeneity within each part. The document provides an example of classifying households as owning or not owning a riding lawnmower based on income and lot size. It discusses measures of impurity like Gini index and entropy that are used in deciding splits. The concepts of pruning trees and overfitting are also covered.

Uploaded by

nuthan manideep
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views

Note 6

This document discusses classification and regression trees, random forests, and boosted trees. It begins with an overview of classification trees, also known as decision trees. The key steps of recursive partitioning are described, including repeatedly splitting records into two parts to maximize homogeneity within each part. The document provides an example of classifying households as owning or not owning a riding lawnmower based on income and lot size. It discusses measures of impurity like Gini index and entropy that are used in deciding splits. The concepts of pruning trees and overfitting are also covered.

Uploaded by

nuthan manideep
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

DS 535: ADVANCED DATA MINING FOR BUSINESS

Lecture Notes #6: Classification and Regression Trees


Random Forests and Boosted Trees

(Textbook reading: Chapter 9)

Trees and Rules

Goal: Classify or predict an outcome based on a set of predictors


The output is a set of rules

Example:
• Goal: classify a record as “will accept credit card offer” or “will not accept”
• Rule might be “IF (Income >= 106) AND (Education < 1.5) AND (Family <= 2.5) THEN
Class = 0 (nonacceptor)
• Also called CART, Decision Trees, or just Trees
• Rules are represented by tree diagrams

1
Key Ideas

Recursive partitioning: Repeatedly split the records into two parts so as to achieve maximum
homogeneity of outcome within each new part

Pruning the tree: Simplify the tree by pruning peripheral branches to avoid overfitting

Recursive Partitioning

Recursive Partitioning Steps


• Pick one of the predictor variables, x i

• Pick a value of x say s , that divides the training data into two (not necessarily equal)
i, i
portions
• Measure how “pure” or homogeneous each of the resulting portions is
• “Pure” = containing records of mostly one class (or, for prediction, records with similar
outcome values)
• Algorithm tries different values of x and s to maximize purity in initial split
i, i

• After you get a “maximum purity” split, repeat the process for a second split (on any
variable), and so on

Example: Riding Mowers


• Goal: Classify 24 households as owning or not owning riding mowers
• Predictors = Income, Lot Size
library rpart for running trees, function prp in library rpart.plot to plot them

2
Income Lot_Size Ownership
60.0 18.4 owner
85.5 16.8 owner
64.8 21.6 owner
61.5 20.8 owner
87.0 23.6 owner
110.1 19.2 owner
108.0 17.6 owner
82.8 22.4 owner
69.0 20.0 owner
93.0 20.8 owner
51.0 22.0 owner
81.0 20.0 owner
75.0 19.6 non-owner
52.8 20.8 non-owner
64.8 17.2 non-owner
43.2 20.4 non-owner
84.0 17.6 non-owner
49.2 17.6 non-owner
59.4 16.0 non-owner
66.0 18.4 non-owner
47.4 16.4 non-owner
33.0 18.8 non-owner
51.0 14.0 non-owner
63.0 14.8 non-owner

How to split
• Order records according to one variable, say income
• Take a predictor value, say 60 (the first record) and divide records into those with income >=
60 and those < 60
• Measure resulting purity (homogeneity) of class in each resulting portion
• Try all other split values
• Repeat for other variable(s)
• Select the one variable & split that yields the most purity
3
Note: Categorical Variables
• Examine all possible ways in which the categories can be split.
• E.g., categories A, B, C can be split 3 ways
{A} and {B, C}
{B} and {A, C}
{C} and {A, B}
• With many categories, # of splits becomes huge
The first split: Income = 60

Second Split: Lot size = 21

4
After All Splits

Measuring Impurity

Gini Index:
Gini index for rectangle
𝑚 2
I(A) = 1 - 𝑘=1 𝑝𝑘

p = proportion of cases in rectangle A that belong to class k (out of m classes)

• I(A) = 0 when all cases belong to same class


• Max value when all classes are equally represented (= 0.50 in binary case)
Note: XLMiner uses a variant called “delta splitting rule”

Entropy:
𝑚

𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐴 = − 𝑝𝑘 𝑙𝑜𝑔2 (𝑝𝑘 )


𝑘=1

p = proportion of cases in rectangle A that belong to class k (out of m classes)

• Entropy ranges between 0 (most pure) and log (m) (equal representation of classes)
2

5
Impurity and Recursive Partitioning
• Obtain overall impurity measure (weighted avg. of individual rectangles)
• At each successive stage, compare this measure across all possible splits in all variables
• Choose the split that reduces impurity the most
• Chosen split points become nodes on the tree
R codes:
library(rpart)
library(rpart.plot)
mower.df <- read.csv("RidingMowers.csv")

# use rpart() to run a classification tree.


# define rpart.control() in rpart() to determine the depth of the tree.
# maxdepth Set the maximum depth of any node of the final tree, with the root node counted as
# depth 0.
class.tree <- rpart(Ownership ~ ., data = mower.df,
control = rpart.control(maxdepth = 2), method = "class")
# in this example, maxdepth = 1 gives the same results.
## plot tree
# use prp() to plot the tree. You can control plotting parameters such as color, shape,
# and information displayed (which and where).
prp(class.tree, type = 1, extra = 1, split.font = 1, varlen = -10)

First Split – The Tree

6
> prp(class.tree) # gives the following plot

The impurity measures for this rectangle are:


• Gini_left = 1 − (7/8)2 − (1/8)2 = 0.219
• entropy_left = − (7/8)log 2(7/8) − (1/8)log 2(1/8) = 0.544
The right rectangle contains 11 owners and five nonowners. The impurity measures of
the right rectangle are therefore
• Gini_right = 1 − (11/16)2 − (5/16)2 = 0.430
• entropy_right = − (11/16)log 2(11/16) − (5/16)log 2(5/16) = 0.896
The combined impurity of the two rectangles that were created by the split is a weighted
average of the two impurity measures, weighted by the number of records in each:
• Gini = (8/24)(0.219) + (16/24)(0.430) = 0.359
• entropy = (8/24)(0.544) + (16/24)(0.896) = 0.779
Thus, the Gini impurity index decreased from 0.5 before the split to 0.359 after the split.
Similarly, the entropy impurity measure decreased from 1 before the split to 0.779 after
the split.

Tree after all splits

> # plot tree after all splits


> class.tree <- rpart(Ownership ~ ., data = mower.df,
+ method = "class", cp = 0, minsplit = 1)
> prp(class.tree, type = 1, extra = 1, split.font = 1, varlen = -10)
>

# Cp: complexity parameter. Any split that does not decrease the overall lack of fit by a factor of cp
is not attempted. For instance, with anova splitting, this means that the overall R-squared must
increase by cp at each step. The main role of this parameter is to save computing time by pruning
off splits that are obviously not worthwhile. Essentially,t he user informs the program that any split
which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence
the program need not pursue it. Default value cp=0.01.

7
# minsplit the minimum number of observations that must exist in a node in order for a split to be
attempted. Default value minsplit = 20

The first split is on Income, then the next split is on Lot Size for both the low income group (at
lot size 21) and the high income split (at lot size 20)

8
Example 2: Acceptance of Personal Loan
code for creating a default classification tree

#### Figure 9.9

library(rpart)
library(rpart.plot)

bank.df <- read.csv("UniversalBank.csv")


bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code columns.

# partition
set.seed(1)
train.index <- sample(c(1:dim(bank.df)[1]), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.index, ]
valid.df <- bank.df[-train.index, ]

# classification tree
default.ct <- rpart(Personal.Loan ~ ., data = train.df, method = "class")
# plot tree
prp(default.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = -10)

9
The Overfitting Problem

Full trees are complex and overfit the data


• Natural end of process is 100% purity in each leaf
• This overfits the data, which end up fitting noise in the data
• Consider Example 2, Loan Acceptance with more records and more variables than the
Riding Mower data – the full tree is very complex

code for creating a deeper classification tree

#### Figure 9.10

deeper.ct <- rpart(Personal.Loan ~ ., data = train.df, method = "class", cp = 0,


minsplit = 1)

# count number of leaves


length(deeper.ct$frame$var[deeper.ct$frame$var == "<leaf>"])

# plot tree

prp(deeper.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = -10,

box.col=ifelse(deeper.ct$frame$var == "<leaf>", 'gray', 'white'))

10
> length(deeper.ct$frame$var[deeper.ct$frame$var == "<leaf>"])

[1] 53

Full trees are too complex – they end up fitting noise, overfitting the data

code for classifying the validation data using a tree and computing the
confusion matrices and accuracy for the training and validation data

> #### Table 9.3


> library(caret)
Loading required package: lattice
Loading required package: ggplot2
Warning message:
package ‘caret’ was built under R version 3.4.4
>
>
> # classify records in the validation data.
> # set argument type = "class" in predict() to generate predicted class
membership.
> default.ct.point.pred.train <- predict(default.ct,train.df,type = "class")

> # generate confusion matrix for training data

11
> table(default.ct.point.pred.train, train.df$Personal.Loan)

default.ct.point.pred.train 0 1
0 2696 26
1 13 265

Note:
Accuracy : 0.987  note: =(2696+265)/(2696+26+13+265)

Sensitivity : 0.9952
Specificity : 0.9107

'Positive' Class : 0

> ### repeat the code for the validation set,

> default.ct.point.pred.valid <- predict(default.ct, valid.df,type = "class")


>
> table(default.ct.point.pred.valid, valid.df$Personal.Loan)

default.ct.point.pred.valid 0 1
0 1792 18
1 19 171
>

> ### repeat the code for using the deeper tree
> default.ct.point.pred.train <- predict(deeper.ct, train.df,type = "class")
> table(default.ct.point.pred.train, train.df$Personal.Loan)

default.ct.point.pred.train 0 1
0 2709 0
1 0 291
>
> default.ct.point.pred.valid <- predict(deeper.ct, valid.df,type = "class")
> table(default.ct.point.pred.valid, valid.df$Personal.Loan)

default.ct.point.pred.valid 0 1
0 1788 19
1 23 170
>

12
Suppose a 2x2 confusion matrix table is denoted as:

Reference

Predicted Event No Event

Event A B

No Event C D

Some other common metrics are:

Sensitivity=A/(A+C); Specificity=D/(B+D); Prevalence=(A+C)/(A+B+C+D);

PPV=(sensitivity∗prevalence)/((sensitivity∗prevalence)+((1−specificity)∗(1−prevalence)));

NPV=(specificity∗(1−prevalence))/(((1−sensitivity)∗prevalence)+((specificity)∗(1−prevalence)));

DetectionRate=A/(A+B+C+D); DetectionPrevalence=(A+B)/(A+B+C+D);

BalancedAccuracy=(sensitivity+specificity)/2; Precision=A/(A+B); Recall=A/(A+C);

F1=(1+beta2)∗precision∗recall/((beta2∗precision)+recall) where beta = 1 for this function.

9.4 Avoiding Overfitting

Overfitting produces poor predictive performance – past a certain point in tree complexity, the
error rate on new data starts to increase.

13
Stopping tree growth - CHAID
• CHAID, older than CART, uses chi-square statistical test to limit tree growth
• Splitting stops when purity improvement is not statistically significant
One can think of different criteria for stopping the tree growth before it starts overfitting the
data. Examples are tree depth (i.e., number of splits), minimum number of records in a terminal
node, and minimum reduction in impurity. In R’s rpart(), for example, we can control the depth
of the tree with the complexity parameter (CP). The problem is that it is not simple to
determine what is a good stopping point using such rules.

Pruning the tree

• CART lets tree grow to full extent, then prunes it back


• Idea is to find that point at which the validation error is at a minimum
• Generate successively smaller trees by pruning leaves
• At each pruning stage, multiple trees are possible
• Use cost complexity to choose the best tree at that stage
Which branch to cut at each stage of pruning?

CC(T) = Err(T) + a L(T)

CC(T) = cost complexity of a tree


Err(T) = proportion of misclassified records
a= penalty factor attached to tree size (set by user)

• Among trees of given size, choose the one with lowest CC


• Do this for each size of tree (stage of pruning)

14
Pruning the tree with the validation data solves the problem of overfitting, but it does not
address the problem of instability. Recall that the CART algorithm may be unstable in choosing
one or another variable for the top-level splits, and this effect then cascades down and produces
highly variable rule sets. The solution is to avoid relying on just one partition of the data into
training and validation. Rather, we do so repeatedly using cross-validation (see below), then
pool the results. Of course, just accumulating a set of different trees with their different rules
will not do much by itself. However, we can use the results from all those trees to learn how deep
to grow the original tree. In this process, we introduce a parameter that can measure, and
control, how deep we grow the tree. We will note this parameter value for each minimum-error
tree in the cross-validation process, take an average, then apply that average to limit tree growth
to this optimal depth when working with new data.

Tree instability

• If 2 or more variables are of roughly equal importance, which one CART chooses for the
first split can depend on the initial partition into training and validation
• A different partition into training/validation could lead to a different initial split
• This can cascade down and produce a very different tree from the first training/validation
partition
• Solution is to try many different training/validation splits – “cross validation”
Cross validation
 Do many different partitions (“folds*”) into training and validation, grow & pruned tree for
each
 Problem: We end up with lots of different pruned trees. Which one to choose?
 Solution: Don’t choose a tree, choose a tree size:
 For each iteration, record the cp that corresponds to the minimum validation error
 Average these cp’s
 With future data, grow tree to that optimum cp value

*typically folds are non-overlapping, i.e. data used in one validation fold will not be used in
others

15
Cross validation, “best pruned”
 In the above procedure, we select cp for minimum error tree
 But… simpler is better: slightly smaller tree might do just as well
 Solution: add a cushion to minimum error
 Calculate standard error of cv estimate – this gives a rough range for chance variation
 Add standard error to the actual error to allow for chance variation
 Choose smallest tree within one std. error of minimum error
 You can then use the corresponding cp to set cp for future data

code for pruning the tree

> #### Figure 9.12


>
> # prune by lower cp
> pruned.ct <- prune(cv.ct,
+ cp = cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]),"CP"])
> length(pruned.ct$frame$var[pruned.ct$frame$var == "<leaf>"])
[1] 26
> prp(pruned.ct, type = 1, extra = 1, split.font = 1, varlen = -10)

16
code for tabulating tree error as a function of the complexity parameter (CP)

> #### Table 9.4


>
> # argument xval refers to the number of folds to use in rpart's built-in
> # cross-validation procedure
> # argument cp sets the smallest value for the complexity parameter.
> cv.ct <- rpart(Personal.Loan ~ ., data = train.df, method = "class",
+ cp = 0.00001, minsplit = 5, xval = 5)
> # use printcp() to print the table.
> printcp(cv.ct)

Classification tree:
rpart(formula = Personal.Loan ~ ., data = train.df, method = "class",
cp = 1e-05, minsplit = 5, xval = 5)

Variables actually used in tree construction:


[1] Age CCAvg CD.Account Education Family Income
Mortgage Online

Root node error: 291/3000 = 0.097

n= 3000

CP nsplit rel error xerror xstd


1 0.3350515 0 1.000000 1.00000 0.055705
2 0.1340206 2 0.329897 0.36082 0.034591
3 0.0154639 3 0.195876 0.20962 0.026565
4 0.0068729 7 0.134021 0.20275 0.026135
5 0.0051546 12 0.099656 0.19931 0.025917
6 0.0034364 14 0.089347 0.19244 0.025475
7 0.0022910 19 0.072165 0.18900 0.025251
8 0.0000100 25 0.058419 0.21306 0.026777
>

17
18
Best-pruned tree obtained by fitting a full tree to the training data, pruning it using the cross-
validation data, and choosing the smallest tree within one standard error of the minimum xerror
tree

> #### Figure 9.13


>
> # prune by lower cp
> pruned.ct <- prune(cv.ct,
+ cp = cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]),"CP"])
> length(pruned.ct$frame$var[pruned.ct$frame$var == "<leaf>"])
[1] 13
> prp(pruned.ct, type = 1, extra = 1, split.font = 1, varlen = -10)
>

19
>
> set.seed(1)
> cv.ct <- rpart(Personal.Loan ~ ., data = train.df, method = "class", cp = 0.00001, minsplit =
+ 1, xval = 5)
>
> # minsplit is the minimum number of observations in a node for a split to be attempted.
> # xval is number K of folds in a K-fold cross-validation.
>
> printcp(cv.ct)

Classification tree:
rpart(formula = Personal.Loan ~ ., data = train.df, method = "class",
cp = 1e-05, minsplit = 1, xval = 5)

Variables actually used in tree construction:


[1] Age CCAvg CD.Account Education Experience Family Income
Mortgage Online

Root node error: 291/3000 = 0.097

n= 3000

CP nsplit rel error xerror xstd


1 0.3350515 0 1.000000 1.00000 0.055705
2 0.1340206 2 0.329897 0.37457 0.035220
3 0.0154639 3 0.195876 0.19931 0.025917
4 0.0068729 7 0.134021 0.17182 0.024096
5 0.0051546 12 0.099656 0.17182 0.024096
6 0.0034364 14 0.089347 0.17182 0.024096
7 0.0022910 25 0.051546 0.16495 0.023617
8 0.0020619 28 0.044674 0.16495 0.023617
9 0.0017182 38 0.024055 0.16838 0.023858
10 0.0000100 52 0.000000 0.16838 0.023858

> # Print out the cp table of cross-validation errors. The R-squared for a regression tree is
> # 1 minus rel error. xerror (or relative cross-validation error where "x" stands for
"cross")
> # is a scaled version of overall average of the 5 out-of-sample errors across the 5 folds.
>
> pruned.ct <- prune(cv.ct, cp = 0.0154639)
> prp(pruned.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = -10,
+ box.col=ifelse(pruned.ct$frame$var == "<leaf>", 'gray', 'white'))
>

20
>

21
The tree method can also be used for a numerical outcome variable. Regression trees for
prediction operate in much the same fashion as classification trees. The outcome variable (Y) is
a numerical variable in this case, but both the principle and the procedure are the same: Many
splits are attempted, and for each, we measure “impurity” in each branch of the resulting tree

Regression Trees

Regression Trees for Prediction


• Used with continuous outcome variable
• Procedure similar to classification tree
• Many splits attempted, choose the one that minimizes impurity
Differences from CT
• Prediction is computed as the average of numerical target variable in the rectangle (in CT it
is majority vote)

• Impurity measured by sum of squared deviations from leaf mean


• Performance measured by RMSE (root mean squared error)
Advantages of trees
• Easy to use, understand
• Produce rules that are easy to interpret & implement
• Variable selection & reduction is automatic
• Do not require the assumptions of statistical models
• Can work without extensive handling of missing data
Disadvantage of single trees: instability and poor predictive performance

22
9.8 Improving Prediction: Random Forests and Boosted Trees
Notwithstanding the transparency advantages of a single tree as described above, in a pure
prediction application, where visualizing a set of rules does not matter, better performance is
provided by several extensions to trees that combine results from multiple trees. These are
examples of ensembles (see Chapter 13). One popular multitree approach is random forests,
introduced by Breiman and Cutler.1 Random forests are a special case of bagging, a method for
improving predictive power by combining multiple classifiers or prediction algorithms. See
Chapter 13 for further details on bagging.

Random Forests and Boosted Trees


 Examples of “ensemble” methods, “Wisdom of the Crowd” (Chap 13)
 Predictions from many trees are combined
 Very good predictive performance, better than single trees (often the top choice for
predictive modeling)
 Cost: loss of rules you can explain implement (since you are dealing with many trees, not a
single tree)
 However, RF does produce “variable importance scores,” (using information about how
predictors reduce Gini scores over all the trees in the forest)

Random Forests (library randomForest)

1. Draw multiple bootstrap resamples of cases from the data


2. For each resample, use a random subset of predictors and produce a tree
3. Combine the predictions/classifications from all the trees (the “forest”)

• Voting for classification


• Averaging for prediction
code for running a random forest, plotting variable importance plot, and
computing accuracy

Description
randomForest implements Breiman’s random forest algorithm (based on Breiman and Cutler’s original
Fortran code) for classification and regression. It can also be used in unsupervised mode for
assessing proximities among data points.

randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y))


max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize =
if (replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity,
norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
keep.inbag=FALSE, ...)

23
Arguments:
xtest a data frame or matrix (like x) containing predictors for the test set.
ytest response for the test set.
ntree Number of trees to grow. This should not be set to too small a number, to ensure that every input
row gets predicted at least a few times.
mtry Number of variables randomly sampled as candidates at each split. Note that the default values are
different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
replace Should sampling of cases be done with or without replacement?
classwt Priors of the classes. Need not add up to one. Ignored for regression.
cutoff (Classification only) A vector of length equal to number of classes. The ‘winning’ class for an
observation is the one with the maximum ratio of proportion of votes to cutoff. Default is 1/k where k is
the number of classes (i.e., majority vote wins).
strata A (factor) variable that is used for stratified sampling.
sampsize Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of
strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be
drawn from the strata.
nodesize Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown
(and thus take less time). Note that the default values are different for classification (1) and regression
(5).
importance Should importance of predictors be assessed?

>
> library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

margin

Warning message:
package ‘randomForest’ was built under R version 3.5.3
>
> ## random forest
> rf <- randomForest(as.factor(Personal.Loan) ~ ., data = train.df, ntree =
500,
+ mtry = 4, nodesize = 5, importance = TRUE)
>

24
Unlike a single tree, results from a random forest cannot be displayed in a tree-like diagram,
thereby losing the interpretability that a single tree provides. However, random forests can
produce “variable importance” scores, which measure the relative contribution of the different
predictors. The importance score for a particular predictor is computed by summing up the
decrease in the Gini index for that predictor over all the trees in the forest. Figure 9.15 shows the
variable importance plots generated from the random forest model for the personal loan
example. We see that Income and Education have the highest scores, with Family being third.
Importance scores for the other predictors are considerably lower.
> ## variable importance plot
> varImpPlot(rf, type = 1)

>
> ## confusion matrix
> rf.pred <- predict(rf, valid.df)
> table(rf.pred, valid.df$Personal.Loan)

rf.pred 0 1
0 1801 19
1 10 170

25
>

Boosted Trees (library adabag)

The second type of multitree improvement is boosted trees. Here a sequence of trees is fitted, so
that each tree concentrates on misclassified records from the previous tree.

Boosting, like RF, is an ensemble method – but uses an iterative approach in which each
successive tree focuses its attention on the misclassified trees from the prior tree.

1. Fit a single tree


2. Draw a bootstrap sample of records with higher selection probability for misclassified
records
3. Fit a new tree to the bootstrap sample
4. Repeat steps 2 & 3 multiple times
5. Use weighted voting (classification) or averaging (prediction) with heavier weights for later
trees
> #### Table 9.5
>
>
> library(adabag)
Loading required package: foreach
Loading required package: doParallel
Loading required package: iterators
Loading required package: parallel
> library(rpart)
> library(caret)
>
> train.df$Personal.Loan <- as.factor(train.df$Personal.Loan)
>
> set.seed(1)
> boost <- boosting(Personal.Loan ~ ., data = train.df)
> pred <- predict(boost, valid.df)
> table(pred$class, valid.df$Personal.Loan)

0 1
0 1803 17
1 8 172
>

26
Table 9.5 shows the result of running a boosted tree on the loan acceptance example that
we saw earlier. We can see that compared to the performance of the single pruned tree
(Table 9.3), the boosted tree has better performance on the validation data in terms of
overall accuracy and especially in terms of correct classification of 1’s—the rare class of
special interest. Where does boosting’s special talent for finding 1’s come from? When
one class is dominant (0’s constitute over 90% of the data here), basic classifiers are
tempted to classify cases as belonging to the dominant class, and the 1’s in this case
constitute most of the misclassifications with the single best-pruned tree. The boosting
algorithm concentrates on the misclassifications (which are mostly 1’s), so it is naturally
going to do well in reducing the misclassification of 1’s (from 18 in the single tree to 15 in
the boosted tree, in the validation set).

Description of Bagging and Boosting in Chapter 13


Bagging
Another form of ensembles is based on averaging across multiple random data
samples. Bagging, short for “bootstrap aggregating,” comprises two steps:
1. Generate multiple random samples (by sampling with replacement from the original
data)—this method is called “bootstrap sampling.”
2. Running an algorithm on each sample and producing scores.
Bagging improves the performance stability of a model and helps avoid overfitting by
separately modeling different data samples and then combining the results. It is therefore
especially useful for algorithms such as trees and neural networks.
Boosting
Boosting is a slightly different approach to creating ensembles. Here the goal is to directly
improve areas in the data where our model makes errors, by forcing the model to pay more
attention to those records. The steps in boosting are:
1. Fit a model to the data.
2. Draw a sample from the data so that misclassified records (or records with large prediction
errors) have higher probabilities of selection.
3. Fit the model to the new sample.
4. Repeat Steps 2–3 multiple times.

Bagging and Boosting in R


In Chapter 9, we described random forests, an ensemble based on bagged trees. We
illustrated a random forest implementation for the personal loan example.

27
The adabag package in R can be used to generate bagged and boosted trees. Tables
13.1 and 13.2 show the R code and output producing a bagged tree and a boosted tree for
the personal loan data, and how they are used to generate classifications for the validation
set.

Example Of Bagging and Boosting Classification Trees on the Personal Loan Data: R code
> library(adabag)
> library(rpart)
> library(caret)
>
> bank.df <- read.csv("UniversalBank.csv")
> bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code columns.
>
> # transform Personal.Loan into categorical variable
> bank.df$Personal.Loan = as.factor(bank.df$Personal.Loan)
>
> # partition the data
> train.index <- sample(c(1:dim(bank.df)[1]), dim(bank.df)[1]*0.6)
> train.df <- bank.df[train.index, ]
> valid.df <- bank.df[-train.index, ]
>
> # single tree
> tr <- rpart(Personal.Loan ~ ., data = train.df)
> pred <- predict(tr, valid.df, type = "class")
> table(pred, valid.df$Personal.Loan)

pred 0 1
0 1803 13
1 10 174
>

Usage
bagging(formula, data, mfinal = 100, control, par=FALSE,...)

Arguments
Formula a formula, as in the lm function.
Data a data frame in which to interpret the variables named in the formula
Mfinal an integer, the number of iterations for which boosting is run or the number
of trees to use. Defaults to mfinal=100 iterations.
Control options that control details of the rpart algorithm. See rpart.control for
more details

28
> # bagging
> bag <- bagging(Personal.Loan ~ ., data = train.df)
> pred <- predict(bag, valid.df, type = "class")
> table(pred$class, valid.df$Personal.Loan)

0 1
0 1806 25
1 7 162
>

Boosting:
Description Fits the AdaBoost.M1 (Freund and Schapire, 1996) and SAMME (Zhu et al., 2009) algorithms
using classification trees as single classifiers.
Usage
boosting(formula, data, boos = TRUE, mfinal = 100, coeflearn = 'Breiman', control,...)
Arguments
formula a formula, as in the lm function. data a data frame in which to interpret the variables named in
formula.
boos if TRUE (by default), a bootstrap sample of the training set is drawn using the weights for each
observation on that iteration. If FALSE, every observation is used with its weights.
mfinal an integer, the number of iterations for which boosting is run or the number of trees to use.
Defaults to mfinal=100 iterations.
coeflearn if ’Breiman’(by default), alpha=1/2ln((1-err)/err) is used. If ’Freund’ alpha=ln((1-err)/err) is
used. In both cases the AdaBoost.M1 algorithm is used and alpha is the weight updating coefficient. On
the other hand, if coeflearn is ’Zhu’ the SAMME algorithm is implemented with alpha=ln((1-err)/err)+
ln(nclasses-1).
control options that control details of the rpart algorithm. See rpart.control for more details.

> # boosting
> boost <- boosting(Personal.Loan ~ ., data = train.df)
> pred <- predict(boost, valid.df, type = "class")
> table(pred$class, valid.df$Personal.Loan)

0 1
0 1810 18
1 3 169
>

29
Summary
• Classification and Regression Trees are an easily understandable and transparent method for
predicting or classifying new records
• A single tree is a graphical representation of a set of rules
• Tree growth must be stopped to avoid overfitting of the training data – cross-validation helps
you pick the right cp level to stop tree growth
• Ensembles (random forests, boosting) improve predictive performance, but you lose
interpretability and the rules embodied in a single tree

Problems
1. Competitive Auctions on eBay.com. The file eBayAuctions.csv contains
information on 1972 auctions that transacted on eBay.com during May–June 2004.
The goal is to use these data to build a model that will classify auctions as
competitive or noncompetitive. A competitive auction is defined as an auction with
at least two bids placed on the item auctioned. The data include variables that
describe the item (auction category), the seller (his/her eBay rating), and the
auction terms that the seller selected (auction duration, opening price, currency,
day-of-week of auction close). In addition, we have the price at which the auction
closed. The task is to predict whether or not the auction will be competitive.
Data Preprocessing. Convert variable Duration into a categorical variable. Split the
data into training (60%) and validation (40%) datasets.
a. Fit a classification tree using all predictors, using the best-pruned tree. To avoid
overfitting, set the minimum number of records in a terminal node to 50 (in
R: minbucket = 50). Also, set the maximum number of levels to be displayed at seven
(in R: maxdepth = 7). Write down the results in terms of rules. (Note: If you had to
slightly reduce the number of predictors due to software limitations, or for clarity of
presentation, which would be a good variable to choose?)
b. Is this model practical for predicting the outcome of a new auction?
c. Describe the interesting and uninteresting information that these rules provide.
d. Fit another classification tree (using the best-pruned tree, with a minimum number
of records per terminal node = 50 and maximum allowed number of displayed levels
= 7), this time only with predictors that can be used for predicting the outcome of a
new auction. Describe the resulting tree in terms of rules. Make sure to report the
smallest set of rules required for classification.
e. Plot the resulting tree on a scatter plot: Use the two axes for the two best
(quantitative) predictors. Each auction will appear as a point, with coordinates
corresponding to its values on those two predictors. Use different colors or symbols
to separate competitive and noncompetitive auctions. Draw lines (you can sketch
these by hand or use R) at the values that create splits. Does this splitting seem

30
reasonable with respect to the meaning of the two predictors? Does it seem to do a
good job of separating the two classes?
f. Examine the lift chart and the confusion matrix for the tree. What can you say about
the predictive performance of this model?
g. Based on this last tree, what can you conclude from these data about the chances of
an auction obtaining at least two bids and its relationship to the auction settings set
by the seller (duration, opening price, ending day, currency)? What would you
recommend for a seller as the strategy that will most likely lead to a competitive
auction?

2. Predicting Delayed Flights. The file FlightDelays.csv contains information


on all commercial flights departing the Washington, DC area and arriving at New
York during January 2004. For each flight, there is information on the departure and
arrival airports, the distance of the route, the scheduled time and date of the flight,
and so on. The variable that we are trying to predict is whether or not a flight is
delayed. A delay is defined as an arrival that is at least 15 minutes later than
scheduled.
Data Preprocessing. Transform variable day of week (DAY_WEEK) info a
categorical variable. Bin the scheduled departure time into eight bins (in R use
function cut()). Use these and all other columns as predictors (excluding
DAY_OF_MONTH). Partition the data into training and validation sets.
a. Fit a classification tree to the flight delay variable using all the relevant predictors.
Do not include DEP_TIME (actual departure time) in the model because it is
unknown at the time of prediction (unless we are generating our predictions of
delays after the plane takes off, which is unlikely). Use a pruned tree with maximum
of 8 levels, setting cp = 0.001. Express the resulting tree as a set of rules.
b. If you needed to fly between DCA and EWR on a Monday at 7:00 AM, would you be
able to use this tree? What other information would you need? Is it available in
practice? What information is redundant?
c. Fit the same tree as in (a), this time excluding the Weather predictor. Display both
the pruned and unpruned tree. You will find that the pruned tree contains a single
terminal node.
i. How is the pruned tree used for classification? (What is the rule for
classifying?)
ii. To what is this rule equivalent?
iii. Examine the unpruned tree. What are the top three predictors according to
this tree?
iv. Why, technically, does the pruned tree result in a single node?
v. What is the disadvantage of using the top levels of the unpruned tree as
opposed to the pruned tree?
vi. Compare this general result to that from logistic regression in the example in
Chapter 10. What are possible reasons for the classification tree’s failure to
find a good predictive model?

31
3. Predicting Prices of Used Cars (Regression Trees). The
file ToyotaCorolla.csv contains the data on used cars (Toyota Corolla) on sale
during late summer of 2004 in the Netherlands. It has 1436 records containing
details on 38 attributes, including Price, Age, Kilometers, HP, and other
specifications. The goal is to predict the price of a used Toyota Corolla based on
its specifications. (The example in Section 9.19 is a subset of this dataset).
Data Preprocessing. Split the data into training (60%), and validation (40%)
datasets.
a. Run a regression tree (RT) with outcome variable Price and predictors Age_08_04,
KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee,
Guarantee_Period, Airco, Automatic_Airco, CD_Player, Powered_Windows,
Sport_Model, and Tow_Bar. Keep the minimum number of records in a terminal
node to 1, maximum number of tree levels to 30, and cp = 0.001, to make the run
least restrictive.
i. Which appear to be the three or four most important car specifications for
predicting the car’s price?
ii. Compare the prediction errors of the training and validation sets by
examining their RMS error and by plotting the two boxplots. How does the
predictive performance of the validation set compare to the training set?
Why does this occur?
iii. How might we achieve better validation predictive performance at the
expense of training performance?
iv. Create a less deep tree by leaving the arguments cp, minbucket, and
maxdepth at their defaults. Compared to the deeper tree, what is the
predictive performance on the validation set?
b. Let us see the effect of turning the price variable into a categorical
variable. First, create a new variable that categorizes price into 20 bins. Now
repartition the data keeping Binned_Price instead of Price. Run a
classification tree with the same set of input variables as in the RT, and with
Binned_Price as the output variable. As in the less deep regression tree, leave
the arguments cp, minbucket, and maxdepth at their defaults.
i. Compare the tree generated by the CT with the one generated by the less
deep RT. Are they different? (Look at structure, the top predictors, size of
tree, etc.) Why?
ii. Predict the price, using the less deep RT and the CT, of a used Toyota Corolla
with the specifications listed in Table 9.6.
iii. Compare the predictions in terms of the predictors that were used, the
magnitude of the difference between the two predictions, and the
advantages and disadvantages of the two methods.
Table 9.6 Specifications For A Particular Toyota Corolla
Variable Value
Age_-08_-04 77

32
KM 117,000
Fuel_Type Petrol
HP 110
Automatic No
Doors 5
Quarterly_Tax 100
Mfg_Guarantee No
Guarantee_Period 3
Airco Yes
Automatic_Airco No
CD_Player No
Powered_Windows No
Sport_Model No
Tow_Bar Yes

33

You might also like