Random Forests
Random Forests
RANDOM FORESTS
Ensemble methods 2
We need to make sure they do not all just learn the same
Bagging 3
If we split the data in random different ways, decision trees give different results,
high variance.
Bagging: Bootstrap aggregating is a method that result in low variance.
If we had multiple realizations of the data (or multiple samples), we could
calculate the predictions multiple times and take the average of the fact that
averaging multiple onerous estimations produce less uncertain results
Bagging 4
Bootstrap
Bootstrap 5
No cross validation?
Remember, in bootstrapping we sample with replacement, and therefore not all
observations are used for each bootstrap sample. On average 1/3 of them are not
used!
We call them out‐of‐bag samples (OOB)
We can predict the response for the i-th observation using each of the trees in
which that observation was OOB and do this for n observations
Calculate overall OOB MSE or classification error
Bagging 9
trees?
Suppose that there is one very strong predictor in the data set, along with a
number of other moderately strong predictors.
Then all bagged trees will select the strong predictor at the top of the tree and
therefore all trees will look similar.
How do we avoid this?
What if we consider only a subset of the predictors at each split?
We will still get correlated trees unless …. we randomly select the subset !
Random Forest, Ensemble Model 14
The random forest (Breiman, 2001) is an ensemble approach that can also be
thought of as a form of nearest neighbor predictor.
Ensembles are a divide-and-conquer approach used to improve performance. The
main principle behind ensemble methods is that a group of “weak learners” can
come together to form a “strong learner”.
Trees and Forests 15
The random forest starts with a standard machine learning technique called a
“decision tree” which, in ensemble terms, corresponds to our weak learner. In a
decision tree, an input is entered at the top and as it traverses down the tree the
data gets bucketed into smaller and smaller sets.
Random Forest 16
For b = 1 to B:
(a) Draw a bootstrap sample Z∗ of size N from the training data.
(b) Grow a random-forest tree to the bootstrapped data, by recursively repeating the
following steps for each terminal node of the tree, until the minimum node size
nmin is reached.
i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes. Output the ensemble of trees.
Random Forest Algorithm 20
When a new input is entered into the system, it is run down all of the trees. The
result may either be an average or weighted average of all of the terminal nodes
that are reached, or, in the case of categorical variables, a voting majority.
Note that:
With a large number of predictors, the eligible predictor set will be quite different
from node to node.
The greater the inter-tree correlation, the greater the random forest error rate, so
one pressure on the model is to have the trees as uncorrelated as possible.
As m goes down, both inter-tree correlation and the strength of individual trees go
down. So some optimal value of m must be discovered.
Differences to standard tree 24
Train each tree on Bootstrap Resample of data (Bootstrap resample of data set
with N samples: Make new data set by drawing with Replacement N samples;
i.e., some samples will probably occur multiple times in new data set)
For each split, consider only m randomly selected variables
Don’t prune
Fit B trees in such a way and use average or majority voting to aggregate results
Random Forests Tuning 25
33
library(e1071)
library(caret)
Reference
Prediction no yes
no 100 0
yes 0 100
Accuracy : 1
95% CI : (0.9817, 1)
Kappa : 1
Sensitivity : 1.0
Specificity : 1.0
Prevalence : 0.5
(prediction)
Let’s use Boston dataset
require(randomForest)
require(MASS)#Package which contains the Boston housing
dataset
attach(Boston)
set.seed(101)
?Boston #to search on the dataset
crim: per capita crime rate by town. 35
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per \$10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in \$1000s.
We are going to use variable ′medv′ as the Response variable, which is the Median Housing Value.
We will fit 500 Trees.
36
dim(Boston)
[1] 506 14
#training Sample with 300 observations
train=sample(1:nrow(Boston),300)
Fitting the Random Forest
We will use all the Predictors in the dataset.
Boston.rf=randomForest(medv ~ . , data = Boston , subset = train)
Boston.rf
Number of trees: 500
No. of variables tried at each split: 4
oob.err=double(13)
test.err=double(13)