100% found this document useful (1 vote)
123 views19 pages

Random Forest Reference Code

The document discusses random forest classification models. It shows that a random forest model was built with 500 trees using 3 variables for each tree. The out-of-bag error estimate provides an accurate assessment of the model's performance without a test set. Variable importance is assessed using a variable importance plot that sorts variables by their MeanDecreaseGini. The random forest model achieves 99% accuracy on both the training and test sets, indicating stability.

Uploaded by

Rajat Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
123 views19 pages

Random Forest Reference Code

The document discusses random forest classification models. It shows that a random forest model was built with 500 trees using 3 variables for each tree. The out-of-bag error estimate provides an accurate assessment of the model's performance without a test set. Variable importance is assessed using a variable importance plot that sorts variables by their MeanDecreaseGini. The random forest model achieves 99% accuracy on both the training and test sets, indicating stability.

Uploaded by

Rajat Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Classification

Random Forest

[Link]
Random Forest
#Random Forest model
modelrf <- randomForest([Link](left) ~ . , data = trainSplit, [Link]=T)
modelrf

The random forest model output tells us that it has built 500 trees and used 3 variables for each tree building.
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set.
The OOB estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-
of-bag error estimate removes the need for a set aside test set.

[Link]
Random Forest
#Checking variable importance in Random Forest
importance(modelrf)

varImpPlot(modelrf)

The variable importance plot displays


a plot with variables sorted by
MeanDecreaseGini

[Link]
Random Forest
# Prediction and Model Evaluation using Confusion Matrix
predrf_tr <- predict(modelrf, trainSplit) #Train Data
predrf_test <- predict(modelrf, testSplit) #Test Data

confusionMatrix(predrf_tr,trainSplit$left) #Train Data


confusionMatrix(predrf_test,testSplit$left) #TestData

The Confusion Matrix The Confusion Matrix


on Train data gives on Train data gives
the accuracy of 99% the accuracy of 99%

As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Random Forest model
[Link]
Comparing ROC curves for Decision Tree and Random Forest

# Prediction and Model Evaluation using Confusion Matrix


#Decision Tree ROC
auc1 <- roc([Link](testSplit$left),
[Link](predtest))
plot(auc1,col =
'blue',main=paste('AUC:',round(auc1$auc[[1]],3)))

#Random Forest ROC


aucrf <- roc([Link](testSplit$left),
[Link](predrf), ci=TRUE)
plot(aucrf, ylim=c(0,1), [Link]=TRUE,
main=paste('Random Forest
AUC:',round(aucrf$auc[[1]],3)),col = 'blue')

#Comparing both ROC curves


plot(aucrf, ylim=c(0,1), main=paste('ROC Comparison :
RF(blue),C5.0(Black))'),col = 'blue')
par(new = TRUE)
plot(auc1)
par(new = TRUE) The ROC curve for Random Forest is better for
Decision Tree.

[Link]
Classification Model
Naïve Bayes

[Link]
Naïve Bayes
#Naive Bayes
modelnb <- naiveBayes([Link](left) ~. , data = trainSplit)
modelnb

These are the apriori probabilities for the variables in the dataset
[Link]
Naïve Bayes
#Performance of Naïve Bayes using Confusion Matrix
prednb_tr <- predict(modelnb,trainSplit) #Train Data
prednb_test <- predict(modelnb,testSplit) #Test Data

confusionMatrix(prednb_tr,trainSplit$left) #Train Data


confusionMatrix(prednb_test,testSplit$left) #Test Data

The Confusion Matrix


The Confusion Matrix
on Train data gives
on Train data gives
the accuracy of
the accuracy of
78.84%
78.58%

As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Naïve Bayes model

[Link]
Classification Model
kNN Algorithm

[Link]
kNN Algorithm
#Data Preparation for kNN Algorithm

library(dummies)
#Creating dummy variables for Factor variable
dummy_df = [Link](hr_data1[, c('role_code', '[Link]')])

hr_data2 = hr_data1
hr_data2 = [Link](hr_data2, dummy_df)

#Removing role_code and [Link] since we have created dummy variables


hr_data2 = hr_data2[, !(names(hr_data2) %in% c('role_code', '[Link]'))]

#Converting variables to numeric datatype


hr_data2$Work_accident = [Link](hr_data2$Work_accident)
hr_data2$promotion_last_5years = [Link](hr_data2$promotion_last_5years)

[Link]
kNN Algorithm
#Data Preparation for kNN Algorithm

#Scale the variables and check their final


structure
X = hr_data2[, !(names(hr_data2) %in% c('left'))]
hr_data2_scaled = [Link](scale(X))

str(hr_data2_scaled)

#Splitting the data for the model building


hr_train <- hr_data2_scaled[splitIndex,]
hr_test <- hr_data2_scaled[-splitIndex,]

hr_train_labels <- hr_data2[splitIndex, 'left']


hr_test_labels <- hr_data2[-splitIndex, 'left']

[Link]
kNN Algorithm
#Applying kNN Algorithm on the dataset
library(class)
library(gmodels)

test_pred_1 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=1)


CrossTable(x=hr_test_labels ,y=test_pred_1 ,[Link] = FALSE)

Here from this crosstab, we can compute the accuracy


of this model, for k = 1.

Accuracy = (TP+TN)/Total
= (3311+1030)/4499
= 96.48%

[Link]
kNN Algorithm
#Applying kNN Algorithm on the dataset

As we calculated for k = 1, Similarly we will calculate it for k = 5,10,50,100,122.


Below we summarize the accuracy for these k values

K Accuracy
5 94.46%
10 94.17%
50 90.19%
100 86.48%
122 85.06%

From the above accuracy table, we can observe that as the k value increases the accuracy goes
down.

[Link]
kNN Algorithm
# Thumb rule to decide on k for k-NN is sqrt(n)/2
k = sqrt(nrow(hr_train))/2
k
#51.2347 (which can be approximated to 51

test_pred_rule <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=k)


CrossTable(x=hr_test_labels ,y=test_pred_rule ,[Link] = FALSE)
# accuracy = 4050/4499 = 90.02%

# Another method to determine the k for k-NN


[Link](400)
ct <- trainControl(method="repeatedcv",repeats = 3)
fit <- train(left ~ ., data = hr_data2, method = "knn", trControl = ct, preProcess =
c("center","scale"),tuneLength = 20)
fit

# Checking accuracy of the model with k = 7


test_pred_7 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=7)
CrossTable(x=hr_test_labels ,y=test_pred_7 ,[Link] = FALSE)
# accuracy = 4357/4499 = 96.84%

#or alternatively we can use this below command


confusionMatrix (hr_test_labels,test_pred_7)

Output on next slide..


[Link]
Using the above code it indicates that
k=7 is the best for this data and it is
better to go with this value because it
has been cross validated

[Link]
Step 6
Model
Summarization

[Link]
Summary of Model Performance

Model Accuracy
Decision Tree 97.09%
Random Forest 99%
Naïve Bayes 78.84%
kNN Algorithm (Using k = 7) 96.84%

[Link]
Appendix
Packages used for the Classification Analysis:

•[Link]
•reshape2
•randomForest
•party # For decision tree
•rpart # for Rpart
•[Link] #for Rpart plot
•lattice # Used for Data Visualization
•caret # for data pre-processing
•pROC # for ROC curve
•corrplot # for correlation plot
•e1071 # for ROC curve and Confusion matrix
•RColorBrewer
•dummies
•class
•gmodels

[Link]
Thank You.

You might also like