Classification
Random Forest
[Link]
Random Forest
#Random Forest model
modelrf <- randomForest([Link](left) ~ . , data = trainSplit, [Link]=T)
modelrf
The random forest model output tells us that it has built 500 trees and used 3 variables for each tree building.
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set.
The OOB estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-
of-bag error estimate removes the need for a set aside test set.
[Link]
Random Forest
#Checking variable importance in Random Forest
importance(modelrf)
varImpPlot(modelrf)
The variable importance plot displays
a plot with variables sorted by
MeanDecreaseGini
[Link]
Random Forest
# Prediction and Model Evaluation using Confusion Matrix
predrf_tr <- predict(modelrf, trainSplit) #Train Data
predrf_test <- predict(modelrf, testSplit) #Test Data
confusionMatrix(predrf_tr,trainSplit$left) #Train Data
confusionMatrix(predrf_test,testSplit$left) #TestData
The Confusion Matrix The Confusion Matrix
on Train data gives on Train data gives
the accuracy of 99% the accuracy of 99%
As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Random Forest model
[Link]
Comparing ROC curves for Decision Tree and Random Forest
# Prediction and Model Evaluation using Confusion Matrix
#Decision Tree ROC
auc1 <- roc([Link](testSplit$left),
[Link](predtest))
plot(auc1,col =
'blue',main=paste('AUC:',round(auc1$auc[[1]],3)))
#Random Forest ROC
aucrf <- roc([Link](testSplit$left),
[Link](predrf), ci=TRUE)
plot(aucrf, ylim=c(0,1), [Link]=TRUE,
main=paste('Random Forest
AUC:',round(aucrf$auc[[1]],3)),col = 'blue')
#Comparing both ROC curves
plot(aucrf, ylim=c(0,1), main=paste('ROC Comparison :
RF(blue),C5.0(Black))'),col = 'blue')
par(new = TRUE)
plot(auc1)
par(new = TRUE) The ROC curve for Random Forest is better for
Decision Tree.
[Link]
Classification Model
Naïve Bayes
[Link]
Naïve Bayes
#Naive Bayes
modelnb <- naiveBayes([Link](left) ~. , data = trainSplit)
modelnb
These are the apriori probabilities for the variables in the dataset
[Link]
Naïve Bayes
#Performance of Naïve Bayes using Confusion Matrix
prednb_tr <- predict(modelnb,trainSplit) #Train Data
prednb_test <- predict(modelnb,testSplit) #Test Data
confusionMatrix(prednb_tr,trainSplit$left) #Train Data
confusionMatrix(prednb_test,testSplit$left) #Test Data
The Confusion Matrix
The Confusion Matrix
on Train data gives
on Train data gives
the accuracy of
the accuracy of
78.84%
78.58%
As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Naïve Bayes model
[Link]
Classification Model
kNN Algorithm
[Link]
kNN Algorithm
#Data Preparation for kNN Algorithm
library(dummies)
#Creating dummy variables for Factor variable
dummy_df = [Link](hr_data1[, c('role_code', '[Link]')])
hr_data2 = hr_data1
hr_data2 = [Link](hr_data2, dummy_df)
#Removing role_code and [Link] since we have created dummy variables
hr_data2 = hr_data2[, !(names(hr_data2) %in% c('role_code', '[Link]'))]
#Converting variables to numeric datatype
hr_data2$Work_accident = [Link](hr_data2$Work_accident)
hr_data2$promotion_last_5years = [Link](hr_data2$promotion_last_5years)
[Link]
kNN Algorithm
#Data Preparation for kNN Algorithm
#Scale the variables and check their final
structure
X = hr_data2[, !(names(hr_data2) %in% c('left'))]
hr_data2_scaled = [Link](scale(X))
str(hr_data2_scaled)
#Splitting the data for the model building
hr_train <- hr_data2_scaled[splitIndex,]
hr_test <- hr_data2_scaled[-splitIndex,]
hr_train_labels <- hr_data2[splitIndex, 'left']
hr_test_labels <- hr_data2[-splitIndex, 'left']
[Link]
kNN Algorithm
#Applying kNN Algorithm on the dataset
library(class)
library(gmodels)
test_pred_1 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=1)
CrossTable(x=hr_test_labels ,y=test_pred_1 ,[Link] = FALSE)
Here from this crosstab, we can compute the accuracy
of this model, for k = 1.
Accuracy = (TP+TN)/Total
= (3311+1030)/4499
= 96.48%
[Link]
kNN Algorithm
#Applying kNN Algorithm on the dataset
As we calculated for k = 1, Similarly we will calculate it for k = 5,10,50,100,122.
Below we summarize the accuracy for these k values
K Accuracy
5 94.46%
10 94.17%
50 90.19%
100 86.48%
122 85.06%
From the above accuracy table, we can observe that as the k value increases the accuracy goes
down.
[Link]
kNN Algorithm
# Thumb rule to decide on k for k-NN is sqrt(n)/2
k = sqrt(nrow(hr_train))/2
k
#51.2347 (which can be approximated to 51
test_pred_rule <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=k)
CrossTable(x=hr_test_labels ,y=test_pred_rule ,[Link] = FALSE)
# accuracy = 4050/4499 = 90.02%
# Another method to determine the k for k-NN
[Link](400)
ct <- trainControl(method="repeatedcv",repeats = 3)
fit <- train(left ~ ., data = hr_data2, method = "knn", trControl = ct, preProcess =
c("center","scale"),tuneLength = 20)
fit
# Checking accuracy of the model with k = 7
test_pred_7 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=7)
CrossTable(x=hr_test_labels ,y=test_pred_7 ,[Link] = FALSE)
# accuracy = 4357/4499 = 96.84%
#or alternatively we can use this below command
confusionMatrix (hr_test_labels,test_pred_7)
Output on next slide..
[Link]
Using the above code it indicates that
k=7 is the best for this data and it is
better to go with this value because it
has been cross validated
[Link]
Step 6
Model
Summarization
[Link]
Summary of Model Performance
Model Accuracy
Decision Tree 97.09%
Random Forest 99%
Naïve Bayes 78.84%
kNN Algorithm (Using k = 7) 96.84%
[Link]
Appendix
Packages used for the Classification Analysis:
•[Link]
•reshape2
•randomForest
•party # For decision tree
•rpart # for Rpart
•[Link] #for Rpart plot
•lattice # Used for Data Visualization
•caret # for data pre-processing
•pROC # for ROC curve
•corrplot # for correlation plot
•e1071 # for ROC curve and Confusion matrix
•RColorBrewer
•dummies
•class
•gmodels
[Link]
Thank You.