100% found this document useful (1 vote)

123 views19 pages

Random Forest Reference Code

The document discusses random forest classification models. It shows that a random forest model was built with 500 trees using 3 variables for each tree. The out-of-bag error estimate provides an accurate assessment of the model's performance without a test set. Variable importance is assessed using a variable importance plot that sorts variables by their MeanDecreaseGini. The random forest model achieves 99% accuracy on both the training and test sets, indicating stability.

Uploaded by

Rajat Shetty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

123 views19 pages

Random Forest Reference Code

Uploaded by

Rajat Shetty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Classification

Random Forest

[Link]
Random Forest
#Random Forest model
modelrf <- randomForest([Link](left) ~ . , data = trainSplit, [Link]=T)
modelrf

The random forest model output tells us that it has built 500 trees and used 3 variables for each tree building.
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set.
The OOB estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-
of-bag error estimate removes the need for a set aside test set.

[Link]
Random Forest
#Checking variable importance in Random Forest
importance(modelrf)

varImpPlot(modelrf)

The variable importance plot displays

a plot with variables sorted by
MeanDecreaseGini

[Link]
Random Forest
# Prediction and Model Evaluation using Confusion Matrix
predrf_tr <- predict(modelrf, trainSplit) #Train Data
predrf_test <- predict(modelrf, testSplit) #Test Data

confusionMatrix(predrf_tr,trainSplit$left) #Train Data

confusionMatrix(predrf_test,testSplit$left) #TestData

The Confusion Matrix The Confusion Matrix

on Train data gives on Train data gives
the accuracy of 99% the accuracy of 99%

As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Random Forest model
[Link]
Comparing ROC curves for Decision Tree and Random Forest

# Prediction and Model Evaluation using Confusion Matrix

#Decision Tree ROC
auc1 <- roc([Link](testSplit$left),
[Link](predtest))
plot(auc1,col =
'blue',main=paste('AUC:',round(auc1$auc[[1]],3)))

#Random Forest ROC

aucrf <- roc([Link](testSplit$left),
[Link](predrf), ci=TRUE)
plot(aucrf, ylim=c(0,1), [Link]=TRUE,
main=paste('Random Forest
AUC:',round(aucrf$auc[[1]],3)),col = 'blue')

#Comparing both ROC curves

plot(aucrf, ylim=c(0,1), main=paste('ROC Comparison :
RF(blue),C5.0(Black))'),col = 'blue')
par(new = TRUE)
plot(auc1)
par(new = TRUE) The ROC curve for Random Forest is better for
Decision Tree.

[Link]
Classification Model
Naïve Bayes

[Link]
Naïve Bayes
#Naive Bayes
modelnb <- naiveBayes([Link](left) ~. , data = trainSplit)
modelnb

These are the apriori probabilities for the variables in the dataset
[Link]
Naïve Bayes
#Performance of Naïve Bayes using Confusion Matrix
prednb_tr <- predict(modelnb,trainSplit) #Train Data
prednb_test <- predict(modelnb,testSplit) #Test Data

confusionMatrix(prednb_tr,trainSplit$left) #Train Data

confusionMatrix(prednb_test,testSplit$left) #Test Data

The Confusion Matrix

The Confusion Matrix
on Train data gives
on Train data gives
the accuracy of
the accuracy of
78.84%
78.58%

As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Naïve Bayes model

[Link]
Classification Model
kNN Algorithm

[Link]
kNN Algorithm
#Data Preparation for kNN Algorithm

library(dummies)
#Creating dummy variables for Factor variable
dummy_df = [Link](hr_data1[, c('role_code', '[Link]')])

hr_data2 = hr_data1
hr_data2 = [Link](hr_data2, dummy_df)

#Removing role_code and [Link] since we have created dummy variables

hr_data2 = hr_data2[, !(names(hr_data2) %in% c('role_code', '[Link]'))]

#Converting variables to numeric datatype

hr_data2$Work_accident = [Link](hr_data2$Work_accident)
hr_data2$promotion_last_5years = [Link](hr_data2$promotion_last_5years)

[Link]
kNN Algorithm
#Data Preparation for kNN Algorithm

#Scale the variables and check their final

structure
X = hr_data2[, !(names(hr_data2) %in% c('left'))]
hr_data2_scaled = [Link](scale(X))

str(hr_data2_scaled)

#Splitting the data for the model building

hr_train <- hr_data2_scaled[splitIndex,]
hr_test <- hr_data2_scaled[-splitIndex,]

hr_train_labels <- hr_data2[splitIndex, 'left']

hr_test_labels <- hr_data2[-splitIndex, 'left']

[Link]
kNN Algorithm
#Applying kNN Algorithm on the dataset
library(class)
library(gmodels)

test_pred_1 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=1)

CrossTable(x=hr_test_labels ,y=test_pred_1 ,[Link] = FALSE)

Here from this crosstab, we can compute the accuracy

of this model, for k = 1.

Accuracy = (TP+TN)/Total
= (3311+1030)/4499
= 96.48%

[Link]
kNN Algorithm
#Applying kNN Algorithm on the dataset

As we calculated for k = 1, Similarly we will calculate it for k = 5,10,50,100,122.

Below we summarize the accuracy for these k values

K Accuracy
5 94.46%
10 94.17%
50 90.19%
100 86.48%
122 85.06%

From the above accuracy table, we can observe that as the k value increases the accuracy goes
down.

[Link]
kNN Algorithm
# Thumb rule to decide on k for k-NN is sqrt(n)/2
k = sqrt(nrow(hr_train))/2
k
#51.2347 (which can be approximated to 51

test_pred_rule <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=k)

CrossTable(x=hr_test_labels ,y=test_pred_rule ,[Link] = FALSE)
# accuracy = 4050/4499 = 90.02%

# Another method to determine the k for k-NN

[Link](400)
ct <- trainControl(method="repeatedcv",repeats = 3)
fit <- train(left ~ ., data = hr_data2, method = "knn", trControl = ct, preProcess =
c("center","scale"),tuneLength = 20)
fit

# Checking accuracy of the model with k = 7

test_pred_7 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=7)
CrossTable(x=hr_test_labels ,y=test_pred_7 ,[Link] = FALSE)
# accuracy = 4357/4499 = 96.84%

#or alternatively we can use this below command

confusionMatrix (hr_test_labels,test_pred_7)

Output on next slide..

[Link]
Using the above code it indicates that
k=7 is the best for this data and it is
better to go with this value because it
has been cross validated

[Link]
Step 6
Model
Summarization

[Link]
Summary of Model Performance

Model Accuracy
Decision Tree 97.09%
Random Forest 99%
Naïve Bayes 78.84%
kNN Algorithm (Using k = 7) 96.84%

[Link]
Appendix
Packages used for the Classification Analysis:

•[Link]
•reshape2
•randomForest
•party # For decision tree
•rpart # for Rpart
•[Link] #for Rpart plot
•lattice # Used for Data Visualization
•caret # for data pre-processing
•pROC # for ROC curve
•corrplot # for correlation plot
•e1071 # for ROC curve and Confusion matrix
•RColorBrewer
•dummies
•class
•gmodels

[Link]
Thank You.

STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
16 pages
MLOPs Original
No ratings yet
MLOPs Original
27 pages
CSE 465 Exam: Decision Trees & SVMs
No ratings yet
CSE 465 Exam: Decision Trees & SVMs
2 pages
CNN Short
No ratings yet
CNN Short
61 pages
Supervised & Deep Learning Guide
No ratings yet
Supervised & Deep Learning Guide
83 pages
FP-Growth Algorithm
No ratings yet
FP-Growth Algorithm
23 pages
Understanding Random Forests in Machine Learning
100% (1)
Understanding Random Forests in Machine Learning
4 pages
DBSCAN Algorithm for Data Scientists
No ratings yet
DBSCAN Algorithm for Data Scientists
10 pages
SVM: Understanding the Optimal Hyperplane
100% (1)
SVM: Understanding the Optimal Hyperplane
37 pages
Understanding Random Forest Algorithm
No ratings yet
Understanding Random Forest Algorithm
16 pages
Clustering Iris Data With Weka
No ratings yet
Clustering Iris Data With Weka
6 pages
ML Unit-1
No ratings yet
ML Unit-1
43 pages
Linear and Nonlinear SVM Examples
No ratings yet
Linear and Nonlinear SVM Examples
10 pages
AAL Programs
No ratings yet
AAL Programs
12 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
Bayes Theorem Topic Final
No ratings yet
Bayes Theorem Topic Final
23 pages
ML - Chapter 6 - Model Evaluation
No ratings yet
ML - Chapter 6 - Model Evaluation
65 pages
Deep Learning LectureCNN
No ratings yet
Deep Learning LectureCNN
28 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Deep Learning For Time Series Forecasting - Tutorial and Literature Survey
100% (1)
Deep Learning For Time Series Forecasting - Tutorial and Literature Survey
36 pages
Chapter
100% (1)
Chapter
101 pages
03 - K Means Clustering On Iris Datasets
No ratings yet
03 - K Means Clustering On Iris Datasets
4 pages
Single Layer Perceptron Classifiers
No ratings yet
Single Layer Perceptron Classifiers
52 pages
Decision Trees for Data Scientists
No ratings yet
Decision Trees for Data Scientists
25 pages
Random Forest
100% (1)
Random Forest
18 pages
AI Statistical Methods Course
No ratings yet
AI Statistical Methods Course
23 pages
Multi-Armed Bandits Algorithms Overview
No ratings yet
Multi-Armed Bandits Algorithms Overview
2 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
52 pages
Imagecon MLops Syllabus
100% (1)
Imagecon MLops Syllabus
6 pages
Neural Networks & SVMs in AI
No ratings yet
Neural Networks & SVMs in AI
19 pages
Ch3 - Data Warehousing
No ratings yet
Ch3 - Data Warehousing
28 pages
Markov Decision Process
No ratings yet
Markov Decision Process
3 pages
Probability Statistics
No ratings yet
Probability Statistics
125 pages
Unit 3
No ratings yet
Unit 3
25 pages
Deep Learning - Wikipedia
No ratings yet
Deep Learning - Wikipedia
36 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
24 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
30 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
3 pages
12-Regularization For Deep Learning-17!08!2024
No ratings yet
12-Regularization For Deep Learning-17!08!2024
51 pages
Supervised Regression in Machine Learning
No ratings yet
Supervised Regression in Machine Learning
32 pages
Introduction to Support Vector Machines
100% (1)
Introduction to Support Vector Machines
56 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Stats & ML Model Comparisons
100% (1)
Stats & ML Model Comparisons
72 pages
Answers For End-Sem Exam Part - 2 (Deep Learning)
No ratings yet
Answers For End-Sem Exam Part - 2 (Deep Learning)
20 pages
Siamese Network: Shusen Wang
No ratings yet
Siamese Network: Shusen Wang
51 pages
Soft Max
No ratings yet
Soft Max
6 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Backtracking Algorithms Explained
No ratings yet
Backtracking Algorithms Explained
37 pages
What Is Supervised Machine Learning
No ratings yet
What Is Supervised Machine Learning
3 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
Pytorch Tutorial 1
No ratings yet
Pytorch Tutorial 1
48 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
19 pages
C15-Momentum RMSProp Adam
No ratings yet
C15-Momentum RMSProp Adam
23 pages
Introduction
No ratings yet
Introduction
6 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
19 pages
Part I
No ratings yet
Part I
12 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
5 pages
IS4242 W6 Model Evaluation and Selection
No ratings yet
IS4242 W6 Model Evaluation and Selection
86 pages
Corporate Strategies Overview
No ratings yet
Corporate Strategies Overview
22 pages
Logistics - 3rd - 5th Party Logistics Provider
No ratings yet
Logistics - 3rd - 5th Party Logistics Provider
2 pages
Strategic Management Process Guide
No ratings yet
Strategic Management Process Guide
40 pages
Strategic MGMT Mintzberg McKinsey Blue Red Ocean
No ratings yet
Strategic MGMT Mintzberg McKinsey Blue Red Ocean
26 pages
Globalisation Strategies Prof Bharat Nadkarni
No ratings yet
Globalisation Strategies Prof Bharat Nadkarni
15 pages
Strategic Management: Prof Bharat Nadkarni
No ratings yet
Strategic Management: Prof Bharat Nadkarni
34 pages
Organizational Decision Making Insights
No ratings yet
Organizational Decision Making Insights
38 pages
Principles of Material Handling Explained
No ratings yet
Principles of Material Handling Explained
4 pages
Strategic Analysis Framework Guide
No ratings yet
Strategic Analysis Framework Guide
110 pages
GE Nine Cell Matrix Overview and Analysis
No ratings yet
GE Nine Cell Matrix Overview and Analysis
13 pages
Supply Chain Management Notes
No ratings yet
Supply Chain Management Notes
6 pages
L5 Offshoring and Outsourcing - Risk, Capabilities Etc
No ratings yet
L5 Offshoring and Outsourcing - Risk, Capabilities Etc
68 pages
Dell's Journey to Market Leadership
0% (1)
Dell's Journey to Market Leadership
2 pages
L 3 Strategic Planning, Concepts, Operational Planning Etc
No ratings yet
L 3 Strategic Planning, Concepts, Operational Planning Etc
56 pages
L 4 Strategy Types and Choices
100% (2)
L 4 Strategy Types and Choices
118 pages
Vision and Mission Statement Essentials
No ratings yet
Vision and Mission Statement Essentials
88 pages
Characteristics of Mass Production Systems
No ratings yet
Characteristics of Mass Production Systems
7 pages
MRP II and JIT Systems Overview
No ratings yet
MRP II and JIT Systems Overview
48 pages
Yield Management Strategies Explained
No ratings yet
Yield Management Strategies Explained
58 pages
CSCP Ls Brochure 2019 8.5x11 Web
No ratings yet
CSCP Ls Brochure 2019 8.5x11 Web
8 pages
L4 Manufacturing Resource Planning - Ii
No ratings yet
L4 Manufacturing Resource Planning - Ii
74 pages
31st March20 Circular
No ratings yet
31st March20 Circular
4 pages
Circular:: Copy For Necessary Action
No ratings yet
Circular:: Copy For Necessary Action
4 pages
Understanding International HRM (IHRM)
No ratings yet
Understanding International HRM (IHRM)
23 pages
BDA Course Structure Overview 2020-22
No ratings yet
BDA Course Structure Overview 2020-22
3 pages
International Business (Jaimen
No ratings yet
International Business (Jaimen
37 pages
Type of Bronchi Mucosa Submucosa
No ratings yet
Type of Bronchi Mucosa Submucosa
2 pages
Iodine Test For Carbohydrates
No ratings yet
Iodine Test For Carbohydrates
6 pages
STEAM TURBINE 3 Design and Materials
100% (8)
STEAM TURBINE 3 Design and Materials
19 pages
Essential Skills For A New Manager
No ratings yet
Essential Skills For A New Manager
30 pages
Paper Mask Making
No ratings yet
Paper Mask Making
32 pages
Liming Fishponds: Testing For Lime Requirements
No ratings yet
Liming Fishponds: Testing For Lime Requirements
4 pages
Dover NJ - Ulster Iron Works
100% (1)
Dover NJ - Ulster Iron Works
10 pages
Functions of Distribution Channels
No ratings yet
Functions of Distribution Channels
22 pages
Bal Seal Catalog
No ratings yet
Bal Seal Catalog
14 pages
Recap - Baroque
No ratings yet
Recap - Baroque
5 pages
NCERT Lab Manual - Verifying (A+b+c) 2
No ratings yet
NCERT Lab Manual - Verifying (A+b+c) 2
2 pages
1st Lessson Plan
No ratings yet
1st Lessson Plan
6 pages
T-6A EPGuide
100% (1)
T-6A EPGuide
86 pages
TBS Kylone DVB To IP User Guide
No ratings yet
TBS Kylone DVB To IP User Guide
45 pages
KFUEIT Spring 2022 Exam Schedule
No ratings yet
KFUEIT Spring 2022 Exam Schedule
6 pages
IoT Viva Questions With Answers
No ratings yet
IoT Viva Questions With Answers
2 pages
The Origin and Development of The American Moving Picture Poster
No ratings yet
The Origin and Development of The American Moving Picture Poster
20 pages
HG - G8 - Week 4
No ratings yet
HG - G8 - Week 4
4 pages
Inside Book Two Marks Most Important Questions
No ratings yet
Inside Book Two Marks Most Important Questions
6 pages
(Pak Mega, Perpan 2) Chapter 11 - Heat Exchangers
No ratings yet
(Pak Mega, Perpan 2) Chapter 11 - Heat Exchangers
32 pages
NVS Form 2021
No ratings yet
NVS Form 2021
5 pages
Inditex's Sustainable Manufacturing
No ratings yet
Inditex's Sustainable Manufacturing
15 pages
Report of How To Plan For Start-Up and Legal & Ethical Steps For IQAC - 13 - 03 - 2025-2-10
No ratings yet
Report of How To Plan For Start-Up and Legal & Ethical Steps For IQAC - 13 - 03 - 2025-2-10
9 pages
1527161097H16RM19 Qi
No ratings yet
1527161097H16RM19 Qi
12 pages
B23 Voc 216
No ratings yet
B23 Voc 216
3 pages
Nec Multisync lcd1860nx
No ratings yet
Nec Multisync lcd1860nx
4 pages
Primary Computer Basicscomputer Study Notes
No ratings yet
Primary Computer Basicscomputer Study Notes
11 pages
Silent Partners Women As Public Investors During Britain's Financial Revolution, 1690-1750 1st Edition Amy M. Froide PDF Download
No ratings yet
Silent Partners Women As Public Investors During Britain's Financial Revolution, 1690-1750 1st Edition Amy M. Froide PDF Download
100 pages
Character Skills and Attributes Sheet
No ratings yet
Character Skills and Attributes Sheet
1 page
Soviet Middlegame Technique First English Edition. Edition Petr Arsen Evich Romanovski PDF Download
No ratings yet
Soviet Middlegame Technique First English Edition. Edition Petr Arsen Evich Romanovski PDF Download
125 pages

Random Forest Reference Code

Uploaded by

Random Forest Reference Code

Uploaded by

Classification

The variable importance plot displays

confusionMatrix(predrf_tr,trainSplit$left) #Train Data

The Confusion Matrix The Confusion Matrix

# Prediction and Model Evaluation using Confusion Matrix

#Random Forest ROC

#Comparing both ROC curves

confusionMatrix(prednb_tr,trainSplit$left) #Train Data

The Confusion Matrix

#Removing role_code and [Link] since we have created dummy variables

#Converting variables to numeric datatype

#Scale the variables and check their final

#Splitting the data for the model building

hr_train_labels <- hr_data2[splitIndex, 'left']

test_pred_1 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=1)

Here from this crosstab, we can compute the accuracy

As we calculated for k = 1, Similarly we will calculate it for k = 5,10,50,100,122.

test_pred_rule <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=k)

# Another method to determine the k for k-NN

# Checking accuracy of the model with k = 7

#or alternatively we can use this below command

Output on next slide..

You might also like