Open In App

Binary Classification or unknown class in Random Forest in R

Last Updated : 26 Jun, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Random Forest is a powerful and versatile machine-learning algorithm capable of performing both classification and regression tasks. It operates by constructing a multitude of decision trees during training time and outputting the mode of the classes (for classification) or mean prediction (for regression) of the individual trees. In this article, we will focus on using Random Forest for binary classification and handling unknown classes in R.

What is Binary Classification?

Binary classification is a type of classification task that outputs one of two possible classes. It is commonly used in applications like spam detection, disease diagnosis (predicting whether a patient has a certain disease), and sentiment analysis (positive or negative sentiment).

Setting Up Random Forest for Binary Classification in R

Now we will discuss step by step for Setting Up Random Forest for Binary Classification in R Programming Language.

Step 1: Install and Load Necessary Libraries

First, ensure that you have the necessary libraries installed and loaded in R. The primary libraries needed are randomForest and caret.

R
install.packages("randomForest")
install.packages("caret")

library(randomForest)
library(caret)

Step 2: Load and Prepare the Data

For illustration, we will use the famous Iris dataset. Although it's a multi-class dataset, we'll modify it for binary classification by considering only two species.

R
# Load the Iris dataset
data(iris)

# Convert Species to a binary factor (setosa vs. non-setosa)
iris$Species <- ifelse(iris$Species == "setosa", "setosa", "non-setosa")
iris$Species <- as.factor(iris$Species)

# Split the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.8, 
                                  list = FALSE, 
                                  times = 1)
irisTrain <- iris[trainIndex,]
irisTest <- iris[-trainIndex,]

Step 3: Train the Random Forest Model

Fit a Random Forest model to the training data.

R
# Train the Random Forest model
rf_model <- randomForest(Species ~ ., data = irisTrain, importance = TRUE, ntree = 500)

# Print the model summary
print(rf_model)

Output:

Call:
 randomForest(formula = Species ~ ., data = irisTrain, importance = TRUE,      ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 0%
Confusion matrix:
           non-setosa setosa class.error
non-setosa         80      0           0
setosa              0     40           0

Step 4: Predict and Evaluate the Model

Use the model to make predictions on the test set and evaluate its performance.

R
# Make predictions on the test set
predictions <- predict(rf_model, newdata = irisTest)

# Confusion Matrix
conf_matrix <- confusionMatrix(predictions, irisTest$Species)
print(conf_matrix)

Output:

Confusion Matrix and Statistics

            Reference
Prediction   setosa non-setosa unknown
  setosa          5          0       0
  non-setosa      0         15       0
  unknown         0          0       0

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8316, 1)
    No Information Rate : 0.75       
    P-Value [Acc > NIR] : 0.003171   
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: non-setosa Class: unknown
Sensitivity                   1.00              1.00             NA
Specificity                   1.00              1.00              1
Pos Pred Value                1.00              1.00             NA
Neg Pred Value                1.00              1.00             NA
Prevalence                    0.25              0.75              0
Detection Rate                0.25              0.75              0
Detection Prevalence          0.25              0.75              0
Balanced Accuracy             1.00              1.00             NA

Handling Unknown Classes

When dealing with unknown or missing classes in a real-world scenario, you can include additional steps to manage and predict unknown classes. Here’s an approach to deal with unknown classes in the data:

Step 5: Handle Unknown Classes in the Dataset

Suppose we have some unknown species labeled as "unknown" in the dataset. We can follow these steps:

R
# Introduce unknown class in the test data
set.seed(124)
unknown_indices <- sample(1:nrow(irisTest), 10)
irisTest$Species[unknown_indices] <- "unknown"

# Ensure that the species is a factor
irisTest$Species <- factor(irisTest$Species, 
                           levels = c("setosa", "non-setosa", "unknown"))

Step 6: Modify Predictions for Unknown Classes

If a class is unknown, we may decide to label it separately or use some heuristic to handle it. Here we’ll predict normally and then check if we need to reclassify unknowns.

R
# Make predictions on the test set including unknowns
predictions <- predict(rf_model, newdata = irisTest, type = "response")

# Reclassify unknowns
predictions <- ifelse(irisTest$Species == "unknown", "unknown", 
                      as.character(predictions))
predictions <- factor(predictions, levels = c("setosa", "non-setosa", "unknown"))

# Confusion Matrix including unknown class
conf_matrix <- confusionMatrix(predictions, irisTest$Species)
print(conf_matrix)

Output:

Confusion Matrix and Statistics

            Reference
Prediction   setosa non-setosa unknown
  setosa          5          0       0
  non-setosa      0         15       0
  unknown         0          0       0

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8316, 1)
    No Information Rate : 0.75       
    P-Value [Acc > NIR] : 0.003171   
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: non-setosa Class: unknown
Sensitivity                   1.00              1.00             NA
Specificity                   1.00              1.00              1
Pos Pred Value                1.00              1.00             NA
Neg Pred Value                1.00              1.00             NA
Prevalence                    0.25              0.75              0
Detection Rate                0.25              0.75              0
Detection Prevalence          0.25              0.75              0
Balanced Accuracy             1.00              1.00             NA

Conclusion

Random Forest is a robust and flexible algorithm for binary classification in R. Handling unknown classes requires additional steps, such as implementing a reject option, adding an "unknown" class, or using anomaly detection. By following the steps outlined above, you can effectively build and deploy a Random Forest model for binary classification while also managing unknown classes.


Next Article

Similar Reads