Best R Packages for Machine Learning

Machine Learning is a subset of artificial intelligence that focuses on the development of computer software or programs that access data to learn from them and make predictions.

R language is being used in building machine learning models due to its flexibility, efficient packages and the ability to perform deep learning models with integration to the cloud. Being an open-source language, it offers multiple packages. Following are some famous R packages widely used in industry.

1. data.table

data.table package is a enhanced version of data.frame package and is designed for high-performance. It is known for its memory efficiency and ability to perform complex data manipulations at high speed. Some key features of data.table are:

Fast file reading and writing
Scalable data aggregation with parallelism support
Feature-rich data reshaping
Simplified syntax for subsetting and merging data

install.packages("data.table")
library(data.table)

iris_dt <- as.data.table(iris)

result <- iris_dt[Species == "setosa" & Sepal.Length > 5][1:5]
result

Output:

2. Dplyr

Dplyr package is one of the most widely used data manipulation tools in R. It provides easy to implement and consistent set of functions to perform data transformations. The key functions in dplyr are:

select(): Choose columns by name
filter(): Subset rows based on conditions
arrange(): Sort rows by column values
mutate(): Add new variables

Select and Mutate Functions :

install.packages("dplyr")  # Run only once
library(dplyr)

data("mtcars")

cat("---- Select ----\n")
selected <- dplyr::select(mtcars, mpg, cyl)
head(selected)
cat("\n------------------\n")

cat("---- Mutate ----\n")
mutated <- dplyr::mutate(mtcars, power_to_weight = hp / wt)
head(mutated)
cat("\n------------------\n")

Output:

Filter and Arrange Functions :

Python

cat("---- Filter ----\n")
filtered <- dplyr::filter(mtcars, cyl == 6)
head(filtered)
cat("\n------------------\n")

cat("---- Arrange ----\n")
arranged <- dplyr::arrange(mtcars, desc(mpg))
head(arranged)
cat("\n------------------\n")

Output:

3. ggplot2

ggplot2 is an open-source visualization package based on the Grammar of Graphics. It is widely regarded as one of the most famous and flexible visualization libraries in R. With ggplot2 users can create a wide range of static and interactive visualizations including:

Bar charts
Scatter plots
Line graphs
Histograms
Boxplots

The syntax is easy and visualizations are highly customizable making it go-to package for data visualization in R.

install.packages("dplyr") 
install.packages("ggplot2")
library(dplyr) 
library(ggplot2)

ggplot(data = mtcars,  
       aes(x = hp, y = mpg, 
           col = disp)) + geom_point()

Output:

Output

4. caret

caret package (Classification and Regression Training) provides a comprehensive framework for building machine learning models in R. It includes tools for:

Data splitting
Preprocessing
Feature selection
Model training
Model evaluation

caret supports numerous machine learning algorithms and is commonly used in industry due to its ease of use and flexibility.

install.packages("e1071")
install.packages("caTools")
install.packages("caret")

library(e1071)
library(caTools)
library(caret)

data(iris)

split <- sample.split(iris, SplitRatio = 0.7)
train_cl <- subset(iris, split == TRUE)
test_cl <- subset(iris, split == FALSE)

train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])

set.seed(120)
classifier_cl <- naiveBayes(Species ~ ., data = train_cl)

y_pred <- predict(classifier_cl, newdata = test_cl)

cm <- table(test_cl$Species, y_pred)
print(classifier_cl)
print(cm)

Output:

Model classifier_cl:

Confusion Matrix:

5. e1071

e1071 package is known for its implementation of various machine learning algorithms including support vector machines (SVM), clustering algorithms and K-Nearest Neighbors (KNN). It is widely used for classification, regression and clustering tasks.

install.packages("e1071")
install.packages("caTools")
install.packages("class")

library(e1071)
library(caTools)
library(class)

data(iris)

split <- sample.split(iris, SplitRatio = 0.7)
train_cl <- subset(iris, split == TRUE)
test_cl <- subset(iris, split == FALSE)

train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])

classifier_knn <- knn(train = train_scale,
                      test = test_scale,
                      cl = train_cl$Species,
                      k = 1)

cm <- table(test_cl$Species, classifier_knn)
print(cm)

misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1 - misClassError))

Outputs:

6. XGBoost

XGBoost is a implementation of gradient boosting algorithms and is useful for large datasets. It is widely used in machine learning due to its performance and scalability. XGBoost works by bagging and boosting techniques to improve model accuracy.

install.packages("data.table") 
install.packages("dplyr") 
install.packages("ggplot2") 
install.packages("caret") 
install.packages("xgboost") 
install.packages("e1071") 
install.packages("cowplot") 

library(data.table) 
library(dplyr) 
library(ggplot2) 
library(caret) 
library(xgboost) 
library(e1071) 
library(cowplot) 

test[, Item_Outlet_Sales := NA]  
combi = rbind(train, test) 

missing_index = which(is.na(combi$Item_Weight)) 
for(i in missing_index){ 
  item = combi$Item_Identifier[i] 
  combi$Item_Weight[i] = mean(combi$Item_Weight[combi$Item_Identifier == item], na.rm = T) 
} 

zero_index = which(combi$Item_Visibility == 0) 
for(i in zero_index){ 
  item = combi$Item_Identifier[i] 
  combi$Item_Visibility[i] = mean(combi$Item_Visibility[combi$Item_Identifier == item], na.rm = T) 
} 

combi[, Outlet_Size_num := ifelse(Outlet_Size == "Small", 0, ifelse(Outlet_Size == "Medium", 1, 2))] 
combi[, Outlet_Location_Type_num := ifelse(Outlet_Location_Type == "Tier 3", 0, ifelse(Outlet_Location_Type == "Tier 2", 1, 2))] 
combi[, c("Outlet_Size", "Outlet_Location_Type") := NULL] 

ohe_1 = dummyVars("~.", data = combi[, -c("Item_Identifier", "Outlet_Establishment_Year", "Item_Type")], fullRank = T) 
ohe_df = data.table(predict(ohe_1, combi[, -c("Item_Identifier", "Outlet_Establishment_Year", "Item_Type")])) 
combi = cbind(combi[, "Item_Identifier"], ohe_df) 

skewness(combi$Item_Visibility)  
skewness(combi$price_per_unit_wt) 

combi[, Item_Visibility := log(Item_Visibility + 1)]  

num_vars = which(sapply(combi, is.numeric))  
num_vars_names = names(num_vars) 
combi_numeric = combi[, setdiff(num_vars_names, "Item_Outlet_Sales"), with = F] 
prep_num = preProcess(combi_numeric, method = c("center", "scale")) 
combi_numeric_norm = predict(prep_num, combi_numeric) 
combi[, setdiff(num_vars_names, "Item_Outlet_Sales") := NULL]  
combi = cbind(combi, combi_numeric_norm) 

train = combi[1:nrow(train)] 
test = combi[(nrow(train) + 1):nrow(combi)] 
test[, Item_Outlet_Sales := NULL]  

param_list = list( 
  objective = "reg:linear", 
  eta = 0.01, 
  gamma = 1, 
  max_depth = 6, 
  subsample = 0.8, 
  colsample_bytree = 0.5 
) 

Dtrain = xgb.DMatrix(data = as.matrix(train[, -c("Item_Identifier", "Item_Outlet_Sales")]), label = train$Item_Outlet_Sales) 
Dtest = xgb.DMatrix(data = as.matrix(test[, -c("Item_Identifier")])) 

set.seed(112) 
xgbcv = xgb.cv(params = param_list, data = Dtrain, nrounds = 1000, nfold = 5, print_every_n = 10, early_stopping_rounds = 30, maximize = F) 

xgb_model = xgb.train(data = Dtrain, params = param_list, nrounds = 428) 
xgb_model

Output:

7. randomForest

Random Forest in R Programming is an ensemble learning method that builds multiple decision trees and combines them to provide more accurate predictions. It is especially useful for classification and regression tasks. Each decision tree is trained on a subset of the data and predictions are made by aggregating the results of all trees.

install.packages("caTools")
install.packages("randomForest")

library(caTools) 
library(randomForest) 

data(iris) 

split <- sample.split(iris, SplitRatio = 0.7) 
train <- subset(iris, split == "TRUE") 
test <- subset(iris, split == "FALSE") 

set.seed(120) 
classifier_RF = randomForest(x = train[-5], y = train$Species, ntree = 500) 

classifier_RF 

y_pred = predict(classifier_RF, newdata = test[-5]) 
cm = table(test[, 5], y_pred) 
cm

Outputs:

Model classifier_RF:

Screenshot-2025-04-16-164956 — Random Forest

Confusion Matrix:

Best R Packages for Machine Learning

1. data.table

2. Dplyr

3. ggplot2

4. caret

5. e1071

6. XGBoost

7. randomForest

Explore