Open In App

How to Use SMOTE for Imbalanced Data in R

Last Updated : 11 Jul, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

SMOTE (Synthetic Minority Over-sampling Technique) is a method used to handle imbalanced data by creating new samples for the minority class. Instead of copying existing data, it builds new examples by combining nearby data points from the smaller class. This helps the model learn patterns from both classes and improves overall prediction accuracy.

What is Imbalanced Data?

Imbalanced data means one class has many more samples than the other(s). This can cause the model to ignore the smaller class and make wrong predictions. It’s a common problem in areas like fraud detection or medical diagnosis, where missing the smaller class can lead to serious errors.

Implementation of SMOTE for Imbalanced Data in Diabetes Dataset

We demonstrate how to use the ROSE package to apply SMOTE for addressing class imbalance in the diabetes dataset.

1. Installing and Loading Required Packages

We install and load the necessary packages to perform SMOTE and modeling.

  • ROSE: Provides the SMOTE implementation for balancing imbalanced datasets.
  • set.seed: Ensures reproducibility by setting a random seed.
R
install.packages("ROSE")
library(ROSE)

set.seed(199)

2. Loading the Diabetes Dataset

We load the diabetes dataset and check the class distribution to confirm if the dataset is imbalanced.

You can download the dataset from here.

  • read.csv: Reads the dataset from the specified path.
  • table: Displays the class distribution to check for imbalance.
R
diabetes <- read.csv("path_to_diabetes.csv")

table(diabetes$Outcome)

Output:

data
Output

3. Applying SMOTE to Balance the Dataset

We use the ROSE function to generate synthetic samples for the minority class and balance the dataset.

  • ROSE: A function from the ROSE package that applies SMOTE to create synthetic data.
  • Outcome ~ .: Specifies the target variable (Outcome) and all other features as predictors.
  • N = 2000: Specifies the total number of synthetic data points to generate.
  • p = 0.5: Specifies the proportion of the minority class.
R
smote_data <- ROSE(Outcome ~ ., data = data, N = 2000, p = 0.5)$data

table(smote_data$Outcome)

Output:

data
Output

The output shows that after applying SMOTE, the dataset has been balanced with 1021 instances of the majority class (Outcome = 0) and 979 instances of the minority class (Outcome = 1), achieving a more equal class distribution.


Similar Reads