How to Use SMOTE for Imbalanced Data in R Last Updated : 11 Jul, 2025 Summarize Comments Improve Suggest changes Share Like Article Like Report SMOTE (Synthetic Minority Over-sampling Technique) is a method used to handle imbalanced data by creating new samples for the minority class. Instead of copying existing data, it builds new examples by combining nearby data points from the smaller class. This helps the model learn patterns from both classes and improves overall prediction accuracy.What is Imbalanced Data?Imbalanced data means one class has many more samples than the other(s). This can cause the model to ignore the smaller class and make wrong predictions. It’s a common problem in areas like fraud detection or medical diagnosis, where missing the smaller class can lead to serious errors.Implementation of SMOTE for Imbalanced Data in Diabetes DatasetWe demonstrate how to use the ROSE package to apply SMOTE for addressing class imbalance in the diabetes dataset.1. Installing and Loading Required PackagesWe install and load the necessary packages to perform SMOTE and modeling.ROSE: Provides the SMOTE implementation for balancing imbalanced datasets.set.seed: Ensures reproducibility by setting a random seed. R install.packages("ROSE") library(ROSE) set.seed(199) 2. Loading the Diabetes DatasetWe load the diabetes dataset and check the class distribution to confirm if the dataset is imbalanced.You can download the dataset from here.read.csv: Reads the dataset from the specified path.table: Displays the class distribution to check for imbalance. R diabetes <- read.csv("path_to_diabetes.csv") table(diabetes$Outcome) Output: Output3. Applying SMOTE to Balance the DatasetWe use the ROSE function to generate synthetic samples for the minority class and balance the dataset.ROSE: A function from the ROSE package that applies SMOTE to create synthetic data.Outcome ~ .: Specifies the target variable (Outcome) and all other features as predictors.N = 2000: Specifies the total number of synthetic data points to generate.p = 0.5: Specifies the proportion of the minority class. R smote_data <- ROSE(Outcome ~ ., data = data, N = 2000, p = 0.5)$data table(smote_data$Outcome) Output: OutputThe output shows that after applying SMOTE, the dataset has been balanced with 1021 instances of the majority class (Outcome = 0) and 979 instances of the minority class (Outcome = 1), achieving a more equal class distribution. Comment More infoAdvertise with us Next Article What is Imbalanced Dataset P poojashu00qn Follow Improve Article Tags : Machine Learning Blogathon AI-ML-DS R Data-science R Machine Learning AI-ML-DS With R Data Science Blogathon 2024 R Language +4 More Practice Tags : Machine Learning Similar Reads Regression with Random Forest on Imbalanced data in R Random Forest is a versatile and powerful machine learning algorithm that can be used for regression tasks, especially when dealing with complex and nonlinear relationships in data. However, when the dataset is imbalanced â meaning one outcome class is significantly more frequent than the others â s 6 min read SMOTE for Imbalanced Classification with Python Imbalanced datasets impact the performance of the machine learning models and the Synthetic Minority Over-sampling Technique (SMOTE) addresses the class imbalance problem by generating synthetic samples for the minority class. The article aims to explore the SMOTE, its working procedure, and various 14 min read What is Imbalanced Dataset In the realm of data science and machine learning, a common challenge that practitioners often encounter is dealing with imbalanced datasets. An Imbalanced Dataset refers to a situation where the number of instances across different classes in a classification problem is not evenly distributed. In s 4 min read How to Handle Imbalanced Classes in Machine Learning In machine learning, "imbalanced classes" is a familiar problem particularly occurring in classification when we have datasets with an unequal ratio of data points in each class. Training of model becomes much trickier as typical accuracy is no longer a reliable metric for measuring the performance 15 min read Imbalanced-Learn module in Python Imbalanced-Learn is a Python module that helps in balancing the datasets which are highly skewed or biased towards some classes. Thus, it helps in resampling the classes which are otherwise oversampled or undesampled. If there is a greater imbalance ratio, the output is biased to the class which has 3 min read Remove Outliers from Data Set in R In this article, we will be looking at the approach to remove the Outliers from the data set using the in-built functions in the R programming language. Outliers are data points that don't fit the pattern of the rest of the data set. The best way to detect the outliers in the given data set is to pl 2 min read Like