A key component of machine learning classification tasks is handling unbalanced data, which is characterized by a skewed class distribution with a considerable overrepresentation of one class over the others. The difficulty posed by this imbalance is that models may exhibit inferior performance due to bias towards the majority class. When faced with uneven settings, the model’s bias is to value accuracy over accurately recognizing occurrences of minority classes.
This problem can be solved by applying specialized strategies like resampling (oversampling minority class, undersampling majority class), utilizing various assessment measures (F1-score, precision, recall), and putting advanced algorithms to work with unbalanced datasets into practice.
What is Imbalanced Data and How to handle it?
Imbalanced data pertains to datasets where the distribution of observations in the target class is uneven. In other words, one class label has a significantly higher number of observations, while the other has a notably lower count.
When one class greatly outnumbers the others in a classification, there is imbalanced data. Machine learning models may become biased in their predictions as a result, favoring the majority class. Techniques like oversampling the minority class or undersampling the majority class are used in resampling to remedy this.
Furthermore, it is possible to evaluate model performance more precisely by substituting other assessment measures, such as precision, recall, or F1-score, for accuracy. To further improve the handling of imbalanced datasets for more reliable and equitable predictions, specialized techniques such as ensemble approaches and the incorporation of synthetic data generation can be used.
Problem with Handling Imbalanced Data for Classification
- Algorithms may get biased towards the majority class and thus tend to predict output as the majority class.
- Minority class observations look like noise to the model and are ignored by the model.
- Imbalanced dataset gives misleading accuracy score.
Ways to handle Imbalanced Data for Classification
Addressing imbalanced data in classification is crucial for fair model performance. Techniques include resampling (oversampling or undersampling), synthetic data generation, specialized algorithms, and alternative evaluation metrics. Implementing these strategies ensures more accurate and unbiased predictions across all classes.
1. Different Evaluation Metric
Classifier accuracy is calculated by dividing the total correct predictions by the overall predictions, suitable for balanced classes but less effective for imbalanced datasets. Precision gauges the accuracy of a classifier in predicting a specific class, while recall assesses its ability to correctly identify a class. In imbalanced datasets, the F1 score emerges as a preferred metric, striking a balance between precision and recall, providing a more comprehensive evaluation of a classifier’s performance. It can be expressed as the mean of recall and accuracy.
[Tex]F1 = 2 * \frac{precision\; *\; recall}{precision\; +\; recall}
[/Tex]
Precision and F1 score both decrease when the classifier incorrectly predict the minority class, increasing the number of false positives. Recall and F1 score also drop if the classifier have trouble accurately identifying the minority class, leading to more false negatives. In particular, the F1 score only becomes better when the amount and accuracy of predictions get better.
F1 score is essentially a comprehensive statistic that takes into account the trade-off between precision and recall, which is critical for assessing classifier performance in datasets that are imbalanced.
2. Resampling (Undersampling and Oversampling)
This method involves adjusting the balance between minority and majority classes through upsampling or downsampling. In the case of an imbalanced dataset, oversampling the minority class with replacement, termed oversampling, is employed. Conversely, undersampling entails randomly removing rows from the majority class to align with the minority class.
This sampling approach yields a balanced dataset, ensuring comparable representation for both majority and minority classes. Achieving a similar number of records for both classes in the dataset signifies that the classifier will assign equal importance to each class during training.
Python3
import numpy as np
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
X, y = make_classification(n_classes = 2 , class_sep = 2 , weights = [ 0.1 , 0.9 ],
n_informative = 3 , n_redundant = 1 , flip_y = 0 ,
n_features = 20 , n_clusters_per_class = 1 ,
n_samples = 1000 , random_state = 42 )
print ( "Original class distribution:" , Counter(y))
oversample = RandomOverSampler(sampling_strategy = 'minority' )
X_over, y_over = oversample.fit_resample(X, y)
print ( "Oversampled class distribution:" , Counter(y_over))
undersample = RandomUnderSampler(sampling_strategy = 'majority' )
X_under, y_under = undersample.fit_resample(X, y)
print ( "Undersampled class distribution:" , Counter(y_under))
|
Output:
Original class distribution: Counter({1: 900, 0: 100})
Oversampled class distribution: Counter({1: 900, 0: 900})
Undersampled class distribution: Counter({0: 100, 1: 100})
3. BalancedBaggingClassifier
When dealing with imbalanced datasets, traditional classifiers tend to favor the majority class, neglecting the minority class due to its lower representation. The BalancedBaggingClassifier, an extension of sklearn classifiers, addresses this imbalance by incorporating additional balancing during training. It introduces parameters like “sampling_strategy,” determining the type of resampling (e.g., ‘majority’ for resampling only the majority class, ‘all’ for resampling all classes), and “replacement,” dictating whether the sampling should occur with or without replacement. This classifier ensures a more equitable treatment of classes, particularly beneficial when handling imbalanced datasets.
Importing Libraries
Python3
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.metrics import accuracy_score, classification_report
|
This code demonstrates the usage of a BalancedBaggingClassifier from the imbalanced-learn library to handle imbalanced datasets. It creates an imbalanced dataset, splits it, and trains a Random Forest classifier with balanced bagging, assessing the model’s performance through accuracy and a classification report.
Creating imbalanced dataset and splitting
Python3
X, y = make_classification(n_classes = 2 , class_sep = 2 , weights = [ 0.1 , 0.9 ],
n_informative = 3 , n_redundant = 1 , flip_y = 0 ,
n_features = 20 , n_clusters_per_class = 1 ,
n_samples = 1000 , random_state = 42 )
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2 , random_state = 42 )
|
This code creates two-class, imbalanced datasets, divides them into training and testing sets, and uses a predetermined random state to guarantee reproducibility. With 20 features in the final dataset, the minority class has a weight of 0.1, indicating a notable class imbalance.
Creating a random forest classifier
Python3
base_classifier = RandomForestClassifier(random_state = 42 )
|
By initializing a Random Forest classifier with a given random state, this method creates a base classifier that may be used in subsequent analyses. Reproducibility in model training is guaranteed by the random state.
Creating a balanced bagging classifier
Python3
balanced_bagging_classifier = BalancedBaggingClassifier(base_classifier,
sampling_strategy = 'auto' ,
replacement = False ,
random_state = 42 )
|
This code creates a BalancedBaggingClassifier by starting with a RandomForestClassifier that was previously defined. A random state is established for reproducibility, and options like “sampling_strategy” and “replacement” are supplied to address class imbalance.
Fitting the model and making predictions
Python3
balanced_bagging_classifier.fit(X_train, y_train)
y_pred = balanced_bagging_classifier.predict(X_test)
|
This code use the training data (X_train, y_train) to train the BalancedBaggingClassifier. Then, using the test data (X_test), they predict the labels, saving the results in the variable y_pred.
Evaluation
Python3
print ( "Accuracy:" , accuracy_score(y_test, y_pred))
print ( "Classification Report:\n" , classification_report(y_test, y_pred))
|
Output:
Accuracy: 1.0
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 1.00 1.00 187
accuracy 1.00 200
macro avg 1.00 1.00 1.00 200
weighted avg 1.00 1.00 1.00 200
This code compute and output the balanced bagging classifier’s accuracy on the test set. Furthermore, a comprehensive classification report with information on each class’s F1-score, recall, and precision is printed.
4. SMOTE
The Synthetic Minority Oversampling Technique (SMOTE) addresses imbalanced datasets by synthetically generating new instances for the minority class. Unlike simply duplicating records, SMOTE enhances diversity by creating artificial instances. In simpler terms, SMOTE examines instances in the minority class, selects a random nearest neighbor using k-nearest neighbors, and generates a synthetic instance randomly within the feature space.
Python3
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter
X, y = make_classification(n_classes = 2 , class_sep = 2 , weights = [ 0.1 , 0.9 ],
n_informative = 3 , n_redundant = 1 , flip_y = 0 ,
n_features = 20 , n_clusters_per_class = 1 ,
n_samples = 1000 , random_state = 42 )
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2 , random_state = 42 )
print ( "Class distribution before SMOTE:" , Counter(y_train))
smote = SMOTE(sampling_strategy = 'auto' , random_state = 42 )
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
print ( "Class distribution after SMOTE:" , Counter(y_train_resampled))
|
Output:
Class distribution before SMOTE: Counter({1: 713, 0: 87})
Class distribution after SMOTE: Counter({1: 713, 0: 713})
This code demonstrates how to rectify class imbalance in a dataset using SMOTE. Initially, an unbalanced dataset is produced, with 10% of the data belonging to a minority class. It shows the class distribution before to SMOTE after dividing the data into training and testing sets. After that, the minority class is oversampled using the SMOTE approach to produce synthetic instances. It displays a more equal representation of both classes in the resampled training data by printing the class distribution after SMOTE.
5. Threshold Moving
In classifiers, predictions are often expressed as probabilities of class membership. The conventional threshold for assigning predictions to classes is typically set at 0.5. However, in the context of imbalanced class problems, this default threshold may not yield optimal results. To enhance classifier performance, it is essential to adjust the threshold to a value that efficiently discriminates between the two classes.
Techniques such as ROC Curves and Precision-Recall Curves are employed to identify the optimal threshold. Additionally, grid search methods or exploration within a specified range of values can be utilized to pinpoint the most suitable threshold for the classifier.
Python3
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
X, y = make_classification(n_classes = 2 , class_sep = 2 , weights = [ 0.1 , 0.9 ],
n_informative = 3 , n_redundant = 1 , flip_y = 0 ,
n_features = 20 , n_clusters_per_class = 1 ,
n_samples = 1000 , random_state = 42 )
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2 , random_state = 42 )
model = RandomForestClassifier(random_state = 42 )
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1 ]
threshold = 0.5
while threshold > = 0 :
y_pred = (y_proba > = threshold).astype( int )
f1 = f1_score(y_test, y_pred)
print (f "Threshold: {threshold:.2f}, F1 Score: {f1:.4f}" )
threshold - = 0.02
|
Output:
Threshold: 0.50, F1 Score: 1.0000
Threshold: 0.48, F1 Score: 1.0000
Threshold: 0.46, F1 Score: 1.0000
Threshold: 0.44, F1 Score: 1.0000
Threshold: 0.42, F1 Score: 1.0000
Threshold: 0.40, F1 Score: 1.0000
Threshold: 0.38, F1 Score: 1.0000
Threshold: 0.36, F1 Score: 1.0000
Threshold: 0.34, F1 Score: 1.0000
Threshold: 0.32, F1 Score: 1.0000
Threshold: 0.30, F1 Score: 1.0000
Threshold: 0.28, F1 Score: 0.9973
Threshold: 0.26, F1 Score: 0.9973
Threshold: 0.24, F1 Score: 0.9973
Threshold: 0.22, F1 Score: 0.9947
Threshold: 0.20, F1 Score: 0.9947
Threshold: 0.18, F1 Score: 0.9947
Threshold: 0.16, F1 Score: 0.9920
Threshold: 0.14, F1 Score: 0.9920
Threshold: 0.12, F1 Score: 0.9894
Threshold: 0.10, F1 Score: 0.9842
Threshold: 0.08, F1 Score: 0.9740
Threshold: 0.06, F1 Score: 0.9664
Threshold: 0.04, F1 Score: 0.9664
Threshold: 0.02, F1 Score: 0.9664
6. Using Tree Based Models
The hierarchical structure of tree-based models—such as Decision Trees, Random Forests, and Gradient Boosted Trees—allows them to handle imbalanced datasets better than non-tree-based models.
- Decision Trees: Decision trees create a structure resembling a tree by dividing the feature space into regions according to feature values. By changing the decision boundaries to incorporate minority class patterns, decision trees can react to data that is unbalanced. They might experience overfitting, though.
- Random Forests: Random Forests are made up of many Decision Trees that have been trained using arbitrary subsets of features and data. Random Forests improve generalization by reducing overfitting and strengthening robustness against imbalanced datasets by mixing numerous trees.
- Gradient Boosted Trees: Boosted Gradient Trees grow in a sequential fashion, with each new growth repairing the mistakes of the older one. Gradient Boosted Trees perform well in imbalanced circumstances because of their ability to concentrate on misclassified occurrences through sequential learning. Although they often work effectively, they could be noise-sensitive.
7. Using Anomaly Detection Algorithms
- Anomaly or Outlier Detection algorithms are ‘one class classification algorithms’ that helps in identifying outliers ( rare data points) in the dataset.
- In an Imbalanced dataset, assume ‘Majority class records as Normal data’ and ‘Minority Class records as Outlier data’.
- These algorithms are trained on Normal data.
- A trained model can predict if the new record is Normal or Outlier.
Similar Reads
Classification and Tabulation of Data
Classification and Tabulation of Data are fundamental processes in the field of statistics, especially in the context of economics. They transform raw data into a structured form, enabling better analysis, interpretation, and presentation of economic data. Proper classification ensures that data is
12 min read
Dataset for Classification
Classification is a type of supervised learning where the objective is to predict the categorical labels of new instances based on past observations. The goal is to learn a model from the training data that can predict the class label for unseen data accurately. Classification problems are common in
5 min read
Semi Supervised Classification in Data Mining
A classification between supervised and unsupervised learning algorithms is a type of machine learning called semi-supervised learning. At the time of training, it uses both labeled and unlabeled datasets. It acts on data that, while having some labels, is primarily unlabeled. Working:Semi-supervise
3 min read
Classification on Imbalanced data using Tensorflow
In the modern days of machine learning, imbalanced datasets are like a curse that degrades the overall model performance in classification tasks. In this article, we will implement a Deep learning model using TensorFlow for classification on a highly imbalanced dataset. Classification on Imbalanced
7 min read
Classification of Data Mining Systems
Data Mining is considered as an interdisciplinary field. It includes a set of various disciplines such as statistics, database systems, machine learning, visualization and information sciences.Classification of the data mining system helps users to understand the system and match their requirements
1 min read
SMOTE for Imbalanced Classification with Python
Imbalanced datasets impact the performance of the machine learning models and the Synthetic Minority Over-sampling Technique (SMOTE) addresses the class imbalance problem by generating synthetic samples for the minority class. The article aims to explore the SMOTE, its working procedure, and various
14 min read
Handling Class Imbalance in PyTorch
Class imbalance is a common challenge in machine learning, where certain classes are underrepresented compared to others. This can lead to biased models that perform poorly on minority classes. In this article, we will explore various techniques to handle class imbalance in PyTorch, ensuring your mo
9 min read
Bagging and Random Forest for Imbalanced Classification
Ensemble learning techniques like bagging and random forests have gained prominence for their effectiveness in handling imbalanced classification problems. In this article, we will delve into these techniques and explore their applications in mitigating the impact of class imbalance. Classification
8 min read
ROC Curves for Multiclass Classification in R
Receiver Operating Characteristic (ROC) curves are a powerful tool for evaluating the performance of classification models. While ROC curves are straightforward for binary classification, extending them to multiclass classification presents additional challenges. In this article, we'll explore how t
3 min read
Evaluation Metrics For Classification Model in Python
Classification is a supervised machine-learning technique that predicts the class label based on the input data. There are different classification algorithms to build a classification model, such as Stochastic Gradient Classifier, Support Vector Machine Classifier, Random Forest Classifier, etc. To
7 min read