Handling imbalanced classes in CatBoost: Techniques and Solutions

Last Updated : 05 Jul, 2024

Gradient boosting algorithms have become a cornerstone in machine learning, particularly in handling complex datasets with heterogeneous features and noisy data. One of the most prominent gradient boosting libraries is CatBoost, known for its ability to process categorical features effectively. However, like other boosting algorithms, CatBoost faces the challenge of dealing with imbalanced datasets, where one class significantly outnumbers the other. This article delves into the techniques and solutions CatBoost offers to tackle the issue of imbalanced classes.

Table of Content

The Problem of Imbalanced Classes
Techniques for Handling Imbalanced Data in CatBoost

1. Class Weights
2. Auto Class Weights
3. Sampling Techniques

Handling Imbalanced Dataset in CatBoost : Practical Example
Choosing the Right Strategy

The Problem of Imbalanced Classes

Imbalanced datasets are common in many real-world applications, such as fraud detection, medical diagnosis, and customer churn prediction. In these scenarios, one class (e.g., the positive class) is significantly underrepresented compared to the other class (e.g., the negative class). This imbalance can lead to biased models that favor the majority class, resulting in poor performance on the minority class.

Techniques for Handling Imbalanced Data in CatBoost

CatBoost provides several built-in mechanisms to handle imbalanced datasets. These include:

Class Weights
Auto Class Weights
Sampling Techniques

Let's walk through a practical example demonstrating how to handle an imbalanced dataset using CatBoost, and then validate its performance. We'll use a synthetic dataset and evaluate the effectiveness of different techniques.

1. Dataset Preparation

First, let's generate a synthetic imbalanced dataset for demonstration purposes using make_classification from scikit-learn:

Python

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import pandas as pd

X, y = make_classification(n_samples=1000, n_features=10, n_informative=8,
                           n_redundant=2, n_classes=2, weights=[0.95, 0.05], random_state=42)

df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['target'] = y

X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42)

1. Class Weights

Class weights are used to assign different importance to different classes. By increasing the weight of the minority class, the model is penalized more for misclassifying minority class instances, thus improving its performance on the minority class.

Python

from catboost import CatBoostClassifier
from sklearn.metrics import classification_report

# Define CatBoost model with class weights
model_class_weights = CatBoostClassifier(class_weights={0: 1, 1: 10}, random_state=42, verbose=0)
model_class_weights.fit(X_train, y_train)
y_pred_class_weights = model_class_weights.predict(X_test)
print("Classification Report - Class Weights:")
print(classification_report(y_test, y_pred_class_weights))

Output:

Classification Report - Class Weights:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1868
           1       0.94      0.73      0.83       132

    accuracy                           0.98      2000
   macro avg       0.96      0.87      0.91      2000
weighted avg       0.98      0.98      0.98      2000

2. Auto Class Weights

CatBoost also offers an automatic way to balance class weights using the auto_class_weights parameter. This parameter can be set to 'Balanced' to automatically calculate and assign weights based on the class distribution.

Python

# Initialize CatBoostClassifier with auto class weights
model = CatBoostClassifier(auto_class_weights='Balanced', verbose=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Predictions:", y_pred)

Output:

Predictions: [0 0 0 ... 0 0 0]

3. Sampling Techniques

Sampling techniques such as oversampling the minority class or undersampling the majority class can also be used to balance the dataset. These techniques can be combined with CatBoost to improve model performance.

Oversampling:

Python

from imblearn.over_sampling import SMOTE

# Apply SMOTE to oversample the minority class
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model = CatBoostClassifier(verbose=0)
model.fit(X_resampled, y_resampled)
y_pred = model.predict(X_test)
print("Predictions:", y_pred)

Output:

Predictions: [0 0 0 ... 0 0 0]

Undersampling:

Python

from imblearn.under_sampling import RandomUnderSampler

# Apply RandomUnderSampler to undersample the majority class
rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
model = CatBoostClassifier(verbose=0)
model.fit(X_resampled, y_resampled)
y_pred = model.predict(X_test)
print("Predictions:", y_pred)

Output:

Predictions: [0 0 0 ... 0 0 0]

Handling Imbalanced Dataset in CatBoost : Practical Example

Problem Statement: You have a dataset from a telecom company containing customer information such as service usage patterns, customer demographics, and whether the customer churned or not. The goal is to build a model that predicts whether a customer will churn based on these features.

Step-by-Step Example: Predicting Customer Purchase

1. Generate Random Dataset

Let's generate a random dataset using Python's numpy and pandas libraries:

Python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
from catboost import CatBoostClassifier

np.random.seed(42)
n_samples = 1000

# Features: customer demographics and behavior
age = np.random.randint(18, 70, size=n_samples)
income = np.random.normal(50000, 20000, size=n_samples)
days_since_last_purchase = np.random.randint(0, 365, size=n_samples)
num_visits_last_month = np.random.randint(1, 30, size=n_samples)
avg_purchase_amount = np.random.normal(50, 20, size=n_samples)
customer_type = np.random.choice(['Regular', 'Premium'], size=n_samples)
location = np.random.choice(['Urban', 'Rural'], size=n_samples)

# Target: whether customer made a purchase (binary: 0 or 1)
purchase = np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2])
data = pd.DataFrame({
    'Age': age,
    'Income': income,
    'Days_Since_Last_Purchase': days_since_last_purchase,
    'Num_Visits_Last_Month': num_visits_last_month,
    'Avg_Purchase_Amount': avg_purchase_amount,
    'Customer_Type': customer_type,
    'Location': location,
    'Purchase': purchase
})

# Display class distribution
print(data['Purchase'].value_counts())

Output:

 Purchase
0    802
1    198
Name: count, dtype: int64

2. Data Preprocessing

Now, let's preprocess the dataset:

Python

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

label_encoder = LabelEncoder()
data['Customer_Type'] = label_encoder.fit_transform(data['Customer_Type'])
data['Location'] = label_encoder.fit_transform(data['Location'])
X = data.drop('Purchase', axis=1)
y = data['Purchase']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

3. Handling Imbalanced Classes with CatBoost

Using Class Weights:

You can manually adjust the class_weights parameter in CatBoost to handle class imbalance:

Python

# Initialize CatBoost classifier with adjusted class weights
catboost_model_weights = CatBoostClassifier(iterations=1000,
                                            learning_rate=0.1,
                                            depth=6,
                                            eval_metric='AUC',
                                            random_seed=42,
                                            class_weights=[1, 4],  # Adjusted for imbalance (example)
                                            verbose=100)

catboost_model_weights.fit(X_train, y_train, eval_set=(X_test, y_test))
y_pred_weights = catboost_model_weights.predict(X_test)
print("CatBoost with Class Weights:")
print(classification_report(y_test, y_pred_weights))
print("ROC AUC Score:", roc_auc_score(y_test, catboost_model_weights.predict_proba(X_test)[:, 1]))

Output:

0:	test: 0.5014094	best: 0.5014094 (0)	total: 4.61ms	remaining: 4.6s
100:	test: 0.4773669	best: 0.5684795 (1)	total: 346ms	remaining: 3.08s
200:	test: 0.4621124	best: 0.5684795 (1)	total: 762ms	remaining: 3.03s
300:	test: 0.4672525	best: 0.5684795 (1)	total: 1.21s	remaining: 2.81s
400:	test: 0.4790250	best: 0.5684795 (1)	total: 1.67s	remaining: 2.49s
500:	test: 0.4821754	best: 0.5684795 (1)	total: 2.24s	remaining: 2.23s
600:	test: 0.4846626	best: 0.5684795 (1)	total: 2.83s	remaining: 1.88s
700:	test: 0.4839993	best: 0.5684795 (1)	total: 3.65s	remaining: 1.56s
800:	test: 0.4936163	best: 0.5684795 (1)	total: 4.35s	remaining: 1.08s
900:	test: 0.4975958	best: 0.5684795 (1)	total: 5.01s	remaining: 551ms
999:	test: 0.4985906	best: 0.5684795 (1)	total: 5.42s	remaining: 0us

bestTest = 0.5684795225
bestIteration = 1

Shrink model to first 2 iterations.
CatBoost with Class Weights:
              precision    recall  f1-score   support

           0       0.82      0.65      0.72       163
           1       0.19      0.35      0.24        37

    accuracy                           0.59       200
   macro avg       0.50      0.50      0.48       200
weighted avg       0.70      0.59      0.63       200

ROC AUC Score: 0.5684795224672525

Using Auto Class Weights:

CatBoost provides an option to automatically calculate class weights based on the training data using auto_class_weights='Balanced':

Python

# Initialize CatBoost classifier with auto class weights
catboost_model_auto_weights = CatBoostClassifier(iterations=1000,
                                                learning_rate=0.1,
                                                depth=6,
                                                eval_metric='AUC',
                                                random_seed=42,
                                                auto_class_weights='Balanced',  # Automatically balance classes
                                                verbose=100)

catboost_model_auto_weights.fit(X_train, y_train, eval_set=(X_test, y_test))
y_pred_auto_weights = catboost_model_auto_weights.predict(X_test)
print("CatBoost with Auto Class Weights:")
print(classification_report(y_test, y_pred_auto_weights))
print("ROC AUC Score:", roc_auc_score(y_test, catboost_model_auto_weights.predict_proba(X_test)[:, 1]))

Output:

0:	test: 0.5014094	best: 0.5014094 (0)	total: 7.61ms	remaining: 7.6s
100:	test: 0.4574697	best: 0.5691428 (1)	total: 279ms	remaining: 2.49s
200:	test: 0.4793567	best: 0.5691428 (1)	total: 503ms	remaining: 2s
300:	test: 0.4752114	best: 0.5691428 (1)	total: 850ms	remaining: 1.97s
400:	test: 0.4757088	best: 0.5691428 (1)	total: 1.21s	remaining: 1.8s
500:	test: 0.4820096	best: 0.5691428 (1)	total: 1.53s	remaining: 1.52s
600:	test: 0.4785276	best: 0.5691428 (1)	total: 2.02s	remaining: 1.34s
700:	test: 0.4798541	best: 0.5691428 (1)	total: 2.31s	remaining: 986ms
800:	test: 0.4836677	best: 0.5691428 (1)	total: 2.64s	remaining: 655ms
900:	test: 0.4808489	best: 0.5691428 (1)	total: 2.95s	remaining: 324ms
999:	test: 0.4813464	best: 0.5691428 (1)	total: 3.16s	remaining: 0us

bestTest = 0.5691427624
bestIteration = 1

Shrink model to first 2 iterations.
CatBoost with Auto Class Weights:
              precision    recall  f1-score   support

           0       0.82      0.65      0.72       163
           1       0.19      0.35      0.24        37

    accuracy                           0.59       200
   macro avg       0.50      0.50      0.48       200
weighted avg       0.70      0.59      0.63       200

ROC AUC Score: 0.5691427623942962

Choosing the Right Strategy

Class Weights Adjustment: Ideal for scenarios where the dataset size is manageable and imbalance is moderate. Adjust weights to penalize misclassifications of the minority class more heavily during training.
Auto Class Weights (Balanced): Suitable for datasets with severe imbalance or where the distribution of classes varies significantly. Automatically adjusts class weights based on class frequencies in the training data.
Sampling Techniques:
- Over-sampling (SMOTE): Effective when the minority class is underrepresented and needs augmentation.
- Under-sampling (RandomUnderSampler): Useful when dataset size is large and computational efficiency is a concern.

Choosing Based on Scenario:

For Moderate Imbalance: Start with adjusting class weights or using Auto Class Weights in CatBoost. Evaluate model performance metrics like precision, recall, and F1-score to fine-tune.
For Severe Imbalance: Consider combining techniques like SMOTE with class weights adjustment or using Auto Class Weights. Evaluate both model performance and computational feasibility.
Model Sensitivity Considerations: Experiment with different strategies and assess how each affects model behavior and performance metrics specific to your task.

Conclusion

Handling imbalanced datasets is a critical aspect of building robust machine learning models. CatBoost provides several effective techniques to address this challenge, including class weights, auto class weights, and sampling techniques. By leveraging these methods and choosing appropriate evaluation metrics, you can significantly improve the performance of your models on imbalanced datasets. Whether you're working on fraud detection, medical diagnosis, or customer churn prediction, CatBoost offers powerful tools to help you achieve better results.

ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python

frisbevhwy

Improve

Article Tags :

Practice Tags :

Machine Learning

Handling imbalanced classes in CatBoost: Techniques and Solutions

The Problem of Imbalanced Classes

Techniques for Handling Imbalanced Data in CatBoost

1. Class Weights

2. Auto Class Weights

3. Sampling Techniques

Handling Imbalanced Dataset in CatBoost : Practical Example

1. Generate Random Dataset

2. Data Preprocessing

3. Handling Imbalanced Classes with CatBoost

Using Class Weights:

Using Auto Class Weights:

Choosing the Right Strategy

Choosing Based on Scenario:

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?