Open In App

Handling imbalanced classes in CatBoost: Techniques and Solutions

Last Updated : 05 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Gradient boosting algorithms have become a cornerstone in machine learning, particularly in handling complex datasets with heterogeneous features and noisy data. One of the most prominent gradient boosting libraries is CatBoost, known for its ability to process categorical features effectively. However, like other boosting algorithms, CatBoost faces the challenge of dealing with imbalanced datasets, where one class significantly outnumbers the other. This article delves into the techniques and solutions CatBoost offers to tackle the issue of imbalanced classes.

The Problem of Imbalanced Classes

Imbalanced datasets are common in many real-world applications, such as fraud detection, medical diagnosis, and customer churn prediction. In these scenarios, one class (e.g., the positive class) is significantly underrepresented compared to the other class (e.g., the negative class). This imbalance can lead to biased models that favor the majority class, resulting in poor performance on the minority class.

Techniques for Handling Imbalanced Data in CatBoost

CatBoost provides several built-in mechanisms to handle imbalanced datasets. These include:

  1. Class Weights
  2. Auto Class Weights
  3. Sampling Techniques

Let's walk through a practical example demonstrating how to handle an imbalanced dataset using CatBoost, and then validate its performance. We'll use a synthetic dataset and evaluate the effectiveness of different techniques.

1. Dataset Preparation

First, let's generate a synthetic imbalanced dataset for demonstration purposes using make_classification from scikit-learn:

Python
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import pandas as pd

X, y = make_classification(n_samples=1000, n_features=10, n_informative=8,
                           n_redundant=2, n_classes=2, weights=[0.95, 0.05], random_state=42)

df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['target'] = y

X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42)

1. Class Weights

Class weights are used to assign different importance to different classes. By increasing the weight of the minority class, the model is penalized more for misclassifying minority class instances, thus improving its performance on the minority class.

Python
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report

# Define CatBoost model with class weights
model_class_weights = CatBoostClassifier(class_weights={0: 1, 1: 10}, random_state=42, verbose=0)
model_class_weights.fit(X_train, y_train)
y_pred_class_weights = model_class_weights.predict(X_test)
print("Classification Report - Class Weights:")
print(classification_report(y_test, y_pred_class_weights))

Output:

Classification Report - Class Weights:
precision recall f1-score support

0 0.98 1.00 0.99 1868
1 0.94 0.73 0.83 132

accuracy 0.98 2000
macro avg 0.96 0.87 0.91 2000
weighted avg 0.98 0.98 0.98 2000

2. Auto Class Weights

CatBoost also offers an automatic way to balance class weights using the auto_class_weights parameter. This parameter can be set to 'Balanced' to automatically calculate and assign weights based on the class distribution.

Python
# Initialize CatBoostClassifier with auto class weights
model = CatBoostClassifier(auto_class_weights='Balanced', verbose=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Predictions:", y_pred)

Output:

Predictions: [0 0 0 ... 0 0 0]

3. Sampling Techniques

Sampling techniques such as oversampling the minority class or undersampling the majority class can also be used to balance the dataset. These techniques can be combined with CatBoost to improve model performance.

Oversampling:

Python
from imblearn.over_sampling import SMOTE

# Apply SMOTE to oversample the minority class
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model = CatBoostClassifier(verbose=0)
model.fit(X_resampled, y_resampled)
y_pred = model.predict(X_test)
print("Predictions:", y_pred)

Output:

Predictions: [0 0 0 ... 0 0 0]

Undersampling:

Python
from imblearn.under_sampling import RandomUnderSampler

# Apply RandomUnderSampler to undersample the majority class
rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
model = CatBoostClassifier(verbose=0)
model.fit(X_resampled, y_resampled)
y_pred = model.predict(X_test)
print("Predictions:", y_pred)

Output:

Predictions: [0 0 0 ... 0 0 0]

Handling Imbalanced Dataset in CatBoost : Practical Example

Problem Statement: You have a dataset from a telecom company containing customer information such as service usage patterns, customer demographics, and whether the customer churned or not. The goal is to build a model that predicts whether a customer will churn based on these features.

Step-by-Step Example: Predicting Customer Purchase

1. Generate Random Dataset

Let's generate a random dataset using Python's numpy and pandas libraries:

Python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
from catboost import CatBoostClassifier

np.random.seed(42)
n_samples = 1000

# Features: customer demographics and behavior
age = np.random.randint(18, 70, size=n_samples)
income = np.random.normal(50000, 20000, size=n_samples)
days_since_last_purchase = np.random.randint(0, 365, size=n_samples)
num_visits_last_month = np.random.randint(1, 30, size=n_samples)
avg_purchase_amount = np.random.normal(50, 20, size=n_samples)
customer_type = np.random.choice(['Regular', 'Premium'], size=n_samples)
location = np.random.choice(['Urban', 'Rural'], size=n_samples)

# Target: whether customer made a purchase (binary: 0 or 1)
purchase = np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2])
data = pd.DataFrame({
    'Age': age,
    'Income': income,
    'Days_Since_Last_Purchase': days_since_last_purchase,
    'Num_Visits_Last_Month': num_visits_last_month,
    'Avg_Purchase_Amount': avg_purchase_amount,
    'Customer_Type': customer_type,
    'Location': location,
    'Purchase': purchase
})

# Display class distribution
print(data['Purchase'].value_counts())

Output:

 Purchase
0 802
1 198
Name: count, dtype: int64

2. Data Preprocessing

Now, let's preprocess the dataset:

Python
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

label_encoder = LabelEncoder()
data['Customer_Type'] = label_encoder.fit_transform(data['Customer_Type'])
data['Location'] = label_encoder.fit_transform(data['Location'])
X = data.drop('Purchase', axis=1)
y = data['Purchase']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

3. Handling Imbalanced Classes with CatBoost

Using Class Weights:

You can manually adjust the class_weights parameter in CatBoost to handle class imbalance:

Python
# Initialize CatBoost classifier with adjusted class weights
catboost_model_weights = CatBoostClassifier(iterations=1000,
                                            learning_rate=0.1,
                                            depth=6,
                                            eval_metric='AUC',
                                            random_seed=42,
                                            class_weights=[1, 4],  # Adjusted for imbalance (example)
                                            verbose=100)

catboost_model_weights.fit(X_train, y_train, eval_set=(X_test, y_test))
y_pred_weights = catboost_model_weights.predict(X_test)
print("CatBoost with Class Weights:")
print(classification_report(y_test, y_pred_weights))
print("ROC AUC Score:", roc_auc_score(y_test, catboost_model_weights.predict_proba(X_test)[:, 1]))

Output:

0:	test: 0.5014094	best: 0.5014094 (0)	total: 4.61ms	remaining: 4.6s
100: test: 0.4773669 best: 0.5684795 (1) total: 346ms remaining: 3.08s
200: test: 0.4621124 best: 0.5684795 (1) total: 762ms remaining: 3.03s
300: test: 0.4672525 best: 0.5684795 (1) total: 1.21s remaining: 2.81s
400: test: 0.4790250 best: 0.5684795 (1) total: 1.67s remaining: 2.49s
500: test: 0.4821754 best: 0.5684795 (1) total: 2.24s remaining: 2.23s
600: test: 0.4846626 best: 0.5684795 (1) total: 2.83s remaining: 1.88s
700: test: 0.4839993 best: 0.5684795 (1) total: 3.65s remaining: 1.56s
800: test: 0.4936163 best: 0.5684795 (1) total: 4.35s remaining: 1.08s
900: test: 0.4975958 best: 0.5684795 (1) total: 5.01s remaining: 551ms
999: test: 0.4985906 best: 0.5684795 (1) total: 5.42s remaining: 0us

bestTest = 0.5684795225
bestIteration = 1

Shrink model to first 2 iterations.
CatBoost with Class Weights:
precision recall f1-score support

0 0.82 0.65 0.72 163
1 0.19 0.35 0.24 37

accuracy 0.59 200
macro avg 0.50 0.50 0.48 200
weighted avg 0.70 0.59 0.63 200

ROC AUC Score: 0.5684795224672525
Using Auto Class Weights:

CatBoost provides an option to automatically calculate class weights based on the training data using auto_class_weights='Balanced':

Python
# Initialize CatBoost classifier with auto class weights
catboost_model_auto_weights = CatBoostClassifier(iterations=1000,
                                                learning_rate=0.1,
                                                depth=6,
                                                eval_metric='AUC',
                                                random_seed=42,
                                                auto_class_weights='Balanced',  # Automatically balance classes
                                                verbose=100)

catboost_model_auto_weights.fit(X_train, y_train, eval_set=(X_test, y_test))
y_pred_auto_weights = catboost_model_auto_weights.predict(X_test)
print("CatBoost with Auto Class Weights:")
print(classification_report(y_test, y_pred_auto_weights))
print("ROC AUC Score:", roc_auc_score(y_test, catboost_model_auto_weights.predict_proba(X_test)[:, 1]))

Output:

0:	test: 0.5014094	best: 0.5014094 (0)	total: 7.61ms	remaining: 7.6s
100: test: 0.4574697 best: 0.5691428 (1) total: 279ms remaining: 2.49s
200: test: 0.4793567 best: 0.5691428 (1) total: 503ms remaining: 2s
300: test: 0.4752114 best: 0.5691428 (1) total: 850ms remaining: 1.97s
400: test: 0.4757088 best: 0.5691428 (1) total: 1.21s remaining: 1.8s
500: test: 0.4820096 best: 0.5691428 (1) total: 1.53s remaining: 1.52s
600: test: 0.4785276 best: 0.5691428 (1) total: 2.02s remaining: 1.34s
700: test: 0.4798541 best: 0.5691428 (1) total: 2.31s remaining: 986ms
800: test: 0.4836677 best: 0.5691428 (1) total: 2.64s remaining: 655ms
900: test: 0.4808489 best: 0.5691428 (1) total: 2.95s remaining: 324ms
999: test: 0.4813464 best: 0.5691428 (1) total: 3.16s remaining: 0us

bestTest = 0.5691427624
bestIteration = 1

Shrink model to first 2 iterations.
CatBoost with Auto Class Weights:
precision recall f1-score support

0 0.82 0.65 0.72 163
1 0.19 0.35 0.24 37

accuracy 0.59 200
macro avg 0.50 0.50 0.48 200
weighted avg 0.70 0.59 0.63 200

ROC AUC Score: 0.5691427623942962

Choosing the Right Strategy

  • Class Weights Adjustment: Ideal for scenarios where the dataset size is manageable and imbalance is moderate. Adjust weights to penalize misclassifications of the minority class more heavily during training.
  • Auto Class Weights (Balanced): Suitable for datasets with severe imbalance or where the distribution of classes varies significantly. Automatically adjusts class weights based on class frequencies in the training data.
  • Sampling Techniques:
    • Over-sampling (SMOTE): Effective when the minority class is underrepresented and needs augmentation.
    • Under-sampling (RandomUnderSampler): Useful when dataset size is large and computational efficiency is a concern.

Choosing Based on Scenario:

  • For Moderate Imbalance: Start with adjusting class weights or using Auto Class Weights in CatBoost. Evaluate model performance metrics like precision, recall, and F1-score to fine-tune.
  • For Severe Imbalance: Consider combining techniques like SMOTE with class weights adjustment or using Auto Class Weights. Evaluate both model performance and computational feasibility.
  • Model Sensitivity Considerations: Experiment with different strategies and assess how each affects model behavior and performance metrics specific to your task.

Conclusion

Handling imbalanced datasets is a critical aspect of building robust machine learning models. CatBoost provides several effective techniques to address this challenge, including class weights, auto class weights, and sampling techniques. By leveraging these methods and choosing appropriate evaluation metrics, you can significantly improve the performance of your models on imbalanced datasets. Whether you're working on fraud detection, medical diagnosis, or customer churn prediction, CatBoost offers powerful tools to help you achieve better results.


Next Article

Similar Reads