HistGradientBoostingClassifier in Sklearn

Last Updated : 06 Jun, 2024

The HistGradientBoostingClassifier is an advanced implementation of the Gradient Boosting algorithm provided by the Scikit-Learn library. It leverages histogram-based techniques to enhance the efficiency and scalability of gradient boosting, making it particularly suitable for large datasets. This article delves into the key features, advantages, and practical applications of the HistGradientBoostingClassifier.

Table of Content

What is Histogram-Based Gradient Boosting?
Implementing HistGradientBoostingClassifier in Sklearn
Comparison with Other Libraries

1. XGBoost
3. LightGBM

Handling Imbalanced Data with HistGradientBoostingClassifier

What is Histogram-Based Gradient Boosting?

Gradient Boosting is an ensemble machine learning technique that builds models sequentially, with each new model attempting to correct the errors made by the previous ones. It is widely used for both classification and regression tasks due to its high predictive accuracy.

Histogram-based gradient boosting is a variant that improves the efficiency of the traditional gradient boosting algorithm by discretizing continuous input features into bins (histograms).

This approach significantly reduces the computational complexity and memory usage
Making it feasible to train models on large datasets.

Key Features of HistGradientBoostingClassifier

Efficiency: By using histograms, the HistGradientBoostingClassifier can handle large datasets more efficiently than traditional gradient boosting methods. This is particularly beneficial when dealing with tens of thousands of samples.
Handling Missing Data: The classifier has built-in support for missing values, which allows it to handle datasets with incomplete data without requiring imputation.
Scalability: The algorithm is designed to scale well with the number of samples and features, making it suitable for high-dimensional data.
Experimental to Stable: Initially introduced as an experimental feature in Scikit-Learn v0.21.0, the HistGradientBoostingClassifier became a stable estimator in v1.0.0.

Implementing HistGradientBoostingClassifier in Sklearn

To use the HistGradientBoostingClassifier, you need to enable the experimental features in Scikit-Learn:

Python

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

Here is a basic example of how to use the HistGradientBoostingClassifier for a classification task:

Python

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=10000, n_features=100, n_informative=50, n_redundant=50, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.3f}')

Output:

Accuracy: 0.950

Performance Evaluation

The performance of the HistGradientBoostingClassifier can be evaluated using cross-validation.

Python

from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from numpy import mean, std

# Define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print(f'Accuracy: {mean(n_scores):.3f} ({std(n_scores):.3f})')

Output:

Accuracy: 0.948 (0.003)

Comparison with Other Libraries

The HistGradientBoostingClassifier can be compared with other popular gradient boosting libraries like XGBoost and LightGBM. Both of these libraries also support histogram-based gradient boosting and offer high efficiency and scalability.

1. XGBoost

XGBoost is known for its speed and performance. It uses a similar histogram-based approach to improve the efficiency of gradient boosting.

Python

from xgboost import XGBClassifier

# Initialize the XGBoost classifier
xgb_model = XGBClassifier(tree_method='approx', max_bin=255, n_estimators=100)
xgb_scores = cross_val_score(xgb_model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print(f'XGBoost Accuracy: {mean(xgb_scores):.3f} ({std(xgb_scores):.3f})')

Output:

XGBoost Accuracy: 0.947 (0.004)

3. LightGBM

LightGBM is another highly efficient gradient boosting library that uses histogram-based techniques.

Python

from lightgbm import LGBMClassifier

# Initialize the LightGBM classifier
lgbm_model = LGBMClassifier(max_bin=255, n_estimators=100)
lgbm_scores = cross_val_score(lgbm_model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print(f'LightGBM Accuracy: {mean(lgbm_scores):.3f} ({std(lgbm_scores):.3f})')

Output:

LightGBM Accuracy: 0.948 (0.003)

Handling Imbalanced Data with `HistGradientBoostingClassifier`

One of the challenges with the HistGradientBoostingClassifier is handling imbalanced datasets. While the classifier performs well on balanced datasets, its performance can degrade on imbalanced datasets. To address this, you can use the class_weight parameter introduced in Scikit-Learn version 1.2

model = HistGradientBoostingClassifier(max_bins=255, max_iter=100, class_weight='balanced')

Conclusion

The HistGradientBoostingClassifier in Scikit-Learn is a powerful tool for efficient and scalable gradient boosting. Its ability to handle large datasets and missing values makes it a versatile choice for many machine learning tasks. By leveraging histogram-based techniques, it offers significant performance improvements over traditional gradient boosting methods. For those dealing with imbalanced datasets, the class_weight parameter provides a way to improve model performance.

HistGradientBoostingClassifier in Sklearn

deepakp7eq

Improve

Article Tags :

Practice Tags :

Machine Learning

HistGradientBoostingClassifier in Sklearn

What is Histogram-Based Gradient Boosting?

Key Features of HistGradientBoostingClassifier

Implementing HistGradientBoostingClassifier in Sklearn

Comparison with Other Libraries

1. XGBoost

3. LightGBM

Handling Imbalanced Data with HistGradientBoostingClassifier

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?

Handling Imbalanced Data with `HistGradientBoostingClassifier`