Open In App

HistGradientBoostingClassifier in Sklearn

Last Updated : 06 Jun, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

The HistGradientBoostingClassifier is an advanced implementation of the Gradient Boosting algorithm provided by the Scikit-Learn library. It leverages histogram-based techniques to enhance the efficiency and scalability of gradient boosting, making it particularly suitable for large datasets. This article delves into the key features, advantages, and practical applications of the HistGradientBoostingClassifier.

What is Histogram-Based Gradient Boosting?

Gradient Boosting is an ensemble machine learning technique that builds models sequentially, with each new model attempting to correct the errors made by the previous ones. It is widely used for both classification and regression tasks due to its high predictive accuracy.

Histogram-based gradient boosting is a variant that improves the efficiency of the traditional gradient boosting algorithm by discretizing continuous input features into bins (histograms).

  • This approach significantly reduces the computational complexity and memory usage
  • Making it feasible to train models on large datasets.

Key Features of HistGradientBoostingClassifier

  1. Efficiency: By using histograms, the HistGradientBoostingClassifier can handle large datasets more efficiently than traditional gradient boosting methods. This is particularly beneficial when dealing with tens of thousands of samples.
  2. Handling Missing Data: The classifier has built-in support for missing values, which allows it to handle datasets with incomplete data without requiring imputation.
  3. Scalability: The algorithm is designed to scale well with the number of samples and features, making it suitable for high-dimensional data.
  4. Experimental to Stable: Initially introduced as an experimental feature in Scikit-Learn v0.21.0, the HistGradientBoostingClassifier became a stable estimator in v1.0.0.

Implementing HistGradientBoostingClassifier in Sklearn

To use the HistGradientBoostingClassifier, you need to enable the experimental features in Scikit-Learn:

Python
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

Here is a basic example of how to use the HistGradientBoostingClassifier for a classification task:

Python
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=10000, n_features=100, n_informative=50, n_redundant=50, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.3f}')

Output:

Accuracy: 0.950

Performance Evaluation

The performance of the HistGradientBoostingClassifier can be evaluated using cross-validation. 

Python
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from numpy import mean, std

# Define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print(f'Accuracy: {mean(n_scores):.3f} ({std(n_scores):.3f})')

Output:

Accuracy: 0.948 (0.003)

Comparison with Other Libraries

The HistGradientBoostingClassifier can be compared with other popular gradient boosting libraries like XGBoost and LightGBM. Both of these libraries also support histogram-based gradient boosting and offer high efficiency and scalability.

1. XGBoost

XGBoost is known for its speed and performance. It uses a similar histogram-based approach to improve the efficiency of gradient boosting.

Python
from xgboost import XGBClassifier

# Initialize the XGBoost classifier
xgb_model = XGBClassifier(tree_method='approx', max_bin=255, n_estimators=100)
xgb_scores = cross_val_score(xgb_model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print(f'XGBoost Accuracy: {mean(xgb_scores):.3f} ({std(xgb_scores):.3f})')

Output:

XGBoost Accuracy: 0.947 (0.004)

3. LightGBM

LightGBM is another highly efficient gradient boosting library that uses histogram-based techniques.

Python
from lightgbm import LGBMClassifier

# Initialize the LightGBM classifier
lgbm_model = LGBMClassifier(max_bin=255, n_estimators=100)
lgbm_scores = cross_val_score(lgbm_model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print(f'LightGBM Accuracy: {mean(lgbm_scores):.3f} ({std(lgbm_scores):.3f})')

Output:

LightGBM Accuracy: 0.948 (0.003)

Handling Imbalanced Data with HistGradientBoostingClassifier

One of the challenges with the HistGradientBoostingClassifier is handling imbalanced datasets. While the classifier performs well on balanced datasets, its performance can degrade on imbalanced datasets. To address this, you can use the class_weight parameter introduced in Scikit-Learn version 1.2

model = HistGradientBoostingClassifier(max_bins=255, max_iter=100, class_weight='balanced')

Conclusion

The HistGradientBoostingClassifier in Scikit-Learn is a powerful tool for efficient and scalable gradient boosting. Its ability to handle large datasets and missing values makes it a versatile choice for many machine learning tasks. By leveraging histogram-based techniques, it offers significant performance improvements over traditional gradient boosting methods. For those dealing with imbalanced datasets, the class_weight parameter provides a way to improve model performance.


Next Article

Similar Reads