HistGradientBoostingClassifier in Sklearn
Last Updated :
06 Jun, 2024
The HistGradientBoostingClassifier
is an advanced implementation of the Gradient Boosting algorithm provided by the Scikit-Learn library. It leverages histogram-based techniques to enhance the efficiency and scalability of gradient boosting, making it particularly suitable for large datasets. This article delves into the key features, advantages, and practical applications of the HistGradientBoostingClassifier
.
What is Histogram-Based Gradient Boosting?
Gradient Boosting is an ensemble machine learning technique that builds models sequentially, with each new model attempting to correct the errors made by the previous ones. It is widely used for both classification and regression tasks due to its high predictive accuracy.
Histogram-based gradient boosting is a variant that improves the efficiency of the traditional gradient boosting algorithm by discretizing continuous input features into bins (histograms).
- This approach significantly reduces the computational complexity and memory usage
- Making it feasible to train models on large datasets.
Key Features of HistGradientBoostingClassifier
- Efficiency: By using histograms, the
HistGradientBoostingClassifier
can handle large datasets more efficiently than traditional gradient boosting methods. This is particularly beneficial when dealing with tens of thousands of samples. - Handling Missing Data: The classifier has built-in support for missing values, which allows it to handle datasets with incomplete data without requiring imputation.
- Scalability: The algorithm is designed to scale well with the number of samples and features, making it suitable for high-dimensional data.
- Experimental to Stable: Initially introduced as an experimental feature in Scikit-Learn v0.21.0, the
HistGradientBoostingClassifier
became a stable estimator in v1.0.0.
Implementing HistGradientBoostingClassifier in Sklearn
To use the HistGradientBoostingClassifier
, you need to enable the experimental features in Scikit-Learn:
Python
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
Here is a basic example of how to use the HistGradientBoostingClassifier
for a classification task:
Python
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=10000, n_features=100, n_informative=50, n_redundant=50, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.3f}')
Output:
Accuracy: 0.950
Performance Evaluation
The performance of the HistGradientBoostingClassifier
can be evaluated using cross-validation.
Python
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from numpy import mean, std
# Define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print(f'Accuracy: {mean(n_scores):.3f} ({std(n_scores):.3f})')
Output:
Accuracy: 0.948 (0.003)
Comparison with Other Libraries
The HistGradientBoostingClassifier
can be compared with other popular gradient boosting libraries like XGBoost and LightGBM. Both of these libraries also support histogram-based gradient boosting and offer high efficiency and scalability.
1. XGBoost
XGBoost is known for its speed and performance. It uses a similar histogram-based approach to improve the efficiency of gradient boosting.
Python
from xgboost import XGBClassifier
# Initialize the XGBoost classifier
xgb_model = XGBClassifier(tree_method='approx', max_bin=255, n_estimators=100)
xgb_scores = cross_val_score(xgb_model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print(f'XGBoost Accuracy: {mean(xgb_scores):.3f} ({std(xgb_scores):.3f})')
Output:
XGBoost Accuracy: 0.947 (0.004)
3. LightGBM
LightGBM is another highly efficient gradient boosting library that uses histogram-based techniques.
Python
from lightgbm import LGBMClassifier
# Initialize the LightGBM classifier
lgbm_model = LGBMClassifier(max_bin=255, n_estimators=100)
lgbm_scores = cross_val_score(lgbm_model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print(f'LightGBM Accuracy: {mean(lgbm_scores):.3f} ({std(lgbm_scores):.3f})')
Output:
LightGBM Accuracy: 0.948 (0.003)
Handling Imbalanced Data with HistGradientBoostingClassifier
One of the challenges with the HistGradientBoostingClassifier
is handling imbalanced datasets. While the classifier performs well on balanced datasets, its performance can degrade on imbalanced datasets. To address this, you can use the class_weight
parameter introduced in Scikit-Learn version 1.2
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100, class_weight='balanced')
Conclusion
The HistGradientBoostingClassifier
in Scikit-Learn is a powerful tool for efficient and scalable gradient boosting. Its ability to handle large datasets and missing values makes it a versatile choice for many machine learning tasks. By leveraging histogram-based techniques, it offers significant performance improvements over traditional gradient boosting methods. For those dealing with imbalanced datasets, the class_weight
parameter provides a way to improve model performance.
Similar Reads
Binary classification using CatBoost
CatBoost is a high-performance, open-source gradient boosting library developed by Yandex, a Russian multinational IT company. It is designed for categorical feature support, making it particularly powerful for structured data like those often encountered in real-world datasets. In this article, we
13 min read
Catboost Classification Metrics
When it comes to machine learning, classification is a fundamental task that involves predicting a categorical label or class based on a set of input features. One of the most popular and efficient algorithms for classification is Catboost, a gradient boosting library developed by Yandex. Catboost i
8 min read
Probability Calibration of Classifiers in Scikit Learn
In this article, we will explore the concepts and techniques related to the probability calibration of classifiers in the context of machine learning. Classifiers in machine learning frequently provide probabilities indicating how confident they are in their predictions. However, the probabilities m
4 min read
sklearn.Binarizer() in Python
sklearn.preprocessing.Binarizer() is a method which belongs to preprocessing module. It plays a key role in the discretization of continuous feature values. Example #1: A continuous data of pixels values of an 8-bit grayscale image have values ranging between 0 (black) and 255 (white) and one needs
3 min read
Multiclass classification using CatBoost
Multiclass or multinomial classification is a fundamental problem in machine learning where our goal is to classify instances into one of several classes or categories of the target feature. CatBoost is a powerful gradient-boosting algorithm that is well-suited and widely used for multiclass classif
10 min read
ML | Bagging classifier
Ensemble learning is a supervised machine-learning technique that combines multiple models to build reliable models and prediction accurate model. It works by combining the strengths of multiple model to create a model that is robust and less likely to overfit the data. Bagging ClassifierBagging or
7 min read
Building Naive Bayesian classifier with WEKA
The use of the Naive Bayesian classifier in Weka is demonstrated in this article. The âweather-nominalâ data set used in this experiment is available in ARFF format. This paper assumes that the data has been properly preprocessed. The Bayes' Theorem is used to build a set of classification algorithm
3 min read
Boosting in Machine Learning | Boosting and AdaBoost
In machine learning a single model may not be sufficient to solve complex problems as it can be too weak to solve it independently. To enhance its predictive accuracy we combine multiple multiple weak models to build a more powerful and robust model. This process of combining multiple weak learners
4 min read
Gradient Boosting in R
In this article, we will explore how to implement Gradient Boosting in R, its theory, and practical examples using various R packages, primarily gbm and xgboost. Gradient Boosting in RGradient Boosting is a powerful machine-learning technique for regression and classification problems. It builds mod
6 min read
Dummy Classifiers using Sklearn - ML
Dummy classifier is a classifier that classifies data with basic rules without producing any insight from the training data. It entirely disregards data trends and outputs the class label based on pre-specified strategies. A dummy classifier is designed to act as a baseline, with which more sophisti
3 min read