Gradient Boosting is a ensemble learning method used for classification and regression tasks. It is a boosting algorithm which combine multiple weak learner to create a strong predictive model. It works by sequentially training models where each new model tries to correct the errors made by its predecessor.
In gradient boosting each new model is trained to minimize the loss function such as mean squared error or cross-entropy of the previous model using gradient descent. In each iteration the algorithm computes the gradient of the loss function with respect to predictions and then trains a new weak model to minimize this gradient. Predictions of the new model are then added to the ensemble (all models prediction) and the process is repeated until a stopping criterion is met.
Shrinkage and Model Complexity
A key feature of Gradient Boosting is shrinkage which scales the contribution of each new model using learning rate (denoted as η).
- Smaller learning rates: mean the contribution of each tree is smaller which reduces the risk of overfitting but requires more trees to achieve the same performance.
- Larger learning rates: mean each tree has a more significant impact but this can lead to overfitting.
There's a trade-off between the learning rate and the number of estimators (trees) a smaller learning rate usually means more trees are required to achieve optimal performance.
Working of Gradient Boosting
1. Sequential Learning Process
The ensemble consists of multiple trees each trained to correct the errors of the previous one. In the first iteration Tree 1 is trained on the original data x and the true labels y. It makes predictions which are used to compute the errors.
2. Residuals Calculation
In the second iteration Tree 2 is trained using the feature matrix x and the errors from Tree 1 as labels. This means Tree 2 is trained to predict the errors of Tree 1. This process continues for all the trees in the ensemble. Each subsequent tree is trained to predict the errors of the previous tree.
Gradient Boosted Trees3. Shrinkage
After each tree is trained its predictions are shrunk by multiplying them with the learning rate η which ranges from 0 to 1. This prevents overfitting by ensuring each tree has a smaller impact on the final model.
Once all trees are trained predictions are made by summing the contributions of all the trees. The final prediction is given by the formula:
y_{\text{pred}} = y_1 + \eta \cdot r_1 + \eta \cdot r_2 + \cdots + \eta \cdot r_N
Where r_1, r_2, \dots, r_N are the errors predicted by each tree.
Difference between Adaboost and Gradient Boosting
The difference between AdaBoost and gradient boosting are as follows:
Features | AdaBoost | Gradient Boosting |
---|
Weight Update Strategy | Increase weights of misclassified sample so that the next learner focuses more on them. | Updates predictions by minimizing a loss function using the negative gradient |
Base learners | AdaBoost uses simple decision trees with one split known as the decision stumps of weak learners. | Gradient Boosting can use a wide range of base learners such as decision trees and linear models. |
Sensitivity to Noise | AdaBoost is more sensitive to noisy data and outliers due to aggressive weighting. | Gradient Boosting is less sensitive as it smooths updates using gradients. |
Optimization Technique | No explicit loss function i.e it focuses on classification error. | Explicitly minimizes a differentiable loss function. |
Boosting Mechanism | Learners are trained sequentially with sample reweighting. | Learners are trained sequentially with residual fitting (gradient descent). |
Interpretability | Easier to interpret due to simple weak learners. | Harder to interpret if complex models are used. |
Use case | Suitable for clean datasets with fewer outliers | Suitable for complex problems with varying loss function |
Implementing Gradient Boosting for Classification and Regression
Here are two examples to demonstrate how Gradient Boosting works for both classification and regression. But before that let's understand gradient boosting parameters.
- n_estimators: This specifies the number of trees (estimators) to be built. A higher value typically improves model performance but increases computation time.
- learning_rate: This is the shrinkage parameter. It scales the contribution of each tree.
- random_state: It ensures reproducibility of results. Setting a fixed value for random_state ensure that you get the same results every time you run the model.
- max_features: This parameter limits the number of features each tree can use for splitting. It helps prevent overfitting by limiting the complexity of each tree and promoting diversity in the model.
Now we start building our models with Gradient Boosting.
Example 1: Classification
We use Gradient Boosting Classifier to predict digits from Digits dataset.
Steps:
- Import the necessary libraries
- Setting SEED for reproducibility
- Load the digit dataset and split it into train and test.
- Instantiate Gradient Boosting classifier and fit the model.
- Predict the test set and compute the accuracy score.
Python
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_digits
SEED = 23
X, y = load_digits(return_X_y=True)
train_X, test_X, train_y, test_y = train_test_split(X, y,
test_size = 0.25,
random_state = SEED)
gbc = GradientBoostingClassifier(n_estimators=300,
learning_rate=0.05,
random_state=100,
max_features=5 )
gbc.fit(train_X, train_y)
pred_y = gbc.predict(test_X)
acc = accuracy_score(test_y, pred_y)
print("Gradient Boosting Classifier accuracy is : {:.2f}".format(acc))
Output:
Gradient Boosting Classifier accuracy is : 0.98
Example 2: Regression
We use Gradient Boosting Regressor on the Diabetes dataset to predict continuous values:
Steps:
- Import the necessary libraries
- Setting SEED for reproducibility
- Load the diabetes dataset and split it into train and test.
- Instantiate Gradient Boosting Regressor and fit the model.
- Predict on the test set and compute RMSE.
python
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_diabetes
SEED = 23
X, y = load_diabetes(return_X_y=True)
train_X, test_X, train_y, test_y = train_test_split(X, y,
test_size = 0.25,
random_state = SEED)
gbr = GradientBoostingRegressor(loss='absolute_error',
learning_rate=0.1,
n_estimators=300,
max_depth = 1,
random_state = SEED,
max_features = 5)
gbr.fit(train_X, train_y)
pred_y = gbr.predict(test_X)
test_rmse = mean_squared_error(test_y, pred_y) ** (1 / 2)
print('Root mean Square error: {:.2f}'.format(test_rmse))
Output:
Root mean Square error: 56.39
Gradient Boosting is an effective and widely-used machine learning technique for both classification and regression problems. It builds models sequentially focusing on correcting errors made by previous models which leads to improved performance.
Similar Reads
GrowNet: Gradient Boosting Neural Networks
GrowNet was proposed in 2020 by students from Purdue, UCLA, and Virginia Tech in collaboration with engineers from Amazon and LinkedIn California. They proposed a new gradient boosting algorithm where they used a shallow neural network as the weak learners, a general loss function for training the g
6 min read
What is Gradient descent?
Gradient Descent is a fundamental algorithm in machine learning and optimization. It is used for tasks like training neural networks, fitting regression lines, and minimizing cost functions in models. In this article we will understand what gradient descent is, how it works , mathematics behind it a
8 min read
Custom gradients in TensorFlow
Custom gradients in TensorFlow allow you to define your gradient functions for operations, providing flexibility in how gradients are computed for complex or non-standard operations. This can be useful for tasks such as implementing custom loss functions, incorporating domain-specific knowledge into
6 min read
Gradient Descent Algorithm in R
Gradient Descent is a fundamental optimization algorithm used in machine learning and statistics. It is designed to minimize a function by iteratively moving toward the direction of the steepest descent, as defined by the negative of the gradient. The goal is to find the set of parameters that resul
6 min read
Understanding Gradient Clipping
Gradient Clipping is the process that helps maintain numerical stability by preventing the gradients from growing too large. When training a neural network, the loss gradients are computed through backpropagation. However, if these gradients become too large, the updates to the model weights can als
12 min read
ML | Mini-Batch Gradient Descent with Python
In machine learning, gradient descent is an optimization technique used for computing the model parameters (coefficients and bias) for algorithms like linear regression, logistic regression, neural networks, etc. In this technique, we repeatedly iterate through the training set and update the model
5 min read
Stochastic Gradient Descent In R
Gradient Descent is an iterative optimization process that searches for an objective functionâs optimum value (Minimum/Maximum). It is one of the most used methods for changing a modelâs parameters to reduce a cost function in machine learning projects. In this article, we will learn the concept of
9 min read
Applying Gradient Clipping in TensorFlow
In deep learning, gradient clipping is an essential technique to prevent gradients from becoming too large during backpropagation, which can lead to unstable training and exploding gradients. This article provides a detailed overview of how to apply gradient clipping in TensorFlow, starting from the
5 min read
Gradient Descent in Linear Regression
Gradient descent is a optimization algorithm used in linear regression to minimize the error in predictions. This article explores how gradient descent works in linear regression.Why Gradient Descent in Linear Regression?Linear regression involves finding the best-fit line for a dataset by minimizin
3 min read
Gradient
The gradient is a fundamental concept in calculus that extends the idea of a derivative to multiple dimensions. It plays a crucial role in vector calculus, optimization, machine learning, and physics. The gradient of a function provides the direction of the steepest ascent, making it essential in ar
4 min read