Unit-I (Ensemble Learning)
Unit-I (Ensemble Learning)
Ensemble Learning
What is Ensemble Learning?
Ensemble learning is a machine learning technique that combines the predictions from
multiple individual models to obtain a better predictive performance than any single model.
The basic idea behind ensemble learning is to leverage the wisdom of the crowd by
aggregating the predictions of multiple models, each of which may have its own strengths
and weaknesses. This can lead to improved performance and generalization.
Ensemble learning can be thought of as compensation for poor learning algorithms that are
computationally more expensive than a single model. But they are more efficient than a
single non-ensemble model that has passed through a lot of learning.
Several individual base models (experts) are fitted to learn from the same data and produce
an aggregation of output based on which a final decision is taken. These base models can
be machine learning algorithms such as decision trees (mostly used), linear models, support
vector machines (SVM), neural networks, or any other model that is capable of making
predictions.
Most commonly used ensembles include techniques such as Bagging- used to generate
Random Forest algorithms and Boosting- to generate algorithms such
as Adaboost, Xgboost etc.
key advantages of ensemble learning:
1. Improved Accuracy
By combining the outputs of multiple models, ensemble methods can reduce errors in
prediction. Techniques like bagging and boosting (e.g., Random Forests, Gradient
Boosting) have been proven to yield models with significantly higher accuracy than
individual models alone.
2. Reduction of Overfitting
Some ensemble methods, like bagging, help reduce overfitting. For instance, Random
Forests combine the predictions of multiple decision trees trained on different subsets of
data, making the final model less likely to fit to noise in any single subset, resulting in
better generalization on new data.
3. Reduction of Variance
Ensemble methods reduce variance, especially in models prone to high variance (e.g.,
decision trees). Aggregating predictions from multiple models can lead to a smoother, more
stable prediction curve, lowering the risk of extreme predictions that deviate due to outliers
or specific patterns in individual training samples.
4. Enhanced Robustness
By aggregating different models, ensemble learning creates a more robust model, meaning
it can handle anomalies or noise in the data better. This resilience is particularly valuable
when data quality is inconsistent.
5. Adaptability to Complex Data Patterns
Ensemble methods are particularly useful in capturing complex relationships in data. For
example, boosting models sequentially adapt by focusing more on hard-to-predict
instances, gradually improving the model's ability to learn complex patterns.
6. Flexibility with Different Algorithms
Ensembles can combine models from different algorithmic families (e.g., decision trees,
logistic regression, neural networks), allowing diverse approaches to contribute to the final
prediction. Techniques like stacking allow combining the strengths of various types of
models, which is especially helpful when no single algorithm performs well on its own.
7. Better Performance with Small or Imbalanced Datasets
Some ensemble techniques, particularly boosting, can perform well even on small or
imbalanced datasets. Boosting methods focus on difficult instances, improving predictive
performance in scenarios where class imbalance or sparse data might otherwise lead to
poor performance.
8. Parallelizable (in Bagging)
Techniques like bagging can be trained in parallel, making it computationally efficient to
create a strong model by training independent models concurrently. This characteristic is
helpful when working with large datasets and high-dimensional data.
9. Diversity of Models Leading to Higher Generalization
When ensemble models are composed of diverse learners, they are more likely to
generalize well across various patterns in the data. Diversity in the models allows
ensembles to handle a broader range of inputs, leading to increased flexibility and
improved performance across multiple domains.
Types of boosting:
1. AdaBoost (Adaptive Boosting)
2. Gradient Tree Boosting
3. XGBoost
AdaBoost:
AdaBoost short for Adaptive Boosting is an ensemble learning used in machine learning for
classification and regression problems. The main idea behind AdaBoost is to iteratively train the
weak classifier on the training dataset with each successive classifier giving more weightage to
the data points that are misclassified. The final AdaBoost model is decided by combining all the
weak classifier that has been used for training with the weightage given to the models according
to their accuracies. The weak model which has the highest accuracy is given the highest
weightage while the model which has the lowest accuracy is given a lower weightage.
Box 1: You can see that we have assigned equal weights to each data point and applied a decision
stump to classify them as + (plus) or – (minus). The decision stump (D1) has generated vertical
line at left side to classify the data points. We see that, this vertical line has incorrectly predicted
three + (plus) as – (minus). In such case, we’ll assign higher weights to these three + (plus) and
apply another decision stump.
Box 2: Here, you can see that the size of three incorrectly predicted + (plus) is bigger as compared
to rest of the data points. In this case, the second decision stump (D2) will try to predict them
correctly. Now, a vertical line (D2) at right side of this box has classified three mis-classified +
(plus) correctly. But again, it has caused mis-classification errors. This time with three -(minus).
Again, we will assign higher weight to three – (minus) and apply another decision stump.
Box 3: Here, three – (minus) are given higher weights. A decision stump (D3) is applied to predict
these mis-classified observation correctly. This time a horizontal line is generated to classify +
(plus) and – (minus) based on higher weight of mis-classified observation.
Box 4: Here, we have combined D1, D2 and D3 to form a strong prediction having complex rule
as compared to individual weak learner. You can see that this algorithm has classified these
observation quite well as compared to any of individual weak learner.
The weak classifier we are training should have an accuracy greater than 0.5 which means
it should be performing better than a naive guess
Step3 – Calculate the error rate and importance of each weak model Mk
Calculate rate error_rate for every weak classifier Mk on the training dataset
After applying the weak classifier model to the training data we will update the weight
assigned to the points using the accuracy of the model. The formula for updating the
weights will be
We will normalize the instance weight so that they can be summed up to 1 using the
formula
We will train K classifiers and will calculate model importance and update the instance
weights using the above formula
The final model M(X) will be an ensemble model which is obtained by combining these
weak models weighted by their model weights.
Stump
The total error is nothing but the summation of all the sample weights of misclassified data points.
Here in our dataset, let’s assume there is 1 wrong output, so our total error will be 1/5, and the
alpha (performance of the stump) will be:
Note: Total error will always be between 0 and 1.
0 Indicates perfect stump, and 1 indicates horrible stump.
From the graph above, we can see that when there is no misclassification, then we have no error
(Total Error = 0), so the “amount of say (alpha)” will be a large number.
When the classifier predicts half right and half wrong, then the Total Error = 0.5, and the
importance (amount of say) of the classifier will be 0.
If all the samples have been incorrectly classified, then the error will be very high (approx. to 1),
and hence our alpha value will be a negative integer.
Step 4: Calculate TE and Performance
You might be wondering about the significance of calculating the Total Error (TE) and
performance of an Adaboost stump. The reason is straightforward – updating the weights is crucial.
If identical weights are maintained for the subsequent model, the output will mirror what was
obtained in the initial model.
The wrong predictions will be given more weight, whereas the correct predictions weights will be
decreased. Now when we build our next model after updating the weights, more preference will
be given to the points with higher weights.
After finding the importance of the classifier and total error, we need to finally update the weights,
and for this, we use the following formula:
The amount of, say (alpha) will be negative when the sample is correctly classified.
The amount of, say (alpha) will be positive when the sample is miss-classified.
There are four correctly classified samples and 1 wrong. Here, the sample weight of that datapoint
is 1/5, and the amount of say/performance of the stump of Gender is 0.69.
New weights for correctly classified samples are:
Note
The sign of alpha after substituting the values, the alpha is negative when the data point is
correctly classified, and this decreases the sample weight from 0.2 to 0.1004. It is positive when
there is misclassification, and this will increase the sample weight from 0.2 to 0.3988
The sample weights must be equal to 1, but here if we sum up all the new sample weights, we will
get 0.8004. To bring this sum equal to 1, we will normalize these weights by dividing all the
weights by the total sum of updated weights, which is 0.8004. So, after normalizing the sample
weights, we get this dataset, and now the sum is equal to 1.
This comes out to be our new dataset, and we see the data point, which was wrongly classified,
has been selected 3 times because it has a higher weight.
Step 7: Repeat Previous Steps
Now this act as our new dataset, and we need to repeat all the above steps i.e.
Assign equal weights to all the data points.
Find the stump that does the best job classifying the new collection of samples by finding
their Gini Index and selecting the one with the lowest Gini index.
Calculate the “Amount of Say” and “Total error” to update the previous sample weights.
Normalize the new sample weights.
Iterate through these steps until and unless a low training error is achieved.
Finally, Now we need to talk about how a forest of stumps created by AdaBoost makes
classification.
Imagine there are 6 stumps are created by the AdaBoost algorithm. Out of 6 stumps, 4 stumps are
classified patient is ill, and the other 2 stumps classified patient does not ill. These are the Amount
of Say for these stumps are 0.97+0.32+0.78+0.63 = 2.7, and the Amount of Say of the other 2
stumps are 0.41+0.82=1.23.
Ultimately, the patient is classified as ill because of the larger Amount of Say (2.7).
Suppose, with respect to our dataset, we have constructed 3 decision trees (DT1, DT2, DT3) in
a sequential manner. If we send our test data now, it will pass through all the decision trees, and
finally, we will see which class has the majority, and based on that, we will do predictions
for our test dataset.
https://datamapu.com/posts/classical_ml/adaboost_example_reg/
Understanding the Working of the AdaBoost Algorithm (Regression):
dataset containing 10 samples. It includes the features ‘age’, ’likes height’, and ’likes goats’. The
target variable is ‘climbed meters’. That is we want to estimate how many meters a person has
climbed depending on their age, and whether they like height and goats.
We start with asigning weights to each sample. Initially, the weights are all equal to 1/N,
with N the number of data samples, that is in our case the initial weights are 0.1 for all
samples.
We now fit a Decision Tree with maximum depth of three to this dataset.
Following the decision paths of the tree, we can find that the samples age=35, likes height=0,
likes goats=0 and age=42, likes height=0, likes goats=0 lead to wrong predictions. The true
target values are 300m and 200m, respectivly, but the predicted value is 250m in both cases.
The other eight samples are correctly predicted. The total error is thus 210. The influence of
this tree is therefore
Note, that different implementation of the AdaBoost algorithm for regression exist.
Usually the prediction does not need to match exactly, but a margin is given, and the
prediction is counted as an error if it falls out of this margin.
For the sake of simplicity, we will keep this definition analogue to a classification problem.
The main idea of calculating the influence of each tree remains, but the way the error is exactly
calculated may differ in different implementations.
The dataset with the updated weights asigned to each sample.
Feature Value
age 45
likes height 0
likes goats 1
To make the final prediction, we need to consider all the individual predictions of all the
models. The weighted mean of these predictions is then the prediction of the constructed
esemble AdaBoost model. As weights the values for the influence is used. Following the
decision path of the first tree, results in a prediction of 300m, the second tree
predicts 233.33m and the third tree again predicts 300m. The final prediction is than calculated
as
The ensemble consists of M trees. Tree1 is trained using the feature matrix X and the labels y.
The predictions labeled y1(hat) are used to determine the training set residual errors r1. Tree2 is
then trained using the feature matrix X and the residual errors r1 of Tree1 as labels. The
predicted results r1(hat) are then used to determine the residual r2. The process is repeated until
all the M trees forming the ensemble are trained. There is an important parameter used in this
technique known as Shrinkage. Shrinkage refers to the fact that the prediction of each tree in
the ensemble is shrunk after it is multiplied by the learning rate (eta) which ranges between 0 to
1. There is a trade-off between eta and the number of estimators, decreasing learning rate needs
to be compensated with increasing estimators in order to reach certain model performance. Since
all trees are trained now, predictions can be made. Each tree predicts a label and the final
prediction is given by the formula,
Why did I say we take the average of the target column? Well, there is math involved in this.
Mathematically the first step can be written as:
Let’s see how to do this with the help of our example. Remember that y_i is our observed value
and γi i is our predicted value, by plugging the values in the above formula we get:
We end up over an average of the observed car price and this is why I asked you to take the average
of the target column and assume it to be your first prediction.
Hence for gamma=14500, the loss function will be minimum so this value will become our
prediction for the base model.
Step 2: Compute Pseudo Residuals
The next step is to calculate the pseudo residuals which are (observed value – predicted value).
Again the question comes why only observed – predicted? Everything is mathematically proven.
Let’s see where this formula comes from. This step can be written as:
Here F(xi) is the previous model and m is the number of decision tree made.
We are just taking the derivative of loss function w.r.t the predicted value and we have already
calculated this derivative:
If you see the formula of residuals above, we see that the derivative of the loss function is
multiplied by a negative sign, so now we get:
The predicted value here is the prediction made by the previous model. In our example the
prediction made by the previous model (initial base model prediction) is 14500, to calculate the
residuals our formula becomes:
Here hm(xi) is the DT made on residuals and m is the number of DT. When m=1 we are talking
about the 1st DT and when it is “M” we are talking about the last DT.
The output value for the leaf is the value of gamma that minimizes the Loss function. The left-
hand side “Gamma” is the output value of a particular leaf. On the right-hand side
[F m-1 (x i )+γh m (x i ))] is similar to step 1 but here the difference is that we are taking previous
predictions whereas earlier there was no previous prediction.
Example of Calculating Regression Tree Output
Let’s understand this even better with the help of an example. Suppose this is our regressor tree:
We see 1st residual goes in R1,1 ,2nd and 3rd residuals go in R2,1 and 4th residual goes in R3,1 .
Let’s calculate the output for the first leave that is R1,1
Now we need to find the value for gamma for which this function is minimum. So we find the
derivative of this equation w.r.t gamma and put it equal to 0.
Hence the leaf R1,1 has an output value of -2500. Now let’s solve for the R2,1.
Let’s take the derivative to get the minimum value of gamma for which this function is minimum:
We end up with the average of the residuals in the leaf R2,1 . Hence if we get any leaf with more
than 1 residual, we can simply find the average of that leaf and that will be our final output.
Now after calculating the output of all the leaves, we get:
Here Fm-1(x) is the prediction of the base model (previous prediction) since F1-1=0 , F0 is our base
model hence the previous prediction is 14500.
nu is the learning rate that is usually selected between 0-1. It reduces the effect each tree has on
the final prediction, and this improves accuracy in the long run. Let’s take nu=0.1 in this example.
Hm(x) is the recent DT made on the residuals.
Let’s calculate the new prediction now:
Suppose we want to find a prediction of our first data point which has a car height of 48.8. This
data point will go through this decision tree and the output it gets will be multiplied by the learning
rate and then added to the previous prediction.
Now let’s say m=2 which means we have built 2 decision trees and now we want to have new
predictions.
This time we will add the previous prediction that is F1(x) to the new DT made on residuals. We
will iterate through these steps again and again till the loss is negligible.
I am taking a hypothetical example here just to
make you understand how this predicts for a new dataset:
If a new data point comes, say, height = 1.40, it’ll go through all the trees and then will give the
prediction. Here we have only 2 trees hence the datapoint will go through these 2 trees and the
final output will be F2(x).
Note: Please bear in mind that we have rounded off everything to one decimal place here, and
hence the log(odds) and probability are the same, which may not be the case always.
If the probability of surviving is greater than 0.5, then we first classify everyone in the training
dataset as survivors. (0.5 is a common threshold used for classification decisions made based on
probability; note that the threshold can easily be taken as something else.)
Now we need to calculate the Pseudo Residual, i.e, the difference between the observed value and
the predicted value. Let us draw the residuals on a graph.
The blue and the yellow dots are the observed values. The blue dots are the passengers who did
not survive with the probability of 0 and the yellow dots are the passengers who survived with a
probability of 1. The dotted line here represents the predicted probability which is 0.7
We need to find the residual which would be :
Transformed tree
Now that we have transformed it, we can add our initial lead with our new tree with a learning
rate.
Learning Rate is used to scale the contribution from the new tree. This results in a small step in
the right direction of prediction. Empirical evidence has proven that taking lots of small steps in
the right direction results in better prediction with a testing dataset i.e the dataset that the model
has never seen as compared to the perfect prediction in 1st step. Learning Rate is usually a small
number like 0.1
We can now calculate new log(odds) prediction and hence a new probability.
For example, for the first passenger, Old Tree = 0.7. Learning Rate which remains the same for all
records is equal to 0.1 and by scaling the new tree, we find its value to be -0.16. Hence, substituting
in the formula we get:
Similarly, we substitute and find the new log(odds) for each passenger and hence find the
probability. Using the new probability, we will calculate the new residuals.
This process repeats until we have made the maximum number of trees specified or the residuals
get super small.
A Mathematical Understanding
We shall go through each step, one at a time and try to understand them.
yi is observed value ( 0 or 1 ).
p is the predicted probability.
The goal would be to maximize the log likelihood function. Hence, if we use the log(likelihood) as
our loss function where smaller values represent better fitting models then:
Substituting,
Now,
Hence,
Here, yi is the observed values, L is the loss function, and gamma is the value for log(odds).
We are summating the loss function i.e. we add up the Loss Function for each observed value.
argmin over gamma means that we need to find a log(odds) value that minimizes this sum.
Then, we take the derivative of each loss function:
… and so on.
Step 2: for m = 1 to M
(A)
This step needs you to calculate the residual using the given formula. We have already found the
Loss Function to be as :
Hence,
(B) Fit a regression tree to the residual values and create terminal regions
Because the leaves are limited for one branch hence, we might have more than one value in a
particular terminal region.
In our first tree, m=1 and j will be the unique number for each terminal node. So R11, R21 and so
on.
C)
For each leaf in the new tree, we calculate gamma which is the output value. The summation should
be only for those records which goes into making that leaf. In theory, we could find the derivative
with respect to gamma to obtain the value of gamma but that could be extremely wearisome due
to the hefty variables included in our loss function.
Substituting the loss function and i=1 in the equation above, we get:
There are three terms in our approximation. Taking derivative with respect to gamma gives us:
Equating this to 0 and subtracting the single derivative term from both the sides.
The gamma equation may look humongous but in simple terms, it is:
Now we shall solve for the second derivative of the Loss Function. After some heavy
computations, we get:
We have simplified the numerator as well as the denominator. The final gamma solution looks like:
We were trying to find the value of gamma that when added to the most recent predicted log(odds)
minimizes our Loss Function. This gamma works when our terminal region has only one residual
value and hence one predicted probability. But, do recall from our example above that because of
the restricted leaves in Gradient Boosting, it is possible that one terminal region has many values.
Then the generalized formula would be:
Hence, we have calculated the output values for each leaf in the tree.
(D)
This formula is asking us to update our predictions now. In the first pass, m =1 and we will
substitute F0(x), the common prediction for all samples i.e. the initial leaf value plus 𝑣, which is
the learning rate into the output value from the tree we built, previously. The summation is for the
cases where a single sample ends up in multiple leaves.
Now we will use this new F1(x) value to get new predictions for each sample.
The new predicted value should get us a little closer to actual value. It is to be noted that in contrary
to one tree in our consideration, gradient boosting builds a lot of trees and M could be as large as
100 or more.
This completes the loop in Step 2 and we are ready for the final step of Gradient Boosting.
Step 3: Output
If we get a new data, then we shall use this value to predict if the passenger survived or not. This
would give us the log(odds) that the person survived. Plugging it into ‘p’ formula:
If the resultant value lies above our threshold then the person survived, else did not.
Comparing and Contrasting AdaBoost and GradientBoost
Both AdaBoost and Gradient Boost learn sequentially from a weak set of learners. A strong learner
is obtained from the additive model of these weak learners. The main focus here is to learn from
the shortcomings at each step in the iteration.
AdaBoost requires users specify a set of weak learners (alternatively, it will randomly generate a
set of weak learner before the real learning process). It increases the weights of the wrongly
predicted instances and decreases the ones of the correctly predicted instances. The weak learner
thus focuses more on the difficult instances. After being trained, the weak learner is added to the
strong one according to its performance (so-called alpha weight). The higher it performs, the more
it contributes to the strong learner.
On the other hand, gradient boosting doesn’t modify the sample distribution. Instead of training
on a newly sampled distribution, the weak learner trains on the remaining errors of the strong
learner. It is another way to give more importance to the difficult instances. At each iteration, the
pseudo-residuals are computed and a weak learner is fitted to these pseudo-residuals. Then, the
contribution of the weak learner to the strong one isn’t computed according to its performance on
the newly distributed sample but using a gradient descent optimization process. The computed
contribution is the one minimizing the overall error of the strong learner.
Adaboost is more about ‘voting weights’ and gradient boosting is more about
‘adding gradient optimization’.
Adaboost Gradient Boost
An additive model where shortcomings of An additive model where shortcomings of
previous models are identified by high- previous models are identified by the
weight data points gradient.
The trees are usually grown as decision The trees are grown to a greater depth usually
stumps. ranging from 8 to 32 terminal nodes.
Each classifier has different weights All classifiers are weighed equally and their
assigned to the final prediction based on its predictive capacity is restricted with learning
performance. rate to increase accuracy.
It is an optimized distributed gradient boosting library designed for efficient and scalable
training of machine learning models. It is an ensemble learning method that combines
the predictions of multiple weak models to produce a stronger prediction. XGBoost
stands for “Extreme Gradient Boosting” and it has become one of the most popular and
widely used machine learning algorithms due to its ability to handle large datasets and
its ability to achieve state-of-the-art performance in many machine learning tasks such
as classification and regression.
One of the key features of XGBoost is its efficient handling of missing values, which
allows it to handle real-world data with missing values without requiring significant pre-
processing. Additionally, XGBoost has built-in support for parallel processing, making
it possible to train models on large datasets in a reasonable amount of time.
XGBoost Features
XGBoost is used for these two reasons: execution speed and model performance.
Execution speed is crucial because it's essential to working with large datasets. When you use
XGBoost, there are no restrictions on the size of your dataset, so you can work with datasets that
are larger than what would be possible with other algorithms.
Model performance is also essential because it allows you to create models that can perform
better than other models. XGBoost has been compared to different algorithms such as random
forest (RF), gradient boosting machines (GBM), and gradient boosting decision trees (GBDT).
These comparisons show that XGBoost outperforms these other algorithms in execution speed
and model performance.
We are given input features (X) and target feature (Y). Now we start with a default set of
predictions (by default set to 0.5 in both classification and regression but you can start from other
values as well)
Calculating Gain
• We now calculate the gain value for each of the splits created in step 3 as shown below.
We go through all of the splits in step 3 and then take the split which gave us the highest
gain. i.e. we select the one which best splits the observations.
We go bottom up from the tree while pruning. If the Gain of a parent node is less than the gamma
(r) value then we prune its children (more formally if gain-r < 0). We only go up if we pruned at
that particular point else we stop there.
In this case of gamma was 130 we would not prune (dosage < 30) since gamma is less than the
gain value. Since we didn't prune it we can't go to its parent i.e. (dosage < 15) to check for
pruning.
Note — setting gamma to 0 does not turn off pruning because there may be nodes with negative
gain values, and gain-r will be < 0 in that case, leading to pruning
Now we’ll calculate the output values for each of the child nodes in the tree using the following
formula.
Using this calculate the output values for each child node. Illustrated below.
Updating the residuals and getting the output
Now that we have a tree that can predict residuals, we can update are initial default vector of 0.5
predictions by adding these residuals to it (multiplied by a learning rate of course. We don’t want
to directly go to the pred value and overfit)
Advantages of XGBoost:
1. Performance: XGBoost has a strong track record of producing high-quality results in
various machine learning tasks, especially in Kaggle competitions, where it has been a
popular choice for winning solutions.
2. Scalability: XGBoost is designed for efficient and scalable training of machine learning
models, making it suitable for large datasets.
3. Customizability: XGBoost has a wide range of hyperparameters that can be adjusted to
optimize performance, making it highly customizable.
4. Handling of Missing Values: XGBoost has built-in support for handling missing values,
making it easy to work with real-world data that often has missing values.
5. Interpretability: Unlike some machine learning algorithms that can be difficult to interpret,
XGBoost provides feature importances, allowing for a better understanding of which
variables are most important in making predictions.
Disadvantages of XGBoost:
1. Computational Complexity: XGBoost can be computationally intensive, especially when
training large models, making it less suitable for resource-constrained systems.
2. Overfitting: XGBoost can be prone to overfitting, especially when trained on small datasets
or when too many trees are used in the model.
3. Hyperparameter Tuning: XGBoost has many hyperparameters that can be adjusted, making
it important to properly tune the parameters to optimize performance. However, finding the
optimal set of parameters can be time-consuming and requires expertise.
4. Memory Requirements: XGBoost can be memory-intensive, especially when working with
large datasets, making it less suitable for systems with limited memory resources.
The first step is to use initial prediction that can be anything, but by default it is 0.5, regardless of
whether regression and classification.
It can observe that output value is inversely proportional to λ i.e. λ>0 reduces the amount that
individual observation adds to the overall prediction. Thus, λ (regularization parameter) will
reduce the prediction sensitivity.
Prediction is
For λ=0
Prediction for dosage=10 is given by
At the end of first prediction the residuals are reduced. Follow same steps for m models.
Maths Behind XGBoost Algorithm:
Loss function for regression
Where, T is the number of terminals nodes or leaves in a tree and γ is a user defined penalty use to
encourage pruning.
γ =0, does not turn off pruning because there may be nodes with negative gain values, and gain-
r will be < 0 in that case, leading to pruning.
Pruning take place after the full tree is built and it plays no role in driving the Optimal output value
or Similarity score.
1
The second term in loss function ∑𝑛𝑖=1 𝐿(𝑦𝑖 , 𝑝𝑖 ) + 2 𝜆𝑂𝑣𝑎𝑙𝑢𝑒
2
contains regularization term. The aim
is to find out output value Ovalue for the leaf that minimizes the whole equation. Which is similar
to ridge regression.
Since we are optimizing the output value from the first tree, we can replace the prediction pi with
initial prediction 𝑝𝑖0 and output Ovalue from new tree.
The first term of XGBoost loss function is similar to Gradient Boost loss function.
For simplification of loss function for regression, derivative is and Taylor approximation is used
for classification. But in XGBoost uses second order Taylor Approximation for bot regression and
classification.
Since, the derivative of a function is related to something called gradient, XGBoost uses g to
represent derivative of loss function and for second derivative of loss function h is used as it related
to something called as Hessian.
𝑛
1 2
∑ 𝐿(𝑦𝑖 , 𝑝𝑖 + O𝑣𝑎𝑙𝑢𝑒 ) + 𝜆𝑂𝑣𝑎𝑙𝑢𝑒 = 𝐿(𝑦1 , 𝑝10 + O𝑣𝑎𝑙𝑢𝑒 ) + 𝐿(𝑦2 , 𝑝20 + O𝑣𝑎𝑙𝑢𝑒 ) + ⋯
2
𝑖=1
1 2
+ 𝐿(𝑦𝑛 , 𝑝𝑛0 + O𝑣𝑎𝑙𝑢𝑒 ) + 𝜆𝑂𝑣𝑎𝑙𝑢𝑒
2
1 1
= 𝐿(𝑦𝑖 , 𝑝10 ) + 𝑔1 O𝑣𝑎𝑙𝑢𝑒 + ℎ1 O2𝑣𝑎𝑙𝑢𝑒 + 𝐿(𝑦2 , 𝑝20 ) + 𝑔2 O𝑣𝑎𝑙𝑢𝑒 + ℎ2 O2𝑣𝑎𝑙𝑢𝑒 + ⋯ + 𝐿(𝑦𝑛 , 𝑝𝑛0 )
2 2
1 1
+ 𝑔𝑛 O𝑣𝑎𝑙𝑢𝑒 + ℎ𝑛 O2𝑣𝑎𝑙𝑢𝑒 + 𝜆𝑂𝑣𝑎𝑙𝑢𝑒2
2 2
Since 𝐿(𝑦1 , 𝑝10 ), 𝐿(𝑦2 , 𝑝20 ), … , 𝐿(𝑦𝑛 , 𝑝𝑛0 ) are not affected by Ovalue, we ignore them for now,
𝑛
1 2 1
∑ 𝐿(𝑦𝑖 , 𝑝𝑖 + O𝑣𝑎𝑙𝑢𝑒 ) + 𝜆𝑂𝑣𝑎𝑙𝑢𝑒 ≈ ( 𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 )O𝑣𝑎𝑙𝑢𝑒 + (ℎ1 + ℎ2 + ⋯ + ℎ𝑛
2 2
𝑖=1
2
+ 𝜆)𝑂𝑣𝑎𝑙𝑢𝑒
𝑑 1 2
{(𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 )O𝑣𝑎𝑙𝑢𝑒 + (ℎ1 + ℎ2 + ⋯ + ℎ𝑛 + 𝜆)𝑂𝑣𝑎𝑙𝑢𝑒 }=0
𝑑𝑂𝑣𝑎𝑙𝑢𝑒 2
−(𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 ) = 𝑠𝑢𝑚𝑜𝑓𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠
ℎ1 + ℎ2 + ⋯ + ℎ𝑛 + 𝜆 = 𝑛𝑜. 𝑜𝑓𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 + 𝜆
Loss function for classification:
’
Similarity score:
Ovalue in below graph shows the minimum value at both the parabolas. To calculate similarity score
it uses simplified equation.
Multiply simplified equation by -1 will flip the parabola over horizontal line y=0.
The optimal Ovalue represents the x-axis coordinate for the highest point on the parabola.
The y-axis coordinate for the highest point on the parabola is the similarity score.
Let substitute
in
−(𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 ) 1
−(𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 ) + (ℎ1 + ℎ2 + ⋯ + ℎ𝑛
(ℎ1 + ℎ2 + ⋯ + ℎ𝑛 + 𝜆) 2
−(𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 ) 2
+ 𝜆) ( )
(ℎ1 + ℎ2 + ⋯ + ℎ𝑛 + 𝜆)
This is the actual Similarity score, but since similarity score is only used for comparison, so it can
be ignored.
1. Regularization
Regularization is a technique in machine learning to avoid overfitting. It’s a collection of methods
to constrain the model to become overcomplicated and have bad generalization power. It’s become
an important technique as many models fit the training data too well.
GBM doesn’t implement Regularization in their algorithm, which makes the algorithm only focus
on achieving minimum loss functions. Compared to the GBM, XGBoost implements the
regularization methods to penalize the overfitting model.
There are two kinds of regularization that XGBoost could apply: L1 Regularization (Lasso) and
L2 Regularization (Ridge). L1 Regularization tries to minimize the feature weights or coefficients
to zero (effectively becoming a feature selection), while L2 Regularization tries to shrink the
coefficient evenly (help to deal with multicollinearity). By implementing both regularizations,
XGBoost could avoid overfitting better than the GBM.
2. Parallelization
GBM tends to have a slower training time than the XGBoost because the latter algorithm
implements parallelization during the training process. The boosting technique might be
sequential, but parallelization could still be done within the XGBoost process.
The parallelization aims to speed up the tree-building process, mainly during the splitting event.
By utilizing all the available processing cores, the XGBoost training time can be shortened.
Speaking of speeding up the XGBoost process, the developer also preprocessed the data into their
developed data format, DMatrix, for memory efficiency and improved training speed.
3. Missing Data Handling
Our training dataset could contain missing data, which we must explicitly handle before passing
them into the algorithm. However, XGBoost has its own in-built missing data handler, whereas
GBM doesn’t.
XGBoost implemented their technique to handle missing data, called Sparsity-aware Split Finding.
For any sparsities data that XGBoost encounters (Missing Data, Dense Zero, OHE), the model
would learn from these data and find the most optimum split. The model would assign where the
missing data should be placed during splitting and see which direction minimizes the loss.
4. Tree Pruning
The growth strategy for the GBM is to stop splitting after the algorithm arrives at the negative loss
in the split. The strategy could lead to suboptimal results because it’s only based on local
optimization and might neglect the overall picture.
XGBoost tries to avoid the GBM strategy and grows the tree until the set parameter max depth
starts pruning backward. The split with negative loss is pruned, but there is a case when the
negative loss split was not removed. When the split arrives at a negative loss, but the further split
is positive, it would still be retained if the overall split is positive.
5. In-Built Cross-Validation
Cross-validation is a technique to assess our model generalization and robustness ability by
splitting the data systematically during several iterations. Collectively, their result would show if
the model is overfitting or not.
Normally, the machine algorithm would require external help to implement the Cross-Validation,
but XGBoost has an in-built Cross-Validation that could be used during the training session. The
Cross-Validation would be performed at each boosting iteration and ensure the produce tree is
robust.