0% found this document useful (0 votes)
13 views

Unit-I (Ensemble Learning)

Uploaded by

ajankit0712
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit-I (Ensemble Learning)

Uploaded by

ajankit0712
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Unit-I

Ensemble Learning
What is Ensemble Learning?
 Ensemble learning is a machine learning technique that combines the predictions from
multiple individual models to obtain a better predictive performance than any single model.
The basic idea behind ensemble learning is to leverage the wisdom of the crowd by
aggregating the predictions of multiple models, each of which may have its own strengths
and weaknesses. This can lead to improved performance and generalization.
 Ensemble learning can be thought of as compensation for poor learning algorithms that are
computationally more expensive than a single model. But they are more efficient than a
single non-ensemble model that has passed through a lot of learning.
 Several individual base models (experts) are fitted to learn from the same data and produce
an aggregation of output based on which a final decision is taken. These base models can
be machine learning algorithms such as decision trees (mostly used), linear models, support
vector machines (SVM), neural networks, or any other model that is capable of making
predictions.
 Most commonly used ensembles include techniques such as Bagging- used to generate
Random Forest algorithms and Boosting- to generate algorithms such
as Adaboost, Xgboost etc.
key advantages of ensemble learning:
1. Improved Accuracy
 By combining the outputs of multiple models, ensemble methods can reduce errors in
prediction. Techniques like bagging and boosting (e.g., Random Forests, Gradient
Boosting) have been proven to yield models with significantly higher accuracy than
individual models alone.
2. Reduction of Overfitting
 Some ensemble methods, like bagging, help reduce overfitting. For instance, Random
Forests combine the predictions of multiple decision trees trained on different subsets of
data, making the final model less likely to fit to noise in any single subset, resulting in
better generalization on new data.
3. Reduction of Variance
 Ensemble methods reduce variance, especially in models prone to high variance (e.g.,
decision trees). Aggregating predictions from multiple models can lead to a smoother, more
stable prediction curve, lowering the risk of extreme predictions that deviate due to outliers
or specific patterns in individual training samples.
4. Enhanced Robustness
 By aggregating different models, ensemble learning creates a more robust model, meaning
it can handle anomalies or noise in the data better. This resilience is particularly valuable
when data quality is inconsistent.
5. Adaptability to Complex Data Patterns
 Ensemble methods are particularly useful in capturing complex relationships in data. For
example, boosting models sequentially adapt by focusing more on hard-to-predict
instances, gradually improving the model's ability to learn complex patterns.
6. Flexibility with Different Algorithms
 Ensembles can combine models from different algorithmic families (e.g., decision trees,
logistic regression, neural networks), allowing diverse approaches to contribute to the final
prediction. Techniques like stacking allow combining the strengths of various types of
models, which is especially helpful when no single algorithm performs well on its own.
7. Better Performance with Small or Imbalanced Datasets
 Some ensemble techniques, particularly boosting, can perform well even on small or
imbalanced datasets. Boosting methods focus on difficult instances, improving predictive
performance in scenarios where class imbalance or sparse data might otherwise lead to
poor performance.
8. Parallelizable (in Bagging)
 Techniques like bagging can be trained in parallel, making it computationally efficient to
create a strong model by training independent models concurrently. This characteristic is
helpful when working with large datasets and high-dimensional data.
9. Diversity of Models Leading to Higher Generalization
 When ensemble models are composed of diverse learners, they are more likely to
generalize well across various patterns in the data. Diversity in the models allows
ensembles to handle a broader range of inputs, leading to increased flexibility and
improved performance across multiple domains.

Wisdom of the Crowd


The phrase “wisdom of the crowd” describes the occurrence in which the judgment of a group of
people as a whole is more accurate or trustworthy than that of any one member of the group. In
other words, the Wisdom of the crowd is the principle that explains how collective knowledge is
better than knowledge of the few. By combining predictions or labels from various models or
human annotators, Wisdom of the Crowd can be used to enhance model performance in the context
of machine learning. The overall precision and resilience of the system can be improved by fusing
various viewpoints and making use of collective knowledge. When individual models or
annotators may produce biases or errors, this strategy is especially useful because it makes use of
the strength of collaboration and collective intelligence to enhance the results of machine learning.
In simple terms, what it means is that asking many people who individually have less knowledge
is better than asking a few who have a lot of knowledge. Seems counter-intuitive, right?
When comparing prediction markets and betting markets. These markets enable people to bet or
forecast a variety of outcomes, including stock prices, sports results, and election results. These
markets have demonstrated to be very accurate in predicting events by combining the forecasts of
a huge number of players, frequently beating individual experts or opinion polls. Due to the
diversity of viewpoints, expertise, and information is taken into account by the participants, the
collective intelligence of the crowd, as reflected in the aggregated bets or predictions, tends to
produce a more accurate approximation of the actual outcome.
This phenomenon is known as the Wisdom of the Crowd. And the similar phenomenon is used
by ensemble methods.
Types of Ensemble techniques

Simple Ensemble Techniques


Simple ensemble techniques combine predictions from multiple models to produce a final
prediction. These techniques are straightforward to implement and can often improve performance
compared to individual models.
Max Voting
In this technique, the final prediction is the most frequent prediction among the base models. For
example, if three base models predict the classes A, B, and A for a given sample, the final
prediction using max voting would be class A, as it appears more frequently.
Averaging
Averaging involves taking the average of predictions from multiple models. This can be
particularly useful for regression problems, where the final prediction is the mean of predictions
from all models. For classification, averaging can be applied to the predicted probabilities for a
more confident prediction.
Weighted Averaging
Weighted averaging is similar, but each model's prediction is given a different weight. The weights
can be assigned based on each model's performance on a validation set or tuned using grid or
randomized search techniques. This allows models with higher performance to have a greater
influence on the final prediction.
Advanced Ensemble Techniques
Advanced ensemble techniques go beyond basic methods like bagging and boosting to enhance
model performance further. Here are explanations of stacking, blending, bagging, and boosting:
Stacking
 Stacking, or stacked generalization, combines multiple base models with a meta-model to
make predictions.
 Instead of using simple methods like averaging or voting, stacking trains a meta-model to
learn how to combine the base models' predictions best.
 The base models can be diverse to capture different aspects of the data, and the meta-model
learns to weight its predictions based on its performance.
Blending
 Blending is similar to stacking but more straightforward.
 Instead of a meta-model, blending uses a simple method like averaging or a linear model to
combine the predictions of the base models.
 Blending is often used in competitions where simplicity and efficiency are important.
Bagging (Bootstrap Aggregating)
 Bagging is a technique where multiple subsets of the dataset are created through
bootstrapping (sampling with replacement).
 A base model (often a decision tree) is trained on each subset, and the final prediction is
the average (for regression) or majority vote (for classification) of the individual
predictions.
 Bagging helps reduce variance and overfitting, especially for unstable models.
Boosting
 Boosting is an ensemble technique where base models are trained sequentially, with each
subsequent model focusing on the mistakes of the previous ones.
 The final prediction is a weighted sum of the individual models' predictions, with higher
weights given to more accurate models.
 Boosting algorithms like AdaBoost, Gradient Boosting, and XGBoost are popular because
they improve model performance.
Bagging:
Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly
used to reduce variance within a noisy data set.
Bootstrap sampling and aggregation are fundamental concepts in machine learning, especially
in ensemble methods like bagging (Bootstrap Aggregating) and Random Forests. Here’s a
breakdown of each concept and how they work together:
Bootstrap Sampling
Bootstrap sampling is a statistical technique where multiple samples (called "bootstrap samples")
are created by repeatedly sampling data points from the original dataset with replacement. Here’s
how it works:
1. Sampling with Replacement:
o A new sample is created by randomly selecting data points from the original dataset,
with each data point having the possibility of being selected more than once. This
is called "sampling with replacement," meaning that once a data point is picked, it
goes back into the pool and can be chosen again.
2. Size of Bootstrap Samples:
o Each bootstrap sample is typically the same size as the original dataset. However,
because sampling is done with replacement, each sample will contain duplicates
and omit some data points from the original dataset.
o On average, around one-third of the original dataset is left out in each bootstrap
sample. These unselected data points are known as out-of-bag (OOB) samples.
3. Purpose of Bootstrap Sampling:
o The idea is to create diverse datasets by introducing randomness, which reduces the
likelihood that any single data point will disproportionately influence the model.
o Each model trained on a bootstrap sample learns slightly different patterns, which
helps create a diverse ensemble of models.
Aggregation
Aggregation is the process of combining the outputs (predictions) from each model trained on
different bootstrap samples to produce a final prediction. The specific aggregation method depends
on the type of machine learning problem:
1. For Classification:
o Majority Voting: In classification tasks, each model makes a class prediction, and
the final predicted class is the one with the most votes across all models. For
instance, if 7 out of 10 models predict "Class A," then "Class A" is the ensemble’s
final prediction.
2. For Regression:
o Averaging: In regression tasks, the final prediction is obtained by averaging the
individual predictions from each model. This helps to smooth out extreme
predictions and yield a more accurate result.
Combining Bootstrap Sampling and Aggregation: Bagging
When bootstrap sampling and aggregation are used together in a machine learning model, it’s
called bagging (short for Bootstrap Aggregating). Here’s how it works:
1. Create Multiple Models with Bootstrap Sampling:
o The training dataset is split into multiple bootstrap samples. Each sample is used to
train a separate model (often decision trees).
2. Diversity through Random Sampling:
o Since each model is trained on a different subset of data, the models are diverse.
They capture different aspects of the data and are less likely to overfit as an
ensemble.
3. Aggregate Predictions:
o The final prediction is obtained by combining predictions from all the models
through majority voting (for classification) or averaging (for regression).
Example: Random Forests
A Random Forest is a popular algorithm that uses bagging with decision trees. Each tree is trained
on a different bootstrap sample, with an additional layer of randomness added by selecting only a
random subset of features at each split. The final prediction is the aggregated result of all trees in
the forest.
Benefits of Bagging
 Variance Reduction: By training multiple models on different data subsets, Bagging
reduces variance, leading to more stable and reliable predictions.
 Overfitting Mitigation: The diversity among base models helps the ensemble generalize
better to new data.
 Robustness to Outliers: Aggregating multiple models’ predictions reduces the impact
of outliers and noisy data points.
 Parallel Training: Training individual models can be parallelized, speeding up the
process, especially with large datasets or complex models.
 Versatility: Bagging can be applied to various base learners, making it a flexible
technique.
 Simplicity: The concept of random sampling with replacement and combining predictions
is easy to understand and implement.
Boosting is an ensemble modelling technique that attempts to build a strong classifier from the
number of weak classifiers. It is done by building a model by using weak models in series. Firstly,
a model is built from the training data. Then the second model is built which tries to correct the
errors present in the first model. This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum number of models are added.
The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong
learners.
Let’s understand this definition in detail by solving a problem of spam email identification:
How would you classify an email as SPAM or not? Like everyone else, our initial approach would
be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:
1. Email has only one image file (promotional image), It’s a SPAM
2. Email has only link(s), It’s a SPAM
3. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
4. Email from our official domain “Analyticsvidhya.com” , Not a SPAM
5. Email from known source, Not a SPAM
Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you
think these rules individually are strong enough to classify an email successfully? No.
Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not
spam’. Therefore, these rules are called as weak learner.
To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using
methods like:
 Using average/ weighted average
 Considering prediction has higher vote
For example: Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and
2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because
we have higher (3) vote for ‘SPAM’.
How Does Boosting Algorithms Work?
Now we know that, boosting combines weak learner a.k.a. base learner to form a strong rule. An
immediate question which should pop in your mind is, ‘How boosting identify weak rules?
To find weak rule, we apply base learning or ML algorithms with a different distribution. Each
time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative
process. After many iterations, the boosting algorithm in machine learning combines these weak
rules into a single strong prediction rule.
Here’s another question which might haunt you, ‘How do we choose different distribution for each
round?’
For choosing the right distribution, here are the following steps:
 Step 1: The base learner takes all the distributions and assign equal weight or attention to
each observation.
 Step 2: If there is any prediction error caused by first base learning algorithm, then we pay
higher attention to observations having prediction error. Then, we apply the next base
learning algorithm.
 Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy
is achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which eventually
improves the prediction power of the model. Boosting pays higher focus on examples which are
mis-classified or have higher errors by preceding weak rules.
Weak learners
Weak learners have low prediction accuracy, similar to random guessing. They are prone to
overfitting—that is, they can't classify data that varies too much from their original dataset. For
example, if you train the model to identify cats as animals with pointed ears, it might fail to
recognize a cat whose ears are curled.
Strong learners
Strong learners have higher prediction accuracy. Boosting converts a system of weak learners into
a single strong learning system. For example, to identify the cat image, it combines a weak learner
that guesses for pointy ears and another learner that guesses for cat-shaped eyes. After analyzing
the animal image for pointy ears, the system analyzes it once again for cat-shaped eyes. This
improves the system's overall accuracy.
Advantages of Boosting
 Improved Accuracy – Boosting can improve the accuracy of the model by combining
several weak models’ accuracies and averaging them for regression or voting over them for
classification to increase the accuracy of the final model.
 Robustness to Overfitting – Boosting can reduce the risk of overfitting by reweighting the
inputs that are classified wrongly.
 Better handling of imbalanced data – Boosting can handle the imbalance data by focusing
more on the data points that are misclassified
 Better Interpretability – Boosting can increase the interpretability of the model by breaking
the model decision process into multiple processes.

Types of boosting:
1. AdaBoost (Adaptive Boosting)
2. Gradient Tree Boosting
3. XGBoost
AdaBoost:
AdaBoost short for Adaptive Boosting is an ensemble learning used in machine learning for
classification and regression problems. The main idea behind AdaBoost is to iteratively train the
weak classifier on the training dataset with each successive classifier giving more weightage to
the data points that are misclassified. The final AdaBoost model is decided by combining all the
weak classifier that has been used for training with the weightage given to the models according
to their accuracies. The weak model which has the highest accuracy is given the highest
weightage while the model which has the lowest accuracy is given a lower weightage.

This diagram aptly explains Ada-boost. Let’s understand it closely:

Box 1: You can see that we have assigned equal weights to each data point and applied a decision
stump to classify them as + (plus) or – (minus). The decision stump (D1) has generated vertical
line at left side to classify the data points. We see that, this vertical line has incorrectly predicted
three + (plus) as – (minus). In such case, we’ll assign higher weights to these three + (plus) and
apply another decision stump.

Box 2: Here, you can see that the size of three incorrectly predicted + (plus) is bigger as compared
to rest of the data points. In this case, the second decision stump (D2) will try to predict them
correctly. Now, a vertical line (D2) at right side of this box has classified three mis-classified +
(plus) correctly. But again, it has caused mis-classification errors. This time with three -(minus).
Again, we will assign higher weight to three – (minus) and apply another decision stump.
Box 3: Here, three – (minus) are given higher weights. A decision stump (D3) is applied to predict
these mis-classified observation correctly. This time a horizontal line is generated to classify +
(plus) and – (minus) based on higher weight of mis-classified observation.

Box 4: Here, we have combined D1, D2 and D3 to form a strong prediction having complex rule
as compared to individual weak learner. You can see that this algorithm has classified these
observation quite well as compared to any of individual weak learner.

Institution behind AdaBoost Algorithm

AdaBoost techniques combine many weak machine-learning models to create a powerful


classification model for the output. The steps to build and combine these models are as

Step1 – Initialize the weights


 For a dataset with N training data points instances, initialize N Wi weights for each data
point with Wi=1/N

Step2 – Train weak classifiers

 Train a weak classifier Mk where k is the current iteration

 The weak classifier we are training should have an accuracy greater than 0.5 which means
it should be performing better than a naive guess

Step3 – Calculate the error rate and importance of each weak model Mk

 Calculate rate error_rate for every weak classifier Mk on the training dataset

 Calculate the importance of each model α_k using formula

Step4 – Update data point weight for each data point Wi

 After applying the weak classifier model to the training data we will update the weight
assigned to the points using the accuracy of the model. The formula for updating the
weights will be

Here yi is the true output and Xi is the corresponding input vector

Step5 – Normalize the Instance weight

 We will normalize the instance weight so that they can be summed up to 1 using the
formula

Step6 – Repeat steps 2-5 for K iterations

 We will train K classifiers and will calculate model importance and update the instance
weights using the above formula

 The final model M(X) will be an ensemble model which is obtained by combining these
weak models weighted by their model weights.

Steps for Adaboost.


 Initially, Adaboost selects a training subset randomly.
 It iteratively trains the AdaBoost machine learning model by selecting the training set based
on the accurate prediction of the last training.
 It assigns the higher weight to wrong classified observations so that in the next iteration
these observations will get the high probability for classification.
 Also, it assigns the weight to the trained classifier in each iteration according to the
accuracy of the classifier. The more accurate classifier will get high weight.
 This process iterate until the complete training data fits without any error or until reached
to the specified maximum number of estimators.
 To classify, perform a "vote" across all of the learning algorithms you built.
Note:
The algorithm creates a set of models, just like in random forests. However, the critical difference
is that AdaBoost models consist of nodes with only two leaves, known as “stumps.” These stumps
are considered weak learners, and AdaBoost prefers them. The order in which stumps are created
is vital in AdaBoost because the error of the first stump influences how subsequent stumps are
built.

Stump

Understanding the Working of the AdaBoost Algorithm (Classification):


Let’s understand what and how this algorithm works with the following example.
Step 1: Assigning Weights
The Image shown below is the actual representation of our dataset. Since the target column is
binary, it is a classification problem. First of all, these data points will be assigned some weights.
Initially, all the weights will be equal.
The formula to calculate the sample weights is:

Where N is the total number of data points


Here since we have 5 data points, the sample weights assigned will be 1/5.
Step 2: Classify the Samples
We start by seeing how well “Gender” classifies the samples and will see how the variables (Age,
Income) classify the samples.
We’ll create a decision stump for each of the features and then calculate the Gini Index of each
tree. The tree with the lowest Gini Index will be our first stump.
Here in our dataset, let’s say Gender has the lowest gini index, so it will be our first stump.
Step 3: Calculate the Influence
We’ll now calculate the “Amount of Say” or “Importance” or “Influence” for this classifier in
classifying the data points using this formula:

The total error is nothing but the summation of all the sample weights of misclassified data points.
Here in our dataset, let’s assume there is 1 wrong output, so our total error will be 1/5, and the
alpha (performance of the stump) will be:
Note: Total error will always be between 0 and 1.
0 Indicates perfect stump, and 1 indicates horrible stump.

From the graph above, we can see that when there is no misclassification, then we have no error
(Total Error = 0), so the “amount of say (alpha)” will be a large number.
When the classifier predicts half right and half wrong, then the Total Error = 0.5, and the
importance (amount of say) of the classifier will be 0.
If all the samples have been incorrectly classified, then the error will be very high (approx. to 1),
and hence our alpha value will be a negative integer.
Step 4: Calculate TE and Performance
You might be wondering about the significance of calculating the Total Error (TE) and
performance of an Adaboost stump. The reason is straightforward – updating the weights is crucial.
If identical weights are maintained for the subsequent model, the output will mirror what was
obtained in the initial model.
The wrong predictions will be given more weight, whereas the correct predictions weights will be
decreased. Now when we build our next model after updating the weights, more preference will
be given to the points with higher weights.
After finding the importance of the classifier and total error, we need to finally update the weights,
and for this, we use the following formula:

The amount of, say (alpha) will be negative when the sample is correctly classified.
The amount of, say (alpha) will be positive when the sample is miss-classified.
There are four correctly classified samples and 1 wrong. Here, the sample weight of that datapoint
is 1/5, and the amount of say/performance of the stump of Gender is 0.69.
New weights for correctly classified samples are:

For wrongly classified samples, the updated weights will be:

Note
The sign of alpha after substituting the values, the alpha is negative when the data point is
correctly classified, and this decreases the sample weight from 0.2 to 0.1004. It is positive when
there is misclassification, and this will increase the sample weight from 0.2 to 0.3988

The sample weights must be equal to 1, but here if we sum up all the new sample weights, we will
get 0.8004. To bring this sum equal to 1, we will normalize these weights by dividing all the
weights by the total sum of updated weights, which is 0.8004. So, after normalizing the sample
weights, we get this dataset, and now the sum is equal to 1.

Step 5: Decrease Errors


Now, we need to make a new dataset to see if the errors decreased or not. For this, we will remove
the “sample weights” and “new sample weights” columns and then, based on the “new sample
weights,” divide our data points into buckets.
Step 6: New Dataset
The algorithm selects random numbers from 0-1. Since incorrectly classified records have higher
sample weights, the probability of selecting those records is very high.
Suppose the 5 random numbers our algorithm take is 0.38,0.25,0.34,0.40,0.55.
Now we will see where these random numbers fall in the bucket, and according to it, we’ll make
our new dataset shown below.

This comes out to be our new dataset, and we see the data point, which was wrongly classified,
has been selected 3 times because it has a higher weight.
Step 7: Repeat Previous Steps
Now this act as our new dataset, and we need to repeat all the above steps i.e.
 Assign equal weights to all the data points.
 Find the stump that does the best job classifying the new collection of samples by finding
their Gini Index and selecting the one with the lowest Gini index.
 Calculate the “Amount of Say” and “Total error” to update the previous sample weights.
 Normalize the new sample weights.
Iterate through these steps until and unless a low training error is achieved.
Finally, Now we need to talk about how a forest of stumps created by AdaBoost makes
classification.
Imagine there are 6 stumps are created by the AdaBoost algorithm. Out of 6 stumps, 4 stumps are
classified patient is ill, and the other 2 stumps classified patient does not ill. These are the Amount
of Say for these stumps are 0.97+0.32+0.78+0.63 = 2.7, and the Amount of Say of the other 2
stumps are 0.41+0.82=1.23.
Ultimately, the patient is classified as ill because of the larger Amount of Say (2.7).
Suppose, with respect to our dataset, we have constructed 3 decision trees (DT1, DT2, DT3) in
a sequential manner. If we send our test data now, it will pass through all the decision trees, and
finally, we will see which class has the majority, and based on that, we will do predictions
for our test dataset.
https://datamapu.com/posts/classical_ml/adaboost_example_reg/
Understanding the Working of the AdaBoost Algorithm (Regression):
dataset containing 10 samples. It includes the features ‘age’, ’likes height’, and ’likes goats’. The
target variable is ‘climbed meters’. That is we want to estimate how many meters a person has
climbed depending on their age, and whether they like height and goats.

 We start with asigning weights to each sample. Initially, the weights are all equal to 1/N,
with N the number of data samples, that is in our case the initial weights are 0.1 for all
samples.
We now fit a Decision Tree with maximum depth of three to this dataset.

Following the decision paths of the tree, we can find that the samples age=35, likes height=0,
likes goats=0 and age=42, likes height=0, likes goats=0 lead to wrong predictions. The true
target values are 300m and 200m, respectivly, but the predicted value is 250m in both cases.
The other eight samples are correctly predicted. The total error is thus 210. The influence of
this tree is therefore

Note, that different implementation of the AdaBoost algorithm for regression exist.
Usually the prediction does not need to match exactly, but a margin is given, and the
prediction is counted as an error if it falls out of this margin.
For the sake of simplicity, we will keep this definition analogue to a classification problem.
The main idea of calculating the influence of each tree remains, but the way the error is exactly
calculated may differ in different implementations.
The dataset with the updated weights asigned to each sample.

Let’s assume, the random numbers drawn are [0.2,0.8,0.4,0.3,0.6,0.5,0.05,0.1,0.25], which


refer to the samples [3,6,3,3,4,4,5,0,1,3]. The modified dataset is shown in the next plot.
Let’s now use the model to make a prediction. Consider the following sample.

Feature Value

age 45

likes height 0

likes goats 1

To make the final prediction, we need to consider all the individual predictions of all the
models. The weighted mean of these predictions is then the prediction of the constructed
esemble AdaBoost model. As weights the values for the influence is used. Following the
decision path of the first tree, results in a prediction of 300m, the second tree
predicts 233.33m and the third tree again predicts 300m. The final prediction is than calculated
as

The true value of this sample is 300m.


What are the differences between Random Forest and Adaboost?
1. The first difference between random forest and adaboost is that random forest comes under
the bagging ensemble technique and adaboost comes under boosting ensemble technique.
If we elaborate, random forest runs a collection of machine learning models in parallel but
adaboost runs a collection of machine learning models in sequence.
2. Random Forest uses shallow or moderate depth of decision trees whereas adaboost uses
decision stumps.
3. Both the algorithms generate a low bias and low variance model but random forest converts
low bias high variance models whereas adaboost converts high bias low variance models.
4. In random forest, we can assign weights to the models but they will be the same for all
whereas in adaboost the weights are assigned accordingly to each model's accuracy and
performance.
Advantages:
 It is easy to use as we do not have to do many hyperparameters tunning as compared to
other algorithms.
 Adaboost increases the accuracy of the weak machine learning models.
 Adaboost has immunity from overfitting of data as it runs each model in a sequence and
has a weight associated with them.
Disadvantages:
 AdaBoost is sensitive to noise data.
 It is highly affected by outliers because it tries to fit each point perfectly.
 AdaBoost is slower compared to XGBoost.
Gradient Boosting
Gradient Boosting is a powerful boosting algorithm that combines several weak learners into
strong learners, in which each new model is trained to minimize the loss function such as mean
squared error or cross-entropy of the previous model using gradient descent. In each iteration,
the algorithm computes the gradient of the loss function with respect to the predictions of the
current ensemble and then trains a new weak model to minimize this gradient. The predictions
of the new model are then added to the ensemble, and the process is repeated until a stopping
criterion is met.
In contrast to AdaBoost, the weights of the training instances are not tweaked, instead, each
predictor is trained using the residual errors of the predecessor as labels. There is a technique
called the Gradient Boosted Trees whose base learner is CART (Classification and Regression
Trees). The below diagram explains how gradient-boosted trees are trained for regression
problems.

Gradient Boosted Trees for Regression

The ensemble consists of M trees. Tree1 is trained using the feature matrix X and the labels y.
The predictions labeled y1(hat) are used to determine the training set residual errors r1. Tree2 is
then trained using the feature matrix X and the residual errors r1 of Tree1 as labels. The
predicted results r1(hat) are then used to determine the residual r2. The process is repeated until
all the M trees forming the ensemble are trained. There is an important parameter used in this
technique known as Shrinkage. Shrinkage refers to the fact that the prediction of each tree in
the ensemble is shrunk after it is multiplied by the learning rate (eta) which ranges between 0 to
1. There is a trade-off between eta and the number of estimators, decreasing learning rate needs
to be compensated with increasing estimators in order to reach certain model performance. Since
all trees are trained now, predictions can be made. Each tree predicts a label and the final
prediction is given by the formula,

y(pred) = y1 + (υ* r1) + (υ * r2) + ....... + (υ * rN)

Gradient Boosting Algorithm


Errors play a major role in any machine learning algorithm. There are mainly two types of errors:
bias error and variance error. The gradient boost algorithm helps us minimize the bias error of the
model. The main idea behind this algorithm is to build models sequentially and these subsequent
models try to reduce the errors of the previous model. But how do we do that? How do we reduce
the error? This is done by building a new model on the errors or residuals of the previous model.
When the target column is continuous, we use Gradient Boosting Regressor whereas when it is a
classification problem, we use Gradient Boosting Classifier. The only difference between the two
is the “Loss function”. The objective here is to minimize this loss function by adding weak learners
using gradient descent. Since it is based on the loss function, for regression problems, we’ll have
different loss functions like Mean squared error (MSE) and for classification, we will have
different functions, like log-likelihood.
Understanding Gradient Boosting Regression Algorithm with an Example
Let’s understand the intuition behind the Stochastic Gradient Boosting algorithm in Machine
Learning with the help of an example. Here our target column is continuous hence we will use
gradient boosting regressor.
Following is a sample from a random dataset where we have to predict the car price based on
various features. The target column is price and other features are independent features.

Step 1: Build a Base Model


The first step in gradient boosting is to build a base model to predict the observations in the training
dataset. For simplicity, we take an average of the target column and assume that to be the predicted
value as shown below:

Why did I say we take the average of the target column? Well, there is math involved in this.
Mathematically the first step can be written as:

Here L is our loss function,


Gamma is our predicted value, and
arg min means we have to find a predicted value/gamma for which the loss function is minimum.
Since the target column is continuous our loss function will be:

Here yi is the observed value, and gamma is the predicted value.


Now we need to find a minimum value of gamma such that this loss function is minimum. We
differentiate this loss function and then put it equal to 0.

Let’s see how to do this with the help of our example. Remember that y_i is our observed value
and γi i is our predicted value, by plugging the values in the above formula we get:
We end up over an average of the observed car price and this is why I asked you to take the average
of the target column and assume it to be your first prediction.
Hence for gamma=14500, the loss function will be minimum so this value will become our
prediction for the base model.
Step 2: Compute Pseudo Residuals
The next step is to calculate the pseudo residuals which are (observed value – predicted value).

Again the question comes why only observed – predicted? Everything is mathematically proven.
Let’s see where this formula comes from. This step can be written as:

Here F(xi) is the previous model and m is the number of decision tree made.
We are just taking the derivative of loss function w.r.t the predicted value and we have already
calculated this derivative:
If you see the formula of residuals above, we see that the derivative of the loss function is
multiplied by a negative sign, so now we get:

The predicted value here is the prediction made by the previous model. In our example the
prediction made by the previous model (initial base model prediction) is 14500, to calculate the
residuals our formula becomes:

Step 3: Build a Model on Calculated Residuals


In the next step, we will build a model on these pseudo residuals and make predictions. Why do
we do this? Because we want to minimize these residuals minimizing the residuals will eventually
improve our model accuracy and prediction power. So, using the Residual as a target and the
original feature Cylinder number, cylinder height, and Engine location we will generate new
predictions. Note that the predictions, in this case, will be the error values, not the predicted car
price values since our target column is an error now.
Let’s say hm(x) is our Decision tree made on these residuals.
Step 4: Compute Decision Tree Output
In this step, we find the output values for each leaf of our decision tree. That means there might be
a case where 1 leaf gets more than 1 residual, hence we need to find the final output of all the
leaves. To find the output we can simply take the average of all the numbers in a leaf, doesn’t
matter if there is only 1 number or more than 1.
Let’s see why we take the average of all the numbers. Mathematically this step can be represented
as:

Here hm(xi) is the DT made on residuals and m is the number of DT. When m=1 we are talking
about the 1st DT and when it is “M” we are talking about the last DT.
The output value for the leaf is the value of gamma that minimizes the Loss function. The left-
hand side “Gamma” is the output value of a particular leaf. On the right-hand side
[F m-1 (x i )+γh m (x i ))] is similar to step 1 but here the difference is that we are taking previous
predictions whereas earlier there was no previous prediction.
Example of Calculating Regression Tree Output
Let’s understand this even better with the help of an example. Suppose this is our regressor tree:

We see 1st residual goes in R1,1 ,2nd and 3rd residuals go in R2,1 and 4th residual goes in R3,1 .
Let’s calculate the output for the first leave that is R1,1

Now we need to find the value for gamma for which this function is minimum. So we find the
derivative of this equation w.r.t gamma and put it equal to 0.

Hence the leaf R1,1 has an output value of -2500. Now let’s solve for the R2,1.

Let’s take the derivative to get the minimum value of gamma for which this function is minimum:
We end up with the average of the residuals in the leaf R2,1 . Hence if we get any leaf with more
than 1 residual, we can simply find the average of that leaf and that will be our final output.
Now after calculating the output of all the leaves, we get:

Step 5: Update Previous Model Predictions


This is finally the last step where we have to update the predictions of the previous model. It can
be updated as:

where m is the number of decision trees made.


Since we have just started building our model so our m=1. Now to make a new DT our new
predictions will be:

Here Fm-1(x) is the prediction of the base model (previous prediction) since F1-1=0 , F0 is our base
model hence the previous prediction is 14500.
nu is the learning rate that is usually selected between 0-1. It reduces the effect each tree has on
the final prediction, and this improves accuracy in the long run. Let’s take nu=0.1 in this example.
Hm(x) is the recent DT made on the residuals.
Let’s calculate the new prediction now:
Suppose we want to find a prediction of our first data point which has a car height of 48.8. This
data point will go through this decision tree and the output it gets will be multiplied by the learning
rate and then added to the previous prediction.
Now let’s say m=2 which means we have built 2 decision trees and now we want to have new
predictions.
This time we will add the previous prediction that is F1(x) to the new DT made on residuals. We
will iterate through these steps again and again till the loss is negligible.
I am taking a hypothetical example here just to
make you understand how this predicts for a new dataset:

If a new data point comes, say, height = 1.40, it’ll go through all the trees and then will give the
prediction. Here we have only 2 trees hence the datapoint will go through these 2 trees and the
final output will be F2(x).

Gradient Boosting in Classification:


Gradient Boosting has three main components:
 Loss Function - The role of the loss function is to estimate how good the model is at
making predictions with the given data. This could vary depending on the problem at hand.
For example, if we’re trying to predict the weight of a person depending on some input
variables (a regression problem), then the loss function would be something that helps us
find the difference between the predicted weights and the observed weights. On the other
hand, if we’re trying to categorize if a person will like a certain movie based on their
personality, we’ll require a loss function that helps us understand how accurate our model
is at classifying people who did or didn’t like certain movies.
 Weak Learner - A weak learner is one that classifies our data but does so poorly, perhaps
no better than random guessing. In other words, it has a high error rate. These are typically
decision trees (also called decision stumps, because they are less complicated than typical
decision trees).
 Additive Model - This is the iterative and sequential approach of adding the trees (weak
learners) one step at a time. After each iteration, we need to be closer to our final model.
In other words, each iteration should reduce the value of our loss function.
An Intuitive Understanding: Visualizing Gradient Boost
Let’s start with looking at one of the most common binary classification machine learning
problems. It aims at predicting the fate of the passengers on Titanic based on a few features: their
age, gender, etc. We will take only a subset of the dataset and choose certain columns, for
convenience. Our dataset looks something like this:

Titanic Passenger Data


 Pclass, or Passenger Class, is categorical: 1, 2, or 3.
 Age is the age of the passenger when they were on the Titanic.
 Fare is the Passenger Fare.
 Sex is the gender of the person.
 Survived refers to whether or not the person survived the crash; 0 if they did not, 1 if they
did.
Now let’s look at how the Gradient Boosting algorithm solves this problem.
We start with one leaf node that predicts the initial value for every individual passenger. For a
classification problem, it will be the log(odds) of the target value. log(odds) is the equivalent of
average in a classification problem. Since four passengers in our case survived, and two did not
survive, log(odds) that a passenger survived would be:
This becomes our initial leaf.

Initial Leaf Node


The easiest way to use the log(odds) for classification is to convert it to a probability. To do so,
we’ll use this formula:

Note: Please bear in mind that we have rounded off everything to one decimal place here, and
hence the log(odds) and probability are the same, which may not be the case always.
If the probability of surviving is greater than 0.5, then we first classify everyone in the training
dataset as survivors. (0.5 is a common threshold used for classification decisions made based on
probability; note that the threshold can easily be taken as something else.)
Now we need to calculate the Pseudo Residual, i.e, the difference between the observed value and
the predicted value. Let us draw the residuals on a graph.
The blue and the yellow dots are the observed values. The blue dots are the passengers who did
not survive with the probability of 0 and the yellow dots are the passengers who survived with a
probability of 1. The dotted line here represents the predicted probability which is 0.7
We need to find the residual which would be :

Here, 1 denotes Yes and 0 denotes No.


We will use this residual to get the next tree. It may seem absurd that we are considering the
residual instead of the actual value, but we shall throw more light ahead.
Branching out data points using the residual values
We use a limit of two leaves here to simplify our example, but in reality, Gradient Boost has a
range between 8 leaves to 32 leaves.
Because of the limit on leaves, one leaf can have multiple values. Predictions are in terms of
log(odds) but these leaves are derived from probability which cause disparity. So, we can’t just
add the single leaf we got earlier and this tree to get new predictions because they’re derived from
different sources. We have to use some kind of transformation. The most common form of
transformation used in Gradient Boost for Classification is :

The numerator in this equation is sum of residuals in that particular leaf.


The denominator is sum of (previous prediction probability for each residual ) * (1 - same previous
prediction probability).
The first leaf has only one residual value that is 0.3, and since this is the first tree, the previous
probability will be the value from the initial leaf, thus, same for all residuals. Hence,

For the second leaf,


Similarly, for the last leaf:

Now the transformed tree looks like:

Transformed tree
Now that we have transformed it, we can add our initial lead with our new tree with a learning
rate.

Learning Rate is used to scale the contribution from the new tree. This results in a small step in
the right direction of prediction. Empirical evidence has proven that taking lots of small steps in
the right direction results in better prediction with a testing dataset i.e the dataset that the model
has never seen as compared to the perfect prediction in 1st step. Learning Rate is usually a small
number like 0.1
We can now calculate new log(odds) prediction and hence a new probability.
For example, for the first passenger, Old Tree = 0.7. Learning Rate which remains the same for all
records is equal to 0.1 and by scaling the new tree, we find its value to be -0.16. Hence, substituting
in the formula we get:

Similarly, we substitute and find the new log(odds) for each passenger and hence find the
probability. Using the new probability, we will calculate the new residuals.
This process repeats until we have made the maximum number of trees specified or the residuals
get super small.
A Mathematical Understanding
We shall go through each step, one at a time and try to understand them.

xi - This is the input variables that we feed into our model.


yi- This is the target variable that we are trying to predict.
We can predict the log likelihood of the data given the predicted probability

yi is observed value ( 0 or 1 ).
p is the predicted probability.
The goal would be to maximize the log likelihood function. Hence, if we use the log(likelihood) as
our loss function where smaller values represent better fitting models then:

Now the log(likelihood) is a function of predicted probability p but we need it to be a function of


predictive log(odds). So, let us try and convert the formula :
We know that:

Substituting,

Now,

Hence,

Loss function in terms of log odds

𝐿(𝑦𝑖 , log⁡(𝑜𝑑𝑑𝑠)) = −𝑦𝑖 log(𝑜𝑑𝑑𝑠) + log⁡(1 + 𝑒 log(𝑜𝑑𝑑𝑠) )


Now that we have converted the p to log(odds), this becomes our Loss Function.
We have to show that this is differentiable.
This can also be written as:

Now we can proceed to the actual steps of the model building.


Step 1: Initialize model with a constant value

Here, yi is the observed values, L is the loss function, and gamma is the value for log(odds).
We are summating the loss function i.e. we add up the Loss Function for each observed value.
argmin over gamma means that we need to find a log(odds) value that minimizes this sum.
Then, we take the derivative of each loss function:

… and so on.
Step 2: for m = 1 to M
(A)

This step needs you to calculate the residual using the given formula. We have already found the
Loss Function to be as :
Hence,

(B) Fit a regression tree to the residual values and create terminal regions

Because the leaves are limited for one branch hence, we might have more than one value in a
particular terminal region.
In our first tree, m=1 and j will be the unique number for each terminal node. So R11, R21 and so
on.
C)

For each leaf in the new tree, we calculate gamma which is the output value. The summation should
be only for those records which goes into making that leaf. In theory, we could find the derivative
with respect to gamma to obtain the value of gamma but that could be extremely wearisome due
to the hefty variables included in our loss function.
Substituting the loss function and i=1 in the equation above, we get:

We use second order Taylor Polynomial to approximate this Loss Function:

There are three terms in our approximation. Taking derivative with respect to gamma gives us:
Equating this to 0 and subtracting the single derivative term from both the sides.

Then, gamma will be equal to:

The gamma equation may look humongous but in simple terms, it is:

We will just substitute the value of derivative of Loss Function

Now we shall solve for the second derivative of the Loss Function. After some heavy
computations, we get:

We have simplified the numerator as well as the denominator. The final gamma solution looks like:
We were trying to find the value of gamma that when added to the most recent predicted log(odds)
minimizes our Loss Function. This gamma works when our terminal region has only one residual
value and hence one predicted probability. But, do recall from our example above that because of
the restricted leaves in Gradient Boosting, it is possible that one terminal region has many values.
Then the generalized formula would be:

Hence, we have calculated the output values for each leaf in the tree.
(D)

This formula is asking us to update our predictions now. In the first pass, m =1 and we will
substitute F0(x), the common prediction for all samples i.e. the initial leaf value plus 𝑣, which is
the learning rate into the output value from the tree we built, previously. The summation is for the
cases where a single sample ends up in multiple leaves.
Now we will use this new F1(x) value to get new predictions for each sample.
The new predicted value should get us a little closer to actual value. It is to be noted that in contrary
to one tree in our consideration, gradient boosting builds a lot of trees and M could be as large as
100 or more.
This completes the loop in Step 2 and we are ready for the final step of Gradient Boosting.
Step 3: Output

If we get a new data, then we shall use this value to predict if the passenger survived or not. This
would give us the log(odds) that the person survived. Plugging it into ‘p’ formula:
If the resultant value lies above our threshold then the person survived, else did not.
Comparing and Contrasting AdaBoost and GradientBoost
Both AdaBoost and Gradient Boost learn sequentially from a weak set of learners. A strong learner
is obtained from the additive model of these weak learners. The main focus here is to learn from
the shortcomings at each step in the iteration.
AdaBoost requires users specify a set of weak learners (alternatively, it will randomly generate a
set of weak learner before the real learning process). It increases the weights of the wrongly
predicted instances and decreases the ones of the correctly predicted instances. The weak learner
thus focuses more on the difficult instances. After being trained, the weak learner is added to the
strong one according to its performance (so-called alpha weight). The higher it performs, the more
it contributes to the strong learner.
On the other hand, gradient boosting doesn’t modify the sample distribution. Instead of training
on a newly sampled distribution, the weak learner trains on the remaining errors of the strong
learner. It is another way to give more importance to the difficult instances. At each iteration, the
pseudo-residuals are computed and a weak learner is fitted to these pseudo-residuals. Then, the
contribution of the weak learner to the strong one isn’t computed according to its performance on
the newly distributed sample but using a gradient descent optimization process. The computed
contribution is the one minimizing the overall error of the strong learner.
Adaboost is more about ‘voting weights’ and gradient boosting is more about
‘adding gradient optimization’.
Adaboost Gradient Boost
An additive model where shortcomings of An additive model where shortcomings of
previous models are identified by high- previous models are identified by the
weight data points gradient.
The trees are usually grown as decision The trees are grown to a greater depth usually
stumps. ranging from 8 to 32 terminal nodes.

Each classifier has different weights All classifiers are weighed equally and their
assigned to the final prediction based on its predictive capacity is restricted with learning
performance. rate to increase accuracy.

It gives weights to both classifiers and It builds trees on previous classifier’s


observations thus capturing maximum residuals thus capturing variance in data.
variance within data.

Advantages of Gradient Boosting


 Often provides predictive accuracy that cannot be trumped.
 Lots of flexibility - can optimize on different loss functions and provides several hyper
parameter tuning options that make the function fit very flexible.
 No data pre-processing required - often works great with categorical and numerical values
as is.
 Handles missing data - imputation not required.

Disadvantages of Gradient Boosting


 Gradient Boosting Models will continue improving to minimize all errors. This can
overemphasize outliers and cause overfitting.
 Computationally expensive - often require many trees (>1000) which can be time and
memory exhaustive.
 The high flexibility results in many parameters that interact and influence heavily the
behavior of the approach (number of iterations, tree depth, regularization parameters, etc.).
This requires a large grid search during tuning.
 Less interpretative in nature, although this is easily addressed with various tools.
XGBoost

 It is an optimized distributed gradient boosting library designed for efficient and scalable
training of machine learning models. It is an ensemble learning method that combines
the predictions of multiple weak models to produce a stronger prediction. XGBoost
stands for “Extreme Gradient Boosting” and it has become one of the most popular and
widely used machine learning algorithms due to its ability to handle large datasets and
its ability to achieve state-of-the-art performance in many machine learning tasks such
as classification and regression.
 One of the key features of XGBoost is its efficient handling of missing values, which
allows it to handle real-world data with missing values without requiring significant pre-
processing. Additionally, XGBoost has built-in support for parallel processing, making
it possible to train models on large datasets in a reasonable amount of time.

XGBoost Features

XGBoost is a widespread implementation of gradient boosting. Let’s discuss some features of


XGBoost that make it so attractive.

 XGBoost offers regularization, which allows you to control overfitting by introducing


L1/L2 penalties on the weights and biases of each tree. This feature is not available in
many other implementations of gradient boosting.
 XGBoost also has a block structure for parallel learning. It makes it easy to scale up on
multicore machines or clusters. It also uses cache awareness, which helps reduce memory
usage when training models with large datasets.
 Finally, XGBoost offers out-of-core computing capabilities using disk-based data
structures instead of in-memory ones during the computation phase.
Why XGBoost?

XGBoost is used for these two reasons: execution speed and model performance.

Execution speed is crucial because it's essential to working with large datasets. When you use
XGBoost, there are no restrictions on the size of your dataset, so you can work with datasets that
are larger than what would be possible with other algorithms.

Model performance is also essential because it allows you to create models that can perform
better than other models. XGBoost has been compared to different algorithms such as random
forest (RF), gradient boosting machines (GBM), and gradient boosting decision trees (GBDT).
These comparisons show that XGBoost outperforms these other algorithms in execution speed
and model performance.

Working of XGBoss – Regression:

1. Given data and initial predictions

We are given input features (X) and target feature (Y). Now we start with a default set of
predictions (by default set to 0.5 in both classification and regression but you can start from other
values as well)

2. Calculating Pseudo Residuals

Calculate the (pseudo) residuals by subtracting Y from default initial predictions.

3.Build XGBoost trees


Calculate similarity score using the following formula.

4.Finding the best Splitting features and its values.

• Now we go through all the input features one by one.


• Select the first feature, sort its values in ascending order and go through those values one
by one.
• Take 2 points at a time from the start, get the mean of those 2 values and then divide the
leaf residuals according to the mean value, i.e. put residuals of elements with less than
mean feature value to one node and the others to a different node.
• Do this for all features and all values in each feature.

Calculating Gain

• We now calculate the gain value for each of the splits created in step 3 as shown below.
We go through all of the splits in step 3 and then take the split which gave us the highest
gain. i.e. we select the one which best splits the observations.

Creating the tree


We continue doing step 3 & 4 for the children of the tree to split it further. The stopping
conditions are either the max depth has been reached (6 by default) or the leaf has just minimum
amount of residuals in it.
Pruning the tree

We use a hyper parameter gamma(r) to prune trees.

We go bottom up from the tree while pruning. If the Gain of a parent node is less than the gamma
(r) value then we prune its children (more formally if gain-r < 0). We only go up if we pruned at
that particular point else we stop there.

In this case of gamma was 130 we would not prune (dosage < 30) since gamma is less than the
gain value. Since we didn't prune it we can't go to its parent i.e. (dosage < 15) to check for
pruning.

Note — setting gamma to 0 does not turn off pruning because there may be nodes with negative
gain values, and gain-r will be < 0 in that case, leading to pruning

Calculating the output Value

Now we’ll calculate the output values for each of the child nodes in the tree using the following
formula.

Using this calculate the output values for each child node. Illustrated below.
Updating the residuals and getting the output

Now that we have a tree that can predict residuals, we can update are initial default vector of 0.5
predictions by adding these residuals to it (multiplied by a learning rate of course. We don’t want
to directly go to the pred value and overfit)
Advantages of XGBoost:
1. Performance: XGBoost has a strong track record of producing high-quality results in
various machine learning tasks, especially in Kaggle competitions, where it has been a
popular choice for winning solutions.
2. Scalability: XGBoost is designed for efficient and scalable training of machine learning
models, making it suitable for large datasets.
3. Customizability: XGBoost has a wide range of hyperparameters that can be adjusted to
optimize performance, making it highly customizable.
4. Handling of Missing Values: XGBoost has built-in support for handling missing values,
making it easy to work with real-world data that often has missing values.
5. Interpretability: Unlike some machine learning algorithms that can be difficult to interpret,
XGBoost provides feature importances, allowing for a better understanding of which
variables are most important in making predictions.
Disadvantages of XGBoost:
1. Computational Complexity: XGBoost can be computationally intensive, especially when
training large models, making it less suitable for resource-constrained systems.
2. Overfitting: XGBoost can be prone to overfitting, especially when trained on small datasets
or when too many trees are used in the model.
3. Hyperparameter Tuning: XGBoost has many hyperparameters that can be adjusted, making
it important to properly tune the parameters to optimize performance. However, finding the
optimal set of parameters can be time-consuming and requires expertise.
4. Memory Requirements: XGBoost can be memory-intensive, especially when working with
large datasets, making it less suitable for systems with limited memory resources.

XGBoost for Regression:


Dosage Drug effectiveness Residual
10 -10 -10.5
20 7 6.5
23 8 7.5
37 -7 -7.5

The first step is to use initial prediction that can be anything, but by default it is 0.5, regardless of
whether regression and classification.

λ is regularization parameter. Let’s assume λ=0.


Now we calculate similarity score for root with all the residuals.
Now we split root using Dosages.
Consider average of first two dosages that is 15. No we split the root with root node dosage<15
Since Gain for Dosage<30 (Gain=56.33) is less than the Gain for Dosage<22.5 (Gain=4) is less
than the Gai for Dosage<15 (Gain 120.33), Dosage<15 is better at splitting the residuals into
clusters of similar values.
So we use threshold that gives largest Gain, in this case Dosage<15.
For further split, leaf at left has pn;y one residual. So, no further split is possible. For the leaf at
right we can do split by again considering the average of the two resiuals.

The first split that we are trying is Dosage <22.5


Gain for Dosage<22.5 is less than Gain for Dosage <30. Final split will be Dosage<30.

Now, let see the process of pruning.


For pruning fir we decide the threshold value (Gamma) and calculate Gain- γ for each root node
starting from last root node in upward direction.
If Gain- γ >0, do not prune the branch.
If Gain- γ<0, do not prune the branch.
γ =0, does not does not turn off pruning because there may be nodes with negative gain values,
and gain-r will be < 0 in that case, leading to pruning
In this case assume γ=130.
Since the difference is positive, do not remove branch.
Consider λ=1. In below tree one can observe that increasing λ decreases the Gain value.
When λ>0, it is easier to prune the leaves because the values for gain are smaller.

Now, calculate output for each of the leaves. Using formula

It can observe that output value is inversely proportional to λ i.e. λ>0 reduces the amount that
individual observation adds to the overall prediction. Thus, λ (regularization parameter) will
reduce the prediction sensitivity.
Prediction is

For λ=0
Prediction for dosage=10 is given by

Prediction for dosage=20 is given by

Dosage Drug effectiveness Residual New predictions


10 -10 -10.5 -2.65
20 7 6.5 2.6
23 8 7.5 2.6
37 -7 -7.5 -1.75

At the end of first prediction the residuals are reduced. Follow same steps for m models.
Maths Behind XGBoost Algorithm:
Loss function for regression

Loss function for classification


XGBoost uses loss function to build tree by minimizing

The actual function is


𝑛
1 2
∑ 𝐿(𝑦𝑖 , 𝑝𝑖 ) + 𝛾𝑇 + 𝜆𝑂𝑣𝑎𝑙𝑢𝑒
2
𝑖=1

Where, T is the number of terminals nodes or leaves in a tree and γ is a user defined penalty use to
encourage pruning.
γ =0, does not turn off pruning because there may be nodes with negative gain values, and gain-
r will be < 0 in that case, leading to pruning.
Pruning take place after the full tree is built and it plays no role in driving the Optimal output value
or Similarity score.
1
The second term in loss function ∑𝑛𝑖=1 𝐿(𝑦𝑖 , 𝑝𝑖 ) + 2 𝜆𝑂𝑣𝑎𝑙𝑢𝑒
2
contains regularization term. The aim
is to find out output value Ovalue for the leaf that minimizes the whole equation. Which is similar
to ridge regression.
Since we are optimizing the output value from the first tree, we can replace the prediction pi with
initial prediction 𝑝𝑖0 and output Ovalue from new tree.

For simplification let λ=0


For λ=0, the optimal Ovalue is the bottom of parabola where derivative is zero.
As λ increases the optimal Ovalue gets closer to 0 and regularization plays its role.

The first term of XGBoost loss function is similar to Gradient Boost loss function.
For simplification of loss function for regression, derivative is and Taylor approximation is used
for classification. But in XGBoost uses second order Taylor Approximation for bot regression and
classification.

Since, the derivative of a function is related to something called gradient, XGBoost uses g to
represent derivative of loss function and for second derivative of loss function h is used as it related
to something called as Hessian.
𝑛
1 2
∑ 𝐿(𝑦𝑖 , 𝑝𝑖 + O𝑣𝑎𝑙𝑢𝑒 ) + 𝜆𝑂𝑣𝑎𝑙𝑢𝑒 = 𝐿(𝑦1 , 𝑝10 + O𝑣𝑎𝑙𝑢𝑒 ) + 𝐿(𝑦2 , 𝑝20 + O𝑣𝑎𝑙𝑢𝑒 ) + ⋯
2
𝑖=1
1 2
+ 𝐿(𝑦𝑛 , 𝑝𝑛0 + O𝑣𝑎𝑙𝑢𝑒 ) + 𝜆𝑂𝑣𝑎𝑙𝑢𝑒
2
1 1
= ⁡𝐿(𝑦𝑖 , 𝑝10 ) + 𝑔1 O𝑣𝑎𝑙𝑢𝑒 + ℎ1 O2𝑣𝑎𝑙𝑢𝑒 + ⁡𝐿(𝑦2 , 𝑝20 ) + 𝑔2 O𝑣𝑎𝑙𝑢𝑒 + ℎ2 O2𝑣𝑎𝑙𝑢𝑒 + ⋯ + ⁡𝐿(𝑦𝑛 , 𝑝𝑛0 )
2 2
1 1
+ 𝑔𝑛 O𝑣𝑎𝑙𝑢𝑒 + ℎ𝑛 O2𝑣𝑎𝑙𝑢𝑒 + 𝜆𝑂𝑣𝑎𝑙𝑢𝑒2
2 2
Since 𝐿(𝑦1 , 𝑝10 ), 𝐿(𝑦2 , 𝑝20 ), … , 𝐿(𝑦𝑛 , 𝑝𝑛0 ) are not affected by Ovalue, we ignore them for now,

𝑛
1 2 1
∑ 𝐿(𝑦𝑖 , 𝑝𝑖 + O𝑣𝑎𝑙𝑢𝑒 ) + 𝜆𝑂𝑣𝑎𝑙𝑢𝑒 ≈ ( 𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 )O𝑣𝑎𝑙𝑢𝑒 + (ℎ1 + ℎ2 + ⋯ + ℎ𝑛
2 2
𝑖=1
2
+ 𝜆)𝑂𝑣𝑎𝑙𝑢𝑒
𝑑 1 2
{(𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 )O𝑣𝑎𝑙𝑢𝑒 + (ℎ1 + ℎ2 + ⋯ + ℎ𝑛 + 𝜆)𝑂𝑣𝑎𝑙𝑢𝑒 }=0
𝑑𝑂𝑣𝑎𝑙𝑢𝑒 2

Loss function for regression

−(𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 ) = 𝑠𝑢𝑚⁡𝑜𝑓⁡𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠

ℎ1 + ℎ2 + ⋯ + ℎ𝑛 + 𝜆 = 𝑛𝑜. 𝑜𝑓⁡𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 + 𝜆
Loss function for classification:


Similarity score:

Ovalue in below graph shows the minimum value at both the parabolas. To calculate similarity score
it uses simplified equation.

Multiply simplified equation by -1 will flip the parabola over horizontal line y=0.

The optimal Ovalue represents the x-axis coordinate for the highest point on the parabola.
The y-axis coordinate for the highest point on the parabola is the similarity score.
Let substitute

in

−(𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 ) 1
−(𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 ) + (ℎ1 + ℎ2 + ⋯ + ℎ𝑛
(ℎ1 + ℎ2 + ⋯ + ℎ𝑛 + 𝜆) 2
−(𝑔1 + 𝑔2 + ⋯ + 𝑔𝑛 ) 2
+ 𝜆) ( )
(ℎ1 + ℎ2 + ⋯ + ℎ𝑛 + 𝜆)

This is the actual Similarity score, but since similarity score is only used for comparison, so it can
be ignored.

For regression loss function


For classification Loss finction

Difference between Gradient Boost and XGBoost:

1. Regularization
Regularization is a technique in machine learning to avoid overfitting. It’s a collection of methods
to constrain the model to become overcomplicated and have bad generalization power. It’s become
an important technique as many models fit the training data too well.
GBM doesn’t implement Regularization in their algorithm, which makes the algorithm only focus
on achieving minimum loss functions. Compared to the GBM, XGBoost implements the
regularization methods to penalize the overfitting model.
There are two kinds of regularization that XGBoost could apply: L1 Regularization (Lasso) and
L2 Regularization (Ridge). L1 Regularization tries to minimize the feature weights or coefficients
to zero (effectively becoming a feature selection), while L2 Regularization tries to shrink the
coefficient evenly (help to deal with multicollinearity). By implementing both regularizations,
XGBoost could avoid overfitting better than the GBM.

2. Parallelization
GBM tends to have a slower training time than the XGBoost because the latter algorithm
implements parallelization during the training process. The boosting technique might be
sequential, but parallelization could still be done within the XGBoost process.
The parallelization aims to speed up the tree-building process, mainly during the splitting event.
By utilizing all the available processing cores, the XGBoost training time can be shortened.
Speaking of speeding up the XGBoost process, the developer also preprocessed the data into their
developed data format, DMatrix, for memory efficiency and improved training speed.
3. Missing Data Handling
Our training dataset could contain missing data, which we must explicitly handle before passing
them into the algorithm. However, XGBoost has its own in-built missing data handler, whereas
GBM doesn’t.
XGBoost implemented their technique to handle missing data, called Sparsity-aware Split Finding.
For any sparsities data that XGBoost encounters (Missing Data, Dense Zero, OHE), the model
would learn from these data and find the most optimum split. The model would assign where the
missing data should be placed during splitting and see which direction minimizes the loss.

4. Tree Pruning
The growth strategy for the GBM is to stop splitting after the algorithm arrives at the negative loss
in the split. The strategy could lead to suboptimal results because it’s only based on local
optimization and might neglect the overall picture.
XGBoost tries to avoid the GBM strategy and grows the tree until the set parameter max depth
starts pruning backward. The split with negative loss is pruned, but there is a case when the
negative loss split was not removed. When the split arrives at a negative loss, but the further split
is positive, it would still be retained if the overall split is positive.

5. In-Built Cross-Validation
Cross-validation is a technique to assess our model generalization and robustness ability by
splitting the data systematically during several iterations. Collectively, their result would show if
the model is overfitting or not.
Normally, the machine algorithm would require external help to implement the Cross-Validation,
but XGBoost has an in-built Cross-Validation that could be used during the training session. The
Cross-Validation would be performed at each boosting iteration and ensure the produce tree is
robust.

Role of learning rate in gradient boosting:


The learning rate in gradient boosting plays a crucial role in controlling how much each new
tree (or model) contributes to the final prediction. Gradient boosting works by building models
sequentially, with each one correcting the errors of the previous models. The learning rate
determines the size of the step taken in this correction process.
Key Roles of Learning Rate in Gradient Boosting
1. Controls the Contribution of Each Model:
o A smaller learning rate reduces the impact of each new tree on the overall model.
This forces the algorithm to rely on a large number of trees to minimize the error
gradually.
o A larger learning rate makes each tree more influential but risks overshooting the
optimal solution.
2. Balances Underfitting and Overfitting:
o Low learning rate: Leads to a slower learning process, requiring more trees to
achieve good performance. This can improve generalization and reduce the risk of
overfitting.
o High learning rate: Can lead to faster convergence but increases the risk of
overfitting the training data, especially with complex datasets.
3. Interacts with the Number of Trees:
o A low learning rate often requires more trees to achieve good performance.
o A high learning rate works with fewer trees but might not generalize well.
4. Provides Stability in Optimization:
o Gradient boosting optimizes the loss function in a stage-wise manner. A smaller
learning rate ensures that each step is conservative, allowing the algorithm to
converge more reliably to an optimal solution.

Choosing the Learning Rate


1. Common Default Values:
o Typical default values range from 0.01 to 0.1.
o For fine-tuning, start with 0.1 and experiment with smaller values like 0.01 or
0.001.
2. Trade-off with the Number of Trees:
o Lower learning rates (e.g., 0.01) often require more trees (n_estimators) to achieve
similar performance compared to higher learning rates (e.g., 0.1).
3. Dataset Complexity:
o For complex datasets with high noise, a lower learning rate is preferred to prevent
overfitting.
4. Cross-Validation:
o Use cross-validation to tune the learning rate along with other hyperparameters like
the number of estimators, maximum depth of trees, and regularization parameters.
Role of gamma in xgboost
In XGBoost, gamma is indeed used for pruning the tree and controls how much improvement
in the objective function is needed to make a split. It plays a crucial role in deciding whether to
grow the tree deeper or prune it. Specifically, gamma adds a regularization term to control the
complexity of the model by limiting the number of splits (branches) made during tree construction.
How Gamma Works for Pruning Trees:
 Gamma represents the minimum loss reduction required to make a further split at a node.
 If the gain from a split (i.e., the improvement in the objective function) is greater than
gamma, then the split is allowed, and the tree continues to grow.
 If the gain from the split is less than gamma, the algorithm prunes the tree by stopping
further splitting at that node, even if the split might still reduce the loss to some degree.
This mechanism helps prune unnecessary branches and results in simpler trees, which helps in
preventing overfitting by avoiding overly complex trees.
Pruning Process in XGBoost (Using Gamma):
1. When Gamma is 0 (No Pruning):
o The algorithm will continue to split nodes as long as the gain is positive, meaning
that even small improvements will lead to further splits.
o This can lead to deeper, more complex trees with more branches, which might
overfit the training data.
2. When Gamma is Positive (Pruning Occurs):
o If the gain from a split is smaller than the specified gamma, the algorithm will stop
splitting further at that node, effectively pruning the tree.
o This helps control the depth of the tree and the number of branches, preventing
overfitting and improving generalization.
Effect of Gamma on Tree Complexity:
 Lower gamma values (e.g., 0) allow more splits and create deeper, more complex trees.
 Higher gamma values (e.g., 1 or higher) restrict the creation of new branches and force
the tree to be shallower and simpler.
 By increasing gamma, the algorithm is forced to make more significant improvements in
loss before making a split. This regularization reduces the risk of overfitting since the
model is less likely to create unnecessary splits.
 Lower gamma values may result in a model that overfits by creating deeper trees with too
many branches, capturing noise in the data.
 Gamma helps minimize the impact of small splits that don't significantly reduce the loss.
Small splits are typically less meaningful and could lead to overfitting if allowed to persist
in the model.

You might also like