Machine Learning With Boosting
Machine Learning With Boosting
Boosting
A Beginner’s Guide
By Scott Hartshorn
What Is In This Book
The goal of this book is to provide you with a working
understanding of how the machine learning algorithm “Gradient
Boosted Trees” works. Gradient Boosted Trees, which is one of the
most commonly used types of the more general “Boosting”
algorithm is a type of supervised machine learning. What that
means is that we will initially pass the algorithm a set of data with a
bunch of independent variables, plus the solution that we care
about. We will use the known solution and the known independent
variables to develop a method of using those variables to derive that
solution (or at least get as close as we can). Later on, after we train
the algorithm, we will use the method we derived to calculate
solutions for unknown results from different independent variables.
This is an example driven book, rather than a theory driven book.
That means we will be showing the actual algorithms within the
code that executes gradient boosted trees, instead of showing the
high level equations about which loss functions are being optimized.
The most common explanation for boosting is “Boosting is a
collection of weak learners combined to form a strong learner”. The
goal of this book is to provide a more tangible and intuitive
explanation than that. This book starts with some analogies that
provide a rough framework of how boosting works. And then goes
into a step by step explanation of gradient boosted trees. It turns out
that the actual boosting algorithms are a straightforward application
of algebra, except for the decision trees that are one part of the
process for most boosting algorithms. (The decision trees are
reasonably straight forward, but are not algebra.)
The examples that will be shown will focus on two types of
problems. One is a regression analysis, where we are trying to
predict a real value given a set of data. One real life example of
regression is Zillow predicting a house’s value based on publicly
available data. The example regression analysis we will show isn’t
that complicated. We will try to predict the value of a sine wave,
shown below as the black dots
And we will show how that single blue line, which results from a
decision tree, can be improved using boosting to match the sine
wave values more closely, as shown below
Obviously this sine wave isn’t as complicated as Zillow’s house
price prediction, but it turns out that once we understand how the
boosting algorithm works, it is simple to increase the complexity
with more data or more layers of boosting.
The other example we will show is a categorization problem. With
categorization we are trying to predict discrete results. I.e. instead
of Zillow predicting a house’s value, it could be an investment
company trying to determine should they invest in this asset, yes or
no. In that example we will show how we can the take categorical
data shown below as either red triangles or blue squares
And make predictions about the values of the entire design space,
shown below
Once we understand how to group two different categories using
boosting, we will extend that to how to work with any number of
categories.
http://www.fairlynerdy.com/decision-trees-cheat-sheet/
Table of Contents
Introduction
A Quick Example Of Boosting
Why Do People Care About Boosting?
An Analogy of How Boosting Works
How Boosting Really Works – Skipping Over The
Details
Final Thoughts
If You Found Errors Or Omissions
More Books
Thank You
A Quick Example Of Boosting
There are several different boosting algorithms that exist. This
book focuses on one of them, gradient boosted trees. The
exact math differs between different boosting algorithms, but
they are all characterized by two key features.
1. Multiple iterations
2. Each subsequent iteration focuses on the parts of the
problem that previous iterations got wrong.
A real life example of a boosting algorithm in progress might
be a high school band teacher teaching a class of 20 students.
This specific band teacher wants to make the average quality
of his class as good as possible. So what does he do?
On day 1 he knows nothing about the quality of the musicians
that he is teaching, so he simply teaches a standard class.
After that, he knows exactly how well each student is doing.
So he then tailors his instruction to focus on whichever
students he can help the most that day. Typically that will be
the worst students in the class. After all, his goal isn’t to
make his best students perfect, he is trying to bring up the
average. It is usually easier to turn Bad into Acceptable than
it is to turn Good into Exceptional. So this music teacher will
tend to ignore the advanced students and spend more time with
the worst students.
How is this an example of boosting put into practice? Simple,
the iterations take place each subsequent day at each new
class. And the requirement that the algorithm focuses on
fixing the errors from previous iterations is satisfied since the
teacher is finding the students with the largest amount of error
and working to improve them.
Why Do People Care About Boosting?
Machine learning is a field that has an increasing number of
applications, large companies like Google and Amazon are
using it for their personal assistant products (i.e. Alexa), and it
will likely revolutionize a number of different fields, such as
when autonomous cars become available. That is machine
learning on a grand scale done with large dedicated teams, and
is more advanced than we will cover in this book. On an
individual’s scale, machine learning has a number of
interesting applications as well. One of the easiest places to
see them applied is on the machine learning competition site
Kaggle.
On Kaggle, individuals or small teams compete to take
different sets of data and extract the most information possible
out of them using whatever techniques they choose. The
winners frequently receive cash, as well as bragging rights.
But the important point here is not the competitors, but the
companies who are generating the data sets. They have real
problems where a better analysis of data can open up new
business opportunities, and they are willing to give away cash
prizes (typically in the tens of thousands of dollars, sometimes
more) to get better answers.
It is difficult to know which machine learning algorithm will
work best for any given problem. However in recent problems
people have found that boosting (especially XGBoost which
uses gradient boosted trees along with other improvements)
has done very well. Here are some competitions that have
been won using boosting, or boosting in conjunction with
other techniques
Liberty Mutual Property Inspection – Use some
property information to predict the hazards in a
home, for insurance purposes.
Caterpillar Tube Pricing – Attempt to predict how
much a supplier will charge for different orders of
metal tubing.
Avito Duplicate Ads Detection – Identify duplicate
ads from an online marketplace so that they can be
removed.
Facebook Robot Detection – An online auction site
(likely a penny auction site) has been flooded with
robot bidders which is causing the real customers to
leave the site. Can you identify the robots?
Otto Product Classification – Use a set of provided
features to figure out what category different
products should be grouped into.
The type of boosting shown in this book, Gradient Boosted
Trees, uses multiple decision trees (which are a series of if-
then questions that split the data into two different branches,
shown in detail later) sequentially in order to improve on one
of the main limitations that decision trees have, which is
overfitting.
An Analogy of How Boosting Works
Boosting, when applied to computer science, has a more
formal definition than the music teacher analogy listed above.
One of the most common descriptions of boosted learning is
that a group of “weak learners” can be combined to form a
“strong learner”.
Applying that description to the music teacher analogy would
be that a group of daily instructions (the weak learners) can be
combined to produce a good quality course (the strong learner)
The combination of weak learners resulting in a strong learner
description is accurate, and it does give some information.
However, it leaves out some important points, such as
What is a weak learner?
How are they combined?
Are some weak learners better than others?
The following analogies are intended to build your intuition on
those points above before getting into the actual math of how
boosting algorithms work.
In the chart above, the average value of all the data points is 6.0. If
you subtract the average value from each point, you get the error
shown in the chart. 3 points have an error of -5 (1-6), 2 points have
an error of -4, 3 have an error of -3, and 8 have an error of 4. The
square of those errors are 25, 16, 9, and 16. And the sum of the
squares for each point is 3*25 + 2*16 + 3*9 + 8 * 16 = 262 which is
the total sum squared error.
The regression tree will split the data into more groups and calculate
the total sum squared error using a different average value for each
group. (The fact that each group gets to use its own average value
instead of the global average is what reduces the resulting sum
squared error after a split). The objective of the regression tree is to
choose splits that minimize that sum squared error.
If we passed these points through a depth 1 regression decision tree,
using x as our feature with the objective of estimating y, the decision
tree would split the data at x = 8.5, creating two groups. This is
shown in the chart below.
After this first split, the summed squared error is much reduced.
That is because the points less than x = 8.5 use their average value of
2.0 to calculate their squared error, and the points greater than x =
8.5 use their average value of 10.0 to calculate their squared error.
The summed squared error for the left group of points is 6.0, and for
the right group of points is 0.0.
Therefore the total summed squared error is 6, which is the best you
can do with this data and only two lines. That was a good split and
it made a lot of improvement in the summed squared error (SSE).
Going from 262 to 6 was a large reduction in the SSE. The next
split will make another improvement.
Here the left line split again at x=4.5. The right line was already
perfect so it didn’t split. This grouping with 3 groups of points has
an SSE of 1.2, which is an improvement over the previous step but
isn’t perfect.
With 2 splits we theoretically could have gotten 4 horizontal lines,
(since depth 2 trees can make 4 groups), which could have matched
the data perfectly. However, since the regression makes the best
choice at every split, and not necessarily the best choice overall, the
first split put the right side of the graph into its own group. In the
second set of splits, the left side split again, but the right side cannot
since it was already perfect. As a result, we end up with 3 groups.
If we made this a depth 3 tree, we would get another split and a
perfect fit.
The way a regression decision tree would work would be to split the
data based on weight. And then split the data based on volume.
And then split the data based on weight again, switching back and
forth between weight and volume as it attempts to optimize the
regression at each step.
Just looking at whether the object floats or sinks, and ignoring how
much is actually floating, the final result of the decision tree would
be to break up the data as shown below in a series of horizontal and
vertical lines
This would have taken several branches and split the data into a
number of smaller segments that are not shown.
If we also tried to break up the top left section into how much of
each data point is above the water, we would get an even more
complicated set of horizontal and vertical lines. The key here is that
we are getting horizontal and vertical lines, not diagonal ones.
What we really need for this data is the ratio of volume to mass,
which is density. Such a split would look like the chart below.
This is how a human would split the data, breaking it into two
groups with a single line. However, a decision tree cannot capture
the relationship between multiple variables. It cannot take the ratio
and make the split based on that value. If you were to do that
calculation outside of the decision tree and pass it density, it could
solve the regression analysis quickly and competently. But since the
decision tree only splits on one variable at a time, and only at one
location on that variable, if it only has weight and volume, the
decision tree will have to do a brute force solution instead of an
elegant one. Creating better variables for the machine learning
algorithm to operate on is called feature engineering.
These are the actual values that we will try to match using regression
The regression tree split the data at x =3.14. It did that to minimize
the resulting sum squared error. The regression tree calculates the
error at every point, sums that error, and picks the split location
which minimizes that sum squared error.
In this case, clearly, the blue lines are not a very good fit for the
black sine wave dots. Trying to fit the sine wave with just two
horizontal lines in a single step isn’t practical. The error is the
difference between the sine wave curve and what we predicted, and
there is a lot of error after the first step.
By making that change, the result is that instead of trying to fit the
curves of the error with 2 horizontal lines, we are trying to fit them
with up to 4 horizontal lines. Why are there 4 lines? Because the
max depth parameter controls the depth of the tree. As we saw in the
decision tree section, a tree with a depth of 1 can end up with 2
branches. A tree with a depth of 2 can end up with 4 branches. A
tree with a depth of 3 can end up with 8 branches. The formula is
branches = 2 to the power of the depth.
So what are the results after increasing the depth of the decision
trees? This is the curve fit after a single step, i.e. just the very first
decision tree
Here the decision tree split the data at x = 3.14 and then split the left
branch again at x = .476 and the right branch again at x = 5.807.
The 4 horizontal lines represent the average error of each of the 4
groups. You can see how that is an improvement over the
comparable curve fit with a depth of 1, shown below
Still, it isn’t really surprising that the regression with 4 horizontal
lines is better than the regression with 2 horizontal lines after a
single iteration. Let’s compare the model with a maximum depth
of 2 after 5 iterations vs the model with the maximum depth of 1
after 10 iterations. One of the methods has more iterations, the
other has more detail in each of the iterations. Which turns out
better in this case?
Here is the chart with a maximum depth of 2, after 5 iterations
And here is the curve fit with a maximum depth of 1 after 10
iterations
The regression with the depth 2 tree over 5 iterations is the better
curve fit. That difference is especially pronounced near the middle
of the chart. If you let the boosting algorithm using a tree with max
depth 2 have a couple more steps, up to 10, the fit continues to
improve, whereas the algorithm using a tree with depth 1 has very
little improvement. A fit with depth 2 and 10 iterations is shown
below.
Why is the regression with a depth 2 tree so much better?
The key difference is that the maximum analysis with a depth of 2,
with 4 horizontal lines, is able to significantly improve the error in at
least one section of the graph for each boosting iteration. Even if
only a few data points get improved each iteration, as long as the
analysis can consistently isolate and improve subsections, the
boosting regression will continue to get better. We can see that
below in the iteration 10 regression fit of the current error.
There were two key statements in the previous paragraph. A good
weak learner needs to isolate and improve subsections. To put it a
different way, it needs to make at least one section a lot better,
without making the rest of the graph any worse.
What we see in the chart above is that most of the graph had no
change. Everything to the left of x = 5.807 has an average error of
near zero. That section didn’t get any better, but it also didn’t get
any worse. However, two limited sections between x = 5.807 and
x=5.997 as well as x=5.997 and x=6.188 improved a lot. For those
points, the error has been reduced to almost zero. We can see that
when looking at the chart for error remaining after iteration 10
shown below.
What that means is for the next boosting iteration there is a lot less
error on the right side of the graph. So the next boosting iteration
will focus on getting the minimum square error on another section of
the graph, even if that is also a small section.
Boosting iteration 11 shown below also makes improvements on a
very limited section of the graph. But on that limited section, it
makes very good improvements and does not make any other part of
the data set worse while doing so.
What is happening is that the 4 horizontal lines have enough fidelity
to “peel off” limited sections of the problem and get them exactly
right. By “peel off” what we mean is that this regression makes
almost no change to most of the data points in the problem, but 4
lines are enough so that the regression tree can make a large
improvement to a small section of the problem while leaving the rest
of the problem nearly unchanged.
The algorithm could not do that with only 2 horizontal lines since it
couldn’t target points in the middle of the graph with only two lines.
With two lines there was no way to make changes to the middle of
the graph without also affecting one or both edges. However, with
the more refined decision trees, this algorithm can continue to make
improvements.
Below is the same result, after a single step, using a learning rate of
.1. So this learning rate is only 10% of the previous learning rate.
What occurs here is that the decision tree splits the data into groups,
the same as always, the groups each calculate their average error,
same as always, but then instead of adding that full average error to
the previous value, we only add 10% of it.
Obviously, this curve fit isn’t as good as when we used the learning
rate of 1.0, but it wasn’t really a fair comparison. A high learning
rate will always converge faster, but the small learning rate might
give better results in the end. Here are the results after 10 steps
with a learning rate of .1
This is certainly an improvement, but still not great. Mathematically
this makes sense, however. A learning rate of .1 means that 90% of
the error remains after every step, assuming that we are getting the
10% learning that we are doing exactly correct. So after 10 steps, .9
raised to the 10th power is .35, so we still expect to have 35% of the
error left.
After 30 steps with a learning rate of .1, the results are
This result starts to look pretty good. Everything, except the peaks
of the curves, is a close fit. Additional iterations improve the fit at
the peaks as well. It ends up taking about 100 iterations with a
learning rate of .1 to get a curve fit that has almost no discernable
error to the eye
The curve fit is roughly equivalent to the curve fit that we would
have after 45 steps with a learning rate of 1.0, shown below
Why A Smaller Learning Rate Can Be Better
So far we have seen how the learning rate parameter works, but we
haven’t yet seen why a smaller learning rate can be advantageous
over a learning rate of 1.0. This is because learning rate is a
parameter intended to protect against overfitting. Overfitting occurs
when sections of your training data are not representative of the full
data set, i.e. if there are some errors or outliers in the training data.
But if there are no errors in the training data, you can’t over fit.
The sine wave plots that are shown above have no error in them. We
generated a sine wave, and then matched the regression model to it.
Since there are no errors in the data that we matched against, there
was no way to overfit the model, so a high learning rate was best.
However, almost all real world data has errors in it. Below is a
more complicated algorithm. It has a signal that we are trying to
match, and random errors for noise in the data that will confuse the
machine learning algorithm.
Here we will try to create a regression against Z, which is the sum of
three different sine and cosine functions. There isn’t anything
special about the equation, it was just generated to be more
complicated than a simple sine wave. Z plotted as different colors
against X &Y ends up being
The plot above is the actual values of Z that we are trying to match.
However, in addition to those values, we have random error added to
the data that we are feeding into the regression algorithm. The
random error is 1.1 * data picked off a normal curve centered around
zero. So the data we are using to train the model is not completely
representative of the true data as a whole.
This means that when we train our model and then predict with it
using X and Y values, there will be some error in the prediction vs
the true Z values. If we take that error, square it, and average it
across all data points, we get mean squared error (MSE). When we
plot MSE against number of boosting iterations for both a learning
rate of 1.0 and a learning rate of .1, we get a plot that looks like this
The learning rate of 1.0 gets pretty close really fast. However, in
later boosting iterations, it starts to get a little bit worse as it begins
to overfit the data. The smaller learning rate of .1 takes additional
boosting iterations to get good results. But after 1200 boosting
iterations, the error is lower than the error for any of the iterations
with the higher learning rate.
The clear conclusion is that a lower learning rate is better, and will
help mitigate overfitting, assuming that you can afford the additional
computation required.
Learning Rate Should Be Between 0 and 1.0
One thing to be sure of is that your learning rate is always greater
than 0.0 and less than or equal to 1.0. A number greater than 1.0
will overshoot the correction it is trying to make, while a negative
learning rate will go in the wrong direction.
A Quick Correction Of A Likely
Misconception
At this point, we need to stop and review what could be an error in
understanding how the boosting works in practice. That
misunderstanding would have been primarily driven by some sloppy
plotting that I did in the sine wave graphs above. That sloppy
plotting was due to the fact that I used a relatively dense amount of
data, and frankly, I am not the world’s leading expert in the Python
plotting tool Matplotlib. Specifically, I showed plots on the sine
wave regression which were continuous. This is not what would
actually occur.
For instance, take the data points shown below. There are 20 data
points arranged between 0 and 2 Pi, not evenly spaced, but
reasonably arranged along the x-axis. The y value is the Sine of
each x point. Given those points, if you did a boosting regression of
them, and assumed that you had a sufficiently large number of steps
and large enough tree depth, and then tested the values of a much
more fine mesh of data, maybe with 500 points instead of 20, what
would the resulting plot look like?
How would you draw it on the graph below?
If you thought it would be a continuous sine wave, such as the one I
plotted below, then that is an error driven by the fact that all of my
plots in the graphs above used continuous lines.
What would really result is shown in this plot below. There would
be a series of steps that go halfway between each pair of adjacent
data points.
With a large enough number of iterations and a large enough tree
depth, the regression trees would eventually put splits between every
two adjacent points. Each split would be exactly halfway between
any two adjacent points. Effectively each and every point would
carve out a zone where it set the value in that zone. Each of those
zones would abruptly shift to the next zone at a point halfway
between one data point and the next.
The result would be a series of stair steps. The width of the stair
would depend on how much distance was between any two points,
and the y value of the stair would match whichever point was in it.
Note, there are a few parameters that could affect that result, which
we will get into near the end of the book. Specifically, there are
some tree parameters that would limit where splits got made, i.e. you
could set a parameter so that you never have a regression tree leaf
with only one point. Additionally, there is a parameter that would
allow you to not use all the data points for every regression tree (i.e.
use a different subset for different trees). That would increase the
number of possible steps because on different trees there would be
different points adjacent to each other (since some points would be
missing) and hence there could be different splits half way between
those adjacent points.
However, the key takeaway you should get from this is that boosting
will carve out a series of discontinuous zones, each zone having its
own value. The next section shows what those zones would look
like in 2 dimensions.
Regression Tree Splits In Two Dimensions
Let’s look at how the data space gets broken up when we have
two features. For this example, assume the x and y values in
the plot represent the values of feature 1 and feature 2, and that
the actual value we are trying to find using the boosting
algorithm is not represented on the plot. However assume that
every data point has a unique value, as is common in
regression analysis.
If you used a very large number of boosting iterations, how
would this data space get broken up?
The steps outlined in the flow chart above are shown below in
more detail.
This is the starting estimate for what your values are. i.e.
just a naïve average. Call this value your “Current
Prediction”
From this step onward we will repeat for every boosting
iteration
It may seem like we wasted a lot of effort with those Expit and Logit
functions. In one of the examples, we converted 0.6 to 0.4055 and
then back again without doing anything else. However, when we do
those two conversions for actual calculations, we will put an
additional step of addition/subtraction between them so it won’t just
be converting the same number back and forth.
This chart shows that no matter how large or small you get in the
infinity range, the number is still bounded by zero and one in the
other range.
For general knowledge of how boosting works, you don’t need to
memorize these functions. The important thing to know is that there
is a way to map the range from negative infinity to positive infinity
onto the range from zero to one. The numbers that we will be
adding to are the numbers in the range from negative infinity to
positive infinity. We will be using the infinity range when we
adjust the values of our predictions for each data point, and we will
be using the zero to one range to calculate the amount of error
remaining in that prediction.
The equation is: find the maximum of x & y, turn that number into
an integer, and find the modulus of that number vs 2. If the result is a
zero, it is in a red zone if the result is a 1, it is in a blue zone.
So the colors are the true function, however, you don’t know the true
function. You don’t have results for every X & Y, you only have
results for 100 data points, and you need to use those points to train
a model that will predict the value anywhere in the design space.
Here are the 100 points that we have, with red triangles in the red
zones, and blue squares in the blue zones.
Of those 100 data points, we might want to cross-validate whatever
results we get. Cross-validation is basically setting aside some data
that we know the answer for and not training on it so that we can use
that data to score how good of a fit the training model made. With
cross-validation, you can split the data wherever you want, and train
on one portion and test on the other. For this example, I split it into
60 points for training and 40 points for cross-validation. (Which is
probably a higher percentage for cross validation than you would
want for most problems.) Here are the 60 data points to train on.
(note these plots were generated by
Classification_Gradient_Boost_Stripes.py located here
http://www.fairlynerdy.com/boosting-examples )
The data above is the full data set we will use to train the boosting
model. We will use the red triangles and blue square and attempt to
replicate the results from the full design space, shown with the
shaded colors. We can, and will, put this training data directly into
the boosting algorithm, and it will work well as is. But for this first
example, in order to understand the boosting algorithm, there is one
simplification we will make. Instead of passing two variables into
the algorithm, X & Y, we will only pass a single variable, the
maximum of X & Y
The reason this simplification will benefit our understanding is that
the boosting classifier uses a regression routine at its core. We have
already seen one regression algorithm. This one is slightly
different. But to show what it is doing, a 2D plot of a single value vs
result, similar to the chart we had for the sine wave example, is very
useful, while a 3D plot of X & Y vs that the result would be hard to
understand.
So if you plot Max(X, Y) instead of X vs. Y this is the design space
Basically what we now have is a single line that changes colors as it
progresses along the one value.
Great, so we want to use this modified data for classification. How
does the algorithm work? We will work down along the boosting
steps shown in the flowchart.
Boosting iteration 1
Step 3 – Calculate Current Error
Now that we have the initial prediction, we can calculate the error in
that prediction. The points all have the same initial prediction, but
not the same true values, so different points can have different
current error.
We subtract the current prediction for each point from the true value
to get the current error at each point. Since the true value of every
point is either zero or one and the current prediction at every point is
0.6, the current error at each of those points is either -0.6, or positive
0.4. This is shown as a plot below.
The above chart shows the initial values coming into this first
boosting state as red diamonds, and shows the true values as black
triangles. The actual regression tree split the data into the three
groups based on the current error of black triangles minus red
diamonds. (Current error is not shown on this chart)
We are skipping over the actual process of the regression tree at this
point because all that really matters is the groups the points get
broken into. (Note, we’ll touch on how the regression was done in
a couple of pages just to tidy up the loose ends after hitting the main
points on how the boosting equations work).
Notice that there are 6 different regions points in the data (based on
their true value) that we are trying to fit the regression to. With this
depth 2 decision tree, we only have the ability to generate up to 4
groups. That means there will inevitably be some areas that are not
an exact match. What we end up with from the regression tree is 3
groups of points. (We could have gotten 4 groups from this depth 2
regression tree, but ended up only getting 3)
Zone 1: has 25 points. 17 are ones and 8 are zeros
Zone 2: has 16 points, all zeros
Zone 3: has 17 points, all ones
Where
That is, the numerator is the sum of the current error of all of the
training points in that group. The current error for any given data
point can be positive or negative. The numerator, which is the sum
of all those, values can be either positive or negative.
The equation for the denominator is
That equation also sums a value across all data points in the group.
The value that it sums is the product of two differences. Note that
the True Value in this equation is always either 0 or 1. If the true
value is 1 then the current error is always positive. If the true value
is 0 then the current error is always negative. If we recognize that
True Value minus Current Error is the same as the Current
Prediction, this equation becomes (CP is Current Prediction)
Since the current prediction is always a value between 0 and 1, the
resulting product for each data point is always positive, which means
that the denominator is always positive.
The amount of change is the numerator divided by the denominator.
Since the denominator is always positive, the sign of the amount of
change is determined by the sign of the numerator. This basically
means you can add up all the current error in a given group to
determine if the new current prediction will be higher or lower than
the current prediction for those points.
Since this value is a positive number, the end result will be that the
points move up. We need to calculate the denominator of the
equation to finish determining how much the points move.
All the data points in this group have a value of 2.0721, which
results in
Keep in mind this is still the first boosting iteration, still on steps 5-
7. So we will be doing these steps in the flow chart again.
However, now we are doing those calculations on a different group
in the regression tree. In zone 2 there are 16 points that have a
current prediction of 0.6 and a true value of 0. That makes their
value for current error -0.6
Notice that this is the smallest change of all three zones. This is
because this was the only zone that was not purely of one category,
so some of the errors canceled each other out.
So the new current prediction for all the points in zone 1 is .6767
For some points, that is an improvement, for others, it is worse than
it was before this iteration.
We have now the generated new current predictions for each of the
three regression groups, so we have completed boosting iteration
number 1. The end result after this boosting iteration?
Zone 3 moved up a lot
Zone 2 moved down a lot
Zone 1 moved up a little bit
Of course, these results just become the input to the second boosting
iteration.
Boosting Iteration # 2
With boosting you typically keep adding iterations until the results
stop improving or you run out of computing resources. For this
problem, there are obvious improvements left to be made, so let’s
look at the second boosting iteration.
Once again the chart below is set up to show the true values as black
triangles, and the current prediction coming into this boosting
iteration as red diamonds. The red diamonds in this chart match the
blue circles on the boosting iteration 1 chart we saw previously.
This chart shows the data broken into three zones. This was done by
the regression decision tree in iteration 2. However, these three
zones are not the same zones as boosting iteration 1 had. The reason
we get different groups is that we have different error going into
iteration 2 than we had for iteration 1.
Zone 1 purely consists of zeroes.
Zone 2 purely consists of ones.
Zone 3 contains a mix of the two types of data.
The amount of change calculations that are performed are the same
as for the first boosting iteration, and they are shown below in a
more compact table. This table does all the same calculations that
we walked through for each group in the first iteration. See below
the table for an explanation of how it is set up.
In the table
There are 6 columns of data because there are 6 different
blocks of data in our data set. I.e. all the numbers between
0-1 all have the same true value, as do the numbers
between 1-2 etc.
The row labeled Step 3 is the current error at any given
data point
Step 4, splitting the data into groups isn’t explicitly
labeled, but is shown with the thick black lines splitting up
the columns
The block of rows labeled step 5 calculates the amount of
change for each zone, using the same equations we saw
before. The rows labeled numerator or denominator are
the sum value for a given column. (I.e. the number of
points in that column multiplied by the current error for
the numerator, or number of points in the column times
(Current Prediction) * (1-Current Prediction) for the
denominator.
The total numerator or total denominator adds together
multiple columns within a given zone. The amount of
change is the ratio of those two values.
The row labeled Step 6 converts the current prediction into
the infinity range
The row labeled Step 7 adds that with the amount of
change in that zone
The row labeled Step 8 converts back to the 0-1 range
In the first boosting iteration we had a big improvement in the data
on the right side of the chart, zones 2 and 3, and not much change to
the data on the left. For this iteration, we are getting a big
improvement in the left groups and not much change on the right.
We see that result in the chart below.
Boosting Iteration # 3
Boosting iteration number 3 takes the input from boosting step
number 2 (red diamonds) and continue to improve it. The math will
obviously be the same as the previous boosting steps, with different
values and groups. However, there are a couple of points worth
highlighting in the chart below.
If you want to test a new point based on the classifier that we fit, it is
as simple as finding which block it would fall into (i.e. look at the
second row of the table), take the starting value of .4055 and add the
amount of change from each of the iterations. That results in a final
value in the infinity range, which can be converted to the 0-1 range
for a final prediction.
For the purposes of the book, it was easiest to present this data as a
table, since there are only 6 blocks of data. (Resulting from the fact
that our original plot only had 6 red and blue stripes). However, the
actual software will save the same data by saving the splits that
generated these results from each regression tree, as well as the
amount of change that was derived at each leaf node.
To the extent that there is a disconnect between the numbers in the
first row in the table above, which is how the data should have been
broken up, and the second row in the table above, which is how the
data was actually broken up based on the training data, we will have
some errors. Those errors are expanded upon in the following
pages.
Cross Validation
Initially, we set aside 40 points from our training data out of the
starting 100 points. We can use those 40 points to measure how
good the boosting result was. The metric that I am using as a
measure is mean squared error, which works OK for this 2 category
problem but isn’t very good for classification in general. MSE
means that if the true value for a point was 1.0, and we predicted
0.9, the error for that point would be 0.1, and the squared error
would be .01. Since it is Mean Square Error, it is the average of the
squared error across all points. The mean squared error values
shown below are for the cross-validation data.
What happened was some of the points that were on the boundary
between two zones were improperly classified. The cause of this is
based on where the decision trees picked their splits. They put a
split mid-way between the data that they had in the training set. For
instance, one true split should have been at Max(X, Y) equal to 3.0,
but instead, the split was at 3.167. This is because the algorithm did
not have data at the exact boundary. As a result, the boosting
algorithm placed its splits in slightly incorrect locations. So any
testing data that has values between 3.0 and 3.167 will get
incorrectly classified. The same is true, to some extent, at all of the
splits. The table below shows where the splits were placed vs
where they should have been placed.
As you can see, every block of data has some error between where it
should have been split, and where it actually was split. For most of
the blocks, that error is small, for instance splitting at .986 vs. 1.0.
However, some of the discrepancies are larger, which makes it more
likely that points will fall in that area and get misclassified.
Truthfully it is unlikely that a human could do very much better
giving the 60 data points that were fed into the boosting algorithm.
If a person were presented with the same data, on an unshaded and
unlabeled graph and asked to draw lines splitting groups, they would
likely have some error as well.
Back To The 2D Example
In the previous example, we worked through a set of boosting
iterations with a single feature as input. That single feature was the
maximum of two other features in order to simplify the data as much
as possible. But recall that what we really were trying to solve was
the 2D example shown on this chart
The first split makes the right column of blue squares into its own
group. The second split, on the left side of the first split, makes the
top row of blue squares into its own group. Notice that there is no
second split on the right side because that is already only one
category.
After the splits, the boosting equations would work the same as we
saw in the 1D example. Each of the tree groups would calculate the
amount of change to apply to that group using the same equation we
saw before, just different groups.
One thing to know is that with the tree depth of 2, we won’t be able
to get really good groups for this 2D problem. 2 splits are not
enough to peel off the middle sections of the data. 2 splits would be
enough to peel the edges of the data, like we showed above or like
the bottom left corner, but not the middle.
To get the middle of the graph into its own category would require a
depth of 3 or more and look something like this
This would effectively split the red triangles between X values 4 & 5
into their own category. (Note, the horizontal split line number 3 on
the left side of split number 2 would probably be higher, between the
top blue group and the red group. I put it in a different spot to show
that it is a unique split from the split line number 3 to the right of
number 2)
Now let’s say that you don’t have enough fidelity in the regression
model to cut out a pure section from the middle. What would
happen?
The boosting could still get good results. For instance, by focusing
several boosting iterations on the outside edges of the data until the
error there is very low and does not affect the results very much, and
then moving inside slowly.
More realistically what would likely happen would be that the
regression tree would split off rows that were as pure as possible in
some iterations, and then in other iterations make columns that were
as pure as possible to correct any resulting error from the rows.
With a regression depth of 2, the chart below might be the best you
can do for generating groups based on columns in the center of the
data.
This isn’t ideal, and would still have some errors. However, those
errors would be different errors than if you did the same thing based
on rows like the chart below
By combining those results, and other results that were as good as
possible, over multiple iterations, the boosting based on a decision
tree depth of 2 still could get pretty good answers. However, just
like we saw with the sine wave regression at the very beginning of
the book, when there was only 1 split in the regression tree.
The results with an inadequate tree depth would take more iterations
and might end up being not as good. (Sine wave chart from the first
example shown again below to highlight limitation of inadequate
decision tree)
The Predictions With 2 Features
Recall that there are 60 data points in the training data below since
we initially started with 100 points and kept 60 of them to do the
training on.
After we have fit the classifier, we can cross validate with the other
40 points. This is what you would do if you did not actually know
what the design space is. You would use the classifier to predict the
value of the 40 points you do know and compare those results
against the true values of the 40 points. You can then use a metric
such as mean squared error on the cross-validation to dial in the
parameters, such as tree depth or number of estimators.
However, we actually do know what the design space is since we
have plotted the red and blue stripes. When we use a classifier to
make a prediction over the entire design space (by making a
prediction every .02 spaces apart, and then use the 90,000
predictions to shade the graph) this is the result that we get.
This result was generated with a max tree depth of 2 and 10 boosting
iterations.
This result is not the worst, but it is not the best. You can see some
of the two-dimensional stripes coming through, however, it clearly
doesn’t exactly match the actual design space. There might some
improvement to be had from changing max depth to 3, and using 20
boosting iterations, however, those results turn out ambiguous as
well. That is shown below
What we are really seeing is that the design space using only 60
points is not very well defined. In the initial plot of the 60 points in
the training data, there are wide areas that do not have any data in
them. Those areas are shown below the black boxes.
It is in those areas that we are getting results that are not matching
our actual design space. For this example, the easiest way to fix
that is to use more data. Instead of starting with 100 points and
keeping 60, we can start with 300 points and keep 180. This results
in a much more populated design space as shown below
When this data is used as the input, the end result is that the
prediction with 20 boosting iterations and a Max Depth of 3 ends up
being quite good.
The challenge for real world problems is that you usually can’t get
more data and have to make do with the data that you have. Finding
the troublesome areas in the fit model is usually done with cross-
validation. Improving those areas ends up depending on the skill of
the data scientist and the problem under consideration. For this
problem, one of the biggest improvements that could be made would
be to collapse the two features X and Y into one feature of MAX(X,
Y), which is the example that we showed first.
3 Or More Features
We simplified the example problem to have one feature to start
with. That was mainly just to make the plots and graphs
easier. Then we showed an example with two features and
saw that the only real difference was that the regression tree
had more complicated splits. There is fundamentally no
difference between 2 feature data or 3 feature data. If there
was 3D data, the regression trees would break the data into
groups based on all three features. After that, the process
would be exactly the same. In each group, the algorithm
would count how many of each category there were, how large
the error was for each data point, and apply the same equation
to adjust the values.
If we were to continue to increase the number of features the
only difference would be how the regression tree breaks up the
data. (And the time it would take to generate the trees would
increase linearly with the number of features.)
Boosting With 3 Or More Categories
Up until now, we have done boosting for classification with only
two categories, red and blue. Now we will look at how to do
classification with 3 or more categories. How can we change our
boosting process to work for 3 or more categories?
The obvious, but incorrect, solution might be to use additional
values in the regression part of the algorithm. I.e. instead of using
only 0 and 1 as the true values, use 0, 1, and 2. However, there are
multiple reasons that wouldn’t work, including our inability to
directly map the infinity range onto a category that goes up to 2.
Because of this, classification with 3 or more categories is somewhat
more complicated than it was with 2 categories. Our old tricks with
the numbering ranges don’t quite work. Where do you put the third
category? Or the fourth one?
Instead, we are going to do something a little bit different. What we
are going to do is boosting that is similar to what we did for two
categories, but we are going to do it multiple times for every
iteration. For each of those multiple times, we will re-categorize the
data into only two categories. One will have all the data from
exactly one of the original categories, and the other new category
will have the data from all the rest of the categories.
This means we will be using a lot more decision trees and taking a
lot more steps. If we have 10 boosting iterations trying to
distinguish 7 categories, we will end up using 70 trees. (side note –
the 2 category process that we showed earlier only used one tree
total per boosting iteration, not two. It was a special case)
What do you do here? You find the magnitude of the total distance
by taking the square root of the sum of the squares.
The process is
Square each component
Sum the squared values
Take the square root of the sum
If you wanted to find the difference between X and the total distance
you can subtract the two of them. You can use that difference
between X and the total distance as a metric for how sure you are
that X is the main direction you traveled in. If the difference
between X and the total distance is small, then X is the dominant
direction.
Analogy Summary
The process we will use for multi-category classification
is the logarithm of the sum of the exponentials
This is similar to using the square root of the sum of the
squares that we do in geometry.
For both of them, you do an operation on each component
(exponential or square) take the sum of the results, and
then do the inverse of the original operation (logarithm or
square root).
Both of those processes are a way of normalizing an
arbitrary number of orthogonal options. In geometry, you
might have three mutually exclusive directions, X, Y, and
Z that are all at right angles to each other. Here, for
classification, we might have three mutually exclusive
categories, A, B, C.
The value that we use from the table above is the natural logarithm
of the sum which is 9.127. We use it by taking the difference
between each current prediction and the natural log of the sum, i.e.
(1 – 9.127), (7 – 9.127) and (9 – 9.127)
Since the natural logarithm of the sum will always be greater than
each of the current predictions, all of the resulting values in the table
above are negative. The next step is to take the exponential of each
of those results (i.e. raise e to the power of -8.127, -2.127, and
-0.127)
This results in values that will always be between zero and one and
can be subtracted to find the current error for this data point.
(Conveniently, the sum of the values in the table above will always
be 1.0). The final step of subtracting the prediction above from the
true values to get the current error is shown below.
Notice that since the sum of the current prediction for any given
point in the 0-1 range is always 1.0, and the true value is always an
array that has all zeroes except for a single 1, that the sum of the
current error for any given point will always be exactly 0.0.
Although we found the current error for all three categories in the
array in the table above, the algorithm would typically only process
a single category based on where it was in its sub-loop. So if it was
focusing on the third category, the single piece of data used from the
table above would be that this data point had a current error of .1195.
This current error result gets used to fit the regression tree. Like we
have seen for all the regression trees, the regression tree will use the
current error that we just generated and attempt to group data points
with similar current error together by splitting based on the available
features.
Let’s keep the same data as we used for the 2 category classification
example, except instead of assigning them to category 0 and 1 using
the modulus 2 we will assign them into 0, 1, 2 using the modulus 3.
We then assign the colors as
0 = Red triangles
1 = Blue squares
2 = Green stars
That results in the data below
Once again we have 60 data points to do the boosting on. Since we
saw earlier that reducing the two dimensions to one dimension
worked well for the 2 category problem, we will do it again for this
data and get this chart
Now that we have the data we can start the boosting routine.
Multi-Category Boosting Process
The flow chart for how multi-category boosting works is shown
below
From this step onward we will repeat for every boosting iteration.
Think of this as an outer loop.
From this step onward we will also repeat one time for each
category. Think of this as an inner loop.
2. Determine what category we are focusing on based on where
we are in the inner loop. Assign every data point that is part of
this category the value of 1. Give every other data point the value
of 0. This is the “True Value”
3. “Normalize” the current prediction (which is just the initial
prediction the first time) using the log of the sum of the
exponential method. This converts the current prediction, which
is in the infinity range, into values in the 0-1 range. For each
point, the current prediction in the infinity range is an array that
has the same length as the number of categories. The current
prediction in the 0-1 range is also an array of the same length.
However, we only care about a single value in that array which
corresponds to the category we are analyzing from the inner loop.
Call that single value the “Normalized Current Prediction”
4. Subtract the “Normalized Current Prediction” from the “True
Value” to calculate the “Current Error” for every data point.(Do
this only for the index in the array corresponding to the current
category in the loop)
5. Use a Decision Tree regression analysis to fit a minimum
mean squared error tree to the “Current Error”. This is exactly
like we showed in the regression section. Just keep the groups
coming out of the regression tree.
6. For each group, generate an equation based on how many
points have positive error, and need to be moved up, and how
many have negative error, and need to be moved down. Account
for both the quantity of points and the magnitude of the error.
Additionally, account for the number of categories in the
boosting analysis. (this is different than the 2 category analysis)
This equation will generate either a positive or negative value in
the range from negative infinity to positive infinity. Call this the
“Amount of Change”. The amount of change is a single value in
the infinity range for each data point, not an array.
7. Add the “Amount of Change” to the “Current Prediction” in
the infinity range for each point. The current prediction is an
array for each point, only modify the value associated with the
current category being analyzed in the inner loop.
At this point, we have completed one cycle through the inner
loop. Continue going through the inner loop for each subsequent
category. Note that even though we only changed the values for
the current category in the infinity range, that change will affect
the calculation for converting the current prediction from the
infinity range to the 0-1 range for each point. I.e. the results from
each cycle of the inner loop will affect later cycles.
Once we have completed going through the inner loop for each
category, then we have completed a full boosting iteration.
8. We can either take the new “Current Prediction” and start
another boosting iteration at step 2, including going through the
inner loop again for each category, or we can be done. If we are
done, the model has now been fitted, and you can use it to predict
new data points.
A Worked Example of 3 Category Boosting
Let’s look at the multi-category boosting for the red-blue-green
example data that we showed above.
Step 1: Make Initial Prediction
The first step is to make our initial prediction. That initial prediction
is simply the normalized count of the number of points in each
category
Out of our 60 data points, we have 9 that are category 0 (red), 27
that are category 1 (blue), and 24 that are category 2 (green). If we
divide all of those by the 60 total data points we get a starting value
of.
[.15, .45, .4]
For every data point
These are the initial predictions in the infinity range. One thing that
is different with this multi-category boosting is that we will not be
converting back and forth between the 0-1 range and the infinity
range. What we will be doing is saving the results in the infinity
range, and then converting to the 0-1 range as required, but not
necessarily saving the values in that range for later use.
For multiple categories, the only change we will have is adding one
more term to account for the number of different categories we
have. The two terms that we called numerator and denominator will
stay exactly the same, but we will include one additional term to the
overall equation. The new total equation becomes
Once we have the numerator and denominator for any given block, it
is a simple matter to add them together based on what zone each
block got grouped into. The first zone is the sum of only the first
data block. The second zone is the sum of the 3 data blocks in the
middle, and the third zone is the sum of the final two data blocks.
These are shown as the “Total Numerator” and “Total Denominator”
rows in the table below.
To calculate the final amount of change for each zone, we simply
need to divide the numerator by the denominator and multiply by the
value of 2/3 based on the total number of categories
The final row in the chart above is the new prediction in the 0-1
range. However we technically aren’t actually using that value at
this step, so the computer would likely go onto the next category
without calculating the new prediction in the 0-1 range.
For our purposes, it is useful so that we can plot it and see how
much change occurred in our predictions for this category. That is
shown below.
Iteration 1- Category 2: Repeat the Steps
Above With The Second Category
Step 2: Assign True Values
At this point, we have done the first of three sub-loops of the first
boosting iteration. We have gone through the inner loop a single
time. We now need to repeat the same process, but operate on
category 2 this time. That will mean
27 points will have a true value of 1
33 points will have a true value of 0
In graphical form, we are on the middle subplot in the charts shown
below
We can see the resulting change in the 0-1 range in the chart below,
which shows the current true values and current prediction for
category 2.
You can see that even though we haven’t done a loop focusing on
category 2 yet, that there are different current predictions based on
where the regression tree split the groups for the category 1 loop.
The three different zones each get their own amount of change
which gets added to the current prediction for category 2 in the
infinity range. After this step, the current prediction of the six
blocks of data is in the table below
Notice that we still haven’t modified the results for category 3 in the
infinity range.
The plot for category 2 current prediction after this sub-loop is
Once again, even though we haven’t yet edited the current prediction
in the infinity range for any of these points, i.e. all points still have
the starting prediction of 0.40, because the current prediction for all
the other categories have changed, that means that different points
have different predictions in the 0-1 range, which is what we are
looking at above.
That data is used to calculate the current error, which is used by the
regression tree to split the data. That is shown below for category 3.
Step 4 & Step 5 – Calculate The Current Error, And Use A
Regression Tree To Split Into Groups
However, the 2-3 block and the 3-4 block basically did not improve
at all. This is because the depth 2 decision trees never isolated
those blocks for their actual categories during the first iteration. If
we did this analysis for a few more iterations, what would happen is
that the edges of the data would start to have a really small error, and
the regression tree would focus on the data in the center.
But we are not going to go through another full iteration since it
would just be repeating the analysis we showed several times. If you
want to see those calculations, this free downloadable Excel file has
the boosting calculations for the 2nd and 3rd full iterations.
Predicting With Multi-Category Algorithm
Predicting with multiple categories is the same as with one
category. Each regression tree saved the amount of change
each leaf resulted in. That amount of change is applied to each
point being tested depending on what leaf they would end up
on in the regression trees.
At the end of all boosting iterations, the category with the
highest value is the one that is predicted.
We have now completed going over how the default
implementation of gradient boosted trees work. The next
section reviews some of the parameters that you can adjust to
determine what works best for your data.
Gradient Boosted Tree Parameters
Gradient Boosted Trees have a number of parameters that you
can tune. This section goes over those parameters. The
naming and syntax used in this section are based on the Python
sklearn names. However, in general, there are similar
parameters available in R.
There are 3 main categories of parameters that you can control
Parameters that control how the decision tree is
generated, i.e. how deep it goes, how it decides
which splits to make and which to keep, etc.
Parameters which control the boosting algorithm,
i.e. how many boosting steps to make, how fast to
incorporate the results
Miscellaneous parameters that don’t fit elsewhere
Boosting Parameters
learning_rate: This has been discussed in depth
earlier in the book. It controls how quickly changes
from the boosting results are rolled into the current
prediction. A large learning rate will get a close
result more quickly but is more prone to overfitting.
By default, the learning rate is .1
n_estimators: How many boosting iterations to
make. By default, there are 100. More boosting
iterations will tend to give better results. Eventually,
however, they will over fit the data, so there is a
sweet spot.
subsample: This controls if all the data is evaluated
each boosting iteration. By default, the value is 1.0,
but it could be a lower value like .8 if you want to
only evaluate 80% of the data each iteration. A
lower value is a control against overfitting. (note,
this is discussed in depth below)
loss: The loss function is what the boosting
algorithm is attempting to optimize. For regression
this defaults to least squares, which is pretty much
the same as minimum mean squared error. Other
options for loss function for regression are “least
absolute deviation” which is pretty much minimum
absolute error, ‘huber’ which is a combination of
least squared and least absolute deviation and
‘quantile’ which is similar to least squares regression
except a little bit more robust against outliers. For a
gradient classifier, the choices are either logistic or
exponential regression. Logistics is the default,
exponential makes the algorithm more similar to
AdaBoost, which is an older type of boosting that
works by weighting data points different instead of
keeping the residual error for each subsequent tree.
(Link with a brief explanation on the difference
between AdaBoost and Gradient Boosted Trees
https://stats.stackexchange.com/questions/164233/int
uitive-explanations-of-differences-between-gradient-
boosting-trees-gbm-ad)
Other Parameters
random_state : This is an integer that controls the
random seed of the tree. If you want results that you
can duplicate, you should use this value to control
the random state of the algorithm, or you should be
setting it earlier in the code, for instance using
numpy in Python.
warm_start: This gives you the option to add more
boosting steps to an already created boosting fit.
You might use this if you fit the boosting algorithm,
and then have a step where you evaluate how good
the fit is before deciding if you want to invest more
time in fitting additional boosting iterations.
Decision Tree Parameters – More Detail
There are 6 different parameters that you can control that drive the
final shape of the decision tree. Here is a diagram showing some of
the different parameters
And it found leaf nodes 3 and 4 to be the least useful, and you
only wanted 3 leaf nodes you would get
The method it uses to determine which of the nodes is the
most useful is the improvement in summed squared error from
that split. It will investigate every split, determine the change
in summed squared error before vs after the split, and roll back
the least beneficial splits. Sometimes it will roll back multiple
levels if a branch has many splits that end up being less useful
than the others.
How Max Features Parameter Works
The max features parameter is an interesting parameter. It
controls which features a given split looks at when the
regression tree is generated. By default, each split looks at all
of the features and picks the best split that could be selected
from any of the features. However, there are times when
looking at fewer features can give better final results.
The Random Forest machine learning algorithm makes use of
this as part of its randomness. Random Forests also heavily
use decision trees. By default, they only look at the square
root of the number of features at each split. So if you have
100 features in a Random Forest decision tree, any given split
will only evaluate 10 of them. However, since there are
typically a lot of splits in a decision tree, and there are a lot of
trees in either a random forest or a gradient boosted tree
algorithm, all of the features do get evaluated multiple times.
Looking at fewer than all the features each split can have two
benefits. The first is it can help minimize overfitting.
Looking at a subset of features can help ensure that more
features are evaluated instead of relying on a few features. It
means that errors in any single feature will not influence the
model as heavily.
The second benefit of not using every feature every split is
avoiding local optimums in the regression tree generation.
Remember, regression trees do not make the best global
result. We saw that when we created regression trees with 3
zones when 4 zones would have been better. Regression Trees
pick the best split at each stage. However, when you have
two or more splits in series the best combination of those splits
is not necessarily the same as the best first split, and then the
best second split. If you always pick a single feature or one of
a small set of features for the early splits, you might be
missing the opportunity for really good splits later on. Only
evaluating a subset of the features helps with that since it
ensures that the locally best features aren’t always available,
so other features are evaluated.
The options that you have for the number of features to
evaluate are
All of them
The square root of the number of all features
The natural log of the number of all the feature
A floating point number representing a percentage of
all the features. i.e. 0.3 for 30% of the features.
Whether you should use all the features or should pick fewer
features is something that can be evaluated with cross-
validation. By default, the parameter uses all of the features
for boosting. Using all features was set to the default for
boosting because it has been found to work well for many
problems. Whether it works the best for your problem is
something to be evaluated.
How Sub-sampling Works
Sub-sampling is another parameter that you can tune for
gradient boosted trees. Sub-sampling simply means to not use
all the data at any individual boosting iteration.
The purpose of sub-sampling is to limit overfitting errors that
are in the training data. Most real life data has some amount
of errors or inconsistencies in it. Training the classifier on the
erroneous data points will lead to errors when testing.
Sub-sampling addresses this by leaving out a portion of the
data each time. Some of the data that is left out will be the
erroneous data points. Since most of the data points won’t be
errors, when the errors are left out the model will tend to
correct those errors
By default the sub-sampling parameter is usually set to 1.0, i.e.
use 100% of the data at each iteration. If you were to set it to
something lower, such as .7, it would only use a portion of the
data, in this case, 70%. However, it would choose a different
70% at every boosting iteration. So after a handful of boosts,
all the data will have been used.
The sub-sample is drawn without replacement, i.e. you can’t
get the same data point twice. This is different than how some
other machine learning algorithms, such as Random Forests,
work since they do use replacement.
A sub-sample parameter of lower than 1.0 is typically only
beneficial if the learning rate is also lower than 1.0. With a
learning rate of 1.0, the algorithm would make too large of
steps on incomplete data. However, with a smaller learning
rate, there is sufficient opportunity for all of the data to be
incorporated on multiple boosting iterations.
This paper by Jerome Friedman shows an example of sub-
sampling outperforming no sub-sampling for several different
data sets.
https://statweb.stanford.edu/~jhf/ftp/stobst.pdf(Plots on page
9). The exact value of sub-sampling to use varies with the
data, with sub-sampling on the order of .6 being the best
shown on the plots in that paper.
An additional advantage of using sub-sampling is that it
speeds up the generation of trees which improves the
performance of the whole process. Using a sub-sample of .5
will speed up tree creation by an approximate factor of 2.
Feature Engineering
The only thing that we have discussed so far is the actual machine
learning algorithms. This section will briefly discuss feature
engineering.
Feature engineering is the generation of features for the machine
learning algorithm to operate on. We actually already did some of
that early in the book, when we reduced this plot
You will recall that regression trees always split data in straight
lines. In two dimensions that would either be horizontal or vertical
lines. That worked in our favor for the graph with square stripes,
since the splits for the true data were all either horizontal or vertical.
In this graph with the circular plots, that will work against us.
With the square stripes, after we saw that generating a new feature of
the maximum of x and y worked well, we tripled the number of
points in the training set up to 180 and used 20 boosting iterations
with a maximum depth of 3 to get this result.
We considered that result to be pretty good but noticed that it took
more data and more steps than when we had the ideal feature,
Max(x,y). With this circular data, 180 points won’t be nearly
enough. 180 points plotted on the circular data is shown below.
To a human, even if there wasn’t the background shading, a pattern
is starting to emerge. But remember, the boosting algorithm does
not recognize the pattern, all it is doing is splitting on the available
features. It is not modifying those features in any way, nor is it
looking at the interactions of multiple features.
If we train the boosting model on the 180 points above using 20
iterations of a depth 3 decision tree, this is the result that we predict
To put it bluntly, that result is outright terrible. With this circular
data, even having triple the amount of data, and a relatively high
number of trees and tree depth, the results look nothing like the
circles of the true data. If instead, we had been able to deduce that
the data was driven by the total distance from the origin, a mere few
dozen data points and 10 iterations would have been enough to get a
pretty good result. That is the power of feature engineering if it is
done very well.
Even feature engineering that is just OK can still be powerful. Let’s
say that you did not recognize that the key feature was the distance
from the origin, and instead thought that the data was arrayed as
diagonal stripes. Bear in mind that you would just be looking at the
actual data points, not the colored background, so it would a
reasonable conclusion to reach.
If you thought that the data was based on diagonal stripes like the
ones shown below
Then a feature you could generate would be the value of x + y. That
would separate points based on which of those diagonal lines they
fell on. If you were to train on this data including the features of x,
y, and x+y, with 180 points and use 60 iterations with a tree depth of
4, the prediction that you would get is shown below.
Now, this isn’t an ideal prediction by any means. But it actually is
not that bad. The overall rounded, striped behavior is starting to
come through, and it looks much more like the true results than the
prediction when we used no feature engineering at all.
I did not spend a lot of time trying to determine exactly how much
data, and how many iterations would be required to get a good result
without any feature engineering at all. However, if we trained on the
1,800 data points shown below
Using 100 iterations with a maximum tree depth of 5, we start to get
results that are OK, but these results definitely still have error on the
boundaries. This is shown below
If we were to dig into the sine wave regression analysis again, we
would be a similar effect. The sine wave regression information
only carried 5 pieces of information in it. It had
The fact that it was a sine wave
The amplitude
The frequency
The phase
The offset
To do a reasonable job of matching those 5 pieces of key
information, the regression analysis needed between 45-100
iterations with a tree depth 2 or more. If the sine wave had extended
over a larger range, for instance, 100 waves instead of a single wave,
we would have needed even more iterations.
All this is not to say that boosting is not a good machine learning
algorithm. It is, in fact, a very powerful algorithm that frequently
wins data analysis competitions. But it is important to know the
limitations. It is just as important to invest effort into improving the
data and the features that the algorithm will use as it is to adjust the
algorithm parameters or the number of iterations.
Feature Importances
The last topic in this book in on feature importances. One
interesting piece of data that you can extract from the boosting
algorithm is the “feature importances”, (the syntax is
feature_importances_ in Python). This reports the relative
significance of all the features that were used to generate the
regression trees that the boosting iterations were based on.
If you extract feature importances, what you end up with is a
normalized array that is the same length as the number of
features you have. The larger numbers correspond to features
that were more important to the analysis.
The actual calculation used for generating feature importances
is to calculate the improvement in mean squared error at each
stage of a regression tree, and then keep a sum of that
improvement for each feature, depending on which feature
was used to generate the split in the regression tree. That sum
of importance is then averaged over all of the regression trees
from all of the boosting iterations.
If you are looking for a more technical description of how
feature importances are calculated, with links to the raw
python and cython code, this stack overflow question has an
excellent answer
http://stats.stackexchange.com/questions/162162/relative-
variable-importance-for-boosting
However, for general knowledge the important things to know
are
The regression trees keep track of improvement in
mean squared error based on what feature the split
was on
The larger the improvement was, summed across all
of the trees, the more important that feature is
So why would you care about feature importances?
As we have already seen, boosting is a useful algorithm, but it
is only as good as the data that you feed into it. If you can
manipulate that data to make it more useful, you can get
significantly better results. However, where should you focus
your attention? Looking at the feature importances will tell
you which features are already the most useful. Usually
focusing on the most useful features, and making them even
more useful will yield the largest improvement in your results.
Making them more useful might include things like
Finding ways to scrub errors and outliers out of
those features
Changing those features from a discrete range to a
continuous range. i.e. instead of classifying dog
breeds as “small”, “medium” and “large” actually
put in the breeds’ adult weight. Or vice versa
Finding ways to mix some of the most important
features. Should you take the ratio of 2 features or
the absolute value of those features? The product or
difference of those features? Perhaps they are
orthogonal distances and you need to take the square
root of the sum of the squares to get the magnitude
of distance?
And let us know. If you do, then let us know if you would like
free copies of our future books. Also, a big thank you!
More Books
If you liked this book, you may be interested in checking out
some of my other books such as
Machine Learning With Random Forests And
Decision Trees – If you like the book you just read
on boosting, you will probably like this book on
Random Forests. Random Forests are another type
of Machine learning algorithm where you combine a
bunch of decision trees that were generated in
parallel, as opposed to in series like we did with
boosting.
Before you go, I’d like to say thank you for purchasing my
eBook. I know you have a lot of options online to learn this
kind of information. So a big thank you for downloading this
book and reading all the way to the end.
If you like this book, then I need your help. Please take a
moment to leave a review for this book on Amazon. It really
does make a difference and will help me continue to write
quality eBooks on Math, Statistics, and Computer Science.
P.S.
I would love to hear from you. It is easy for you to connect
with us on Facebook here
https://www.facebook.com/FairlyNerdy
or on our webpage here
http://www.FairlyNerdy.com
But it’s often better to have one-on-one conversations. So I
encourage you to reach out over email with any questions you
have or just to say hi!
Simply write here:
~ Scott Hartshorn