100% found this document useful (1 vote)
140 views

Machine Learning With Boosting

The document provides an introduction to gradient boosted trees, a type of machine learning algorithm. It explains that gradient boosted trees use multiple decision trees sequentially to improve predictions. The goal is to initially pass data to the algorithm with known solutions, then use what is learned to predict unknown solutions. Examples will focus on regression, like predicting house prices, and classification, like predicting investment outcomes. Code examples are provided to demonstrate how gradient boosted trees work in a tangible way.

Uploaded by

elena.echavarria
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
140 views

Machine Learning With Boosting

The document provides an introduction to gradient boosted trees, a type of machine learning algorithm. It explains that gradient boosted trees use multiple decision trees sequentially to improve predictions. The goal is to initially pass data to the algorithm with known solutions, then use what is learned to predict unknown solutions. Examples will focus on regression, like predicting house prices, and classification, like predicting investment outcomes. Code examples are provided to demonstrate how gradient boosted trees work in a tangible way.

Uploaded by

elena.echavarria
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 212

Machine Learning With

Boosting

A Beginner’s Guide

By Scott Hartshorn
What Is In This Book
The goal of this book is to provide you with a working
understanding of how the machine learning algorithm “Gradient
Boosted Trees” works. Gradient Boosted Trees, which is one of the
most commonly used types of the more general “Boosting”
algorithm is a type of supervised machine learning. What that
means is that we will initially pass the algorithm a set of data with a
bunch of independent variables, plus the solution that we care
about. We will use the known solution and the known independent
variables to develop a method of using those variables to derive that
solution (or at least get as close as we can). Later on, after we train
the algorithm, we will use the method we derived to calculate
solutions for unknown results from different independent variables.
This is an example driven book, rather than a theory driven book.
That means we will be showing the actual algorithms within the
code that executes gradient boosted trees, instead of showing the
high level equations about which loss functions are being optimized.
The most common explanation for boosting is “Boosting is a
collection of weak learners combined to form a strong learner”. The
goal of this book is to provide a more tangible and intuitive
explanation than that. This book starts with some analogies that
provide a rough framework of how boosting works. And then goes
into a step by step explanation of gradient boosted trees. It turns out
that the actual boosting algorithms are a straightforward application
of algebra, except for the decision trees that are one part of the
process for most boosting algorithms. (The decision trees are
reasonably straight forward, but are not algebra.)
The examples that will be shown will focus on two types of
problems. One is a regression analysis, where we are trying to
predict a real value given a set of data. One real life example of
regression is Zillow predicting a house’s value based on publicly
available data. The example regression analysis we will show isn’t
that complicated. We will try to predict the value of a sine wave,
shown below as the black dots
And we will show how that single blue line, which results from a
decision tree, can be improved using boosting to match the sine
wave values more closely, as shown below
Obviously this sine wave isn’t as complicated as Zillow’s house
price prediction, but it turns out that once we understand how the
boosting algorithm works, it is simple to increase the complexity
with more data or more layers of boosting.
The other example we will show is a categorization problem. With
categorization we are trying to predict discrete results. I.e. instead
of Zillow predicting a house’s value, it could be an investment
company trying to determine should they invest in this asset, yes or
no. In that example we will show how we can the take categorical
data shown below as either red triangles or blue squares
And make predictions about the values of the entire design space,
shown below
Once we understand how to group two different categories using
boosting, we will extend that to how to work with any number of
categories.

Get The Data And Examples Used In This Book


The algebra of the boosting algorithms can be duplicated in Excel,
and I’ve included an Excel file that does just that in the free
downloadable bonus material here
http://www.fairlynerdy.com/boosting-examples That bonus material
also includes all of the Python code used to generate the examples
shown.
If you want to help us produce more material like this, then please
leave a positive review for this book on Amazon. It really does make
a difference!
If you spot any errors in this book, think of topics that we should
include, or have any suggestions for future books then I would love
to hear from you. Please email me at
~ Scott Hartshorn
Your Free Gift
As a way of saying thank you for your purchase, I’m
offering this free cheat sheet on Decision Trees that’s
exclusive to my readers.
Decision trees form the heart of Gradient Boosted Trees, so it
is important to understand how they work. They are also a
useful machine learning technique in their own right. This is
a 2 page PDF document that I encourage you to print, save,
and share. You can download it by going here

http://www.fairlynerdy.com/decision-trees-cheat-sheet/
Table of Contents
Introduction
A Quick Example Of Boosting
Why Do People Care About Boosting?
An Analogy of How Boosting Works
How Boosting Really Works – Skipping Over The
Details

Gradient Boosted Trees For Regression


Regression vs. Classification
Regression Decision Trees
Gradient Boosted Trees Regression
Regression Boosting Example
Learning Rate: How Fast Boosting Incorporates
Improvements

Gradient Boosted Trees For Classification


Classification
Numbering Ranges – 0 to 1 & Negative Infinity to
Positive Infinity
Classification Example
Predicting With A Model
Boosting With 3 Or More Categories

Model Tuning And Improvement


Gradient Boosted Tree Parameters
Feature Engineering

Final Thoughts
If You Found Errors Or Omissions
More Books
Thank You
A Quick Example Of Boosting
There are several different boosting algorithms that exist. This
book focuses on one of them, gradient boosted trees. The
exact math differs between different boosting algorithms, but
they are all characterized by two key features.
1. Multiple iterations
2. Each subsequent iteration focuses on the parts of the
problem that previous iterations got wrong.
A real life example of a boosting algorithm in progress might
be a high school band teacher teaching a class of 20 students.
This specific band teacher wants to make the average quality
of his class as good as possible. So what does he do?
On day 1 he knows nothing about the quality of the musicians
that he is teaching, so he simply teaches a standard class.
After that, he knows exactly how well each student is doing.
So he then tailors his instruction to focus on whichever
students he can help the most that day. Typically that will be
the worst students in the class. After all, his goal isn’t to
make his best students perfect, he is trying to bring up the
average. It is usually easier to turn Bad into Acceptable than
it is to turn Good into Exceptional. So this music teacher will
tend to ignore the advanced students and spend more time with
the worst students.
How is this an example of boosting put into practice? Simple,
the iterations take place each subsequent day at each new
class. And the requirement that the algorithm focuses on
fixing the errors from previous iterations is satisfied since the
teacher is finding the students with the largest amount of error
and working to improve them.
Why Do People Care About Boosting?
Machine learning is a field that has an increasing number of
applications, large companies like Google and Amazon are
using it for their personal assistant products (i.e. Alexa), and it
will likely revolutionize a number of different fields, such as
when autonomous cars become available. That is machine
learning on a grand scale done with large dedicated teams, and
is more advanced than we will cover in this book. On an
individual’s scale, machine learning has a number of
interesting applications as well. One of the easiest places to
see them applied is on the machine learning competition site
Kaggle.
On Kaggle, individuals or small teams compete to take
different sets of data and extract the most information possible
out of them using whatever techniques they choose. The
winners frequently receive cash, as well as bragging rights.
But the important point here is not the competitors, but the
companies who are generating the data sets. They have real
problems where a better analysis of data can open up new
business opportunities, and they are willing to give away cash
prizes (typically in the tens of thousands of dollars, sometimes
more) to get better answers.
It is difficult to know which machine learning algorithm will
work best for any given problem. However in recent problems
people have found that boosting (especially XGBoost which
uses gradient boosted trees along with other improvements)
has done very well. Here are some competitions that have
been won using boosting, or boosting in conjunction with
other techniques
Liberty Mutual Property Inspection – Use some
property information to predict the hazards in a
home, for insurance purposes.
Caterpillar Tube Pricing – Attempt to predict how
much a supplier will charge for different orders of
metal tubing.
Avito Duplicate Ads Detection – Identify duplicate
ads from an online marketplace so that they can be
removed.
Facebook Robot Detection – An online auction site
(likely a penny auction site) has been flooded with
robot bidders which is causing the real customers to
leave the site. Can you identify the robots?
Otto Product Classification – Use a set of provided
features to figure out what category different
products should be grouped into.
The type of boosting shown in this book, Gradient Boosted
Trees, uses multiple decision trees (which are a series of if-
then questions that split the data into two different branches,
shown in detail later) sequentially in order to improve on one
of the main limitations that decision trees have, which is
overfitting.
An Analogy of How Boosting Works
Boosting, when applied to computer science, has a more
formal definition than the music teacher analogy listed above.
One of the most common descriptions of boosted learning is
that a group of “weak learners” can be combined to form a
“strong learner”.
Applying that description to the music teacher analogy would
be that a group of daily instructions (the weak learners) can be
combined to produce a good quality course (the strong learner)
The combination of weak learners resulting in a strong learner
description is accurate, and it does give some information.
However, it leaves out some important points, such as
What is a weak learner?
How are they combined?
Are some weak learners better than others?
The following analogies are intended to build your intuition on
those points above before getting into the actual math of how
boosting algorithms work.

What Is A Good Weak Learner?


A weak learner is any machine learning algorithm that gives
better accuracy than simply guessing. For instance, if you are
trying to classify animals at a zoo, you might have an
algorithm that can correctly identify zebras most of the time,
but it simply guesses for any other animal. That algorithm
would be a weak learner because it is better than guessing.
If you had an algorithm that identified every animal as a zebra,
then that probably is not better than guessing and so it would
not be a weak learner.
For boosting problems, the best kinds of weak learners are
ones that are very accurate, even if it is only over a limited
scope of the problem. For instance, the algorithm that
correctly identifies zebras would be good. It allows you to
confidently identify at least most of the zebras, allowing other
weak learners to focus on the remaining animals.
A weak learner that would not be as useful is one that simply
counted the number of legs an animal has. If the animal has
four legs it could be a zebra, horse, alligator, or panda. If it
has zero legs, it could be a fish, snake, or a worm. That kind
of identification helps some, but it doesn’t really narrow the
scope of the problem that much for future learners.

How Are Weak Learners Combined?


Boosting algorithms typically work by solving subsections of
the problem, by peeling them away so future boosting
iterations can solve the remaining sections.
Here is another analogy. Imagine you are hiring people to
build your house, and you have 10 different big jobs that need
to be done. A great way of doing it would be to get someone
who is really good at foundations to build the foundation.
Then hire a very good carpenter to focus on the framing. Then
hire a great roofer and plumber to focus on their sections of
the house. At each stage, a small subsection of the project is
getting completely solved.
The roofer may be completely useless at laying foundations,
but as long as you use him at the right time you will be in good
shape.
The contrast to that method would be to hire 10 people who
are all decent at most things, but not great at anything. None
of them can build you a good foundation, and if you start with
one, the next one will have to come in and fix some problems
with it, while at the same time doing a shoddy job framing.
The third person will have to come in and make corrections to
the errors that the first two left behind. You might get a good
product at the end, but more likely you will have adequate
results that still have errors in them.
The takeaway is that weak learners are best combined in a way
that allows each one to solve a limited section of the
problem. Any machine learning routine can be used as a
weak learner. Neural nets, support vector machines or any
other would work, but the most commonly used weak learner
is the decision tree.
How Boosting Really Works – Skipping Over The Details
With boosting, and with many other types of machine learning, there
are two stages to using the algorithm. The first stage is to train the
algorithm on data that you know the answer to. You train the model
using data that has some “features” that you think will be useful in
getting your desired results, as well as a final answer that you also
know. This is known as “fitting” the model or the classifier using
training data. By fitting the boosting model, it learns how to use the
features of the data in order to create groups of data points with
similar final answers. It also learns how to adjust the results within
each group in order to get the known final answer.
The second stage is to use that fitted model on a set of data that has
the same features as the training data, except that you don’t know
the final answer. The machine learning algorithm will then operate
on the unknown data the same way that it learned to based on the
known data, and draw conclusions from that. The theory is that if it
groups the second set of data using the same method as it did for the
first set of data, and then adjusts the values in the second set of
groups using the method developed by looking at the final answers
in the first set of groups, that the algorithm can determine good
estimated values for the second set of data.
The first stage, fitting the model, is somewhat more complicated
than the second stage of predicting with a fit model. At a high level,
the process we use to fit the boosting model is to start with the
training data and make an initial estimate of a value for all the data
points. Then we calculate how much error was in that initial
estimate. (Which is only possible because we know the actual final
value of the training data). Next, we attempt to group data points
together, using their features to generate the groups, with the
objective of making groups that have a similar amount of error
within each group. Then, for each group, we calculate a single value
and adjust all the data points in the group by that value. This creates
a new value for every data point. That completes a boosting
iteration. We could then be done with the training of the boosting
algorithm, or we could use the new value to calculate new errors and
go through the whole cycle repeatedly until the results stop
improving.
In the simplest terms, the cycle looks like this
Splitting The Groups
The most computationally expensive part of gradient boosting is
determining the best way to group the data. The process used to
group the data is regression decision trees.
The images below show an example of how the decision tree split
the data into groups. What happens is that data starts at the head
node (the top) and gets grouped according to how much error
remains in each data point, based on the features that can be used to
split the data up. The resulting groups are shown below as 1, 2, 3,
4. (Real problems could have a greater or fewer number of groups)

Each split occurs on a single feature at a specific value of that


feature. Subsequent splits could use different features or different
values of the same feature. At any given split, a data point will take
one branch or the other based on the value it has for the feature
being operated on. As a result, the splits result in multiple groups.
Each of those groups will calculate a value that will get added or
subtracted from all of the data points that fall within that group.
That is shown in the image below
This value is important. It is, in fact, the whole point of making
these groups. That value is a real number that will get added (or
subtracted) from the current value of every member of the group,
which updates the current value to a new value.
The change value will be calculated for each group in order to give
the most improvement possible to the members of that group, but it
will frequently be the case that some data points improve while
others get worse on any given boosting cycle.
All the data points start with the same initial estimate value, but after
the first boosting cycle they will have different values depending on
the amount of change imposed by the group they are in. Since
different points have different estimated values, they will have
different error relative to their true values. As a result, in subsequent
boosting cycles any given data point will not necessarily end up
grouped together with the same points as it was in previous cycles.
This means that, given enough boosting cycles, each data point in
the training set can be improved to very closely match its true value.
Multiple sequential boosting cycles might look something like this
Where every data point in the training set follows its own path
through each tree into the groups and changes its value depending on
what group it lands in. Each tree will end up with groups using with
different features, and the groups will have different amounts of
change applied to their values. Different trees could end up with
different numbers of groups.
The final results from a trained set of gradient boosted trees are the
splits that created the groups for each tree, and the amount of change
that gets applied to each group. Once the algorithm is trained and
you have those results, you can take a different set of data, determine
which groups each data point would fall into for each tree, and apply
the appropriate change for each data point in the new set of data.
Regression vs. Classification
There are two types of problems that gradient boosted trees are
typically used for, regression problems and classification
problems. The way that the gradient boosted trees algorithm
handles those two types of problems are different, but similar
in many aspects.
Regression is the type of analysis you would use when your
final result is a range of numbers. Typically they would be
real numbers in a continuous range. For instance, if you
wanted to make an algorithm to estimate the value of houses in
a certain area, you would likely use a regression algorithm
because the end result is a continuous range of possible
housing values.
Classification is the type of analysis you would use if you
want to break your data into categories. If you wanted to
develop an algorithm to take certain attributes of a car and
determine what kind of car it is (make and model), that is a
classification problem because you have discrete categories in
the final result.
Some examples of previous regression problems on the data
competition site Kaggle are
Zillow Home Price – In this competition you are
asked to determine how much error there is in the
price estimate Zillow makes in its house values.
(This competition is ongoing as of summer 2017)
Allstate Claim Severity – In this competition people
were asked to estimate the amount a claim would be
for different real life insurance events
https://www.kaggle.com/c/allstate-claims-
severity/kernels
Winners Interview
Winton Stock Market Returns – Can you predict the
stock market returns of different stocks over the next
5 days?
Winners Interview
And some examples of classification problems on that same
site are
Bosch Production Quality – A series of quality
control measurements have been made on various
components as they move through production. Can
you identify if that component will pass or fail?
Winners Interview
Expedia Hotel Recommendations – Hotels are
segmented into different types of hotels (quality,
location, etc.) Can you predict which type of hotel a
customer will book?
Identify Cancer Variations – Can you identify which
genetic variations are contributing to cancer and
which are not? (This competition is ongoing as of
summer 2017)
The algorithms for gradient boosted trees are similar for
regression vs classification problems. They both use the same
method for splitting the data into different groups. That
method is regression decision trees. Yes, even the
classification boosting problems use regression decision trees,
not classification decision trees.
Both the regression and classification boosting algorithms take
their grouped data and calculate a value that gets added to or
subtracted from the data points in that group. The method for
calculating the amount of change to apply is different for
regression boosting vs classification boosting.
Using a regression decision tree in a regression boosting
analysis is pretty straight forward. Since the final answer is a
range of real numbers, it is simple to split the training data
based on the remaining error at each step. Additionally, it
makes sense to add or subtract the values that come out of the
regression tree to a range of numbers and get a new value. As
you will see when we go into the regression section in more
detail, the algorithm for boosting with regression is almost
trivially simple.
What is not as obvious is how you use a regression decision
tree to solve a classification problem. For instance, if your
classification problem is trying to split items into three
different colors, red, blue, and yellow, how do you split those
items up based on a numeric value? And what good does it do
you to add or subtract a real value from a color?
The algorithm for classification boosting is more complicated
than it is for regression boosting, simply because we need to
change the categories into real values and then back again. If
you have two categories that are you trying to group the items
into, for instance yes/no, true/false, or hot/cold, the way the
algorithm works is to assign one of the categories a value of
1.0, the other one a value of 0.0, and then keep track of where
any given data point falls within that range.
When the values are in a real range between 0 and 1, then
regression decision trees are applicable. It is clear how you
can add or subtract values from that real range, as well as how
you can calculate the error in your current prediction. If a data
point’s true value is 1.0, and you are predicting 0.6, then the
error is 0.4. (There are a couple more subtleties that we will
get into in more detail in the classification section.
Classification gets more complicated with three or more
categories instead of two because you can’t just assign each
category all a value and keep track in a range. For instance, if
you have three categories you can’t assign category A to be
0.0, category B to be 1.0, and category C to be 2.0. The
reason that this doesn’t work is that an item for which you are
not sure if it is category A or category C might end up to be
the average between the two, and get a value of 1.0 resulting
in category B. For multiple categories we will need to split
the problem up to be a series of 2 category problems. So the
problem will become: ‘Is a given data point category A or is it
one of any of the other categories?’ Then ‘is it category B, or
is it one of any of the other categories’? Then ‘is it category
C, or one of any of the other categories’?
At this point, we have a good overview of everything that we
are going to cover in the book, and it is time to start getting
into more detail. For boosting algorithms there are two things
to know in order to understand how they work
1. How are the data points are split into groups
2. How are adjustments made to the values in each of
the groups to feed into the next cycle
The way adjustments are made in the values differ between
classification and regression problems, and even within
classification for two categories vs. three or more categories.
However the way the data is split into groups is the same for
all the different flavors of gradient boosted trees, so we will
cover that next. And the way it is done is with Decision Trees.
Regression Decision Trees
This section gives a moderate level of detail on how regression
decision trees work before we get into boosting for regression.
Below is a quick summary of the most important aspects of
decision trees, and then the following few pages expand on
that with some explanations and pictures. Since this book is
on boosting rather than decision trees specifically, this will be
a several page explanation of how decision trees work, rather
than 40-50 pages of detailed examples. If you find the rest of
this book useful and want more detail on how decision trees
work, you might be interested in my book “Machine Learning
with Random Forests and Decision Trees” that covers them in
more detail. (Although that book mainly focuses on
classification).
If you want to skip this section on decision trees, and go
straight to the first example of regression using boosting, click
this link Gradient Boosted Trees Regression
A decision tree is a series of if-then decisions on your data.
The “if statements” branch out, and following them will
collect the data into a set of groups based on how each data
point answered each question. The if-then is based on the
value of one, and only one, of the variables in the data and
splits the data into exactly two branches.
It is called a decision tree because the resulting shape is tree-
like
What we care about coming out of a decision tree is which
data points end up at the same leaf together, i.e. end up in the
same group.
The Important Details To Know About Decision Trees
There are regression decision trees and classification
decision trees. They are similar but use different functions
to split the data. The boosting algorithms in this book use
regression decision trees, even the classification boosting
algorithms use the regression decision trees, not
classification trees.
By default, the regression decision trees attempt to
minimize summed squared error at every step
Summed squared error is the value you get if, for every
data point, you subtract the predicted value resulting from
the regression from the true value, square that result, and
then add all those squared values together.
Each split of a set of data is the equivalent of deciding
how to best fit a series of points with two horizontal
lines. You can decide the value of both of the lines, and
where the lines start and stop, but you only get two lines
per split, they can’t overlap, and they must cover the entire
range of interest.
However, every single horizontal line can be split again.
Each one into two new horizontal lines. This continues
until you hit a criterion for stopping.
The stopping criteria could be a maximum number of
splits, maximum number of results to keep, minimum
number of points required to do a split, or simply the fact
that there are no more improvements to make
All the bullets above relate to creating a decision tree i.e.
fitting a model. Once you have created the decision tree
you can use it to test new data points by following the
branches down, answering the questions, and seeing which
leaf the new data point ends up in.

Regression Trees In More Detail


Boosting algorithms use a decision tree regressor as their base. The
way the decision tree regressor works is that each split is made in a
way that minimizes the resulting summed squared error of the two
sides.
The regression tree looks at every possible split to find the split that
gives the lowest summed squared error. (Note, this is the
computationally expensive part since it involves sorting the data for
every feature. With N data points and M features this would be O(M
N lg(N) ) per tree. Although some software, such as xgboost, save
time using techniques such as caching. ( See section 2 of this paper,
http://learningsys.org/papers/LearningSys_2015_paper_32.pdf as
well as the plots on page 5 of that paper for information on how
xgboost implements time-saving methods of generating the trees )

Regression Trees Optimize Sum Squared Error


One important thing to know is that a regression tree optimizes each
step to result in the minimum sum squared error. For instance, if
you have a tree depth 2, you have 3 independent splits for each tree.
Each split is the same as placing the two horizontal lines which
minimize summed squared error of the data set. Each split attempts
to minimize the summed squared error at that split. This results in a
local optimum but not necessarily a global optimum.
For instance, in the chart below, there is initially a summed squared
error of 262, calculated by subtracting the mean value of 6.0 from
each data point, squaring the result and summing all the squares.

In the chart above, the average value of all the data points is 6.0. If
you subtract the average value from each point, you get the error
shown in the chart. 3 points have an error of -5 (1-6), 2 points have
an error of -4, 3 have an error of -3, and 8 have an error of 4. The
square of those errors are 25, 16, 9, and 16. And the sum of the
squares for each point is 3*25 + 2*16 + 3*9 + 8 * 16 = 262 which is
the total sum squared error.
The regression tree will split the data into more groups and calculate
the total sum squared error using a different average value for each
group. (The fact that each group gets to use its own average value
instead of the global average is what reduces the resulting sum
squared error after a split). The objective of the regression tree is to
choose splits that minimize that sum squared error.
If we passed these points through a depth 1 regression decision tree,
using x as our feature with the objective of estimating y, the decision
tree would split the data at x = 8.5, creating two groups. This is
shown in the chart below.

After this first split, the summed squared error is much reduced.
That is because the points less than x = 8.5 use their average value of
2.0 to calculate their squared error, and the points greater than x =
8.5 use their average value of 10.0 to calculate their squared error.
The summed squared error for the left group of points is 6.0, and for
the right group of points is 0.0.
Therefore the total summed squared error is 6, which is the best you
can do with this data and only two lines. That was a good split and
it made a lot of improvement in the summed squared error (SSE).
Going from 262 to 6 was a large reduction in the SSE. The next
split will make another improvement.

Here the left line split again at x=4.5. The right line was already
perfect so it didn’t split. This grouping with 3 groups of points has
an SSE of 1.2, which is an improvement over the previous step but
isn’t perfect.
With 2 splits we theoretically could have gotten 4 horizontal lines,
(since depth 2 trees can make 4 groups), which could have matched
the data perfectly. However, since the regression makes the best
choice at every split, and not necessarily the best choice overall, the
first split put the right side of the graph into its own group. In the
second set of splits, the left side split again, but the right side cannot
since it was already perfect. As a result, we end up with 3 groups.
If we made this a depth 3 tree, we would get another split and a
perfect fit.

Sum Squared Error vs. Mean Squared Error


In most of this book I discuss the regression trees optimizing sum
squared error (SSE). Some other references state that they optimize
Mean Squared Error (MSE). For the purposes of how regression
trees choose their splits, those two terms are completely the same.
Mean Squared Error is just Sum Squared Error divided by the
number of data points. Since the number of data points does not
change, attempted to optimize SSE is the same as optimizing MSE.
I chose to primarily use SSE in this book because I found it easier to
show the sum rather than the average on some of the charts.
However the fact that it is insensitive to the number of data points
makes MSE a better for some purposes, such as using it as a metric
for how good of results you get using cross validation data.

Depth Of The Decision Tree


One of the key parameters for the decision trees that we will be
using in this book is maximum depth. Maximum depth limits the
number of splits the decision tree can make by limiting the height of
the decision tree. The maximum total number of resulting leaf
nodes is 2 raised to the power of maximum depth. For instance, a
tree with a maximum depth 3 could have 8 leaf nodes. 2^3 = 8. A
depth 3 tree is shown below.
Side note – 8 leaf nodes are the maximum we could get with a depth
3 tree. However, you can have fewer branches than the max depth
would allow. This occurs when a branch stops improving and
therefore stops splitting. The result is a lopsided tree, which is
completely fine. This occurs more in classification than regression.
An example of a tree that does not use the full depth is shown below.

Building Intuition About Decision Tree Splits


At each branch of the decision tree, the split occurs on one, and only
one, of the variables. This means that if your data has a relationship
that is dependent on the interaction of two or more variables, the
only way a decision tree can capture it is as a series of splits. It
cannot be done in one step.
As an example, imagine you are trying to use a regression decision
tree to determine how much of an object is floating above the water.
The two variables that you pass to the decision tree are weight and
volume, as well as what percentage of the object is above the water
(zero percent if it sinks), which is the result that the regression is
done on.
Below is an example of what a chart of that data looks like. The red
circles show objects that float, and the size of the circle shows how
much of the object is floating. Black squares show objects that
sink.

The way a regression decision tree would work would be to split the
data based on weight. And then split the data based on volume.
And then split the data based on weight again, switching back and
forth between weight and volume as it attempts to optimize the
regression at each step.
Just looking at whether the object floats or sinks, and ignoring how
much is actually floating, the final result of the decision tree would
be to break up the data as shown below in a series of horizontal and
vertical lines

This would have taken several branches and split the data into a
number of smaller segments that are not shown.
If we also tried to break up the top left section into how much of
each data point is above the water, we would get an even more
complicated set of horizontal and vertical lines. The key here is that
we are getting horizontal and vertical lines, not diagonal ones.
What we really need for this data is the ratio of volume to mass,
which is density. Such a split would look like the chart below.
This is how a human would split the data, breaking it into two
groups with a single line. However, a decision tree cannot capture
the relationship between multiple variables. It cannot take the ratio
and make the split based on that value. If you were to do that
calculation outside of the decision tree and pass it density, it could
solve the regression analysis quickly and competently. But since the
decision tree only splits on one variable at a time, and only at one
location on that variable, if it only has weight and volume, the
decision tree will have to do a brute force solution instead of an
elegant one. Creating better variables for the machine learning
algorithm to operate on is called feature engineering.

Which Variables Does The Decision Tree Look At When Doing


The Splits?
There is a parameter in the decision trees which controls how many
variables are evaluated to find the best split location. By default, it
will evaluate every variable that it is passed and pick the best split
location out of all of those variables. However, you can change the
parameters so that each tree only looks at some of the variables each
time, similar to how decision trees work in Random Forests. The
different options that you have for parameters will be discussed in
the parameters section near the end of the book.
Recall the starting description of boosting, which was weak learners
combined to form a strong learner. Each individual decision tree is a
weak learner, and we have spent the past section seeing how they
work. The next section shows how they are combined in a
regression analysis to form a total result that is better than any single
decision tree.
Gradient Boosted Trees Regression
We will start by looking at how gradient boosted trees work
for regression analysis. Classification uses some of the
regression algorithms, so it makes sense to understand
regression first.
A regressor attempts to fit a numeric value to something. The
something might be fitting the population of a country to GPS
coordinates or stock price of a company to monthly sales.
The end result of a fitted regression analysis is that you pass in
the known features and can predict the unknown output value.
Here is the process that boosting regression follows
1. Predict an initial estimate of 0.0
2. Use the true values to calculate the error in the initial
prediction
3. Split the data into groups using the features of the
data, with the goal of putting data with similar error
into the same group.
4. For each group, find the average error
5. For every data point in that group, add the average
error to the current prediction
6. Calculate the new error for each point for the new
prediction
Then repeat the cycle over again starting at step 3 as many
times as desired
As a flow chart, the process looks like this
Regression Boosting Example
This is most easily understood with an example. This plot is a sine
wave, plus one

These are the actual values that we will try to match using regression

Our objective is to train a gradient boosted trees regression model


using the data shown in the plot above so that we can pass the model
the x value and predict the y value.
Step 1 & 2 – Calculate Initial Error
The initial prediction that we make on this data is 0.0. When we
subtract that from the true values, we get an initial error that matches
the true values. In the next step, we will try to group the data points
based on that error.
(Note, it may seem like we didn’t do anything here, i.e. subtracting
0.0, but it is useful to think of it this way because for classification
our initial prediction won’t be 0.0. Furthermore, you could use an
initial prediction of something other than 0.0 for regression if you
chose. The mean value of the data would be an obvious choice)

Step 3 – Split The Data Into Groups


We will attempt to split this data into groups using a regression
decision tree with a maximum depth of 1. This means we can have
2different groups. Each of those groups will make an estimate for
the data in its group. Essentially the question becomes, what is the
best approximation you can make of that sine wave using two
horizontal lines?
Using a decision tree, with a least squares fit, the result is this

The regression tree split the data at x =3.14. It did that to minimize
the resulting sum squared error. The regression tree calculates the
error at every point, sums that error, and picks the split location
which minimizes that sum squared error.
In this case, clearly, the blue lines are not a very good fit for the
black sine wave dots. Trying to fit the sine wave with just two
horizontal lines in a single step isn’t practical. The error is the
difference between the sine wave curve and what we predicted, and
there is a lot of error after the first step.

Step 4 – For Each Group Find The Average Error


The blue lines are the average error of each group. Right now, at the
beginning of the first iteration, our current prediction for every point
is still 0.0. That means the error at each point is the black dots,
which are the true values. The average value of all the points left of
3.14 is 1.63, and the average value of all points to the right of 3.14 is
0.37.

Step 5 – Adjust The Current Prediction


We use the average error that we found for each group in the
previous step to adjust our prediction for each group. Our current
prediction for each of the points is 0.0. When we add the average
value for each group to our current prediction, we get the blue lines
shown above as our new prediction.

Step 6 – Calculate The New Error


Now we need to use our new current prediction to calculate our new
error. That is equivalent to subtracting the blue line from the black
dots in the chart below.
When we do that we get the new current error shown below.
Obviously, after the first step, there is still a large amount of error.
But now we come to the key point of boosting. The key point is,
you don’t have to do it all in one step. You can make your best
estimate, then measure the error and make corrections. The answer
we got after the first step isn’t very good, but what if we build on
that answer?
Repeating The Cycle
One way to build on our answer is to keep the two horizontal lines
that we just got as a starting point and then modify them. The way
that we will modify them is by using information about where our
error is. We will do that by fitting a regression tree to the error and
adding the resulting predictions to the original lines.

Iteration 2 – Step 3 – Split The Data Into Groups


We pass the decision tree the error dots shown above. The resulting
split occurs at x = .476 and is shown below.

Iteration 2 Step 4 – Find The Average Error Of Each Group


The blue lines in the chart above represent the average values of
their groups. The left side of the split has a large negative average
error. The right side of the split only changes a little bit. This is true
even though there is still a lot of error on the right side. The right
side has both positive and negative error which mostly cancel each
other out.

Iteration 2 – Step 5 – Add The Average Error To The Current


Prediction
We can then add the average error we just calculated for each group
to the previous fit from the first cycle and the result is more detailed
than the first prediction. This is shown below.

Instead of 2 horizontal lines fitting the sine wave, we have 3. If you


were to calculate mean squared error on this chart, it would be an
improvement on the first iteration. Obviously, however, there is
still error in this chart.

Iteration 2 – Step 6 – Calculate The Remaining Error


If we subtract the blue line from the black dots the remaining error is
Repeat the Cycle Again
There is still significant error in the model, and it seems like we are
still improving the fit, so you have probably already guessed where
we are going. We will make a curve fit on the error from this step,
and use that result to improve our prediction. When we do that we
are adding another estimator. So far we have 2 estimators.
However, we can keep adding more and more estimators until our
results stop improving.
The next several charts show the outcomes from adding more
estimators. However, the charts are reformatted to be somewhat
more compact than the charts above. Three charts are combined as
subplots into a single chart. For each one, the top subplot shows the
full boosting result after that iteration plotted against the original
curve that is being fit. The middle plot shows just that single
boosting step plotted against the remaining error it was trying to fit
to for that iteration. The bottom chart shows the total remaining
error after that boosting iteration.
The three previous charts showed the results from Boosting Iteration
2. When those charts are combined into the new compact format
they are
The results after the 3rd iteration are
And the results from the 4th boosting iteration are
You can see how each boosting iteration is making a small
improvement to the overall model. The improvements are adding up
over time and resulting in an OK model, even after only 4 iterations.
The above plot was the results after 4 iterations, the plots below
jump ahead to the results after 9 iterations and then 10 iterations in
order to illustrate a few key points
The boosting results after 9 iterations
The boosting results after 10 iterations
The first key point is that the final match after 10 iterations is OK,
but not fantastic. There are still clear differences between the sine
wave that we are trying to match and the ending curve fit.
However, more importantly, there was almost no improvement in the
curve fit between iteration 9 and iteration 10. The reason for this
becomes apparent if we look at the boosting iteration 10 plot of the
fit to the error at that iteration (middle subplot above), which is
shown by itself below.
The plot of the error (shown with dots) is very complicated.
Depending on how you count them, there are around 9 different
curves of dots. And with the depth 1 regression tree, we are trying
to find two horizontal lines that give the minimum mean squared
error. Unsurprisingly, as it turns out, two horizontal lines are not a
good fit for this set of curves. As a result, we get very little
improvement in overall results from this boosting iteration. If we
plot mean squared error vs boosting iteration, we can see the plateau
in improvement below.
Each iteration still results in some improvement, but it levels off
fairly quickly with a fair amount of error remaining. Why is that?
Well, what we have hit on is that the method we are using to
generate the splits at each boosting iteration is not good enough. I
chose to start with a decision tree with exactly 1 decision point.
This was the simplest way to show the basics of how boosting
works, but now it has hit a roadblock where that method can’t
improve.

Depth 2 Decision Tree


So let’s see what would happen if instead of using a maximum
decision depth of 1, we used a maximum decision depth of 2. The
code I used to create this example can be downloaded for free here
http://www.fairlynerdy.com/boosting-examples In the code, this is
as simple as changing a single parameter controlling the decision
tree depth
The image below shows changing the variable I used for tree depth
parameter to be depth 2.

By making that change, the result is that instead of trying to fit the
curves of the error with 2 horizontal lines, we are trying to fit them
with up to 4 horizontal lines. Why are there 4 lines? Because the
max depth parameter controls the depth of the tree. As we saw in the
decision tree section, a tree with a depth of 1 can end up with 2
branches. A tree with a depth of 2 can end up with 4 branches. A
tree with a depth of 3 can end up with 8 branches. The formula is
branches = 2 to the power of the depth.

So what are the results after increasing the depth of the decision
trees? This is the curve fit after a single step, i.e. just the very first
decision tree
Here the decision tree split the data at x = 3.14 and then split the left
branch again at x = .476 and the right branch again at x = 5.807.
The 4 horizontal lines represent the average error of each of the 4
groups. You can see how that is an improvement over the
comparable curve fit with a depth of 1, shown below
Still, it isn’t really surprising that the regression with 4 horizontal
lines is better than the regression with 2 horizontal lines after a
single iteration. Let’s compare the model with a maximum depth
of 2 after 5 iterations vs the model with the maximum depth of 1
after 10 iterations. One of the methods has more iterations, the
other has more detail in each of the iterations. Which turns out
better in this case?
Here is the chart with a maximum depth of 2, after 5 iterations
And here is the curve fit with a maximum depth of 1 after 10
iterations
The regression with the depth 2 tree over 5 iterations is the better
curve fit. That difference is especially pronounced near the middle
of the chart. If you let the boosting algorithm using a tree with max
depth 2 have a couple more steps, up to 10, the fit continues to
improve, whereas the algorithm using a tree with depth 1 has very
little improvement. A fit with depth 2 and 10 iterations is shown
below.
Why is the regression with a depth 2 tree so much better?
The key difference is that the maximum analysis with a depth of 2,
with 4 horizontal lines, is able to significantly improve the error in at
least one section of the graph for each boosting iteration. Even if
only a few data points get improved each iteration, as long as the
analysis can consistently isolate and improve subsections, the
boosting regression will continue to get better. We can see that
below in the iteration 10 regression fit of the current error.
There were two key statements in the previous paragraph. A good
weak learner needs to isolate and improve subsections. To put it a
different way, it needs to make at least one section a lot better,
without making the rest of the graph any worse.
What we see in the chart above is that most of the graph had no
change. Everything to the left of x = 5.807 has an average error of
near zero. That section didn’t get any better, but it also didn’t get
any worse. However, two limited sections between x = 5.807 and
x=5.997 as well as x=5.997 and x=6.188 improved a lot. For those
points, the error has been reduced to almost zero. We can see that
when looking at the chart for error remaining after iteration 10
shown below.
What that means is for the next boosting iteration there is a lot less
error on the right side of the graph. So the next boosting iteration
will focus on getting the minimum square error on another section of
the graph, even if that is also a small section.
Boosting iteration 11 shown below also makes improvements on a
very limited section of the graph. But on that limited section, it
makes very good improvements and does not make any other part of
the data set worse while doing so.
What is happening is that the 4 horizontal lines have enough fidelity
to “peel off” limited sections of the problem and get them exactly
right. By “peel off” what we mean is that this regression makes
almost no change to most of the data points in the problem, but 4
lines are enough so that the regression tree can make a large
improvement to a small section of the problem while leaving the rest
of the problem nearly unchanged.
The algorithm could not do that with only 2 horizontal lines since it
couldn’t target points in the middle of the graph with only two lines.
With two lines there was no way to make changes to the middle of
the graph without also affecting one or both edges. However, with
the more refined decision trees, this algorithm can continue to make
improvements.

Comparison Of Mean Squared Error


Looking at the mean squared error for the depth 2 boosting analysis,
there is a substantially faster convergence than we had with the
depth 1 analysis. A plot of MSE for both decision tree depths is
shown below.
Going back to the starting analogy
We started this book with a couple of analogies and said that
good weak learners were ones that could get a limited section
of the data exactly right. I.e. ones that could always classify a
zebra at the zoo, or a construction crew that could build a
perfect foundation even if they couldn’t do anything else.
This is what we are seeing with the Depth 2 decision trees.
With this data,4 horizontal lines are enough to allow the
decision tree to get a small section of the data exactly right.
On the next iteration, the next decision tree can focus on a
different small section and get it exactly right. Each decision
tree is a “weak learner”. But they are good enough that they
can be stacked together for an end result that is quite good.
A decision tree depth of 2 is good enough that the algorithm
would eventually exactly match this regression of Y based on
a single variable of X. However, a lot of problems will be
more complicated than that. You might be trying to match a
result based on several different independent variables, such as
matching the fuel economy of your car vs how fast you are
driving and how much you are accelerating. As you add more
dimensions to the plot, a tree depth of 2 might not be enough
to cut out limited sections of the decision space for the
algorithm to fit to.

In That Case Why Not Use Decision Trees With A Very


High Depth?
Using a decision tree with more branches/leafs can cause a
large improvement in the fit of the boosting algorithm.
Optimizing the fit of the tree at each boosting step is one of the
first things that you should do when you tune the parameters in
your model. By default, boosted trees in Python sklearn use a
decision depth of 3. If you set up a good cross-validation test,
you might find that increasing the decision depth yields a
significant improvement in your results. However, there are
some reasons why increased decision depth is not a categorical
improvement, and needs to be studied on a case by case basis
for the problem at hand.
Time: Increasing the decision depth increases the
complication of the problem. It will take more time
to generate the decision trees for each iteration,
which means that for limited computing resources
you will be able to run fewer iterations over all. As
a result, there is a tradeoff between more iterations
and more fidelity at each iteration.
Boosting vs Decision Trees: Decision trees are a
useful and well-known machine learning approach.
However, they have their limitations. One of the
largest limitations is their tendency to overfit the
data. Boosting problems are characterized by
analyzing the data in a series of improving steps. As
you run with more detailed decision trees and fewer
boosting steps, you are reducing the impact of the
boosting algorithm, and increasing the impact of the
decision tree algorithm, which may or may not be
beneficial for you depending on the data that you are
analyzing. This paper by Jerome Friedman showed
that trees with 6 terminal nodes (approximate depth
3) outperformed those with 11, 21, or 41 terminal
nodes.
https://statweb.stanford.edu/~jhf/ftp/stobst.pdf- Plot
on Page 7. This is why a depth 3 decision tree is
default in Python Sklearn.
Learning Rate: How Fast Boosting Incorporates
Improvements
Gradient Boosted Trees have a lot of parameters that can be tuned.
There are all of the parameters for decision trees, as well as
boosting, and a few others. We will discuss most of them later in the
book, after detailing how classification works. However, it makes
sense to discuss one of the parameters, learning rate, now since it is
easier to understand with the regression plots we already have.
One of the most important parameters in boosted learning is the
learning rate, also known as shrinkage. What this parameter controls
is how quickly error is corrected. In every iteration of the gradient
boosting, a correction factor is applied based on the error in the data
points in each group. The question is: do you roll in all of the
correction each iteration or only part of it?
If you roll in the entire correction factor each iteration, you have a
learning rate of 1.0. That is the learning rate we have used in all the
examples we have seen so far. Otherwise, you have a learning rate
less than 1.0

An Analogy For Learning Rate


As an analogy, imagine trying to navigate somewhere using a
compass that gives you a direction and a distance. After you take
each measurement, you have a decision: how far do you travel
before stopping and taking a new measurement?
You take your first measurement, and it tells you to go 10 miles
west. A high learning rate would be: travel 10 miles, then take a
new measurement. Maybe this tells you to go 1.5 miles southeast.
A lower learning rate is – travel 2 miles west out from the original
10, then remeasure and find out you now need to go 7 more miles,
mostly west but also a little bit south.
Measuring and then just going for it with a high learning rate will
get you close faster, but it is challenging for later boosting iterations
to make up for the mistakes of early iterations.
A smaller learning rate will tend to give better results, assuming you
can afford the time for extra boosting iterations. This is analogous
to building a complicated piece of furniture. A high learning rate is
like making a measurement, then making a cut. A smaller learning
rate is more like measuring, then making a small cut, checking the
fit, and making another small cut if required.
The takeaway here is that learning rate should be one of the first
parameters to tune. The learning rate typically defaults to 0.1, but it
can be any number greater than 0 and less than or equal to 1.0.

Learning Rate Example


This is the result of the sine wave curve fit we saw before with a
learning depth of 2, a learning rate of 1.0, and only a single step.
So this is the result using a high learning rate

Below is the same result, after a single step, using a learning rate of
.1. So this learning rate is only 10% of the previous learning rate.
What occurs here is that the decision tree splits the data into groups,
the same as always, the groups each calculate their average error,
same as always, but then instead of adding that full average error to
the previous value, we only add 10% of it.
Obviously, this curve fit isn’t as good as when we used the learning
rate of 1.0, but it wasn’t really a fair comparison. A high learning
rate will always converge faster, but the small learning rate might
give better results in the end. Here are the results after 10 steps
with a learning rate of .1
This is certainly an improvement, but still not great. Mathematically
this makes sense, however. A learning rate of .1 means that 90% of
the error remains after every step, assuming that we are getting the
10% learning that we are doing exactly correct. So after 10 steps, .9
raised to the 10th power is .35, so we still expect to have 35% of the
error left.
After 30 steps with a learning rate of .1, the results are
This result starts to look pretty good. Everything, except the peaks
of the curves, is a close fit. Additional iterations improve the fit at
the peaks as well. It ends up taking about 100 iterations with a
learning rate of .1 to get a curve fit that has almost no discernable
error to the eye
The curve fit is roughly equivalent to the curve fit that we would
have after 45 steps with a learning rate of 1.0, shown below
Why A Smaller Learning Rate Can Be Better
So far we have seen how the learning rate parameter works, but we
haven’t yet seen why a smaller learning rate can be advantageous
over a learning rate of 1.0. This is because learning rate is a
parameter intended to protect against overfitting. Overfitting occurs
when sections of your training data are not representative of the full
data set, i.e. if there are some errors or outliers in the training data.
But if there are no errors in the training data, you can’t over fit.
The sine wave plots that are shown above have no error in them. We
generated a sine wave, and then matched the regression model to it.
Since there are no errors in the data that we matched against, there
was no way to overfit the model, so a high learning rate was best.
However, almost all real world data has errors in it. Below is a
more complicated algorithm. It has a signal that we are trying to
match, and random errors for noise in the data that will confuse the
machine learning algorithm.
Here we will try to create a regression against Z, which is the sum of
three different sine and cosine functions. There isn’t anything
special about the equation, it was just generated to be more
complicated than a simple sine wave. Z plotted as different colors
against X &Y ends up being

The plot above is the actual values of Z that we are trying to match.
However, in addition to those values, we have random error added to
the data that we are feeding into the regression algorithm. The
random error is 1.1 * data picked off a normal curve centered around
zero. So the data we are using to train the model is not completely
representative of the true data as a whole.
This means that when we train our model and then predict with it
using X and Y values, there will be some error in the prediction vs
the true Z values. If we take that error, square it, and average it
across all data points, we get mean squared error (MSE). When we
plot MSE against number of boosting iterations for both a learning
rate of 1.0 and a learning rate of .1, we get a plot that looks like this

The learning rate of 1.0 gets pretty close really fast. However, in
later boosting iterations, it starts to get a little bit worse as it begins
to overfit the data. The smaller learning rate of .1 takes additional
boosting iterations to get good results. But after 1200 boosting
iterations, the error is lower than the error for any of the iterations
with the higher learning rate.
The clear conclusion is that a lower learning rate is better, and will
help mitigate overfitting, assuming that you can afford the additional
computation required.
Learning Rate Should Be Between 0 and 1.0
One thing to be sure of is that your learning rate is always greater
than 0.0 and less than or equal to 1.0. A number greater than 1.0
will overshoot the correction it is trying to make, while a negative
learning rate will go in the wrong direction.
A Quick Correction Of A Likely
Misconception
At this point, we need to stop and review what could be an error in
understanding how the boosting works in practice. That
misunderstanding would have been primarily driven by some sloppy
plotting that I did in the sine wave graphs above. That sloppy
plotting was due to the fact that I used a relatively dense amount of
data, and frankly, I am not the world’s leading expert in the Python
plotting tool Matplotlib. Specifically, I showed plots on the sine
wave regression which were continuous. This is not what would
actually occur.
For instance, take the data points shown below. There are 20 data
points arranged between 0 and 2 Pi, not evenly spaced, but
reasonably arranged along the x-axis. The y value is the Sine of
each x point. Given those points, if you did a boosting regression of
them, and assumed that you had a sufficiently large number of steps
and large enough tree depth, and then tested the values of a much
more fine mesh of data, maybe with 500 points instead of 20, what
would the resulting plot look like?
How would you draw it on the graph below?
If you thought it would be a continuous sine wave, such as the one I
plotted below, then that is an error driven by the fact that all of my
plots in the graphs above used continuous lines.

What would really result is shown in this plot below. There would
be a series of steps that go halfway between each pair of adjacent
data points.
With a large enough number of iterations and a large enough tree
depth, the regression trees would eventually put splits between every
two adjacent points. Each split would be exactly halfway between
any two adjacent points. Effectively each and every point would
carve out a zone where it set the value in that zone. Each of those
zones would abruptly shift to the next zone at a point halfway
between one data point and the next.
The result would be a series of stair steps. The width of the stair
would depend on how much distance was between any two points,
and the y value of the stair would match whichever point was in it.
Note, there are a few parameters that could affect that result, which
we will get into near the end of the book. Specifically, there are
some tree parameters that would limit where splits got made, i.e. you
could set a parameter so that you never have a regression tree leaf
with only one point. Additionally, there is a parameter that would
allow you to not use all the data points for every regression tree (i.e.
use a different subset for different trees). That would increase the
number of possible steps because on different trees there would be
different points adjacent to each other (since some points would be
missing) and hence there could be different splits half way between
those adjacent points.
However, the key takeaway you should get from this is that boosting
will carve out a series of discontinuous zones, each zone having its
own value. The next section shows what those zones would look
like in 2 dimensions.
Regression Tree Splits In Two Dimensions
Let’s look at how the data space gets broken up when we have
two features. For this example, assume the x and y values in
the plot represent the values of feature 1 and feature 2, and that
the actual value we are trying to find using the boosting
algorithm is not represented on the plot. However assume that
every data point has a unique value, as is common in
regression analysis.
If you used a very large number of boosting iterations, how
would this data space get broken up?

My first thought for this problem was incorrect. I thought that


each data point would carve out a space around itself halfway
between all of its neighbors. That incorrect space would look
something like this
Effectively what would happen in the plot above (which isn’t
what actually happens in boosting) is that a point in the test
data would end up finding the point in the training data it was
closest to, and get adjusted to that value. That is basically how
the machine learning algorithm “Nearest Neighbors” works,
but not gradient boosted trees.
There are at least two fundamental mistakes I made with that
initial plot above. (And there may be more). Take a look at
the data again and see if you can determine how the design
space would be split up, knowing what you know about how
decision trees work. The solution is on the next page.
It would, in fact, be split up like this

Basically, if you sort the data on any feature there is a split


halfway between any two adjacent points. That split
Operates on only one feature
Goes the full distance of all the other features
In two dimensions that means we get purely horizontal or
vertical lines, making a hash of the data space. In 3
dimensions there would be planes that split that data either
vertically, horizontally, or out of the page, resulting in a bunch
of 3-dimensional rectangles.
Just like we saw in the 1D sine wave plot, each of these
rectangles (or hyper-rectangles in higher dimensions) could
potentially have their own value, and there could be a
discontinuity in values at every rectangle boundary.
In a real machine learning analysis, there would not
necessarily always be as dense of a mesh as is shown. The
chart above shows every possible split between two adjacent
points. Whether a split actually occurred at any given potential
location would depend on the data. There would never be a
split between two points that had the exact same value, nor
would there be a split on a given feature between two points if
they have the same value on that feature.
There are some interesting things worth noting in the 2D grid
we made above. One is that many, if not most, of the resulting
rectangular areas, do not have an actual data point in them.
They were created by splitting other points some distance
away which results in many areas which are empty. Any
given split occurs between two points that are adjacent when
any given feature is sorted. And wherever splits on all the
features intersect, they carve out a region. Sometimes that
region has a single point in it, and sometimes it is empty.
Either one is ok.
Sub-sampling
If you have N points and keep all of them, you will have at
most N-1 possible splits on any given feature. You might have
fewer splits if some of those points have the exact same value.
Each of those splits represents a discontinuity in the resulting
regression.
There is one way that you can get even more regions than the
number of points you have. Counter-intuitively, that method is
to leave out some of your data on any given tree. That process
is called sub-sampling and is controlled by a parameter.
Remember, that splits occur halfway between adjacent points.
So if I have points located at 1.0, 2.0, 4.0, and 5.0 I would get
splits at 1.5, 3.0, and 4.5. Now, what if on some of those trees
I left out a single point? That means that some trees would get
passed 1.0, 4.0, 5.0. Those trees would have a split at 2.5 (in
addition to other splits). Other trees would get passed 1.0, 2.0,
5.0. Those trees would have a split at 3.5 (in addition to other
splits). Those splits at 2.5 and 3.5 did not exist in the original
tree. They were created in the new tree because omitting some
points made other points adjacent that were not adjacent
previously.
The additional possible splits make for denser possible
regions, which theoretically could make for less discontinuity
between regions.
The Final Point Before We Get Into
Classification
There is one more key point to look at in the regression section
before moving onto classification. Recall at the beginning that
we said all boosting algorithms have two key characteristics.
They are
1. Multiple iterations
2. Each subsequent iteration focuses on the parts of the
problem that previous iterations got wrong.
The multiple iterations are pretty clear in the examples above,
we can keep doing additional iterations as long as the results
keep improving. But how is the algorithm focusing on the
parts of the problem that the previous iterations got wrong?
For this regression algorithm, there are two ways it is doing
that focus.
The first is by subtracting the current prediction from the true
value at every single iteration and then fitting the regression
tree to the current error. This means that at every single
iteration, the data points that have the largest error will stand
out the most.
The second way of focusing on the error is buried within the
regression tree. By default, the regression analysis attempts
to minimize the summed squared error. (Or mean squared
error, which is an equivalent objective)The result is that a data
point with an error of 10 would have four times the squared
error as a data point with an error of 5. (i.e. 100 vs 25). Since
the objective is to minimize the summed squared error, the
algorithm naturally focuses on the largest sources of error
since they have an outsized impact. This is not to say that the
regression algorithm always changes the current largest error
because sometimes it can improve many other data points
simultaneously, but it does tend to focus the algorithm where
the error is. (Note, summed squared error is the typical
regression metric, but there are other options such as mean
absolute error, which uses an absolute value in place of the
squared value)
This is the method of focusing on the error that gradient
boosted trees use. As a side note, there is another common
method that other algorithms, such as AdaBoost, use instead.
That method involves weighting the data points differently. In
the other method all data points start out with equal weights,
and then, data points with more error have their weights
increased, and those with less error have their weights
decreased. In gradient boosted trees, all data points have a
weighting of 1.0.
Classification
This section focuses on using gradient boosted trees for
classification. Classification is splitting items into two or
more categories. The main difference between classification
and regression is that regression uses a continuous range of
numbers, and classification uses discrete groups.
Using gradient boosted trees for classification is a little more
complicated than it was for regression. There are a few more
steps. However, you don’t need to understand every equation
to have a good big picture understanding of how the boosting
is working. The general concept is that we will change the
categories into numbers and keep a running estimate of which
categories we are predicting for each data point. We will use a
regression tree to group the data points based on their error in
what they are predicting vs the true value and adjust the results
accordingly. Here is the flow chart for a 2 category
classification problem. The following pages will fill in the
details.
You will notice a few extra steps compared to the flow chart
we had for regression. Two of those steps are converting
intermediate results between numbering ranges. Part of what
makes classification boosting more complicated than
regression boosting is that we will be carrying two different
numbering ranges and converting back and forth between
them for each data point.
One of those ranges goes from zero to one, setting one
category equal to zero and the other equal to one. The other
range goes from negative infinity to positive infinity. (note,
understanding and keeping track of when and why to use two
numbering ranges is probably the hardest part of this)
We will have a function to convert back and forth between the
two numbering ranges. The conversion will change numbers
like this:
One important thing to know is that you can map any number
from one of the ranges onto the other range. They are
completely convertible.

The steps outlined in the flow chart above are shown below in
more detail.

Two Category Gradient Boosting Classification


1. Renumber the categories to be 0 & 1 instead of
whatever discrete values they have. Their 0 or 1
value will be the “True Value” for each data point.
It doesn’t matter which category becomes zero and
which becomes one.
2. Make an initial prediction for each data point by
dividing the total number of ones in your training
data by the total number of data points overall.

For instance, if you have 3 ones and 1 zero, the initial


prediction is 3 / 4 = .75

This is the starting estimate for what your values are. i.e.
just a naïve average. Call this value your “Current
Prediction”
From this step onward we will repeat for every boosting
iteration

3. Subtract your “Current Prediction” for each data


point from the “True Value”. This is your “Current
Error”.
4. Use a Decision Tree regression analysis to fit a
minimum mean squared error tree to the “Current
Error”. This is exactly like we showed in the
regression section. Ignore the actual average values
that the regression tree fit to the “Current Error”, and
just keep the groups.
5. For each group, generate an equation based on how
many points have positive error, and need to be
moved up, and how many have negative error, and
need to be moved down. Account for both the
quantity of points and the magnitude of the error.
This equation will generate either a positive or
negative value in the range from negative infinity to
positive infinity. Call this the “Amount of
Change”
6. Convert your current prediction for each data point
(from step 3 on the first iteration, from step 8 on
subsequent iterations) to the negative infinity /
positive infinity range. Call this the “Modified
Current Prediction”
7. For each point, add the “Modified Current
Prediction” (step 6) to the “Amount of
Change”(step 5)
8. For each point, convert the result from step 7 from
the negative infinity / positive infinity range back to
the zero to one range. The is the new “Current
Prediction”

Finish The Model Or Iterate Again


9. Take the new “Current Prediction” and either start
another boosting iteration at step 3 or be done. If
you are done, the model has now been fitted, and
you can use it to predict new data points.

The process may seem somewhat complicated due to the fact


that there are 9 steps. However, the only actually complicated
logic is being done in step #4, which is the decision tree.
Every other step is simply algebra that needs to be done in the
correct order. If you export the results from the decision trees
out of Python or R you can do all the other steps, except the
decision trees, in Excel fairly easily.
So easily in fact, that I have done every other boosting step in
Excel for 10 boosting iterations for the example shown in the
pages below. That excel file can be downloaded for free here
http://www.fairlynerdy.com/boosting-examples, to supplement
the example. It doesn’t use any function more complication
than a CountIf. (note, Python and R do quite a bit of error
checking that was not replicated in the Excel boosting
duplication).
Compared to regression boosting, there are a few additional
steps. The new steps are assigning numeric values to the
categories at the beginning, and then converting back and forth
between the two numbering ranges when we calculate the
amount of change for each group. Additionally, the equation
for calculating how much to change the data in each group is
different for regression than classification.
For both regression and classification, we would fit a model
using training data and then use it to predict results on a
separate set of data that we don’t have the solution to. The
steps for predicting results with new data points are a bit
simpler than fitting the model. However, before going through
those steps, we are going to look at an example of fitting a
model, and go over how the 0 to 1 and negative infinity to
positive infinity numbering ranges work.
Numbering Ranges – 0 to 1 & Negative Infinity to
Positive Infinity
Before getting into the example, let’s look at how the two numbering
ranges work.
Why Have Two Ranges?
There are two numbering ranges to keep track of for this process.
One is a range from zero to one, based on the fact that we have
numbered the two categories to be zero and one. The other range is
from negative infinity to positive infinity and is just the standard real
number range we use in everyday math.
Why do we need two numbering ranges?
We need the 0 to 1 range to track how likely a data point is
to be part of the category represented by 0 or 1
We need a negative infinity to positive infinity range
because we don’t know how many boosting iterations we
will do. Each iteration will produce a value to add to or
subtract from each of the data points. We need a
numbering range where we can keep adding values to the
data points an unlimited number of times if we keep
running more and more boosting iterations.
We could do both things with some more complicated math and a
single range, but it is more mathematically convenient to use the two
separate numbering ranges. Then all we have to worry about is how
to convert between the two ranges.

How To Convert Between The Two Ranges


There is a formula to convert a number to the infinity range, and a
different formula to convert back to the zero to one range. If you
have a number in the zero to one range, you would use the LOGIT
function to convert to the negative infinity to positive infinity range,
(logit function on Wikipedia https://en.wikipedia.org/wiki/Logit )
The LOGIT function is
Where p is the value in the 0 to 1 range and ln is the natural
logarithm function. For instance, if you wanted to convert 0.6 to
the infinity range, the equation would be

If you wanted to convert .9999 to the infinity range the equation


would be

We use the numbers in the infinity range to add our amount of


change to or subtract our amount of change from for each group
(Step 7). However, we need to use the result of that
addition/subtraction in the zero to one range since we use the zero to
one range to calculate the current error (Step 3). So we need a way
to convert back. The function that we use to do that is the inverse
of the logit function we saw before. This function is known as the
Expit function.

The Expit function is

Where e is the mathematical constant approximately equal to


2.71828, and x is the value being operated on. So if we had a value
of .4055 in the infinity range we could convert it back to the zero to
one range as shown below.

If we had a value of -10.0 in the infinity range, it would convert


back to the zero to one range as

It may seem like we wasted a lot of effort with those Expit and Logit
functions. In one of the examples, we converted 0.6 to 0.4055 and
then back again without doing anything else. However, when we do
those two conversions for actual calculations, we will put an
additional step of addition/subtraction between them so it won’t just
be converting the same number back and forth.

Logit and Expit Graphically


Seeing the equations to convert between the two numbering ranges
is one thing. However, sometimes a visualization makes more
sense. Here is the chart of that conversion from the zero to one
range into the infinity range.
This chart shows a few things
There is a smooth conversion between every number in
the zero to one range and every number in the negative
infinity to positive infinity range
The conversion goes asymptotic near the boundaries of 0.0
and 1.0

The asymptotes at the boundaries mean that as you approach


certainty that a point is either category, you need more and more
information to make you more certain. To go from .5, which is
even odds between the two categories, to .75, which says you are
75% confident the point is in the 1 category, you would need to go
from 0 to 1.09 in the negative infinity to positive infinity range.
However, to go from 75% confidence to 90% confidence, you need
another 1.09 of change. A third 1.09 change would make the
confidence just over 96%, a fourth would change the confidence to
just under 99%. As you get more and more certain on a given data
point, there are diminishing returns to additional boosting iterations
on that data point.
The chart of the Expit function converting back to the zero to one
range is shown below.

This chart shows that no matter how large or small you get in the
infinity range, the number is still bounded by zero and one in the
other range.
For general knowledge of how boosting works, you don’t need to
memorize these functions. The important thing to know is that there
is a way to map the range from negative infinity to positive infinity
onto the range from zero to one. The numbers that we will be
adding to are the numbers in the range from negative infinity to
positive infinity. We will be using the infinity range when we
adjust the values of our predictions for each data point, and we will
be using the zero to one range to calculate the amount of error
remaining in that prediction.

More Uses For The Expit Function


The expit function, also known as the logistic function (
https://en.wikipedia.org/wiki/Logistic_function ), the sigmoid, or the
S-curve is a very important function that finds its way into many
areas, not just boosting. It is found in nature in phase changes of all
kinds, for instance, the melting of an ice cube. In machine learning,
this function is one of the key features in neural nets and deep
learning. In that usage, it models how fast a given neuron fires its
output based on its input.
Classification Example
Let’s look at an example classification problem that has 2
categories. You have a design space with two types of data, red and
blue. The full design space looks like this

It is basically stripes in two dimensions. The equation that makes


the chart is

The equation is: find the maximum of x & y, turn that number into
an integer, and find the modulus of that number vs 2. If the result is a
zero, it is in a red zone if the result is a 1, it is in a blue zone.
So the colors are the true function, however, you don’t know the true
function. You don’t have results for every X & Y, you only have
results for 100 data points, and you need to use those points to train
a model that will predict the value anywhere in the design space.
Here are the 100 points that we have, with red triangles in the red
zones, and blue squares in the blue zones.
Of those 100 data points, we might want to cross-validate whatever
results we get. Cross-validation is basically setting aside some data
that we know the answer for and not training on it so that we can use
that data to score how good of a fit the training model made. With
cross-validation, you can split the data wherever you want, and train
on one portion and test on the other. For this example, I split it into
60 points for training and 40 points for cross-validation. (Which is
probably a higher percentage for cross validation than you would
want for most problems.) Here are the 60 data points to train on.
(note these plots were generated by
Classification_Gradient_Boost_Stripes.py located here
http://www.fairlynerdy.com/boosting-examples )
The data above is the full data set we will use to train the boosting
model. We will use the red triangles and blue square and attempt to
replicate the results from the full design space, shown with the
shaded colors. We can, and will, put this training data directly into
the boosting algorithm, and it will work well as is. But for this first
example, in order to understand the boosting algorithm, there is one
simplification we will make. Instead of passing two variables into
the algorithm, X & Y, we will only pass a single variable, the
maximum of X & Y
The reason this simplification will benefit our understanding is that
the boosting classifier uses a regression routine at its core. We have
already seen one regression algorithm. This one is slightly
different. But to show what it is doing, a 2D plot of a single value vs
result, similar to the chart we had for the sine wave example, is very
useful, while a 3D plot of X & Y vs that the result would be hard to
understand.
So if you plot Max(X, Y) instead of X vs. Y this is the design space
Basically what we now have is a single line that changes colors as it
progresses along the one value.
Great, so we want to use this modified data for classification. How
does the algorithm work? We will work down along the boosting
steps shown in the flowchart.

Step 1 – Renumber classes 0 to 1


The first step is to determine the number of unique categories that
you have. Here we only have 2, which is beneficial since the
process gets more complicated with more categories (we’ll see that
later in the book). We assign those two categories a value of either
zero or one. Here the red points correspond to zero, and the blue
points correspond to one
For our 60 data points, plotting MAX(X, Y) vs the value for the
category each point is in gives us the plot below. This shows our
“True Value” for each of the points.
Step 2 – Establish Initial Predictions
The next step is to estimate an initial value for the data points. The
initial prediction is the value you would guess for a point if you did
not do any machine learning at all and just used an average value.
All data points will get the same initial prediction.
The equation for our initial prediction is the number of data points in
category 1 divided by the total number of data points. This equation
is basically a glorified average value of the points. Here we have 24
points that are red corresponding to the zero category and 36 points
that are blue and correspond to category one.

Note we used the 36 as the numerator in this equation because it is


the category numbered 1. The resulting value of .6 is our “Current
Prediction”

The Repeated Part Of The Boosting Loop


The calculations that were done above are only done one time to set
the initial values for the boosting iterations. The calculations that
are shown below are done for every boosting iteration. The
following pages walk through the calculations for the first 3
boosting iterations.

Boosting iteration 1
Step 3 – Calculate Current Error
Now that we have the initial prediction, we can calculate the error in
that prediction. The points all have the same initial prediction, but
not the same true values, so different points can have different
current error.
We subtract the current prediction for each point from the true value
to get the current error at each point. Since the true value of every
point is either zero or one and the current prediction at every point is
0.6, the current error at each of those points is either -0.6, or positive
0.4. This is shown as a plot below.

Step 4: Use a Regression Decision Tree on the Current Error to


Break the Data Into Groups
We will use a regression tree to attempt to group points with similar
error together. For this problem, I have a selected a decision tree
with a depth of 2. That means that the decision tree will generate a
minimum mean squared error fit using a maximum of 4 groups.
We have discussed how regression trees work earlier in the book and
are going to sidestep additional discussion for a couple of pages in
order to remain focused on the actual boosting algorithm. For our
purposes, the important thing that comes out of the regression
decision tree is what groups the data is put into. The first boosting
iteration breaks the data into 3 groups (not 4, surprisingly) labeled in
the chart below

The above chart shows the initial values coming into this first
boosting state as red diamonds, and shows the true values as black
triangles. The actual regression tree split the data into the three
groups based on the current error of black triangles minus red
diamonds. (Current error is not shown on this chart)
We are skipping over the actual process of the regression tree at this
point because all that really matters is the groups the points get
broken into. (Note, we’ll touch on how the regression was done in
a couple of pages just to tidy up the loose ends after hitting the main
points on how the boosting equations work).
Notice that there are 6 different regions points in the data (based on
their true value) that we are trying to fit the regression to. With this
depth 2 decision tree, we only have the ability to generate up to 4
groups. That means there will inevitably be some areas that are not
an exact match. What we end up with from the regression tree is 3
groups of points. (We could have gotten 4 groups from this depth 2
regression tree, but ended up only getting 3)
Zone 1: has 25 points. 17 are ones and 8 are zeros
Zone 2: has 16 points, all zeros
Zone 3: has 17 points, all ones

The next step we will do is step 5. However, we will have to do the


steps 5, 6, and 7 three times because we need to do them one time
for each of the groups that we created with the Regression tree in
Step 4.

Step 5 - Calculate Amount Of Change


Equation For Calculating The Amount Of Change
We need to calculate the amount of change for each of the three
zones separately. The result that gets calculated will be a value in the
infinity range.
The equation that we will use to calculate the change for each group
is

Where

That is, the numerator is the sum of the current error of all of the
training points in that group. The current error for any given data
point can be positive or negative. The numerator, which is the sum
of all those, values can be either positive or negative.
The equation for the denominator is

That equation also sums a value across all data points in the group.
The value that it sums is the product of two differences. Note that
the True Value in this equation is always either 0 or 1. If the true
value is 1 then the current error is always positive. If the true value
is 0 then the current error is always negative. If we recognize that
True Value minus Current Error is the same as the Current
Prediction, this equation becomes (CP is Current Prediction)
Since the current prediction is always a value between 0 and 1, the
resulting product for each data point is always positive, which means
that the denominator is always positive.
The amount of change is the numerator divided by the denominator.
Since the denominator is always positive, the sign of the amount of
change is determined by the sign of the numerator. This basically
means you can add up all the current error in a given group to
determine if the new current prediction will be higher or lower than
the current prediction for those points.

Amount of Change Calculation in Zone 3


I will do the next steps starting with zone 3, not zone 1, just because
zone 3 is all the same category which makes it a little easier. The
order we do each group doesn’t actually matter to the results.
In zone 3 there are 19 points all with a current prediction of .6 and a
true value of 1.0. So the current error is .4 for each of those 19
points.

Step 5 Continued – Calculate Amount Of Change For This


Example
To calculate the numerator for the amount of change we need, to
sum up all of the values of current error. That will be

So we need to sum the current error of .4 across all 19 points

Since this value is a positive number, the end result will be that the
points move up. We need to calculate the denominator of the
equation to finish determining how much the points move.

For the denominator the equation is

For each point, the current prediction is 0.6. So the denominator


value for each point is

Since we have 19 points, the total denominator value is

The amount of change is


That is how much the points move in the infinity range.

Step 6 – Convert The Current Prediction Into The Infinity


Range, Becoming “Modified Current Prediction”
Each of the points has a current prediction in the 0-1 range of .6.
Using the Logit equation below

to convert to the infinity range that value becomes

Step 7 – Add Modified Current Prediction To Amount of


Change
Adding the modified current prediction from the previous step to the
amount of change we get
Step 8 – Convert Back To Zero To One Range
To find the new current prediction we need to convert back to the
zero to one range using the Expit function.

All the data points in this group have a value of 2.0721, which
results in

So the new current prediction of the 19 points in zone 3 is .8882


after 1 boosting iteration.
Amount Of Change Calculation for Zone 2
The previous block of calculations was zone 3 out of this chart. This
block of calculations solves for the new values in Zone 2.

Keep in mind this is still the first boosting iteration, still on steps 5-
7. So we will be doing these steps in the flow chart again.
However, now we are doing those calculations on a different group
in the regression tree. In zone 2 there are 16 points that have a
current prediction of 0.6 and a true value of 0. That makes their
value for current error -0.6

Step 5 – Calculate Amount Of Change


The numerator for the amount of change equation is

For zone 2 that results in

Since this value is negative the points will move down.


The equation for the denominator is

Where CP is the current prediction, which is 0.6. So for a single


point

So the sum of the denominator for all 16 points is

The amount of change is

Step 6 – Convert The Current Prediction Into The Infinity


Range, Becoming “Modified Current Prediction”
The current prediction of all the points in the 0 to 1 range is .6,
which converts to .4055 using the logit function as shown for the
zone 3 calculation.

Step 7 – Add Modified Current Prediction To Amount of


Change
The two previous steps have results in the infinity range. We add
them together to get the new value in the infinity range.
As a side note, the reason we only have one equation here is that all
of the points in this group have the same current prediction. In
reality, we should do this step and the previous step once per data
point, but we don’t need to for this group since all the data points
have the same value.

Step 8– Convert Back To Zero To One Range


Converting that value back to the zero to one range using the Expit
function

So the new current prediction for this group is .1096


For the first grouping, we saw the current prediction go from .6 to
.8882. For the second grouping, we saw the current prediction go
from .6 to .1096. Both changes were substantial improvements for
those groups. However, since neither group is purely 0.0 or 1.0,
there are additional improvements that could be made in later
boosting iterations. (Note, the actual current prediction can never
actually reach 0.0 or 1.0 due to the asymptote when converting from
the infinity range to the zero to one range. However, it can always
get closer.)
Step 5- Amount of Change Calculation For Zone 1
We now repeat the previous steps for the third and final group.
Unlike the other two groups, zone 1 has data from both categories.

There are 25 points total all with a current prediction of .6


8 points have a true value of 0 and a current error of -0.6
17 points have a true value of 1 and a current error of 0.4
This is the first time that there have been multiple values of current
error in the calculation, but the process will remain the same, we
will sum up the individual results of each data point.
The numerator is
Since this value is positive, the points will all move up. All the
points will move up, even though some of them have a true value of
zero. Those points will actually have more error after this boosting
iteration than they do now, and later iterations will have to improve
them. The direction of movement for a group is decided simply by
the sum of all the current errors. If that sum is positive the points
will move up. If that sum is negative, the points will move down.
However, if points mostly cancel each other out, the numerator will
be a small value so the change will be small.
The equation for any given point in the denominator is the same as
one of the points from either zone 2 or zone 3. The current
prediction for each of the points is 0.6, and there are 25 points in all.

So the denominator value for a single point is

And the sum of all 25 points is

The amount of change is

Notice that this is the smallest change of all three zones. This is
because this was the only zone that was not purely of one category,
so some of the errors canceled each other out.

Step 6 – Convert The Current Prediction Into The Infinity


Range, Becoming “Modified Current Prediction”
The current prediction of 0.6 in the zero to one range becomes .4055
in the infinity range. This is the same calculation as the previous
two zones because the current prediction is the same as it was in the
previous two zones (since it is still the first boosting iteration).

Step 7 – Add Modified Current Prediction To Amount of


Change

Step 8 – Convert Back To Zero To One Range


Changing back to the zero to one range

So the new current prediction for all the points in zone 1 is .6767
For some points, that is an improvement, for others, it is worse than
it was before this iteration.
We have now the generated new current predictions for each of the
three regression groups, so we have completed boosting iteration
number 1. The end result after this boosting iteration?
Zone 3 moved up a lot
Zone 2 moved down a lot
Zone 1 moved up a little bit
Of course, these results just become the input to the second boosting
iteration.

Going Back To How The Regression Decision Tree Worked


At the start of this boosting iteration, we showed the results for what
groups the data was broken into by the regression decision tree but
skipped over exactly how it was done in the interest of remaining
focused on the boosting calculations themselves. But before looking
at the second boosting iteration, let’s look back at the chart for this
first iteration and see how the groups were generated and answer the
question of, why are there 3 groups, not 4?
This is the chart of the data with the true values and the current
prediction going into iteration number 1.
The groups that the points were split into were generated using a
regression decision tree with depth 2. A depth 2 decision tree can
have up to four groups since the first split generates two groups, and
the second set of splits generates two groups from each of the first
two. Remember, the regression tree operates on the current error,
which is the true value minus the current prediction. Here is a chart
of the current error at the start of this boosting step.

Just like we saw in the section on regression boosting, the objective


of the decision tree is to split the data into groups which minimize
summed squared error.
The first thing that occurred in the regression tree was to split the far
right most points off into their own group (which I later labeled zone
3). That initial split made a big improvement in zone 3 because its
regression line could exactly match all of the points, and it made a
smaller improvement in the zone to the left of the split. That first
split and the regression lines that resulted from that first split are
shown below.
It is difficult to see the regression line on the right, in what became
zone 3 between x values of 5 & 6 because it falls exactly on top of
those points. (There is an arrow pointing to it). That regression is
perfect and no further improvement can be made for that group.
Obviously the same cannot be said for the regression to the left of
the split. Even though the actual regression value isn’t used, the fact
that the horizontal line is so far off from all of the points
demonstrates that it is not a good grouping.
Then there is the second level of splits. The split on the left gets split
again and becomes what I have labeled zone 1 & zone 2. These
splits and regression lines are shown below.
This split made a big improvement in the new zone 2, because its
regression line is a perfect match, and the second split makes a
smaller improvement in the new zone 1. However, the regression
line for zone 3 was already a perfect match. There was no benefit
to be had by splitting it, so the end result was 3 regression lines
creating 3 zones.
The two right most zones are purely a single category. That is why
we saw so much improvement in these groups when we calculated
the boosting values. The leftmost zone is a mix of categories, and
we end up seeing very little improvement.
The fact that we got 3 groups out of this depth 2 decision tree
instead of 4 is an example of how the decision tree optimizes the
results at every split, which does not necessarily give the optimum
regression. If you could have manually picked the splits, you could
have done it differently and gotten 3 zones that were a perfect match
and a fourth zone that had data from both categories. That would
have resulted in a little bit more improvement from this boosting
iteration than we ended up getting. However, there is not much to be
done about how the regression tree chooses the splits. There are a
few tunable parameters that we will discuss those in the section on
parameters, but those will not affect the global vs. local optimum
issue we encountered.
The Next Boosting Iteration
Fortunately, even though we didn’t get perfect results after this
iteration, this is a boosting analysis, so we can just take whatever
error we get out of the first iteration and see what improvements can
be made in boosting iteration number two.

Boosting Iteration # 2
With boosting you typically keep adding iterations until the results
stop improving or you run out of computing resources. For this
problem, there are obvious improvements left to be made, so let’s
look at the second boosting iteration.
Once again the chart below is set up to show the true values as black
triangles, and the current prediction coming into this boosting
iteration as red diamonds. The red diamonds in this chart match the
blue circles on the boosting iteration 1 chart we saw previously.

This chart shows the data broken into three zones. This was done by
the regression decision tree in iteration 2. However, these three
zones are not the same zones as boosting iteration 1 had. The reason
we get different groups is that we have different error going into
iteration 2 than we had for iteration 1.
Zone 1 purely consists of zeroes.
Zone 2 purely consists of ones.
Zone 3 contains a mix of the two types of data.
The amount of change calculations that are performed are the same
as for the first boosting iteration, and they are shown below in a
more compact table. This table does all the same calculations that
we walked through for each group in the first iteration. See below
the table for an explanation of how it is set up.

In the table
There are 6 columns of data because there are 6 different
blocks of data in our data set. I.e. all the numbers between
0-1 all have the same true value, as do the numbers
between 1-2 etc.
The row labeled Step 3 is the current error at any given
data point
Step 4, splitting the data into groups isn’t explicitly
labeled, but is shown with the thick black lines splitting up
the columns
The block of rows labeled step 5 calculates the amount of
change for each zone, using the same equations we saw
before. The rows labeled numerator or denominator are
the sum value for a given column. (I.e. the number of
points in that column multiplied by the current error for
the numerator, or number of points in the column times
(Current Prediction) * (1-Current Prediction) for the
denominator.
The total numerator or total denominator adds together
multiple columns within a given zone. The amount of
change is the ratio of those two values.
The row labeled Step 6 converts the current prediction into
the infinity range
The row labeled Step 7 adds that with the amount of
change in that zone
The row labeled Step 8 converts back to the 0-1 range
In the first boosting iteration we had a big improvement in the data
on the right side of the chart, zones 2 and 3, and not much change to
the data on the left. For this iteration, we are getting a big
improvement in the left groups and not much change on the right.
We see that result in the chart below.
Boosting Iteration # 3
Boosting iteration number 3 takes the input from boosting step
number 2 (red diamonds) and continue to improve it. The math will
obviously be the same as the previous boosting steps, with different
values and groups. However, there are a couple of points worth
highlighting in the chart below.

The first is in Zone 2 and Zone 3. The previous boosting iterations


focused on improving the two edges of the problem because those
pieces of the data were easiest for the decision tree algorithm to peel
off. However by boosting iteration 3, the edges of the data are
already mostly correct, but the middle of the data still has large
errors, as is shown in the chart below.
Remember that the boosting algorithm is set up to focus on the
largest remaining error in the problem, it does that is by trying to
minimize summed squared error in the regression decision tree. This
is why this boosting iteration ended up with 4 groups. The first split
that was put in was the split in the middle, splitting zone 2 from zone
3 shown above. Since the zone 2 and the zone 3 data both had large
errors, but they were errors in different directions, they were split
from each other.
After the split down the middle, the left half and the right half are
effectively separate problems. As a result, the regression tree took
the left half of the problem and split zone 2 from zone 1, and it took
the right half of the problem and split zone 3 from zone 4. This is
shown below.
Zone 2 and Zone 3 both came into this boosting iteration with large
error. As a result, the regression tree algorithm peeled them off from
the rest of the problem. Since those groups are purely of one type
after they were peeled off, there was a lot of improvement gained
from this boosting iteration.
Zone 1 and Zone 4 got very little improvement out of this boosting
iteration because both of them had data that was very mixed
together. As a result, some of the data points got better and some
got worse in those zones.
The calculations for this iteration are shown below
This is the chart for boosting iteration 3 just to visualize the results
after this step, which are the blue dots.

After 3 boosting iterations, every data point has a large amount of


improvement relative to their initial starting values of .6. Since are
no erroneous data points in this graph, we could continue with
additional boosting iterations and continue getting improvements in
these results without overfitting. However, it would be repetitive to
continue showing the calculation for more boosting iterations since
it is just more of the same. Instead, let’s see how we would use this
model to predict values for data that we don’t know.
Predicting With A Model
We just saw how the boosting model was fit using data that we
know. In Python, the top-level code that you would use to fit this
classifier would be something like
clf.fit()
But now that you have a trained model, how do you use the
classifier to predict the category that new data falls into?
After you have fit a model with training data, it is fairly
straightforward to get results for new data. The way it works is
1. Start with the same initial prediction as was made for the
training data, in the infinity range. (.4055 in this case)
2. Use the regression trees that were created when the model
was fit (they were saved automatically)
3. Pass each test point through all the iterations of the
decision trees. Keep track of which leaf each point ends
up in for each tree. Whatever amount of change resulted
from that leaf when the model was created, apply that
same amount of change to that test data point.
4. Use the final resulting value to determine which category
the point ends up going in
It is actually quite a bit easier to test new data on a boosting
classifier that has already been generated than it was to fit it in the
first place. When the boosting classifier was made the trees at each
boosting stage were saved, and so were the amounts of change that
resulted at each leaf on those trees. With new data, it just needs to be
passed through each of the regression trees to determine what leaf it
ends up at, and that amount of change summed up for each iteration.
Interestingly, when testing with the tree we don’t need to convert
back and forth between the two numbering ranges at each iteration
like we did when we generated the trees. We just need to use the
initial prediction in the infinity range and sum up all the amounts of
change from the appropriate leaf nodes in all of the boosting
iterations. Those amounts of change are all already in the infinity
range. Then after all the boosting iterations are complete, convert
back to the zero to one range a single time at the very end. The
reason we are not converting back and forth when we are testing is
that we are not calculating the current error, which is in the 0-1
range.
In this case, there are 6 different blocks of data. So we can keep
track of the values of change for each of those blocks, and then
know what the resulting value of any point we test will be. Those 6
blocks are based on the training data, so they won’t exactly match
what the actual true values of the design space should be. For
instance, one of the blocks of data should contain any data point
with a Max(x,y) value between 5.0 and 6.0. However since we
created the splits halfway between actual data points, the block
actually spans between 4.957 and 6.0. This is shown in the table
below.

If you want to test a new point based on the classifier that we fit, it is
as simple as finding which block it would fall into (i.e. look at the
second row of the table), take the starting value of .4055 and add the
amount of change from each of the iterations. That results in a final
value in the infinity range, which can be converted to the 0-1 range
for a final prediction.
For the purposes of the book, it was easiest to present this data as a
table, since there are only 6 blocks of data. (Resulting from the fact
that our original plot only had 6 red and blue stripes). However, the
actual software will save the same data by saving the splits that
generated these results from each regression tree, as well as the
amount of change that was derived at each leaf node.
To the extent that there is a disconnect between the numbers in the
first row in the table above, which is how the data should have been
broken up, and the second row in the table above, which is how the
data was actually broken up based on the training data, we will have
some errors. Those errors are expanded upon in the following
pages.
Cross Validation
Initially, we set aside 40 points from our training data out of the
starting 100 points. We can use those 40 points to measure how
good the boosting result was. The metric that I am using as a
measure is mean squared error, which works OK for this 2 category
problem but isn’t very good for classification in general. MSE
means that if the true value for a point was 1.0, and we predicted
0.9, the error for that point would be 0.1, and the squared error
would be .01. Since it is Mean Square Error, it is the average of the
squared error across all points. The mean squared error values
shown below are for the cross-validation data.

What we end up seeing is a good improvement in the results at the


beginning, and then a plateau at around .075 MSE. That means the
average error in our prediction is the square root of .075, or
approximately .274.
Having cross-validation data is always a tradeoff. It means you have
less data for training, but that you can evaluate which results are
incorrect. If we plot the data points against Max(X, Y), and cheat
by shading the results with the values that we know are correct, we
can see why the mean squared error stopped improving.
The arrows in the plot below point to data points which were
incorrectly classified.

What happened was some of the points that were on the boundary
between two zones were improperly classified. The cause of this is
based on where the decision trees picked their splits. They put a
split mid-way between the data that they had in the training set. For
instance, one true split should have been at Max(X, Y) equal to 3.0,
but instead, the split was at 3.167. This is because the algorithm did
not have data at the exact boundary. As a result, the boosting
algorithm placed its splits in slightly incorrect locations. So any
testing data that has values between 3.0 and 3.167 will get
incorrectly classified. The same is true, to some extent, at all of the
splits. The table below shows where the splits were placed vs
where they should have been placed.

As you can see, every block of data has some error between where it
should have been split, and where it actually was split. For most of
the blocks, that error is small, for instance splitting at .986 vs. 1.0.
However, some of the discrepancies are larger, which makes it more
likely that points will fall in that area and get misclassified.
Truthfully it is unlikely that a human could do very much better
giving the 60 data points that were fed into the boosting algorithm.
If a person were presented with the same data, on an unshaded and
unlabeled graph and asked to draw lines splitting groups, they would
likely have some error as well.
Back To The 2D Example
In the previous example, we worked through a set of boosting
iterations with a single feature as input. That single feature was the
maximum of two other features in order to simplify the data as much
as possible. But recall that what we really were trying to solve was
the 2D example shown on this chart

So how is this 2D problem different than the 1D simplification that


we already did? By and large, the problem isn’t different at all.
These aspects of the boosting problem are all completely the same
The regression tree still operates on the current error
The regression tree is still used to turn the points into
groups
The same calculation for the amount of change with the
two numbering ranges is performed.
It turns out the only difference is how the regression tree generates
the groups. With a single feature, a regression tree can only split on
that feature, which we showed as a vertical line on the plot. With
two features we can have a split on either of them. By default, the
regression tree will examine both features and find the best possible
split. If we have a tree with a depth of two, a likely split pattern on
the 2D data is shown below

The first split makes the right column of blue squares into its own
group. The second split, on the left side of the first split, makes the
top row of blue squares into its own group. Notice that there is no
second split on the right side because that is already only one
category.
After the splits, the boosting equations would work the same as we
saw in the 1D example. Each of the tree groups would calculate the
amount of change to apply to that group using the same equation we
saw before, just different groups.
One thing to know is that with the tree depth of 2, we won’t be able
to get really good groups for this 2D problem. 2 splits are not
enough to peel off the middle sections of the data. 2 splits would be
enough to peel the edges of the data, like we showed above or like
the bottom left corner, but not the middle.
To get the middle of the graph into its own category would require a
depth of 3 or more and look something like this
This would effectively split the red triangles between X values 4 & 5
into their own category. (Note, the horizontal split line number 3 on
the left side of split number 2 would probably be higher, between the
top blue group and the red group. I put it in a different spot to show
that it is a unique split from the split line number 3 to the right of
number 2)
Now let’s say that you don’t have enough fidelity in the regression
model to cut out a pure section from the middle. What would
happen?
The boosting could still get good results. For instance, by focusing
several boosting iterations on the outside edges of the data until the
error there is very low and does not affect the results very much, and
then moving inside slowly.
More realistically what would likely happen would be that the
regression tree would split off rows that were as pure as possible in
some iterations, and then in other iterations make columns that were
as pure as possible to correct any resulting error from the rows.
With a regression depth of 2, the chart below might be the best you
can do for generating groups based on columns in the center of the
data.
This isn’t ideal, and would still have some errors. However, those
errors would be different errors than if you did the same thing based
on rows like the chart below
By combining those results, and other results that were as good as
possible, over multiple iterations, the boosting based on a decision
tree depth of 2 still could get pretty good answers. However, just
like we saw with the sine wave regression at the very beginning of
the book, when there was only 1 split in the regression tree.
The results with an inadequate tree depth would take more iterations
and might end up being not as good. (Sine wave chart from the first
example shown again below to highlight limitation of inadequate
decision tree)
The Predictions With 2 Features
Recall that there are 60 data points in the training data below since
we initially started with 100 points and kept 60 of them to do the
training on.

After we have fit the classifier, we can cross validate with the other
40 points. This is what you would do if you did not actually know
what the design space is. You would use the classifier to predict the
value of the 40 points you do know and compare those results
against the true values of the 40 points. You can then use a metric
such as mean squared error on the cross-validation to dial in the
parameters, such as tree depth or number of estimators.
However, we actually do know what the design space is since we
have plotted the red and blue stripes. When we use a classifier to
make a prediction over the entire design space (by making a
prediction every .02 spaces apart, and then use the 90,000
predictions to shade the graph) this is the result that we get.
This result was generated with a max tree depth of 2 and 10 boosting
iterations.
This result is not the worst, but it is not the best. You can see some
of the two-dimensional stripes coming through, however, it clearly
doesn’t exactly match the actual design space. There might some
improvement to be had from changing max depth to 3, and using 20
boosting iterations, however, those results turn out ambiguous as
well. That is shown below
What we are really seeing is that the design space using only 60
points is not very well defined. In the initial plot of the 60 points in
the training data, there are wide areas that do not have any data in
them. Those areas are shown below the black boxes.
It is in those areas that we are getting results that are not matching
our actual design space. For this example, the easiest way to fix
that is to use more data. Instead of starting with 100 points and
keeping 60, we can start with 300 points and keep 180. This results
in a much more populated design space as shown below
When this data is used as the input, the end result is that the
prediction with 20 boosting iterations and a Max Depth of 3 ends up
being quite good.
The challenge for real world problems is that you usually can’t get
more data and have to make do with the data that you have. Finding
the troublesome areas in the fit model is usually done with cross-
validation. Improving those areas ends up depending on the skill of
the data scientist and the problem under consideration. For this
problem, one of the biggest improvements that could be made would
be to collapse the two features X and Y into one feature of MAX(X,
Y), which is the example that we showed first.
3 Or More Features
We simplified the example problem to have one feature to start
with. That was mainly just to make the plots and graphs
easier. Then we showed an example with two features and
saw that the only real difference was that the regression tree
had more complicated splits. There is fundamentally no
difference between 2 feature data or 3 feature data. If there
was 3D data, the regression trees would break the data into
groups based on all three features. After that, the process
would be exactly the same. In each group, the algorithm
would count how many of each category there were, how large
the error was for each data point, and apply the same equation
to adjust the values.
If we were to continue to increase the number of features the
only difference would be how the regression tree breaks up the
data. (And the time it would take to generate the trees would
increase linearly with the number of features.)
Boosting With 3 Or More Categories
Up until now, we have done boosting for classification with only
two categories, red and blue. Now we will look at how to do
classification with 3 or more categories. How can we change our
boosting process to work for 3 or more categories?
The obvious, but incorrect, solution might be to use additional
values in the regression part of the algorithm. I.e. instead of using
only 0 and 1 as the true values, use 0, 1, and 2. However, there are
multiple reasons that wouldn’t work, including our inability to
directly map the infinity range onto a category that goes up to 2.
Because of this, classification with 3 or more categories is somewhat
more complicated than it was with 2 categories. Our old tricks with
the numbering ranges don’t quite work. Where do you put the third
category? Or the fourth one?
Instead, we are going to do something a little bit different. What we
are going to do is boosting that is similar to what we did for two
categories, but we are going to do it multiple times for every
iteration. For each of those multiple times, we will re-categorize the
data into only two categories. One will have all the data from
exactly one of the original categories, and the other new category
will have the data from all the rest of the categories.
This means we will be using a lot more decision trees and taking a
lot more steps. If we have 10 boosting iterations trying to
distinguish 7 categories, we will end up using 70 trees. (side note –
the 2 category process that we showed earlier only used one tree
total per boosting iteration, not two. It was a special case)

Multi-Category Boosting vs. Two Categories


For the most part, the boosting process with multiple categories is
the same as with two categories.
The big differences are
Although we still have two numbering ranges, zero to one
and an infinity range, we won’t use Expit and Logit to
convert back and forth. Instead we will do all the work in
the infinity range, and then convert it to the zero to one
range using the a different normalization equation (and
then not convert it back).
Every boosting iteration will have multiple trees equal to
the total number of categories. Each tree will compare a
single category to all other categories.
A couple of smaller differences are
The initial values are in the infinity range, not zero to one
range.
The equation to determine the amount of movement
(numerator/denominator) gets one more term related to the
number of categories. (the numerator and denominator
sub-equations stay the same)

A High-Level Example Of How To Do 4 Categories


If we had two categories, A and B, then every iteration the algorithm
would adjust each data point closer to 1, representing A, or 0,
representing B. With 4 categories, A, B, C, D we will have another
loop within each iteration. Each of those sub-loops will have their
own current prediction and error. I.e. instead of each data point
having one current prediction, each point will have an array of
current predictions that is 4 numbers long (since there are 4
categories).
On the first pass through the sub-loop, we would set the true value
of every data point in category A equal to 1, and the true value of
everything else equal to 0. Then the algorithm would adjust the first
current prediction closer to 1 or 0.
On the second pass through the sub-loop, we would set the true
value of every data point in category B equal to 1, and the true value
of every other point equal to 0. Then the algorithm would adjust the
second current prediction (which is different than the one for the
first sub-loop) closer to 1 or 0.
We would do the same thing a third time for category C, and a fourth
time for category D. At the end of the first iteration, we would have
gone through 4 sub-loops and each data point would have modified
each value in its length 4array of current predictions a single time.
One thing that becomes obvious is that boosting for a large number
of categories will take a lot of trees and hence a lot of time.
In the flow chart below, the section in the hashed box is the section
that gets repeated for every iteration, for every category.
Some Important Math For Multi-Categories
Before going through an iteration of boosting with three categories,
there is a little bit of math to explain, just like we did with the Expit
and Logit before the two-category boosting. The purpose of this
math is to answer the question “How do you generate the current
error for multiple categories?”
We still have the same issue that we had for the two category
problem. We still need an infinity range so we can add or subtract
from it an unlimited number of times, and we still need a way to
convert that infinity range into a zero to one range so that we can
subtract from our true values and get current error. However, the
additional issue that we have is that each data point will have
multiple values for its current prediction, and we need to keep those
values correctly proportioned after we convert to the zero to one
range. To account for multiple values we will not be using Expit and
Logit, but instead, will be using a different method.
Let’s say that you have 3 categories, A, B, and C, and your current
prediction for the data point in question for those three categories is
[-1.0, 7.0, 9.0]
Note, this is in the multi-category equivalent of the infinity range,
which isn’t used exactly the same as it was with the two categories
but is pretty similar. Just like we saw in the two category example,
the larger the value, the more strongly we are predicting a data point
to be in that category. However what really matters, in this case, is
how large a given value is in relation to all the values in the other
categories.
Suppose that the true value of this data point is category C. Based
on the values above, we are currently predicting category C since the
value of 9.0 in the array it is the highest. But how strong is that
prediction? The 7.0 in category B seems pretty close.
The way we are going to answer that question is by using the natural
logarithm of the sum of the exponentials. That is, raise e (the
constant with the value of approximately 2.7182) to the power of all
of those numbers, in turn, take the sum of that result for all of the
numbers, and then take the natural logarithm of that sum and use
that result to normalize the values from each of the categories.
That math is likely not very intuitive to you. So before getting into it
let’s do an analogy from geometry.

Using Geometry As An Analogy To The Exponential &


Logarithm Equation
Let’s say you have two distances in orthogonal directions (i.e. right
angles to each other, X and Y). You want to determine how large a
single distance is vs the total distance if you were to travel all of
them simultaneously. In other words, what is the total distance you
would travel in X compared to the total distance?

What do you do here? You find the magnitude of the total distance
by taking the square root of the sum of the squares.

In two dimensions this is commonly known as Pythagorean’s


theorem. This also works in multi-dimensional space. If you had a
distance in X, Y, and Z, the equation for total distance is

The process is
Square each component
Sum the squared values
Take the square root of the sum
If you wanted to find the difference between X and the total distance
you can subtract the two of them. You can use that difference
between X and the total distance as a metric for how sure you are
that X is the main direction you traveled in. If the difference
between X and the total distance is small, then X is the dominant
direction.
Analogy Summary
The process we will use for multi-category classification
is the logarithm of the sum of the exponentials
This is similar to using the square root of the sum of the
squares that we do in geometry.
For both of them, you do an operation on each component
(exponential or square) take the sum of the results, and
then do the inverse of the original operation (logarithm or
square root).
Both of those processes are a way of normalizing an
arbitrary number of orthogonal options. In geometry, you
might have three mutually exclusive directions, X, Y, and
Z that are all at right angles to each other. Here, for
classification, we might have three mutually exclusive
categories, A, B, C.

An Example Of Logarithm of Sum of Exponential


Going into this equation we will have an array of current predictions
based on the process we discussed where each category will do a
boosting iteration vs. every other category. We will use this
equation to turn the current prediction into current error so that we
can use that current error in a regression tree. We will calculate the
current error by normalizing the values using the logarithm of the
sum of exponentials, and then subtracting the normalized result from
the true value.
In equation form it is
And one thing to be aware of for multi-category classification,
which is different than what we did for 2 category classification, is
that for every data point, all of the variables in the equation above
are actually arrays with a length equal to the number of categories.
I.e. if you are trying to predict between three different categories,
then for each data point the true value will be an array that is of
length three. It would have a value of 1 in the location matching
actual category for each data point, and a value of 0 in the other
locations. The current error and current prediction would each also
be of length 3, which values that depending on the results of the
boosting iterations.
For this example let’s say that our true value is [0, 0, 1] i.e. that
the data is actually a member of category 3.
Now let’s say that our current prediction after some number of
boosting iterations is [-1.0, 7.0, 9.0] That current prediction
obviously isn’t in the 0 to 1 range that we need in order to calculate
the current error, so we need to normalize the current prediction. We
will use the logarithm of the sum of the exponential method we
described earlier.
The table below shows us taking the exponential of each of the
current predictions, summing those exponentials, and then taking the
natural logarithm of that sum.

The value that we use from the table above is the natural logarithm
of the sum which is 9.127. We use it by taking the difference
between each current prediction and the natural log of the sum, i.e.
(1 – 9.127), (7 – 9.127) and (9 – 9.127)
Since the natural logarithm of the sum will always be greater than
each of the current predictions, all of the resulting values in the table
above are negative. The next step is to take the exponential of each
of those results (i.e. raise e to the power of -8.127, -2.127, and
-0.127)

This results in values that will always be between zero and one and
can be subtracted to find the current error for this data point.
(Conveniently, the sum of the values in the table above will always
be 1.0). The final step of subtracting the prediction above from the
true values to get the current error is shown below.

Notice that since the sum of the current prediction for any given
point in the 0-1 range is always 1.0, and the true value is always an
array that has all zeroes except for a single 1, that the sum of the
current error for any given point will always be exactly 0.0.
Although we found the current error for all three categories in the
array in the table above, the algorithm would typically only process
a single category based on where it was in its sub-loop. So if it was
focusing on the third category, the single piece of data used from the
table above would be that this data point had a current error of .1195.
This current error result gets used to fit the regression tree. Like we
have seen for all the regression trees, the regression tree will use the
current error that we just generated and attempt to group data points
with similar current error together by splitting based on the available
features.

A Side Note On The Logarithm & Exponential Process We Just


Did
The method above is how the math works, but there are a few pieces
of mathematical beauty worth highlighting
The first is that the results for the normalized current
prediction will always add up to 1. For instance, the
numbers we input of [-1.0, 7.0, 9.0] were chosen without
any special reason. But the result of [0, .119, .881] sums
to 1.0. This will be true for any numbers you pick, for any
number of categories.
The second interesting point flows from that. If you have
two equal numbers for the input you will get equal output
(unsurprisingly). If those numbers are the largest of the
categories you will get two results that are pretty close to
one-half.
For instance, [-1, 9, 9] would have become approximately [ 0, .5,
.5] So this normalized current prediction is pretty much the current
probability of a value being a certain category.
If you had 3 equal numbers, for instance, 5, 5, 5 your results would
all be one-third.
The objective of the boosting process is to make the value for one
category much larger than the values for the other categories.
Ideally, one category is a large positive number and the others are
large negative numbers.
If you have one number that is a fair bit bigger than the others, the
prediction becomes nearly one for the big category, and nearly zero
for all the others, which is what you are going for.
[Note, obviously “a fair bit” is a loose term. A more precise value
would be a category that is at least ~6 larger than the next largest
category tends to dominate. Not 6 times larger, just the value of
approximately 6 or more. i.e. categories with predictions of 2, 14,
20 would be dominated by the 20 because it is at least 6 larger than
the 14. Play with different numbers in the spreadsheet you can
download here http://www.fairlynerdy.com/boosting-examples to
see for yourself]
A Full Example Of 3 Category Boosting
For the 2 category boosting we used 2 colors of stripes

Generated with this equation

Let’s keep the same data as we used for the 2 category classification
example, except instead of assigning them to category 0 and 1 using
the modulus 2 we will assign them into 0, 1, 2 using the modulus 3.
We then assign the colors as
0 = Red triangles
1 = Blue squares
2 = Green stars
That results in the data below
Once again we have 60 data points to do the boosting on. Since we
saw earlier that reducing the two dimensions to one dimension
worked well for the 2 category problem, we will do it again for this
data and get this chart
Now that we have the data we can start the boosting routine.
Multi-Category Boosting Process
The flow chart for how multi-category boosting works is shown
below

The outline of the process is


1. Determine “Initial Prediction” based on the weighting of
each category. The “Initial Prediction” for each point
will be an array with a length equal to the number of
categories. The resulting value is in the infinity range.

From this step onward we will repeat for every boosting iteration.
Think of this as an outer loop.
From this step onward we will also repeat one time for each
category. Think of this as an inner loop.
2. Determine what category we are focusing on based on where
we are in the inner loop. Assign every data point that is part of
this category the value of 1. Give every other data point the value
of 0. This is the “True Value”
3. “Normalize” the current prediction (which is just the initial
prediction the first time) using the log of the sum of the
exponential method. This converts the current prediction, which
is in the infinity range, into values in the 0-1 range. For each
point, the current prediction in the infinity range is an array that
has the same length as the number of categories. The current
prediction in the 0-1 range is also an array of the same length.
However, we only care about a single value in that array which
corresponds to the category we are analyzing from the inner loop.
Call that single value the “Normalized Current Prediction”
4. Subtract the “Normalized Current Prediction” from the “True
Value” to calculate the “Current Error” for every data point.(Do
this only for the index in the array corresponding to the current
category in the loop)
5. Use a Decision Tree regression analysis to fit a minimum
mean squared error tree to the “Current Error”. This is exactly
like we showed in the regression section. Just keep the groups
coming out of the regression tree.
6. For each group, generate an equation based on how many
points have positive error, and need to be moved up, and how
many have negative error, and need to be moved down. Account
for both the quantity of points and the magnitude of the error.
Additionally, account for the number of categories in the
boosting analysis. (this is different than the 2 category analysis)
This equation will generate either a positive or negative value in
the range from negative infinity to positive infinity. Call this the
“Amount of Change”. The amount of change is a single value in
the infinity range for each data point, not an array.
7. Add the “Amount of Change” to the “Current Prediction” in
the infinity range for each point. The current prediction is an
array for each point, only modify the value associated with the
current category being analyzed in the inner loop.
At this point, we have completed one cycle through the inner
loop. Continue going through the inner loop for each subsequent
category. Note that even though we only changed the values for
the current category in the infinity range, that change will affect
the calculation for converting the current prediction from the
infinity range to the 0-1 range for each point. I.e. the results from
each cycle of the inner loop will affect later cycles.
Once we have completed going through the inner loop for each
category, then we have completed a full boosting iteration.
8. We can either take the new “Current Prediction” and start
another boosting iteration at step 2, including going through the
inner loop again for each category, or we can be done. If we are
done, the model has now been fitted, and you can use it to predict
new data points.
A Worked Example of 3 Category Boosting
Let’s look at the multi-category boosting for the red-blue-green
example data that we showed above.
Step 1: Make Initial Prediction
The first step is to make our initial prediction. That initial prediction
is simply the normalized count of the number of points in each
category
Out of our 60 data points, we have 9 that are category 0 (red), 27
that are category 1 (blue), and 24 that are category 2 (green). If we
divide all of those by the 60 total data points we get a starting value
of.
[.15, .45, .4]
For every data point
These are the initial predictions in the infinity range. One thing that
is different with this multi-category boosting is that we will not be
converting back and forth between the 0-1 range and the infinity
range. What we will be doing is saving the results in the infinity
range, and then converting to the 0-1 range as required, but not
necessarily saving the values in that range for later use.

Now we will begin looping through for every iteration, and


furthermore, we will be looping through for every category.
Initially, I had planned to compress the looping through for each
category and just show all the steps for different categories
simultaneously in order to save my time writing and your time
reading this section.
I had thought that what I could do was convert the current prediction
in the infinity range of [.15, .45, .40] into the 0-1 range for all
categories, and then just do all the regression trees simultaneously.
However, as it turns out, the loop through the first category will use
the current prediction values of [.15, .45, .40] for all data points, but
the loop through the second category will use values of [ YYYY,
.45, .40] as its current predictions, where the YYYY stands for
whatever results come out of the loop through the first category.
And the loop through the third category will use results of [ YYYY,
ZZZZ, .40] as its current predictions, where the ZZZZ are the results
that come out of the loop through the second category. (Note, the
YYYY, ZZZZ values will be different for different points depending
on what groups they get split into)

Step 2: Assign True Values


Starting with the first category, we set the true value of every point
in that category to 1, and the true value of every other point to 0.
Since there are 9 points in the first category we end up with
9 ones
51 zeros
Notice that every other category except the first got grouped
together. This is because this part of this boosting iteration is
focused only on the first. The other categories will each get their
turns to be in the focus as we loop through each one.
The resulting true values are shown in the top subplot below. The
other two subplots show what the true values will get set to the
second time through the loop when we focus on the second category,
and the third time through the loop when we focus on the third
category.
Step 3: Normalize The Current Prediction To Convert From
Infinity Range to 0-1 Range
Right now all data points have the same current prediction in the
infinity range, which is the initial prediction of [.15, .45, .40]. We
need a value in the 0-1 range so that we can get current error. When
we normalize those points into the 0-1 range the result is

Since we are looking at the first category at this moment we will


only use the normalized value of .275. Since all the data points have
the same current prediction at this time, they all have the same .275
as their normalized current prediction. (note that this is the only
time all the points will all have the same value. After going through
this first iteration for the first category, different data points will get
different amounts of change and as a result, will all have different
current values)

Step 4: Calculate The Current Error


This is a plot of what the data looks like at this step.
All of the data points have a current value of .275. 9 of those points
have a true value of 1, so their current error is (1 – .275 = .725) the
other 51 points have a true value of 0, so their current error is
(0-.275 = -.275)

Step 5: Use A Regression Tree To Break The Data Into Groups


Based On Current Error
Those values get fed into a regression tree, which breaks them into
groups in order to minimize sum squared error. Like previously, I
utilized a depth 2 regression tree which theoretically could have split
the data into 4 groups. However, due to the way the regression tree
works by picking the best split at every step (effectively searching
for a local optimum instead of a global optimum) the tree only
generated 3 groups in this case.
Those 3 groups are shown below
How To Calculate The Amount of Change
The equation for calculating the amount of change for 3 or more
categories is nearly the same as for two categories. For two
categories the equation was

Where the numerator was

And the denominator was


(where CP stands for the current prediction of any given point, in the
0-1 range).

For multiple categories, the only change we will have is adding one
more term to account for the number of different categories we
have. The two terms that we called numerator and denominator will
stay exactly the same, but we will include one additional term to the
overall equation. The new total equation becomes

# cat stands for the total number of categories in the boosting


analysis. In this case, we have three categories, so the new term
would be (number of categories minus 1) divided by the number of
categories, which is ( 3 – 1) / 3 or 2 /3

Step 6: Use The Groups To Calculate Amount Of Change


If we list the data in 6 different columns based on their values on the
X axis, with values between 0.0 and 1.0 being a data block, between
1.0-2.0 being a data block, 2.0-3.0 being a data block etc, we get the
columns below. We can then calculate the current error for each of
those columns, shown in the table below in the row labeled Step 4,
and break the data into groups, shown with the thick black lines
between columns.

We know the number of points in each of those data blocks, as well


as the current prediction and current error of each of those blocks.
Additionally based on the regression tree above, we know how each
of those blocks gets broken down into the three different groups,
Zone 1, Zone 2 and Zone 3.
For any given block we can calculate the numerator term of our
equation just by multiplying the number of points in that block by
the current error (or technically, summing up the total current error),
and we can do a similar thing to calculate the denominator for each
block by multiplying the number of points by ( Current Prediction *
(1 – Current Prediction) ). This is shown as step 6 in the table
below.

Once we have the numerator and denominator for any given block, it
is a simple matter to add them together based on what zone each
block got grouped into. The first zone is the sum of only the first
data block. The second zone is the sum of the 3 data blocks in the
middle, and the third zone is the sum of the final two data blocks.
These are shown as the “Total Numerator” and “Total Denominator”
rows in the table below.
To calculate the final amount of change for each zone, we simply
need to divide the numerator by the denominator and multiply by the
value of 2/3 based on the total number of categories

This gives us a change in the infinity range of 2.423 for Zone 1,


-.008 for Zone 2, and -.920 for Zone 3. Basically, we see that Zone
1 moves strongly towards recognizing that its true value is the
current category being analyzed (category 1), Zone 3 moves strongly
away from the current category, and Zone 2 doesn’t change much at
all. Zone 1 and Zone 3 had a lot of change because they were pure
categories, Zone 2 was mixed and mostly canceled itself out.

Step 7: Adjust The Current Prediction Based On The Calculated


Change
The final step is to add that amount of change to the current
prediction for this category, in the infinity range. Since all points
have a current prediction of .15 for this category, all points will have
a new prediction of .15 + the amount of change for their zone
The final result is that we get a current prediction for this category of
2.573 for Zone 1, 0.142 for Zone 2, and -0.770 for Zone 3. Since the
current prediction is actually an array for each point, and we have
not yet changed the current prediction for the other categories, that
means that the full array of current predictions in the infinity range
is
Zone 1 : [ 2.573, 0.45, 0.40]
Zone 2 : [ 0.142, 0.45, 0.40]
Zone 3 : [-0.770, 0.45, 0.40]

The final row in the chart above is the new prediction in the 0-1
range. However we technically aren’t actually using that value at
this step, so the computer would likely go onto the next category
without calculating the new prediction in the 0-1 range.
For our purposes, it is useful so that we can plot it and see how
much change occurred in our predictions for this category. That is
shown below.
Iteration 1- Category 2: Repeat the Steps
Above With The Second Category
Step 2: Assign True Values
At this point, we have done the first of three sub-loops of the first
boosting iteration. We have gone through the inner loop a single
time. We now need to repeat the same process, but operate on
category 2 this time. That will mean
27 points will have a true value of 1
33 points will have a true value of 0
In graphical form, we are on the middle subplot in the charts shown
below

Step 3: Normalize The Current Prediction To Convert From


Infinity Range to 0-1 Range
If you recall, the initial prediction for all the points in the infinity
range was [.15, .45, .4] before the first sub-loop. This corresponded
to an initial prediction in the 0-1 range of [.275, .371, .353]. After
the loop through the first category, we changed the current
prediction in the infinity range of all the points. I.e. the .15 value
changed depending on what zone a point was in. As a result, the
current prediction of the first category in the 0-1 range also changed,
i.e. the .275 value changed depending on what zone a point was in.
However, one thing I didn’t point out before was that the current
prediction in the 0-1 range changed for every category, not just the
first category.
I.e. not only did the .275 current prediction change, but the .371, and
the .353 current predictions for the other categories also changed,
even though we haven’t looped through to focus on those categories
yet. That is because when the prediction changed in the infinity
range for the first category, it changed the normalized value for all
the categories. This is exactly why I couldn’t do the process for all
three categories at the same time but instead had to do them one
after the other.
This table shows the current prediction in both the infinity range and
the 0-1 range for each block of data. (Notice that the current
predictions for category 2 and category 3 in infinity range are still
the same as the initial predictions of .45 and .40)

We can see the resulting change in the 0-1 range in the chart below,
which shows the current true values and current prediction for
category 2.
You can see that even though we haven’t done a loop focusing on
category 2 yet, that there are different current predictions based on
where the regression tree split the groups for the category 1 loop.

Step 4: Calculate The Current Error


The process that occurs for category 2 is the same steps that we saw
before. We need to convert the current prediction in the infinity
range into the 0-1 range, as is shown above. We then need to
subtract from the true values for category 2 to get the current error
for category 2.
As we have seen before, the regression tree splits the points based on
their current error in order to minimize the total sum squared error.
The current error and the splits that the regression tree generates for
this category are shown in the chart below.
What occurred was that the regression tree put the first split at
approximately 5.0 which breaks what I labeled as zone 3 out into
pure category. The second split was put at approximately 4.0,
breaking the data between 4.0-5.0 into a pure category.
Step 6 & 7 – Calculate The Amount Of Change, And Add It To
The Current Prediction
I’ll show the calculation for the new prediction as a single chart
here, instead of breaking into a series of charts like we did last time

The three different zones each get their own amount of change
which gets added to the current prediction for category 2 in the
infinity range. After this step, the current prediction of the six
blocks of data is in the table below
Notice that we still haven’t modified the results for category 3 in the
infinity range.
The plot for category 2 current prediction after this sub-loop is

As you can see, we got a lot of improvement in zone 2 and zone 3


for this category, but not very much change in Zone 1.
Iteration 1- Category 3: Repeat the Steps
Above With The Third Category
Step 2: Assign True Values
The final step for the first iteration is to complete the whole process
again while focusing on the third category. These are the true values
and the current prediction of category 3 going into this boosting
loop.

Once again, even though we haven’t yet edited the current prediction
in the infinity range for any of these points, i.e. all points still have
the starting prediction of 0.40, because the current prediction for all
the other categories have changed, that means that different points
have different predictions in the 0-1 range, which is what we are
looking at above.
That data is used to calculate the current error, which is used by the
regression tree to split the data. That is shown below for category 3.
Step 4 & Step 5 – Calculate The Current Error, And Use A
Regression Tree To Split Into Groups

Step 6 & 7 – Calculate The Amount Of Change, And Add It To


The Current Prediction
The calculations to get the amount of change and to modify the
current prediction are shown in the table below.
The final result for this category is

Now we have completed a single full boosting iteration. There are 6


distinct blocks that the data falls into, and the results after a single
boosting iteration is shown below for each of those blocks
That is the final results after 1 iteration, but what does it mean?
Well, what we see is that the 0-1 block, the 1-2 block, the 4-5 block
and the 5-6 block all had reasonably strong movement towards their
actual categories. I’ve highlighted those values in the table below.

However, the 2-3 block and the 3-4 block basically did not improve
at all. This is because the depth 2 decision trees never isolated
those blocks for their actual categories during the first iteration. If
we did this analysis for a few more iterations, what would happen is
that the edges of the data would start to have a really small error, and
the regression tree would focus on the data in the center.
But we are not going to go through another full iteration since it
would just be repeating the analysis we showed several times. If you
want to see those calculations, this free downloadable Excel file has
the boosting calculations for the 2nd and 3rd full iterations.
Predicting With Multi-Category Algorithm
Predicting with multiple categories is the same as with one
category. Each regression tree saved the amount of change
each leaf resulted in. That amount of change is applied to each
point being tested depending on what leaf they would end up
on in the regression trees.
At the end of all boosting iterations, the category with the
highest value is the one that is predicted.
We have now completed going over how the default
implementation of gradient boosted trees work. The next
section reviews some of the parameters that you can adjust to
determine what works best for your data.
Gradient Boosted Tree Parameters
Gradient Boosted Trees have a number of parameters that you
can tune. This section goes over those parameters. The
naming and syntax used in this section are based on the Python
sklearn names. However, in general, there are similar
parameters available in R.
There are 3 main categories of parameters that you can control
Parameters that control how the decision tree is
generated, i.e. how deep it goes, how it decides
which splits to make and which to keep, etc.
Parameters which control the boosting algorithm,
i.e. how many boosting steps to make, how fast to
incorporate the results
Miscellaneous parameters that don’t fit elsewhere

Here is a summary of the available parameters, some of which


are covered in greater depth in the following pages

Decision Tree Parameters


max_depth : Max depth controls the number of
decision levels that the decision tree can have. By
default, it is 3. The maximum number of leaf nodes
that the decision tree can have is 2 raised to the
power of max_depth
max_leaf_nodes: This is similar to max_depth
except that it controls the final number of resulting
leaf nodes. Note that in Python only
max_leaf_nodes or max_depth can be used since
they are selected from an If-Else statement. If max
leaf nodes is selected the tree will grow as many
leafs as it can, given the other parameters (ignoring
max_depth) and then keep the best ones based on
their improvement in quality. By default, this is not
selected.
min_samples_split: This controls when a branch
can be split into multiple branches. By default, this
is 2, which means that if there are at least 2 data
points in a branch, and they are different, it can be
split. If you have a larger value, the tree will be
more robust against overfitting, since it will require
more data points to make a split, however, it might
not drill down to the same levels of depth
min_samples_leaf: This is very similar to the
min_samples_split parameter. Except it says there
have to be a minimum number of data points after
the split rather than before the split. For instance, if
you set this parameter to 2, that means that every
resulting branch or leaf has to have at least 2 data
points in it. By default, this parameter is 1
min_weight_fraction_leaf: This is similar to the
min_samples_leaf, except that instead of giving a
count, you are giving a fraction of the total data. So
if you set this value to .01 you are saying that each
resulting leaf needs to have at least 1% of the data.
By default, this value is 0
min_impurity_spit: This determines the minimum
Gini impurity before making a split. By default, the
value is 1e-7. If you want to control the tree splits,
it is frankly more intuitive to use the other
parameters like max depth, or min samples
criterion: Here you can select how you want the
decision tree to measure the goodness of a split.
Your choices are either “mse” for mean squared
error or “mae” for mean absolute error. (there is also
“friedman_mse” which is the same as “mse” for all
practical purposes) What has been shown in this
book was mean squared error. MSE puts a
somewhat larger focus on the higher errors because
you are squaring the level of error. Mean absolute
error is similar except you are taking the absolute
value of each error instead of squaring it. When we
were placing regression lines in the decision trees we
saw that the minimum mean squared error is the
average value of all points in a group. As it turns
out, minimum mean absolute error is median of all
points in the group. So MSE takes the average,
MAE takes the median.
max_features: This controls how many of the
variables the tree will look at before choosing the
best split. By default, the option is all of them, but
you could also select a certain percentage of the
number of variables, or the square root of the
number of variables, or the natural log of the number
of variables. Using fewer than all of the variables is
a method of controlling overfitting, and increasing
the speed of tree generation.

Boosting Parameters
learning_rate: This has been discussed in depth
earlier in the book. It controls how quickly changes
from the boosting results are rolled into the current
prediction. A large learning rate will get a close
result more quickly but is more prone to overfitting.
By default, the learning rate is .1
n_estimators: How many boosting iterations to
make. By default, there are 100. More boosting
iterations will tend to give better results. Eventually,
however, they will over fit the data, so there is a
sweet spot.
subsample: This controls if all the data is evaluated
each boosting iteration. By default, the value is 1.0,
but it could be a lower value like .8 if you want to
only evaluate 80% of the data each iteration. A
lower value is a control against overfitting. (note,
this is discussed in depth below)
loss: The loss function is what the boosting
algorithm is attempting to optimize. For regression
this defaults to least squares, which is pretty much
the same as minimum mean squared error. Other
options for loss function for regression are “least
absolute deviation” which is pretty much minimum
absolute error, ‘huber’ which is a combination of
least squared and least absolute deviation and
‘quantile’ which is similar to least squares regression
except a little bit more robust against outliers. For a
gradient classifier, the choices are either logistic or
exponential regression. Logistics is the default,
exponential makes the algorithm more similar to
AdaBoost, which is an older type of boosting that
works by weighting data points different instead of
keeping the residual error for each subsequent tree.
(Link with a brief explanation on the difference
between AdaBoost and Gradient Boosted Trees
https://stats.stackexchange.com/questions/164233/int
uitive-explanations-of-differences-between-gradient-
boosting-trees-gbm-ad)

Other Parameters
random_state : This is an integer that controls the
random seed of the tree. If you want results that you
can duplicate, you should use this value to control
the random state of the algorithm, or you should be
setting it earlier in the code, for instance using
numpy in Python.
warm_start: This gives you the option to add more
boosting steps to an already created boosting fit.
You might use this if you fit the boosting algorithm,
and then have a step where you evaluate how good
the fit is before deciding if you want to invest more
time in fitting additional boosting iterations.
Decision Tree Parameters – More Detail
There are 6 different parameters that you can control that drive the
final shape of the decision tree. Here is a diagram showing some of
the different parameters

In addition to those, min_weight_fraction_leaf would mean that all


final nodes must have a certain percentage of those total nodes.
How Does Pruning Work?
Let’s say you go with a max leaf nodes option. How does it
work?
This is not as obvious as if you selected a max_depth option or
one of the other options. Except for max leaf nodes, all of the
other options stop generating splits based on a parameter.
Max leaf nodes does something different. It generates all of
the leafs until it is stopped for some other reason, and then
recursively deletes the ones that are the least useful. The way
it deletes the nodes is by deleting the split that generated
them. For instance, if your tree was

And it found leaf nodes 3 and 4 to be the least useful, and you
only wanted 3 leaf nodes you would get
The method it uses to determine which of the nodes is the
most useful is the improvement in summed squared error from
that split. It will investigate every split, determine the change
in summed squared error before vs after the split, and roll back
the least beneficial splits. Sometimes it will roll back multiple
levels if a branch has many splits that end up being less useful
than the others.
How Max Features Parameter Works
The max features parameter is an interesting parameter. It
controls which features a given split looks at when the
regression tree is generated. By default, each split looks at all
of the features and picks the best split that could be selected
from any of the features. However, there are times when
looking at fewer features can give better final results.
The Random Forest machine learning algorithm makes use of
this as part of its randomness. Random Forests also heavily
use decision trees. By default, they only look at the square
root of the number of features at each split. So if you have
100 features in a Random Forest decision tree, any given split
will only evaluate 10 of them. However, since there are
typically a lot of splits in a decision tree, and there are a lot of
trees in either a random forest or a gradient boosted tree
algorithm, all of the features do get evaluated multiple times.
Looking at fewer than all the features each split can have two
benefits. The first is it can help minimize overfitting.
Looking at a subset of features can help ensure that more
features are evaluated instead of relying on a few features. It
means that errors in any single feature will not influence the
model as heavily.
The second benefit of not using every feature every split is
avoiding local optimums in the regression tree generation.
Remember, regression trees do not make the best global
result. We saw that when we created regression trees with 3
zones when 4 zones would have been better. Regression Trees
pick the best split at each stage. However, when you have
two or more splits in series the best combination of those splits
is not necessarily the same as the best first split, and then the
best second split. If you always pick a single feature or one of
a small set of features for the early splits, you might be
missing the opportunity for really good splits later on. Only
evaluating a subset of the features helps with that since it
ensures that the locally best features aren’t always available,
so other features are evaluated.
The options that you have for the number of features to
evaluate are
All of them
The square root of the number of all features
The natural log of the number of all the feature
A floating point number representing a percentage of
all the features. i.e. 0.3 for 30% of the features.
Whether you should use all the features or should pick fewer
features is something that can be evaluated with cross-
validation. By default, the parameter uses all of the features
for boosting. Using all features was set to the default for
boosting because it has been found to work well for many
problems. Whether it works the best for your problem is
something to be evaluated.
How Sub-sampling Works
Sub-sampling is another parameter that you can tune for
gradient boosted trees. Sub-sampling simply means to not use
all the data at any individual boosting iteration.
The purpose of sub-sampling is to limit overfitting errors that
are in the training data. Most real life data has some amount
of errors or inconsistencies in it. Training the classifier on the
erroneous data points will lead to errors when testing.
Sub-sampling addresses this by leaving out a portion of the
data each time. Some of the data that is left out will be the
erroneous data points. Since most of the data points won’t be
errors, when the errors are left out the model will tend to
correct those errors
By default the sub-sampling parameter is usually set to 1.0, i.e.
use 100% of the data at each iteration. If you were to set it to
something lower, such as .7, it would only use a portion of the
data, in this case, 70%. However, it would choose a different
70% at every boosting iteration. So after a handful of boosts,
all the data will have been used.
The sub-sample is drawn without replacement, i.e. you can’t
get the same data point twice. This is different than how some
other machine learning algorithms, such as Random Forests,
work since they do use replacement.
A sub-sample parameter of lower than 1.0 is typically only
beneficial if the learning rate is also lower than 1.0. With a
learning rate of 1.0, the algorithm would make too large of
steps on incomplete data. However, with a smaller learning
rate, there is sufficient opportunity for all of the data to be
incorporated on multiple boosting iterations.
This paper by Jerome Friedman shows an example of sub-
sampling outperforming no sub-sampling for several different
data sets.
https://statweb.stanford.edu/~jhf/ftp/stobst.pdf(Plots on page
9). The exact value of sub-sampling to use varies with the
data, with sub-sampling on the order of .6 being the best
shown on the plots in that paper.
An additional advantage of using sub-sampling is that it
speeds up the generation of trees which improves the
performance of the whole process. Using a sub-sample of .5
will speed up tree creation by an approximate factor of 2.
Feature Engineering
The only thing that we have discussed so far is the actual machine
learning algorithms. This section will briefly discuss feature
engineering.
Feature engineering is the generation of features for the machine
learning algorithm to operate on. We actually already did some of
that early in the book, when we reduced this plot

Into a single line by taking the maximum of X and Y and running


the boosting algorithm against that value instead of both X and Y. In
fact, we saw that the boosting algorithm worked a lot better with that
engineered feature, rather than the two features (X and Y) that we
started with. In many cases, you will find that working with the data
to extract the salient bits of information is more important to getting
a good result than what you do with the machine learning algorithm.
Frequently the parameters that you choose, or even the algorithm
itself, are much less important than the data that you pass it.
A more extreme example of the same plot that we have above is
shown below. This has the same data points as the plot above, but
instead of coloring the design space based on taking the modulus of
the maximum X or Y value vs 2, I took the modulus of the distance
the point is away from the origin vs 2. The result is circular stripes
instead of square stripes.

You will recall that regression trees always split data in straight
lines. In two dimensions that would either be horizontal or vertical
lines. That worked in our favor for the graph with square stripes,
since the splits for the true data were all either horizontal or vertical.
In this graph with the circular plots, that will work against us.
With the square stripes, after we saw that generating a new feature of
the maximum of x and y worked well, we tripled the number of
points in the training set up to 180 and used 20 boosting iterations
with a maximum depth of 3 to get this result.
We considered that result to be pretty good but noticed that it took
more data and more steps than when we had the ideal feature,
Max(x,y). With this circular data, 180 points won’t be nearly
enough. 180 points plotted on the circular data is shown below.
To a human, even if there wasn’t the background shading, a pattern
is starting to emerge. But remember, the boosting algorithm does
not recognize the pattern, all it is doing is splitting on the available
features. It is not modifying those features in any way, nor is it
looking at the interactions of multiple features.
If we train the boosting model on the 180 points above using 20
iterations of a depth 3 decision tree, this is the result that we predict
To put it bluntly, that result is outright terrible. With this circular
data, even having triple the amount of data, and a relatively high
number of trees and tree depth, the results look nothing like the
circles of the true data. If instead, we had been able to deduce that
the data was driven by the total distance from the origin, a mere few
dozen data points and 10 iterations would have been enough to get a
pretty good result. That is the power of feature engineering if it is
done very well.
Even feature engineering that is just OK can still be powerful. Let’s
say that you did not recognize that the key feature was the distance
from the origin, and instead thought that the data was arrayed as
diagonal stripes. Bear in mind that you would just be looking at the
actual data points, not the colored background, so it would a
reasonable conclusion to reach.
If you thought that the data was based on diagonal stripes like the
ones shown below
Then a feature you could generate would be the value of x + y. That
would separate points based on which of those diagonal lines they
fell on. If you were to train on this data including the features of x,
y, and x+y, with 180 points and use 60 iterations with a tree depth of
4, the prediction that you would get is shown below.
Now, this isn’t an ideal prediction by any means. But it actually is
not that bad. The overall rounded, striped behavior is starting to
come through, and it looks much more like the true results than the
prediction when we used no feature engineering at all.
I did not spend a lot of time trying to determine exactly how much
data, and how many iterations would be required to get a good result
without any feature engineering at all. However, if we trained on the
1,800 data points shown below
Using 100 iterations with a maximum tree depth of 5, we start to get
results that are OK, but these results definitely still have error on the
boundaries. This is shown below
If we were to dig into the sine wave regression analysis again, we
would be a similar effect. The sine wave regression information
only carried 5 pieces of information in it. It had
The fact that it was a sine wave
The amplitude
The frequency
The phase
The offset
To do a reasonable job of matching those 5 pieces of key
information, the regression analysis needed between 45-100
iterations with a tree depth 2 or more. If the sine wave had extended
over a larger range, for instance, 100 waves instead of a single wave,
we would have needed even more iterations.
All this is not to say that boosting is not a good machine learning
algorithm. It is, in fact, a very powerful algorithm that frequently
wins data analysis competitions. But it is important to know the
limitations. It is just as important to invest effort into improving the
data and the features that the algorithm will use as it is to adjust the
algorithm parameters or the number of iterations.
Feature Importances
The last topic in this book in on feature importances. One
interesting piece of data that you can extract from the boosting
algorithm is the “feature importances”, (the syntax is
feature_importances_ in Python). This reports the relative
significance of all the features that were used to generate the
regression trees that the boosting iterations were based on.
If you extract feature importances, what you end up with is a
normalized array that is the same length as the number of
features you have. The larger numbers correspond to features
that were more important to the analysis.
The actual calculation used for generating feature importances
is to calculate the improvement in mean squared error at each
stage of a regression tree, and then keep a sum of that
improvement for each feature, depending on which feature
was used to generate the split in the regression tree. That sum
of importance is then averaged over all of the regression trees
from all of the boosting iterations.
If you are looking for a more technical description of how
feature importances are calculated, with links to the raw
python and cython code, this stack overflow question has an
excellent answer
http://stats.stackexchange.com/questions/162162/relative-
variable-importance-for-boosting
However, for general knowledge the important things to know
are
The regression trees keep track of improvement in
mean squared error based on what feature the split
was on
The larger the improvement was, summed across all
of the trees, the more important that feature is
So why would you care about feature importances?
As we have already seen, boosting is a useful algorithm, but it
is only as good as the data that you feed into it. If you can
manipulate that data to make it more useful, you can get
significantly better results. However, where should you focus
your attention? Looking at the feature importances will tell
you which features are already the most useful. Usually
focusing on the most useful features, and making them even
more useful will yield the largest improvement in your results.
Making them more useful might include things like
Finding ways to scrub errors and outliers out of
those features
Changing those features from a discrete range to a
continuous range. i.e. instead of classifying dog
breeds as “small”, “medium” and “large” actually
put in the breeds’ adult weight. Or vice versa
Finding ways to mix some of the most important
features. Should you take the ratio of 2 features or
the absolute value of those features? The product or
difference of those features? Perhaps they are
orthogonal distances and you need to take the square
root of the sum of the squares to get the magnitude
of distance?

One thing to know about feature importances is that as you


add more similar features you will dilute the importance of the
existing features. For instance, if you are trying to classify
different types of airplanes the length of that airplane, would
be a really important feature. If you take that information and
also add features that represent the height of the airplane, and
the volume, and the wingspan, and other dimensions of the
airplane you will likely end up with final results that are a
significant improvement. However, the relative importance of
the height feature itself will be diminished due to all of the
other similar features sharing the limelight.
If You Found Errors Or Omissions
We put some effort into trying to make this book as bug-free as
possible, and including what we thought was the most important
information. However, if you have found some errors or
significant omissions that we should address please email us
here

And let us know. If you do, then let us know if you would like
free copies of our future books. Also, a big thank you!
More Books
If you liked this book, you may be interested in checking out
some of my other books such as
Machine Learning With Random Forests And
Decision Trees – If you like the book you just read
on boosting, you will probably like this book on
Random Forests. Random Forests are another type
of Machine learning algorithm where you combine a
bunch of decision trees that were generated in
parallel, as opposed to in series like we did with
boosting.

Linear Regression and Correlation: Linear


Regression is a way of simplifying a set of data into
a single equation. For instance, we all know
Moore’s law: that the number of transistors on a
computer chip doubles every two years. This law
was derived by using regression analysis to simplify
the progress of dozens of computer manufacturers
over the course of decades into a single equation.
This book walks through how to do regression
analysis, including multiple regression when you
have more than one independent variable. It also
demonstrates how to find the correlation between
two sets of numbers.

Hypothesis Testing: A Visual Introduction To


Statistical Significance – This book demonstrates
how to tell the difference between events that have
occurred by random chance, and outcomes that are
driven by an outside event. This book contains
examples of all the major types of statistical
significance tests, including the Z test and the 5
different variations of a T-test.
Thank You

Before you go, I’d like to say thank you for purchasing my
eBook. I know you have a lot of options online to learn this
kind of information. So a big thank you for downloading this
book and reading all the way to the end.
If you like this book, then I need your help. Please take a
moment to leave a review for this book on Amazon. It really
does make a difference and will help me continue to write
quality eBooks on Math, Statistics, and Computer Science.

P.S.
I would love to hear from you. It is easy for you to connect
with us on Facebook here
https://www.facebook.com/FairlyNerdy
or on our webpage here
http://www.FairlyNerdy.com
But it’s often better to have one-on-one conversations. So I
encourage you to reach out over email with any questions you
have or just to say hi!
Simply write here:

~ Scott Hartshorn

You might also like