Understanding Random Forest
Understanding Random Forest
Tony Yiu
Follow
Jun 12 · 9 min read
In this post, we will examine how basic decision trees work, how
individual decisions trees are combined to make a random
forest, and ultimately discover why random forests are so good
at what they do.
Decision Trees
Let’s quickly go over decision trees as they are the building
blocks of the random forest model. Fortunately, they are pretty
intuitive. I’d be willing to bet that most people have used a
decision tree, knowingly or not, at some point in their lives.
The two 1s that are underlined go down the Yes subbranch and
the 0 that is not underlined goes down the right subbranch and
we are all done. Our decision tree was able to use the two
features to split up the data perfectly. Victory!
Obviously in real life our data will not be this clean but the logic
that a decision tree employs remains the same. At each node, it
will ask —
The low correlation between models is the key. Just like how
investments with low correlations (like stocks and bonds) come
together to form a portfolio that is greater than the sum of its
parts, uncorrelated models can produce ensemble predictions
that are more accurate than any of the individual
predictions. The reason for this wonderful effect is that
the trees protect each other from their individual
errors (as long as they don’t constantly all err in the same
direction). While some trees may be wrong, many other trees
will be right, so as a group the trees are able to move in the
correct direction. So the prerequisites for random forest to
perform well are:
Which would you pick? The expected value of each game is the
same:
So even though the games share the same expected value, their
outcome distributions are completely different. The more we
split up our $100 bet into different plays, the more confident we
can be that we will make money. As mentioned previously, this
works because each play is independent of the other ones.
Random forest is the same — each tree is like one play in our
game earlier. We just saw how our chances of making money
increased the more times we played. Similarly, with a random
forest model, our chances of making correct predictions
increase with the number of uncorrelated trees in our model.
If you would like to run the code for simulating the game
yourself you can find it on my GitHub here.
Notice that with bagging we are not subsetting the training data
into smaller chunks and training each tree on a different chunk.
Rather, if we have a sample of size N, we are still feeding each
tree a training set of size N (unless specified otherwise). But
instead of the original training data, we take a random sample
of size N with replacement. For example, if our training data
was [1, 2, 3, 4, 5, 6] then we might give one of our trees the
following list [1, 2, 2, 3, 6, 6]. Notice that both lists are of length
six and that “2” and “6” are both repeated in the randomly
selected training data we give to our tree (because we sample
with replacement).
Node splitting in a random forest model is based on a random subset of features for each tree.
Now let’s take a look at our random forest. We will just examine
two of the forest’s trees in this example. When we check out
random forest Tree 1, we find that it it can only consider
Features 2 and 3 (selected randomly) for its node splitting
decision. We know from our traditional decision tree (in blue)
that Feature 1 is the best feature for splitting, but Tree 1 cannot
see Feature 1 so it is forced to go with Feature 2 (black and
underlined). Tree 2, on the other hand, can only see Features 1
and 3 so it is able to pick Feature 1.
Conclusion
Random forests are a personal favorite of mine. Coming from
the world of finance and investments, the holy grail was always
to build a bunch of uncorrelated models, each with a positive
expected return, and then put them together in a portfolio to
earn massive alpha (alpha = market beating returns). Much
easier said than done!
Logistic Regression
A/B Testing