Random Forest Regression
Random Forest Regression
Training
→
Testing
→
Regression
• Regression is a statistical technique used to predict a
continuous outcome variable based on one or more
independent variables. It establishes a relationship
between the input variables (predictors) and the output
variable (target). Regression analyzes the correlations
between variables in a dataset and determines their
statistical significance.
• The two basic types of regression are simple linear
regression and multiple linear regression, although there
are non-linear regression methods for more complicated
data and analysis. Simple linear regression uses one
independent variable to explain or predict the outcome of
the dependent variable Y, while multiple linear regression
uses two or more independent variables to predict the
outcome (while holding all others constant).
Linear Regression
Time-Series with Machine Learning Model
• Sliding windowed → forecasting problem into a supervised machine learning problem.
Multivariate vs Univariate Time-Series
Example Example
Datetime Load Datetime Temperature Humidity Load
2012-01-01 00:00:00 … 2012-01-01 00:00:00 … … …
2012-01-02 00:00:00 … 2012-01-02 00:00:00 … … …
2012-01-03 00:00:00 … 2012-01-03 00:00:00 … … …
2012-01-04 00:00:00 … 2012-01-04 00:00:00 … … …
Exogenous Variable
Time-Series Models
- ARIMA
- SARIMAX
- Prophet Shallow Learning Models Deep Learning Models
- Vector Auto regression
- Linear Regression - CNN
- SVR - RNN
- Random Forest Regression - LSTM
- Transformer
Decision Tree Training
“Decision tree regression uses a fast divide and conquer greedy algorithm that recursively splits
the data into smaller parts. This greedy algorithm can cause poor decisions in lower levels of
the tree because of the instability of the estimations. But the decision tree is one of the
machine learning algorithms that is very fast and performs well.”
Decision Trees
• Can represent any Boolean Function
• Can be viewed as a way to compactly represent a lot of data.
• Natural representation: (20 questions)
• The evaluation of the Decision Tree Classifier is easy
• Labels
• Binary classification task: Y = {+, -}
12
Will I play tennis today?
O T H W Play?
1 S H H W - Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)
13
Basic Decision Trees Learning Algorithm
1
O
S
T
H
H
H
W
W
Play?
-
• Data is processed in Batch (i.e. all the
2 S H H S - data available) Algorithm?
3 O H H W +
4 R M H W + • Recursively build a decision tree top
5 R C N W + down.
6 R C N S -
7 O C N S + Outlook
8 S M H W -
9 S C N W +
10 R M N W + Sunny Overcast Rain
11 S M N S +
12 O M H S + Humidity Yes Wind
13 O H N W +
14 R M H S - High Normal Strong Weak
No Yes No Yes
Basic Decision Tree Algorithm
• Let S be the set of Examples
• Label is the target attribute (the prediction)
• Attributes is the set of measured attributes
• ID3(S, Attributes, Label)
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S why?
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End For evaluation time
Return Root
15
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
• But, finding the minimal decision tree consistent with the data is NP-hard
• The recursive algorithm is a greedy heuristic search for a simple
tree, but cannot guarantee optimality.
• The main decision in the algorithm is the selection of the next
attribute to condition on.
16
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B). A
< (A=0,B=0), - >: 50 examples 1 0
< (A=0,B=1), - >: 50 examples
+ -
< (A=1,B=0), - >: 0 examples
< (A=1,B=1), + >: 100 examples
splitting on A
• What should be the first attribute we select?
• Splitting on A: we get purely labeled nodes.
• Splitting on B: we don’t get purely labeled nodes.
• What if we have: <(A=1,B=0), - >: 3 examples?
B
1 0
A -
1 0
• (one way to think about it: # of queries required to label a
random data point) + -
splitting on B
17
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
• Trees looks structurally similar; which attribute should we choose?
Advantage A. But… A B
1 0 1 0
Need a way to quantify things
B - A -
• One way to think about it: # of queries required to 1 0 100 1 0 53
label a random data point.
• If we choose A we have less uncertainty about + - + -
the labels.
100 3 100 50
splitting on A splitting on B 18
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
• The main decision in the algorithm is the selection of the next attribute
to condition on.
• We want attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
• The most popular heuristics is based on information gain, originated with
the ID3 system of Quinlan.
19
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝+ log 𝑝+ − 𝑝− log 𝑝−
• 𝑝+ is the proportion of positive examples in S and
• 𝑝− is the proportion of negatives examples in S
• If all the examples belong to the same category: Entropy = 0
• If all the examples are equally mixed (0.5, 0.5): Entropy = 1
• Entropy = Level of uncertainty.
• In general, when pi is the fraction of examples labeled i:
𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 𝑝1, 𝑝2 , … , 𝑝𝑘 = − 𝑝𝑖 log 𝑝𝑖
1
• Entropy can be viewed as the number of bits required, on average, to
encode the class of labels. If the probability for + is 0.5, a single bit is
required for each example; if it is 0.8 – can use less then 1 bit.
20
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝+ log 𝑝+ − 𝑝− log 𝑝−
• 𝑝+ is the proportion of positive examples in S and
• 𝑝− is the proportion of negatives examples in S
• If all the examples belong to the same category: Entropy = 0
• If all the examples are equally mixed (0.5, 0.5): Entropy = 1
• Entropy = Level of uncertainty.
1 1 1
-- + -- + -- +
21
Entropy
(Convince yourself that the max value would be log 𝑘 )
(Also note that the base of the log only introduce a constant factor; therefore, we’ll think about base 2)
1 1 1
22
Information Gain
• The information gain of an attribute a is the expected reduction
in entropy caused by partitioning on this attribute
|𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆| Outlook
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
• Where:
• Sv is the subset of S for which attribute a has value v, and Sunny Overcast Rain
• the entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set
23
Will I play tennis today?
O T H W Play?
Outlook: S(unny),
1 S H H W -
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W +
Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)
24
Will I play tennis today?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W + calculate current entropy
4 R M H W +
9 5
5 R C N W + • 𝑝+ = 𝑝− =
6 R C N S - 14 14
7 O C N S + • 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑃𝑙𝑎𝑦 = −𝑝+ log 2 𝑝+ − 𝑝− log 2 𝑝−
8 S M H W - 9 9 5 5
9 S C N W + = − log2 − log2
10 R M N W + 14 14 14 14
11 S M N S + 0.94
12 O M H S +
13 O H N W +
14 R M H S -
25
Information Gain: Outlook
O T H W Play? |𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
1 S H H W - |𝑆|
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Outlook = sunny:
4 R M H W + 𝑝+ = 2/5 𝑝− = 3/5 Entropy(O = S) = 0.971
5 R C N W + Outlook = overcast:
6 R C N S - 𝑝+ = 4/4 𝑝− = 0 Entropy(O = O) = 0
7 O C N S + Outlook = rainy:
8 S M H W - 𝑝+ = 3/5 𝑝− = 2/5 Entropy(O = R) = 0.971
9 S C N W +
10 R M N W + Expected entropy
11 S M N S + |𝑆 |
12 O M H S + = σ𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆) 𝑣 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
13 O H N W + = (5/14)×0.971 + (4/14)×0 + (5/14)×0.971 = 0.694
14 R M H S -
Information gain = 0.940 – 0.694 = 0.246
26
Information Gain: Humidity
O T H W Play? |𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
1 S H H W - |𝑆|
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Humidity = high:
4 R M H W + 𝑝+ = 3/7 𝑝− = 4/7 Entropy(H = H) = 0.985
5 R C N W + Humidity = Normal:
6 R C N S - 𝑝+ = 6/7 𝑝− = 1/7 Entropy(H = N) = 0.592
7 O C N S +
8 S M H W - Expected entropy
9 S C N W + |𝑆𝑣 |
σ
= 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
10 R M N W + |𝑆|
11 S M N S + = (7/14)×0.985 + (7/14)×0.592 = 0.7785
12 O M H S +
13 O H N W + Information gain = 0.940 – 0.694 = 0.246
14 R M H S -
27
Which feature to split on?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Information gain:
5 R C N W + Outlook: 0.246
6 R C N S - Humidity: 0.151
7 O C N S + Wind: 0.048
8 S M H W - Temperature: 0.029
9 S C N W +
10 R M N W +
11 S M N S + → Split on Outlook
12 O M H S +
13 O H N W +
14 R M H S -
28
An Illustrative Example (III)
Gain(S,Humidity)=0.151
Outlook Gain(S,Wind) = 0.048
Gain(S,Temperature) = 0.029
Gain(S,Outlook) = 0.246
29
An Illustrative Example (III)
O T H W Play?
Outlook 1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain 5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
8 S M H W -
? Yes ?
9 S C N W +
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -
30
An Illustrative Example (III)
O T H W Play?
Outlook
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain
5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
Continue until:
10 R M N W +
• Every attribute is included in path, or, 11 S M N S +
• All examples in the leaf have same label 12 O M H S +
13 O H N W +
14 R M H S -
31
An Illustrative Example (IV)
Outlook
O T H W Play?
1 S H H W -
2 S H H S -
Sunny Overcast Rain 4 R M H W +
5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14
6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
10 R M N W +
Gain(S sunny , Humidity) = .97-(3/5) 0-(2/5) 0 = .97 11 S M N S +
Gain(S sunny , Temp) = .97- 0-(2/5) 1 = .57 12 O M H S +
13 O H N W +
Gain(S sunny , Wind) = .97-(2/5) 1 - (3/5) .92= .02
14 R M H S -
Split on Humidity
32
An Illustrative Example (V)
Outlook
33
An Illustrative Example (V)
Outlook
High Normal
No Yes
34
induceDecisionTree(S)
• 1. Does S uniquely define a class?
if all s ∈ S have the same label y: return S;
• 3. Add children to S:
for k in Values(Xi):
Sk = {s ∈ S | xi = k}
addChild(S, Sk)
induceDecisionTree(Sk)
return S;
35
An Illustrative Example (VI)
Outlook
36
Decision Tree Regression
• Leaf node (e.g., Hours Played) represents a decision on the
numerical target
Decision Tree Algorithm in Regression
• Standard deviation for one attribute:
https://www.saedsayad.com/decision_tree_reg.htm
Underfitting and Overfitting
Underfitting Overfitting
Expected
Error
Variance
Bias
Model complexity
González, S.,
García, S., Del Ser,
J., Rokach, L., &
Herrera, F. (2020). A
practical tutorial on
bagging and
boosting based
ensembles for
machine learning:
Algorithms, software
tools, performance
study, practical
perspectives and
opportunities. Inform
ation Fusion, 64,
205-237.
Random Forest
Algorithm:
• Bootstrap Aggregating (Bagging)
Bootstrap samples helps improve predictive stability
Using ensemble learning from several decision trees
• Aggregation Prediction
Using averages when regression, voting when classification
• Feature Importance
Provides feature quality evaluation
Feature Importance
•Train Data
Mean Absolute Error on Actual and Forecasted values: 113.66473723475617
Mean Absolute Error on Actual and Forecasted values: 40393.45235376421
•Validation Data
Mean Absolute Error on Actual and Forecasted values: 397.9795535050539
Mean Absolute Error on Actual and Forecasted values: 386917.71175304166
•Test Data
Mean Absolute Error on Actual and Forecasted values: 407.28569900058994
Mean Absolute Error on Actual and Forecasted values: 415946.52320662094
Feature Importance