0% found this document useful (0 votes)
12 views

Random Forest Regression

Random Forest Regression

Uploaded by

buatakungug
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Random Forest Regression

Random Forest Regression

Uploaded by

buatakungug
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Random Forest Regression

Dr. Eng. Adi Wibowo, S.Si., M.Kom

Departemen Ilmu Komputer/ Informatika


Universitas Diponegoro, Semarang
Machine Learning
Machine learning is a set of methods that
can automatically detect patterns in data.
There are two type of machine learning:

• Supervised: We are given input samples


(X) and output samples (y) of a function y
= f(X). We would like to “learn” f, and
evaluate it on new data. Types:
• Classification: y is discrete (class labels).
• Regression: y is continuous, e.g. linear
regression.

• Unsupervised: Given only samples X of the


data, we compute a function f such that y
= f(X) is “simpler”.
Supervised Learning
Supervised Learning (cont.)
Machine Learning Bias-variance trade-off in training

Training

Testing


Regression
• Regression is a statistical technique used to predict a
continuous outcome variable based on one or more
independent variables. It establishes a relationship
between the input variables (predictors) and the output
variable (target). Regression analyzes the correlations
between variables in a dataset and determines their
statistical significance.
• The two basic types of regression are simple linear
regression and multiple linear regression, although there
are non-linear regression methods for more complicated
data and analysis. Simple linear regression uses one
independent variable to explain or predict the outcome of
the dependent variable Y, while multiple linear regression
uses two or more independent variables to predict the
outcome (while holding all others constant).
Linear Regression
Time-Series with Machine Learning Model
• Sliding windowed → forecasting problem into a supervised machine learning problem.
Multivariate vs Univariate Time-Series

Example Example
Datetime Load Datetime Temperature Humidity Load
2012-01-01 00:00:00 … 2012-01-01 00:00:00 … … …
2012-01-02 00:00:00 … 2012-01-02 00:00:00 … … …
2012-01-03 00:00:00 … 2012-01-03 00:00:00 … … …
2012-01-04 00:00:00 … 2012-01-04 00:00:00 … … …

Exogenous Variable
Time-Series Models

Traditional Time-Series Models Machine Learning Models

- ARIMA
- SARIMAX
- Prophet Shallow Learning Models Deep Learning Models
- Vector Auto regression
- Linear Regression - CNN
- SVR - RNN
- Random Forest Regression - LSTM
- Transformer
Decision Tree Training

“Decision tree regression uses a fast divide and conquer greedy algorithm that recursively splits
the data into smaller parts. This greedy algorithm can cause poor decisions in lower levels of
the tree because of the instability of the estimations. But the decision tree is one of the
machine learning algorithms that is very fast and performs well.”
Decision Trees
• Can represent any Boolean Function
• Can be viewed as a way to compactly represent a lot of data.
• Natural representation: (20 questions)
• The evaluation of the Decision Tree Classifier is easy

• Clearly, given data, there are Outlook

many ways to represent it as


a decision tree. Sunny Overcast Rain
Humidity Wind
• Learning a good representation Yes
from data is the challenge. High Normal Strong Weak
No Yes No Yes
https://www.seas.upenn.edu/~cis5190/fall2019/assets/lectures/lecture-2/Lecture2-DT.pptx 11
Will I play tennis today?
• Features
• Outlook: {Sun, Overcast, Rain}
• Temperature: {Hot, Mild, Cool}
• Humidity: {High, Normal, Low}
• Wind: {Strong, Weak}

• Labels
• Binary classification task: Y = {+, -}

12
Will I play tennis today?
O T H W Play?
1 S H H W - Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)

13
Basic Decision Trees Learning Algorithm
1
O
S
T
H
H
H
W
W
Play?
-
• Data is processed in Batch (i.e. all the
2 S H H S - data available) Algorithm?
3 O H H W +
4 R M H W + • Recursively build a decision tree top
5 R C N W + down.
6 R C N S -
7 O C N S + Outlook
8 S M H W -
9 S C N W +
10 R M N W + Sunny Overcast Rain
11 S M N S +
12 O M H S + Humidity Yes Wind
13 O H N W +
14 R M H S - High Normal Strong Weak
No Yes No Yes
Basic Decision Tree Algorithm
• Let S be the set of Examples
• Label is the target attribute (the prediction)
• Attributes is the set of measured attributes
• ID3(S, Attributes, Label)
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S why?
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End For evaluation time
Return Root

15
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
• But, finding the minimal decision tree consistent with the data is NP-hard
• The recursive algorithm is a greedy heuristic search for a simple
tree, but cannot guarantee optimality.
• The main decision in the algorithm is the selection of the next
attribute to condition on.

16
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B). A
< (A=0,B=0), - >: 50 examples 1 0
< (A=0,B=1), - >: 50 examples
+ -
< (A=1,B=0), - >: 0 examples
< (A=1,B=1), + >: 100 examples
splitting on A
• What should be the first attribute we select?
• Splitting on A: we get purely labeled nodes.
• Splitting on B: we don’t get purely labeled nodes.
• What if we have: <(A=1,B=0), - >: 3 examples?
B
1 0
A -
1 0
• (one way to think about it: # of queries required to label a
random data point) + -
splitting on B
17
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
• Trees looks structurally similar; which attribute should we choose?

Advantage A. But… A B
1 0 1 0
Need a way to quantify things
B - A -
• One way to think about it: # of queries required to 1 0 100 1 0 53
label a random data point.
• If we choose A we have less uncertainty about + - + -
the labels.
100 3 100 50
splitting on A splitting on B 18
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
• The main decision in the algorithm is the selection of the next attribute
to condition on.
• We want attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
• The most popular heuristics is based on information gain, originated with
the ID3 system of Quinlan.

19
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝+ log 𝑝+ − 𝑝− log 𝑝−
• 𝑝+ is the proportion of positive examples in S and
• 𝑝− is the proportion of negatives examples in S
• If all the examples belong to the same category: Entropy = 0
• If all the examples are equally mixed (0.5, 0.5): Entropy = 1
• Entropy = Level of uncertainty.
• In general, when pi is the fraction of examples labeled i:
𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 𝑝1, 𝑝2 , … , 𝑝𝑘 = − ෍ 𝑝𝑖 log 𝑝𝑖
1
• Entropy can be viewed as the number of bits required, on average, to
encode the class of labels. If the probability for + is 0.5, a single bit is
required for each example; if it is 0.8 – can use less then 1 bit.
20
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝+ log 𝑝+ − 𝑝− log 𝑝−
• 𝑝+ is the proportion of positive examples in S and
• 𝑝− is the proportion of negatives examples in S
• If all the examples belong to the same category: Entropy = 0
• If all the examples are equally mixed (0.5, 0.5): Entropy = 1
• Entropy = Level of uncertainty.

1 1 1

-- + -- + -- +
21
Entropy
(Convince yourself that the max value would be log 𝑘 )
(Also note that the base of the log only introduce a constant factor; therefore, we’ll think about base 2)

1 1 1

22
Information Gain
• The information gain of an attribute a is the expected reduction
in entropy caused by partitioning on this attribute
|𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆| Outlook
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
• Where:
• Sv is the subset of S for which attribute a has value v, and Sunny Overcast Rain
• the entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set

• Partitions of low entropy (imbalanced splits) lead to high gain


• Go back to check which of the A, B splits is better High Entropy – High level of Uncertainty
Low Entropy – No Uncertainty.

23
Will I play tennis today?
O T H W Play?
Outlook: S(unny),
1 S H H W -
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W +
Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)

24
Will I play tennis today?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W + calculate current entropy
4 R M H W +
9 5
5 R C N W + • 𝑝+ = 𝑝− =
6 R C N S - 14 14
7 O C N S + • 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑃𝑙𝑎𝑦 = −𝑝+ log 2 𝑝+ − 𝑝− log 2 𝑝−
8 S M H W - 9 9 5 5
9 S C N W + = − log2 − log2
10 R M N W + 14 14 14 14
11 S M N S +  0.94
12 O M H S +
13 O H N W +
14 R M H S -

25
Information Gain: Outlook
O T H W Play? |𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
1 S H H W - |𝑆|
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Outlook = sunny:
4 R M H W + 𝑝+ = 2/5 𝑝− = 3/5 Entropy(O = S) = 0.971
5 R C N W + Outlook = overcast:
6 R C N S - 𝑝+ = 4/4 𝑝− = 0 Entropy(O = O) = 0
7 O C N S + Outlook = rainy:
8 S M H W - 𝑝+ = 3/5 𝑝− = 2/5 Entropy(O = R) = 0.971
9 S C N W +
10 R M N W + Expected entropy
11 S M N S + |𝑆 |
12 O M H S + = σ𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆) 𝑣 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
13 O H N W + = (5/14)×0.971 + (4/14)×0 + (5/14)×0.971 = 0.694
14 R M H S -
Information gain = 0.940 – 0.694 = 0.246
26
Information Gain: Humidity
O T H W Play? |𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
1 S H H W - |𝑆|
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Humidity = high:
4 R M H W + 𝑝+ = 3/7 𝑝− = 4/7 Entropy(H = H) = 0.985
5 R C N W + Humidity = Normal:
6 R C N S - 𝑝+ = 6/7 𝑝− = 1/7 Entropy(H = N) = 0.592
7 O C N S +
8 S M H W - Expected entropy
9 S C N W + |𝑆𝑣 |
σ
= 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
10 R M N W + |𝑆|
11 S M N S + = (7/14)×0.985 + (7/14)×0.592 = 0.7785
12 O M H S +
13 O H N W + Information gain = 0.940 – 0.694 = 0.246
14 R M H S -

27
Which feature to split on?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Information gain:
5 R C N W + Outlook: 0.246
6 R C N S - Humidity: 0.151
7 O C N S + Wind: 0.048
8 S M H W - Temperature: 0.029
9 S C N W +
10 R M N W +
11 S M N S + → Split on Outlook
12 O M H S +
13 O H N W +
14 R M H S -

28
An Illustrative Example (III)
Gain(S,Humidity)=0.151
Outlook Gain(S,Wind) = 0.048
Gain(S,Temperature) = 0.029
Gain(S,Outlook) = 0.246

29
An Illustrative Example (III)
O T H W Play?
Outlook 1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain 5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
8 S M H W -
? Yes ?
9 S C N W +
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -

30
An Illustrative Example (III)
O T H W Play?
Outlook
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain
5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
Continue until:
10 R M N W +
• Every attribute is included in path, or, 11 S M N S +
• All examples in the leaf have same label 12 O M H S +
13 O H N W +
14 R M H S -
31
An Illustrative Example (IV)
Outlook
O T H W Play?
1 S H H W -
2 S H H S -
Sunny Overcast Rain 4 R M H W +
5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14
6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
10 R M N W +
Gain(S sunny , Humidity) = .97-(3/5) 0-(2/5) 0 = .97 11 S M N S +
Gain(S sunny , Temp) = .97- 0-(2/5) 1 = .57 12 O M H S +
13 O H N W +
Gain(S sunny , Wind) = .97-(2/5) 1 - (3/5) .92= .02
14 R M H S -

Split on Humidity
32
An Illustrative Example (V)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
? Yes ?

33
An Illustrative Example (V)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes ?

High Normal
No Yes

34
induceDecisionTree(S)
• 1. Does S uniquely define a class?
if all s ∈ S have the same label y: return S;

• 2. Find the feature with the most information gain:


i = argmax i Gain(S, Xi)

• 3. Add children to S:
for k in Values(Xi):
Sk = {s ∈ S | xi = k}
addChild(S, Sk)
induceDecisionTree(Sk)
return S;

35
An Illustrative Example (VI)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes

36
Decision Tree Regression
• Leaf node (e.g., Hours Played) represents a decision on the
numerical target
Decision Tree Algorithm in Regression
• Standard deviation for one attribute:

•Standard Deviation (S) is for tree building (branching).


•Coefficient of Deviation (CV) decides when to stop branching. We can use
Count (n) as well.
•Average (Avg) is the value in the leaf nodes.
Decision Tree Algorithm in Regression
• Standard deviation for two attributes (target and predictor):
Decision Tree Algorithm in Regression
• Standard Deviation Reduction
• Step 1: The standard deviation of the target is
calculated.
• Standard deviation (Hours Played) = 9.32
• Step 2: The dataset is then split into different
attributes. The standard deviation for each
branch is calculated. The resulting standard
deviation is subtracted from the standard
deviation before the split. The result is the
standard deviation reduction.
• Step 3: The attribute with the largest standard
deviation reduction is chosen for the decision
node.
Decision Tree Algorithm in Regression
• Standard Deviation Reduction
• Step 4a: The dataset is divided based on the
values of the selected attribute. This process is
run recursively on the non-leaf branches, until all
data is processed.
• In practice, we need some termination criteria. For
example, when coefficient of deviation (CV) for a
branch becomes smaller than a certain threshold
(e.g., 10%) and/or when too few instances (n)
remain in the branch (e.g., 3).
• Step 4b: "Overcast" subset does not need any
further splitting because its CV (8%) is less than
the threshold (10%). The related leaf node gets
the average of the "Overcast" subset.
Decision Tree Algorithm in Regression
• Standard Deviation Reduction
• Step 4c: However, the "Sunny" branch has an CV
(28%) more than the threshold (10%) which
needs further splitting. We select "Temp" as the
best best node after "Outlook" because it has the
largest SDR.
• Because the number of data points for both
branches (FALSE and TRUE) is equal or less than 3 we
stop further branching and assign the average of
each branch to the related leaf node.
Decision Tree Algorithm in Regression
• Standard Deviation Reduction
• Step 4d: Moreover, the "rainy" branch has an CV
(22%) which is more than the threshold (10%).
This branch needs further splitting. We select
"Temp" as the best best node because it has the
largest SDR.
• Because the number of data points for all three
branches (Cool, Hot and Mild) is equal or less than 3
we stop further branching and assign the average of
each branch to the related leaf node.

https://www.saedsayad.com/decision_tree_reg.htm
Underfitting and Overfitting
Underfitting Overfitting
Expected
Error
Variance

Bias
Model complexity

Simple models: Complex models:


High bias and low variance High variance and low bias
This can be made more accurate for some loss functions.
We will discuss a more precise and general theory that
trades expressivity of models with empirical error
44
History of Decision Tree Research
• Hunt and colleagues in Psychology used full search decision tree
methods to model human concept learning in the 60s
• Quinlan developed ID3, with the information gain heuristics in the late 70s to
learn expert systems from examples
• Breiman, Freidman and colleagues in statistics developed CART (classification
and regression trees simultaneously)
• A variety of improvements in the 80s: coping with noise,
continuous attributes, missing data, non-axis parallel etc.
• Quinlan’s updated algorithm, C4.5 (1993) is commonly used (New: C5)
• Boosting (or Bagging) over DTs is a very good general purpose
algorithm
45
Ensemble Learning

González, S.,
García, S., Del Ser,
J., Rokach, L., &
Herrera, F. (2020). A
practical tutorial on
bagging and
boosting based
ensembles for
machine learning:
Algorithms, software
tools, performance
study, practical
perspectives and
opportunities. Inform
ation Fusion, 64,
205-237.
Random Forest

Algorithm:
• Bootstrap Aggregating (Bagging)
Bootstrap samples helps improve predictive stability
Using ensemble learning from several decision trees

• Random Feature Selection


Prevents dominance of a single strong feature in each tree

• Aggregation Prediction
Using averages when regression, voting when classification

• Feature Importance
Provides feature quality evaluation
Feature Importance

“Feature importance measures the extent to which a feature


contributes to reducing impurities or increasing the quality of
splitting in each decision tree that forms a Random Forest.
Features with higher Mean Decrease Impurity (MDI) scores are
considered more influential in the model's predictions.”
Coding
Working with Colab
Load Data
Exploratory Data Analysis
Data Preparation
Data Generation (Splitting and Normalization)
Modeling
Evaluation

•Train Data
Mean Absolute Error on Actual and Forecasted values: 113.66473723475617
Mean Absolute Error on Actual and Forecasted values: 40393.45235376421

•Validation Data
Mean Absolute Error on Actual and Forecasted values: 397.9795535050539
Mean Absolute Error on Actual and Forecasted values: 386917.71175304166

•Test Data
Mean Absolute Error on Actual and Forecasted values: 407.28569900058994
Mean Absolute Error on Actual and Forecasted values: 415946.52320662094
Feature Importance

You might also like