8. Linear Regression
8. Linear Regression
Splitting the Given Data into Training Data (TRD) and Test Data (TSD)
Dataset
Training Data Test Data
• Different Training data sets (TRDs) can be treated as different samples from the same distribution (population )
Each Learnt model can be subjected to test data (TSD)
For each TRD: an ML model can be trained
for assessment of accuracy
1
Linear Regression
Challenge: Learning the Source curve (Black) based on Test data points incorporating Irred. Error
The given data points The black curve For a given input (𝑥𝑖 ), 𝜖𝐼𝑅 corresponds to the difference between
(training data) • The actual output (𝑦𝑖 ) as per the data set
• The expected 𝑦𝑖 based on black curve (the perceived reality)
…..just that if the data suggests linear distribution (indicating that the black curve is
linear), we approximate it using a straight line with a specified intercept and slope
2
Simple Linear Regression (one independent variable)
Assumptions
𝐴1: 𝐷𝑎𝑡𝑎 𝑐𝑎𝑛 𝑏𝑒 𝑤𝑒𝑙𝑙 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑒𝑑 𝑏𝑦 𝑎 𝑆𝑡𝑟𝑎𝑖𝑔ℎ𝑡 𝑙𝑖𝑛𝑒
𝑌 = 𝜌0 + 𝜌1 𝑋 + 𝜖, 𝑤ℎ𝑒𝑟𝑒 • 𝑋 𝑖𝑠 𝑘𝑛𝑜𝑤𝑛 𝑒𝑥𝑎𝑐𝑡𝑙𝑦 𝑛𝑜𝑡 𝑎 𝑅𝑉
• 𝜖 𝑖𝑠 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒, 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝑋
• 𝑌 𝑖𝑠 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝐴2. 𝑇ℎ𝑒 𝑅𝑉 𝜖~ 𝑁 0, 𝜎 2
𝐼. 𝑇ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 𝑜𝑓 𝜖 𝑖𝑠 𝑧𝑒𝑟𝑜: 𝐸 𝜖 = 0, 𝑖. 𝑒. 𝜖𝑠 𝑎𝑏𝑜𝑣𝑒 𝑎𝑛𝑑 𝐼𝐼. 𝑇ℎ𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝜖 𝑖𝑠 𝑢𝑛𝑘𝑛𝑜𝑤𝑛
𝑏𝑒𝑙𝑜𝑤 𝑡ℎ𝑒 𝑝𝑜𝑝 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟 (𝑏𝑙𝑎𝑐𝑘, 𝑠𝑡𝑟𝑎𝑖𝑔ℎ𝑡 𝑙𝑖𝑛𝑒) 𝑛𝑒𝑔𝑎𝑡𝑒 𝑒𝑎𝑐ℎ 𝑜𝑡ℎ𝑒𝑟 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡: 𝑉𝑎𝑟(𝜖) = 𝜎 2
𝐸(𝑦𝑗 ) = 𝜌0 +𝜌𝑥𝑗
𝑦𝑗 ~𝑁 𝜌0 + 𝜌1 𝑥𝑗 , 𝜎 2
4
Simple Linear Regression (one independent variable)
Finding the Best-fit Linear Model for any TRD: the Principle
5
Simple Linear Regression (one independent variable)
Finding the Best-fit Linear Model for any TRD: the underlying Mathematical formulation
Residual
𝑒𝑗
𝑒𝑗 = 0
𝑗=1
𝑃1 𝑃3 deals with σ𝑦ො𝑗 𝑒𝑗 = σ 𝛽0 + 𝛽1 𝑥𝑗 𝑒𝑗 = 𝛽0 σ𝑒𝑗 + 𝛽1 σ𝑥𝑗 𝑒𝑗 = 0 𝑏𝑦 𝑃1 + (𝑃2 )
𝑛
𝑦𝑗 = 𝑦ෝ𝑗
𝑗=1 8
Simple Linear Regression (one independent variable)
An example on Least Squares fit
Based on the past student data, tabulated on the RHS
• Use the method of Least Squares to find the equation for the prediction of student's
end-term marks based on the student's mid-term marks.
• Predict the end-term marks of a student who has received 86 marks in the mid-
term exam of Statistical Machine Learning course.
σ𝑛𝑗=1(𝑥𝑗 − 𝑥)(𝑦
ҧ 𝑗 − 𝑦)
ത
𝛽1 = 𝛽0 = 𝑦ത − 𝛽1 𝑥ഥ
2
σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ
2004
𝛽1 = = 0.5816 𝛽0 = 32.02
3445.67
𝑦ො = 32.02 + 0.5816𝑥
𝐴𝑡 𝑥 = 86: 𝑦ො = 82.04
9
Simple Linear Regression (one independent variable)
Statistical Properties of Least Squares Estimators: 𝜷𝟎 𝒂𝒏𝒅 𝜷𝟏
It is important to note that the given data is 𝑷𝟏: 𝐁𝐨𝐭𝐡 𝜷𝟎 𝐚𝐧𝐝 𝜷𝟏 are linear combinations of the observations 𝒚𝒊
just a sample, drawn from the population 𝑛 𝑛 𝑛
distribution, where given an 𝑥𝑖 : σ 𝑗=1 (𝑥𝑗 − 𝑥)(𝑦
ҧ 𝑗 − 𝑦)
ത σ 𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑦𝑗 σ 𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑦ത
𝛽1 = 2 = 2 − 2
σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ
𝑦𝑗 = 𝜌0 + 𝜌1 𝑥𝑗 + 𝜖𝑗
𝑛 σ𝑛𝑗=1 𝑥𝑗
• 𝐸 𝑦𝑗 = 𝜌0 + 𝜌1𝑥𝑗 • 𝑁(0, 𝜎 2 = 2) 𝑥𝑗 − 𝑥ҧ ത σ𝑛𝑗=1 𝑥𝑗
𝑦( − 𝑛 𝑥)ҧ ത σ𝑛𝑗=1 𝑥𝑗
𝑦( −𝑛 )
𝛽1 = 𝑦𝑗 = 𝑛 =0
𝑛 2 2 2
𝑗=1 σ𝑗=1 𝑥𝑗 − 𝑥ҧ σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ
𝑛
𝑥𝑗 − 𝑥ҧ 𝑥𝑗 − 𝑥ҧ
𝛽1 = 𝑐𝑗 𝑦𝑗 ; 𝑤ℎ𝑒𝑟𝑒 𝑐𝑗 = 2 =
σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑆𝑋𝑋
𝑗=1
𝑛 𝑛 𝑛
1 1
𝛽0 = 𝑦ത − 𝛽1 𝑥ҧ = 𝑦𝑗 − 𝑥ҧ 𝑐𝑗 𝑦𝑗 = 𝑦𝑗 [ − 𝑥ҧ 𝑐𝑗 ] → L. C of yj
𝑛 𝑛
𝑗=1 𝑗=1 𝑗=1
1
In P2: σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ = 0 → σ𝑛𝑗=1 𝑐𝑗 = σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ = 0; σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑥ҧ = 0
𝑆𝑋𝑋
10
Simple Linear Regression (one independent variable)
Statistical Properties of Least Squares Estimators: 𝜷𝟎 𝒂𝒏𝒅 𝜷𝟏
It is important to note that the given 𝑷𝟐: 𝑩𝒐𝒕𝒉 𝜷𝟎 𝒂𝒏𝒅 𝜷𝟏 are unbiased estimators of 𝝆𝟎 𝒂𝒏𝒅 𝝆𝟏 , 𝒓𝒆𝒔𝒑𝒆𝒄𝒕𝒊𝒗𝒆𝒍𝒚
data is just a sample, drawn from the 𝑛 𝑛 𝑛
population distribution, where given 𝑥𝑗 − 𝑥ҧ
𝛽1 = 𝑐𝑗 𝑦𝑗 ; 𝑤ℎ𝑒𝑟𝑒 𝑐𝑗 = 2 𝐸(𝛽1 ) = 𝐸 𝑐𝑗 𝑦𝑗 = 𝑐𝑗 𝐸(𝑦𝑗 )
an 𝑥𝑖 : σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ
𝑗=1
𝑦𝑗 = 𝜌0 + 𝜌1 𝑥𝑗 + 𝜖𝑗 𝑗=1 𝑗=1
0 1 0
• 𝐸 𝑦𝑗 = 𝜌0 + 𝜌1𝑥𝑗 𝑛 𝑛 𝑛 𝑛
𝑛 𝑛 𝑛
1 1
𝑐𝑗 = (𝑥𝑗 − 𝑥)ҧ = [ 𝑥𝑗 − 𝑛𝑥]ҧ 𝐸(𝜖𝑗 ) = 0
𝑆𝑋𝑋 𝑆𝑋𝑋
𝑗=1 𝑗=1 𝑗=1
𝑛
1 σ𝑛𝑗=1 𝑥𝑗
= [ 𝑥𝑗 − 𝑛 ] =0
𝑆𝑋𝑋 𝑛
𝑗=1
𝑛 𝑛 𝑛 𝑛
1 1
𝑐𝑗 𝑥𝑗 = 𝑥𝑗 − 𝑥ҧ 𝑥𝑗 = 𝑥𝑗 − 𝑥ҧ (𝑥𝑗 −𝑥)ҧ + 𝑥𝑗 − 𝑥ҧ 𝑥ҧ
𝑆𝑋𝑋 𝑆𝑋𝑋
𝑗=1 𝑗=1 𝑗=1 𝑗=1
𝑛
In P1: we used σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑦=0
ത 1 1
= 𝑆𝑋𝑋 + 𝑥(
ҧ 𝑥𝑗 − 𝑛𝑥)ҧ = 𝑆 + 𝑥(0)
ҧ =1
𝑆𝑋𝑋 𝑆𝑋𝑋 𝑋𝑋
In P2: we will use σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑥=0
ҧ 𝑗=1
𝑥𝑗 −𝑥ҧ
σ𝑛𝑗=1 𝑐𝑗 = σ𝑛𝑗=1 =0 𝐸 𝛽0 = 𝐸(𝑦ത − 𝛽1 𝑥)ҧ = 𝐸 𝜌0 + 𝜌1 𝑥ҧ − 𝛽1 𝑥ҧ = 𝐸(𝜌0 ) since 𝐸(𝜌1 ) = 𝐸(𝛽1 )
𝑆𝑋𝑋 11
Simple Linear Regression (one independent variable): Preparation for Hypothesis Testing
Expectation and Variance of Least Squares Estimators (𝜷𝟎 𝒂𝒏𝒅 𝜷𝟏 ); Gauss-Markov Theorem
𝐸(𝜌1) = 𝐸 𝛽1 𝐸 𝛽0 = 𝐸 𝜌0
Just like the derivations on expectations, the derivations for variance lead to
𝑉𝑎𝑟(𝜖𝑗 ) 𝜎2 2
𝑥ҧ 2 1
The 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟𝑠 (𝑆𝐸𝑠) are used: 𝑉𝑎𝑟 𝛽1 = = 𝑉𝑎𝑟 𝛽0 =𝜎 [ + ]
𝑉𝑎𝑟(𝑋) 𝑆𝑋𝑋 𝑆𝑋𝑋 𝑛
• In Hypothesis Testing
• to generate confidence Intervals for
population parameters 𝜌0 and 𝜌1 based 𝜎2 𝑥ҧ 2 1
𝑆𝐸(𝛽1 ) = 𝑉𝑎𝑟 𝛽1 = 𝑆𝐸(𝛽0 ) = 𝑉𝑎𝑟 𝛽0 = 𝜎2[ + ]
on 𝛽0 and 𝛽1 , respectively 𝑆𝑋𝑋 𝑆𝑋𝑋 𝑛
The Gauss-Markov property assures that the least-squares estimatorestimators that are
is BLUE – Best, linear
Linear, combinations
Unbiased of the
Estimator, 𝑦𝑖 . BLUE → One with
where
Minimum Variance (class of unbiased LEs). However, there is no guarantee that the minimum variance will be small (covered later). 12
Simple Linear Regression (one independent variable): Preparation for Hypothesis Testing
Residual Mean Square (RMS) = RSS/(n-2) = Unbiased estimator for Error Variance → 𝝈𝟐
𝜎2 𝑥ҧ 2 1
𝑊ℎ𝑒𝑛 𝜎 𝑖𝑠 𝑘𝑛𝑜𝑤𝑛 𝑆𝐸(𝛽1 ) = 𝑉𝑎𝑟 𝛽1 = 𝑆𝐸(𝛽0 ) = 𝑉𝑎𝑟 𝛽0 = 𝜎2[ + ]
𝑆𝑋𝑋 𝑆𝑋𝑋 𝑛
𝐸(𝑅𝑆𝑆) = 𝐸(𝑆𝑌𝑌 ) − 𝐸(𝛽12 𝑆𝑋𝑋 ); 𝑤ℎ𝑒𝑟𝑒 𝑠𝑜𝑚𝑒 𝑚𝑎𝑡ℎ𝑠 𝑠ℎ𝑜𝑤𝑠 𝐸 𝑆𝑌𝑌 = 𝑛 − 1 𝜎 2 = 𝛽12 𝑆𝑋𝑋 ; and 𝐸(𝛽12 𝑆𝑋𝑋 ) = 𝜎 2 + 𝛽12 𝑆𝑋𝑋
𝐸(𝑅𝑆𝑆) = 𝑛 − 2 𝜎 2
𝑅𝑆𝑆 𝑅𝑆𝑆
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒: 𝑅𝑀𝑆 = ≡( )
𝑑𝑓 𝑛−2
𝑅𝑆𝑆 𝑅𝑀𝑆 𝑖𝑠 𝑎𝑛 𝑢𝑛𝑏𝑖𝑎𝑠𝑒𝑑 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟 𝑜𝑓 𝜎 2 ,
𝐸(𝑅𝑀𝑆) = 𝐸( ) = 𝜎2
𝑛−2 ℎ𝑒𝑛𝑐𝑒 𝑅𝑀𝑆 𝑐𝑎𝑛 𝑏𝑒 𝑢𝑠𝑒𝑑 𝑖𝑛 𝑝𝑙𝑎𝑐𝑒 𝑜𝑓 𝜎 2
𝑅𝑀𝑆 𝑥ҧ 2 1
We now need to understand
𝑆𝐸 𝛽1 = 𝑉𝑎𝑟 𝛽1 = 𝑆𝐸(𝛽0 ) = 𝑉𝑎𝑟 𝛽0 = 𝑅𝑀𝑆[ + ] the distribution of RMS, setting
𝑆𝑋𝑋 𝑆𝑋𝑋 𝑛
the context for t-test
13
Simple Linear Regression (one independent variable): Preparation for Hypothesis Testing
Residual Mean Square (RMS): Knowing its Distribution
𝑘
2
𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝐾 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑍 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝑠
𝜒𝑑𝑓=𝑘 = 𝑍𝐾2
𝑖=1 𝑎𝑚𝑜𝑢𝑛𝑡𝑠 𝑡𝑜 𝑎 𝜒2𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑡𝑢𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝐾 𝑑𝑜𝑓
𝑛
𝑒𝑗 − 0
𝑅𝑆𝑆 = 𝑒𝑗2 ; 𝑤ℎ𝑒𝑟𝑒 𝐸𝑎𝑐ℎ 𝑒𝑗 ~𝑁 0, 𝜎 2 → 𝐸𝑎𝑐ℎ ~𝑁 0,1 ~𝑍
𝜎
𝑗=1
𝑒𝑗 − 0 𝑒𝑗 2
𝐸𝑎𝑐ℎ ~𝑁 0,1 ~𝑍 → 𝐸𝑎𝑐ℎ ~𝑍 2 ~𝜒12
𝜎 𝜎
𝑛
𝑅𝑆𝑆 𝑒𝑗 2
= 𝑠ℎ𝑜𝑢𝑙𝑑 𝑓𝑜𝑙𝑙𝑜𝑤 𝑎 𝜒 2 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑑𝑜𝑓 = 𝑛
𝜎2 𝜎
𝑗=1
𝑅𝑆𝑆 𝑒𝑗 2
2
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒: = σ → 𝑆𝑢𝑚 𝑜𝑓 𝑛 − 2 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑍 2 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝑠 → 𝜒 2 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑑𝑜𝑓 = 𝑛 − 2 ~ 𝑋𝑛−2
𝜎2 𝜎
𝑅𝑆𝑆
2 ∵ 𝑅𝑀𝑆 =
𝑅𝑆𝑆 𝑒𝑗 2 𝑛−2 𝑛 − 2 𝑅𝑀𝑆 𝑒𝑗 2
=σ ~𝑋𝑛−2 2
= σ ~𝑋𝑛−2
𝜎2 𝜎 𝜎2 𝜎
14
Simple Linear Regression (one independent variable): Preparation for Hypothesis Testing
Setting the context for t-test: using Expectation & Variance of 𝜷𝟎 , 𝜷𝟏 + RMS distribution for SLE
❑ 𝐸(𝜌1 ) = 𝐸 𝛽1 Important Result ❑ 𝐸(𝜌0) = 𝐸 𝛽0
𝜎2 𝛽1 −𝜌1 𝑥ҧ 2 1 𝛽0 −𝜌0
→ 𝛽1~𝑁 𝜌1, → ~𝑁(0,1) → 𝛽0~𝑁 2
𝜌0, 𝜎 [ + ] → ~𝑁(0,1)
𝑆𝑋𝑋 𝜎2 𝑆𝑋𝑋 𝑛 ഥ2
𝑥 1
𝑆𝑋𝑋 𝜎2 [ 𝑆 +𝑛]
𝑋𝑋
𝑅𝑆𝑆 𝑛 − 2 𝑅𝑀𝑆 2
𝑅𝑆𝑆 𝑛 − 2 𝑅𝑀𝑆 2
2
= 2
~𝑋𝑛−2 2
= 2
~𝑋𝑛−2
𝜎 𝜎 𝜎 𝜎
Important Fact
𝑋
𝐿𝑒𝑡 𝑋 𝑎𝑛𝑑 𝑌 𝑏𝑒 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑅𝑉𝑠 𝑠. 𝑡. 𝑋~𝑁 0,1 𝑎𝑛𝑑 𝑌~𝜒𝑛2 𝑇ℎ𝑒𝑛 ~𝑡𝑛 (𝑡 𝑑𝑖𝑠𝑡 𝑤𝑖𝑡ℎ 𝑑𝑜𝑓 = 𝑛)
𝑌
𝑛
𝛽1 − 𝜌1 𝛽1 − 𝜌1 𝛽0 − 𝜌0
𝐿𝑒𝑡 𝑋 = → 𝑋~𝑁 0,1 ; 𝐿𝑒𝑡 𝑋 = ~𝑁(0,1) 𝛽0 − 𝜌0
𝜎2 𝜎2 𝑥ҧ 2 1 𝑥ҧ 2 1
𝑆𝑋𝑋 𝑆𝑋𝑋 𝛽1 − 𝜌1 𝜎 2[ 𝑆 + 𝑛] 𝜎 2[ 𝑆 + 𝑛 ] 𝛽1 − 𝜌1
≡ ~𝑡𝑛−2 𝑋𝑋 𝑋𝑋
≡ ~𝑡𝑛−2
𝑅𝑀𝑆 𝑥ҧ 2 1
𝑛 − 2 𝑅𝑀𝑆 𝑛 − 2 𝑅𝑀𝑆 𝑛 − 2 𝑅𝑀𝑆 𝑛 − 2 𝑅𝑀𝑆 𝑅𝑀𝑆[ 𝑆 + 𝑛 ]
𝑌= 2
→ 𝑌~𝜒𝑛−2 𝑆𝑋𝑋 2
𝜎2 (𝑛 − 2)𝜎 2 𝑌= 2 → 𝑌~𝜒𝑛−2 (𝑛 − 2)𝜎 2 𝑋𝑋
𝜎
= 𝑦𝑗 − 𝑦ത
2
= 𝑦𝑗 − 𝑦ො𝑗
2
+ 𝑦ො𝑗 − 𝑦ത
2 • 𝑦ො𝑗 − 𝑦ത = 𝛽1 (𝑥𝑗 − 𝑥)ҧ
𝑗=1 𝑗=1 𝑗=1
• 𝑦𝑗 − 𝑦ො𝑗 = 𝑦𝑗 − 𝛽0 + 𝛽1 𝑥𝑗 , 𝑤ℎ𝑒𝑟𝑒, 𝛽0 = 𝑦ത − 𝛽1 𝑥ҧ
𝐓𝐒𝐒 = 𝐑𝐒𝐒 + 𝐄𝐒𝐒 → 𝑦𝑗 − 𝑦ො𝑗 = 𝑦𝑗 − 𝑦ത − 𝛽1 (𝑥𝑗 − 𝑥)ҧ
𝑛
𝑇𝑆𝑆 − 𝑅𝑆𝑆
𝐶𝑜𝑒𝑓𝑓 𝑜𝑓 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛: 𝑅2 = 𝑃𝑟𝑜𝑑 = 𝛽1 𝑥𝑗 − 𝑥ҧ [ 𝑦𝑗 − 𝑦ത − 𝛽1 (𝑥𝑗 − 𝑥)]
ҧ
𝑇𝑆𝑆
𝑅𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑇𝑀 − 𝑅𝑒𝑠 𝑁𝑂𝑇 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑇𝑀 𝑅𝑒𝑠𝑖 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑇𝑀 𝐸𝑆𝑆 𝑗=1
= = = 𝑆𝑋𝑌
𝑅𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑇𝑀 𝑅𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑇𝑀 𝑇𝑆𝑆 𝑃𝑟𝑜𝑑 = 𝛽1 𝑆𝑋𝑌 − 𝛽12 𝑆𝑋𝑋 = 𝛽1 𝑆𝑋𝑌 − 𝛽1 𝑆𝑋𝑋 = 0 ∵ 𝛽1 =
𝑆𝑋𝑋
16
Simple Linear Regression (one independent variable)
Setting the context for F-test: Its computation
𝐓𝐒𝐒 = 𝐑𝐒𝐒 + 𝐄𝐒𝐒 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍
𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 𝒅𝒇 𝑴𝒆𝒂𝒏 𝑺𝒒𝒖𝒂𝒓𝒆 ≡ → 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆
𝑛 𝑛 𝑛 𝒅𝒇
2 2 2 𝑇𝑆𝑆
𝑦𝑗 − 𝑦ത = 𝑦𝑗 − 𝑦ො𝑗 + 𝑦ො𝑗 − 𝑦ത 𝑇𝑜𝑡𝑎𝑙 𝑇𝑆𝑆 𝑛−1 𝑇𝑜𝑡𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑇𝑀𝑆 =
𝑗=1 𝑗=1 𝑗=1
𝑛−1
𝑅𝑆𝑆
𝑑𝑓𝑇𝑆𝑆 = 𝑛 − 1 𝑑𝑓𝑅𝑆𝑆 = 𝑛 − 2 𝑑𝑓𝐸𝑆𝑆 = 𝑑𝑓𝑇𝑆𝑆 − 𝑑𝑓𝑅𝑆𝑆 = 1 𝑁𝑜𝑡 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑅𝑆𝑆 𝑛−2 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑅𝑀𝑆 =
because because 𝑛−2
𝑛 𝑛 𝑛
2 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝐸𝑆𝑆 1 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝐸𝑀𝑆 = 𝐸𝑆𝑆
𝑦𝑗 − 𝑦ത = 0 𝑒𝑗2 ~𝜒𝑛−2
2 𝛽12 𝑥𝑗 − 𝑥ҧ = 𝛽12𝑆𝑋𝑋
𝑗=1 𝑗=1 𝑗=1
Important Fact
𝐸𝑀𝑆 = 𝐸𝑆𝑆 =
𝑅𝑆𝑆 𝑛 2
𝐼𝑓 𝑋 𝑎𝑛𝑑 𝑌 𝑎𝑟𝑒 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑅𝑉𝑠 𝑠. 𝑡. 𝑋~𝜒𝑚 ; 𝑌~𝜒𝑛2
𝑅𝑀𝑆 = 2
𝑛−2 𝛽12 𝑥𝑗 − 𝑥ҧ = 𝛽12𝑆𝑋𝑋
𝑗=1 𝑋/𝑚
𝑇ℎ𝑒𝑛 ~𝐹 (𝐹 𝑑𝑖𝑠𝑡 𝑤𝑖𝑡ℎ 𝑑𝑓1 = 𝑚; 𝑑𝑓2 = 𝑛)
𝑌/𝑛 𝑚,𝑛
𝐽𝑢𝑠𝑡 𝑎𝑠 𝑤𝑒 𝑝𝑟𝑜𝑣𝑒𝑑: 𝐸(𝑅𝑀𝑆) = 𝜎 2
17
Simple Linear Regression (one independent variable)
Setting the context for F-test: Its computation
𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍
𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 𝒅𝒇 𝑴𝒆𝒂𝒏 𝑺𝒒𝒖𝒂𝒓𝒆 ≡ → 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆
𝒅𝒇
𝑛
2 𝑇𝑆𝑆
𝑇𝑜𝑡𝑎𝑙 𝑇𝑆𝑆 𝑦𝑗 − 𝑦ത 𝑛−1 𝑇𝑜𝑡𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑇𝑀𝑆 =
𝑗=1
𝑛−1
𝑛
2 𝑅𝑆𝑆
𝑁𝑜𝑡 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑅𝑆𝑆 𝑦𝑗 − 𝑦ො𝑗 𝑛−2 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑅𝑀𝑆 =
𝑗=1
𝑛−2
𝑛
2
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝐸𝑆𝑆 𝑦ො𝑗 − 𝑦ത 1 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝐸𝑀𝑆 = 𝐸𝑆𝑆
𝑗=1
𝐸𝑀𝑆 𝐸𝑀𝑆
𝑋= 2
~𝑋12 ~𝐹 ≡ 𝐹𝐻0
𝜎 𝑅𝑀𝑆 1,𝑛−2
𝑛 − 2 𝑅𝑀𝑆 2
𝑌= 2
~𝑋𝑛−2
𝜎
𝐼𝑓 𝐹𝐻0 > 𝐹𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 : 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0
18
Now we have all the formulations,
19
Simple Linear Regression (one independent variable)
Model Evaluation at Two levels
Either of these approaches could be applied to EACH TRD,
leading to as many ML models as there are TRDs
Individual Evaluation The learnt ML model can be evaluated w.r.t. the WORST possible ML
(Local: for each TRD) model (constant at 𝑦)
ത for the same TRD + Inferences on Population
Diagnostic checks
1. Coeff of determination 𝑹𝟐:How good is our model compared to the worst model (w.r.t. the specific TRD itself)?
2. Test of Significance: t-test or F-test:"Check whether the simple/multiple regression model with one/several
predictor is significantly better than a model without any predictors, to capture the "real linear relation in the
population."
3. Confidence interval for coefficients: finding the lower and upper bound for population parameter with
confidence level of 𝝍 based on the corresponding sample regression coefficient
21
Simple Linear Regression (one independent variable)
Is the Population distribution captured better by the Trained Model or the Worst Model: t-test
Suppose we have a Least squares fit. The question is Does this fit reflect population behaviour → Is this model a good estimate
of the real linear relation between x and y in the population vis-à-vis the Worst model (y independent of x)
❖ 𝑰𝑰𝑰 𝑪𝒐𝒎𝒑𝒂𝒓𝒆 𝒕𝒉𝒆 𝒄𝒐𝒎𝒑𝒖𝒕𝒆𝒅 𝒕𝑯𝟎 𝒂𝒈𝒂𝒊𝒏𝒔𝒕 𝒕𝒄𝒓𝒊𝒕𝒊𝒄𝒂𝒍 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 𝒕𝒐 𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆 𝒍𝒆𝒗𝒆𝒍 𝜶 = 𝟎. 𝟎𝟓
❑ 𝑨𝒍𝒕𝒆𝒓𝒏𝒂𝒕𝒊𝒗𝒆𝒍𝒚, 𝒓𝒆𝒇𝒆𝒓 𝒕𝒐 𝒕𝒉𝒆 𝒑 𝒕𝒂𝒃𝒍𝒆, 𝒇𝒊𝒏𝒅 𝒕𝒉𝒆 𝒑 𝒗𝒂𝒍𝒖𝒆 (𝒑𝒗) 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 𝒕𝒐 𝒈𝒊𝒗𝒆𝒏: (𝒊) 𝒕𝑯𝟎 𝒗𝒂𝒍𝒖𝒆 𝒂𝒏𝒅 (𝒊𝒊) 𝒅𝒇 = 𝒏 − 𝟐
→ 𝑰𝒇 𝒑𝒗 < 𝟎. 𝟎𝟓 ∶ 𝑹𝒆𝒋𝒆𝒄𝒕 𝑯𝟎
22
Simple Linear Regression (one independent variable)
Is the Population distribution captured better by the Trained Model or the Worst Model: t-test
Is Y related linearly with X
𝑅𝑆𝑆 σe2𝑗 1.1
𝑿 𝒀 𝒚ෝ𝒋 = −𝟎. 𝟏 + 𝟎. 𝟕𝒙𝒊 ෝ𝒋
𝒆𝒋 = 𝒚𝒋 − 𝒚 𝛽1 𝑅𝑀𝑆 = = = = 0.3666
𝑛−2 3 3
𝑡𝐻0:𝑛−2 =
1 1 0.6 0.4 𝑅𝑀𝑆 2
2 1 1.3 -0.3 𝑆𝑋𝑋 𝑆𝑋𝑋 = σ 𝑥𝑗 − 𝑋ത = 𝑥𝑗2 − 𝑛 𝑋ത 2 = 55 − 5 ⋅ 32 = 10
3 2 2.0 0.0
0.7
4 2 2.7 -0.7 𝑡𝐻0 :𝑛−2 = = 3.655, 𝑤ℎ𝑒𝑟𝑒 𝑑𝑓 = 𝑛 − 2 = 3
0.3666
5 4 3.4 0.6 10
23
Simple Linear Regression (one independent variable)
Is the Population distribution captured better by the Trained Model or the Worst Model: F-test
Applying F test on the previous problem
𝑇𝑆𝑆 = 6 𝑅𝑆𝑆 = σe2𝑗 = 1.1 𝐸𝑆𝑆 = 𝑇𝑆𝑆 − 𝑅𝑆𝑆 = 4.9 𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒𝑙𝑦: 𝐸𝑆𝑆 = 𝛽12 𝑆𝑋𝑋 = 0.72 × 10 = 4.9
Note: these numbers suggest a reasonably good fit as it explains 4.9 out of the total residual of 6
Critical F Table
𝑹𝑺𝑺 𝑬𝑴𝑺 𝟒. 𝟗
𝑹𝑴𝑺 = =0.367 𝑬𝑴𝑺 = 𝑬𝑺𝑺 = 𝟒. 𝟗 𝑭= = = 𝟏𝟑. 𝟑𝟓
𝒏−𝟐 𝑹𝑴𝑺 𝟎. 𝟑𝟔𝟕
𝑑𝑓1 = 1 ; 𝑑𝑓2 = 3
𝑺𝒊𝒏𝒄𝒆 𝑭𝑯𝟎 = 𝟏𝟑. 𝟑𝟓 > 𝑭𝒄𝒓𝒊𝒕𝒊𝒄𝒂𝒍 = 𝟏𝟎. 𝟏𝟐𝟖 𝑹𝒆𝒋𝒆𝒄𝒕 𝑯𝟎 → 𝑳𝒊𝒏𝒆𝒂𝒓 𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏 𝒆𝒙𝒊𝒔𝒕𝒔
Finding the Confidence Estimation of LB & UB in which 𝝆 will lie 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝑳𝒆𝒗𝒆𝒍 𝒊𝒏 %
Interval for Coefficients with confidence level 𝝍 Significance Level = 𝟏− 𝟏𝟎𝟎
implies finding the lower 𝜌 %𝝍 The use of 𝜶 depends on the use of
and upper bound for 𝜶=𝟏−
𝟏𝟎𝟎 one or two tailed t-test
population parameter
with confidence level of Sample Coefficient 𝛽
𝝍 based on the
corresponding sample
regression coefficient
𝜎2 𝑅𝑀𝑆
𝑆𝐸(𝛽1) = 𝑉𝑎𝑟 𝛽1 = 𝑆𝐸(𝛽1) = 𝑉𝑎𝑟 𝛽1 =
𝑆𝑋𝑋 𝑆𝑋𝑋
𝑆𝐸 𝛽0 = 𝑉𝑎𝑟 𝛽0 𝑆𝐸 𝛽0 = 𝑉𝑎𝑟 𝛽0
𝑥ҧ 2 1 𝑥ҧ 2 1
= 𝜎 2[ + ] = 𝑅𝑀𝑆[ + ]
𝑆𝑋𝑋 𝑛 𝑆𝑋𝑋 𝑛
𝑹𝑴𝑺 𝛽1 − 𝜌1 𝑥ҧ 2 1
𝝆𝟏 = 𝜷𝟏 ± 𝒕𝜶,𝒏−𝟐 𝑃 −𝑡𝛼,𝑛−2 ≤ ≤ 𝑡𝛼,𝑛−2 = 1 − 𝛼 𝝆𝟎 = 𝜷𝟎 ± 𝒕𝜶,𝒏−𝟐 𝑅𝑀𝑆[ + ]
𝟐 𝑺𝑿𝑿 2 𝑆𝐸 2 𝟐 𝑆𝑋𝑋 𝑛
26
Linear Regression: Individual Evaluation of Each ML model trained on TRDs w.r.t. the WORST Model
Diagnostic Check -3: Confidence Interval for Coefficients
The previous problem
• 𝑛=5
•
𝒕𝟎.𝟎𝟐𝟓,𝟑 = 𝟑. 𝟏𝟖𝟐 𝑹𝑴𝑺 =
𝑹𝑺𝑺
=0.367
𝑑𝑓 = 𝑛 − 2 = 3 𝑅𝑆𝑆 = σe2𝑗 = 1.1 𝒏−𝟐
2
𝑆𝑋𝑋 = σ 𝑥𝑗 − 𝑋ത = 𝑥𝑗2 − 𝑛 𝑋ത 2 = 55 − 5 ⋅ 32 = 10
𝑹𝑴𝑺
𝝆𝟏 = 𝜷𝟏 ± 𝒕𝜶,𝒏−𝟐
𝟐 𝑺𝑿𝑿
Covered so far
Discussed next
28
Linear Regression: Collective Evaluation of ALL ML models vis-à-vis the Test Data (TSD)
Bias and Variance
For a given 𝒙 ∈ 𝑻𝑺𝑫, Let the target 𝑏𝑒 𝑦
• the estimate of target by an ML model be 𝑦ො
• the Average Prediction over the ML models
𝑦ො1 + 𝑦ො2 + 𝑦ො3 + 𝑦ො4
𝐸 𝑦ො =
4
2
= 𝐸[𝑦ො 2 + 𝐸 𝑦ො 2 − 2 𝑦𝐸
ො 𝑦ො ]
Actual ∼ Average prediction Expected [Individual Prediction ∼ Avg. prediction]
Measure of inaccuracy Measure of Estimator’s Jumpiness (how much the estimate = 𝐸[𝑦ො 2] + 𝐸 𝑦ො 2 − 2 𝐸(𝑦)𝐸
ො 𝑦ො
ෝ) changes with a change in the
of the target function (𝒚
training data vis-à-vis the Average prediction) = 𝐸 𝑦ො 2 − 𝑬 𝒚 𝟐
ෝ 29