0% found this document useful (0 votes)
3 views

8. Linear Regression

ajdjkacjkc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

8. Linear Regression

ajdjkacjkc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Linear Regression

Splitting the Given Data into Training Data (TRD) and Test Data (TSD)
Dataset
Training Data Test Data

TRD1 TRD2 TRD3 TRD4 TSD

TRD1 TRD2 TRD3 TRD4 TSD

• Different Training data sets (TRDs) can be treated as different samples from the same distribution (population )
Each Learnt model can be subjected to test data (TSD)
For each TRD: an ML model can be trained
for assessment of accuracy
1
Linear Regression
Challenge: Learning the Source curve (Black) based on Test data points incorporating Irred. Error

For any TRD

𝑻𝒓𝒖𝒆 𝒓𝒆𝒂𝒍𝒊𝒕𝒚 𝒀𝑻 = 𝒇 𝑿𝑻 = 𝑷𝒆𝒓𝒄𝒆𝒊𝒗𝒆𝒅 𝒓𝒆𝒂𝒍𝒊𝒕𝒚 𝒀𝑷 = 𝒇 𝑿𝑷 + 𝑰𝒓𝒓𝒆𝒅𝒖𝒄𝒊𝒃𝒍𝒆 𝒆𝒓𝒓𝒐𝒓 [𝝐𝑰𝑹 ]

The given data points The black curve For a given input (𝑥𝑖 ), 𝜖𝐼𝑅 corresponds to the difference between
(training data) • The actual output (𝑦𝑖 ) as per the data set
• The expected 𝑦𝑖 based on black curve (the perceived reality)

Task/Challenge in Regression: within the confines of perceived reality, we try to


learn/approximate the underlying source (black curve) based on the available
training data (grey points), conscious of the fact that the grey points are ‘also’
influenced by or incorporate irreducible error

…..just that if the data suggests linear distribution (indicating that the black curve is
linear), we approximate it using a straight line with a specified intercept and slope

2
Simple Linear Regression (one independent variable)
Assumptions
𝐴1: 𝐷𝑎𝑡𝑎 𝑐𝑎𝑛 𝑏𝑒 𝑤𝑒𝑙𝑙 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑒𝑑 𝑏𝑦 𝑎 𝑆𝑡𝑟𝑎𝑖𝑔ℎ𝑡 𝑙𝑖𝑛𝑒
𝑌 = 𝜌0 + 𝜌1 𝑋 + 𝜖, 𝑤ℎ𝑒𝑟𝑒 • 𝑋 𝑖𝑠 𝑘𝑛𝑜𝑤𝑛 𝑒𝑥𝑎𝑐𝑡𝑙𝑦 𝑛𝑜𝑡 𝑎 𝑅𝑉
• 𝜖 𝑖𝑠 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒, 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝑋
• 𝑌 𝑖𝑠 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

𝐴2. 𝑇ℎ𝑒 𝑅𝑉 𝜖~ 𝑁 0, 𝜎 2
𝐼. 𝑇ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 𝑜𝑓 𝜖 𝑖𝑠 𝑧𝑒𝑟𝑜: 𝐸 𝜖 = 0, 𝑖. 𝑒. 𝜖𝑠 𝑎𝑏𝑜𝑣𝑒 𝑎𝑛𝑑 𝐼𝐼. 𝑇ℎ𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝜖 𝑖𝑠 𝑢𝑛𝑘𝑛𝑜𝑤𝑛
𝑏𝑒𝑙𝑜𝑤 𝑡ℎ𝑒 𝑝𝑜𝑝 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟 (𝑏𝑙𝑎𝑐𝑘, 𝑠𝑡𝑟𝑎𝑖𝑔ℎ𝑡 𝑙𝑖𝑛𝑒) 𝑛𝑒𝑔𝑎𝑡𝑒 𝑒𝑎𝑐ℎ 𝑜𝑡ℎ𝑒𝑟 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡: 𝑉𝑎𝑟(𝜖) = 𝜎 2

𝐴3. 𝐹𝑜𝑟 𝑎𝑛𝑦 𝑥𝑖 𝑎𝑛𝑑 𝑥𝑗 , 𝜖𝑖 𝑎𝑛𝑑 𝜖𝑗 𝑔𝑒𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑑 ~ 𝑁 0, 𝜎 2 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡𝑙𝑦


𝐼. 𝜖𝑖 𝑎𝑛𝑑 𝜖𝑗 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 → 𝐶𝑜𝑣 𝜖𝑖 , 𝜖𝑗 = 0; 𝑣𝑖𝑐𝑒 𝑣𝑒𝑟𝑠𝑎 𝑛𝑜𝑡 𝑡𝑟𝑢𝑒
𝐼𝐼. 𝑦𝑖 𝑎𝑛𝑑 𝑦𝑗 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 3
Simple Linear Regression (one independent variable)
Understanding Input Data: Expected Value and Variance
𝑦𝑗 = 𝜌0 + 𝜌1 𝑥𝑗 + 𝜖𝑗
𝑥𝑗 : 𝑖𝑡𝑠 𝑤𝑒𝑙𝑙 − 𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑜𝑡 𝑎 𝑅𝑉
𝜖𝑗 𝑖𝑠 𝑎 𝑅𝑉 → 𝑦𝑗 𝑖𝑠 𝑎 𝑅𝑉

𝐸(𝑦𝑗 ) = 𝐸(𝜌0 + 𝜌1𝑥𝑗 + 𝜖𝑗 ) = 𝐸 𝜌0 + 𝜌1𝑥𝑗 + E 𝜖𝑗

𝐸(𝑦𝑗 ) = 𝜌0 +𝜌𝑥𝑗

𝑉(𝑦𝑗 ) = 𝑉(𝜌0 + 𝜌1𝑥𝑗 + 𝜖𝑗 ) = V 𝜌0 + 𝜌1 𝑥𝑗 + V 𝜖𝑗


𝐸(𝑦𝑗 ) = 𝜌0 +𝜌1 𝑥𝑗
𝑉(𝑦𝑗 ) = 𝑉 𝑐𝑜𝑛𝑠𝑡 + 𝜎 2 = 𝜎 2

𝐸(𝑦𝑗 ) = 𝜌0 +𝜌1 𝑥𝑗 and 𝑉(𝑦𝑗 ) = 𝜎 2 together imply

𝑦𝑗 ~𝑁 𝜌0 + 𝜌1 𝑥𝑗 , 𝜎 2

4
Simple Linear Regression (one independent variable)
Finding the Best-fit Linear Model for any TRD: the Principle

D 𝑿 𝒀 𝐸𝑎𝑐ℎ 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑖𝑠 𝑜𝑓 𝑎 𝑙𝑖𝑛𝑒 𝑤ℎ𝑒𝑟𝑒 𝛽0 𝑖𝑠 𝑡ℎ𝑒 2 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠; 𝑓𝑎𝑟 𝑚𝑎𝑛𝑦 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛𝑠


A 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑎𝑛𝑑 𝛽1 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑙𝑜𝑝𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑙𝑖𝑛𝑒 𝐼𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑙: 𝑛𝑜 𝑒𝑥𝑎𝑐𝑡 𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛;
T 𝒙𝟏 𝒚𝟏 𝑦1 = 𝛽0 + 𝛽1 𝑥1 𝑆𝑒𝑒𝑘 𝑙𝑒𝑎𝑠𝑡 𝑒𝑟𝑟𝑜𝑟 𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛
A
𝒙𝟐 𝒚𝟐 𝑦2 = 𝛽0 + 𝛽1 𝑥2 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑒 𝛽0 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑎𝑛𝑑 𝛽1 𝑠𝑙𝑜𝑝𝑒
P
O 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑡ℎ𝑒 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒
⋯ ⋯
I 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑡ℎ𝑒 𝑔𝑖𝑣𝑒𝑛 𝑦𝑗 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖𝑛𝑝𝑢𝑡 𝑥𝑗
N 𝒙𝒊 𝒚𝒊 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 𝑎𝑛𝑑 𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒𝑑 𝑦ෝ𝑗 = 𝛽0 + 𝛽1 𝑥𝑗
T 𝑖𝑠 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑖𝑛 𝑠𝑜𝑚𝑒 𝑓𝑜𝑟𝑚
S ⋯ ⋯
𝑂𝑛𝑐𝑒 𝛽0 𝑎𝑛𝑑 𝛽1 𝑎𝑟𝑒 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑒𝑑
𝒙𝒏 𝒚𝒏 𝑦𝑛 = 𝛽0 + 𝛽1 𝑥𝑛
𝑦𝑜𝑢 𝑔𝑒𝑡 𝑎 𝑙𝑖𝑛𝑒, 𝑎𝑛𝑑 𝑎 𝑀𝑜𝑑𝑒𝑙: 𝑀

5
Simple Linear Regression (one independent variable)
Finding the Best-fit Linear Model for any TRD: the underlying Mathematical formulation

𝐺𝑜𝑎𝑙: 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑒 𝛽0 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡


𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙
𝑎𝑛𝑑 𝛽1 𝑠𝑙𝑜𝑝𝑒 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡
𝑦𝑗 ∼ 𝑦ෝ𝑗 𝑖𝑠 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑖𝑛 𝑠𝑜𝑚𝑒 𝑓𝑜𝑟𝑚

Approach-1 (Least Square Approach: Minimize Variance ≡ RSS) Approach-2 (S.O.L.E)


2
Least Squares method: arg min σ𝑛𝑗=1 𝑦𝑗 − 𝑦ෝ𝑗 𝑌 = 𝑋𝐵
𝛽0 ,𝛽1
𝑛 𝑦1 1 𝑥1
2
: arg min ෍ 𝑦𝑗 − 𝛽0 − 𝛽1𝑥𝑗 𝑦2 1 𝑥2 𝛽0
𝛽0 ,𝛽1
𝑗=1 𝑌= ⋯ 𝑋= ⋯ ⋯ 𝐵=
𝛽1
⋯ ⋯ ⋯
𝑥:ҧ 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑥 𝑦:
ത 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑦 𝑦𝑛 1 𝑥𝑛 𝑛×2

𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑖𝑠 𝑆𝑂𝐸


σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖− 𝑦)
ത 𝑐𝑜𝑣(𝑋, 𝑌) 𝑤𝑖𝑡ℎ 𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝐸𝑟𝑟𝑜𝑟
𝛽1 = ≡
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 𝑉𝑎𝑟(𝑋) 𝛽0
𝐵= = 〖 𝑋𝑇𝑋 −1
〗 𝑋𝑇𝑌
𝛽1 (2 × 𝑛)(𝑛 × 2)(2 × 𝑛)(𝑛 × 1)
𝛽0 = 𝑦ത − 𝛽1 𝑥ഥ (2 × 2) (2 × 1)
(2 × 1) (2 × 1) 6
Simple Linear Regression (one independent variable)
Finding the Best-fit Linear Model for any TRD: the underlying Mathematical formulation
Derivation: Approach-1 (Least Square Approach: Minimize Variance ≡ RSS) 𝜕E
2 2 𝜕𝛽0 0
Least Squares method: arg min σ𝑛𝑗=1 𝑦𝑗 − 𝑦ෝ𝑗 Error: E = σ𝑛𝑗=1 𝑦𝑗 − [𝛽0 + 𝛽1 𝑥𝑗 ] First order Necessary Condition : ∇𝐸 = 𝜕E
=
0
𝛽0 ,𝛽1
𝜕𝛽1
𝑛 𝑛 𝑛 𝑛 𝑛
𝜕E 𝜕E
= 0 → −2(෍ 𝑦𝑗 − 𝛽0 + 𝛽1𝑥𝑗 = 0 → −2 (෍ 𝑦𝑗 − 𝛽0 + 𝛽1𝑥𝑗 𝑥𝑗 ෍ 𝑥𝑗 𝑦𝑗 − ෍ 𝛽0𝑥𝑗 − 𝛽1 ෍ 𝑥𝑗2 = 0
𝜕𝛽0 𝜕𝛽1
𝑗=1 𝑗=1 𝑗=1 𝑗=1 𝑗=1
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
෍ 𝑦𝑗 − ෍ 𝛽0 − 𝛽1 ෍ 𝑥𝑗 = 0 ෍ 𝑥𝑗 𝑦𝑗 − ෍(𝑦ത ҧ 𝑗 − 𝛽1 ෍ 𝑥𝑗2
− 𝛽1 𝑥)𝑥 =0 ത 𝑗 ) − 𝛽1 ෍(𝑥𝑗2 − 𝑥ҧ 𝑥𝑗 ) = 0
෍(𝑥𝑗 𝑦𝑗 − 𝑦𝑥
𝑗=1 𝑗=1 𝑗=1 𝑗=1 𝑗=1 𝑗=1 𝑗=1 𝑗=1

σ𝑛𝑗=1 𝑦𝑗 σ𝑥𝑗 𝑦𝑗 σ𝑥𝑗 σ𝑦𝑗


σ𝑛𝑗=1(𝑥𝑗 𝑦𝑗 − 𝑦𝑥 σ𝑛𝑗=1(𝑥𝑗 𝑦𝑗
− 𝑛 𝑥𝑗 ) 𝐸 𝑥𝑦 − 𝐸 𝑥 𝐸[𝑦]
𝑛 𝑛
𝛽1 =
ത 𝑗)
= = 𝑛 − 𝑛 ⋅ 𝑛 =
𝑛𝛽0 = ෍ 𝑦𝑗 − 𝛽1 ෍ 𝑥𝑗 σ𝑛𝑗=1(𝑥𝑗2 − 𝑥ҧ 𝑥𝑗 ) 𝑛 2
σ𝑛𝑗=1 𝑥𝑗 σ𝑥 2
σ𝑥
2 𝐸 𝑥𝑥 − 𝐸 𝑥 𝐸[𝑥]
σ𝑗=1(𝑥𝑗 −
𝑛 𝑥𝑗 )
𝑗 𝑗
𝑗=1 𝑗=1 𝑛 − 𝑛
2
σ𝑛𝑗=1 𝑦𝑗 σ𝑛𝑗=1 𝑥𝑗 𝑉 𝑋 = 𝐸 𝑋 − 𝑋ത 2 → 𝑉 𝑋 = 𝐸 𝑋2 − 𝐸 𝑋 = 𝐸 𝑋 ⋅ 𝑋 − 𝐸 𝑋 ⋅ 𝐸(𝑋)
𝛽0 = − 𝛽1
𝑛 𝑛 𝐶𝑜𝑣 𝑋 = 𝐸 𝑋 − 𝑋ത 𝑌 − 𝑌ത → 𝐶𝑜𝑣(𝑋, 𝑌) = 𝐸 𝑋 ⋅ 𝑌 − 𝐸 𝑋 ⋅ 𝐸(𝑌)

𝛽0 = 𝑦ത − 𝛽1 𝑥ҧ 𝐸 𝑥𝑦 −𝐸 𝑥 𝐸[𝑦] 𝐶𝑜𝑣 𝑥𝑦 𝑆 ҧ 𝑖 −𝑦)


σ(𝑥𝑖 −𝑥)(𝑦 ത Most compact form
𝛽1 = 𝐸 𝑥𝑥 −𝐸 𝑥 𝐸[𝑥] = 𝑉𝑎𝑟[𝑥]
= 𝑆𝑋𝑌 = σ 𝑥𝑖 −𝑥ҧ 2 for hand calculation
𝑋𝑋

𝑁𝑜𝑡𝑒 𝛽1′ 𝑠 𝑝𝑎𝑟𝑡𝑖𝑎𝑙 𝑟𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝐶𝑜𝑣 𝑋𝑌 𝐶𝑜𝑣 𝑋𝑌 𝑆𝑋𝑌


𝑤𝑖𝑡ℎ 𝑃𝑒𝑎𝑟𝑠𝑜𝑛 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛: 𝜌𝑋𝑌 𝜌𝑋𝑌 = = =
𝜎𝑋 ⋅ 𝜎𝑌 𝑉[𝑋] 𝑉[𝑌] 𝑆𝑋𝑋 𝑆𝑌𝑌
7
Simple Linear Regression (one independent variable)
Useful properties of Least Squares fit

Residual
𝑒𝑗

𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛𝑠 𝑒1 𝑎𝑛𝑑 𝑒2 𝑎𝑟𝑒 𝑜𝑓𝑡𝑒𝑛 𝑟𝑒𝑓𝑒𝑟𝑟𝑒𝑑 𝑡𝑜 𝑎𝑠 𝑁𝑜𝑟𝑚𝑎𝑙 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛𝑠

𝑃1 : Sum of residuals in any regression model


𝑃2 : σ𝑥𝑗 𝑒𝑗 = 0 𝑃3 : σ𝑦ො𝑗 𝑒𝑗 = 0
that contains an intercept 𝛽0 is always 0
𝑛 𝑛 𝑛 𝑛 𝑛

𝑒1 → ෍[𝑦𝑗 − 𝛽0 + 𝛽1 𝑥𝑗 = ෍[𝑦𝑗 −𝑦ො𝑗 ] = 0 𝑒2 → ෍[𝑦𝑗 − 𝛽0 + 𝛽1 𝑥𝑗 ]𝑥𝑗 = ෍[𝑦𝑗 −𝑦ො𝑗 ]𝑥𝑗 = ෍ 𝑥𝑗 𝑒𝑗 = 0 … (𝑃2 )


𝑗=1 𝑗=1 𝑗=1 𝑗=1 𝑗=1
𝑛

෍ 𝑒𝑗 = 0
𝑗=1
𝑃1 𝑃3 deals with σ𝑦ො𝑗 𝑒𝑗 = σ 𝛽0 + 𝛽1 𝑥𝑗 𝑒𝑗 = 𝛽0 σ𝑒𝑗 + 𝛽1 σ𝑥𝑗 𝑒𝑗 = 0 𝑏𝑦 𝑃1 + (𝑃2 )
𝑛

෍ 𝑦𝑗 = 𝑦ෝ𝑗
𝑗=1 8
Simple Linear Regression (one independent variable)
An example on Least Squares fit
Based on the past student data, tabulated on the RHS
• Use the method of Least Squares to find the equation for the prediction of student's
end-term marks based on the student's mid-term marks.
• Predict the end-term marks of a student who has received 86 marks in the mid-
term exam of Statistical Machine Learning course.

σ𝑛𝑗=1(𝑥𝑗 − 𝑥)(𝑦
ҧ 𝑗 − 𝑦)

𝛽1 = 𝛽0 = 𝑦ത − 𝛽1 𝑥ഥ
2
σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ

2004
𝛽1 = = 0.5816 𝛽0 = 32.02
3445.67

𝑦ො = 32.02 + 0.5816𝑥

𝐴𝑡 𝑥 = 86: 𝑦ො = 82.04

9
Simple Linear Regression (one independent variable)
Statistical Properties of Least Squares Estimators: 𝜷𝟎 𝒂𝒏𝒅 𝜷𝟏
It is important to note that the given data is 𝑷𝟏: 𝐁𝐨𝐭𝐡 𝜷𝟎 𝐚𝐧𝐝 𝜷𝟏 are linear combinations of the observations 𝒚𝒊
just a sample, drawn from the population 𝑛 𝑛 𝑛
distribution, where given an 𝑥𝑖 : σ 𝑗=1 (𝑥𝑗 − 𝑥)(𝑦
ҧ 𝑗 − 𝑦)
ത σ 𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑦𝑗 σ 𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑦ത
𝛽1 = 2 = 2 − 2
σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ
𝑦𝑗 = 𝜌0 + 𝜌1 𝑥𝑗 + 𝜖𝑗

𝑛 σ𝑛𝑗=1 𝑥𝑗
• 𝐸 𝑦𝑗 = 𝜌0 + 𝜌1𝑥𝑗 • 𝑁(0, 𝜎 2 = 2) 𝑥𝑗 − 𝑥ҧ ത σ𝑛𝑗=1 𝑥𝑗
𝑦( − 𝑛 𝑥)ҧ ത σ𝑛𝑗=1 𝑥𝑗
𝑦( −𝑛 )
𝛽1 = ෍ 𝑦𝑗 = 𝑛 =0
𝑛 2 2 2
𝑗=1 σ𝑗=1 𝑥𝑗 − 𝑥ҧ σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ

𝑛
𝑥𝑗 − 𝑥ҧ 𝑥𝑗 − 𝑥ҧ
𝛽1 = ෍ 𝑐𝑗 𝑦𝑗 ; 𝑤ℎ𝑒𝑟𝑒 𝑐𝑗 = 2 =
σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑆𝑋𝑋
𝑗=1

𝑛 𝑛 𝑛
1 1
𝛽0 = 𝑦ത − 𝛽1 𝑥ҧ = ෍ 𝑦𝑗 − 𝑥ҧ ෍ 𝑐𝑗 𝑦𝑗 = ෍ 𝑦𝑗 [ − 𝑥ҧ 𝑐𝑗 ] → L. C of yj
𝑛 𝑛
𝑗=1 𝑗=1 𝑗=1

Note, here in P1: we saw σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ = 0 → σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑦ത = 0

1
In P2: σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ = 0 → σ𝑛𝑗=1 𝑐𝑗 = σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ = 0; σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑥ҧ = 0
𝑆𝑋𝑋
10
Simple Linear Regression (one independent variable)
Statistical Properties of Least Squares Estimators: 𝜷𝟎 𝒂𝒏𝒅 𝜷𝟏
It is important to note that the given 𝑷𝟐: 𝑩𝒐𝒕𝒉 𝜷𝟎 𝒂𝒏𝒅 𝜷𝟏 are unbiased estimators of 𝝆𝟎 𝒂𝒏𝒅 𝝆𝟏 , 𝒓𝒆𝒔𝒑𝒆𝒄𝒕𝒊𝒗𝒆𝒍𝒚
data is just a sample, drawn from the 𝑛 𝑛 𝑛
population distribution, where given 𝑥𝑗 − 𝑥ҧ
𝛽1 = ෍ 𝑐𝑗 𝑦𝑗 ; 𝑤ℎ𝑒𝑟𝑒 𝑐𝑗 = 2 𝐸(𝛽1 ) = 𝐸 ෍ 𝑐𝑗 𝑦𝑗 = ෍ 𝑐𝑗 𝐸(𝑦𝑗 )
an 𝑥𝑖 : σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ
𝑗=1
𝑦𝑗 = 𝜌0 + 𝜌1 𝑥𝑗 + 𝜖𝑗 𝑗=1 𝑗=1

0 1 0
• 𝐸 𝑦𝑗 = 𝜌0 + 𝜌1𝑥𝑗 𝑛 𝑛 𝑛 𝑛

• 𝑁(0, 𝜎 2 = 2) 𝐸(𝛽1 ) = ෍ 𝑐𝑗 𝐸 𝜌0 + 𝜌1 𝑥𝑗 + 𝜖𝑗 = 𝜌0 𝐸(෍ 𝑐𝑗 ) + 𝜌1 𝐸 ෍ 𝑐𝑗 𝑥𝑗 + ෍ 𝑐𝑗 𝐸(𝜖𝑗 ) = 𝜌1


𝑗=1 𝑗=1 𝑗=1 𝑗=1

𝑛 𝑛 𝑛
1 1
෍ 𝑐𝑗 = ෍(𝑥𝑗 − 𝑥)ҧ = [෍ 𝑥𝑗 − 𝑛𝑥]ҧ 𝐸(𝜖𝑗 ) = 0
𝑆𝑋𝑋 𝑆𝑋𝑋
𝑗=1 𝑗=1 𝑗=1
𝑛
1 σ𝑛𝑗=1 𝑥𝑗
= [෍ 𝑥𝑗 − 𝑛 ] =0
𝑆𝑋𝑋 𝑛
𝑗=1

𝑛 𝑛 𝑛 𝑛
1 1
෍ 𝑐𝑗 𝑥𝑗 = ෍ 𝑥𝑗 − 𝑥ҧ 𝑥𝑗 = ෍ 𝑥𝑗 − 𝑥ҧ (𝑥𝑗 −𝑥)ҧ + ෍ 𝑥𝑗 − 𝑥ҧ 𝑥ҧ
𝑆𝑋𝑋 𝑆𝑋𝑋
𝑗=1 𝑗=1 𝑗=1 𝑗=1
𝑛
In P1: we used σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑦=0
ത 1 1
= 𝑆𝑋𝑋 + 𝑥(෍
ҧ 𝑥𝑗 − 𝑛𝑥)ҧ = 𝑆 + 𝑥(0)
ҧ =1
𝑆𝑋𝑋 𝑆𝑋𝑋 𝑋𝑋
In P2: we will use σ𝑛𝑗=1 𝑥𝑗 − 𝑥ҧ 𝑥=0
ҧ 𝑗=1
𝑥𝑗 −𝑥ҧ
σ𝑛𝑗=1 𝑐𝑗 = σ𝑛𝑗=1 =0 𝐸 𝛽0 = 𝐸(𝑦ത − 𝛽1 𝑥)ҧ = 𝐸 𝜌0 + 𝜌1 𝑥ҧ − 𝛽1 𝑥ҧ = 𝐸(𝜌0 ) since 𝐸(𝜌1 ) = 𝐸(𝛽1 )
𝑆𝑋𝑋 11
Simple Linear Regression (one independent variable): Preparation for Hypothesis Testing
Expectation and Variance of Least Squares Estimators (𝜷𝟎 𝒂𝒏𝒅 𝜷𝟏 ); Gauss-Markov Theorem
𝐸(𝜌1) = 𝐸 𝛽1 𝐸 𝛽0 = 𝐸 𝜌0
Just like the derivations on expectations, the derivations for variance lead to

𝑉𝑎𝑟(𝜖𝑗 ) 𝜎2 2
𝑥ҧ 2 1
The 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟𝑠 (𝑆𝐸𝑠) are used: 𝑉𝑎𝑟 𝛽1 = = 𝑉𝑎𝑟 𝛽0 =𝜎 [ + ]
𝑉𝑎𝑟(𝑋) 𝑆𝑋𝑋 𝑆𝑋𝑋 𝑛
• In Hypothesis Testing
• to generate confidence Intervals for
population parameters 𝜌0 and 𝜌1 based 𝜎2 𝑥ҧ 2 1
𝑆𝐸(𝛽1 ) = 𝑉𝑎𝑟 𝛽1 = 𝑆𝐸(𝛽0 ) = 𝑉𝑎𝑟 𝛽0 = 𝜎2[ + ]
on 𝛽0 and 𝛽1 , respectively 𝑆𝑋𝑋 𝑆𝑋𝑋 𝑛

Important inferences Gauss-Markov theorem (which applies to Multiple Linear


Regression in general; and not just simple linear regression) states
❑ 𝐸(𝜌1 ) = 𝐸 𝛽1 that: For the regression model 𝑦𝑖 = 𝜌0 + 𝜌1𝑥𝑖 + 𝜖𝑖 , with the
𝜎2 𝛽1 −𝜌1
→ 𝛽1 ~𝑁 𝜌1 , → ~𝑁(0,1) assumption that:
𝑆𝑋𝑋 𝜎2
𝑆𝑋𝑋 1. 𝜖𝑖 ≡ 𝑁 0, 𝜎 2 → 𝐸 𝜖 = 0; 𝑉𝑎𝑟 𝜖 = 𝜎 2
❑ 𝐸(𝜌0 ) = 𝐸 𝛽0 2. 𝜖𝑖 and 𝜖𝑗 and uncorrelated
𝑥ҧ 2 1 𝛽0 −𝜌0
→ 𝛽0 ~𝑁 𝜌0, 𝜎 2 [ + ] → ഥ2
~𝑁(0,1) the least squares estimators (𝛽0 and 𝛽1): (a) are unbiased, and (b)
𝑆𝑋𝑋 𝑛 2 𝑥 1
𝜎 [𝑆 +𝑛] have minimum variance when compared to all other unbiased
𝑋𝑋

The Gauss-Markov property assures that the least-squares estimatorestimators that are
is BLUE – Best, linear
Linear, combinations
Unbiased of the
Estimator, 𝑦𝑖 . BLUE → One with
where
Minimum Variance (class of unbiased LEs). However, there is no guarantee that the minimum variance will be small (covered later). 12
Simple Linear Regression (one independent variable): Preparation for Hypothesis Testing
Residual Mean Square (RMS) = RSS/(n-2) = Unbiased estimator for Error Variance → 𝝈𝟐
𝜎2 𝑥ҧ 2 1
𝑊ℎ𝑒𝑛 𝜎 𝑖𝑠 𝑘𝑛𝑜𝑤𝑛 𝑆𝐸(𝛽1 ) = 𝑉𝑎𝑟 𝛽1 = 𝑆𝐸(𝛽0 ) = 𝑉𝑎𝑟 𝛽0 = 𝜎2[ + ]
𝑆𝑋𝑋 𝑆𝑋𝑋 𝑛

𝑊ℎ𝑎𝑡 𝑖𝑓 𝜎 𝑖𝑠 𝑁𝑂𝑇 𝑘𝑛𝑜𝑤𝑛? 𝑊𝑒𝑙𝑙 𝜎 𝑛𝑒𝑒𝑑𝑠 𝑡𝑜 𝑏𝑒 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑


2 2 2 2
Residual sum of squares: 𝑅𝑆𝑆 = σ𝑒𝑗2 = σ 𝑦𝑗 − 𝑦ො𝑗 = σ 𝑦𝑗 − 𝛽0 − 𝛽1 𝑥𝑗 = σ 𝑦𝑗 − (𝑦ത − 𝛽1 𝑥)ҧ − 𝛽1 𝑥𝑗 = σ 𝑦𝑗 − 𝑦ത − 𝛽1 (𝑥𝑗 − 𝑥)ҧ
2 2
= σ 𝑦𝑗 − 𝑦ത + 𝛽12 σ 𝑥𝑗 − 𝑥ҧ
− 2𝛽1 σ(𝑥𝑗 − 𝑥)(𝑦
ҧ 𝑗 − 𝑦)

𝑆𝑋𝑌
𝑅𝑆𝑆 = 𝑆𝑌𝑌 + 𝛽12 𝑆𝑋𝑋 − 2𝛽1 𝑆𝑋𝑌 ; 𝑤ℎ𝑒𝑟𝑒 𝛽1 = → 𝑅𝑆𝑆 = 𝑆𝑌𝑌 + 𝛽12 𝑆𝑋𝑋 − 2𝛽12 𝑆𝑋𝑋 → 𝑅𝑆𝑆 = 𝑆𝑌𝑌 − 𝛽12 𝑆𝑋𝑋
𝑆𝑋𝑋

𝐸(𝑅𝑆𝑆) = 𝐸(𝑆𝑌𝑌 ) − 𝐸(𝛽12 𝑆𝑋𝑋 ); 𝑤ℎ𝑒𝑟𝑒 𝑠𝑜𝑚𝑒 𝑚𝑎𝑡ℎ𝑠 𝑠ℎ𝑜𝑤𝑠 𝐸 𝑆𝑌𝑌 = 𝑛 − 1 𝜎 2 = 𝛽12 𝑆𝑋𝑋 ; and 𝐸(𝛽12 𝑆𝑋𝑋 ) = 𝜎 2 + 𝛽12 𝑆𝑋𝑋

𝐸(𝑅𝑆𝑆) = 𝑛 − 2 𝜎 2

𝑅𝑆𝑆 𝑅𝑆𝑆
𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒: 𝑅𝑀𝑆 = ≡( )
𝑑𝑓 𝑛−2
𝑅𝑆𝑆 𝑅𝑀𝑆 𝑖𝑠 𝑎𝑛 𝑢𝑛𝑏𝑖𝑎𝑠𝑒𝑑 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟 𝑜𝑓 𝜎 2 ,
𝐸(𝑅𝑀𝑆) = 𝐸( ) = 𝜎2
𝑛−2 ℎ𝑒𝑛𝑐𝑒 𝑅𝑀𝑆 𝑐𝑎𝑛 𝑏𝑒 𝑢𝑠𝑒𝑑 𝑖𝑛 𝑝𝑙𝑎𝑐𝑒 𝑜𝑓 𝜎 2

𝑅𝑀𝑆 𝑥ҧ 2 1
We now need to understand
𝑆𝐸 𝛽1 = 𝑉𝑎𝑟 𝛽1 = 𝑆𝐸(𝛽0 ) = 𝑉𝑎𝑟 𝛽0 = 𝑅𝑀𝑆[ + ] the distribution of RMS, setting
𝑆𝑋𝑋 𝑆𝑋𝑋 𝑛
the context for t-test
13
Simple Linear Regression (one independent variable): Preparation for Hypothesis Testing
Residual Mean Square (RMS): Knowing its Distribution
𝑘
2
𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝐾 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑍 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝑠
𝜒𝑑𝑓=𝑘 = ෍ 𝑍𝐾2
𝑖=1 𝑎𝑚𝑜𝑢𝑛𝑡𝑠 𝑡𝑜 𝑎 𝜒2𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑡𝑢𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝐾 𝑑𝑜𝑓
𝑛
𝑒𝑗 − 0
𝑅𝑆𝑆 = ෍ 𝑒𝑗2 ; 𝑤ℎ𝑒𝑟𝑒 𝐸𝑎𝑐ℎ 𝑒𝑗 ~𝑁 0, 𝜎 2 → 𝐸𝑎𝑐ℎ ~𝑁 0,1 ~𝑍
𝜎
𝑗=1
𝑒𝑗 − 0 𝑒𝑗 2
𝐸𝑎𝑐ℎ ~𝑁 0,1 ~𝑍 → 𝐸𝑎𝑐ℎ ~𝑍 2 ~𝜒12
𝜎 𝜎
𝑛
𝑅𝑆𝑆 𝑒𝑗 2
= ෍ 𝑠ℎ𝑜𝑢𝑙𝑑 𝑓𝑜𝑙𝑙𝑜𝑤 𝑎 𝜒 2 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑑𝑜𝑓 = 𝑛
𝜎2 𝜎
𝑗=1

𝐻𝑜𝑤𝑒𝑣𝑒𝑟, 𝑡ℎ𝑖𝑠 𝑖𝑠 𝑛𝑜𝑡 𝑡𝑟𝑢𝑒, 𝑠𝑖𝑛𝑐𝑒 𝑛𝑜𝑡 𝑎𝑙𝑙 𝑒𝑗 𝑠 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡


𝐷𝑖𝑠𝑡𝑖𝑛𝑔𝑢𝑖𝑠ℎ 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑢𝑠𝑒 𝑜𝑓 𝜖 𝑖𝑛 𝐴𝑠𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛𝑠, 𝑎𝑛𝑑 𝑢𝑠𝑒 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟 𝑒 ℎ𝑒𝑟𝑒
• 𝑂𝑛𝑙𝑦 𝑛 − 2 𝑜𝑢𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑛 𝑒𝑟𝑟𝑜𝑟𝑠 𝑒𝑗 𝑠 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡
• 𝑌 = 𝜌0 + 𝜌1 𝑋 + 𝜖 → 𝜖 ≡ 𝐷𝑎𝑡𝑎~𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟 ≡ 𝑌~(𝜌0 + 𝜌1 𝑋)
• 𝑇𝑤𝑜 𝑜𝑓 𝑡ℎ𝑒𝑚 𝑑𝑒𝑝𝑒𝑛𝑑 𝑜𝑛 𝑡ℎ𝑒 𝑐𝑜𝑛𝑡𝑟𝑎𝑖𝑛𝑡𝑠 𝑟𝑒𝑠𝑢𝑙𝑡𝑖𝑛𝑔 𝑓𝑟𝑜𝑚
• 𝑒 = 𝑌~ ෡𝑌 = 𝐴𝑐𝑡𝑢𝑎𝑙 ~ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑡ℎ𝑒 𝑛𝑜𝑟𝑚𝑎𝑙 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛𝑠: σ𝑒𝑗 = 0 & σ𝑒𝑗 𝑥𝑗 = 0

𝑅𝑆𝑆 𝑒𝑗 2
2
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒: = σ → 𝑆𝑢𝑚 𝑜𝑓 𝑛 − 2 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑍 2 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝑠 → 𝜒 2 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑑𝑜𝑓 = 𝑛 − 2 ~ 𝑋𝑛−2
𝜎2 𝜎

𝑅𝑆𝑆
2 ∵ 𝑅𝑀𝑆 =
𝑅𝑆𝑆 𝑒𝑗 2 𝑛−2 𝑛 − 2 𝑅𝑀𝑆 𝑒𝑗 2
=σ ~𝑋𝑛−2 2
= σ ~𝑋𝑛−2
𝜎2 𝜎 𝜎2 𝜎
14
Simple Linear Regression (one independent variable): Preparation for Hypothesis Testing
Setting the context for t-test: using Expectation & Variance of 𝜷𝟎 , 𝜷𝟏 + RMS distribution for SLE
❑ 𝐸(𝜌1 ) = 𝐸 𝛽1 Important Result ❑ 𝐸(𝜌0) = 𝐸 𝛽0
𝜎2 𝛽1 −𝜌1 𝑥ҧ 2 1 𝛽0 −𝜌0
→ 𝛽1~𝑁 𝜌1, → ~𝑁(0,1) → 𝛽0~𝑁 2
𝜌0, 𝜎 [ + ] → ~𝑁(0,1)
𝑆𝑋𝑋 𝜎2 𝑆𝑋𝑋 𝑛 ഥ2
𝑥 1
𝑆𝑋𝑋 𝜎2 [ 𝑆 +𝑛]
𝑋𝑋

𝑅𝑆𝑆 𝑛 − 2 𝑅𝑀𝑆 2
𝑅𝑆𝑆 𝑛 − 2 𝑅𝑀𝑆 2
2
= 2
~𝑋𝑛−2 2
= 2
~𝑋𝑛−2
𝜎 𝜎 𝜎 𝜎
Important Fact
𝑋
𝐿𝑒𝑡 𝑋 𝑎𝑛𝑑 𝑌 𝑏𝑒 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑅𝑉𝑠 𝑠. 𝑡. 𝑋~𝑁 0,1 𝑎𝑛𝑑 𝑌~𝜒𝑛2 𝑇ℎ𝑒𝑛 ~𝑡𝑛 (𝑡 𝑑𝑖𝑠𝑡 𝑤𝑖𝑡ℎ 𝑑𝑜𝑓 = 𝑛)
𝑌
𝑛

𝛽1 − 𝜌1 𝛽1 − 𝜌1 𝛽0 − 𝜌0
𝐿𝑒𝑡 𝑋 = → 𝑋~𝑁 0,1 ; 𝐿𝑒𝑡 𝑋 = ~𝑁(0,1) 𝛽0 − 𝜌0
𝜎2 𝜎2 𝑥ҧ 2 1 𝑥ҧ 2 1
𝑆𝑋𝑋 𝑆𝑋𝑋 𝛽1 − 𝜌1 𝜎 2[ 𝑆 + 𝑛] 𝜎 2[ 𝑆 + 𝑛 ] 𝛽1 − 𝜌1
≡ ~𝑡𝑛−2 𝑋𝑋 𝑋𝑋
≡ ~𝑡𝑛−2
𝑅𝑀𝑆 𝑥ҧ 2 1
𝑛 − 2 𝑅𝑀𝑆 𝑛 − 2 𝑅𝑀𝑆 𝑛 − 2 𝑅𝑀𝑆 𝑛 − 2 𝑅𝑀𝑆 𝑅𝑀𝑆[ 𝑆 + 𝑛 ]
𝑌= 2
→ 𝑌~𝜒𝑛−2 𝑆𝑋𝑋 2
𝜎2 (𝑛 − 2)𝜎 2 𝑌= 2 → 𝑌~𝜒𝑛−2 (𝑛 − 2)𝜎 2 𝑋𝑋
𝜎

𝐼𝑛 𝑠𝑙𝑖𝑑𝑒 13: 𝑤𝑒 𝑑𝑒𝑐𝑙𝑎𝑟𝑒𝑑 𝑡ℎ𝑒𝑠𝑒 𝑑𝑒𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑜𝑟 𝑎𝑠 𝑆𝐸 𝛽1 𝑎𝑛𝑑 𝑆𝐸 𝛽0 , 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦 𝑏𝑢𝑡 𝑡ℎ𝑒𝑟𝑒


𝑤𝑒 𝑑𝑖𝑑 𝑛𝑜𝑡 𝑘𝑛𝑜𝑤 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑡ℎ𝑎𝑡 𝑡ℎ𝑒 𝑤ℎ𝑜𝑙𝑒 𝑟𝑎𝑡𝑖𝑜 𝑤𝑜𝑢𝑙𝑑 𝑓𝑜𝑙𝑙𝑜𝑤, ℎ𝑒𝑛𝑐𝑒, 𝑟𝑒𝑠𝑢𝑙𝑡𝑠 𝑓𝑟𝑜𝑚
𝑆𝑙𝑖𝑑𝑒𝑠 12, 14 𝑎𝑛𝑑 15 (𝑎𝑏𝑜𝑣𝑒) 𝑤𝑒𝑟𝑒 𝑛𝑒𝑒𝑑𝑒𝑑
15
Simple Linear Regression (one independent variable)
Setting the context for F-test: Coefficient of Determination (𝑹𝟐 )
In general:
y = constant=𝛽0 (𝛽1 = 0)
→ y independent of x →
horizontal line ≡ mean:
possibly the worst model
since it talks of constant
output ie dependent
variable, disregarding
independent variables ഥ = 𝒚𝒋 − 𝒚
𝒚𝒋 − 𝒚 ෝ𝒋 + 𝒚
ෝ𝒋 − 𝒚

𝑃𝑟𝑜𝑑 = 0 (ℎ𝑜𝑤: 𝑏𝑒𝑙𝑜𝑤)


𝑛 𝑛 𝑛 𝑛 𝑛
2 2 2 2 • 𝑦ത = 𝛽0 + 𝛽1 𝑥ҧ
෍ 𝑦𝑗 − 𝑦ത = ෍ 𝑦𝑗 − 𝑦ො𝑗 + 𝑦ො𝑗 − 𝑦ത = ෍ 𝑦𝑗 − 𝑦ො𝑗 + ෍ 𝑦ො𝑗 − 𝑦ത + 2 ෍ 𝑦𝑗 − 𝑦ො𝑗 𝑦ො𝑗 − 𝑦ത
𝑗=1 𝑗=1 𝑗=1 𝑗=1 𝑗=1 • 𝑦ො𝑗 = 𝛽0 + 𝛽1 𝑥𝑗
𝑛 𝑛 𝑛

= ෍ 𝑦𝑗 − 𝑦ത
2
= ෍ 𝑦𝑗 − 𝑦ො𝑗
2
+ ෍ 𝑦ො𝑗 − 𝑦ത
2 • 𝑦ො𝑗 − 𝑦ത = 𝛽1 (𝑥𝑗 − 𝑥)ҧ
𝑗=1 𝑗=1 𝑗=1
• 𝑦𝑗 − 𝑦ො𝑗 = 𝑦𝑗 − 𝛽0 + 𝛽1 𝑥𝑗 , 𝑤ℎ𝑒𝑟𝑒, 𝛽0 = 𝑦ത − 𝛽1 𝑥ҧ
𝐓𝐒𝐒 = 𝐑𝐒𝐒 + 𝐄𝐒𝐒 → 𝑦𝑗 − 𝑦ො𝑗 = 𝑦𝑗 − 𝑦ത − 𝛽1 (𝑥𝑗 − 𝑥)ҧ
𝑛
𝑇𝑆𝑆 − 𝑅𝑆𝑆
𝐶𝑜𝑒𝑓𝑓 𝑜𝑓 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛: 𝑅2 = 𝑃𝑟𝑜𝑑 = ෍ 𝛽1 𝑥𝑗 − 𝑥ҧ [ 𝑦𝑗 − 𝑦ത − 𝛽1 (𝑥𝑗 − 𝑥)]
ҧ
𝑇𝑆𝑆
𝑅𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑇𝑀 − 𝑅𝑒𝑠 𝑁𝑂𝑇 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑇𝑀 𝑅𝑒𝑠𝑖 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑇𝑀 𝐸𝑆𝑆 𝑗=1
= = = 𝑆𝑋𝑌
𝑅𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑇𝑀 𝑅𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑇𝑀 𝑇𝑆𝑆 𝑃𝑟𝑜𝑑 = 𝛽1 𝑆𝑋𝑌 − 𝛽12 𝑆𝑋𝑋 = 𝛽1 𝑆𝑋𝑌 − 𝛽1 𝑆𝑋𝑋 = 0 ∵ 𝛽1 =
𝑆𝑋𝑋
16
Simple Linear Regression (one independent variable)
Setting the context for F-test: Its computation
𝐓𝐒𝐒 = 𝐑𝐒𝐒 + 𝐄𝐒𝐒 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍
𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 𝒅𝒇 𝑴𝒆𝒂𝒏 𝑺𝒒𝒖𝒂𝒓𝒆 ≡ → 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆
𝑛 𝑛 𝑛 𝒅𝒇
2 2 2 𝑇𝑆𝑆
෍ 𝑦𝑗 − 𝑦ത = ෍ 𝑦𝑗 − 𝑦ො𝑗 + ෍ 𝑦ො𝑗 − 𝑦ത 𝑇𝑜𝑡𝑎𝑙 𝑇𝑆𝑆 𝑛−1 𝑇𝑜𝑡𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑇𝑀𝑆 =
𝑗=1 𝑗=1 𝑗=1
𝑛−1
𝑅𝑆𝑆
𝑑𝑓𝑇𝑆𝑆 = 𝑛 − 1 𝑑𝑓𝑅𝑆𝑆 = 𝑛 − 2 𝑑𝑓𝐸𝑆𝑆 = 𝑑𝑓𝑇𝑆𝑆 − 𝑑𝑓𝑅𝑆𝑆 = 1 𝑁𝑜𝑡 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑅𝑆𝑆 𝑛−2 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑅𝑀𝑆 =
because because 𝑛−2
𝑛 𝑛 𝑛
2 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝐸𝑆𝑆 1 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝐸𝑀𝑆 = 𝐸𝑆𝑆
෍ 𝑦𝑗 − 𝑦ത = 0 ෍ 𝑒𝑗2 ~𝜒𝑛−2
2 ෍ 𝛽12 𝑥𝑗 − 𝑥ҧ = 𝛽12𝑆𝑋𝑋
𝑗=1 𝑗=1 𝑗=1
Important Fact
𝐸𝑀𝑆 = 𝐸𝑆𝑆 =
𝑅𝑆𝑆 𝑛 2
𝐼𝑓 𝑋 𝑎𝑛𝑑 𝑌 𝑎𝑟𝑒 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑅𝑉𝑠 𝑠. 𝑡. 𝑋~𝜒𝑚 ; 𝑌~𝜒𝑛2
𝑅𝑀𝑆 = 2
𝑛−2 ෍ 𝛽12 𝑥𝑗 − 𝑥ҧ = 𝛽12𝑆𝑋𝑋
𝑗=1 𝑋/𝑚
𝑇ℎ𝑒𝑛 ~𝐹 (𝐹 𝑑𝑖𝑠𝑡 𝑤𝑖𝑡ℎ 𝑑𝑓1 = 𝑚; 𝑑𝑓2 = 𝑛)
𝑌/𝑛 𝑚,𝑛
𝐽𝑢𝑠𝑡 𝑎𝑠 𝑤𝑒 𝑝𝑟𝑜𝑣𝑒𝑑: 𝐸(𝑅𝑀𝑆) = 𝜎 2

𝐸𝑀𝑆 𝑋/1 𝐸𝑀𝑆


𝐼𝑡 𝑐𝑎𝑛 𝑏𝑒 𝑝𝑟𝑜𝑣𝑒𝑑 𝑡ℎ𝑎𝑡: 𝑋= 2
~𝑋12 𝑇ℎ𝑒𝑛 ~ ~𝐹 ≡ 𝐹𝐻0
𝜎 𝑌/(𝑛 − 2) 𝑅𝑀𝑆 1,𝑛−2
• 𝐸(𝐸𝑀𝑆) = 𝜎 2 + 𝛽12 𝑆𝑋𝑋 𝑛 − 2 𝑅𝑀𝑆 2
𝑌= 2
~𝑋𝑛−2 (𝐹 𝑑𝑖𝑠𝑡 𝑤𝑖𝑡ℎ 𝑑𝑓1 = 1; 𝑑𝑓2 = 𝑛 − 2)
𝜎
𝐸𝑀𝑆
• 2 ~𝜒12 𝑢𝑛𝑑𝑒𝑟 𝐻0 : 𝛽1 = 0 𝐼𝑓 𝐹𝐻0 > 𝐹𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 : 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0
𝜎

17
Simple Linear Regression (one independent variable)
Setting the context for F-test: Its computation

𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍
𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 𝒅𝒇 𝑴𝒆𝒂𝒏 𝑺𝒒𝒖𝒂𝒓𝒆 ≡ → 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆
𝒅𝒇
𝑛
2 𝑇𝑆𝑆
𝑇𝑜𝑡𝑎𝑙 𝑇𝑆𝑆 ෍ 𝑦𝑗 − 𝑦ത 𝑛−1 𝑇𝑜𝑡𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑇𝑀𝑆 =
𝑗=1
𝑛−1
𝑛
2 𝑅𝑆𝑆
𝑁𝑜𝑡 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑅𝑆𝑆 ෍ 𝑦𝑗 − 𝑦ො𝑗 𝑛−2 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑅𝑀𝑆 =
𝑗=1
𝑛−2
𝑛
2
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝐸𝑆𝑆 ෍ 𝑦ො𝑗 − 𝑦ത 1 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝐸𝑀𝑆 = 𝐸𝑆𝑆
𝑗=1

𝐸𝑀𝑆 𝐸𝑀𝑆
𝑋= 2
~𝑋12 ~𝐹 ≡ 𝐹𝐻0
𝜎 𝑅𝑀𝑆 1,𝑛−2
𝑛 − 2 𝑅𝑀𝑆 2
𝑌= 2
~𝑋𝑛−2
𝜎
𝐼𝑓 𝐹𝐻0 > 𝐹𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 : 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0

18
Now we have all the formulations,

and ingredients for Hypothesis Testing ready

……so let’s go….

19
Simple Linear Regression (one independent variable)
Model Evaluation at Two levels
Either of these approaches could be applied to EACH TRD,
leading to as many ML models as there are TRDs

Individual Evaluation The learnt ML model can be evaluated w.r.t. the WORST possible ML
(Local: for each TRD) model (constant at 𝑦)
ത for the same TRD + Inferences on Population
Diagnostic checks
1. Coeff of determination 𝑹𝟐:How good is our model compared to the worst model (w.r.t. the specific TRD itself)?
2. Test of Significance: t-test or F-test:"Check whether the simple/multiple regression model with one/several
predictor is significantly better than a model without any predictors, to capture the "real linear relation in the
population."
3. Confidence interval for coefficients: finding the lower and upper bound for population parameter with
confidence level of 𝝍 based on the corresponding sample regression coefficient

Collective Evaluation The performance of


ALL the ML models COLLECTIVELY, 𝑩𝒊𝒂𝒔;
(ALL TRDs
can be evaluated against TSD 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆
Collectively)
20
Simple Linear Regression (one independent variable)
Pre-requisites for Coefficient of Determination (𝑹𝟐) and ANOVA
In general:
y = constant
→ y independent of x
→ horizontal line ≡
mean could be the
worst possible model
since it talks of
constant dependent
variable, disregarding 𝐓𝐒𝐒 = 𝐑𝐒𝐒 + 𝐄𝐒𝐒
independent variables
𝑇𝑆𝑆 − 𝑅𝑆𝑆 𝑅𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑇𝑀 − 𝑅𝑒𝑠 𝑁𝑂𝑇 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑇𝑀 𝑅𝑒𝑠𝑖 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑏𝑦 𝑇𝑀 𝐸𝑆𝑆
𝑅2 = = = =
𝑇𝑆𝑆 𝑅𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑇𝑀 𝑅𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑇𝑀 𝑇𝑆𝑆

𝑊𝑜𝑟𝑠𝑡 𝑓𝑖𝑡: 𝑃𝑒𝑟𝑓𝑒𝑐𝑡 𝑓𝑖𝑡 ∶


𝑅𝑆𝑆 = 𝑇𝑆𝑆 𝑅𝑆𝑆 = 0
𝑇𝑆𝑆 − 𝑅𝑆𝑆 11.1 − 4.4 6.7
𝑅2 = = = = 0.6
Area doesn’t 𝑇𝑆𝑆 11.1 11.1
explain any Area explains
residuals in Area explains 60% reduction in 100% residuals
Price in Price
Price residuals

21
Simple Linear Regression (one independent variable)
Is the Population distribution captured better by the Trained Model or the Worst Model: t-test
Suppose we have a Least squares fit. The question is Does this fit reflect population behaviour → Is this model a good estimate
of the real linear relation between x and y in the population vis-à-vis the Worst model (y independent of x)

𝑭𝒐𝒓 𝒕𝒉𝒆 𝑷𝒐𝒑𝒖𝒍𝒂𝒕𝒊𝒐𝒏, 𝒍𝒆𝒕 𝒖𝒔 𝒔𝒂𝒚 𝒕𝒉𝒆 𝒓𝒆𝒂𝒍 𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏 𝒊𝒔: 𝒚 = 𝝆𝟎 + 𝝆𝟏 𝒙 + 𝝐


𝑭𝒐𝒓 𝒕𝒉𝒆 𝑺𝒂𝒎𝒑𝒍𝒆:
❑ 𝒕𝒉𝒆 𝒍𝒆𝒂𝒔𝒕 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒇𝒊𝒕 𝒊𝒔: 𝒚
ෝ = 𝜷𝟎 + 𝜷𝟏 𝒙
❑ 𝒕𝒉𝒆 𝒘𝒐𝒓𝒔𝒕 𝒇𝒊𝒕 𝒄𝒐𝒖𝒍𝒅 𝒃𝒆 ∶ 𝒚ෝ = 𝜷𝟎
𝑻𝒆𝒔𝒕 𝒊𝒏𝒔𝒕𝒂𝒏𝒄𝒆: 𝒊𝒇 𝒕𝒉𝒆𝒓𝒆 𝒊𝒔 𝒂 𝒍𝒊𝒏𝒆𝒂𝒓 𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏𝒔𝒉𝒊𝒑 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝑿 𝒂𝒏𝒅 𝒀 → 𝒕𝒆𝒔𝒕 𝒕𝒉𝒆 𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆 𝒐𝒇 𝒔𝒍𝒐𝒑𝒆 𝝆𝟏

❖ 𝑰 𝑫𝒆𝒇𝒊𝒏𝒆 𝒕𝒉𝒆 𝑯𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔 ❖ 𝑰𝑰 𝑫𝒆𝒇𝒊𝒏𝒆 𝒕𝒉𝒆 𝒕𝒆𝒔𝒕 𝒔𝒕𝒂𝒕𝒊𝒔𝒕𝒊𝒄


𝑖. 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠: 𝑁𝑜 𝑙𝑖𝑛𝑒𝑎𝑟 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 → 𝜌1 = 0
→ 𝑦 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 = 𝜌0 𝑤𝑜𝑟𝑠𝑡 𝑚𝑜𝑑𝑒𝑙 𝛽1 − 𝜌1 𝛽1
𝑡= ~𝑡𝑛−2 → 𝑡𝐻0:𝑛−2 =
𝑅𝑀𝑆 𝑅𝑀𝑆
𝑖𝑖. 𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠: 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑎 𝑙𝑖𝑛𝑒𝑎𝑟 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 → 𝜌1 ≠ 0 𝑆𝑋𝑋 𝑆𝑋𝑋

❖ 𝑰𝑰𝑰 𝑪𝒐𝒎𝒑𝒂𝒓𝒆 𝒕𝒉𝒆 𝒄𝒐𝒎𝒑𝒖𝒕𝒆𝒅 𝒕𝑯𝟎 𝒂𝒈𝒂𝒊𝒏𝒔𝒕 𝒕𝒄𝒓𝒊𝒕𝒊𝒄𝒂𝒍 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 𝒕𝒐 𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆 𝒍𝒆𝒗𝒆𝒍 𝜶 = 𝟎. 𝟎𝟓

❑ 𝑰𝒇 𝒕𝑯𝟎 > 𝒕𝒄𝒓𝒊𝒕𝒊𝒄𝒂𝒍 : 𝑹𝒆𝒋𝒆𝒄𝒕 𝑯𝟎 ; 𝒆𝒍𝒔𝒆 𝒓𝒆𝒕𝒂𝒊𝒏 𝒊𝒕

❑ 𝑨𝒍𝒕𝒆𝒓𝒏𝒂𝒕𝒊𝒗𝒆𝒍𝒚, 𝒓𝒆𝒇𝒆𝒓 𝒕𝒐 𝒕𝒉𝒆 𝒑 𝒕𝒂𝒃𝒍𝒆, 𝒇𝒊𝒏𝒅 𝒕𝒉𝒆 𝒑 𝒗𝒂𝒍𝒖𝒆 (𝒑𝒗) 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 𝒕𝒐 𝒈𝒊𝒗𝒆𝒏: (𝒊) 𝒕𝑯𝟎 𝒗𝒂𝒍𝒖𝒆 𝒂𝒏𝒅 (𝒊𝒊) 𝒅𝒇 = 𝒏 − 𝟐
→ 𝑰𝒇 𝒑𝒗 < 𝟎. 𝟎𝟓 ∶ 𝑹𝒆𝒋𝒆𝒄𝒕 𝑯𝟎

22
Simple Linear Regression (one independent variable)
Is the Population distribution captured better by the Trained Model or the Worst Model: t-test
Is Y related linearly with X
𝑅𝑆𝑆 σe2𝑗 1.1
𝑿 𝒀 𝒚ෝ𝒋 = −𝟎. 𝟏 + 𝟎. 𝟕𝒙𝒊 ෝ𝒋
𝒆𝒋 = 𝒚𝒋 − 𝒚 𝛽1 𝑅𝑀𝑆 = = = = 0.3666
𝑛−2 3 3
𝑡𝐻0:𝑛−2 =
1 1 0.6 0.4 𝑅𝑀𝑆 2
2 1 1.3 -0.3 𝑆𝑋𝑋 𝑆𝑋𝑋 = σ 𝑥𝑗 − 𝑋ത = ෍ 𝑥𝑗2 − 𝑛 𝑋ത 2 = 55 − 5 ⋅ 32 = 10
3 2 2.0 0.0
0.7
4 2 2.7 -0.7 𝑡𝐻0 :𝑛−2 = = 3.655, 𝑤ℎ𝑒𝑟𝑒 𝑑𝑓 = 𝑛 − 2 = 3
0.3666
5 4 3.4 0.6 10

𝑁𝑜𝑡𝑎𝑏𝑙𝑦: 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 = 3.18; 𝑡𝐻0 = 3.655.

∵ 𝑡𝐻0 = 3.655 > 𝒕𝒄𝒓𝒊𝒕𝒊𝒄𝒂𝒍 = 𝟑. 𝟏𝟖: 𝑹𝒆𝒋𝒆𝒄𝒕 𝑯𝟎



𝑳𝒊𝒏𝒆𝒂𝒓 𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏 𝒆𝒙𝒊𝒔𝒕𝒔, 𝒔𝒊𝒏𝒄𝒆 𝝆𝟏 𝒊𝒔 𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒕

𝑆𝑖𝑛𝑐𝑒 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑖𝑠 𝑝𝑖𝑐𝑘𝑒𝑑 𝑓𝑟𝑜𝑚 2 𝑠𝑖𝑑𝑒𝑑 𝛼


𝐵𝑜𝑡ℎ 𝑠𝑖𝑑𝑒𝑠 𝑡𝑜𝑔𝑒𝑡ℎ𝑒𝑟 𝑎𝑐𝑐𝑜𝑢𝑛𝑡 𝑓𝑜𝑟 𝛼
→ 𝐸𝑎𝑐ℎ 𝑠𝑖𝑑𝑒 𝑎𝑐𝑐𝑜𝑢𝑛𝑡𝑠 𝑓𝑜𝑟 𝛼/2
→ 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 → 𝑤𝑟𝑖𝑡𝑡𝑒𝑛 𝑎𝑠 𝑡𝑛−2,𝛼/2

23
Simple Linear Regression (one independent variable)
Is the Population distribution captured better by the Trained Model or the Worst Model: F-test
Applying F test on the previous problem

𝑇𝑆𝑆 = 6 𝑅𝑆𝑆 = σe2𝑗 = 1.1 𝐸𝑆𝑆 = 𝑇𝑆𝑆 − 𝑅𝑆𝑆 = 4.9 𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒𝑙𝑦: 𝐸𝑆𝑆 = 𝛽12 𝑆𝑋𝑋 = 0.72 × 10 = 4.9

Note: these numbers suggest a reasonably good fit as it explains 4.9 out of the total residual of 6
Critical F Table
𝑹𝑺𝑺 𝑬𝑴𝑺 𝟒. 𝟗
𝑹𝑴𝑺 = =0.367 𝑬𝑴𝑺 = 𝑬𝑺𝑺 = 𝟒. 𝟗 𝑭= = = 𝟏𝟑. 𝟑𝟓
𝒏−𝟐 𝑹𝑴𝑺 𝟎. 𝟑𝟔𝟕
𝑑𝑓1 = 1 ; 𝑑𝑓2 = 3

𝑺𝒊𝒏𝒄𝒆 𝑭𝑯𝟎 = 𝟏𝟑. 𝟑𝟓 > 𝑭𝒄𝒓𝒊𝒕𝒊𝒄𝒂𝒍 = 𝟏𝟎. 𝟏𝟐𝟖 𝑹𝒆𝒋𝒆𝒄𝒕 𝑯𝟎 → 𝑳𝒊𝒏𝒆𝒂𝒓 𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏 𝒆𝒙𝒊𝒔𝒕𝒔

𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 − 𝐹𝑜𝑟 𝑡ℎ𝑖𝑠 𝑝𝑟𝑜𝑏𝑙𝑒𝑚 (2 𝑔𝑟𝑜𝑢𝑝𝑠): 𝑡 = 3.655; 𝐹 = 13.35 → 𝐹 = 𝑡 2


𝐹𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 = 10.128
24
Linear Regression: Individual Evaluation of Each ML model trained on TRDs w.r.t. the WORST Model
Diagnostic Check -3: Confidence Interval for Coefficients
The following was discussed while covering Preliminaries of Hypothesis Testing

Finding the Confidence Estimation of LB & UB in which 𝝆 will lie 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝑳𝒆𝒗𝒆𝒍 𝒊𝒏 %
Interval for Coefficients with confidence level 𝝍 Significance Level = 𝟏− 𝟏𝟎𝟎
implies finding the lower 𝜌 %𝝍 The use of 𝜶 depends on the use of
and upper bound for 𝜶=𝟏−
𝟏𝟎𝟎 one or two tailed t-test
population parameter
with confidence level of Sample Coefficient 𝛽
𝝍 based on the
corresponding sample
regression coefficient

• 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟: 𝜌 𝑇𝑤𝑜 𝑡𝑎𝑖𝑙𝑒𝑑 𝑡 − 𝑑𝑖𝑠𝑡:


𝑂𝑛𝑒 𝑡𝑎𝑖𝑙𝑒𝑑 𝑡 − 𝑑𝑖𝑠𝑡:
• 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐: 𝛽 𝜌 = 𝛽 ± 𝒕𝜶 𝑆𝐸𝛽 𝜌 = 𝛽 ± 𝒕𝜶/𝟐 𝑆𝐸𝛽
• 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟: 𝑡𝛼 𝑖𝑠 𝑢𝑠𝑒𝑑 ∵ 𝑒𝑛𝑡𝑖𝑟𝑒 𝛼 𝑖𝑠 𝑜𝑛 𝑜𝑛𝑒 𝑠𝑖𝑑𝑒 𝑡𝛼/2 𝑖𝑠 𝑢𝑠𝑒𝑑 ∵ 𝛼 𝑖𝑠 𝑒𝑞𝑢𝑎𝑙𝑙𝑦 𝑠𝑝𝑙𝑖𝑡 𝑜𝑛 𝑏𝑜𝑡ℎ 𝑠𝑖𝑑𝑒𝑠
𝑆𝐸𝛽 = 𝜎 2/𝑛
25
Linear Regression: Individual Evaluation of Each ML model trained on TRDs w.r.t. the WORST Model
Diagnostic Check -3: Confidence Interval for Coefficients
If we just follow the general If we just follow the general
principles in the current principles in the current
context, we can just plug in context, we can just plug in
the Standard Error values the Standard Error values

𝑨𝒔𝒔𝒖𝒎𝒊𝒏𝒈 𝝈 𝒊𝒔 𝒌𝒏𝒐𝒘𝒏 𝑨𝒔𝒔𝒖𝒎𝒊𝒏𝒈 𝝈 𝒊𝒔 𝒖𝒏𝒌𝒏𝒐𝒘𝒏

𝜎2 𝑅𝑀𝑆
𝑆𝐸(𝛽1) = 𝑉𝑎𝑟 𝛽1 = 𝑆𝐸(𝛽1) = 𝑉𝑎𝑟 𝛽1 =
𝑆𝑋𝑋 𝑆𝑋𝑋

𝑆𝐸 𝛽0 = 𝑉𝑎𝑟 𝛽0 𝑆𝐸 𝛽0 = 𝑉𝑎𝑟 𝛽0
𝑥ҧ 2 1 𝑥ҧ 2 1
= 𝜎 2[ + ] = 𝑅𝑀𝑆[ + ]
𝑆𝑋𝑋 𝑛 𝑆𝑋𝑋 𝑛

If we want to independently derive confidence interval for 𝝆𝒔 based on 𝜷𝒔


𝛽1 − 𝜌1 𝛽1 − 𝜌1
= 𝑡𝑛−2 = 𝑡𝑛−2
𝑅𝑀𝑆 𝑥ҧ 2 1
𝑆𝑋𝑋 𝑅𝑀𝑆[ 𝑆 + 𝑛 ]
𝑋𝑋

𝑹𝑴𝑺 𝛽1 − 𝜌1 𝑥ҧ 2 1
𝝆𝟏 = 𝜷𝟏 ± 𝒕𝜶,𝒏−𝟐 𝑃 −𝑡𝛼,𝑛−2 ≤ ≤ 𝑡𝛼,𝑛−2 = 1 − 𝛼 𝝆𝟎 = 𝜷𝟎 ± 𝒕𝜶,𝒏−𝟐 𝑅𝑀𝑆[ + ]
𝟐 𝑺𝑿𝑿 2 𝑆𝐸 2 𝟐 𝑆𝑋𝑋 𝑛
26
Linear Regression: Individual Evaluation of Each ML model trained on TRDs w.r.t. the WORST Model
Diagnostic Check -3: Confidence Interval for Coefficients
The previous problem

• 𝑛=5

𝒕𝟎.𝟎𝟐𝟓,𝟑 = 𝟑. 𝟏𝟖𝟐 𝑹𝑴𝑺 =
𝑹𝑺𝑺
=0.367
𝑑𝑓 = 𝑛 − 2 = 3 𝑅𝑆𝑆 = σe2𝑗 = 1.1 𝒏−𝟐

2
𝑆𝑋𝑋 = σ 𝑥𝑗 − 𝑋ത = ෍ 𝑥𝑗2 − 𝑛 𝑋ത 2 = 55 − 5 ⋅ 32 = 10

𝑹𝑴𝑺
𝝆𝟏 = 𝜷𝟏 ± 𝒕𝜶,𝒏−𝟐
𝟐 𝑺𝑿𝑿

0.1 ≤ 𝜌1 ≤ 1.3 𝑤𝑖𝑡ℎ 95% 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙


27
Reiterating the Flow

Covered so far

Discussed next

28
Linear Regression: Collective Evaluation of ALL ML models vis-à-vis the Test Data (TSD)
Bias and Variance
For a given 𝒙 ∈ 𝑻𝑺𝑫, Let the target 𝑏𝑒 𝑦
• the estimate of target by an ML model be 𝑦ො
• the Average Prediction over the ML models
𝑦ො1 + 𝑦ො2 + 𝑦ො3 + 𝑦ො4
𝐸 𝑦ො =
4

Variance ≡ how far each number in


the set is from the mean (average)
ҧ 𝑖 2
σ 𝑥−𝑥
𝑠2 =
𝑛−1

Variance = expected value of the


squared dev. from the mean of a RV.
𝟐
ෝ =𝒚−𝑬 𝒚
𝑩𝒊𝒂𝒔 𝒚 ෝ ෝ =𝑬 𝒚
𝑽𝒂𝒓 𝒚 ෝ − 𝑬𝒚
ෝ ෝ𝟐 − 𝑬 𝒚
=𝑬 𝒚 ෝ 𝟐

𝐸 𝑦ො − 𝑬 𝒚 2

2
= 𝐸[𝑦ො 2 + 𝐸 𝑦ො 2 − 2 𝑦𝐸
ො 𝑦ො ]
Actual ∼ Average prediction Expected [Individual Prediction ∼ Avg. prediction]

Measure of inaccuracy Measure of Estimator’s Jumpiness (how much the estimate = 𝐸[𝑦ො 2] + 𝐸 𝑦ො 2 − 2 𝐸(𝑦)𝐸
ො 𝑦ො
ෝ) changes with a change in the
of the target function (𝒚
training data vis-à-vis the Average prediction) = 𝐸 𝑦ො 2 − 𝑬 𝒚 𝟐
ෝ 29

You might also like