Topic - chapter 12 - Regression models
Topic - chapter 12 - Regression models
[1] Scatter plot can be a helpful tool in determining Linear model 𝒚 = 𝜷𝟎 + 𝜷𝟏 × 𝒙 + 𝝐 We measure model’s fit by comparing the variance we can explain relative to the variance we cannot explain.
the strength of the relationship between 2 variables. The random error 𝜖 used in the model is due to the fact that other unspecified 𝑉𝑎𝑟(𝑦) = 𝑉𝑎𝑟(𝛽0 + 𝛽1 × 𝑥 + 𝜖) = 𝑉𝑎𝑟(𝛽1 𝑥) + 𝑉𝑎𝑟(𝜖)
variables also may affect 𝑌 and there may be measurement error in 𝑌.
Sample correlation coefficient (also called the Simple regression equation: 𝑬(𝒀|𝒙) = 𝜷𝟎 + 𝜷𝟏 × 𝒙 Estimated regression model 𝑦̂ = 𝑏0 + 𝑏1 𝑥
Pearson correlation coefficient) Evaluation metrics can be used for regression: Maximum error 𝐸∞ = max |𝑦̂𝑖 − 𝑦𝑖 | Ordinary least squares method (OLS) [Carl F. Gauss,
1≤𝑖≤𝑛
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ ) × (𝑦𝑖 − 𝑦̅) 1809]
𝑟= 1
Average error 𝐸1 = ∑𝑛𝑖=1 |𝑦̂𝑖 − 𝑦𝑖 |
√∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 √∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 𝑛
𝑬𝟐 is minimized when
If 𝑟 = 0 then there is seemly no linear correlation Root-mean square error ∑𝑛 𝑛
𝑖=1(𝑥𝑖 −𝑥̅ )×∑𝑖=1(𝑦𝑖 −𝑦
̅)
𝑏1 = ∑𝑛 2
and 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
between two variables 𝑥 and 𝑦. 𝑖=1(𝑥𝑖 −𝑥̅ )
𝑛 𝑛
1 1 ∑𝑛 ∑𝑛
𝐸2 = √ ∑(𝑦𝑖 − 𝑦̂) 2 = √ ∑(𝑦 − 𝑏 − 𝑏 𝑥 )2 𝑖=1 𝑥𝑖 𝑖=1 𝑦𝑖
𝑖 𝑖 0 1 𝑖 where 𝑥̅ = ; 𝑦̅ =
𝑛 𝑛 𝑛 𝑛
𝑖=1 𝑖=1
[2] Test for significant correlation using Student’s t Source of variation in 𝑦 Total variation about the mean Variation explained by regression Unexplained or error Another evaluation metric for regression is called
Hypotheses: 𝐻0 : 𝜌 = 0; 𝐻1 : 𝜌 ≠ 0 variation Coefficient of determination
𝑛 𝑛
𝑛−2 In a regression, we seek to explain the + ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)
𝑖
2
≡ 𝑆𝑆𝑅 𝑆𝑆𝐸
Test statistic: 𝑡𝑐𝑎𝑙𝑐 = 𝑟√ ∑(𝑦𝑖 − 𝑦̅)2 ≡ 𝑆𝑆𝑇 = ∑(𝑦̂𝑖 − 𝑦̅)2 ≡ 𝑆𝑆𝑅 𝑅2 = =1−
1−𝑟 2
variation in the dependent variable 𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝐸
𝑖=1 𝑖=1
compare with 𝑡𝑐𝑟𝑖𝑡 using 𝑑. 𝑓. = 𝑛 − 2 This number represents the percent of variation explained.
around its mean.
[3] Critical value for the correlation coefficient Test for significance 𝑆𝑆𝐸 Note that in a simple
If the fit is good, SSE is relatively small compared to SST. A measure of overfit is standard error 𝑠𝑒 = √
𝑛−2
regression, the F-test
𝑡𝑐𝑟𝑖𝑡 Coefficient Hypotheses Test statistic Confidence interval for 𝒚 always yields the same
𝑟𝑐𝑟𝑖𝑡 = with 𝑑. 𝑓. = 𝑛 − 2
2
√𝑡𝑐𝑟𝑖𝑡 +𝑛−2 Slope 𝐻0 : 𝛽1 = 0; 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑠𝑙𝑜𝑝𝑒−0 𝑏1 𝑠𝑒 𝑏1 ± 𝑡𝛼 × 𝑠𝑏1
𝑡𝑐𝑎𝑙𝑐 = = 𝑠𝑏1 = 2
p-value as a two-tailed t
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑠𝑙𝑜𝑝𝑒 𝑆𝑏1
Compare 𝑟 to 𝑟𝑐𝑟𝑖𝑡 . 𝐻1 : 𝛽1 ≠ 0 √∑𝑛1(𝑥𝑖 − 𝑥̅ )2
test for zero slope,
Intercept 𝐻0 : 𝛽0 = 0; 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 − 0 𝑏0 𝑏0 ± 𝑡𝛼 × 𝑠𝑏0 which in turn always
𝑡𝑐𝑎𝑙𝑐 = = 1 (𝑥̅ )2 2
If r is not between the positive and negative critical 𝐻1 : 𝛽0 ≠ 0 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑆𝑏0 𝑠𝑏0 = 𝑠𝑒 √ + 𝑛 gives the same p-value
𝑛 ∑1 (𝑥𝑖 − 𝑥̅ )2
values, then the correlation coefficient is significant. If as a two-tailed test for
ANOVA
r is significant, then you may want to use a line for zero correlation. The
prediction. relationship between the
test statistics is
2
𝐹𝑐𝑎𝑙𝑐 = 𝑡𝑐𝑎𝑙𝑐
Caveat: In large samples, small correlations may be A few assumptions about the random error term 𝜀 are made when we use linear regression to fit a line to data [1: The errors are normally distributed. 2: The errors have constant variance. 3: The errors are
significant, even if the scatter plot shows little independent.] Because we cannot observe the error, we must rely on the residuals 𝑒1 = 𝑦1 − 𝑦
̂,
1 𝑒2 = 𝑦2 − 𝑦
̂,
2 … , 𝑒𝑛 = 𝑦𝑛 − 𝑦
̂𝑛 from the estimated regression for clues about possible violations of these assumptions.
evidence of linearity. Thus, a significant correlation While formal tests exist for identifying assumption violations, many analysts rely on simple visual tools to help them determine when an assumption has not been met and how serious the violation is. For more details,
may lack practical importance. please consult in the textbook.