0% found this document useful (0 votes)
7 views

Topic - chapter 12 - Regression models

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Topic - chapter 12 - Regression models

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Chapter 12.

Simple regression Nguyen Thi Thu Van - August 3, 2023


Two main objectives of a linear regression model:  Establish if there is a linear relationship between two variables by fitting a linear equation to observed data. More precisely, establish if there is a statistically significant relationship between the two variables.
 Forecast new observations: Can we use what we know about the relationship to forecast unobserved values?
Variable’s roles: One variable, here’s 𝑥, is considered to be the independent/explanatory/predictor and other is considered as a dependent/response variable, here’s 𝑦.
Two basic steps to build a linear regression model: (1) Before attempting to fit a linear model to observed data, we should first determine whether or not there is a relationship between the variables of interest. This does not necessarily imply that one variable causes the other
(for example, higher SAT scores do not cause higher college grades), but that there is some significant association between the two variables [Correlation analysis] (2) Fit a linear model to observed data [Regression models].
Correlation Analysis Regression models
Even though the error term cannot be observable, we assume
that:
(A1) The errors are normally distributed.
(A2) The errors have constant variance 𝜎 2
(A3) The errors are independent of each other.

[1] Scatter plot can be a helpful tool in determining Linear model 𝒚 = 𝜷𝟎 + 𝜷𝟏 × 𝒙 + 𝝐 We measure model’s fit by comparing the variance we can explain relative to the variance we cannot explain.
the strength of the relationship between 2 variables. The random error 𝜖 used in the model is due to the fact that other unspecified 𝑉𝑎𝑟(𝑦) = 𝑉𝑎𝑟(𝛽0 + 𝛽1 × 𝑥 + 𝜖) = 𝑉𝑎𝑟(𝛽1 𝑥) + 𝑉𝑎𝑟(𝜖)
variables also may affect 𝑌 and there may be measurement error in 𝑌.
Sample correlation coefficient (also called the Simple regression equation: 𝑬(𝒀|𝒙) = 𝜷𝟎 + 𝜷𝟏 × 𝒙 Estimated regression model 𝑦̂ = 𝑏0 + 𝑏1 𝑥
Pearson correlation coefficient) Evaluation metrics can be used for regression: Maximum error 𝐸∞ = max |𝑦̂𝑖 − 𝑦𝑖 | Ordinary least squares method (OLS) [Carl F. Gauss,
1≤𝑖≤𝑛
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ ) × (𝑦𝑖 − 𝑦̅) 1809]
𝑟= 1
Average error 𝐸1 = ∑𝑛𝑖=1 |𝑦̂𝑖 − 𝑦𝑖 |
√∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 √∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 𝑛
𝑬𝟐 is minimized when
If 𝑟 = 0 then there is seemly no linear correlation Root-mean square error ∑𝑛 𝑛
𝑖=1(𝑥𝑖 −𝑥̅ )×∑𝑖=1(𝑦𝑖 −𝑦
̅)
𝑏1 = ∑𝑛 2
and 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
between two variables 𝑥 and 𝑦. 𝑖=1(𝑥𝑖 −𝑥̅ )
𝑛 𝑛
1 1 ∑𝑛 ∑𝑛
𝐸2 = √ ∑(𝑦𝑖 − 𝑦̂) 2 = √ ∑(𝑦 − 𝑏 − 𝑏 𝑥 )2 𝑖=1 𝑥𝑖 𝑖=1 𝑦𝑖
𝑖 𝑖 0 1 𝑖 where 𝑥̅ = ; 𝑦̅ =
𝑛 𝑛 𝑛 𝑛
𝑖=1 𝑖=1

[2] Test for significant correlation using Student’s t Source of variation in 𝑦 Total variation about the mean Variation explained by regression Unexplained or error Another evaluation metric for regression is called
Hypotheses: 𝐻0 : 𝜌 = 0; 𝐻1 : 𝜌 ≠ 0 variation Coefficient of determination
𝑛 𝑛
𝑛−2 In a regression, we seek to explain the + ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)
𝑖
2
≡ 𝑆𝑆𝑅 𝑆𝑆𝐸
Test statistic: 𝑡𝑐𝑎𝑙𝑐 = 𝑟√ ∑(𝑦𝑖 − 𝑦̅)2 ≡ 𝑆𝑆𝑇 = ∑(𝑦̂𝑖 − 𝑦̅)2 ≡ 𝑆𝑆𝑅 𝑅2 = =1−
1−𝑟 2
variation in the dependent variable 𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝐸
𝑖=1 𝑖=1
compare with 𝑡𝑐𝑟𝑖𝑡 using 𝑑. 𝑓. = 𝑛 − 2 This number represents the percent of variation explained.
around its mean.
[3] Critical value for the correlation coefficient Test for significance 𝑆𝑆𝐸 Note that in a simple
If the fit is good, SSE is relatively small compared to SST. A measure of overfit is standard error 𝑠𝑒 = √
𝑛−2
regression, the F-test
𝑡𝑐𝑟𝑖𝑡 Coefficient Hypotheses Test statistic Confidence interval for 𝒚 always yields the same
𝑟𝑐𝑟𝑖𝑡 = with 𝑑. 𝑓. = 𝑛 − 2
2
√𝑡𝑐𝑟𝑖𝑡 +𝑛−2 Slope 𝐻0 : 𝛽1 = 0; 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑠𝑙𝑜𝑝𝑒−0 𝑏1 𝑠𝑒 𝑏1 ± 𝑡𝛼 × 𝑠𝑏1
𝑡𝑐𝑎𝑙𝑐 = = 𝑠𝑏1 = 2
p-value as a two-tailed t
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑠𝑙𝑜𝑝𝑒 𝑆𝑏1
Compare 𝑟 to 𝑟𝑐𝑟𝑖𝑡 . 𝐻1 : 𝛽1 ≠ 0 √∑𝑛1(𝑥𝑖 − 𝑥̅ )2
test for zero slope,
Intercept 𝐻0 : 𝛽0 = 0; 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 − 0 𝑏0 𝑏0 ± 𝑡𝛼 × 𝑠𝑏0 which in turn always
𝑡𝑐𝑎𝑙𝑐 = = 1 (𝑥̅ )2 2
If r is not between the positive and negative critical 𝐻1 : 𝛽0 ≠ 0 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑆𝑏0 𝑠𝑏0 = 𝑠𝑒 √ + 𝑛 gives the same p-value
𝑛 ∑1 (𝑥𝑖 − 𝑥̅ )2
values, then the correlation coefficient is significant. If as a two-tailed test for
ANOVA
r is significant, then you may want to use a line for zero correlation. The
prediction. relationship between the
test statistics is
2
𝐹𝑐𝑎𝑙𝑐 = 𝑡𝑐𝑎𝑙𝑐

Caveat: In large samples, small correlations may be A few assumptions about the random error term 𝜀 are made when we use linear regression to fit a line to data [1: The errors are normally distributed. 2: The errors have constant variance. 3: The errors are
significant, even if the scatter plot shows little independent.] Because we cannot observe the error, we must rely on the residuals 𝑒1 = 𝑦1 − 𝑦
̂,
1 𝑒2 = 𝑦2 − 𝑦
̂,
2 … , 𝑒𝑛 = 𝑦𝑛 − 𝑦
̂𝑛 from the estimated regression for clues about possible violations of these assumptions.

evidence of linearity. Thus, a significant correlation While formal tests exist for identifying assumption violations, many analysts rely on simple visual tools to help them determine when an assumption has not been met and how serious the violation is. For more details,
may lack practical importance. please consult in the textbook.

You might also like