STAT 445-Lecture 1_2021
STAT 445-Lecture 1_2021
REGRESSION ANALYSIS
Dr. Godwin Debrah
Linear Regression
Introduction
• Regression analysis is a statistical technique for investigating and
modeling the relationship between variables
𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝑢𝑢
Dependent variable,
explained variable, Independent variable, Error term,
response variable, explanatory variable, disturbance,
Predicted variable Control variable, unobservables,…
regressand Predictor variable
regressor,…
The Simple Regression Model
• The errors are assumed to have mean zero and unknown
variance 𝜎𝜎 2
Rainfall,
Measures the effect of fertilizer on
land quality,
yield, holding all other factors fixed
presence of parasites, …
A regression model does not imply a cause - and - effect relationship between the
variables even though a strong empirical relationship may exist
To establish causality, the relationship between the regressors and the response
variable must have a basis outside the sample data — for example, the
relationship may be suggested by theoretical considerations.
Regression analysis can aid in confirming a cause - and - effect relationship, but it
cannot be the sole basis of such a claim
The Simple Regression Model
• Population regression function (PFR)
• Recall that there is a probability distribution for 𝑦𝑦 at each possible value
for 𝑥𝑥. The mean of this distribution is
𝐸𝐸 𝑦𝑦 𝑥𝑥 = 𝐸𝐸 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝑢𝑢 𝑥𝑥
= 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝐸𝐸 𝑢𝑢 𝑥𝑥
= 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥
provided we assume 𝐸𝐸 𝑢𝑢 𝑥𝑥 =0
• This means that the average value of the dependent variable can be
expressed as a linear function of the explanatory variable
The Simple Regression Model
𝑛𝑛
𝑛𝑛 𝑛𝑛 𝑛𝑛
∑𝑛𝑛𝑖𝑖=1(𝑥𝑥𝑖𝑖 − 𝑥𝑥)(𝑦𝑦
̅ 𝑖𝑖 − 𝑦𝑦)
�
𝛽𝛽̂1 = 𝑛𝑛 2
, 𝛽𝛽̂0 = 𝑦𝑦� − 𝛽𝛽̂1 𝑥𝑥̅
∑𝑖𝑖=1(𝑥𝑥𝑖𝑖 − 𝑥𝑥)̅
𝑆𝑆𝑥𝑥𝑥𝑥
̂
• We often conveniently write 𝛽𝛽1 = or 𝛽𝛽̂1 = ∑𝑛𝑛𝑖𝑖=1 𝑐𝑐𝑖𝑖 𝑦𝑦𝑖𝑖 where
𝑆𝑆𝑥𝑥𝑥𝑥
2
• 𝑆𝑆𝑥𝑥𝑥𝑥 = ∑𝑛𝑛 ̅ and 𝑆𝑆𝑥𝑥𝑥𝑥 =
𝑖𝑖=1(𝑥𝑥𝑖𝑖 − 𝑥𝑥) ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 𝑦𝑦𝑖𝑖 − 𝑦𝑦� = ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 𝑦𝑦𝑖𝑖 and
𝑥𝑥𝑖𝑖 −𝑥𝑥̅
𝑐𝑐𝑖𝑖 =
𝑆𝑆𝑥𝑥𝑥𝑥
�𝑦𝑦
𝜎𝜎
̂
Using simple algebra, we can also write 𝛽𝛽1 = 𝜌𝜌�𝑥𝑥𝑥𝑥 ( � ) where 𝜌𝜌�𝑥𝑥𝑥𝑥 is sample
𝜎𝜎𝑥𝑥
correlation between 𝑥𝑥 and 𝑦𝑦 and 𝜎𝜎�𝑥𝑥 , 𝜎𝜎�𝑦𝑦 denote sample standard deviations
Properties of the Least - Squares Estimators
and the Fitted Regression Model
• The sum of the residuals,�𝑢𝑢𝑖𝑖 in any regression model that contains an intercept is
always zero. This property follows directly from the first normal equation
• The sum of the observed values 𝑦𝑦𝑖𝑖 is equal to the sum of the fitted values, 𝑦𝑦�𝑖𝑖
• The least-squares regression line always passes through the centroid (𝑥𝑥,̅ 𝑦𝑦)
� of the
data
• The sum of the residuals weighted by the corresponding value of the regressor
variable always equals zero that is ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 𝑢𝑢� 𝑖𝑖 = 0
• The sum of the residuals weighted by the corresponding fitted value always
equals zero, that is ∑𝑛𝑛𝑖𝑖=1 𝑢𝑢� 𝑖𝑖 𝑦𝑦�𝑖𝑖 = 0
The Simple Regression Model
• CEO Salary and return on equity
• Fitted regression
� = 963.191 + 18.501𝑟𝑟𝑟𝑟𝑟𝑟
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
• Fitted regression
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤
� = 0.90 + 0.54𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
• Fitted regression
� = 26.81 + 0.464𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣
∑𝑛𝑛𝑖𝑖=1(𝑥𝑥𝑖𝑖 − 𝑥𝑥)(𝑦𝑦
̅ 𝑖𝑖 − 𝑦𝑦)
�
̂
𝛽𝛽1 = , ̂0 = 𝑦𝑦� − 𝛽𝛽̂1 𝑥𝑥̅
𝛽𝛽
𝑛𝑛
∑𝑖𝑖=1(𝑥𝑥𝑖𝑖 − 𝑥𝑥)̅ 2
Data is random and depends on particular sample that has been drawn
• The question is what the estimators will estimate on average and how large their
variability in repeated samples is
• (Unbiasedness of OLS)
𝑆𝑆𝑆𝑆𝑆𝑆. 1 − 𝑆𝑆𝑆𝑆𝑆𝑆. 3 ⇒ 𝐸𝐸 𝛽𝛽̂0 = 𝛽𝛽0 , 𝐸𝐸 𝛽𝛽̂1 = 𝛽𝛽1
• Interpretation of unbiasedness
• The estimated coefficients may be smaller or larger, depending on the sample
that is the result of a random draw
• However, on average, they will be equal to the values that characterize the
true relationship between 𝑦𝑦 and 𝑥𝑥 in the population
• “On average” means if sampling was repeated, i.e. if drawing the random
sample and doing the estimation was repeated many times
• In a given sample, estimates may differ considerably from true values
Properties of the Least - Squares Estimators
and the Fitted Regression Model
𝑥𝑥𝑖𝑖 −𝑥𝑥̅
• Recall we can write 𝛽𝛽̂1 = ∑𝑛𝑛𝑖𝑖=1 𝑐𝑐𝑖𝑖 𝑦𝑦𝑖𝑖 where 𝑐𝑐𝑖𝑖 =
𝑆𝑆𝑥𝑥𝑥𝑥
and 𝛽𝛽̂0 = 𝑦𝑦� − 𝛽𝛽̂1 𝑥𝑥̅
• Thus the Least-Squares estimators are linear combinations of the
observations 𝑦𝑦𝑖𝑖
• The least-squares estimators of 𝛽𝛽0 and 𝛽𝛽1 are UNBIASED ESTIMATORS
𝑛𝑛 𝑛𝑛
𝑛𝑛 𝑛𝑛 𝑛𝑛
⇒ 𝐸𝐸 𝛽𝛽̂1 = 𝛽𝛽1
The Simple Regression Model
• Variances of the OLS estimators
• Depending on the sample, the estimates will be nearer or farther away from
the true population values
• How far can we expect our estimates to be away from the true population
values on average (= sampling variability)?
• Sampling variability is measured by the estimator‘s variances
𝑣𝑣𝑣𝑣𝑣𝑣 𝛽𝛽̂1 , 𝑣𝑣𝑣𝑣𝑣𝑣 𝛽𝛽̂0
• Conclusion:
• The sampling variability of the estimated regression coefficients will be the
higher, the larger the variability of the unobserved factors, and the lower, the
higher the variation in the explanatory variable
Variances of the OLS estimators
𝑛𝑛 𝑛𝑛
1 𝑥𝑥̅ 2
𝑉𝑉𝑉𝑉𝑉𝑉 𝛽𝛽̂0 = 𝑉𝑉𝑉𝑉𝑉𝑉 𝑦𝑦� + 𝑥𝑥̅ 2 𝑉𝑉𝑉𝑉𝑉𝑉 𝛽𝛽̂1 = 𝜎𝜎 2 +
𝑛𝑛 𝑆𝑆𝑥𝑥𝑥𝑥
The Simple Regression Model
• Estimating the error variance
The variance of u does not depend on x, i.e. equal to
• 𝑉𝑉𝑉𝑉𝑉𝑉 𝑢𝑢𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝜎𝜎 2 = 𝑉𝑉𝑉𝑉𝑉𝑉 𝑢𝑢𝑖𝑖 the unconditional variance
1 𝑛𝑛 1 𝑛𝑛
• 𝜎𝜎� 2 = ∑ (𝑢𝑢� − 𝑢𝑢� 𝑖𝑖 ) = ∑𝑖𝑖=1 𝑢𝑢� 𝑖𝑖2
2
𝑛𝑛 𝑖𝑖=1 𝑖𝑖
One could estimate the variance of the
𝑛𝑛 errors by calculating the variance of the
residuals in the sample; unfortunately this
estimate would be biased
𝑛𝑛
1
2
𝜎𝜎� = � 𝑢𝑢� 𝑖𝑖2
𝑛𝑛 − 2
𝑖𝑖=1
An unbiased estimate of the error variance can be
obtained by substracting the number of estimated
regression coefficients
from the number of observations
The Simple Regression Model
• (Unbiasedness of the error variance)
𝑆𝑆𝑆𝑆𝑆𝑆. 1 − 𝑆𝑆𝑆𝑆𝑆𝑆. 4 ⇒ 𝐸𝐸 𝜎𝜎� 2 = 𝜎𝜎 2
• Calculation of standard errors for regression coefficients
Plug in for
𝑠𝑠𝑠𝑠 𝛽𝛽̂1 = 𝑉𝑉𝑉𝑉𝑉𝑉 𝛽𝛽̂1 = 𝜎𝜎� 2 ⁄𝑆𝑆𝑆𝑆𝑆𝑆𝑥𝑥 the unknown
𝑛𝑛
The estimated standard deviations of the regression coefficients are called “standard errors.”
They measure how precisely the regression coefficients are estimated.
The Simple Regression Model
• Goodness-of-Fit
“How well does the explanatory variable explain the dependent variable?”
• Measures of Variation
𝑛𝑛 𝑛𝑛 𝑛𝑛
2
𝑆𝑆𝑆𝑆𝑅𝑅 𝑆𝑆𝑆𝑆𝐸𝐸
𝑅𝑅 = =1−
𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 R-squared measures the fraction of the
total variation that is explained by the
regression
The Simple Regression Model
• CEO Salary and return on equity
� = 963.191 + 18.501𝑟𝑟𝑟𝑟𝑟𝑟
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 The regression explains only 1.3%
of the total variation in salaries