Simple+Linear+Regression
Simple+Linear+Regression
Part I
Background Review
2
Regression Analysis
Statistical model is a mathematical description of the data
structure/data generating mechanism
Parametric model
Easier to fit, interpret, infer
More powerful (statistically)
Model complexity is fixed
Nonparametric model
No distributional assumption
More flexible
Model complexity may grow
Semiparametric model
3
Regression Analysis
Example: exam scores
Parametric: approximate the class distribution by a normal
distribution with certain parameters (mean and variance)
(hence we can say mean +/- one standard deviation ~ 68%)
Nonparametric: use the histogram
4
Regression Analysis
Regression studies the relationship between
Response/outcome/dependent variables; and
Predictor/explanatory/independent variables
5
Types of Variables
Nominal Binary
Qualitative/ No orderings in categories Only two categories
• Martial Status • Yes/No
Categorical • Eye Color • Male/Female
Categories are naturally ordered
Ordinal • Likert/rating scale
• Letter grades
Variables
• Number of children
Discrete • Defects per hour
Quantitative/
Numerical
• Weight
Continuous • Voltage
6
Let’s look at the simplest case
To study the relationship between two numerical
variables, such as
Exam score vs. Time spent on doing revision
Apartment price vs. Gross floor area
Electricity consumption vs. Air temperature
7
Linear Correlation Analysis
Scatter plot
8
Linear Correlation Analysis Cont’d
(Sample) Linear correlation coefficient, 𝑟
σ𝑛 ത ത
𝑖=1(𝑋𝑖 −𝑋)(𝑌𝑖 −𝑌)
𝑟=
σ𝑛 ത 2 σ𝑛
𝑖=1(𝑋𝑖 −𝑋)
ത 2
𝑖=1(𝑌𝑖 −𝑌)
Dimensionless
−1 ≤ 𝑟 ≤ +1
“Sign” indicates the direction (positive / negative) of a linear
relationship
“Magnitude” measures the strength of a linear relationship
σ𝑛
𝑖=1 𝑋𝑖 σ𝑛
𝑖=1 𝑌𝑖
ത=
𝑋 , 𝑌ത = are the sample means
𝑛 𝑛
σ𝑛 ത 2
𝑖=1(𝑋𝑖 −𝑋) σ𝑛 ത 2
𝑖=1(𝑌𝑖 −𝑌)
𝑆𝑋2 = 𝑛−1
, 𝑆𝑌2 = 𝑛−1
are the sample variances
9
Linear Correlation Analysis Cont’d
t-test for correlation coefficient
Important!! Note the
𝐻0 : 𝜌 = 0 (no linear correlation) slight abuse of notations
𝐻1 : 𝜌 ≠ 0 (linear correlation exists) • Upright t denotes the
value of statistic
𝑟−𝜌 • 𝑡𝑛−2 denotes the
t-statistic t =
1−𝑟2 distribution itself
𝑛−2 • 𝑡𝛼Τ2,(𝑛−2) denotes its
upper tail quantile
p-value = 2𝑃(𝑡𝑛−2 ≥ |t|)
𝑡𝑛−2 denotes a 𝑡 distribution with (𝑛 − 2) degree of
freedom (d.f.)
Reject 𝐻0 if |t| > C. V. = 𝑡𝛼Τ2,(𝑛−2) or p-value < 𝛼
10
Example
Is residential apartment price related to its gross floor
area and age of the building?
12
Example • R will not process codes after #, use for comments
#set working directory
setwd(“C:/Users/chiwchu/Google Drive/Academic/CityU/MS3252/Lecture")
13
Example Cont’d
14
Example Cont’d
2x10-16
-2 .
15
Conditional Distribution
Probability/density -> Distribution
Conditional probability/density -> Conditional distribution
e.g. Let 𝑌 denote the random variable of whether it will
rain tomorrow (1=yes, 0=no)
If the probability of raining tomorrow is 0.4, the (marginal)
distribution of 𝑌 is Bernoulli(0.4), denoted by 𝑌 ∼ 𝐵𝑒𝑟𝑛(0.4)
But what if we know whether a typhoon is coming?
Let 𝑋 denote the random variable of whether a typhoon is
coming (1=yes, 0=no)
𝑋 can be random itself, but we can think of it as fixed
16
Conditional Distribution
Given the information of 𝑋, the probability of raining
tomorrow and hence the distribution of 𝑌 may change!
Say, conditional probability 𝑃 𝑌 = 1 𝑋 = 1 = 0.9, then
the conditional distribution of 𝑌|𝑋 = 1 is 𝐵𝑒𝑟𝑛(0.9)
Similarly, the conditional distribution of 𝑌|𝑋 = 0 could be
𝐵𝑒𝑟𝑛(0.3)
𝑌|𝑋 ∼ 𝐵𝑒𝑟𝑛(0.3 + 0.6𝑋)
The conditional distribution of 𝑌, particularly the
conditional mean, varies across different values of 𝑋
Regression is about the study of conditional distribution!
17
Part II
Formulation and Estimation
18
Overview of Regression Analysis
Input
Response / outcome / dependent variable, 𝑌
The variable we wish to explain or predict
Predictor / covariate / explanatory / independent variable, 𝑋
The variable used to explain the response variable
Output
A (linear) function that allows us to
Model association: Explain the variation of the response caused by the
predictor(s)
Provide prediction: Estimate the value of the response based on value(s)
of the predictor(s)
19
Simple Linear Regression - Formulation
A simple linear regression model consists of two
components
Regression line: A straight line that describes the
dependence of the average value (conditional mean) of the
𝑌-variable on one 𝑋-variable
Random error: The unexpected deviation of the observed
value from the expected value
Population Slope Coefficient
Population Intercept
Predictor
Regression Line 20
Simple Linear Regression - Formulation
EcEiXi)ta
(Linear) regression model 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖
Assumptions ElYi(Xi) = BotB , Xi +a
21
+ Ve
-ve
·
Ei normal distribution
Independent
80
Student : E to 17
Ei -525
Student 2:
Go 100 so
to
Ei
Simple Linear Regression - Formulation
Equivalently, the linear regression model can be written as
𝐸 𝑌𝑖 𝑋𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 (mean function)
𝑉𝑎𝑟 𝑌𝑖 𝑋𝑖 = 𝜎𝜀2 (variance function)
𝑌𝑖 |𝑋𝑖 are independent and normally distributed
In other words, 𝑌𝑖 |𝑋𝑖 are independent 𝑁 𝛽0 + 𝛽1 𝑋𝑖 , 𝜎𝜀2
𝑁 𝜇, 𝜎 2 denotes a normal distribution with mean 𝜇 and
variance 𝜎 2
We also call it mean regression model
22
Simple Linear Regression - Formulation
Framework: we have one response 𝑌 and 𝐾 predictors 𝑋
𝐾 = 1 here because we only have one 𝑋
We obtain a random sample of size 𝑛, containing the
values of 𝑌𝑖 and 𝑋𝑖 for each individual/subject/observation
𝑖, 𝑖 = 1, ⋯ , 𝑛
Our goal is to model/infer about the conditional mean of 𝑌
given 𝑋
As the conditional mean is characterized by 𝛽0 and 𝛽1 , that
means we need to estimate 𝛽0 and 𝛽1 from the data
23
Simple Linear Regression - Estimation
Goal: estimate 𝛽0 and 𝛽1
Let’s denote these estimates by 𝑏0 and 𝑏1
our notations for parameters: Greek alphabets (𝛽0 , 𝛽1 )
represent the population/true versions; English alphabets
(𝑏0 , 𝑏1 ) represent the sample/estimated analogies.
Two methods (turn out to be equivalent for linear
regression):
Least Squares Estimator (LSE)/Ordinary Least Squares (OLS)
Maximum Likelihood Estimator (MLE)
24
Simple Linear Regression - Estimation
𝑌
𝒀𝒊
𝒆𝒊
𝒊
𝒀
𝒊 = 𝒃𝟎 + 𝒃𝟏 𝑿𝒊
𝒀
We are assuming
𝑋
(conditional) normality
𝑋𝑖
of Y for every level of X
𝑏0 represents the sample intercept
𝑏1 represents the sample slope coefficient
𝑒 represents the sample residual error 25
Simple Linear Regression - Estimation
𝑏0 and 𝑏1 are estimated using the least squares method,
which minimize the sum of squares errors (SSE)
𝑛 𝑛 𝑛
26
Simple Linear Regression - Estimation
The solution to 𝑏0 and 𝑏1 can be obtained by
differentiating with respect to 𝑏0 and 𝑏1
That is to solve
𝜕 σ𝑛𝑖=1 𝑒𝑖2 𝑛
= −2 𝑌𝑖 − 𝑏0 + 𝑏1 𝑋𝑖 = 0
𝜕𝑏0 𝑖=1
and
𝜕 σ𝑛𝑖=1 𝑒𝑖2 𝑛
= −2 𝑋𝑖 𝑌𝑖 − 𝑏0 + 𝑏1 𝑋𝑖 = 0
𝜕𝑏1 𝑖=1
simultaneously
27
Simple Linear Regression - Estimation
The solutions are
σ𝑛 ത ത σ𝑛 ത 2
𝑖=1(𝑌𝑖 −𝑌)
𝑖=1(𝑋𝑖 −𝑋)(𝑌𝑖 −𝑌) 𝑆
𝑏1 = σ𝑛 ത 2
=𝑟 = 𝑟(𝑆𝑌 )
𝑖=1(𝑋𝑖 −𝑋) σ𝑛 ത 2 𝑋
𝑖=1(𝑋𝑖 −𝑋)
and
𝑏0 = 𝑌ത − 𝑏1 𝑋ത
Also, the estimate for the error variance 𝜎𝜀2 is given by
𝑆𝑆𝐸
𝑆𝑒2 = 𝑀𝑆𝐸 =
𝑛−𝐾−1
28
Simple Linear Regression - Estimation
Maximum Likelihood Estimation is to find the parameters that
maximize the likelihood/probability of observing the sample
Recall that 𝑌𝑖 |𝑋𝑖 ∼ 𝑁 𝛽0 + 𝛽1 𝑋𝑖 , 𝜎𝜀2
2
𝑦𝑖 −𝜇
1 −
The density function of 𝑁 𝜇, 𝜎 2 is 2
𝑒 2𝜎2
2𝜋𝜎
Assume 𝜎𝜀2 is known and equals 1 for simplicity…
The joint likelihood/probability of observing these 𝑌𝑖 given these
2
𝑌𝑖 −𝛽0 −𝛽1 𝑋𝑖
1
𝑋𝑖 will be ς𝑛𝑖=1 𝑒− 2
2𝜋
Maximizing this likelihood function is equivalent to minimizing
2
σ𝑛𝑖=1 𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 , which is exactly the SSE, so MLE = LSE!
29
Example Cont’d
𝑏0
𝑏1
𝑆𝑒 = 𝑀𝑆𝐸
𝑟 2 or 𝑅2
30
Example – The Model &
Interpretation of Coefficients Cont’d
The estimated simple linear regression equation
𝑌 = 1.3584 + 0.0048𝑋
where 𝑌 = Price in million HK$
𝑋 = Gross floor area in ft2
The estimated slope coefficient, 𝑏1
Measures the estimated change in the average value of 𝑌
as a result of a one-unit increase in 𝑋
𝑏1 = 0.0048 says that the price of an apartment increases
by 𝐻𝐾$4,800(= 0.0048 × 𝐻𝐾$1,000,000), on average, for
each square foot increase in gross floor area
31
Example – The Model &
Interpretation of Coefficients Cont’d
The estimated simple linear regression equation
𝑌 = 1.3584 + 0.0048𝑋
where 𝑌 = Price in million HK$
𝑋 = Gross floor area in ft2
The estimated intercept coefficient, 𝑏0
Denotes the estimated average value of 𝑌 when 𝑋 is zero
𝑏0 = 1.3584 says that the price of an apartment is
𝐻𝐾$1,358,400(= 1.3584 × 𝐻𝐾$1,000,000), on average,
when the gross floor area is zero (any problem?)
Interpret with caution with the 𝑋-value is out of range!!
32
Example Cont’d
Regress Price against Age
33
Example Cont’d
The relationship between apartment price and age of the
building is
𝑌 = 6.1478 − 0.1078𝑍
where 𝑌 = Price in million HK$
𝑍 = Age of building in years
If the building gets 1 year older, the average apartment
price decreases by 𝐻𝐾$107,800
34
Confidence Interval (CI)
Confidence interval estimate for slope coefficient
𝑏1 ± 𝑡𝛼Τ2,𝑛−𝐾−1 𝑆𝑏1
R program
36
Special Case II: Two Groups
Now, the 𝑋𝑖 are either 0 or 1 indicating which group the
observation belongs to
The linear regression model assumes that
𝑌𝑖 |𝑋𝑖 = 0 are independent 𝑁 𝛽0 , 𝜎𝜀2
𝑌𝑖 |𝑋𝑖 = 1 are independent 𝑁 𝛽0 + 𝛽1 , 𝜎𝜀2
This is equivalent to fitting two normal distributions to the
two groups respectively!!
37
Part III
Goodness of Fit,
Parameter Inference,
and Model Significance
38
Goodness of Fit and Model Significance
We want to compare the fitted model with 𝑋 against the
null model without 𝑋
Fitted/Full model = the model you considered
Null model = special case I = a horizontal line at 𝑌ത
(Saturated model = data = the model with perfect fit)
Price
GrossFA
*
Analysis of Variance (ANOVA) Cont’d
Coefficient of determination, 𝑅 2
𝑅2 =
𝑆𝑆𝑅 SSTeSSR + SSE
𝑆𝑆𝑇
SSR = SSTXR
:
0 ≤ 𝑅2 ≤ 1
Measures the proportion of variation of 𝑌𝑖 explained by
the regression equation with the predictor 𝑋
Measures the “goodness of fit” of the regression model
SST =
SSTXRETSSE
Remark!! 𝑅 2 = 𝑟 2 in simple linear regression, i.e. when
there is one 𝑋-variable
SST-SSi XRESSE
42
Example
Which independent variable, GrossFA or Age, provides a
better explanation to the variation of apartment price?
SSR
SSE
𝑆𝑒2 =MSE
For SST, use either of
• sum(anova(m1)[,2])
• var(Price)*(length(Price)-1)
43
Inferences about the Parameters –
A 𝑋-Variable Significance
t-test for a slope coefficient
𝐻0 : 𝛽1 = 0 (no linear relationship)
𝐻1 : 𝛽1 ≠ 0 (linear relationship exists)
𝑏1 −𝛽1
t-statistic t =
𝑆𝑏1
where 𝑆𝑏1 = standard error of the slope
44
Inferences about the Parameters –
A 𝑋-Variable Significance Cont’d
𝑆𝑏1 measures the variation in the slope of regression lines from
different possible samples (one color denotes one sample)
𝒀 𝒀
𝑆𝑒2 𝑆𝑒2
𝑆𝑏21 = ത 2
σ(𝑋𝑖 −𝑋)
= 2
(𝑛−1)𝑆𝑋
𝑆𝑆𝐸
𝑆𝑒2 = = variation of the errors around the regression line
𝑛−𝐾−1
45
Inferences about the Parameters –
A 𝑋-Variable Significance Cont’d
𝑆𝑌 𝑏1 𝑟
Recall 𝑏1 = 𝑟(𝑆 ), we can show that 𝑆 = !!
𝑋 𝑏1 1−𝑟2
𝑛−2
46
Example
Is GrossFA significantly affecting the apartment price?
𝑏1 𝑆𝑏1 t p-value
d.f. = 𝑛 − 𝐾 − 1
47
Example Cont’d
Is GrossFA significantly affecting the apartment price?
𝐻0 : 𝛽𝐺𝑟𝑜𝑠𝑠𝐹𝐴 = 0 In R, use
𝐻1 : 𝛽𝐺𝑟𝑜𝑠𝑠𝐹𝐴 ≠ 0 • qt(.975,78) to obtain C.V.
0.0048−0 • 2*(1-pt(10.81,78)) to obtain p-value
t= 0.000448
= 10.812
In exam,
• use t-table to obtain C.V.
At 𝛼 = 5% 7
• p-value is not computable by hand,
d.f. = (80 − 1 − 1) = 78 but a range can be found at best
C.V. = 1.9908
-
Reject 𝐻0 , GrossFA significantly
affects apartment price. -
V
-
At 𝛼 = 5%
d.f. = (80 − 1 − 1) = 78
C.V. = −1.6646
F
d.f. = 𝐾, 𝑛 − 𝐾 − 1
p-value
SSR MSR
SSE MSE
51
Example Cont’d
Is the model significant?
𝐻0 : 𝛽𝐺𝑟𝑜𝑠𝑠𝐹𝐴 = 0 In R, use
𝐻1 : 𝛽𝐺𝑟𝑜𝑠𝑠𝐹𝐴 ≠ 0 • qf(.95,78) to obtain C.V.
55.96517 • 1-pf(116.90,78) to obtain p-value
F= = 116.90
0.47876
In exam,
• use F-table to obtain C.V.
At 𝛼 = 5% • p-value is not computable by hand
d.f. = 1, (80 − 1 − 1) = 1, 78
C.V. = 4.00
53
Part IV
Prediction and Diagnostics
54
Prediction of New Observations –
Point Prediction
Convert the given 𝑋-value into the same measurement
scale as the observed 𝑋-value
As the estimated slope coefficient is scale dependent
Ideally, only use the regression equation to predict the 𝑌-
value when the given 𝑋-value is inside the observed data
range
As we are not sure whether the linear relationship will go
beyond the range of observed 𝑋-value
55
Example Cont’d
What is the estimated price for an apartment with gross floor
area 764 ft2?
Prediction given by the simple linear regression equation
𝑌 = 1.3585 + 0.0049𝑋
= 1.3585 + 0.0049 × 764 = 5.1020
where 𝑌 = Price in million HK$
𝑋 = Gross floor area in ft2
The expected price for an apartment with gross floor area
764 ft2 is 𝐻𝐾$5,102,100
What is the estimated mean price for apartments with gross
floor area 764 ft2? – same estimate, but any differences?
56
Prediction of New Observations Cont’d
The prediction given by regression models raised from
different possible samples will vary
𝒊
𝒀
Which prediction
𝒊
𝒀 we should trust?
𝒊
𝒀
𝑿𝒊 𝑿
57
Prediction of New Observations –
Interval Prediction Cont’d
Confidence interval estimate for the mean of 𝑌-variable
given a 𝑋-value
𝑌 ± 𝑡𝛼Τ ,𝑛−𝐾−1 𝑆𝑚
2
2 2 1 (𝑋−𝑋)ത 2
where 𝑆𝑚 = 𝑆𝑒 𝑛 + ത 2
σ(𝑋𝑖 −𝑋)
, 𝑋 is the given 𝑋-value
R program
predict(m1,level=.95,interval="confidence")
ത 2 = 𝑛 − 1 𝑆𝑋2 , where 𝑆𝑋2 is the
Note that σ(𝑋𝑖 − 𝑋)
sample variance of 𝑋
58
Prediction of New Observations –
Interval Prediction Cont’d
Prediction interval estimate for an individual 𝑌-value
given a 𝑋-value
𝑌 ± 𝑡𝛼Τ ,𝑛−𝐾−1 𝑆𝑝
2
1 (𝑋−𝑋)ത 2
where 𝑆𝑝2 = 𝑆𝑒2 1+ + ത 2
σ(𝑋𝑖 −𝑋)
= 𝑆𝑒2 + 𝑆𝑚
2
𝑛
R program
predict(m1,level=.95,interval=“prediction")
It is still a type of confidence interval, although we are using
the term prediction interval to differentiate them
This CI is wider because there are more uncertainty about the
prediction for a single Y compared to the average
59
Prediction of New Observations –
Interval Prediction Cont’d
𝒀
Prediction Interval for an
individual 𝑌-value
𝑿𝒊 𝑿
60
Example Cont’d
Determine a 90% confidence interval for the mean
apartment price for flats of 764 ft2 gross area
Also, construct a 90% prediction interval for the
apartment price for a flat of 764 ft2 gross area
61
Regression Assumptions
Linearity of regression equation
𝛽0 + 𝛽1 𝑋𝑖 is a linear function
Error normality
𝜀𝑖 has a normal distribution for all 𝑖
Constant variances of errors
Var 𝜀𝑖 |𝑋𝑖 = 𝜎𝜀2
Error independence
𝜀𝑖 are independent for all 𝑖
62
Residual Analysis
Check the regression assumptions by examining the
residuals
Residuals (or errors), 𝑒𝑖 = 𝑌𝑖 − 𝑌𝑖
Plot
Residuals against the predictor for checking linearity and
constant variances
Residuals against index for checking error independence
Histogram of the residuals for examining error normality
63
Residual Analysise Cont’d
0
e
𝑿
Residuals has a systematic pattern,
0 𝑌 and 𝑋-variables are not having a
liner relationship, but a curved one
𝑿 e
𝑿
Error variance increases with 𝑋-value
64
Residual Analysis Cont’d
e e
0 0
Index(Time) Index(Time)
Residuals
pattern
displaying a random Negative residuals are associated
mainly with the early trials, and
positive residuals with the later
trials, time of the data being
collected affects the residuals and
𝑌-values
65
Residual Analysis Cont’d
35 35
30 30
25 25
20 20
%
%
15 15
10 10
5 5
0 0
-0.75 -0.5 -0.25 0 0.25 0.5 0.75 -0.75 -0.5 -0.25 0 0.25 0.5 0.75
e e
66
Summary
Description Response Predictor Correlation Error
Population Version 𝑌𝑖 𝑋𝑖 𝜌 𝜀𝑖
Sample Analogy 𝑟 𝑒𝑖
Variance of Estimator 𝑆𝑆𝑇 σ 𝑌𝑖 − 𝑌ത 2 σ 𝑋𝑖 − 𝑋ത 2
1 − 𝑟2 𝑆𝑆𝐸
𝑆𝑌2 = = 𝑆𝑋2 = 𝑆𝑒2 =
(take square root to 𝑛−1 𝑛−1 𝑛−1 𝑛−2 𝑛−𝐾−1
get standard error)
67
Summary
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸 is the breakdown of variance / variations
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = = 1 − 𝑆𝑆𝑇 = a single number in [0,1] that quantifies the
𝑆𝑆𝑇
model explained variation / measures the goodness of fit
𝑏1 −𝛽1
t-statistic t = tests the significance of a single predictor, i.e.
𝑆𝑏1
whether 𝛽1 = 0
𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
F-statistic F = = tests the significance of the entire
𝑀𝑆𝐸 𝑆𝑆𝐸/(𝑛−𝐾−1)
model, i.e. whether all 𝛽1 = ⋯ 𝛽𝐾 = 0, 𝐾 is the number of
predictors
In this chapter with a single 𝑋, 𝐾 = 1 and F = t2
Point prediction and confidence interval prediction
68
Summary
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸 is the breakdown of variance / variations
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = = 1 − 𝑆𝑆𝑇 = a single number in [0,1] that quantifies the
𝑆𝑆𝑇
model explained variation / measures the goodness of fit
𝑏1 −𝛽1
t-statistic t = tests the significance of a single predictor, i.e.
𝑆𝑏1
whether 𝛽1 = 0
𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
F-statistic F = = tests the significance of the entire
𝑀𝑆𝐸 𝑆𝑆𝐸/(𝑛−𝐾−1)
model, i.e. whether all 𝛽1 = ⋯ 𝛽𝐾 = 0, 𝐾 is the number of
predictors
In this chapter with a single 𝑋, 𝐾 = 1 and F = t2
Point prediction and confidence interval prediction
68