0% found this document useful (0 votes)
2 views

Week 13

Uploaded by

wtwong357
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Week 13

Uploaded by

wtwong357
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CB2200 Business Statistics

Topic 8
Simple Linear Regression
Tutorial Week13

1
Outline

◼ Covariance and the Coefficient of Correlation


◼ Simple Linear Regression
❑ Least Squares Estimation
❑ Predictions in Regression Analysis
❑ Coefficient of Determination
❑ Inferences about the Slope
◼ Applications of Linear Regression

2
Covariance
◼ How do we measure the degree of linear association
between two variables 𝑋 and 𝑌?
◼ The answer to this question is the covariance
❑ A quantity that measures the linear association

◼ Population covariance
σ𝑁
𝑖=1(𝑋𝑖 −𝜇𝑋 )(𝑌𝑖 −𝜇𝑌 )
𝜎𝑋𝑌 = 𝑁
◼ Sample covariance
σ𝑛 ത ത
𝑖=1(𝑋𝑖 −𝑋)(𝑌𝑖 −𝑌)
𝑆𝑋𝑌 = 𝑛−1
❑ An estimator of 𝜎𝑋𝑌 based on 𝑛 pairs of sample values

3
Coefficient of Correlation

◼ Population coefficient of correlation


pronounced rho
σ𝑁
𝑖=1(𝑋𝑖 −𝜇𝑋 )(𝑌𝑖 −𝜇𝑌 ) 𝜎𝑋𝑌
𝜌𝑋𝑌 = = 𝜎𝑋 𝜎𝑌
σ𝑁
𝑖=1(𝑋𝑖 −𝜇𝑋 )2 σ𝑁 (𝑌 −𝜇 )2
𝑖=1 𝑖 𝑌

◼ Sample coefficient of correlation


σ𝑛 ത ത
𝑖=1(𝑋𝑖 −𝑋)(𝑌𝑖 −𝑌) 𝑆𝑋𝑌
𝑟𝑋𝑌 = = 𝑆𝑋 𝑆𝑌
σ𝑛 ത 2 𝑛 ത 2
𝑖=1(𝑋𝑖 −𝑋) σ𝑖=1(𝑌𝑖 −𝑌)

❑ An estimator of 𝜌𝑋𝑌
◼ The sign of 𝜌𝑋𝑌 (𝑟𝑋𝑌 ) is the same as that of 𝜎𝑋𝑌 (𝑆𝑋𝑌 )
❑ As the denominator of 𝜌𝑋𝑌 is always non-negative
4
Coefficient of Correlation
Cont’d
◼ It can be shown it is always the case that
−1 ≤ 𝜌𝑋𝑌 ≤ 1 and −1 ≤ 𝑟𝑋𝑌 ≤ 1
◼ Three special values of 𝜌𝑋𝑌 and 𝑟𝑋𝑌 are of interest
❑ When 𝜌𝑋𝑌 = 0 (𝑟𝑋𝑌 = 0), 𝑋 and 𝑌 are not linearly related, and
we say that 𝑋 and 𝑌 are uncorrelated in the population (sample)
❑ When all population (sample) values of 𝑋 and 𝑌 lie exactly on a
straight line having a positive slope, then 𝜌𝑋𝑌 = 1 (𝑟𝑋𝑌 = 1)
❑ When all population (sample) values of 𝑋 and 𝑌 lie exactly on a
straight line having a negative slope, then 𝜌𝑋𝑌 = −1 (𝑟𝑋𝑌 = −1)
◼ If the population (sample) values of 𝑋 and 𝑌 lie close to a
straight line, then 𝜌𝑋𝑌 (𝑟𝑋𝑌 ) will be close to 1 or -1

5
Coefficient of Correlation
Quadran 𝑋ത Quadran
t II tI

𝑌ത

Quadrant Quadrant
III IV

r -> +1, Strong Positive r -> -1, Strong Negative


Linear Relationship Linear Relationship

r -> 0, Weak Linear Relationship


(May have other relationship)
6
Linear Regression Model
◼ The population (or true) regression line is defined as
𝑌𝑖 = 𝑌+෠ 𝜀𝑖 = 𝛽0 + 𝛽1𝑋𝑖 + 𝜀𝑖
-> 𝜀𝑖 = 𝑌𝑖 − 𝑌෠
◼ The response of 𝑌𝑖 to a particular value 𝑋𝑖 will be the
sum of two parts
❑ An expectation (𝛽0 +𝛽1 𝑋𝑖 ) reflecting their systematic
relationship
❑ A discrepancy 𝜀𝑖 from the expectation, often called the error
term
◼ Since the population regression line involves on one
independent variable (𝑋𝑖 ), the line is sometimes called
simple linear regression model

7
Least Squares Estimation
Cont’d

◼ We must consider the entire set of (𝑋𝑖 , 𝑌𝑖 ), 𝑖 = 1, …, 𝑛, for


determining the goodness of fit
◼ Consider an observed set of (𝑋𝑖 , 𝑌𝑖 ), 𝑖 = 1, …, 𝑛, suppose
there exists a straight line
𝑌෠𝑖 = 𝑏0 + 𝑏1𝑋𝑖
such that it minimizes the sum of squared errors (SSE)
2
Min. SSE = σ𝑛𝑖=1 𝑌𝑖 − 𝑌෠𝑖 = σ𝑛𝑖=1 𝑒𝑖 2
❑ Least-squares criterion is about finding such for 𝑏0 and 𝑏1
❑ The resulting line is often called the least-squares regression line

8
Least Squares Estimation
Cont’d

◼ It is possible to show using calculus that the least-


squares form of 𝑏0 and 𝑏1 can be determined as
σ𝑛 ത ത
𝑖=1(𝑋𝑖 −𝑋)(𝑌𝑖 −𝑌)
𝑏0 = 𝑌ത − 𝑏1 𝑋ത and 𝑏1 = σ𝑛 ത 2
𝑖=1(𝑋𝑖 −𝑋)

❑ 𝑏0 and 𝑏1 are the least squares estimates for 𝛽0 and 𝛽1


respectively

𝑆𝑌 σ𝑛 ത 2
𝑖=1(𝑌𝑖 −𝑌)
𝑏1 = 𝑟𝑋𝑌 = 𝑟𝑋𝑌
𝑆𝑋
σ𝑛 ത 2
𝑖=1(𝑋𝑖 −𝑋)

9
Developing Regression Model
in Excel Cont’d

◼ Output

|𝒓𝑿𝒀|

SSE

𝒃𝟎
𝒃𝟏

10
Coefficient of Determination
Cont’d

◼ A better way to measure the goodness of fit for a least-


squares regression line is to compare its SSE value to that
of another regression line based on the same set of 𝑌

◼ ෠
A natural second line to be compared with is 𝑌𝑖 = 𝑌, ത
that is, estimating the mean value of 𝑌 without using 𝑋
◼ The corresponding SSE is
∗ 2
σ𝑛𝑖=1 𝑌𝑖 − 𝑌෠𝑖 = σ𝑛𝑖=1 𝑌𝑖 − 𝑌ത 2 = SST
❑ SST is called the total variation in 𝑌 or the total sum of squares

11
Coefficient of Determination
◼ The goal is to determine by how much the SSE is smaller
than SST
❑ Or, the amount of improvement in using the regression line and
the independent variable 𝑋 rather than just the sample mean to
predict 𝑌
◼ This measure is provided through a statistic called the
coefficient of determination (𝑅2 )
𝑆𝑆𝐸
𝑅2 = 1 − 𝑆𝑆𝑇
❑ 𝑅 2 is unit-free with value in between 0 and 1 inclusive
❑ The higher the 𝑅 2 , the better the fitting (the stronger linear
association between 𝑋 and 𝑌)
❑ However, it does not mean that 𝑋 causes 𝑌

12
Coefficient of Determination

◼ Commonly, the coefficient of determination is


interpreted as
❑ 74.56% of the sample variability in 𝑌 is explained by its linear
dependency on 𝑋
❑ Or, alternatively, by taking the linear dependence on 𝑋 into
account, the SSE is reduced by 74.56%

13
Coefficient of Determination
Cont’d

◼ In a regression model containing only one 𝑋 variable,


𝑅2 = (𝑟𝑋𝑌 )2
◼ Hence, in our example, the sample correlation coefficient
between 𝑋 and 𝑌 is 𝑟𝑋𝑌 = − 0.7456 = −0.8635
❑ We know 𝑟𝑋𝑌 has a negative sign because 𝑏1 is negative
❑ 𝑟𝑋𝑌 would have a positive sign if 𝑏1 was positive

14
Inferences about the Slope
◼ At times, tests concerning 𝛽1 are of interest, particularly
one of the forms: H0: 𝛽1 = 0 vs H1: 𝛽1 ≠ 0
◼ If 𝛽1 = 0, there is no linear relationship between 𝑋 and 𝑌
❑ The means of the probability distribution of 𝑌 are all equal,
namely 𝐸 𝑌 𝑋 = 𝑥 = 𝛽0 + 0𝑥 = 𝛽0 for all levels of 𝑋
❑ A change in 𝑋 does not induce any change in 𝑌
◼ Similar to those discussed in Topics 6 & 7, we need to
consider the sampling distribution of 𝑏1, the least squares
point estimate of 𝛽1 , in order to perform the inferences
on 𝛽1

15
Inferences about the Slope
Cont’d

◼ The population regression line is defined as


𝑌𝑖 = 𝛽0 + 𝛽1𝑋𝑖 + 𝜀𝑖
◼ It is very common to assume that the error terms 𝜀𝑖 are
independent and normally distributed with mean 0 and
variance 𝜎 2, 𝑖 = 1, …, 𝑛
❑ This assumption can be relaxed, but it will make the inference on
the slope parameter (and others) more complicated
◼ Under this assumption, the dependent variables 𝑌𝑖 are
also independent and normally distributed with mean
𝐸(𝑌𝑖 ) = 𝛽0 + 𝛽1 𝑋𝑖 and variance 𝜎 2 , 𝑖 = 1, …, 𝑛
❑ We are treating 𝑋𝑖 as known constants
16
Inferences about the Slope
Cont’d

◼ Sampling distribution of 𝑏1
❑ Since the 𝑌𝑖 are normal, the estimator 𝑏1 is also normal. It can be
shown that 𝑏1 has mean and variance
𝜎2
𝐸 𝑏1 = 𝛽1 𝜎𝑏21 = σ𝑛
𝑖=1 𝑋𝑖 −𝑋ത 2

❑ The variance 𝜎𝑏21 can be estimated by 𝑆𝑏21 as


𝑆𝑒2 𝑆𝑆𝐸 Τ(𝑛−2) σ𝑛 ෠ 2
𝑖=1 𝑌𝑖 −𝑌𝑖 Τ(𝑛−2)
𝑆𝑏21 = σ𝑛 ത 2
= σ𝑛 ത 2
= σ𝑛 ത 2
𝑖=1 𝑋𝑖 −𝑋 𝑖=1 𝑋𝑖 −𝑋 𝑖=1 𝑋𝑖 −𝑋

◼ 𝑆𝑏1 measures the variability in the slope of regression lines arise


from different possible samples
◼ 𝑆𝑒2 is called the mean squared error (MSE) of the regression model.
It measures the variance of the errors around the regression line. It
is an unbiased estimator of 𝜎 2
17
Inferences about the Slope
Cont’d

◼ Confidence intervals for the population regression slope


❑ Since 𝑏1 is normally distributed, when 𝜎𝑏1 is estimated by 𝑆𝑏1 ,
the statistic
𝑏1 −𝛽1
~ 𝑡 with 𝑛-2 degrees of freedom
𝑆𝑏 1

❑ If the error term 𝜀𝑖 are normally distribution as assumed, a


100(1-𝛼)% confidence interval for the population regression
slope 𝛽1 is given by
𝑏1 − 𝑡𝛼 Τ2,𝑛−2 𝑆𝑏1 , 𝑏1 + 𝑡𝛼 Τ2,𝑛−2 𝑆𝑏1
where 𝑡𝛼 Τ2,𝑛−2 is the value corresponding to an upper-tail
probability of  / 2 from the 𝑡 distribution at degrees of freedom
𝑛 −2
18
Inferences about the Slope
Cont’d

◼ The confidence interval for the population regression


slope is interpreted as
❑ The 100(1-𝛼)% confidence interval for the expected change in 𝑌
resulting from one-unit increase in 𝑋 is between ൣ𝑏1 −
𝑡𝛼 Τ2,𝑛−2 𝑆𝑏1 , 𝑏1 + 𝑡𝛼 Τ2,𝑛−2 𝑆𝑏1 ൧

19
Inferences about the Slope –
Exercise Cont’d

◼ Refer to the example our example on number of days


taken off work, given 𝑏1 = −1.09 and 𝑆𝑏1 = 0.2842
◼ A 95% CI for 𝛽1 is
95% CI for 𝛽1
= 𝑏1 ± 𝑡𝛼Τ2,𝑛−2 𝑆𝑏1
= −1.09 ± 2.5706 × 0.2842
= [−1.821, −0.359]
The 95% CI for the expected decrease in the number of
days taken off work resulting from one additional year of
service is between 1.821 and 0.359
20
Inferences about the Slope
◼ Hypothesis testing for 𝛽1
❑ For hypotheses 𝐻0: 𝛽1 = 0 and 𝐻1: 𝛽1 ≠ 0 , the 𝑡 test statistic is
𝑏1
𝑡= ~(𝛼/2, 𝑛 − 2)
𝑆𝑏 1

❑ Critical value approach


◼ At 𝛼 significance level, reject 𝐻0 if 𝑡 < 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒𝐿 or 𝑡 >
𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒𝑈 where the critical values are obtained from the 𝑡
distribution table at 𝑛 – 2 degrees of freedom
❑ 𝑝-value approach
◼ 𝑝-value = 𝑃(𝑡 ≤ − 𝑡 ) + 𝑃(𝑡 ≥ 𝑡 )
◼ Reject 𝐻0 if 𝑝-value < 𝛼

❑ The same 𝑡 can also be used for testing the hypotheses


𝐻0 : 𝛽1 ≤ 0 vs 𝐻1 : 𝛽1 > 0 , or 𝐻0 : 𝛽1 ≥ 0 and 𝐻1 : 𝛽1 < 0

21
Inferences about the Slope –
Exercise Cont’d

◼ In the example on number of days taken off work, test at


5% level of significance, is years of service linearly
influencing the number of days taken off work?
𝐻0 : 𝛽1 = 0 Given 𝑏1 = −1.09 and 𝑆𝑏1 = 0.2842,
𝐻1 : 𝛽1 ≠ 0 𝑏1 −1.09
𝑡= = = −3.835
At 𝛼 = 0.05 𝑆𝑏1 0.2842
𝑛 = 7 𝑑𝑓 = 5 0.01 < p-value < 0.02
Critical Value = ±2.5706
At 𝛼 = 0. 05, reject 𝐻0
Reject 𝐻0 if 𝑡 < −2.5706 or
There is evidence that years of service
𝑡 > +2.5706
is linearly relating to the number of
days taken off work
22
Developing Regression Model
in Excel Cont’d

◼ Output

|𝒓𝑿𝒀|
𝑹𝟐

𝑺𝒆
𝒏

SSE
SST

𝒃𝟎
𝒃𝟏 𝑺𝒃𝟏

𝒕 for 𝒑-value 95% CI for 𝜷𝟏 90% CI for 𝜷𝟏


𝜷𝟏 = 𝟎 for 𝜷𝟏 = 𝟎 23
Multiple Linear Regression
◼ In many situations, two or more independent variables
may be included in a regression model to provide an
adequate description of the process under study or to
yield sufficiently precise inferences
◼ For example a regression model for predicting the
demands for a firm’s product in different countries uses
socioeconomic variables (mean household income,
average years of schooling of head of household),
demographic variables (average family size, percentage
of retired population), and environmental variables
(mean daily temperature, pollution index), etc.

24
Multiple Linear Regression

◼ Linear regression models containing two or more


independent variables are called multiple linear
regression models
◼ The simple linear regression model can be extended to
include 𝑘 independent variables
𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀

25

You might also like