0% found this document useful (0 votes)
26 views

Stats101A - Chapter 2

This document summarizes key concepts about simple linear regression analysis: 1) Simple linear regression models the relationship between a response (Y) variable and a predictor (X) variable as a straight line. The goal is to estimate the population intercept (β0) and slope (β1) using least squares. 2) Least squares estimation chooses intercept (b0) and slope (b1) values that minimize the sum of squared residuals (RSS). 3) Statistical inference for the intercept and slope includes point estimation, confidence intervals, and hypothesis testing to evaluate the significance of the linear relationship between X and Y.

Uploaded by

Zhen Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Stats101A - Chapter 2

This document summarizes key concepts about simple linear regression analysis: 1) Simple linear regression models the relationship between a response (Y) variable and a predictor (X) variable as a straight line. The goal is to estimate the population intercept (β0) and slope (β1) using least squares. 2) Least squares estimation chooses intercept (b0) and slope (b1) values that minimize the sum of squared residuals (RSS). 3) Statistical inference for the intercept and slope includes point estimation, confidence intervals, and hypothesis testing to evaluate the significance of the linear relationship between X and Y.

Uploaded by

Zhen Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Chapter 2

STATS 101A Introduction to Data Analysis and Regression


Maria Cha
Story of ‘Regression’
• Dictionary definition : The act of returning or stepping
back to a previous stage.

• Why do we use the term ‘regression’ to describe the


statistical procedure to fit the linear model?

• The term ‘regression’ was coined by Francis Galton,


who used the regression model to study genetic
relationships.

• He noticed that even though taller-than-average fathers


tended to have taller-than-average sons, the sons were
somewhat closer to average than the fathers were.
Story of ‘Regression’
• Galton called this phenomenon ‘regression toward
mediocrity’, which later came to be known as
‘regression toward the mean’.

• For Galton, “regression” referred only to the tendency of


extreme data values to "revert" to the overall mean
value.

• Later, as he and other statisticians built on the


methodology to quantify linear relationship and to fit
lines to data values, the term “regression” become
associated with the statistical analysis that we now call
regression.
Simple Linear Regression Model
• The ultimate purpose of the analysis is to model the
relationship between the two variables as a straight line.

• In particular, 𝑌 is modeled as a linear function of 𝑋.

• Data structure : 𝑥! , 𝑦! , 𝑥" , 𝑦" , … , 𝑥# , 𝑦#


• 𝑌 : response / dependent variable
𝑋 : predictor / independent variable.
Simple Linear Regression Model
• Recall : The population linear model is
𝑌! = 𝛽" + 𝛽# 𝑋! + 𝑒!

• (Random) error term (𝑒! ): it is due to the variation in 𝑌 that


cannot be predicted or explained.
• Assumptions about the error term : they are
• 1. random and independent one another,
• 2. normally distributed,
• 3. their mean is 0, and the variance is 𝜎 $ , which is
usually unknown.

• In summary, 𝑒! ~𝑁(0, 𝜎 $ ).
Simple Linear Regression Model
• Note: When 𝜎 $ is also unknown, we estimate it with
&
1
𝑆$ = 3 𝑒̂!$
𝑛−2
!%#

, which is an unbiased estimator of 𝜎 $ . (HW2)


Simple Linear Regression Model
• The linear regression model is the expected value of 𝑌 when
𝑋 takes the specific value of 𝑥.

𝐸(𝑌|𝑋 = 𝑥) = 𝐸 𝛽" + 𝛽# 𝑋 + 𝑒| 𝑋 = 𝑥
= 𝛽" + 𝛽# 𝑥 + 𝐸 𝑒
= 𝛽" + 𝛽# 𝑥

, where 𝛽0 and 𝛽1 are population intercept and population


slope, which are unknown.

• With the given (sample) data of


𝑥# , 𝑦# , 𝑥$ , 𝑦$ , … , (𝑥& , 𝑦& ),
we want to estimate 𝛽0 and 𝛽1.
Simple Linear Regression Model
• Let’s say 𝑏0 and 𝑏1 are the (good) estimates of 𝛽0 and 𝛽1.
Then we have,
𝑦;! = 𝑏" + 𝑏# 𝑥!

for the observation 𝑖. This is called the ‘predicted


value’ or the ‘fitted value’ of 𝑦𝑖 (observed value).

• Residual : The difference between the predicted and


observed value of 𝑦.
𝑒̂! = 𝑦! − 𝑦;!

• We wish to minimize this difference. (Least square


estimates)
Least Square Estimates
• How do we know which model has the least residuals in
general?

• Quantify the residuals of the model:

• Option1 : Sum of residuals (Σ𝑒̂! )– Does not work : the


sum of residuals is always zero. (HW2)

• Option2 : Sum of squared residuals (∑ 𝑒̂!$ )


• The smaller the sum, the better the fit.
• The line of best fit is the line for which the sum of
the squared residuals is smallest.
Least Square Estimates
• Least squares estimates choose 𝑏0 and 𝑏1 which minimize
the sum of squared residuals.

• Sum of squared residuals (or residual sum of squares;


𝑅𝑆𝑆)
& & &

𝑅𝑆𝑆 = 3 𝑒̂!$ = 3(𝑦! − 𝑦@! )$ = 3(𝑦! −𝑏" − 𝑏# 𝑥! )$


!%# !%# !%#
Least Square Estimates
• To minimize the RSS with respect to 𝑏0 and 𝑏1, we require,

• By rearranging the two equations, we get the two


equations called normal equations:
Least Square Estimates
• Solving the normal equations for 𝑏0 and 𝑏1 gives least
squares estimates.

• Note: The full process of derivation will be remained for


HW2 question.
Least Square Estimates
• Example: Among possible regression lines, find the line
that minimizes the sum or squared residuals.
Scatterplot of Mothers' age vs. Fathers' age
60
Fathers' age

40
20
0

0 10 20 30 40 50

Mothers' age

• Which one looks the best?


14
Least Square Estimates
Scatterplot of Mothers' age vs. Fathers' age

60
Fathers' age

40
20
0

0 10 20 30 40 50

Mothers' age

Sum of squared
Line Color Linear Equation
residuals
Black 𝑦! = 11.54 + 0.68𝑥 38905.1
Green 𝑦! = 1 + 𝑥 54545
Red 𝑦! = 15 + 0.6𝑥 47887.32
15
Statistical inference
• Statistical Inference is the process of drawing
conclusions about populations or scientific truths from
data.
• It includes point/interval estimation or hypothesis testing
for the parameters.
Inference about 𝛽0 and 𝛽1
• 1. Point estimation for 𝛽0 and 𝛽1
• Find 𝛽A" and 𝛽A# using least square method.
• Find the sampling distribution of 𝛽A" and 𝛽A# .

• 2. Interval estimation for 𝛽0 and 𝛽1


• Find the confidence interval for 𝛽0 and 𝛽1

• 3. Hypothesis testing for 𝛽0 and 𝛽1


• Test for 𝐻" ∶ 𝛽# = 0.
• Test for 𝐻" ∶ 𝛽" = 𝛽"∗ .
Inference about the slope (𝛽1)
• Using Least square method, we find

∑#$%! 𝑥$ − 𝑥̅ 𝑦$ − 𝑦-
𝛽'! =
∑#$%!(𝑥$ − 𝑥)̅ "
∑#$%! 𝑥$ − 𝑥̅ 𝑦$ − 𝑦- ∑#$%! 𝑥$ − 𝑥̅
=
∑#$%!(𝑥$ − 𝑥)̅ "
∑#$%! 𝑥$ − 𝑥̅ 𝑦$
= #
∑$%!(𝑥$ − 𝑥)̅ "
#

= 0 𝑐$ 𝑦$
$%!

('! (')̅
, where 𝑐$ = .
+,,
Inference about the slope
• (Sampling) Distribution of 𝛽A#
𝜎$
𝛽A# |𝑋~𝑁(𝛽# , )
SXX

• The distribution implies that

𝐸 𝛽A# |𝑋 = 𝛽#
𝜎$
𝑉𝑎𝑟 𝛽A# |𝑋 =
SXX

* See the chapter 2.7.1 for the detailed derivation.


** For brief summaries for properties of the expectation and variance operations, click
https://ucla.box.com/s/6yi0nl8mqyhj5hbzzyvyhhe7t5xbx8yp
Inference about the slope
• Standardize to 𝑍:

𝛽A# − 𝛽#
𝑍= ~𝑁(0,1)
𝜎/ SXX

• Since 𝜎 is unknown, it should be replaced with the estimator


s. Then, the distribution of the test statistic will be T-
distribution with the degree of freedom 𝑛 − 2.

𝛽A# − 𝛽#
𝛽A# − 𝛽#
𝑇= = ~𝑇&($
A
𝑆/ SXX 𝑠𝑒(𝛽# )

)
, where 𝑠𝑒 𝛽A# = is the estimated standard error(se) of 𝛽A# .
*++
Inference about the slope
• Hypothesis testing for the significance of 𝛽1 : “The
slope(𝛽1) is significant” implies that the two variables have
a statistically significant linear association.

• Equivalent to test the hypotheses,


𝐻" : 𝛽# = 0 𝑣𝑠. 𝐻, : 𝛽# ≠ 0

The test statistic under the null hypothesis is,


𝛽A#
𝑇= ~𝑇&($
𝑠𝑒(𝛽A# )
Inference about the slope
• Interpretation of the result of the hypothesis test:

• If we reject the null hypothesis, we conclude that the


slope is significant, and the two variables have
significant linear association.
• If we fail to reject the null hypothesis, we conclude that
the slope is not significant, and the two variables have
no significant linear association.
Inference about the slope
• Note: inference about population mean (𝜇) vs. inference
about the slope (𝛽1) in simple linear regression
Inference about the slope
• Confidence interval for 𝛽1

𝛽A# ± 𝑡-,&($ ∗ 𝑠𝑒(𝛽A# )


$

• Interpretation of the 95% confidence interval : We are 95%


confident that the population slope (true 𝛽1) will fall within
the interval.

• The interval tells us the feasible values for the true 𝛽1.
Inference about the intercept
• Using Least square method, we find

𝛽A" = 𝑦T − 𝛽A# 𝑥̅

• (Sampling) Distribution of
1 𝑥̅ $
𝛽A" |𝑋~𝑁 𝛽" , 𝜎 $ +
𝑛 SXX

# 0̅ !
, and 𝑠𝑒(𝛽A" ) = 𝑆 + . is the estimated standard error of
& *++
𝛽A" .
Inference about the intercept
• Test statistic for 𝐻" : 𝛽" = 𝛽"∗

𝛽A" − 𝛽"∗
𝑇= ~𝑇&($
𝑠𝑒(𝛽A" )

• Confidence interval for 𝛽0

𝛽A" ± 𝑡-,&($ ∗ 𝑠𝑒(𝛽A" )


$

Inference about 𝑌 with given X = 𝑥
• Consider the two similar, but different questions about the
regression model :
V = −229.077 + 5.577 ∗ 𝐻𝑒𝑖𝑔ℎ𝑡
𝑊𝑒𝑖𝑔ℎ𝑡

• 1. What will be the average weight of the people who are 71


inches tall?

• 2. What will be the weight of a person who is 71 inches tall?



Inference about 𝑌 with given X = 𝑥
• For both questions, we will answer with the same way :

V = −229.077 + 5.577 ∗ 𝐻𝑒𝑖𝑔ℎ𝑡


𝑊𝑒𝑖𝑔ℎ𝑡
= −229.077 + 5.577 ∗ 71
= 166.89

• However, they are mathematically asking different objects:


• 1. Average weight of the people who are 71 inches tall
(value on the regression line):
𝐸(𝑌|𝑋 = 71)

• 2. Individual’s weight who is 71 inches tall :


𝑌 ∗ = 𝑌|(𝑋 = 71)

Inference about 𝑌 with given X = 𝑥
• Which question is harder to answer?

• Individuals have more variability than do means of


groups. Thus, the second question is more difficult to
answer.

Inference about 𝑌 with given X = 𝑥
• Parameter of interest :
• 1. Mean 𝑌 (regression line) at 𝑋 = 𝑥 ∗
𝐸(𝑌|𝑋 = 𝑥 ∗ ) = 𝛽" + 𝛽# 𝑥 ∗

• 2. Single 𝑌 at 𝑋 = 𝑥 ∗
𝑌 ∗ = 𝑌|(𝑋 = 𝑥 ∗ ) = 𝛽" + 𝛽# 𝑥 ∗ + 𝑒 ∗

• Point estimation (same) :


𝑦; ∗ = 𝑦|(𝑋
; = 𝑥 ∗ ) = 𝛽A" + 𝛽A# 𝑥 ∗

Inference about 𝑌 with given X = 𝑥
• Interval estimation
• 1. Mean 𝑌 (regression line) at 𝑋 = 𝑥 ∗
1 (𝑥 ∗ − 𝑥)̅ $
𝑦; ∗ ± 𝑡&($ ∗ 𝑆 +
𝑛 𝑆𝑋𝑋
It is called the confidence interval for the mean response
for given 𝑥 ∗ .

• 2. Single 𝑌 at 𝑋 = 𝑥 ∗
1 (𝑥 ∗ − 𝑥)̅ $
𝑦; ∗ ± 𝑡&($ ∗ 𝑆 𝟏+ +
𝑛 𝑆𝑋𝑋

It is called the prediction interval for the new response


for given 𝑥 ∗ .
Confidence vs. Prediction interval
• Example : In 1966, Cyril Burt published a paper called “The
genetic determination of differences in intelligence: A study
of monozygotic twins reared apart” The data consist of IQ
scores for an assumed random sample of 27 identical
twins, one raised by foster parents, the other by the
biological parents. Note that the average IQ score of 27
biological twins in the sample is 95.3 points, with a
standard deviation is 15.74 points.
Confidence vs. Prediction interval
• Consider predicting the foster twin’s IQ(𝑌 ∗ ) when the
biological twin’s IQ is 100.
Inference about Y with given X=x*
• R result for the linear model

• The predicted 𝑦; ∗ for 𝑥 ∗ = 100 is

𝑦; ∗ = 9.2076 + 0.9014 ∗ 𝑥 ∗
= 9.2076 + 0.9014 ∗ 100
= 99.35
Confidence vs. Prediction interval
• Find a 95% confidence interval for 𝐸(𝑌|𝑋) at 𝑥 ∗ = 100.

∗ # (0 ∗ (0)̅ !
𝑦; ± 𝑡&($ ∗ 𝑆 +
& )33
# (#""(56.8)!
= 99.35 ± 2.06 ∗ 7.729 +
$4 9::#.:8;
= (96.15, 102.55)

• Hint: 𝑆𝑋𝑋 = 𝑛 − 1 𝑠3$ and 𝑡".546,$6 = 2.06.


Confidence vs. Prediction interval
• Find a 95% prediction interval for 𝑌 ∗ at 𝑥 ∗ = 100.

1 (𝑥 ∗ − 𝑥)̅ $
𝑦; ∗ ± 𝑡&($ ∗ 𝑆 1+ +
𝑛 𝑆𝑋𝑋
# (#""(56.8)!
= 99.35 ± 2.06 ∗ 7.729 1 + +
$4 9::#.:8;
= (83.11, 115.59)
Confidence vs. Prediction interval
• Confidence interval for 𝐸(𝑌|𝑋) at 𝑥 ∗ = 100 vs.
Prediction interval for 𝑥 ∗ at 𝑥 ∗ = 100.
Confidence vs. Prediction interval
• Confidence vs. Prediction band
Confidence vs. Prediction interval
• A prediction interval is designed to cover a “moving target”,
the random future value of 𝑦, while the confidence interval
is designed to cover the “fixed target”, the expected value
of 𝑦, 𝐸(𝑌|𝑋) for a given 𝑥 ∗ .

• Although both are centered at 𝑦; ∗ , the prediction interval is


wider than the confidence interval. The prediction interval
must take account of the tendency of 𝑦 to fluctuate from its
mean value, while the confidence interval simply needs to
account for the uncertainty in estimating the mean value.

• The error in estimating 𝐸(𝑌|𝑋) and 𝑌 ∗ grows as 𝑥 ∗ moves


away from 𝑥.̅ The further 𝑥 ∗ is from 𝑥,̅ the wider the
confidence/prediction intervals will be.
Confidence vs. Prediction interval
• (Exercise) Answer the two questions for the regression
model :
V = −229.077 + 5.577 ∗ 𝐻𝑒𝑖𝑔ℎ𝑡
𝑊𝑒𝑖𝑔ℎ𝑡

• 1. What will be the average weight of the people who are 71


inches tall? Construct the 95% confidence interval for the
parameter of interest.

• 2. What will be the weight of a person who is 71 inches tall?


Construct the 95% prediction interval for the parameter of
interest.
Confidence vs. Prediction interval
• In R,
Confidence vs. Prediction interval
• Visualizing CI and PI in R:
Review: Testing for slope
• Recall : “The slope(𝛽1) is significant (non-zero)” implies
that the two variables have a statistically significant linear
association.

• It is equivalent to test the hypotheses,

𝐻" : 𝛽# = 0 𝑣𝑠. 𝐻, : 𝛽# ≠ 0

, also is equivalent to test the hypotheses

𝐻" : 𝑌 = 𝛽" + 𝑒
vs. 𝐻, : 𝑌 = 𝛽" + 𝛽# 𝑥 + 𝑒
Testing between the two models
• Hypotheses in terms of the models

𝐻" : 𝑌 = 𝛽" + 𝑒
vs. 𝐻, : 𝑌 = 𝛽" + 𝛽# 𝑥 + 𝑒

• Two scenarios:
• If 𝐻𝑜 is plausible, then 𝑋 and 𝑌 do not have linear
association. Thus, the best estimates of 𝑌 is 𝑌.T
• If 𝐻𝐴 is plausible, then 𝑋 is a significant explanatory
variable, thus the best estimates of 𝑌 is 𝑌.b
Variations in the linear model
• Total deviation = Unexplained deviation + Explained deviation
𝑌! − 𝑌T = 𝑌! − 𝑌b! + (𝑌b! − 𝑌)
T
Variations in the linear model
• In terms of sum of squares of the terms:
T $ = Σ(𝑌! − 𝑌b! )$ +Σ(𝑌b! − 𝑌)
Σ(𝑌! − 𝑌) T $
SST = RSS + SSreg

• Note: We will show this equation in HW2.

• Each term represents:


• SST (same as SYY) : Total sum of squares – the total
variation in Y.
• RSS : Residual sum of squares
• SSreg : regression sum of squares – explained
variation in SST.
Variations in the linear model
• Again, 𝐻0 supports 𝑌T as an estimate of 𝑌, while 𝐻𝐴
supports 𝑌b as an estimate of 𝑌.

• If 𝐻𝐴 model is plausible, the RSS of the model is


Σ(𝑌! − 𝑌b! )$ .

• However, if 𝐻0 model is plausible, the RSS of the model


T $ , which is SST.
will be Σ(𝑌! − 𝑌)

• Since SSreg = SST - RSS, if SSreg is big then it implies


that 𝐻𝐴 is more plausible. If SSreg is small, then it
implies that 𝐻0 is more plausible. Thus, SSreg measures
improvement in adding the slope to the model.
Variations in the linear model
• Variations in the hypotheses
Variations in the linear model
• Two measures about the goodness of fit, i.e. validity of the
regression model.

))<=>
• 1. 𝑅$ = : the proportion of variation in 𝑌’s
))?

explained by the regression model.

))<=> ))?(@))
§ 𝑅$ = = = 1, if RSS=0. It
))? ))?

implies that the model perfectly fits to the data.


§ 𝑅$ = r $ , if we have one predictor (𝑋), i.e. simple linear
regression.
Variations in the linear model
2. Test statistic for 𝐻" : 𝑌 = 𝛽" + 𝑒
vs. 𝐻, : 𝑌 = 𝛽" + 𝛽# 𝑥 + 𝑒

𝑆𝑆𝑟𝑒𝑔/𝑑𝑓#
𝐹= ~𝐹AB#,AB!
𝑅𝑆𝑆/𝑑𝑓$

• 𝑑𝑓# is a degree of freedom associated to SSreg = 1


• 𝑑𝑓$ is a degree of freedom associated to RSS=𝑛 − 2

• If SSreg is relatively large, it implies that the regression


model is valid. Then, F-statistic will increase, so it is more
likely to reject the null hypothesis.
F distribution
• It is positively skewed and the skewness depends on
the df1 and df2.
• Values of F must always be positive or zero, since it is
the ratio of two variances.
F distribution
• Again: If SSreg is relatively large, it implies that the
regression model is valid. Then, F-statistic will increase,
so it is more likely to reject the null hypothesis.
• When it comes to the direction of hypothesis testing, it
should be one-sided test in the right-hand side of the F
distribution.
• Thus, the p-value of the F-statistic is always
𝑃(𝐹"."6,AB#,AB! > 𝐹)

• Note: F statistic for the ANOVA is same as (T statistic)2 for


𝐻" : 𝛽# = 0;
$
A
𝛽# 𝑆𝑆𝑟𝑒𝑔
=
A
𝑠𝑒(𝛽# ) 𝑅𝑆𝑆/(𝑛 − 2)
F distribution
• Proof:
𝑆𝑆𝑟𝑒𝑔
𝐹=
𝑅𝑆𝑆/(𝑛 − 2)
Σ(𝑌b! − 𝑌)
T $ Σ(𝑌b! − 𝑌)
T $
= =
b $
Σ(𝑌! − 𝑌! ) /(𝑛 − 2) 𝑆$
Σ(𝛽A" + 𝛽A# 𝑋! − 𝑌)
T $ Σ(𝑌T − 𝛽A# 𝑋T + 𝛽A# 𝑋! − 𝑌)
T $
= $
=
𝑆 𝑆$
$ $
A
𝛽# Σ(𝑋! − 𝑋) T $ A
𝛽# SXX
= $
=
𝑆 𝑆$
$ $
A
𝛽# A
𝛽#
= $ = = 𝑇 $
𝑆 /𝑆𝑋𝑋 𝑠𝑒(𝛽A# )$
ANOVA table
• Analysis of Variance Table : summarizes the variations in the
linear model in the table.
Source of Degree of Sum of Mean
F p-value
Variation Freedom Squares squares
!!"#$
Regression (1) 1 (4) SSreg (7) SSreg/1 (9) !""/(! − 2) (10)

Residual (2) n-2 (5) RSS (8) RSS/n-2

Total (3) n-1 (6) SST


• Degree of freedom (1)-(3) are associated (4)-(6), respectively.
• Mean squares are Sum of squares divided by the corresponding
degree of freedom : (7) =(4)/(1) and (8) =(5)/(2)
• F statistic : (9) = (7)/(8)
• p-value : (10) = P(F(1),(2) > F(9)).
Recap with a special case
• Special case: X is a dummy variable, i.e. categorical
variable with two categories.

• Example: "changeover_times.txt” data – we consider a


large food processing center that needs to be able to
switch from one type of package to another quickly to
react to changes in order patterns. Consultants have
developed a new method for changing the production
line and used it to produce a sample of 48 change-over-
times, as well as 72 change-over times for the existing
method.

• Research question: Is the method(X: existing vs. new)


significantly related to the change-over times(Y)?
Recap with a special case
• Note that 𝑋 is a categorical variable with 0 (existing
method) and 1 (new method)

• Data structure:
Recap with a special case
• Scatter plot and Boxplot for Y vs. X

• Question: Do they seem to be related?


Recap with a special case
• Approach 1: Simple linear regression
• To answer the question, we test for the slope.
𝐻" : 𝛽# = 0 𝑣𝑠. 𝐻, : 𝛽# ≠ 0

• Also, we may refer the ANOVA test for the linear model.
𝐻" : 𝑌 = 𝛽" + 𝑒
vs. 𝐻, : 𝑌 = 𝛽" + 𝛽# 𝑥 + 𝑒

• Approach 2: Two-sample t-test


𝐻" : 𝜇&=C = 𝜇DEA 𝑣𝑠. 𝐻, : 𝜇&=C ≠ 𝜇DEA
Recap with a special case
• The estimated slope from the linear equation is same as the
mean difference in Y between new and existing method, i.e.
𝛽A# = 𝑌T&=C − 𝑌TDEA

• Recall that the interpretation of the slope is “the average


change in 𝑌 when 𝑋 increases by 1 unit.”

You might also like