Inference For Regression
Inference For Regression
residual
2
n 2
(y
i
y
i
)
2
n 2
Conditions for inference
The observations are independent.
The relationship is indeed linear.
The standard deviation of y, , is the same for all values of x.
The response y varies normally
around its mean.
Using residual plots to check for regression validity
The residuals (y ) give useful information about the contribution of
individual data points to the overall pattern of scatter.
We view the residuals in
a residual plot:
If residuals are scattered randomly around 0 with uniform variation, it
indicates that the data fit a linear model, have normally distributed
residuals for each value of x, and constant standard deviation .
Residuals are randomly scattered
good!
Curved pattern
the relationship is not linear.
Change in variability across plot
not equal for all values of x.
What is the relationship between the
average speed a car is driven and its
fuel efficiency?
We plot fuel efficiency (in miles
per gallon, MPG) against average
speed (in miles per hour, MPH)
for a random sample of 60 cars. The
relationship is curved.
When speed is log transformed (log
of miles per hour, LOGMPH) the new
scatterplot shows a positive, linear
relationship.
Residual plot:
The spread of the residuals is reasonably
randomno clear pattern. The
relationship is indeed linear.
But we see one low residual (3.8, 4) and
one potentially influential point (2.5, 0.5).
Normal quantile plot for residuals:
The plot is fairly straight, supporting the
assumption of normally distributed
residuals.
Data okay for inference.
Confidence interval for regression parameters
Estimating the regression parameters b
0
, b
1
is a case of one-sample inference
with unknown population variance.
We rely on the t distribution, with n 2 degrees of freedom.
A level C confidence interval for the slope, b
1
, is proportional to the standard
error of the least-squares slope:
b
1
t* SE(b
1
)
A level C confidence interval for the intercept, b
0
, is proportional to the
standard error of the least-squares intercept:
b
0
t* SE(b
0
)
t* is the t critical for the t (n 2) distribution with area C between t* and +t*.
The standard error of the regression coefficients
To estimate the parameters of the regression, we calculate the standard
errors for the estimated regression coefficients.
The standard error of the least-squares slope
1
is:
The standard error of the intercept
0
is:
SE(b
0
) s
e
1
n
x
2
(x
i
x )
2
SE(b
1
)
s
e
(x
i
x )
2
s
e
s
x
n 1
Significance test for the slope
We can test the hypothesis H
0
: b
1
= 0 versus a 1 or 2 sided
alternative.
We calculate t = b
1
/ SE(b
1
)
which has the t (n 2)
distribution to find the
p-value of the test.
Note: Software typically provides
two-sided p-values.
Testing the hypothesis of no relationship
We may look for evidence of a significant relationship
between variables x and y in the population from which our
data were drawn.
For that, we can test the hypothesis that the regression slope
parameter is equal to zero.
H
0
:
1
= 0 vs. H
0
:
1
0
Testing H
0
:
1
= 0 also allows to test the hypothesis of
no correlation between x and y in the population.
1
slope
y
x
s
b r
s
s
e
2
n
s
e
1
n
(x * x )
2
(x
i
x )
2
SE(
y ) SE
2
(b
1
)( x * x )
2
s
e
2
n
s
e
2
s
e
1
1
n
(x * x )
2
(x
i
x )
2