Simple Regression Analysis
Simple Regression Analysis
Linear
10 Regression
80
60
Rating
40
20
0 5 10 15
Sugar
2
Simple Linear Regression
80
60
Rating
40
20
0 5 10 15
Sugar
3
Simple Linear Regression
80
60
Rating
40
xx
20
0 5 10 15
Sugar
4
The Simple Linear Regression Model
The simplest mathematical relationship between two variables x
and y is a linear relationship:
y = b0 + b1x.
The objective of this section is about equivalent linear
probabilistic models.
5
The Simple Linear Regression Model
The simplest mathematical relationship between two variables x
and y is a linear relationship:
y = b0 + b1x.
The objective of this section is about equivalent linear
probabilistic models.
6
The Simple Linear Regression Model
Definition The Simple Linear Regression Model
There are parameters b0, b1, and s 2, such that for any fixed
value of the independent variable x, the dependent variable
is a random variable related to x through the model
equation
Y = b0 + b1x + ε
The quantity ε in the model equation is the “error” -- a
random variable, assumed to be distributed as
ε ~ N(0, s 2)
7
The Simple Linear Regression Model
X: the independent, predictor, or explanatory variable
(usually known). NOT RANDOM.
ε: The random deviation or random error term. For fixed x, ε
IS RANDOM.
8
The Simple Linear Regression Model
The points (x1, y1), …, (xn, yn) resulting from n independent
observations will then be scattered about the true
regression line:
This image cannot currently be displayed.
9
The Simple Linear Regression Model
How do we know simple linear regression is
appropriate?
- Theoretical considerations
- Scatterplots
10
The Simple Linear Regression Model
Interpreting parameters:
11
The Error Term
distribution of Î
The variance parameter s 2 determines the extent to which each
normal curve spreads out about the regression line
12
The Error Term
Homoscedasticity:
distribution of Î
εi ~ N(0, s 2)
(b) distribution of Y for different values of x
We assume the variance (amount of variability) of
the distribution of Y values to be the same at each
different value of fixed x.
(i.e. homogeneity of variance assumption).
The variance parameter s 2 determines the extent to which each
normal curve spreads out about the regression line
13
Estimating Model Parameters
The variance of our model, s 2 will be smallest if the
differences between the estimate of the true line and each
point is the smallest. This is our goal: minimize s 2.
14
Estimating Model Parameters
The variance of our model, s 2 will be smallest if the
differences between the estimate of the true line and each
point is the smallest. This is our goal: minimize s 2.
15
Estimating Model Parameters
The “best fit” line is motivated by the principle of least
squares, which can be traced back to the German
mathematician Gauss (1777–1855):
16
Estimating Model Parameters
The sum of squared vertical deviations from the points
(x1, y1),…, (xn, yn) to the line is then
& &
The point estimates of b0 and b1, denoted by and , are
called the least squares estimates – they are those
values that minimize f(b0, b1).
17
Estimating Model Parameters
The fitted regression line or least squares line is then
the line whose equation is y = + x.
The minimizing values of b0 and b1 are found by taking
partial derivatives of Q with respect to both b0 and b1,
equating them both to zero [this is regular old Calculus!],
and solving the two equations with two unknowns.
18
Estimating Model Parameters
The fitted regression line or least squares line is then
the line whose equation is y = + x.
The minimizing values of b0 and b1 are found by taking
partial derivatives of Q with respect to both b0 and b1,
equating them both to zero [this is regular old Calculus!],
and solving the two equations with two unknowns.
19
Estimating Model Parameters
There are shortcut notations we use to express the
estimates of the parameters.
1 , ,
1 ,
!"" =$ (() − () = $ () − $ ()
& &
1 1
!"- =$ (() − ()(.) − .) = $ () .) − $ () .)
& &
20
Estimating Model Parameters
There are shortcut notations we use to express the
estimates of the parameters. HOW?
1 , ,
1 ,
!"" =$ (() − () = $ () − $ ()
& &
1 1
!"- =$ (() − ()(.) − .) = $ () .) − $ () .)
& &
21
Estimating Model Parameters
There are shortcut notations we use to express the
estimates of the parameters.
1 , ,
1 ,
!"" =$ (() − () = $ () − $ ()
& &
1 1
!"- =$ (() − ()(.) − .) = $ () .) − $ () .)
& &
Your book doesn’t use this notation, but we’re going to
because it’s very convenient and commonly used. 22
Example
The cetane number is a critical property in specifying the
ignition quality of a fuel used in a diesel engine.
23
Example
The iodine value (x) is the amount of iodine necessary to
saturate a sample of 100 g of oil. The article’s authors fit the
simple linear regression model to this data, so let’s do the same.
24
Example
Scatter plot with the least squares line superimposed.
25
Fitted Values
Fitted Values:
The fitted (or predicted) values are obtained
by substituting x1,…, xn into the equation of the estimated
regression line:
Residuals:
The differences between the
observed and fitted y values.
26
Example
We now calculate the first three fitted values and residuals
from the cetane example.
27
Estimating s 2 and s
The parameter s 2 determines the amount of spread about
the true regression line. Two separate examples:
28
Estimating s 2 and s
An estimate of s 2 will be used in confidence interval (CI)
formulas and hypothesis-testing procedures presented in
the next two sections.
29
Estimating s 2 and s
The error sum of squares (equivalently, residual sum of
squares), denoted by SSE, is
30
Estimating s 2 and s
The error sum of squares (equivalently, residual sum of
squares), denoted by SSE, is
This is because to obtain s2, the two parameters b0 and b1
must first be estimated, which results in a loss of 2 df (just as
µ had to be estimated in one sample problems, resulting in an
estimated variance based on n – 1 df in our previous t-tests).
Replacing each yi in the formula for s2 by the r.v. Yi gives the
estimator S2.
It can be shown that the r.v. S2 is an unbiased estimator for s 2
32
Example
Suppose we have the following data on filtration rate (x)
versus moisture content (y):
33
Example
The corresponding error sum of squares (SSE) is
SSE = (–.200)2 + (–.188)2 + ··· + (1.099)2 = 7.968
The estimate of s 2 is then s2 = 7.968/(20 – 2) = .4427,
and the estimated standard deviation is…
34
Example
The corresponding error sum of squares (SSE) is
SSE = (–.200)2 + (–.188)2 + ··· + (1.099)2 = 7.968
The estimate of s 2 is then s2 = 7.968/(20 – 2) = .4427,
and the estimated standard deviation is…
35
Estimating s 2 and s
The last method for computing the estimate of s 2 was
terribly complex and realistically would produce errors in
calculation if done by hand.
36
The Coefficient of Determination
Different variability in observed y values:
In the first plot SSE = 0, and there is no unexplained
variation, whereas unexplained variation is small for
second, and large for the third plot.
38
The Coefficient of Determination
Total sum of squares is the sum of squared deviations
about the sample mean of the observed y values – when
no predictors are taken into account.
Thus the same number y is subtracted from each yi in SST,
whereas SSE involves subtracting each different predicted
value from the corresponding observed yi.
The SST in some sense is as bad as SSE can get if there
is no regression model (i.e., slope is 0) then
ˆ0 = y ˆ1 x ) ŷ = ˆ0 + ˆ1 x = ˆ0 = y
|{z}
=0
The ratio SSE/SST is the proportion of total variation that
cannot be explained by the simple linear regression model,
and
r2 = 1 – SSE/SST
The ratio SSE/SST is the proportion of total variation that
cannot be explained by the simple linear regression model,
and
r2 = 1 – SSE/SST
The higher the value of r2, the more successful is the
simple linear regression model in explaining y variation.
41
Example
Let’s calculate the correlation coefficient for the two
previous examples we’ve done.
42
Iodine Example
The iodine value (x) is the amount of iodine necessary to
saturate a sample of 100 g of oil.
43
Filtration Example
Filtration rate (x) versus moisture content (y):
44
Inferences About the Slope Parameter b1
In virtually all of our inferential work thus far, the notion of
sampling variability has been pervasive.
Same idea as before: The value of any quantity calculated
from sample data (which is random) will vary from one
sample to another.
45
Inferences About the Slope Parameter b1
The estimators are:
=>
And,
46
Inferences About the Slope Parameter b1
Invoking properties of linear function of random variables
as discussed earlier, leads to the following results:
1. E( ) = b1, so is an unbiased estimator of b1
2. The variance and standard deviation of are:
48
A Confidence Interval for b1
As in the derivation of previous CIs, we begin with a
probability statement:
A 100(1 – a)% CI for the slope b1 of the true regression
line is
49
Example
Variations in clay brick masonry weight have implications not
only for structural and acoustical design but also for design of
heating, ventilating, and air conditioning systems.
50
Example cont’d
The scatter plot of this data in Figure 12.14 certainly
suggests the appropriateness of the simple linear regression
model;; there appears to be a substantial negative linear
relationship between air content and density, one in which
density tends to decrease as air content increases.
51
Example cont’d
52
Hypothesis-Testing Procedures
The most commonly encountered pair of hypotheses about
b1 is H0: b1 = 0 versus Ha: b1 ≠ 0. When this null hypothesis
is true, µY x = b0 (independent of x). Then knowledge of x
gives no information about the value of the dependent
variable.
53
Hypothesis-Testing Procedures
Alternative Hypothesis Alternative Hypothesis
If H0: b1 = 0, then the test statistic is the t ratio t = .
54
Regression in R.
55
Residuals and Standardized Residuals
The standardized residuals are given by
56
Diagnostic Plots
The basic plots that many statisticians recommend for an
assessment of model validity and usefulness are the
following:
1. ei* (or ei) on the vertical axis versus xi on the horizontal
axis
2. ei* (or ei) on the vertical axis versus on the horizontal
axis
3. on the vertical axis versus yi on the horizontal axis
57
Diagnostic Plots
Plots 1 and 2 are called residual plots (against the independent
variable and fitted values, respectively), whereas Plot 3 is fitted
against observed values.
Provided that the model is correct, neither residual plots should exhibit
distinct patterns.
If Plot 3 yields points close to the 45-deg line [slope +1 through (0, 0)],
then the estimated regression function gives accurate predictions of
the values actually observed.
58
Example (Plot Type #2 and #3)
59
Heteroscedasticity
The residual plot below suggests that, although a straight-
line relationship may be reasonable, the assumption that
V(Yi) = s2 for each i is of doubtful validity.