Regression After Midterm 5
Regression After Midterm 5
LEARNING GOAL
Explain the simple linear regression model
Obtain and interpret the simple linear regression
equation for a set of data
Describe R2 as a measure of explanatory power of the
regression model
Use a regression equation for prediction
Copyright 2009 Pearson Education, Inc.
Definition
A correlation exists between two variables when
higher values of one variable consistently go with
higher values of another variable or when higher
values of one variable consistently go with lower
values of another variable.
Slide 7.1- 2
Slide 7.1- 3
Slide 7.1- 4
Scatter Diagrams
Definition
A scatter diagram (or scatterplot) is a graph in
which each point represents the values of two
variables.
Slide 7.1- 5
Types of Correlation
(Note: detailed descriptions of these graphs appear in the next few slides.)
Slide 7.1- 6
Slide 7.1- 7
Slide 7.1- 8
Slide 7.1- 9
Slide 7.1- 10
Types of Correlation
Positive correlation: Both variables tend to increase (or
decrease) together.
Negative correlation: The two variables tend to change
in opposite directions, with one increasing while the other
decreases.
No correlation: There is no apparent (linear) relationship
between the two variables.
Nonlinear relationship: The two variables are related,
but the relationship results in a scatter diagram that does
not follow a straight-line pattern.
Slide 7.1- 11
Slide 7.1- 12
Slide 7.1- 13
Slide 7.1- 14
Slide 7.1- 15
Beware of Outliers
If you calculate
the correlation coefficient
for these data, youll find
that it is a relatively high
r = 0.880, suggesting a
very strong correlation.
Figure 7.10
However, if you cover the data point in the upper right corner of
Figure 7.10, the apparent correlation disappears.
In fact, without this data point, the correlation coefficient is r = 0.
Copyright 2009 Pearson Education, Inc.
Slide 7.2- 16
Slide 7.2- 17
Slide 7.2- 18
Solution: (cont.)
We might therefore suspect that these two women either recorded
their data incorrectly or were not following their usual habits
during the two-week study. If we can confirm this suspicion, then
we would have reason to delete the two data points as invalid.
Figure 7.12 shows that the correlation
is quite strong without those two
outlier points, and suggests that the
number of calories consumed rises by
a little more than 500 calories for
each hour of cycling.
Figure 7.12 The data from Figure
Of course, we should not remove
7.11 without the two outliers.
the outliers without confirming our
suspicion that they were invalid data points, and we should report
our reasons for leaving them out.
Copyright 2009 Pearson Education, Inc.
Slide 7.2- 19
Slide 7.2- 20
Slide 7.2- 21
Figure 7.14 These scatter diagrams show the same data as Figure 7.13,
separated into the two groups identified in Table 7.4.
Slide 7.2- 22
Figure 7.15 Scatter diagram for the car weight and price data.
Slide 7.2- 23
Slide 7.2- 24
Definition
The best-fit line (or regression line) on a scatter
diagram is a line that lies closer to the data points
than any other possible line (according to a
standard statistical measure of closeness).
Slide 7.3- 25
Slide 7.3- 26
Slide 7.3- 27
Slide 7.3- 28
Slide 7.3- 29
Ch. 11-30
y b0 b1x
Cov(x, y)
b1
s2x
Copyright 2010 Pearson Education, Inc. Publishing as Prentice Hall
b 0 y b1x
Ch. 11-31
Introduction to
Regression Analysis
Ch. 11-32
Yi 0 1x i i
Ch. 11-33
Population
Slope
Coefficient
Independent
Variable
Random
Error
term
Yi 0 1Xi i
Linear component
Random Error
component
Ch. 11-34
Yi 0 1Xi i
Observed Value
of Y for Xi
Predicted Value
of Y for Xi
Slope = 1
Random Error
for this Xi value
Intercept = 0
Xi
Copyright 2010 Pearson Education, Inc. Publishing as Prentice Hall
X
Ch. 11-35
Estimate of
the regression
Estimate of the
regression slope
intercept
y i b0 b1x i
Value of x for
observation i
ei ( y i - y i ) y i - (b0 b1x i )
Copyright 2010 Pearson Education, Inc. Publishing as Prentice Hall
Ch. 11-36
Ch. 11-37
b1
(x x)(y y)
i1
2
(x
x
)
i
sy
Cov(x, y)
rxy
2
sx
sx
i1
b0 y b1x
Ch. 11-38
Ch. 11-39
E[ i ] 0 and E[ i ] 2
for (i 1, , n)
Ch. 11-40
Interpretation of the
Slope and the Intercept
Ch. 11-41
Ch. 11-42
Square Feet
(X)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Ch. 11-43
Graphical Presentation
Ch. 11-44
Ch. 11-45
(continued)
Ch. 11-46
Excel Output
Ch. 11-47
Excel Output
(continued)
Regression Statistics
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
41.33032
Observations
ANOVA
10
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
Ch. 11-48
Graphical Presentation
Intercept
= 98.248
Ch. 11-49
Interpretation of the
Intercept, b0
house price 98.24833 0.10977 (square feet)
Ch. 11-50
Interpretation of the
Slope Coefficient, b1
house price 98.24833 0.10977 (square feet)
Ch. 11-51
Measures of Variation
SST
SSR
SSE
Total Sum of
Squares
Regression Sum
of Squares
Error Sum of
Squares
SST (y i y)2
SSR (y i y)2
SSE (y i y i )2
where:
Ch. 11-52
Measures of Variation
(continued)
Ch. 11-53
Measures of Variation
(continued)
Y
yi
2
SSE = (yi - yi )
_
y
xi
Copyright 2010 Pearson Education, Inc. Publishing as Prentice Hall
_
y
X
Ch. 11-54
Coefficient of Determination, R2
SST
total sum of squares
2
note:
0 R 1
Ch. 11-55
Examples of Approximate
r2 Values
Y
r2 = 1
r2 = 1
r =1
2
Ch. 11-56
Examples of Approximate
r2 Values
Y
0 < r2 < 1
X
Copyright 2010 Pearson Education, Inc. Publishing as Prentice Hall
Ch. 11-57
Examples of Approximate
r2 Values
r2 = 0
No linear relationship
between X and Y:
r2 = 0
Ch. 11-58
Excel Output
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
41.33032
Observations
ANOVA
SSR 18934.9348
R
0.58082
SST 32600.5000
2
Regression Statistics
10
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
Ch. 11-59
Correlation and R2
R r
2
2
xy
Ch. 11-60
Estimation of Model
Error Variance
2
e
i
SSE
s
n2 n2
2
2
e
i1
Ch. 11-61
Excel Output
Regression Statistics
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
41.33032
Observations
ANOVA
s e 41.33032
10
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
Ch. 11-62
small se
large se
Ch. 11-63
2
2
(xi x) (n 1)s x
where:
sb1
SSE
se
n2
Ch. 11-64
Excel Output
Regression Statistics
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
Observations
ANOVA
sb1 0.03297
41.33032
10
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
Ch. 11-65
small Sb1
large Sb1
Ch. 11-66
Test statistic
b1 1
t
sb1
d.f. n 2
Copyright 2010 Pearson Education, Inc. Publishing as Prentice Hall
where:
b1 = regression slope
coefficient
1 = hypothesized slope
sb1 = standard
error of the slope
Ch. 11-67
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Ch. 11-68
H1: 1 0
Coefficients
Intercept
Square Feet
b1
Standard Error
sb1
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
b1 1 0.10977 0
t
3.32938
t
sb1
0.03297
Ch. 11-69
H1: 1 0
Coefficients
Intercept
Square Feet
d.f. = 10-2 = 8
t8,.025 = 2.3060
/2=.025
Reject H0
/2=.025
Do not reject H0
-tn-2,/2
-2.3060
Reject H0
tn-2,/2
2.3060 3.329
b1
Standard Error
sb1
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
Decision:
Reject H0
Conclusion:
There is sufficient evidence
that square footage affects
house price
Ch. 11-70
P-value = 0.01039
H0: 1 = 0
H1: 1 0
Coefficients
Intercept
Square Feet
P-value
Standard Error
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
Ch. 11-72
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
Ch. 11-73
Prediction
y n1 b0 b1x n1
Ch. 11-74
Predictions Using
Regression Analysis
Predict the price for a house
with 2000 square feet:
Ch. 11-75
Risky to try to
extrapolate far
beyond the range
of observed Xs
Copyright 2010 Pearson Education, Inc. Publishing as Prentice Hall
Ch. 11-76
Correlation Analysis
Ch. 11-77
Correlation Analysis
r
where
s xy
s xy
sxsy
(x x)(y y)
n 1
Ch. 11-78
H0 : 0
r (n 2)
(1 r )
2
Ch. 11-79
Decision Rules
Hypothesis Test for Correlation
Lower-tail test:
Upper-tail test:
Two-tail test:
H0: 0
H1: < 0
H0: 0
H1: > 0
H0: = 0
H1: 0
-t
r (n 2)
(1 r )
2
/2
-t/2
/2
t/2
has n - 2 d.f.
Ch. 11-80