4-Biol 605-Regression Models (1)
4-Biol 605-Regression Models (1)
AND
CORRELATION
Correlation
Correlation
A correlation is a relationship between two variables. The
data can be represented by the ordered pairs (x, y) where
x is the independent (or explanatory) variable, and y is
the dependent (or response) variable.
A scatter plot can be used to y
x 1 2 3 4 5 –2
y –4 –2 –1 0 2
–4
Linear Correlation
y y
As x increases, As x increases,
y tends to y tends to
decrease. increase.
x x
Negative Linear Correlation Positive Linear Correlation
y y
x x
No Correlation Nonlinear Correlation
Correlation Coefficient
The correlation coefficient is a measure of the strength
and the direction of a linear relationship between two
variables. The symbol r represents the sample correlation
coefficient. The formula for r is
n xy x y
r .
n x 2 x n y 2 y
2 2
r = 0.91 r = 0.88
x
x
Strong negative correlation
Strong positive correlation
y
y
r = 0.42
r = 0.07
x
x
Weak positive correlation
Nonlinear Correlation
The correlation between X and Y may be:
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
Continued.
Correlation Coefficient
Example continued:
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
y
100
80
Test score
60
40
20
x
2 4 6 8 10
Hours watching TV
Continued.
Correlation Coefficient
Example continued:
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
xy 0 85 164 222 285 340 380 420 348 455 525 500
x2 0 1 4 9 9 25 25 25 36 49 49 100
y2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500
Predicted d
3
y-value
x
Each data point di represents the difference between the
observed y-value and the predicted y-value for a given x-
value on the line. These differences are called residuals.
Regression Line
A regression line, also called a line of best fit, is the line
for which the sum of the squares of the residuals is a
minimum.
The Equation of a Regression Line
The equation of a regression line for an independent variable
x and a dependent variable y is
ŷ = mx + b
where ŷ is the predicted y-value for a given x-value. The slope
m and y-intercept b are given by
n xy x y y x
m and b y mx m
n x 2 x
2 n n
where y is the mean of the y - values and x is the mean of the
x - values. The regression line always passes through (x , y ).
Regression Line
Example:
Find the equation of the regression line.
x y xy x2 y2
1 –3 –3 1 9
2 –1 –2 4 1
3 0 0 9 0
4 1 4 16 1
5 2 10 25 4
x 15 y 1 xy 9 x 2 55 y 2 15
1
1
2 (x , y ) 3,
5
3
Regression Line
Example:
The following data represents the number of hours 12
different students watched television during the
weekend and the scores of each student who took a test
the following Monday.
a.) Find the equation of the regression line.
b.) Use the equation to find the expected test score
for a student who watches 9 hours of TV.
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
xy 0 85 164 222 285 340 380 420 348 455 525 500
x2 0 1 4 9 9 25 25 25 36 49 49 100
y2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500
y
b y mx 100 (x , y ) 1254 , 908
12
4.5,75.7
908 54
(4.067) 80
12 12
Test score 60
93.97
40
ŷ = –4.07x + 93.97 20
x
2 4 6 8 10
Hours watching TV
Continued.
Regression Line
Example continued:
Using the equation ŷ = –4.07x + 93.97, we can predict
the test score for a student who watches 9 hours of TV.
ŷ = –4.07x + 93.97
= –4.07(9) + 93.97
= 57.34
x
x
Variation About a Regression Line
The total variation about a regression line is the sum of the
squares of the differences between the y-value of each ordered
pair and the mean of y.
Total variation y i y
2
Example:
The correlation coefficient for the data that represents
the number of hours students watched television and the
test scores of each student is r 0.831. Find the
coefficient of determination.
r 2 (0.831)2 About 69.1% of the variation in the test
scores can be explained by the variation
0.691
in the hours of TV watched. About 30.9%
of the variation is unexplained.