Correlation and Regression
Correlation and Regression
Correlation and
Regression
Chapter Outline
9.1 Correlation
9.2 Linear Regression
9.3 Measures of Regression and Prediction Intervals
9.4 Multiple Regression
Section 9.1 Correlation
Section 9.1 Objectives
1. An introduction to linear correlation, independent and
dependent variables, and the types of correlation
2. How to find a correlation coefficient
3. How to test a population correlation coefficient
using a table
4. How to perform a hypothesis test for a population
correlation coefficient
5. How to distinguish between correlation and causation
Correlation (1 of 3)
Correlation
• A relationship between two variables.
• The data can be represented by ordered pairs (x, y)
– x is the independent (or explanatory) variable
– y is the dependent (or response) variable
Correlation (2 of 3)
• In a scatter plot, the ordered pairs (x, y) are graphed as
points in a coordinate plane.
• The independent (explanatory) variable x is measured on
the horizontal axis, and the dependent (response)
variable y is measured on the vertical axis.
• A scatter plot can be used to determine whether a linear
(straight line) correlation exists between two variables.
Correlation (3 of 3)
Hours of exercise, x 12 3 0 6 10 2 18 14 15 5
GPA, y 3.6 4.0 3.9 2.5 2.4 2.2 3.7 3.0 1.8 3.1
Example: Constructing a Scatter Plot (3 of 3)
Solution:
From the scatter plot, it appears that there is no linear
correlation between the variables.
y
Summation of y.
xy
Summation of x y.
n xy x y
r = start fraction n summation of x y minus left parenthesis summation of x right parenthesis left parenthesis summation of y right parenthesis over square root of start expression n summation x squared minus left parenthesis summation of x right parenthesis squared end expression square root of start expression n summation of y squared minus left parenthesis summation of y right parenthesis squared end expression end fraction.
x 25.7
Summation of x = 25.7.
y 5269.5
Summation of y = 5269.5.
xy 16, 687.99
Summation of x y = 16,687.99.
x 2
82.81
Summation of x squared = 82.81.
y 2
3,548, 633.25
Summation of y squared = 3,548,633.25.
Solution: Calculating the Correlation
Coefficient (2 of 3)
With these sums and n = 10, the correlation coefficient is
n xy x y
r
n x 2 x n y 2 y
2 2
31, 453.75
167.61 7, 718, 702.25
0.874
Solution: Calculating the Correlation
Coefficient (3 of 3)
• The result r 0.874 suggests a strong positive linear
correlation.
• As the gross domestic product increases, the carbon
dioxide emissions tend to increase.
Hypothesis Testing for a Population
Correlation Coefficient Rho (1 of 2)
• A hypothesis test can also be used to determine whether
the sample correlation coefficient r provides enough
evidence to conclude that the population correlation
coefficient is significant at a specified level of
significance.
• A hypothesis test can be one-tailed or two-tailed.
Hypothesis Testing for a Population
Correlation Coefficient Rho (2 of 2)
• Left-tailed test
• Two-tailed test
H 0 : 0 no significant correlation
H a : 0 significant correlation
The t-Test for the Correlation
Coefficient
• A t-test can be used to test whether the correlation
between two variables is significant. The test statistic is
r and the standardized test statistic
r r
t
r 1 r2
n2
follows a t-distribution with d.f. n – 2 degrees of freedom,
where n is the number of pairs of data. (Note that there are
n – 2 degrees of freedom because one degree of freedom is
lost for each variable.)
Using the t-Test for Rho (1 of 2)
In Words In Symbols
1. State the null and alternative State H 0 and H a . H sub 0 and H sub a.
hypothesis.
2. Specify the level of significance Identify . alpha.
d.f. n – 2
d.f. = n minus 2.
H0 .
Blank
H 0 : 0 no correlation and
H a : 0 significant correlation .
Solution: t-Test for a Correlation
Coefficient (1 of 3)
Because there are 10 pairs of data in the sample, there are
10 2 8 degrees of freedom. Because the test is a two-
tailed test, 0.05, and d.f. 8, the critical values are
t0 2.306 and t0 2.306. The rejection regions are
t 2.306 and t 2.306.
Solution: t-Test for a Correlation
Coefficient (2 of 3)
Using the t-test, the standardized test statistic is
r 0.874
t 5.087.
1 r 1 0.874
2 2
n2 10 2
Solution: t-Test for a Correlation
Coefficient (3 of 3)
The figure shows the location of the rejection regions and the
standardized test statistic.
regression line for the gross GDP (in trillions (in millions of
of dollars), x metric tons), y
domestic products and 1.7 620.1
carbon dioxide emissions 2.4 475.2
data 3.0 457.6
1.2 389.7
4.1 810.8
2.3 352.9
0.9 235.0
1.8 297.8
2.9 413.9
5.4 1216.5
Solution: Finding the Equation of a
Regression Line (1 of 3)
Recall that there is a significant linear correlation between
gross domestic products and carbon dioxide emissions.
Also, you found that n = 10, x 25.7, y 5269.5,
xy 16, 687.99 and 82.81. You can use these
x 2
187.660343
Solution: Finding the Equation of a
Regression Line (2 of 3)
• and its y-intercept b.
b y mx
5269.5 25.7
187.660343
10 10
44.663
yˆ 187.660 x 44.663.
Solution: Finding the Equation of a
Regression Line (3 of 3)
To sketch the regression line, first choose two x-values
between the least and greatest x-values in the data set. Next,
calculate the corresponding y-values using the regression
equation. Draw a line through the two points. Notice that the
line passes through the point x, y 2.57, 526.95 .
Example: Predicting y-Values Using
Regression Equations
The regression equation for the gross domestic products (in
trillions of dollars) and carbon dioxide emissions (in millions
of metric tons) data is yˆ 187. 660 x 44.663. Use this
equation to predict the expected carbon dioxide emissions
for the following gross domestic products. (Recall from
section 9.1 that x and y have a significant linear correlation.)
1. $1.2 trillion dollars
2. $2.0 trillion dollars
3. $2.6 trillion dollars
Solution: Predicting y-Values Using
Regression Equations (1 of 3)
To predict the expected carbon dioxide emissions, substitute
each gross domestic product for x in the regression equation.
Then calculate yˆ .