0% found this document useful (0 votes)
3 views

Correlation and Regression

The document discusses correlation and regression, outlining key concepts such as correlation coefficients, hypothesis testing for population correlation coefficients, and the distinction between correlation and causation. It includes examples of constructing scatter plots and calculating correlation coefficients, as well as performing t-tests to determine the significance of correlations. The document emphasizes that correlation does not imply causation and explores various scenarios that could explain observed correlations.

Uploaded by

ogutahamphrey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Correlation and Regression

The document discusses correlation and regression, outlining key concepts such as correlation coefficients, hypothesis testing for population correlation coefficients, and the distinction between correlation and causation. It includes examples of constructing scatter plots and calculating correlation coefficients, as well as performing t-tests to determine the significance of correlations. The document emphasizes that correlation does not imply causation and explores various scenarios that could explain observed correlations.

Uploaded by

ogutahamphrey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Statistics

Correlation and
Regression
Chapter Outline
9.1 Correlation
9.2 Linear Regression
9.3 Measures of Regression and Prediction Intervals
9.4 Multiple Regression
Section 9.1 Correlation
Section 9.1 Objectives
1. An introduction to linear correlation, independent and
dependent variables, and the types of correlation
2. How to find a correlation coefficient
3. How to test a population correlation coefficient 
using a table
4. How to perform a hypothesis test for a population
correlation coefficient 
5. How to distinguish between correlation and causation
Correlation (1 of 3)
Correlation
• A relationship between two variables.
• The data can be represented by ordered pairs (x, y)
– x is the independent (or explanatory) variable
– y is the dependent (or response) variable
Correlation (2 of 3)
• In a scatter plot, the ordered pairs (x, y) are graphed as
points in a coordinate plane.
• The independent (explanatory) variable x is measured on
the horizontal axis, and the dependent (response)
variable y is measured on the vertical axis.
• A scatter plot can be used to determine whether a linear
(straight line) correlation exists between two variables.
Correlation (3 of 3)

Negative Linear Correlation Positive Linear Correlation

No Correlation Nonlinear Correlation


Example: Constructing a Scatter Plot (1 of 3)
An researcher wants to determine CO 2 emissions
whether there is a linear GDP (in trillions (in millions of
of dollars), x metric tons), y
relationship between a country’s
1.7 620.1
gross domestic product (GDP)
 CO2 
2.4 475.2
and carbon dioxide
3.0 457.6
emissions. The data 10 different 1.2 389.7
countries in a recent year are 4.1 810.8
shown in the table. Display the
2.3 352.9
data in a scatter plot and describe
0.9 235.0
the type of correlation. (Source:
1.8 297.8
World Bank and U.S. Energy
Information Administration) 2.9 413.9
5.4 1216.5
Solution: Constructing a Scatter Plot

Appears to be a positive linear correlation. Reading from


left to right, as the gross domestic products increase, the
carbon dioxide emissions tend to increase.
Example: Constructing a Scatter Plot (2 of 3)
A student conducts a study to determine whether there is a
linear relationship between the number of hours a student
exercises each week and the student’s grade point average
(GPA). The data are shown in the table below. Display the
data in a scatter plot and describe the type of correlation.

Hours of exercise, x 12 3 0 6 10 2 18 14 15 5
GPA, y 3.6 4.0 3.9 2.5 2.4 2.2 3.7 3.0 1.8 3.1
Example: Constructing a Scatter Plot (3 of 3)
Solution:
From the scatter plot, it appears that there is no linear
correlation between the variables.

The number of hours a student exercises each week does not


appear to be related to the student’s grade point average.
Correlation Coefficient (1 of 2)
Correlation coefficient
• A measure of the strength and the direction of a linear
relationship between two variables.
• The symbol r represents the sample correlation coefficient.
• A formula for r is
n  xy   x y n is the number
r
n  x 2   x n  y 2   y
2 2
of data pairs

• The population correlation coefficient is represented by 


(rho).
Correlation Coefficient (2 of 2)
• The range of the correlation coefficient is –1 to 1.

If r  –1 there is If r is close to 0 If r = 1 there is a


a perfect negative there is no linear perfect positive
correlation correlation correlation
Linear Correlation

Perfect positive Strong positive Weak positive


correlation r = 1 correlation r = 0.81 correlation r = 0.45

Perfect negative correlation Strong negative correlation No correlation r = 0.04


r  1 r  0.92
Calculating a Correlation Coefficient (1 of 2)
In Words In Symbols
x
Summation of x.

1. Find the sum of the x-values.

y
Summation of y.

2. Find the sum of the y-values.

 xy
Summation of x y.

3. Multiply each x-value by its


corresponding y-value and find
the sum.
Calculating a Correlation Coefficient (2 of 2)
In Words In Symbols
Summation of x squared.

4. Square each x-value and find  x2


the sum.
Summation of y squared.

5. Square each y-value and find  y2


the sum.

n  xy   x y
r = start fraction n summation of x y minus left parenthesis summation of x right parenthesis left parenthesis summation of y right parenthesis over square root of start expression n summation x squared minus left parenthesis summation of x right parenthesis squared end expression square root of start expression n summation of y squared minus left parenthesis summation of y right parenthesis squared end expression end fraction.

6. Use these five sums to calculate r 


the correlation coefficient. n  x 2   x n  y 2   y
2 2
Example: Calculating the Correlation
Coefficient
Calculate the correlation CO 2 emissions
C O sub 2

coefficient for the gross GDP (in trillions (in millions of


of dollars), x metric tons), y
domestic products and 1.7 620.1
carbon dioxide emissions 2.4 475.2
data. What can you 3.0 457.6
conclude? 1.2 389.7
4.1 810.8
2.3 352.9
0.9 235.0
1.8 297.8
2.9 413.9
5.4 1216.5
Solution: Calculating the Correlation
Coefficient (1 of 3)
Solution:
CO 2 emissions (in
C o Sub 2

GDP (in trillions millions of metric


of dollars), x tons), y xy x2 x squared.
y2
y squared.

1.7 620.1 1,054.17 2.89 384,524.01


2.4 475.2 1,140.48 5.76 225,815.04
3.0 457.6 1372.8 9 209,397.76
1.2 389.7 467.64 1.44 151,866.09
4.1 810.8 3,324.28 16.81 657,396.64
2.3 352.9 811.67 5.29 124,538.41
0.9 235.0 211.5 0.81 55,225
1.8 297.8 536.04 3.24 88,684.84
2.9 413.9 1,200.31 8.41 1,71,313.21
5.4 1216.5 6569.1 29.16 1,479,872.25

 x  25.7
Summation of x = 25.7.

 y  5269.5
Summation of y = 5269.5.

 xy  16, 687.99
Summation of x y = 16,687.99.

x 2
 82.81
Summation of x squared = 82.81.

y 2
 3,548, 633.25
Summation of y squared = 3,548,633.25.
Solution: Calculating the Correlation
Coefficient (2 of 3)
With these sums and n = 10, the correlation coefficient is
n  xy   x y
r
n  x 2   x n  y 2   y
2 2

10 16, 687.99    25.7  5269.5 



10  82.81   25.7  10  3,548, 633.25  –  5269.5 
2 2

31, 453.75

167.61 7, 718, 702.25

 0.874
Solution: Calculating the Correlation
Coefficient (3 of 3)
• The result r  0.874 suggests a strong positive linear
correlation.
• As the gross domestic product increases, the carbon
dioxide emissions tend to increase.
Hypothesis Testing for a Population
Correlation Coefficient Rho (1 of 2)
• A hypothesis test can also be used to determine whether
the sample correlation coefficient r provides enough
evidence to conclude that the population correlation
coefficient  is significant at a specified level of
significance.
• A hypothesis test can be one-tailed or two-tailed.
Hypothesis Testing for a Population
Correlation Coefficient Rho (2 of 2)
• Left-tailed test

 H 0 :   0  no significant negative correlation 



 H a :   0  significant negative correlation 
• Right-tailed test

 H 0 :   0  no significant positive correlation 



 H a :   0  significant positive correlation 

• Two-tailed test

 H 0 :   0  no significant correlation 

 H a :   0  significant correlation 
The t-Test for the Correlation
Coefficient
• A t-test can be used to test whether the correlation
between two variables is significant. The test statistic is
r and the standardized test statistic
r r
t 
r 1 r2
n2
follows a t-distribution with d.f.  n – 2 degrees of freedom,
where n is the number of pairs of data. (Note that there are
n – 2 degrees of freedom because one degree of freedom is
lost for each variable.)
Using the t-Test for Rho (1 of 2)
In Words In Symbols
1. State the null and alternative State H 0 and H a . H sub 0 and H sub a.

hypothesis.
2. Specify the level of significance Identify  . alpha.

d.f.  n – 2
d.f. = n minus 2.

3. Identify the degrees of freedom.

4. Determine the critical value(s) Use t table


and rejection region(s).
Using the t-Test for Rho (2 of 2)
In Words In Symbols
t = start fraction r over square root of start expression start fraction 1 minus r squared over n minus 2 end fraction end expression end fraction.

5. Find the standardized test r


t
statistic. 1 r2
n2

6. Make a decision to reject or fail If t is in the rejection region,


to reject the null hypothesis. reject H 0 . Otherwise fail to reject
H sub 0.
H sub 0. H

H0 .
Blank

7. Interpret the decision in the


context of the original claim.
Example: t-Test for a Correlation
Coefficient
Previously you used 10 pairs of data to find r  0.874.
Test the significance of this correlation coefficient. Use
  0.05.
Solution:
The null and alternative hypotheses are

H 0 :   0  no correlation  and
H a :   0  significant correlation  .
Solution: t-Test for a Correlation
Coefficient (1 of 3)
Because there are 10 pairs of data in the sample, there are
10  2  8 degrees of freedom. Because the test is a two-
tailed test,   0.05, and d.f.  8, the critical values are
t0  2.306 and t0  2.306. The rejection regions are
t  2.306 and t  2.306.
Solution: t-Test for a Correlation
Coefficient (2 of 3)
Using the t-test, the standardized test statistic is

r 0.874
t   5.087.
1 r 1   0.874 
2 2

n2 10  2
Solution: t-Test for a Correlation
Coefficient (3 of 3)
The figure shows the location of the rejection regions and the
standardized test statistic.

Because t is in the rejection region, you reject the null hypothesis.


There is enough evidence at the 5% level of significance to
conclude that there is a significant linear correlation between gross
domestic products and carbon dioxide emissions.
Correlation and Causation (1 of 3)
• The fact that two variables are strongly correlated does
not in itself imply a cause-and-effect relationship between
the variables.
• When there is a significant correlation between two
variables, you should consider the following possibilities.
1. Is there a direct cause-and-effect relationship between
the variables?
– Does x cause y?
Correlation and Causation (2 of 3)
2. Is there a reverse cause-and-effect relationship between
the variables?
– Does y cause x?
3. Is it possible that the relationship between the variables
can be caused by a third variable or by a combination of
several other variables?
– Variables that have an effect on the variables being
studied but are not included in the study are called
lurking variables.
Correlation and Causation (3 of 3)
4. Is it possible that the relationship between two variables
may be a coincidence?
Section 9.2 Linear Regression
Section 9.2 Objectives
1. How to find the equation of a regression line
2. How to predict y-values using a regression equation
Regression Lines
• After verifying that the linear correlation between two
variables is significant, the next step is to determine the
equation of the line that best models the data.
• This line is called a regression line, and its equation can
be used to predict the value of y for a given value of x.
Residuals
• For each data point, di represents the difference between
the observed y-value and the predicted y-value for a given
x-value.
• These differences are called residuals and can be
positive, negative, or zero.
Regression Line
Regression line (line of best fit)
• The line for which the sum of the squares of the residuals
is a minimum.  i
d 2

• The equation of a regression line for an independent


variable x and a dependent variable y is
The Equation of a Regression Line
• The equation of a regression line for an independent
variable x and a dependent variable y is
ŷ  mx  b
where ŷ is the predicted y-value for a given x-value.
The slope m and y-intercept b are given by
n  xy   x y y x
m b  y  mx  m
n  x 2   x
2 n n

where y is the mean of the y-values in the data set, x


is the mean of the x-values, and n is the number of pairs
of data. The regression line always passes through the
point (x, y).
Example: Finding the Equation of a
Regression Line
Find the equation of the CO 2 emissions
CO sub 2

regression line for the gross GDP (in trillions (in millions of
of dollars), x metric tons), y
domestic products and 1.7 620.1
carbon dioxide emissions 2.4 475.2
data 3.0 457.6
1.2 389.7
4.1 810.8
2.3 352.9
0.9 235.0
1.8 297.8
2.9 413.9
5.4 1216.5
Solution: Finding the Equation of a
Regression Line (1 of 3)
Recall that there is a significant linear correlation between
gross domestic products and carbon dioxide emissions.
Also, you found that n = 10,  x  25.7,  y  5269.5,
 xy  16, 687.99 and   82.81. You can use these
x 2

values to calculate the slope m of the regression line


n xy    x   y 
m
n x 2    x 
2

10 16, 687.99    25.7  5269.5 



10  82.81   25.7 
2

 187.660343
Solution: Finding the Equation of a
Regression Line (2 of 3)
• and its y-intercept b.
b  y  mx
5269.5  25.7 
  187.660343  
10  10 
 44.663

• So, the equation of the regression line is

yˆ  187.660 x  44.663.
Solution: Finding the Equation of a
Regression Line (3 of 3)
To sketch the regression line, first choose two x-values
between the least and greatest x-values in the data set. Next,
calculate the corresponding y-values using the regression
equation. Draw a line through the two points. Notice that the
line passes through the point  x, y    2.57, 526.95  .
Example: Predicting y-Values Using
Regression Equations
The regression equation for the gross domestic products (in
trillions of dollars) and carbon dioxide emissions (in millions
of metric tons) data is yˆ  187. 660 x  44.663. Use this
equation to predict the expected carbon dioxide emissions
for the following gross domestic products. (Recall from
section 9.1 that x and y have a significant linear correlation.)
1. $1.2 trillion dollars
2. $2.0 trillion dollars
3. $2.6 trillion dollars
Solution: Predicting y-Values Using
Regression Equations (1 of 3)
To predict the expected carbon dioxide emissions, substitute
each gross domestic product for x in the regression equation.
Then calculate yˆ .

1. yˆ  187.660 x  44.663  187.660 1.2   44.663  269.855

When the gross domestic product is $1.2 trillion, the


predicted CO 2 emissions are 269.855 million metric tons.
Solution: Predicting y-Values Using
Regression Equations (2 of 3)
To predict the expected carbon dioxide emissions, substitute
each gross domestic product for x in the regression equation.
Then calculate yˆ .

2. yˆ  187.660 x  44.663  187.660  2.0   44.663  419.983

When the gross domestic product is $2.0 trillion, the


predicted CO 2 emissions are 419.983 million metric tons.
Solution: Predicting y-Values Using
Regression Equations (3 of 3)
To predict the expected carbon dioxide emissions, substitute
each gross domestic product for x in the regression equation.
Then calculate yˆ .

3. yˆ  187.660 x  44.663  187.660  2.6   44.663  532.579

When the gross domestic product is $2.6 trillion, the


predicted CO 2 emissions are 532.579 million metric tons.

You might also like