0% found this document useful (0 votes)
14 views

Lesson 11 Pearsons R

N/A

Uploaded by

limjana490
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lesson 11 Pearsons R

N/A

Uploaded by

limjana490
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

\

Lesson 11
Correlation (Pearson’s r)

INTRODUCTION:

Another area of inferential statistics involves determining whether a


relationship exists between two or more numerical or quantitative
variables. For example, educators are interested in determining whether
the number of hours a student studies is related to the student’s score on
a particular exam. Medical researchers are interested in questions such as,
Is caffeine related to heart damage? or Is there a relationship between a
person’s age and his or her blood pressure? These are only a few of the
many questions that can be answered by using the techniques of
correlation and regression analysis.

OBJECTIVES:

At the end of this lesson, you should be able to:


1. Investigate the linear relationship between two variables by
measuring the strength of association.
2. Perform hypothesis testing involving Pearson’s r correlation
coefficient.

1
Correlation

Correlation is a statistical method used to determine whether a


relationship between variables exists. Regression is a statistical method used to
describe the nature of the relationship between variables, that is, positive or
negative, linear or nonlinear.

The purpose of this chapter is to answer these questions statistically:

1. Are two variables related?


2. If so, what is the strength of the relationship?
3. What type of relationship exists?
To answer the first two questions, statisticians use a numerical measure to
determine whether two or more variables are related and to determine the
strength of the relationship between or among the variables. This measure is
called a correlation coefficient. For example, there are many variables that
contribute to heart disease, among them lack of exercise, smoking, heredity,
age, stress, and diet. Of these variables, some are more important than others;
therefore, a physician who wants to help a patient must know which factors are
most important.

To answer the third question, you must ascertain what type of relationship
exists. There are two types of relationships: simple and multiple. In a simple
relationship, there are two variables—an independent variable, also called
an explanatory variable or a predictor variable, and a dependent variable, also
called a response variable. A simple relationship analysis is called simple
regression, and there is one independent variable that is used to predict the
dependent variable. For example, a manager may wish to see whether the
number of years the salespeople have been working for the company has
anything to do with the amount of sales they make. This type of study involves a
simple relationship, since there are only two variables—years of experience and
amount of sales.

In a multiple relationship, called multiple regression, two or more


independent variables are used to predict one dependent variable. For example,
an educator may wish to investigate the relationship between a student’s success
in college and factors such as the number of hours devoted to studying, the

2
student’s GPA, and the student’s high school background. This type of study
involves several variables.

Simple relationships can also be positive or negative. A positive


relationship exists when both variables increase or decrease at the same time.
For instance, a person’s height and weight are related; and the relationship is
positive, since the taller a person is, generally, the more the person weighs. In a
negative relationship, as one variable increases, the other variable decreases,
and vice versa. For example, if you measure the strength of people over 60 years
of age, you will find that as age increases, strength generally decreases. The
word generally is used here because there are exceptions.

Pearson’s r Correlation Coefficient

Statisticians use a measure called the correlation coefficient to


determine the strength of the linear relationship between two variables. There
are several types of correlation coefficients. The one explained in this lesson is
called the Pearson product moment correlation coefficient (PPMC),
named after statistician Karl Pearson, who pioneered the research in this area.

The correlation coefficient computed from the sample data measures the
strength and direction of a linear relationship between two variables. The symbol
for the sample correlation coefficient is 𝑟. The symbol for the population
correlation coefficient is 𝜌 (Greek letter rho).

The range of the correlation coefficient is from -1 𝑡𝑜 + 1. If there is a


strong positive linear relationship between the variables, the value of r will be
close to +1. If there is a strong negative linear relationship between the
variables, the value of r will be close to −1. When there is no linear relationship
between the variables or only a weak relationship, the value of r will be close to
0.

3
The graphs in figure below show the relationship between the correlation
coefficients and their corresponding scatter plots. Notice that as the value of the
correlation coefficient increases from 0 𝑡𝑜 + 1 (parts a, b, and c), data values
become closer to an increasingly stronger relationship. As the value of the
correlation coefficient decreases from 0 𝑡𝑜 − 1 (parts d, e, and f), the data values
also become closer to a straight line. Again this suggests a stronger relationship.

Formula for the Correlation Coefficient r

𝒏(∑ 𝒙𝒚) − (∑ 𝒙)(∑ 𝒚)


𝒓=
√[𝒏(∑ 𝒙𝟐 ) − (∑ 𝒙)𝟐 ][𝒏(∑ 𝒚𝟐 ) − (∑ 𝒚)𝟐 ]

where 𝒏 is the number of data in pairs

Rounding Rule for the Correlation Coefficient Round the value of 𝑟 to three
decimal places. The formula looks somewhat complicated, but using a table to
compute the values, as shown in example 1, makes it somewhat easier to
determine the value of 𝑟. There are no units associated with 𝑟, and the value of
𝑟 will remain unchanged if the 𝑥 and 𝑦 values are switched.

4
Example 1: Car Rental Companies. Compute the correlation coefficient for
car rental companies in the United States for a recent year.

Solution:

Step 1: Make a table as shown here.

Step 2: Find the values of 𝑥𝑦, 𝑥 2 , and 𝑦 2 and place these values in the
corresponding columns of the table.

The completed table is shown.

5
Step 3: Substitute in the formula and solve for 𝑟.

𝒏(∑ 𝒙𝒚) − (∑ 𝒙)(∑ 𝒚)


𝒓=
√[𝒏(∑ 𝒙𝟐 ) − (∑ 𝒙)𝟐 ][𝒏(∑ 𝒚𝟐 ) − (∑ 𝒚)𝟐 ]

(6)(682.77) − (153.8)(18.7)
= = 𝟎. 𝟗𝟖𝟐
√[(6)(5859.26) − (153.8)2 ][(6)(80.67) − (18.7)2 ]

The correlation coefficient suggests a strong relationship between the


number of cars a rental agency has and its annual income.

Example 2: Absences and Final Grades. Compute the value of the


correlation coefficient for the data obtained in the study of the number
of absences and the final grade of the seven students in the statistics
class.

Solution:

Step 1: Make a table.

Step 2: Find the values of 𝑥𝑦, 𝑥 2 , and 𝑦 2 ; place these values in the
corresponding columns of the table.

6
Step 3: Substitute in the formula and solve for 𝑟.

𝒏(∑ 𝒙𝒚) − (∑ 𝒙)(∑ 𝒚)


𝒓=
√[𝒏(∑ 𝒙𝟐 ) − (∑ 𝒙)𝟐 ][𝒏(∑ 𝒚𝟐 ) − (∑ 𝒚)𝟐 ]

(7)(3745) − (57)(511)
= = −𝟎. 𝟗𝟒𝟒
√[(7)(579) − (57)2 ][(7)(38,993) − (511)2 ]

The value of 𝑟 suggests a strong negative relationship between a student’s final


grade and the number of absences a student has. That is, the more absences a
student has, the lower is his or her grade.

Example 3: Exercise and Milk Consumption. Compute the value of the


correlation coefficient for the data given in the number of hours a person
exercises and the amount of milk a person consumes per week.

Subject No. of Hours Amount of


Milk
𝑨 3 48
𝑩 0 8
𝑪 2 32
𝑫 5 64
𝑬 8 10
𝑭 5 32
𝑮 10 56

7
𝑯 2 72
𝑰 1 48
Solution:

Step 1: Make a table.

Step 2: Find the values of 𝑥𝑦, 𝑥 2 , and 𝑦 2 ; place these values in the
corresponding columns of the table.

Step 3: Substitute in the formula and solve for 𝑟.

𝒏(∑ 𝒙𝒚) − (∑ 𝒙)(∑ 𝒚)


𝒓=
√[𝒏(∑ 𝒙𝟐 ) − (∑ 𝒙)𝟐 ][𝒏(∑ 𝒚𝟐 ) − (∑ 𝒚)𝟐 ]

(9)(1520) − (36)(370)
= = 𝟎. 𝟎𝟔𝟕
√[(9)(232) − (36)2 ][(9)(19,236) − (370)2 ]

The value of r indicates a very weak positive relationship between the variables.

The Significance of the Correlation Coefficient.

As stated before, the range of the correlation coefficient is between −1 and


+1. When the value of r is near −1 or +1, there is a strong linear relationship.
When the value of 𝑟 is near 0, the linear relationship is weak or non-existent.
Since the value of 𝑟 is computed from data obtained from samples, there are two

8
possibilities when 𝑟 is not equal to zero: either the value of 𝑟 is high enough to
conclude that there is a significant linear relationship between the variables, or
the value of 𝑟 is due to chance.

The population correlation coefficient is computed from taking all possible


(𝑥, 𝑦) pairs; it is designated by the Greek letter 𝜌 (rho). The sample correlation
coefficient can then be used as an estimator of 𝜌 if the following assumptions are
valid.

1. The variables 𝑥 and 𝑦 are linearly related.


2. The variables are random variables.
3. The two variables have a bivariate normal distribution.
A bivariate normal distribution means that for the pairs of (𝑥, 𝑦) data
values, the corresponding 𝑦 values have a bell-shaped distribution for any
given 𝑥 value, and the 𝑥 values for any given 𝑦 value have a bell-shaped
distribution.

Formally defined, the population correlation coefficient 𝜌 is the correlation


computed by using all possible pairs of data values (𝑥, 𝑦) taken from a
population.

In hypothesis testing, one of these is true:

𝐻0 : 𝜌 = 0 This null hypothesis means that there is no correlation


between the 𝑥 and 𝑦 variables in the population.

𝐻1 : 𝜌 ≠ 0 This alternative hypothesis means that there is a significant


correlation between the variables in the population.

When the null hypothesis is rejected at a specific level, it means that there
is a significant difference between the value of 𝑟 and 0. When the null hypothesis
is not rejected, it means that the value of 𝑟 is not significantly different from 0
(zero) and is probably due to chance.

9
Example 4: Using PPMC Table, test the significance of the correlation coefficient
𝑟 = 0.982, obtained in Example 1, at 𝛼 = 0.01.

Solution:

𝐻0 : 𝜌 = 0 and 𝐻1 : 𝜌 ≠ 0

Since the sample size is 6 and 𝛼 = 0.01, the critical value obtained from
PPMC Table is 0.917. For a significant relationship, a value of 𝑟 greater than
+0.917 or less than -0.917 is needed. Since 𝑟 = 0.982, the null hypothesis is
rejected. Hence, there is enough evidence to say that there is a significant linear
relationship between the variables. See figure.

Exercises: Perform the following steps.

a. Compute the value of the correlation coefficient.


b. State the hypotheses.
c. Test the significance of the correlation coefficient at 𝛼 = 0.05, using PPMC
Table.
d. Give a brief explanation of the type of relationship

1. Emergency Calls and Temperature. An emergency service wishes to see


whether a relationship exists between the outside temperature and the
number of emergency calls it receives for a 7-hour period. The data are
shown.

10
2. Calories and Cholesterol. The number of calories and the number of
milligrams of cholesterol for a random sample of fast-food chicken
sandwiches from seven restaurants are shown here. Is there a relationship
between the variables?

3. Tall Buildings. An architect wants to determine the relationship between


the heights (in feet) of a building and the number of stories in the building.
The data for a sample of 10 buildings in Pittsburgh are shown. Explain the
relationship.

4. Distribution of Population in U.S. Cities. A random sample of U.S. cities


is selected to determine if there is a relationship between the population (in
thousands) of people under 5 years of age and the population (in thousands)
of those 65 years of age and older. The data for the sample are shown here.

5. Commercial Movie Releases. The yearly data have been published


showing the number of releases for each of the commercial movie studios
and the gross receipts for those studios thus far. Based on these data, can it
be concluded that there is a relationship between the number of releases
and the gross receipts?

11
Critical Values Pearson Product Moment Correlation Coefficient (PPMC)

12

You might also like