What I Know
Directions: Read the statement carefully and choose the best answer.
1. What is the value of r indicating?
a. The degree of relationship between two variables.
b. The direction of the correlation coefficient.
c. The line closest to the points.
d. The relationship of the points to the trend line.
2. The following statements is/are true, EXCEPT?
a. Perfect correlation happens when other variables are controlled like we do in our
experiments.
b. The direction of the line tells the direction of correlation that exists between the
variables.
c. Direction of the correlation indicates the closeness of the points to the trend line.
d. The relationship between two variables can also be described in terms of its strength.
3. What is the range of moderately high correlation?
a. > 0 ± to < ± 0.25 b. ± 0.50 to < ± 0.75 c. ± 0.75 to < ± 1 d. ± 0.25 to < ± 0.50
4. In the interpretation of a computed r, what two elements should be emphasized?
a. It focuses on the strength of association that correspond to the range.
b. It focuses on both the direction and strength of correlation.
c. It focuses on the sign of the correlation coefficient.
d. It focuses on the absolute value of the computed value only.
5. Suppose r1 = 0.7 while r2 = -0.7. Which one indicates a stronger correlation?
a. -0.7 b. both 0.7 and -0.7 c. 0.7 d. none
LESSON THE PEARSON PRODUCT-
1 MOMENT CORRELATION
What’s In
Directions: Identify the direction and the strength of the following correlation given. Choose
your answer from the box.
a. Strong positive correlation b. Moderate positive correlation
c. No correlation d. Moderate negative correlation
e. Strong negative correlation f. Perfect correlation
3
1. 2. 3. 4. 5.
What’s New
In the previous lesson, we have learned about bivariate data. We also learned how to
draw the scatterplot of the pair of variables and interpret it quantitatively in terms of its
direction and strength of association using the trend of points. Sometimes, a scatterplot does
not evidently show that a correlation exists between the two variables. This is in the case of
very weak correlation where it would be very difficult to identify the trend line.
Thus, we need to come up with more accurate interpretation of the scatterplot using
quantitative methods. Here, we will be computing some values that will indicate that a
correlation between the two variables exists and where we can describe its strength using
arbitrary scale which we will make. So, brace yourself for the next lessons you will learn.
TASK: Research on the life of Karl Pearson and his important contributions in the
field of statistics. Do not forget to copy and study the formula he proposed for
computing the coefficient of correlation( r).
What is It
LESSON 1
Correlation coefficient, computed from the sample data measures the strength and
direction of a linear relationship between two variables. The strength of correlation is
indicated by the coefficient of correlation. There are several coefficients of correlation. One
that is most commonly used in linear correlation is Pearson Product-Moment coefficient of
correlation, symbolized by r, named in honor of the statistician who did a lot of research on
this area, Karl Pearson.
The symbol for the sample Correlation Coefficient is “r”. To compute r, we use the
formula,
𝒏∑𝑿𝒀 − ∑𝑿 • ∑𝒀
𝒓=
√[𝒏∑𝑿𝟐 − (∑𝑿)𝟐 ] [𝒏∑𝒀𝟐 − (∑𝒀)𝟐 ]
4
where, r is called the Pearson correlation coefficient. This indicates the degree of
relationship between the two values,
X is the values in the first set of data,
Y is the values in the second set of data, and
n is the total number of values/data pairs.
Analyze the diagram below:
The Pearson correlation coefficient, r, can take a range of values from +1 to -1.
▪ A value greater than 0 indicates a positive correlation; that is, as the value of one
variable increases, so does the value of the other variable.
▪ A value less than 0 indicates a negative association; that is, as the value of one
variable increases, the value of the other variable decreases.
▪ A value of 0 indicates that there is no correlation between the two variables.
The direction of the points scattered tells the direction of correlation that exists between
the variables.
Explore the Correlation Scale.
The stronger the association of the two variables, the closer the Pearson correlation
coefficient, r, will be to either +1 or -1 depending on whether the relationship is positive or
negative, respectively. See table below (Table of range of values).
PEARSON R QUALITATIVE DESCRIPTION
±1 Perfect
± 0.75 to < ± 1 Very high
± 0.50 to < ± 0.75 Moderately high
± 0.25 to < ± 0.50 Moderately low
> 0 ± to < ± 0.25 Very low
0 No correlation
5
Different relationships and their correlation coefficients are shown in the diagram
below:
Achieving a value of +1 or -1 means that all your data points are included on the line
of best fit – there are no data points that show any variation away from this line. Values
for r between +1 and -1 (for example, r = 0.7 or -0.3) indicate that there is variation around
the line of best fit. The closer the value of r to 0 the greater the variation around the line of
best fit.
It indicates the closeness of the point to the trend line. The closer the points are to
the trend line, the stronger the relationship is.
The following data show the scores of five students in Statistics and Physics.
Determine if there is a relationship between the scores in Physics and Statistics. Interpret
the results.
STUDENT SCORE IN STATISTICS X SCORE IN PHYSICS Y
Alfonso 3 5
Frances 9 8
Rafael 10 10
James 12 9
Loida 7 8
STEPS SOLUTION
1. Construct a table shown on the right side. Student X Y X2 Y2 XY
Alfonso 3 5
Frances 9 8
Rafael 10 10
James 12 9
Loida 7 8
6
2. Complete the table.
Square all entries in the X column. Student X Y X2 Y2 XY
Put them under X2 column. Alfonso 3 5 9 25 15
Frances 9 8 81 64 72
Square all entries in the Y column. Rafael 10 10 100 100 100
Put them under Y2 column. James 12 9 144 81 108
Loida 7 8 49 64 56
Multiply entries in the X and Y columns. Put
them under the XY column.
3. Get the sum of all entries in the X column.
This is ∑X. Student X Y X2 Y2 XY
Alfonso 3 5 9 25 15
Get the sum of all entries in the Y column. Frances 9 8 81 64 72
This is ∑Y. Rafael 10 10 100 100 100
James 12 9 144 81 108
Get the sum of all entries in the X2 column. Loida 7 8 49 64 56
This is ∑X2. 41 40 383 334 351
∑X ∑Y ∑X2 ∑Y2 ∑XY
Get the sum of all entries in the Y2 column.
This is ∑Y2.
Get the sum of all entries in the XY column.
This is ∑XY.
Here n = 5 because there are five pairs of
4. Substitute the values obtained from step 3 in values.
the formula and solve. 𝑛∑𝑋𝑌 − ∑𝑋 • ∑𝑌
𝑟=
√[𝑛∑𝑋 − (∑𝑋)2 ] [𝑛∑𝑌 2 − (∑𝑌)2 ]
2
𝑛∑𝑋𝑌 − ∑𝑋 • ∑𝑌 5(351) − (41)(40)
𝑟= 𝑟=
√[𝑛∑𝑋2 − (∑𝑋)2 ] [𝑛∑𝑌 2 − (∑𝑌)2 ] √[5(383) − (41)2 ] [5(334) − (40)2 ]
𝑟 = 0.90
Therefore, there is a strong positive
correlation between the scores in Physics and
Science of the students.
LESSON 2
The formula for computing r is,
𝒏∑𝑿𝒀 − ∑𝑿 • ∑𝒀
𝒓=
√[𝒏∑𝑿𝟐 − (∑𝑿)𝟐 ] [𝒏∑𝒀𝟐 − (∑𝒀)𝟐 ]
Correlation coefficient formula is used to find how strong a relationship is between
data. The formula returns a value between -1 and 1, where:
• 1 indicates a strong positive relationship.
• -1 indicates a strong negative relationship.
• A result of zero indicates no relationship at all.
7
Meaning
✓ A correlation coefficient of 1 means that for every positive increase in one variable, there
is a positive increase of a fixed proportion in the other. For example, shoe sizes go up in
(almost) perfect correlation with foot length.
✓ A correlation coefficient of -1 means that for every positive increase in one variable,
there is a negative decrease of a fixed proportion in the other. For example, the amount
of gas in a tank decreases in (almost) perfect correlation with speed.
✓ Zero means that for every increase, there isn’t a positive or negative increase. The two
just aren’t related.
The absolute value of the correlation coefficient gives us the strength of the
relationship. The larger the number, the stronger the relationship. For example, |-.75| = .75,
which has a stronger relationship than .65
Let’s find the value of the correlation coefficient from the table below.
SUBJECT AGE X GLUCOSE LEVEL Y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
STEP 1: Make a chart. Use the given data, and add three more columns: xy, x2, and y2.
Subject Age x Glucose level y xy x2 y2
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
STEP 2: Multiply x and y together to fill the xy column. For example, row 1 would be
43 × 99 = 4,257.
Subject Age x Glucose level y xy x2 y2
1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779
8
STEP 3: Take the square of the numbers in the x column, and put the result in the
x2 column.
Subject Age x Glucose level y xy x2 y2
1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481
STEP 4: Take the square of the numbers in the y column, and put the results in the
y2 column.
Subject Age x Glucose level y xy x2 y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
STEP 5: Add up all of the numbers in the columns and put the result at the bottom of the
column. The Greek letter sigma (Σ) is a short way of saying “sum of.”
Subject Age x Glucose level y xy x2 y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
STEP 6: Use the following correlation coefficient formula.
𝒏∑𝑿𝒀 − ∑𝑿 • ∑𝒀
𝒓=
√[𝒏∑𝑿𝟐 − (∑𝑿)𝟐 ] [𝒏∑𝒀𝟐 − (∑𝒀)𝟐 ]
The answer is: 2868 / 5413.27 = 0.529809
From our table:
• Σx = 247
• Σy = 486
• Σxy = 20,485
• Σx2 = 11,409
• Σy2 = 40,022
• n is the sample size, in our case = 6
𝟔(𝟐𝟎,𝟒𝟖𝟓) – (𝟐𝟒𝟕 × 𝟒𝟖𝟔)
r= = 𝟎. 𝟓𝟐𝟗𝟖
√[𝟔(𝟏𝟏,𝟒𝟎𝟗) – (𝟐𝟒𝟕𝟐)] [𝟔(𝟒𝟎,𝟎𝟐𝟐) – 𝟒𝟖𝟔𝟐]
9
The range of the correlation coefficient is from -1 to 1. Our result is 0.5298, which
means the relationship between variables is moderate positive correlation.
✓ Assumptions
For the Pearson r correlation, both variables should be normally distributed (normally
distributed variables have a bell-shaped curve). Other assumptions include linearity and
homoscedasticity. Linearity assumes a straight line relationship between each of the two
variables and homoscedasticity assumes that data are equally distributed about the
regression line.
Solve the value of the correlation coefficient for the data obtained in the study of
age and blood pressure given.
SOLUTION:
STEP 1. Make a table.
Subject Age x Pressure y xy x2 y2
A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152
Step 2. Find the values of xy, x2, y2 and place these values in the corresponding column of
the table.
Subject Age x Pressure y xy x2 y2
A 43 128 5504 1849 16384
B 48 120 5760 2304 14400
C 56 135 7560 3136 18225
D 61 143 8723 3721 20449
E 67 141 9447 4489 19881
F 70 152 10640 4900 23104
∑ ∑x = 345 ∑y = 819 ∑xy = 47, 634 ∑x = 20, 399
2
∑y = 112, 443
2
STEP 3. Substitute in the formula and solve for r.
𝒏∑𝑿𝒀 − ∑𝑿 • ∑𝒀
𝒓=
√[𝒏∑𝑿𝟐 − (∑𝑿)𝟐 ] [𝒏∑𝒀𝟐 − (∑𝒀)𝟐 ]
The correlation coefficient suggests a strong positive relationship between age and blood
pressure.
𝟔(𝟒𝟕,𝟔𝟑𝟒) – (𝟑𝟒𝟓 × 𝟖𝟏𝟗)
r=
√[𝟔(𝟐𝟎,𝟑𝟗𝟗) – (𝟑𝟒𝟓)𝟐 ] [𝟔(𝟏𝟏𝟐,𝟒𝟒𝟑) – (𝟖𝟏𝟗)𝟐 ]
= 𝟎. 𝟖𝟗𝟕
10
What’s More
I. Directions: Calculate r and make a generalization regarding the information that you get
from the computed correlation coefficient for each of the following:
a. ∑X = 225 b. ∑X = 32 c. ∑X = 180
∑Y = 22 ∑Y = 1105 ∑Y = 147
∑X = 9653
2
∑X = 220
2
∑X2 = 6914
∑Y = 143
2
∑Y = 364525
2
∑Y2 = 5273
∑XY = 651 ∑XY = 3402 ∑XY = 4013
n=6 n=6 n=7
II. Directions: Solve the Problem.
The following are the heights of a father and his eldest son, in inches:
Heights of the Father 71 69 67 68 68 66 70 72 65 60
Heights of the Eldest Son 71 69 69 65 66 63 68 70 60 58
QUESTION: Do the data support the hypothesis that height is hereditary? Explain.
Accompany your explanation with statistical computations.
What I Have Learned
KEYPOINTS
▪ The formula for computing r is
𝑛∑𝑋𝑌 − ∑𝑋 • ∑𝑌
𝑟=
√[𝑛∑𝑋2 − (∑𝑋)2 ] [𝑛∑𝑌 2 − (∑𝑌)2 ]
▪ The interpretation of r focuses on both the direction and strength of correlation.
▪ The direction of correlation is indicated by the sign of r while its strength is indicated by
the absolute value of the computed value.
▪ To make the interpretation of the strength of association more objective, we use a
correlation scale showing the ranges of r and the corresponding qualitative description.
The Meaning of the Correlation Coefficient
1. If the trend line contains all the points in the scatterplot and the line points to the right, we
conclude that there is a perfect positive correlation between the two variables. The
computed r is 1.
2. If all the points fall on the trend line that point to the left, then there exists a perfect
negative correlation between the pair of variables. The computed value of r is – 1.
3. If the trend line does not exist, there is no correlation between the pair variables. This is
confirmed by the computed value of r which is 0.
4. The absolute value of r indicates the strength of correlation between the two variables.
The direction of correlation is indicated by the sign (positive or negative) of r.
11
What I Can Do
Directions: Briefly answer the Self – Assessment Questions (SAQ) below.
1. Why do we study Pearson’s Correlation Coefficient?
2. How can we determine the strength of association based on the Pearson correlation
coefficient?
3. How do we make our interpretation of the strength of relationship more objective?
4. Cite a real-life application where we used Pearson’s Correlation Coefficient.
Assessment
I. Directions: Read the statement carefully and choose the best answer.
For items 1 – 5. Complete the table below.
Consider the scores obtained in Math(X) and Statistics (Y) subjects by 10 students.
Observation Math Score (X) Stat Score (Y) X2 Y2 XY
1 5 2 25 4 10
2 8 7 64 49 56
3 10 8 100 64 80
4 12 9 144 81 108
5 12 10 144 100 120
6 14 12 196 144 168
7 15 14 225 196 210
8 16 10 256 100 160
9 18 16 324 256 288
10 20 12 400 144 240
Sum
1. The ∑X2 is equal to ________.
a. 1118 b. 1138 c. 1878 d. 1873
2. Find ∑XY.
a. 1440 b. 1040 c. 1400 d. 1140
3. How many respondents are being observed?
a. 20 b. 12 c. 10 d. 6
4. Based on the given data, solve for the Pearson’s correlation coefficient.
a. 0.78 b. 0.87 c. 0.86 d. 0.76
5. Evaluate what conclusion can be derived from the result of r obtained in the data.
a. There is a no relationship between math scores and statistics scores of the
students.
b. There is a strong negative relationship between math scores and statistics scores
of the students.
c. There is a moderately positive relationship between math scores and statistics
scores of the students.
d. There is a strong positive relationship between math scores and statistics scores
of the students.
12