Unit-1
Unit-1
OF CORRELATION
Structure
1.0 Introduction
1.1 Objectives
1.2 Correlation: Meaning and Interpretation
1.2.1 Scatter Diagram: Graphical Presentation of Relationship
1.2.2 Correlation: Linear and Non-Linear Relationship
1.2.3 Direction of Correlation: Positive and Negative
1.2.4 Correlation: The Strength of Relationship
1.2.5 Measurements of Correlation
1.2.6 Correlation and Causality
1.0 INTRODUCTION
We measure psychological attributes of people by using tests and scales in order to
describe individuals. There are times when you realise that increment in one of the
characteristics is associated with increment in other characteristic as well. For example,
individuals who are more optimistic about the future are more likely to be happy. On
the other hand, those who are less optimistic about future (i.e., pessimistic about it)
are less likely to be happy. You would realise that as one variable is increasing, the
other is also increasing and as the one is decreasing the other is also decreasing. In
the statistical language it is referred to as correlation. It is a description of “relationship”
or “association” between two variables (more than two variables can also be correlated,
we will see it in multiple correlation).
5
Correlation and Regression In this unit you will be learning about direction of Correlation that is, Positive and
Negative and zero correlation. You will also learn about the strength of correlation
and how to measure correlation. Specifically you will be learning Pearson’s Product
Moment Coefficient of Correlation and how to interpret this correlation coefficient.
You will also learn about the ramifications of the Pearson’s r. You will also learn the
coefficient of correlation equations with numerical examples.
1.1 OBJECTIVES
After reading and doing exercises in this unit, you will be able to:
z describe and explain concept of correlation;
z plot the scatter diagram;
z explain the concept of direction, and strength of relationship;
z differentiate between various measures of correlations;
z analyse conceptual issues in correlation and causality;
z describe problems suitable for correlation analysis;
z describe and explain concept of Pearson’s Product Moment Correlation;
z compute and interpret Pearson’s correlation by deviation score method and raw
score method; and
z test the significance and apply the correlation to the real data.
40
20
10
100 110 120 130 140
Intelligence
The graph shown above is scatterplot representing the relationship between intelligence
and the scores on reasoning task. We have plotted intelligence on x-axis because it
is a cause of the performance on the reasoning task. The scores on reasoning have
started from 100 instead of zero simply because the smallest score on intelligence is
104 which is far away from zero. We have also started the range of reasoning scores
from 10 since the lowest score on reasoning is 12. Then we have plotted the pair
of scores. For example, subject A has score of 104 on intelligence and 12 on
reasoning so we get x,y pair of 104,12. We have plotted this pair on the point of
intersection between these two scores in the graph by a dot. This is the lowest dot
at the left side of the graph. You can try to practice the scatter by using the data given
in the practice.
Fig. 2: Scatter showing linearity of the relationship between Intelligence and Scores on
Reasoning Task
Non-linear Relationship
There are other forms of relationships as well. They are called as curvilinear or non-
linear relationships. The Yorkes-Dodson Law, Steven’s Power Law in Psychophysics,
etc. are good examples of non-linear relationships. The relationship between stress
and performance is popularly known as Yorkes-Dodson Law. It suggests that the
performance is poor when the stress is too little or too much. It improves when the
stress is moderate. Figure 3 shows this relationship. The non-linear relationships,
cannot be plotted as a straight line.
The performance is poor at extremes and improves with moderate stress. This is one
type of curvilinear relationship.
9
Correlation and Regression
10
90
Product Moment
Coefficient of Correlation
80
Marks Obtained
70
60
50
40
80 90 100 110 120 130 140 150
Intelligence
Fig. 4: Positive correlation: Scatter showing the positive correlation between intelligence
and marks obtained.
Negative Correlation
The Negative correlation indicates that as the values of one variable increases, the
values of the other variable decrease. Consequently, as the values of one variable
decreases, the values of the other variable increase. This means that two variables
move in the opposite direction. For example,
a) As the intelligence (IQ) increases the errors on reasoning task decreases.
b) As hope increases, depression decreases.
Figure 5 shows scatterplot of the negative relationship. You will realise that the
higher scores on X axis are associated with lower scores on Y axis and lower scores
on X axis are generally associated with higher score on Y axis.
In the ‘a’ example, higher scores on intelligence are associated with the lower score
on errors on reasoning task. Similarly, as the scores on intelligence drops down,
the errors on reasoning task have gone up.
2
Errors on Reasoning Task
-1
-2
-3
-3 -2 -1 0 1 2 3
Intelligence
Fig. 5: Negative correlation: Scatter showing the negative correlation between intelligence
and errors on reasoning task 11
Correlation and Regression No Relationship
Until now you have learned about the positive and negative correlations. Apart from
positive and negative correlations, it is also possible that there is no relationship
between x and y. That is the two variables do not share any relationship. If they do
not share any relationship (that is, technically the correlation coefficient is zero), then,
obviously, the direction of the correlation is neither positive nor negative. It is often
called as zero correlation or no correlation.
(Please note that ‘zero order correlation’ is a different term than ‘zero correlation’
which we will discuss afterwards).
For example, guess the relationship between shoe size and intelligence?
This sounds an erratic question because there is no reason for any relationship
between them. So there is no relationship between these two variables.
The data of one hundred individuals is plotted in Figure 6. It shows the scatterplot
for no relationship.
10
6
Shoe
0
60 80 100 120 140 160
Intelligence
You can understand the strength of association as the common variance between
two correlated variables. The correlation coefficient is NOT percentage.
explain this point. See, every variable has variance. We denote it as S x2 (variance
of X). Similarly, Y also has its own variance ( S y2 ). In the previous block you have
learned to compute them. From the complete variance of X, it shares some variance
with Y. It is called covariance.
The Figure 8 shown below explains the concept of shared variance. The circle X
indicates the variance of X. Similarly, the circle Y indicates the variance of Y. The
overlapping part of X and Y, indicated by shaded lines, shows the shared variance
between X and Y. One can compute the shared variance.
X Y
Variance of X Variance of Y
13
Correlation and Regression
Percentage of common variance between X and Y = rxy2 × 100 (eq. 1.2)
For instance, if the correlation between X and Y is 0.50 then the percent of variation
shared by X and Y can be calculated by using equation 1.2 as follows.
∑X i
X= i =1 (eq. 1.3)
n
You have learned this in the first block. We will need to use this as a basic element
to compute correlation.
Variance
(eq. 1.4)
Cov XY =
∑ ( X − X )(Y − Y ) (eq. 1.5)
n
Where,
S X is standard deviation of X
SY is standard deviation of Y.
Since, it can be shown that CovXY is always smaller than or equal to S X SY , the
maximum value of correlation coefficient is bound to be 1.
The denominator of this formula ( S X SY ) is always positive. This is the reason for a
– 1 to + 1 range of correlation coefficient. By substituting covariance equation (eq.
1.5) for covariance we can rewrite equation 1.6 as
∑ ( X − X )(Y − Y )
r= n
S X SY
(eq. 1.7)
r=
∑ ( X − X )(Y − Y )
nS X rSY=
(eq. 1.8)
1.3.3 Numerical Example
Now we shall use this formula to compute Pearson’s correlation coefficient. For this
purpose we will use the following data. The cognitive theory of depression argues
that hopelessness is associated with depression. Aron Beck developed instruments
to measure depression and hopelessness. The BHS (Beck Hopelessness Scale) and
the BDI (Beck Depression Inventory) are measures of hopelessness and depression,
respectively.
Let’s take a hypothetical data of 10 individuals on whom these scales were
administered. (In reality, such a small data is not sufficient to make sense of correlation;
roughly, at least a data of 50 to 100 observations is required). We can hypothesize
that the correlation between hopelessness and depression will be positive. This
hypothetical data is given below in table 2.
17
Correlation and Regression Table 2: Hypothetical data of 10 subjects on BHS and BDI
n = 10 ∑X ∑Y ∑ ( X − X ) ∑ (Y − Y )
2 2
∑ ( X − X )(Y − Y )
=110 =120 = 156 = 100 = 117
X = 11 Y = 12
SX = ∑ (X − X ) 2
/ n = 4.16
SY = ∑ (Y − Y ) 2
/ n = 3.33
Step 1. You need scores of subjects on two variables. We have scores on ten
subjects on two variables, BHS and BDI.
Step 2. Then list the pairs of scores on two variables in two columns. The order will
not make any difference. Remember, same individuals’ two scores should be kept
together. Label one variable as X and other as Y. We label BHS as X and BDI as
Y.
Step 3. Compute the mean of variable X and variable Y. It was found to be 11 and
12 respectively.
Step 4. Compute the deviation of each X score from its mean ( ) and each Y
score from its own mean ( Y ). This is shown in the column labeled as
X − X and Y − Y . As you have learned earlier, the sum of these columns has to be
zero.
Step 6. Then compute the sum of these squared deviations of X and Y. The sum of
squared deviations for X is 156 and for Y it is 100.
Step 7. Divide them by n to obtain the standard deviations for X and Y. The Sx was
18 found to be 4.16. Similarly, the Sy was found to be 3.33.
Step 8. Compute the cross-product of the deviations of X and Y. These cross- Product Moment
Coefficient of Correlation
products are shown in the last column labeled as ( x − x ) ( y − y ).
Step 9. Then obtain the sum of these cross-products. It was found to be 117. Now,
we have all the elements required for computing r.
Step 10. Use the formula of r to compute correlation. The sum of the cross-product
of deviations is numerator and n, Sx, Sy, are denominators. Compute r. the value of
r is 0.937 in this example.
Η O : ρBHS BDI = 0
Η A : ρBHS BDI ≠ 0
1.3.5 Adjusted r
The Pearson’s correlation coefficient (r) calculated on the sample is not an unbiased
estimate of population coefficient (ñ). When the number of observations (sample
size) are small the sample correlation is a biased estimate of population correlation.
In order to reduce this bias, the calculated correlation coefficient is adjusted. This is
called as adjusted correlation coefficient (radj).
(1 − r 2 )(n − 1)
radj = 1 −
n−2
Where,
20 radj= adjusted r
r2- = the square of Pearson’s Correlation Coefficient obtained on sample, Product Moment
Coefficient of Correlation
n = sample size
In case of our data, presented in table 1.2, the correlation between BHS and BDI
is +.937 obtained on the sample of 10. The adjusted r can be calculated as follows
40 48
30 46
20 44
MARKS
10 42
marks
MARKS
0 40
0 1 2 3 4 5 6 7 8 4.5 5.0 5.5 6.0 6.5 7.0 7.5
Fig. 1.9a: Scatter showing full range Fig. 1.9b: Scatter with restricted
on both variables range on hours studied
1.4.1 Outliers
Outliers are extreme score on one of the variables or both the variables. The presence
of outliers has deterring impact on the correlation value. The strength and degree of
the correlation are affected by the presence of outlier. Suppose you want to compute
correlation between height and weight. They are known to correlate positively. Look
at the figure below. One of the scores has low score on weight and high score on
height (probably, some anorexia patient).
Figure 1.10. Impact of an outlier observation on correlation. Without the outlier, the
correlation is 0.95. The presence of an outlier has drastically reduced a correlation
coefficient to 0.45.
r = +.45
70
60
50
40
30
Weight
20
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5
Height
23
Correlation and Regression 1.4.2 Curvilinearity
We have already discussed the issue of linearity of the relationship. The Pearson’s
product moment correlation is appropriate if the relationship between two variables
is linear. The relationships are curvilinear then other techniques need to be used. If
the degree of cruviliniarity is not very high, high score on both the variable go
together, low scores go together, but the pattern is not linear then the useful option
is Spearman’s rho.
r=
∑ ( X − X )(Y − Y )
nS X SY
The denominator of correlation formula can be written as
∑ ( X − X ) (Y − Y )
2 2
(eq. 1.10)
Which is
(∑ X )2
SS X = ∑ ( X − X ) = ∑ X −
2 2
(eq. 1.12)
n
and
(∑ Y ) 2
SSY = ∑ (Y − Y ) 2 = ∑ Y 2 − (eq. 1.13)
n
The numerator of the correlation formula can be written as
(∑ X )(∑ Y )
∑ ( X − X )(Y − Y ) = ∑ XY − n
(eq. 1.14)
24
So r can be calculated by following formula which is a raw score formula: Product Moment
Coefficient of Correlation
∑ XY − ∑ n ∑
( X )( Y )
r= (eq. 1.15)
(SS X SSY )
∑ ( X − X )(Y − Y ) = ∑ XY − ∑ n ∑
( X )( Y )
=117
∑ XY − ∑ n ∑
( X )( Y )
=0.937
r=
(SS X SSY )
Readers might find one of the methods easier. There is nothing special about the
methods. One should be able to correctly compute the value of correlation.
X Y
12 20
13 22
15 28
17 31
11 22
9 24
8 18
10 21
11 23
7 16
2) Plot scatter for following example. The data was collected on Perceived stress
and anxiety on 10 subjects. Compute the Pearson’s correlation between them State
the null hypothesis. Test the null hypothesis using this hypothesis. Do the similar
exercise after deleting a pair that clearly looks an outlier observation.
Perceived Anxiety
stress
9 12
8 11
7 9
4 5
8 9
4 6
6 8
14 2
7 11
11 9
9 11
26
3) Data showing scores on time taken to complete 200 meters race and duration Product Moment
Coefficient of Correlation
of practice for 5 runners. Plot the scatter. Compute mean, variance, SD, and
covariance. Compute correlation coefficient. Write the null hypothesis.
Dissatisfaction Irritability
with work scores
12 5
16 7
19 9
27 13
30 16
25 11
22 6
26 14
11 7
17 9
19 14
21 18
23 19
28