Chapter 16
correlation
1
Correlation
Consider the following two statements:
1. There is a positive relationship between cigarette smoking and lung damage.
2. There is a negative relationship between being overweight and life expectancy.
The first statement implies that there is evidence that if you score high on one
variable “cigarette smoking” you are likely to score high on the other variable “lung
damage”.
The second statement describes the finding that scoring high on the variable “overweight”
tends to be associated with low measures on the variable “life expectancy”.
A correlation coefficient is a statistic which expresses numerically the magnitude and
direction of the association between two variables.
2
Example: scattergram
To provide a visual representation of the relationship
between the two variables, we can plot the above data
on a scattergram.
student Anatomy Xphysiology Y
1 3 2.5
2 4 3.5
physiology Y
3 1 0 7
4 8 6 6
5
5 2 1 4
A scattergram is a graph of the paired 3
scores for each subject on the two variables. 2
0
0 1 2 3 4 5 6 7 8 9
3
Figure (a) and (b) represent a linear correlation between the variables x and y. That
is, a straight line is the most appropriate representation of the relationship between
x and y.
Figure (c) represents a non-linear correlation, where a curve best represents the
relationship between x and y.
4
Figure (a) represents a positive correlation, indicating that high scores on x are related to high
scores on y. For example, the relationship between cigarette smoking and lung damage is a
positive correlation.
Figure (b) represents a negative correlation, where high scores on x are associated with low
scores on y. For example, the correlation between the variables being ‘overweight' and ‘life
expectancy' is negative, meaning that the more you are overweight, the lower your Life
expectancy
Correlation coefficients
The correlation coefficient expresses quantitatively the magnitude and direction
of the correlation.
5
Selection of correlation coefficients
we will examine only the commonly used Pearson’s , and Spearman's
Regardless of which correlation coefficient we employ, these statistics share the following
characteristics:
1. Correlation coefficients are calculated from pairs of measurements on variables x and y
for the same group of individuals.
2. A positive correlation is denoted by + and a negative correlation by -.
3. The values of the correlation coefficient range from +1 to -1, where +1 means a
perfect positive correlation, 0 means no correlation, and -1 a perfect negative correlation.
4. The square of the correlation coefficient represents the coefficient of determination.
6
All the correlation coefficients , and , are appropriate for quantifying linear relationships
between variables.
Coefficient Conditions where appropriate
Both x and y measures on a nominal scale.
Both x and y measures on ordinal scale, or at least one of the variables
measures on an ordinal scale.
Both x and y measures on an interval or ratio scale (i.e., continuous)
There are other correlation coefficients, such as which are used for quantifying
non-linear relationships
7
Calculation of correlation coefficients
Pearson’s
We have already stated that Pearson's is the appropriate correlation coefficient when both
variables x and y are measured on an interval or a ratio scale.
To calculate we need to represent the position of each paired score within its own distribution, so
we convert each raw score to a z score.
∑ 𝑧𝑥 𝑧𝑦
The formula for calculating Pearson's is: 𝑟=
𝑛
8
Table gives the calculations for the correlation coefficient for the data given in the earlier
examination scores example.
Example calculation correlation coefficient
Raw scores z scores
student
1 3 2.5 -0.22207 -0.04293 0.009534
2 4 3.5 0.148047 0.386405 0.057206
3 1 0 -0.9623 -1.11628 1.074201
4 8 6 1.628513 1.459752 2.377225
5 2 1 -0.59219 -0.68694 0.406798
mean 3.6 2.6 3.924964
standard deviation 2.7018512
mean 3.6 2.32916294
2.6 3.924964
standard deviation 2.7018512 2.32916294
9
Classification of correlation coefficients
Strong Moderate Weak Weak Moderate Strong
(High) (Acceptable) (Low) (Low) (Acceptable) (High)
- 0.7 - 0.3 0.3 0.7
1- 0 1
Perfect negative correlation no correlation Perfect positive correlation
10
Assumptions for using Pearson’s correlation coefficient r
It was pointed out earlier that r is used when:
1. two variables are scaled on interval or
ratio scales and
2. when it is shown that they are linearly
associated.
3. In addition, the sets of scores on each
variable should be approximately normally
distributed.
11
Spearman’s
When the obtained data are such that at least one of the variables x or y was measured on an
ordinal scale and the other on an ordinal scale or higher, we use to calculate the correlation
between the two variables. The higher scale can be readily reduced to an ordinal scale.
If one or both variables were measured on a nominal scale, is no longer appropriate as a
statistic.
6 ∑ 𝑑2
𝜌 =1− 3
𝑛 −𝑛
where d = difference in a pair of ranks and n = number of pairs.
12
socioeconomic
socioeconomic severity
severity of
of illness
illness
Patient
Patient status
status (rank)
(rank) (rank)
(rank)
1 6 5
1
2 6
7 5
8
2
3 7
2 8
4
3
4 2
3 4
3
5
4 5
3 7
3
6
5 4
5 1
7
7 1 2
6
8 4
8 1
6
7 1 2
8 8 6
6 ∑ 𝑑2 6 × 24
𝜌 =1− 𝜌 =1− 3 = 0.71
𝑛3 − 𝑛 8 −8
13
Example
In a study of the relationship between level education and income the following
data was obtained. Find the relationship between them and comment.
Income level education Sample numbers
(Y) (X)
25 Preparatory. A
10 Primary. B
8 University. C
10 secondary D
15 secondary E
50 illiterate F
60 University. G
14
exercise
1
1
1
1
1
15
Uses of correlation in the health sciences
Prediction
When the correlation coefficient has been calculated it may be used to predict the value of one
variable ( y) given the value of the other variable (x).
the smaller the correlation coefficient, the greater the probability of making an error in
prediction.
A more appropriate and precise way of making predictions is in terms of regression analysis,
but this topic is not covered in this book.
16
Reliability and predictive validity of assessment
Reliability refers to measurements using instruments or to subjective judgments remaining
relatively the same on repeated administration this is called test-retest reliability and its degree
is measured by a correlation coefficient
Estimating proportion of variance (coefficient of determination)
A useful statistic is the square of the correlation coefficient () which represents the proportion of
variance in one variable accounted for by the other, this is also called the coefficient of determination
Example:
If r = 0.8, then =0.64, means that 64% of the variability of y is explained (accounted) in terms of x.
17
Remarks
•1. If the Correlation coefficient is higher, then the predictive validity is higher and consequently the possible error decreases
2. If the Correlation coefficient is lower, then the predictive validity is lower and consequently the possible error increases
3. If the Correlation coefficient is higher, then the reliability is higher (reliable)
4. If the Correlation coefficient is lower, then the reliability is lower (unreliable)
5. If the Correlation coefficient is higher, then the coefficient of determination is higher.
6. If the Correlation coefficient is lower, then the coefficient of determination is lower.
18
correlation and causation
As an example, let us take cigarette smoking (x) and lung damage (y). A high positive correlation
could result from any of the following circumstances:
1. x causes y.
2. y causes x.
3. There is a third variable, which causes changes in both x and y.
4. The correlation represents a spurious or chance association.
Some associations between variables are completely spurious there might be a correlation
between the amount of margarine consumed and the number of cases of influenza over a period
in a community, but each of the two events might have entirely different, unrelated causes.
Even a high correlation between two variables does not necessarily imply a causal relationship
The end of chapter 16
19