Correlation Analysis
Correlation Analysis
By:
Er Gaurav Goyal
Assistant Professor
Introduction
• So far we have been dealing with the analysis of one variable only, e.g.
mean, median, mode, etc.
When the relationship is of quantitative nature the appropriate statistical tool for
discovering and measuring the relationship and expressing it in a brief formula
is known as correlation.
By Croxton and Cowden.
Whenever some definite connection exists between the two or more groups,
classes or series or data there is said to be correlated.
By Boddington
If two or more quantities vary in sympathy, so that movement in the one tend to
be accompanied by corresponding movement in the other then they are said
to be correlated.
By L.R. Connor.
Importance of Correlation
• The correlation coefficient helps in measuring the
extent of relationship between two variables in one
figure only
• Existence of relationship between two or more
variables enables us to predict what will happen in
future e.g., if the production of wheat has increased,
other factors remaining constant, we may expect fall in
the price of wheat.
• Correlation facilitates decision-making in business
organizations. Expectations about the behavior of
certain variables are also based on correlation analysis.
Kinds of Correlation
Based on Direction of Change
a) Positive
b) Negative
• In the correlation analysis we don’t deal with one series but the
association or relationship between two or more series
Steps
a) Calculate the AM of X and Y series
b) Find out the deviation of X series and denote these deviations by
x.
c) Square these deviations and obtain the total i.e. Ʃx^2
d) Find out the deviation of Y series from the mean of Y and denote
these deviations by y.
e) Square these deviations and obtain the total, i.e., Ʃy^2
f) Multiply these calculated deviations of X and Y series and find
out total, i.e., Ʃxy.
Example (Ex1)
Calculate the correlation coefficient between
the height of father and height of son from
the given data:
Height 64 65 66 67 68 69 70
of
Father
(in
inches)
Height 66 67 65 68 70 68 72
of son
(in
inches)
Direct Method
• In case mean values of the two series in a
bivariate data are fractional values and
number of observations and their volume in
the two series in not very large, direct method
is useful over here.
Note sum: If the correlation between X and Y is
r, then the correlation between –X and Y is –r.
Ex2
Calculate coefficient of correlation between
birth rate and death rate from the following
data:
Year 1931 1941 1951 1961 1971 1981 1991
Birth 24 26 32 33 35 30 32
Rate
Death 15 20 22 24 27 24 20
Rate
Short-cut Method
• When the mean values are fractional and the
number of observations is large, and the
observations have large values, computations
of r can be simplified by using deviations of
the observations from some suitably chosen
constant or constants.
Short-cut Method
Steps
a) Calculate the deviations of X series from an assumed mean
and denote them by dx and find out the total, i.e., Ʃdx
b) Calculate the deviations of Y series from an assumed mean
and denote them by dy and find out the total, i.e., Ʃdy
c) Square the deviation of X series and obtain the total, i.e.,
Ʃdx^2
d) Square the deviation of Y series and obtain the total, i.e.,
Ʃdy^2
e) Multiply dx and dy and find out the total, i.e., Ʃdxdy
EX5
Calculate Karl Pearson’s coefficient of correlation
from the following data using 20 as the
working mean for price, and 70 as the working
mean for demand.
Price 14 16 17 18 19 20 21 22 23
Demand 84 78 70 75 66 67 62 58 60
Assumptions
• Linear Relationship: In this method, a linear
relationship between the two variables is
assumed.
• Causal Relationship: In studying correlation,
we expect a cause and effect relationship
between the forces affecting the values in the
two series.
Properties of Correlation Coefficient
• The value of r lies between -1 and +1
• The correlation coefficient is independent of the change of origin and scale
• It is the ratio of two quantities having same units, thus it is a pure number
having no units
• The value of r doesn’t change if all the values of either variable are
converted to a different scale.
• The value of r is not affected by the choice of X or Y.
• r measures the strength of linear relationship.
• If the sign of all the values of one of the variables is changed, the sign of the
correlation coefficient changes.
• If each value of X and Y a constant amount or added or subtracted, the
correlation coefficient remains unchanged, i.e. correlation coefficient is
independent of the change of origin.
• If each value of X and Y is multiplied or divided by a constant, the
correlation coefficient remains unchanged, i.e., the correlation coefficient is
independent of the change of scale.
Spearman’s Rank Correlation Coefficient
Demerits
e) This method can be applied to ungrouped data also.
f) The ranking procedure involved ignores the actual magnitude of data and, as
such, the results obtained are only approximate, because the effect of
extreme values is almost ignored
g) The computation procedure becomes difficult as the number of paired
observations increased.
Measures of Association
Between Two Variables
• Covariance
• Correlation Coefficient
Covariance
The
The covariance
covariance is
is aa measure
measure of
of the
the linear
linear association
association
between
between two
two variables.
variables.
Positive
Positive values
values indicate
indicate aa positive
positive relationship.
relationship.
Negative
Negative values
values indicate
indicate aa negative
negative relationship.
relationship.
Covariance
The
The covariance
covariance is
is computed
computed as
as follows:
follows:
( xi x )( yi y ) for
sxy
n 1 samples
( xi x )( yi y ) for
xy populations
N
Correlation Coefficient
Correlation
Correlation is
is aa measure
measure of
of linear
linear association
association and
and not
not
necessarily
necessarily causation.
causation.
Just
Just because
because two
two variables
variables are
are highly
highly correlated,
correlated, it
it
does
does not
not mean
mean that
that one
one variable
variable is
is the
the cause
cause of
of the
the
other.
other.
Correlation Coefficient
The
The correlation
correlation coefficient
coefficient is
is computed
computed as
as follows:
follows:
sxy xy
rxy xy
sx s y x y
for for
samples populations
Correlation Coefficient
The
The coefficient
coefficient can
can take
take on
on values
values between
between -1
-1 and
and +1
+
Values
Values near
near -1
-1 indicate
indicate aa strong
strong negative
negative linear
linear
relationship.
relationship.
Values
Values near
near +1
+1 indicate
indicate aa strong
strong positive
positive linear
linear
relationship.
relationship.
Covariance and Correlation Coefficient
Average DrivingAverage
Distance (yds.) 18-Hole Score
277.6 69
259.5 71
269.1 70
267.0 70
255.6 71
272.9 69
Covariance and Correlation Coefficient
Sample Covariance
sxy
(x x)(y y) 35.40
i i
7.08
n 1 6 1
Sample Correlation Coefficient
sxy 7.08
rxy -.9631
sxsy (8.2192)(.8944)
The Weighted Mean and
Working with Grouped Data
• Weighted Mean
• Mean for Grouped Data
• Variance for Grouped Data
• Standard Deviation for Grouped Data
Weighted Mean
When the mean is computed by giving each data
value a weight that reflects its importance, it is
referred to as a weighted mean.
In the computation of a grade point average (GPA),
the weights are the number of credit hours earned
each grade.
When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.
Weighted Mean
x wx i i
w i
where:
xi = value of observation i
wi = weight for observation i
Grouped Data
The weighted mean computation can be used to
obtain approximations of the mean, variance, and
standard deviation for the grouped data.
To compute the weighted mean, we treat the
midpoint of each class as though it were the mean
of all items in the class.
We compute a weighted mean of the class midpoint
using the class frequencies as weights.
Similarly, in computing the variance and standard
deviation, the class frequencies are used as weight
Mean for Grouped Data
For sample data
x fM
i i
n
For population data
fM
i i
N
where:
fi = frequency of class i
Mi = midpoint of class i
Sample Mean for Grouped Data
f ( M ) 2
2 i i
N
Sample Variance for Grouped Data
Rent ($) fi Mi Mi - x (M i - x )2 f i (M i - x )2
420-439 8 429.5 -63.7 4058.96 32471.71
440-459 17 449.5 -43.7 1910.56 32479.59
460-479 12 469.5 -23.7 562.16 6745.97
480-499 8 489.5 -3.7 13.76 110.11
500-519 7 509.5 16.3 265.36 1857.55
520-539 4 529.5 36.3 1316.96 5267.86
540-559 2 549.5 56.3 3168.56 6337.13
560-579 4 569.5 76.3 5820.16 23280.66
580-599 2 589.5 96.3 9271.76 18543.53
600-619 6 609.5 116.3 13523.36 81140.18
Total 70 208234.29
continued
Sample Variance for Grouped Data
Sample Variance
s2 = 208,234.29/(70 – 1) = 3,017.89