0% found this document useful (0 votes)
0 views

Correlation Analysis

The document discusses correlation analysis, which examines the statistical relationship between two or more variables. It covers definitions, importance, types of correlation, methods for measuring correlation, and the distinction between correlation and causation. Additionally, it explains various methods such as scatter diagrams and Pearson's coefficient, along with their merits and demerits.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Correlation Analysis

The document discusses correlation analysis, which examines the statistical relationship between two or more variables. It covers definitions, importance, types of correlation, methods for measuring correlation, and the distinction between correlation and causation. Additionally, it explains various methods such as scatter diagrams and Pearson's coefficient, along with their merits and demerits.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

CORRELATION ANALYSIS

By:
Er Gaurav Goyal
Assistant Professor
Introduction
• So far we have been dealing with the analysis of one variable only, e.g.
mean, median, mode, etc.

• The study of correlation deals with the degree of mutual statistical


relationship between two or more variables, i.e. correlation studies,
correspondence of movement (going togetherness) between two
variables of series of paired items. Some of the examples are
a) If the price increases demand decreases
b) If the income of the family increases the expenditure will increase.
c) Sale of wooden garments increases as winter approaches.
d) If the price increases supply increases.
Introduction
• In this we don’t deal with one series but rather with the
association or relationship between two series, and we don’t
measure variation in one series but rather compare variation
in two or more series.

• The two series may vary together in


a) In the same direction
b) In the opposite direction
c) They don’t vary together at all.
Introduction
• Guidelines

To measure the association of series through correlation we


must have
a) Sufficient number of items in the series (2or 3 pairs are not
enough)
b) There must not be a blank in one series where there is a
value in the other series, these must be pairing throughout.
Definition of Correlation
Correlation measures the closeness of relationship between two variables, more
exactly of the closeness of linear relationship.

When the relationship is of quantitative nature the appropriate statistical tool for
discovering and measuring the relationship and expressing it in a brief formula
is known as correlation.
By Croxton and Cowden.

Whenever some definite connection exists between the two or more groups,
classes or series or data there is said to be correlated.
By Boddington

If two or more quantities vary in sympathy, so that movement in the one tend to
be accompanied by corresponding movement in the other then they are said
to be correlated.
By L.R. Connor.
Importance of Correlation
• The correlation coefficient helps in measuring the
extent of relationship between two variables in one
figure only
• Existence of relationship between two or more
variables enables us to predict what will happen in
future e.g., if the production of wheat has increased,
other factors remaining constant, we may expect fall in
the price of wheat.
• Correlation facilitates decision-making in business
organizations. Expectations about the behavior of
certain variables are also based on correlation analysis.
Kinds of Correlation
Based on Direction of Change
a) Positive
b) Negative

Based on Change in Proportion


c) Linear
d) Non-Linear

Based on Number of Variables


e) Simple
f) Partial
g) Multiple
Some Important Points
• There should be a certain number of items in the series. On the basis of
two or three pairs of values we can’t generalize their ‘going togetherness’

• In the correlation analysis we don’t deal with one series but the
association or relationship between two or more series

• We don’t measure variation in one series only but rather compare


variation in two or more series- do the series vary together in the same
direction, do they vary together in opposite direction or they don’t vary
together at all.

• To measure Correlation i.e., the degree of association between the


movement of two or more series, it is not necessary that there be
proportionate change, i.e., it is not necessary that for every unit change in
one variable.
Some Important Points
• We study only Linear Correlation

• Correlation doesn’t necessarily mean cause and effect


relationship, e.g. heating of iron rod. Heating of rod is cause
and hotness of rod is effect.

• The sign of r indicates the type of linear relationship, whether


positive or negative

• The value of r without regard to sign, the strength of the


linear relationship.
• Is Correlation Cause and Effect Relationship?
Is Correlation Cause and Effect
Relationship?
• Correlation is used in the sense of mutual
dependence of two or more variables yet it is
not at all necessary that it should always be
so.
• Even a very high degree of correlation
between two variables doesn’t necessarily
indicate a cause and effect relationship
between them.
Reasons
• There is a cause and effect relationship between two
variables.
• Both the correlation variables are being affected by a
third variable or by more than one variable
• Related variables might be mutually affecting each
other so that neither of them could be designed as a
cause or effect
• The correlation may be due to random or chance factors
• There might be a situation of nonsense or spurious
correlation between the two variables under study.
Measures of Correlation
• Scatter Diagram
• Karl Pearson’s coefficient for measuring linear
correlation
• Method of Rank differences (Spearman’s Rank
Correlation Coefficient)
Scatter Diagram
• Scatter diagram or dot diagram is a graphical representation of
pair of numerical values of the two variables.
• Each pair of values is represented by a dot on the graph.
• Scatter of points and the direction of the scatter diagram reveals
the nature and degree of correlation between two variables.
• If all the points lie on a straight line having positive slope (i.e.
rising line) the correlation is said to be perfect positive, r=+1,
and vice-versa.
• In brief if the low values of one variable go with the low values
of other variable and high values go with high values, the path
runs roughly from the lower left corner to the upper right
corner, the relationship is direct and is called positive and vice-
versa.
Scatter Diagram
• Limited Degree of Correlation
The limited degree of combination can be
a) High between ±0.75 and 1
b) Moderate between ±0.25 and 0.75
c) Low between 0 and 0.25.
Scatter Diagram
Merits
a) It’s a non-mathematical and easy way of finding the nature of
correlation between two variables
b) By drawing a line of best fit by free hand method through the plotted
dots, the methods can be used for estimating the missing values of
the dependent variable for a given value of independent variable.
c) The shape of scatter-diagram reveals whether the correlation is linear
or non-linear which enables us to know the pattern of relationship
existing between two variables.
d) The values of extreme observations don’t affect the method.
Demerits
a) This method doesn’t give any quantitative measures of the degree or
extent of correlation.
Karl Pearson’s Coefficient

• Actual mean method


• Direct method
• Short-cut method.
Actual Mean Method
• This method is suitable in cases where the mean value of x and y
are whole numbers and not fractional.

Steps
a) Calculate the AM of X and Y series
b) Find out the deviation of X series and denote these deviations by
x.
c) Square these deviations and obtain the total i.e. Ʃx^2
d) Find out the deviation of Y series from the mean of Y and denote
these deviations by y.
e) Square these deviations and obtain the total, i.e., Ʃy^2
f) Multiply these calculated deviations of X and Y series and find
out total, i.e., Ʃxy.
Example (Ex1)
Calculate the correlation coefficient between
the height of father and height of son from
the given data:

Height 64 65 66 67 68 69 70
of
Father
(in
inches)
Height 66 67 65 68 70 68 72
of son
(in
inches)
Direct Method
• In case mean values of the two series in a
bivariate data are fractional values and
number of observations and their volume in
the two series in not very large, direct method
is useful over here.
Note sum: If the correlation between X and Y is
r, then the correlation between –X and Y is –r.
Ex2
Calculate coefficient of correlation between
birth rate and death rate from the following
data:
Year 1931 1941 1951 1961 1971 1981 1991
Birth 24 26 32 33 35 30 32
Rate
Death 15 20 22 24 27 24 20
Rate
Short-cut Method
• When the mean values are fractional and the
number of observations is large, and the
observations have large values, computations
of r can be simplified by using deviations of
the observations from some suitably chosen
constant or constants.
Short-cut Method
Steps
a) Calculate the deviations of X series from an assumed mean
and denote them by dx and find out the total, i.e., Ʃdx
b) Calculate the deviations of Y series from an assumed mean
and denote them by dy and find out the total, i.e., Ʃdy
c) Square the deviation of X series and obtain the total, i.e.,
Ʃdx^2
d) Square the deviation of Y series and obtain the total, i.e.,
Ʃdy^2
e) Multiply dx and dy and find out the total, i.e., Ʃdxdy
EX5
Calculate Karl Pearson’s coefficient of correlation
from the following data using 20 as the
working mean for price, and 70 as the working
mean for demand.
Price 14 16 17 18 19 20 21 22 23
Demand 84 78 70 75 66 67 62 58 60
Assumptions
• Linear Relationship: In this method, a linear
relationship between the two variables is
assumed.
• Causal Relationship: In studying correlation,
we expect a cause and effect relationship
between the forces affecting the values in the
two series.
Properties of Correlation Coefficient
• The value of r lies between -1 and +1
• The correlation coefficient is independent of the change of origin and scale
• It is the ratio of two quantities having same units, thus it is a pure number
having no units
• The value of r doesn’t change if all the values of either variable are
converted to a different scale.
• The value of r is not affected by the choice of X or Y.
• r measures the strength of linear relationship.
• If the sign of all the values of one of the variables is changed, the sign of the
correlation coefficient changes.
• If each value of X and Y a constant amount or added or subtracted, the
correlation coefficient remains unchanged, i.e. correlation coefficient is
independent of the change of origin.
• If each value of X and Y is multiplied or divided by a constant, the
correlation coefficient remains unchanged, i.e., the correlation coefficient is
independent of the change of scale.
Spearman’s Rank Correlation Coefficient

• To calculate the rank correlation coefficient:


a) We first rank the two series say X’s and Y’s individually
among themselves, giving rank 1 to the largest (or
smallest) value, rank 2 to the second largest (or second
smallest) and so on in each series separately.
b) Find the differences ‘D’ of the corresponding rank of X and
Y.
c) Square these differences and find the sum of the squares
of these differences, i.e., ƩD^2
d) Calculate the rank correlation coefficient by using the
formula:
Spearman’s Rank Correlation Coefficient
Notes
a) Ranks can be allotted either in ascending order or in descending
order but which ever method is selected must be used for both
variables.
b) If two or more data items have the same value. Then the ranks
which have been allotted separately must be averaged and this
average rank given to each item.
c) The highest rank in a series is equal to the number of value in
that series. If the two series contains equal number of values,
the highest rank in each series is equal to the number of pairs of
values. But the sometimes the question as given appears to be a
question of rank is greater than the number of pairs of values.
Spearman’s Rank Correlation Coefficient
Merits
a) This method is simple and easier as compared to the karl Pearson’s Method
b) This method is specially useful when precise measurement on the variables
under the study are not given or can’t be obtained.
c) This method is also applicable for qualitative data.
d) This method can be applied to irregular to irregular data also.

Demerits
e) This method can be applied to ungrouped data also.
f) The ranking procedure involved ignores the actual magnitude of data and, as
such, the results obtained are only approximate, because the effect of
extreme values is almost ignored
g) The computation procedure becomes difficult as the number of paired
observations increased.
Measures of Association
Between Two Variables

• Covariance
• Correlation Coefficient
Covariance

The
The covariance
covariance is
is aa measure
measure of
of the
the linear
linear association
association
between
between two
two variables.
variables.

Positive
Positive values
values indicate
indicate aa positive
positive relationship.
relationship.

Negative
Negative values
values indicate
indicate aa negative
negative relationship.
relationship.
Covariance

The
The covariance
covariance is
is computed
computed as
as follows:
follows:

 ( xi  x )( yi  y ) for
sxy 
n 1 samples

 ( xi   x )( yi   y ) for
 xy  populations
N
Correlation Coefficient

Correlation
Correlation is
is aa measure
measure of
of linear
linear association
association and
and not
not
necessarily
necessarily causation.
causation.

Just
Just because
because two
two variables
variables are
are highly
highly correlated,
correlated, it
it
does
does not
not mean
mean that
that one
one variable
variable is
is the
the cause
cause of
of the
the
other.
other.
Correlation Coefficient

The
The correlation
correlation coefficient
coefficient is
is computed
computed as
as follows:
follows:
sxy  xy
rxy   xy 
sx s y  x y

for for
samples populations
Correlation Coefficient
The
The coefficient
coefficient can
can take
take on
on values
values between
between -1
-1 and
and +1
+

Values
Values near
near -1
-1 indicate
indicate aa strong
strong negative
negative linear
linear
relationship.
relationship.

Values
Values near
near +1
+1 indicate
indicate aa strong
strong positive
positive linear
linear
relationship.
relationship.
Covariance and Correlation Coefficient

A golfer is interested in investigating the


relationship, if any, between driving distance and
18-hole score.

Average DrivingAverage
Distance (yds.) 18-Hole Score
277.6 69
259.5 71
269.1 70
267.0 70
255.6 71
272.9 69
Covariance and Correlation Coefficient

x y (xi  x) (yi  y) (xi  x)(yi  y)


277.6 69 10.65 -1.0 -10.65
259.5 71 -7.45 1.0 -7.45
269.1 70 2.15 0 0
267.0 70 0.05 0 0
255.6 71 -11.35 1.0 -11.35
272.9 69 5.95 -1.0 -5.95
Average 267.0 70.0 Total -35.40
Std. Dev. 8.2192.8944
Covariance and Correlation Coefficient

 Sample Covariance

sxy 
 (x  x)(y  y)  35.40
i i
   7.08
n 1 6 1
 Sample Correlation Coefficient
sxy  7.08
rxy    -.9631
sxsy (8.2192)(.8944)
The Weighted Mean and
Working with Grouped Data

• Weighted Mean
• Mean for Grouped Data
• Variance for Grouped Data
• Standard Deviation for Grouped Data
Weighted Mean
 When the mean is computed by giving each data
value a weight that reflects its importance, it is
referred to as a weighted mean.
 In the computation of a grade point average (GPA),
the weights are the number of credit hours earned
each grade.
 When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.
Weighted Mean

x  wx i i

w i

where:
xi = value of observation i
wi = weight for observation i
Grouped Data
 The weighted mean computation can be used to
obtain approximations of the mean, variance, and
standard deviation for the grouped data.
 To compute the weighted mean, we treat the
midpoint of each class as though it were the mean
of all items in the class.
 We compute a weighted mean of the class midpoint
using the class frequencies as weights.
 Similarly, in computing the variance and standard
deviation, the class frequencies are used as weight
Mean for Grouped Data
 For sample data

x  fM
i i

n
 For population data

  fM
i i

N
where:
fi = frequency of class i
Mi = midpoint of class i
Sample Mean for Grouped Data

Given below is the previous sample of monthly rents


for 70 efficiency apartments, presented here as grouped
data in the form of a frequency distribution.
Rent ($) Frequency
420-439 8
440-459 17
460-479 12
480-499 8
500-519 7
520-539 4
540-559 2
560-579 4
580-599 2
600-619 6
Sample Mean for Grouped Data

Rent ($) fi Mi f iMi


420-439 8 429.5 3436.0 34,525
x  493.21
440-459 17 449.5 7641.5 70
460-479 12 469.5 5634.0 This approximation
480-499 8 489.5 3916.0
differs by $2.41 from
500-519 7 509.5 3566.5
520-539 4 529.5 2118.0 the actual sample
540-559 2 549.5 1099.0 mean of $490.80.
560-579 4 569.5 2278.0
580-599 2 589.5 1179.0
600-619 6 609.5 3657.0
Total 70 34525.0
Variance for Grouped Data
 For sample data
2
 f ( M  x )
s2  i i
n 1

 For population data

 f ( M   ) 2
2  i i
N
Sample Variance for Grouped Data

Rent ($) fi Mi Mi - x (M i - x )2 f i (M i - x )2
420-439 8 429.5 -63.7 4058.96 32471.71
440-459 17 449.5 -43.7 1910.56 32479.59
460-479 12 469.5 -23.7 562.16 6745.97
480-499 8 489.5 -3.7 13.76 110.11
500-519 7 509.5 16.3 265.36 1857.55
520-539 4 529.5 36.3 1316.96 5267.86
540-559 2 549.5 56.3 3168.56 6337.13
560-579 4 569.5 76.3 5820.16 23280.66
580-599 2 589.5 96.3 9271.76 18543.53
600-619 6 609.5 116.3 13523.36 81140.18
Total 70 208234.29
continued
Sample Variance for Grouped Data

 Sample Variance

s2 = 208,234.29/(70 – 1) = 3,017.89

 Sample Standard Deviation


s  3,017.89  54.94

This approximation differs by only $.20


from the actual standard deviation of $54.74.

You might also like