Notes For Correlation Unit - 3 Business Statistics
Notes For Correlation Unit - 3 Business Statistics
CORRELATION ANALYSIS
CORRELATION/ COVARIATION
It is the relationship that exists between two or more variables. If two variables are
related to each other in such a way that change in one creates a corresponding
change in other, then the variables are said to be correlated.
Example
Types of Correlation
Positive correlation
If both the variables vary in the same direction, correlation is said to be positive.
If one variable increases, the other increases or, if one variable decreases, the other
decreases, then the correlation between the two variables is said to be positive
correlation.
Example
As height is increasing we observe that the weight is also increasing. Hence, the
direction of change is same. Height and weight share a positive correlation.
Negative correlation
If both the variables vary in the opposite direction, correlation is said to be
negative.
If one variable increases, the other decreases or, if one variable decreases, the other
increases, then the correlation between the two variables is said to be negative
correlation.
Example
Price (Rs.) 5 4 3 2 1
Demand (units) 100 200 300 400 500
Simple Correlation
When only two variables are studied, it is the case of simple correlation.
Example
Relationship between the wheat output per acre and the amount of rainfall.
Multiple Correlation
When three or more variables are studied, it is a case of multiple correlation.
Example
Relationship between the wheat output per acre, amount of rainfall and the amount
of fertilizers used.
They are of two types: Partial or Total
Partial Multiple Correlation
When one studies three or more variables but considers only two variables to be
influencing each other and the effect of other influencing variables being held
constant. Its order depends on the number of variables being held constant.
i.e. if one variable is held constant then it is called first order partial Correlation.
Total Multiple Correlation
When one studies three or more variables without excluding the effect of any
variable held as a constant.
3. Depending upon the constancy of the ratio of change between the variables:
Linear Correlation
Non-Linear / Curvilinear Correlation
Linear Correlation
If the amount of change in one variable bears a constant ratio to the amount of
change in the other variable.
If such variables are plotted then all the points will fall on a straight line.
Example
Milk (L) 10 20 30 40 50
Cheese (Kg) 2 4 6 8 10
The change in milk quantity to the change in cheese quantity is a constant ratio
10:2.
If the amount of change in one variable does not bear a constant ratio to the
amount of change in the other variable.
If such variables are plotted then all the points will fall on a curve and not a
straight line.
Example
Advertising Expenditure 2 4 6 8 10
Sales 10 12 15 15 16
Method of Studying Correlation
On the other hand, a situation where you might find a strong but not perfect
positive correlation would be if you examined the number of hours students spent
studying for an exam versus the grade received. This won't be a perfect correlation
because two people could spend the same amount of time studying and get
different grades. But in general the rule will hold true that as the amount of time
studying increases so does the grade received.
Limitations of a Scatter Diagram
Question (2014)
Question (2015)
There are all sorts of correlations we can look at. Sometimes variables increase or
decrease over time. For example, the earth’s temperature is increasing over
time. So are the levels of greenhouse gases. If you run a correlation analysis on
these two variables, you will find that global temperature correlates strongly to the
level of greenhouse gases. But does this mean that one is the cause of the
other? Not necessarily. When two variables are trending up or down, a correlation
analysis will often show there is a significant relationship – simply because of the
trend – not necessarily because there is a cause and effect relationship between the
two variables.
COVARIANCE
Given a set of N pairs of Observations (X1, Y1), (X2, Y2), (X3, Y3), ……….., (XN,
YN) relating to two variables X and Y, the covariance X and Y, usually represented
by Cov(X, Y), is defined as:
∑ ̅ ̅ ∑
Cov(X, Y) = =
Where ̅ ̅
The covariance is a measure for how two variables are related to each other, i.e.,
how two variables vary with each other.
Example
X 1 2 3 4 5
Y 10 20 30 50 40
Solution
X ̅ Y ̅ xy
1 -2 10 -20 40
2 -1 20 -10 10
3 0 30 0 0
4 1 50 20 20
5 2 40 10 20
∑ = 15 ∑ =0 ∑ =150 ∑ =0 ∑ =90
N= number of pairs = 5
̅=∑ = =3
̅=∑ = = 30
∑ ̅ ̅ ∑
Cov(X, Y) = = = = 18
Limitation
Covariance is direct measure of correlation between two or more variables, but it
cannot be used for the meaningful measuring of the strenght of the relationship
between two variables.
Because covariance can take all the values from negative to positive to zero value.
Given a set of N pairs of Observations (X1, Y1), (X2, Y2), (X3, Y3), ……….., (XN,
YN) relating to two variables X and Y, the Coefficient of correlation between X
and Y, usually represented by r, is defined as:
r =
where
Cov(X, Y) = Covariance of X and Y
= Standard Deviation of X
= Standard Deviation of Y
r =
∑ ̅ ̅ ∑ ∑ ∑ ∑
= = = = =
√∑
̅
√∑
̅
√∑ √∑ √∑ ∑
∑
⟹ r=
√∑ ∑
Where ̅ ̅
By dividing the covariance by the features’ standard deviations, we ensure that the
correlation between two features is in the range [-1, 1], which makes it more
interpretable than the unbounded covariance. However, note that the covariance
and correlation are exactly the same if the features are normalized to unit variance
(e.g., via standardization or z-score normalization). Two features are perfectly
positively correlated if ρ=1and pefectly negatively correlated if ρ=−1. No
correlation is observed if ρ=0.
Independent of the choice of Origin: The value of r is not affected even if each of
the individual values of X and Y is increased or decreased by some non-zero
constant.
Dependent on the choice of scale: The value of r is affected if each of the
individual values of X and Y is multiplied or divided by some non-zero constant.
Interpretation
Example
X 1 2 3 4 5
Y 10 20 30 50 40
Solution
X ̅ Y ̅ xy
1 -2 10 -20 40 4 400
2 -1 20 -10 10 1 100
3 0 30 0 0 0 0
4 1 50 20 20 1 400
5 2 40 10 20 4 100
∑ = 15 ∑ =0 ∑ =150 ∑ =0 ∑ =90 ∑ =10 ∑ =1000
N= number of pairs = 5
̅=∑ = =3
̅=∑ = = 30
∑
r= = = = 0.9
√∑ ∑ √
Example
Calculate the correlation coefficient from the following data:
X 6 8 12 15 18 20 24 28 31
Y 10 12 15 15 18 25 22 26 28
Solution
X Y ̅ ̅ xy x2 y2
6 10 -12 -9 108 144 81
8 12 -10 -7 70 100 49
12 15 -6 -4 24 36 16
15 15 -3 -4 12 9 16
18 18 0 -1 0 0 1
20 25 2 6 12 4 36
24 22 6 3 18 36 9
28 26 10 7 70 100 49
31 28 13 9 117 169 81
∑ =162 ∑ =171 ∑ =0 ∑ =0 ∑ =431 ∑ = ∑ =
598 338
N= number of pairs = 9
̅=∑ = = 18
̅=∑ = = 19
∑
r= = = = 0.959
√∑ ∑ √
Age(years) 15 16 17 18 19 20
No. of students 250 200 150 120 100 80
Regular Players 200 150 90 48 30 12
Let us first find the percentage of regular players and then calculate the coefficient
of correlation between the age and percentage so obtained.
Solution
200/250 *100
150/200*100
̅ =17.5 ̅ =50
∑
r= = = = -0.991
√∑ ∑ √
Question
Advantage
Gives direction as well as the degree of the relationship between the variables.
Helps in estimating the value of the dependent variables from the known value of
independent variables.
Limitations
It uses ranks rather than actual observation. The correlation coefficient between
two series of ranks is called Rank Correlation Coefficient.
∑
R=1
2. suitable for qualitative data ( association between the variables which are not
capable of being quantifiable but can only be ranked in some order. Example: it is
possible for the two judges to rank by preference 10 girls in terms of beauty
wheraeas it may be difficult to give them numerical grades in terms of beauty.
When to Use
Steps
Example
(2015)
Two judges in a beauty competition rank the 12 entries as follows:
X 1 2 3 4 5 6 7 8 9 10 11 12
Y 12 9 6 10 3 5 4 7 8 2 11 1
What degree of agreement is there between the judgement of the two judges?
Solution
X =R1 R1 Y=R2 R2 D = R1 – R2 D2
1 1 12 12 -11 121
2 2 9 9 -7 49
3 3 6 6 -3 9
4 4 10 10 -6 36
5 5 3 3 2 4
6 6 5 5 1 1
7 7 4 4 3 9
8 8 7 7 1 1
9 9 8 8 1 1
10 10 2 2 8 64
11 11 11 11 0 0
12 12 1 1 11 121
∑D2 = 416
N = 12
∑
R=1–
Example
X 1 2 3 4 5
Y 5 4 3 2 1
Z 3 5 2 1 4
Which pair of judges has the nearest approach to common tastes in beauty.
Solution
N=5
Rank correlation between the judgement of first and second judges
∑
R=1–
=1– =1 -2 = -1
Second and the third judges have the nearest approach to common tastes in beauty
since the correlation coefficient is positive here.
Example
X 59 69 39 49 29
Y 79 69 59 49 39
Solution
X Y R1 R2 D D2
59 79 4 5 -1 1
69 69 5 4 1 1
39 59 2 3 -1 1
49 49 3 2 1 1
29 39 1 1 0 0
∑D2 = 4
N=5
Rank correlation
∑
R=1– =1– =1– = 1 – 0.2 = 0.8
Question
Solution
X Y R1 R2 D D2
1111 1717 3 5 -2 4
2020 1919 7 6 1 1
2222 2323 8 8 0 0
1818 1616 5 4 1 1
1919 2020 6 7 -1 1
1111 1010 2 2 0 0
1010 1111 1 3 -2 4
1515 18 4 1 3 9
∑D2 = 20
N=8
Rank correlation
∑
R=1– =1– =1– = 1 – 0.238 = 0.761
Case3 : When values of some variables are equal
In case there is more than one item with the same values in the series, usually
average rank is alloted to each of these items and the factor is added for
each such tied item to ∑ . Thus, in case of tied ranks, the modified formula for
rank correlation coefficient becomes.
∑
R=
Example
X 49 69 39 49 29
Y 59 59 59 49 39
Solution
X Y R1 R2 D D2
49 59 3.5 4 -0.5 0.25
69 59 5 4 1 1
39 59 2 4 -2 4
49 49 3.5 2 0.5 2.25
29 39 1 1 0 0
79 ∑D2 = 7.5
In X, 49 we will take the average of 3rd and 4th 3+4//2= 7/2= 3.5
In Y, 59 we will take the average of 3rd, 4th and 5th , 3+4+5/3= 12/3=4
X has 49 two times (m=2), assign the average of the ranks 3th and 4th position for
the rank of 49.
= 3.5
Y has 59 three times (m=3), assign the average of the ranks 3rd , 4th and 5th position
for the rank of 59.
=4
N=5
∑
R= = =
= = 1 – 0.5 = 0.5
Exercise
X 49 69 39 49 29 49 79
Y 59 59 59 49 39 39 69
Questions
1. 16
2. 15
3. A. 9.6 B. 0.7 C. 49