Correlation and Regression: Jaipur National University
Correlation and Regression: Jaipur National University
ASSIGNMENT
ON
SUBMITTED TO:
SUBMITTED BY:
CORRELATION
Introduction to Correlation :
Meaning of Correlation:
In daily practice we come across a large number of problems involving the use
of two or more than two variables. If two quantities vary in such a way that
movements in one are accompanied by movements in the other, these quantities
are correlated. The degree of relationship between the variables under
consideration is measured through the correlation analysis. The measure of
correlation called the correlation coefficient or correlation index
summarizes in one figure the direction and degree of correlation.
Definitions of Correlation :
“Statistical inquiry into concomitant variation is correlation analysis.”
“If two or more quantities vary in sympathy so that movements in one tend to
be accompanied by corresponding movements in the other then they are said to
be correlated.”
-L. R. Conner
“Correlation means that between two series or group of data there exist some
casual connection.”
-Prof. King
-Boddington
Most of the variables show some kind of relationship. With the help
of correlation analysis we can measure in one figure the degree of
relationship existing between the variables.
Once we know that the two variables are closely related, we can
estimate the value of one variable given the value of another. This is
known with the help of regression analysis.
Correlation analysis contributes to the understanding of economic
behaviour, aids in locating the critically important variables on which
others depend, may reveal to the economist the connection by which
disturbances spread and suggest to him the paths through which
stabilizing forces may become effective.
Types of Correlation :
More examples:
For example :
Relationship between the yield of rice per acre and both the amount of
rainfall and the amount of fertilizers used.
For example :
For example :
60
50
40
30
20
10
0
50 100 150 200 250 300 350 400
It is clear that the ratio of change between the two variables is the same.
If such variables are plotted on a graph paper all the plotted points would fall
on a straight line.
For example :
1) If changes in two series of variables are in the same direction and having a
constant ratio, the correlation is linear positive.
2) If changes in two groups of variables are in an opposite direction in a
constant ratio, the correlation will be known as linear negative.
3) If changes in two groups of variables are in the same direction but not in a
constant ratio, the correlation is positive non-linear.
The various methods ascertaining whether two variables are correlated or not
are :
Scatter Diagram
Karl Pearson’s Coefficient of Correlation
Rank Correlation
Scatter diagram is the elementary method of knowing the direction of the two
variables. We take the independent variable on X-axis and the dependent
variable on Y-axis and plot the graph in the dotted form of a dotted chart i.e. for
each pair of X and Y values we put a dot and thus obtain as many points as the
number of observations. By looking at the scatter we can form an idea as to
whether the variables are related or not.
The greater the scatter of the plotted points on the chart, the lesser is the
relationship between the two variables. The more closely the points come to a
straight line falling from the lower left-hand corner to the upper right-hand
corner, correlation is said to be perfectly positive ( i.e. r= 1 as in Fig 1).
On the other hand, if all the points are lying on a straight line rising from the
upper left-hand corner to the lower right-hand corner of the diagram, it is said to
be perfectly negative(i.e. r= -1 as in Fig 5).
If the plotted points fall in a narrow band there would be a high degree of
correlation between the variables- correlation shall be positive if the points
show a tendency from the lower left-hand corner to the upper-right hand corner
(0<r<1 as in Fig 2) and negative if the points show a declining tendency from
the upper left-hand corner to the lower right-hand corner of the diagram(-1<r<0
as in Fig 4).
On the other hand, if the points are widely scattered over the diagram it
indicates very little relationship between the variables.
If the points plotted lie on the straight line parallel to the X-axis or in a
haphazard manner, it shows absence of any relationship between the
variables(i.e. r=0 as in Fig 3).
Illustration 1 :
Represent the following sales and advertisement data by scatter diagram and
comment whether there is any correlation between them :
Scatter Diagram
12
10
0
20 30 40 50 60 70 80 90
By the inspection of the plotted points it is clear that they are from lower left-
hand to upper right-hand side and they do not scatter much. Therefore, the two
variables have high degree of positive correlation.
Also,
COV.(X,Y) = ∑ ¿¿ ¿
SD(X) = √ ∑ ¿ ¿ ¿ ¿ ¿ = √ COV .( X , X )
SD(Y) = √ ∑ ¿ ¿ ¿ ¿ ¿ = √ COV .(Y ,Y ) (n = number of
observations)
Note : The covariance of two variables is always smaller than the product of
their standard deviations, and the value of coefficient of correlation always lies
between -1 & 1.
-1≤r≤1
So substituting these values in the following equation, we get :
COV . ( X , Y )
r=
SD( X ) . SD(Y )
= ∑ ¿¿ ¿ ¿
1
= n ∑ ¿¿¿
=∑ ¿¿ ¿ ...(ii)
∑ ∂ ( X ) . ∂(Y )
Or, r= √∑ {∂ ( X ) } . {∂(Y ) }
2 2
1
= n (∑ XY −n X́ Ý )
XY
= ∑n − X́ Ý
Note:
Because ∑ X Ý =Ý ∑ X ,
∑X
X́ =
n
can also be written as X́ n=∑ X
and so, ∑ X Ý =n X́ Ý
XY
And again COV. (X,Y) = ∑n - X́ Ý
XY ∑ X ∑Y
= ∑n −
n n
n ∑ XY – ∑ X . ∑ Y
Therefore, COV.(X,Y) = n2
2
∑ X2 n ∑ X 2−(∑ X )
Also, SD(X) = √ COV .( X , X ) = n −¿ ¿ =
√
2
n
∑ Y2
SD(Y) = √ COV .(Y ,Y ) = n −¿ ¿ = √ n ∑ Y −¿ ¿ ¿ ¿ ¿
2
n ∑ XY −∑ X ∑ Y
So, r= n2 , or we can say that Karl Pearson’s coefficient
√ n ∑ ( X ¿¿ 2)−¿ ¿ ¿ ¿ ¿ ¿ ¿
of correlation can also be determined by the following formula :
r = n ∑ XY −¿∑
¿
X∑ Y
...(iii)
This is the derivation of the DIRECT METHOD of finding out the coefficient
of correlation, and this method is applied only where deviations of items are
taken from actual mean and not from the assumed mean. Out of the above
derived three formulas for correlation coefficient, any can be used to determine
the value of “r” in case of actual mean being used for getting deviations. It will
be more clear from the following illustration.
Steps to follow :
o Find out the actual mean of both the items i.e., X́ and Ý .
o Take the deviations of both the items from their actual mean and sum
them up separately i.e., firstly X − X́ and Y −Ý and then, ∑ (X − X́ ) and
∑ (Y −Ý ).
o Now find the squares of these deviations and then get their individual
2 2
totals i.e., firstly ( X − X́ )2 and ( Y −Ý )2 and then ∑ ( X− X́ ) and ∑ ( Y −Ý ) .
o Then substitute the values in the stated formula and correlation
coefficient is so obtained.
Illustration 2 :
No. of Unemployed : 15 12 13 11 12 12 19 26
∑ ( X − X́ ) (Y −Ý )
Now by applying formula : r= √(X − X́ ) . √(Y −Ý )
2 2❑
−92
= √120∗184 = -0.619
Now in case the deviations in items is made from the assumed mean from the
data, then we use the assumed mean method for finding out the coefficient of
correlation. And the formula for this method is :
r =n
∑ ∂ ( X ) .∂ ( Y ) −¿ ( ∑ ∂ ( X ) . ∑ ∂(Y )) ¿
2 2
√∑
n {∂ ( X ) }
2
−{∑ ∂( X ) } √∑
n {∂ ( Y ) }
2
−{∑ ∂(Y ) }
Steps to follow :
o Take the deviations of both the X & Y series from their assumed means
and obtain their totals separately i.e. firstly ∂( X) and ∂(Y ) and then,
∑ ∂( X ) and ∑ ∂(Y ).
o Then, obtain the squares of the deviations [ ∂( X) and ∂(Y )] i.e., { ∂( X ) }2
2
and { ∂(Y ) } .
2
o Also, obtain the squares of the totals of the deviations i.e., {∑ ∂( X ) } and
2
{∑ ∂(Y )} .
o Now obtain the sum of the product of the individual deviations i.e.,
∑ ∂ ( X ) . ∂(Y ).
o And now substitute the values in the stated formula and the resultant is
the required correlation coefficient.
Illustration 3 :
The following table gives the distribution of items of production and also the
relatively defective items among them, according to the size-groups. Find the
correlation coefficient between the size and defect in quality and its probable
error.
Grp Avrg.size % of
∂ ( X ) =(X {∂(
−17.5) ∂ ( Y )=(Y −50)2
- (X) X )}2 Defective { ∂(Y ) } ∂ ( X ) . ∂(Y )
size =(LL+U Items(Y)
L)/2 =(NoI/NoDI)
*100
15- 15.5 -2 4 75 25 625 -50
16
16- 16.5 -1 1 60 10 100 -10
17
17- 17.5 0 0 50 0 0 0
18
18- 18.5 1 1 50 0 0 0
19
19- 19.5 2 4 45 -5 25 -10
20
20- 20.5 3 9 38 -12 144 -36
21
∑❑ 3 19 18 894 -106
2 1−r 2
P ( E )= ( )
3 √n
(1−r 2)
Also, P ( E )=0.6745
√n
Where, n=no. of pairs of observations
r=coefficient of correlation
(a) When ranks are not given : For quantitative data, ranks are not given n
we have to assign it.
(b)When ranks are given : For qualitative data, ranks are always given.
Again, the data here can have two conditions –
6∑ d2
R=1−
n ( n2−1 )
N= no. of items
Steps to follow :
Note : In case of qualitative data, ranks are already given. So there we don’t
need to go for the first step and rest of the procedure is same.
Two ladies were asked to rank 7 different types of lipsticks. The ranks given by
them are as follows :
Lipsticks : A B C D E F G
Neelu : 2 1 4 3 5 7 6
Neena : 1 3 2 4 5 6 7
6 ∑ d2
R=1− 2
√n ( n −1 )
6∗12
=1− = 0.786
√ 7(7 2−1)
Illustration 5: (for quantitative data)
Using rank correlation method, find out the relationship between debenture
prices and share prices.
6 ∑ d2
R=1− 2
√n ( n −1 )
6∗62
= 1− √ 7 ( 7 −1 ) = -0.107
2
∑ m ( m2−1 ) ∑ m ( m2−1 )
R=1−
6 [ 2
∑ d + 12 + 12 +… ]
n ( n2−1 )
n=no. of items
Steps to follow :
Illustration 6:
Obtain the rank correlation coefficient between the variables X and Y from the
following pairs of observed values :
X : 50 55 65 50 55 60 50 65 70 75
Y : 110 110 115 125 140 115 130 120 115 160
X RX Y RY d=R X−R Y d2
50 2 110 1.5 0.5 0.25
55 4.5 110 1.5 3 9
65 7.5 115 4 3.5 12.25
50 2 125 7 -5 25
55 4.5 140 9 -4.5 20.25
60 6 115 4 2 4
50 2 130 8 -6 36
65 7.5 120 6 1.5 2.25
70 9 115 4 5 25
75 10 160 10 0 0
∑ d 2=134
In the series X, 50 has repeated 3 times (m=3), 55 has repeated 2 times (m=2),
and 65 also has repeated 2 times (m=2). In series Y, 110 has repeated 2 times
(m=2) and 115 has repeated 3 times (m=3).
∑ m ( m2−1 ) ∑ m ( m2−1 )
R=1−
6 [ 2
∑ d + 12 + 12 +… ]
n ( n2−1 )
∑ 3 ( 32−1 ) ∑ 2 ( 23 −1 ) ∑ 2 ( 23−1 ) ∑ 2 ( 23 −1 ) ∑ 3 ( 33 −1 )
R=
6 [ 134 +
12
+
12
+ +
12
+
12 12 ]
10 ( 102−1 )
R=0.155
This method can not be used for finding out correlation in a grouped
frequency distribution.
Where the number of items exceed 30, the calculations become quite
tedious and require a lot of time. So this method shouldn’t be used where
“n” exceeds 30 unless we are given the ranks and not the actual values of
the variables.
REGRESSION ANALYSIS
Introduction to Regression :
For example :
If we know that the yield of rice and rainfall are closely related, we may find
out the amount of rain required to achieve a certain production figure by
regression.
Regression is that method of statistical analysis with the help of which the value
of other series can be estimated from the known value of one series. Regression
analysis reveals average relationship between two variables and this makes the
estimation or prediction possible.
Meaning of Regression :
The dictionary meaning of this term regression is the ‘act of returning’ or ‘going
back’. The meaning of regression is just reverse of Progression. Progression, in
general, means to move forward while regression means to move backward or
going back or in statistical terms the return to the mean value. Regression is a
statistical technique to construct a mathematical relationship in the form of
equations between two correlated variables. It is a statistical device with the
help of which we are in a position to estimate the unknown values of one
variable from known values of another variable. The variable which is used to
predict the variable of interest is called the independent variable or
explanatory variable denoted by X and the variable we are trying to predict is
called the dependent variable or explained variable denoted by Y. The
analysis used is called the simple linear regression analysis – simple because
there is only one predictor or independent variable and linear because of the
assumed linear relationship between the dependent and independent variables.
The term “linear” means that an equation of an straight line of the form
Y =a+bX where, a and b are constants, is used to describe the average
relationship that exists between the two variables.
Definitions of Regression :
-Taro Yamane
Utility of Regression Analysis :
For example :
1. If we know that the two variables, price(X) and demand(Y), are closely
related we can find out the most probable value of X for a given value of
Y or the most probable value of Y for the given value of X.
2. If we know that the amount of tax and the rise in price of the commodity
are closely related, we can find out the expected price for a certain
amount of tax levy.
Regression lines :
The lines of best fit drawn to show the mutual relationship between X and Y
variables are known as Regression Lines. Every regression problem for linear
equations has two lines on same graph, one representing regression of X on Y
and the other of Y on X for minimizing the differences in the two variables.
These two regression lines always intersect each other at ( X́ , Ý ), i.e.,mean.
Regression Equation of Y on X :
Y SD
Regression equation of Y on X is : ( Y −Ý )=r SD ( X − X́ )
X
Where, X́ and Ý being the actual means of X and Y variables
Regression Equation of X on Y :
Illustration 7:
The following data gives the experience and the estimate of machine operators
and their performance ratings as given by the number of good parts turned out
per 100 pieces :
Operator : 1 2 3 4 5 6 7 8
Experience : 16 12 18 4 3 10 5 12
Performance Rating : 87 88 89 68 78 80 75 83
∑ ( X − X́ ) ( Y −Ý ) ( X − X́ )
Regression eq. Of Y on X : ( Y −Ý )= 2
∑ ( X − X́ )
247
( Y −81 )= ( X −10 )
218
So we get, Y =69.67+1.133 X
∑ ( X− X́ ) ( Y −Ý ) ( Y −Ý )
Regression eq. Of X on Y : ( X − X́ ) = 2
∑ ( Y −Ý )
247
( X −10 )= (Y −81 )
368