Correlation and regression
Correlation and regression
Structure
5.0 Objectives
5.1 Introduction
5.2 Scatter Diagram
5.3 Covariance
5.4 Correlation Coefficient
5.5 Interpretation of Correlation Coefficient
5.6 Rank Correlation Coefficient
5.7 The Concept of Regression
5.8 Linear Relationship: Two-Variables Case
5.9 Minimisation of Errors
5.10 Method Least Squares
5.11 Prediction
5.12 Relationship between Regression and Correlation
5.13 Multiple Regressions
5.14 Non-Linear Regression
5.15 Let Us Sum Up
5.16 Answers/Hints to Check Your Progress Exercises
5.0 OBJECTIVES
After going through this unit you will be in a position to
plot scatter diagram;
compute correlation coefficient and state its properties;
compute rank correlation;
explain the concept of regression;
explain the method of least squares;
identify the limitations of linear regression;
apply linear regression models to given data; and
use the regression equation for prediction.
5.1 INTRODUCTION
The word ‘bivariate’ is used to describe situations in which two character are
measured on each individual or item, the character being represented by two
variables. For example, the measurement of height ( X i ) and weight ( Yi ) of
students in a school. The subscript i in this case represents the student concerned.
*
Prof. Kaustuva Barik, School of Social Sciences, Indira Gandhi National Open University.
Summarisation of Thus, for example, X 5 , Y5 represent the height and weight of the fifth student.
Bivariate and Multi-
variate Data Statistical data relating to simultaneous measurement of two variables are called
bivariate data. The observation on each individual are paired, one for each
variable (X1, Y1), (X2, Y2), ......, (Xn, Yn).
In statistical studies with several variables, there are generally two types of
problems. In some problems it is of interest to study how the variables are
interrelated; such problems are tackled by using correlation technique. For
instance, an economist may be interested in studying the relationship between the
stock prices of various companies; for this he may use correlation techniques. In
other problems there is a variable y of basic interest and the problem is to find out
what information the other variable provides on Y, such problems are tackled
using regression techniques. For instance, an economist may be interested in
studying what factors determine the pay of an employed person and in particular,
he may be interested in exploring what role the factors such as education,
experience, market demand, etc. play in determining the pay. In the above
situation he may use regression techniques to set up a prediction formula for pay
based on education, experience, etc.
A representation of data of this type on a graph is a useful device which will help
us to understand the nature and form of the relationship between the two
variables, whether there is a discernible relationship or not and if so whether it is
linear of not. For this let us denote score in Economics by X and the score in
Statistics by Y and plot the data of Table 5.1 on the x-y plane. It does not matter
which is called X and which Y for this purpose. Such a plot is called Scatter Plot
or Scatter Diagram. For data of Table 5.1 the scatter diagram is given in Fig. 5.1.
98
Correlation and
Regression
An inspection of Table 5.1 and Fig. 5.1 shows that there is a positive relationship
between x and y. This means that larger values of x associated with larger values
of y and smaller values of y. Further, the points seem to lie scattered around both
sides of a straight line. Thus, it appears that a linear relationship exists between x
and y. This relationship, however, in not perfect in the sense that there are
deviations from such a relationship in the case of certain observations. It would
indeed be useful to get a measure of the strength of this linear relationship.
5.3 COVARIANCE
In the case of a single variable we have learnt the concept of variance, which is
defined as
𝜎 = ∑ (𝑋 − 𝑋) … (5.1)
You may recall that standard deviation is always positive since it is defined as the
positive square root of variance. In the case of covariance there are two terms
( X i X ) and (Yi Y ) which represent the deviations in x from X and Y from Y .
99
Summarisation of Moreover, ( X i X ) can be positive or negative depending on whether xi is less
Bivariate and Multi-
variate Data than or greater than X . Similarly (Yi Y ) can be positive or negative. It is not
necessary that whenever ( X i X ) is positive (Yi Y ) will also be positive.
Therefore, the product ( X i X ) (Yi Y ) can be either positive or negative. A
positive value for ( X i X ) (Yi Y ) implies the whenever X i X , we have
Yi Y . Thus a higher value of xi is associated with a relatively higher value in yi .
On the other hand, ( X i X )(Yi Y ) 0 implies that a lower value in X i is
associated with a relatively higher value in yi . when we sum it over all the
observations and ivied by the number of observations, we may obtain a negative
or positive value. Therefore, covariance can assume both positive and negative
values.
When covariance between x and y is negative ( xy 0) we can say that the
relationship could be inverse. Similarly, ( xy 0 ) implies a positive relationship
between x and y. A major limitation of covariance is that it is not independent of
unit of measurement. It means that if we change the unit of measurement of the
variables we will get a difference value for xy .
1 n 1 n
xy
n i 1
( X i X )(Yi Y ) ( X iYi XYi XY )
n i1
1 n 1 n 1 n 1 n
xy i i n
n i 1
X Y
i 1
X Yi i n
n i1
X Y
i 1
XY
1 n 1
Since
n i 1
XYi X iY XY we have
n
1 n
xy X iYi XY
n i 1
… (5.3)
100
This can be achieved by standardizing each variable, that is by considering Correlation and
Regression
XX Y Y
and where X and Y are the means of X and Y respectively and x
x y
and y are standard deviations.
Let us denote these standardised variables by u and y respectively. Let us also use
the notation ( X i , Yi ) to denote the score ith student in Economics and Statistics
respectively, i ranging from 1 to n, the number of students, n being 20 in our
example. Similarly, let (ui , vi ) denote the standardised scores of ith student. Then
recall the following formulae for mean and standard deviation:
1 n 1 n
X
n i 1
X i ; x2 ( X i X ) 2 ;
n i 1
1 n 1 n
Y i Y n
n i 1
X ; 2
i 1
(Yi Y ) 2
Fig. 5.2 is the scatter diagram in terms of standardised variables u and v. Let us
observe that in this example there is a positive association between the two
scores. The larger one score is, the larger the other score also is; the smaller one
score is the smaller the other score is, on the whole. In view of this, most of the
points are either in the first quadrant or in the third quadrant. The first quadrant
represents the cases where both scores are above their respective means and third
quadrant represents the cases where both scores are below their respective means.
There are only a very few points in second and fourth quadrants, which represent
the cases where one score is above its mean and the other is below its mean. Thus
the product of the u, v values is a suitable indicator of the strength of the
relationship; this product is positive in the first and third quadrants and negative
in the second and fourth. Thus the product of u, v averaged over all the points
may be considered to be suitable measure of the strength of linear relationship
between X and Y.
101
Summarisation of This measure is called the correlation coefficient between X and Y and is usually
Bivariate and Multi-
variate Data denoted by rxy or simply by r, when it is clear what x and y in the context are.
This is also called the Pearson’s Product-Moment Correlation Coefficient to
distinguish it from other types of correlation coefficients.
1 n 1 n
i
n i1
( X x )(Yi Y ) Xi X (Yi Y )
n i1
r ... (5.6)
1 n 2 1
n
1 n 2 1
n
X i X n
n i1 i1
(Yi Y ) 2
( X i X ) n
n i1 i1
(Yi Y ) 2
Or, alternatively
n n n
n X iYi X i Yi
r i 1 i 1 i 1
… (5.7)
2
n 2
n X i2 X i n Yi 2 Yi
n n n
i 1 i1 i 1 i 1
Let us go back to the data given in Table 5.1 and work out the value of r. You can
use any of the formulae (5.4), (5.5) or (5.7) to get the value of r. Since all the
formulae are derived from the same concept we obtain the same value for r
whichever formulae we use. For the data set in Table 5.1 we have calculated it by
using (5.4) and (5.7). We construct Table 5.2 for this purpose.
102
Table 5.2: Calculation of Correlation Coefficient Correlation and
Regression
Observation X Y X2 Y2 XY
No.
1 82 64 6724 4096 5248
2 70 40 4900 1600 2800
3 34 35 1156 1225 1190
4 80 48 6400 2304 3840
5 66 54 4356 2916 3564
6 84 56 7056 3136 4704
7 74 62 5476 3844 4588
8 84 66 7056 4356 5544
9 60 52 3600 2704 3120
10 86 82 7396 6724 7052
11 76 58 5776 3364 4408
12 76 66 5776 4356 5016
13 92 72 8464 5184 6624
14 72 46 5184 2116 3312
15 64 44 4096 1936 2816
16 86 76 7396 5776 6536
17 84 52 7056 2704 4368
18 60 40 3600 1600 2400
19 82 60 6724 3600 4920
20 90 60 8100 3600 5400
Total 1502 1133 116292 67141 87450
X
i 1
i 150; X 75.1;
20
Y
i 1
i 1133; Y 56.65;
20
1 15022
X i2 116292; x2
i 1 20
116292
20
174.59; x 13.21;
20
1 11332
Y
i 1
i
2
67141; x2
20
67141
20
147.83; y 12.16;
1 1502 1133
XY i i 87450; xy
20
87450
20 118.09
103
Summarisation of Thus we see that both the formulae provide the same value of the correlation
Bivariate and Multi-
variate Data coefficient r. You can check yourself that the same value of r is obtained by using
the formula (5.5). For this purpose you will need values on
4)
Find the correlation coefficient between toughness and nickel content and
comment on the result.
……………………………………………………………………………..
……………………………………………………………………………..
……………………………………………………………………………..
……………………………………………………………………………..
……………………………………………………………………………..
……………………………………………………………………………..
4) Determine the correlation coefficient between x and y.
x : 5 7 9 11 13 15
D i 0 D1
2
87.50
Let us consider the data of Table 5.3. Here there are some ties; the tied cases are
given the same rank in such a way their total is the same as when there is no tie.
For example, when there are two cases with rank 6, each is given a rank of 6.5
and there is no case with rank either 6 or 7. Similarly, if there are three cases with
rank 5, then each is given a rank of 6 and there is no case with rank 5 or 7.
Spearman’s rank correlation coefficient, called Spearman’s Rho, denoted by , is
based on the difference Di (i for ith observation) between the two rankings. If the
two rankings completely coincide, then Di is zero for every case. The larger the
value of Di, the greater is the difference between the two rankings and smaller is
the association. Thus, the association can be measured by considering the
magnitudes of Di. Since the sum of Di is always zero, to find a single index on the
basis of Di values, we should remove the sign of Di and consider only the
magnitude. In Spearman’s , this is done by taking Di2 .
108
n Correlation and
However, the largeness or smallness of D
i 1
i
2
, where n is the number of cases, Regression
will depend on n. thus, in order to be able to interpret this value, we could create a
ratio by dividing this sum by the largest possible value, which depends only on n,
n
6 Di2
n( n 1 )2
i 1
which is . However, is zero for perfect association and 2 for
6 n( n 2 1 )
lack of association, i.e., perfect negative association, while we would like it to be
other way around. So we subtract this ratio from 1. Thus
n
6 Di2
1 i 1
… (5.8)
n( n 2 1 )
is defined as Spearman’s rank correlation.
Let us calculated the value of from the data given in Table 5.3.
6 87.5 525
1 1 0.53 0.47.
10( 10 1 )
2
990
Like Karl Pearson’s coefficient of correlation the Spearman’s rank correlation has
a value +1 for perfect matching of ranks, –1for perfect mismatching of ranks and
0 for the lack of relation between the ranks.
There are other measures of association suitable for use when the variables are of
nominal, ordinal and other types. We do not discuss them here.
Check Your Progress 2
1) In a contest, two judges ranked eight candidates A, B, C, D, E, F, G and H in
order of their preference, as shown in the following table. Find the rank
correlation coefficient.
A B C D E F G H
First Judge 5 2 8 1 4 6 3 7
Second Judge 4 5 7 3 2 8 1 6
…………………………………………………………………………….
…………………………………………………………………………….
…………………………………………………………………………..…
…………………………………………………………………………..…
…………………………………………………………………………..…
…………………………………………………………………………..…
……………………………………………………………………………..
109
Summarisation of
Bivariate and Multi- Roll Nos. 1 2 3 4 5 6 7 8 9 10
variate Data
Rank in B. Com. Exam. 1 5 8 6 7 4 2 3 9 10
Ranks by A : 1 6 5 10 3 2 4 9 7 8
Ranks by B : 3 5 8 4 7 10 2 1 6 9
Ranks by C : 6 4 9 8 1 2 3 10 5 7
Using the rank correlation method, discuss which pair of judges has the
nearest approach to common liking in music.
…………………………………………………………………....………….....
………………………………………………………………….......…………..
………………………………………………………………….......……..……
………………………………………………………………….......………..…
………………………………………………………………….......……..……
110
Correlation and
5.7 THE CONCEPT OF REGRESSION Regression
In the previous section we noted that correlation coefficient does not reflect cause
and effect relationship between two variables. Thus we cannot predict the value
of one variable for a given value of the other variable. This limitation is removed
by regression analysis. In regression analysis, the relationship between variables
are expressed in the form of a mathematical equation. It is assumed that one
variable is the cause and the other is the effect. You should remember that
regression is a statistical tool which helps understand the relationship between
variables and predicts the unknown values of the dependent variable from known
values of the independent variable.
In regression analysis we have two types of variables: i) dependent (or explained)
variable, and ii) independent (or explanatory) variable. As the name (explained
and explanatory) suggests the dependent variable is explained by the independent
variable.
In the simplest case of regression analysis there is one dependent variable and one
independent variable. Let us assume that consumption expenditure of a household
is related to the household income. For example, it can be postulated that as
household income increases, expenditure also increases. Here consumption
expenditure is the dependent variable and household income is the independent
variable.
Usually we denote the dependent variable as Y and the independent variable as X.
Suppose we took up a household survey and collected n pairs of observations in X
and Y. The next step is to find out the nature of relationship between X and Y.
The relationship between X and Y can take many forms. The general practice is
to express the relationship in terms of some mathematical equation. The simplest
of these equations is the linear equation. This means that the relationship between
X and Y is in the form of a straight line and is termed linear regression. When the
equation represents curves (not a straight line) the regression is called non-linear
or curvilinear.
Now the question arises, ‘How do we identify the equation form?’ There is no
hard and fast rule as such. The form of the equation depends upon the reasoning
and assumptions made by us. However, we may plot the X and Y variables on a
graph paper to prepare a scatter diagram. From the scatter diagram, the location of
the points on the graph paper helps in identifying the type of equation to be fitted.
If the points are more or less in a straight line, then linear equation is assumed. On
the other hand, if the points are not in a straight line and are in the form of a
curve, a suitable non-linear equation (which resembles the scatter) is assumed.
We have to take another decision, that is, the identification of dependent and
independent variables. This again depends on the logic put forth and purpose of
analysis: whether ‘Y depends on X’ or ‘X depends on Y’. Thus there can be two
regression equations from the same set of data. These are i) Y is assumed to be
111
Summarisation of dependent on X (this is termed ‘Y on X’ line), and ii) X is assumed to be
Bivariate and Multi-
variate Data dependent on Y (this is termed ‘X on Y’ line).
Regression analysis can be extended to cases where one dependent variable is
explained by a number of independent variables. Such a case is termed multiple
regression. In advanced regression models there can be a number of both
dependent as well as independent variables.
You may by now be wondering why the term ‘regression’, which means ‘reduce’.
This name is associated with a phenomenon that was observed in a study on the
relationship between the stature of father (x) and son (y). It was observed that the
average stature of sons of the tallest fathers has a tendency to be less than the
average stature of these fathers. On the other hand, the average stature of sons of
the shortest fathers has a tendency to be more than the average stature of these
fathers. This phenomenon was called regression towards the mean. Although this
appeared somewhat strange at that time, it was found later that this is due to
natural variation within subgroups of a group and the same phenomenon occurred
in most problems and data sets. The explanation is that many tall men come from
families with average stature due to vagaries of natural variation and they produce
sons who are shorter than them on the whole. A similar phenomenon takes place
at the lower end of the scale.
112
For the months when your income is the same, do your consumption remain the Correlation and
Regression
same? The point we are trying to make is that economic relationship involves
certain randomness.
Therefore, we assume the relationship between Y and X to be stochastic and add
one error term in (5.9). Thus our stochastic model is
Yi a bX i ei …(5.10)
where ei is the error term. In real life situations ei represents randomness in
human behaviour and excluded variables, if any, in the model. Remember that
the right hand side of (5.10) has two parts, viz., i) deterministic part (that is,
a bX i ), and ii) stochastic or randomness part (that is, ei ). Equation (5.10)
implies that even if X i remains the same for two observations, Yi need not be the
same because of different ei . Thus, if we plot (5.10) on a graph paper the
observations will not remain on a straight line.
Example 5.1
The amount of rainfall and agricultural production for ten years are given in Table
5.4.
Table 5.4: Rainfall and Agricultural Production
Rainfall Agricultural
(in mm.) production (in
tonne)
60 33
62 37
65 38
71 42
73 42
75 45
81 49
85 52
88 55
90 57
113
Summarisation of We plot the data on a graph paper. The scatter diagram looks something like Fig.
Bivariate and Multi-
variate Data 5.4. We observe from Fig. 5.4 that the points do not lie strictly on a straight line.
But they show an upward rising tendency where a straight line can be fitted. Let
us draw the regression line along with the scatter plot.
where b is the slope and a is the intercept on y-axis. The location of a straight line
depends on the value of a and b, called parameters. Therefore, the task before us
is to estimate these parameters from the collected data. (You will learn more
about the concept of estimation in Block 4). In order to obtain the line of best fit
to the data we should find estimates of a and b in such a way that the error ei is
minimum.
In Fig. 5.4 these differences between observed and predicted values of Y are
marked with straight lines from the observed points, parallel to y-axis, meeting
the regression line. The lengths of these segments are the errors at the observed
points.
Let us denote the n observations as before by ( X i , Yi ), i = 1, 2, ....., n. In Example
5.1 on agricultural production and rainfall, n=10.
114
Let us denote the predicted value of Yi at X i by Ŷi (the notation Ŷi is pronounced Correlation and
Regression
as ‘ Yi -cap’ or ‘ Yi -hat’). Thus
Yˆi a bX i , i = 1, 2, ....., n.
ei Yi Yˆi ……(5.11)
It would be nice if we can determine a and b in such a way that each of the ei , i =
1, 2, ....., n is zero. But this is impossible unless it so happens that all the n points
lie on a straight line, which is very unlikely. Thus we have to be content with
minimising a combination of ei , i = 1, 2, ....., n. What are the options before us?
n
It is tempting to think that the total of all the ei , i = 1, 2, ….., n, that is, ei
i1
is a suitable choice. But it is not. Because, ei for points above the line are
positive and below the line are negative. Thus by having a combination of
n
large positive and large negative errors, it is possible for ei to be very
i 1
small.
A second possibility is that if we take a = y (the arithmetic mean of the Yi ’s)
n
and b = 0, ei could be made zero. In this case, however, we do not need the
i 1
value of X at all for prediction! The predicted value is the same irrespective of
the observed value of X. This evidently is wrong.
n
What then is wrong with the criterion ei ? It takes into account the sign of
i 1
ei . What matters is the magnitude of the error and whether the error is on the
n
positive side or negative side is really immaterial. Thus, the criterion ei is
i 1
115
Summarisation of
Bivariate and Multi- 5.10 METHOD OF LEAST SQUARES
variate Data
In the least squares method we minimise the sum of squares of the error terms,
n
that is, ei2 .
i 1
The next question is: How do we obtain the values of a and b to minimise (5.12)?
Those of you who are familiar with the concept of differentiation will
remember that the value of a function is minimum when the first derivative of
the function is zero and second derivative is positive. Here we have to choose
n
the value of a and b. Hence, ei2 will be minimum when its partial derivatives
i 1
n
with respect to a and b are zero. The partial derivatives of ei2 are obtained as
i 1
follows:
ei2 (Yi a bX i ) 2
i
i
2 (1) (Yi a bX i ) …(5.13)
a a i
ei2 (Yi a bX i ) 2
i
i
2 ( X i ) (Yi a bX i ) …(5.14)
b b i
By equating (5.13) and (5.14) to zero and re-arranging the terms we get the
following two equations:
n n
Yi na b X i …(5.15)
i 1 i 1
n n n
2
X i Yi a X i b X i …(5.16)
i 1 i 1 i 1
These two equations, (5.15) and (5.16), are called the normal equations of
least squares. These are two simultaneous linear equations in two unknowns.
These can be solved to obtain the values of a and b.
Those of you who are not familiar with the concept of differentiation can use
a rule of thumb (We suggest that you should learn the concept of
differentiation, which is so much useful in Economics). We can say that the
normal equations given at (5.15) and (5.16) are derived by multiplying the
coefficients of a and b to the linear equation and summing over all
observations. Here the linear equation is Yi a bX i . The first normal
equation is simply the linear equation Yi a bX i summed over all
observations (since the coefficient of a is 1).
Yi a bX i or Yi na b X i
116
The second normal equation is the linear equation multiplied by X i (since the Correlation and
Regression
coefficient of b is X i )
2 2
X iYi aX i bX i or X i Yi a X i b X i
After obtaining the normal equations we calculate the values of a and b from the
set of data we have.
Example 5.2: Assume that quantity of agricultural production depends on the
amount of rainfall and fit a linear regression to the data given in Example 5.1.
In this case dependent variable (Y) is quantity of agricultural production and
independent variable (X) is amount of rainfall. The regression equation to be
fitted is
Yi a bX i ei
For the above equation we find out the normal equations by the method of least
squares. These equations are given at (5.15) and (5.16). Next we construct a table
as follows:
Table 5.5: Computation of Regression Line
Xi Yi X i2 X i Yi Ŷi ei
By substituting values from Table 5.5 in the normal equations (5.15) and (5.16)
we get the following:
117
Summarisation of 450 = 10a + 750b
Bivariate and Multi-
variate Data
34526 = 750a + 57294b
Notice that the sum of errors ei for the estimated regression equation in zero
i
The computation given in Table 5.5 often involves large numbers and poses
difficulty. Hence we have a short-cut method for calculating the values of a and b
from the normal equations.
Let us take
x X X and y Y Y where X and Y are the arithmetic means of X and Y
respectively.
Hence xy ( X X )(Y Y )
a Y bX …(5.18)
1n
You may recall that covariance is given by xy (Xi X)(Yi Y ) = 1 n
xi y i .
n i1 n i 1
1 n
n
Moreover, variance of X is given by x2 ( Xi X )2 = 1 xi2
n i1 n i 1
n
xy xy
Since b
i1 b
n
x
2
we can say that
x2 …(5.19)
i1
Since these formulae are derived from the normal equations we get the same
values for a and b in this method also. For the data given in Table 5.4 we compute
the values of a and b by this method. For this purpose we construct Table 5.6.
118
Table 5.6: Computation of Regression Line (short-cut method) Correlation and
Regression
Xi Yi xi yi xi2 xi y i
60 33 -15 -12 225 180
62 37 -13 -8 169 104
65 38 -10 -7 100 70
71 42 -4 -3 16 12
73 42 -2 -3 4 6
75 45 0 0 0 0
81 49 6 4 36 24
85 52 10 7 100 70
88 55 13 10 169 130
90 57 15 12 225 180
Total = 750 450 0 0 1044 776
a Y b X 45 0.743 10 10 .73
Thus the regression line in this method also Yˆi 10.73 0.743X i …(5.20)
5.11 PREDICTION
A major interest in studying regression lies in its ability to forecast. In Example
5.1 we assumed that the quantity of agricultural production is dependent on the
amount of rainfall. We fitted a linear equation to the observed data and got the
relationship
Yˆi 10.73 0.743X i
From this equation we can predict the quantity of agricultural output given the
amount of rainfall. Thus when rainfall is 60 mm. agricultural production is
( 10 .73 0.74 60 ) 33 .85 thousand tonnes. This figure is the predicted value on
the basis of regression equation. In a similar manner we can find the predicted
values of Y for different values of X.
119
Summarisation of Let us compare the predicted value with the observed value. From Table 5.4,
Bivariate and Multi-
variate Data where observed values are given, we find that when rainfall is 60 mm.
agricultural production is 33 thousand tonnes. In fact, the predicted values Ŷi for
observed values of X are given in the fifth column of Table 5.5. Thus when
rainfall is 60 mm. Predicted value is 33.85 thousand tonnes. Thus the error value
ei is –0.85 thousand tonne.
Now a question arises, ‘Which one, between observed and predicted values,
should we believe?’ In other words, what will be the quantity of agricultural
production if there is a rainfall of 60 mm. in future? On the basis of our regression
line it is given to be 33.85 tonnes. And we accept this value because it is based on
the overall data. The error of –0.85 is considered as a random fluctuation which
may not be repeated.
The second question that comes to our mind is, ‘Is the prediction valid for any
value of X?’ For example, we find from the regression equation that when rainfall
is zero, agricultural production is –10.73 thousand tonne. But common sense tells
us that agricultural production cannot be negative! Is there anything wrong with
our regression equation? In fact, the regression equation here is estimated on the
basis of rainfall data in the range of 60-90 mm. Thus prediction is be valid in this
range of X. Our prediction should not be for far off values of X.
A third, question that arises here is, ‘Will the predicted value come true?’ This
depends upon the coefficient of determination. If the coefficient of determination
is closer to one, there is greater likelihood that the prediction will be realised.
However, the predicted value is constrained by elements of randomness involved
with human behaviour and other unforeseen factors.
a) Y on X line, Yi a bX i
b) X on Y line, X i Yi
120
You may ask, ‘What is the need for having two different lines? By rearrangement Correlation and
Regression
a 1
of terms of the Y on X line we obtain X i Yi . Thus we should have
b b
a 1
and . However, the observations are not on a straight line and the
b b
relation between X and Y is not a mathematical one. You may recall that
estimates of the parameters are obtained by the method of least squares. Thus the
regression line Yˆi a bX i is obtained by minimising (Yi a bX i ) 2 whereas
i
xy2
Thus b , which is the same as r2.
x2 2y
This r2 is called the coefficient of determination. Thus the product of the two
regression coefficients of Y on X and X on Y is the square of the correlation
coefficient. This gives a relationship between correlation and regression. Notice,
however, that the coefficient of determination of either regression is the same,
i.e., r2; this means that although the two regression lines are different, their
predictive powers are the same. Note that the coefficient of determination r2
ranges between 0 and 1, i.e., the maximum value it can assume is unity and the
minimum value is zero; it cannot be negative.
From the previous discussions, two points emerge clearly:
1) If the points in the scatter lie close to a straight line, then there is a strong
relationship between X and Y and the correlation coefficient is high.
2) If the points in the scatter diagram lie close to a straight line, then the observed
values and predicted values of Y by least squares are very close and the
prediction errors (Yi Yˆi ) are small.
Thus, the prediction errors by least squares seem to be related to the correlation
coefficient. We explain this relationship here. The sum of squares of errors at the
various points upon using the least squares linear regression is Yi Yˆi
n
i 1
2
.
On the other hand, if we had not used the value of observed X to predict Y, then
the prediction would be a constant, say, a. The best value of a by least squares
n
criterion is such an a that minimises Yi a ; the solution to this a is seen to be
2
i 1
Y . Thus the sum of squares of errors of prediction at various points without using
X is Yi Y .
n 2
i 1
121
Summarisation of n n
Bivariate and Multi- The ratio, (Yi Yˆi ) 2 (Yi Y ) 2 can then be used as an index of how much has
i 1 i 1
variate Data
been gained by the use of X. In fact, this ratio is the coefficient of determination
and same as r 2 mentioned above. Since both the numerator and denominator of
this ratio are non-negative, the ratio is greater than or equal to zero.
5) Obtain the equation of the line of regression of yield of rice (y) on water (x)
from the data given in the following table :
Water in inches (x) 12 18 24 30 36 42 48
Yield in tons (y) 5.27 5.68 6.25 7.21 8.02 8.71 8.42
Estimate the most probable yield of rice for 40 inches of water.
…………………………………………………………………....………….....
………………………………………………………………….......…………..
………………………………………………………………….......……..……
………………………………………………………………….......………..…
………………………………………………………………….......……..……
123
Summarisation of By solving the above equations we obtain estimates for α, β and γ. The regression
Bivariate and Multi-
variate Data equation that we obtain is
Yˆ X 1 X 2 …(5.23)
In the bivariate case (Y,X) we could plot the regression line on a graph paper.
However, it is quite complex to plot the three variable case (Y, X 1 , X 2 ) on graph
paper because it will require three dimensions. However, the intuitive idea
remains the same and we have to minimise the sum of errors. In fact when we
add all the error terms ( e1 , e2 ,........en ) it sum up to zero.
In many cases the number of explanatory variables may be more than two. In
such cases we have to follow the basic principle of least squares: minimize e2 .
Thus if Y a0 a1 X1 a2 X 2 ............... an X n e then we have to minimize
e 2 (Y a0 a1 X 1 a2 X 2 ........ an X n ) 2
124
The steps involved in estimation of regression line are: Correlation and
Regression
i) Find out the regression equation to be estimated. In this case it is given by
Y X 1 X 2 e .
ii) Find out the normal equations for the regression equation to be estimated.
In this case the normal equations are
Y n X 1 X 2
X 1Y X 1 X 12 X 1 X 2
X 2Y X 2 X 1 X 2 X 22
iv) Put the values from the table in the normal equations.
v) Solve for the estimates of , and .
Y X1 X2 X 1Y X 2Y X 12 X 22 X1X 2 Yˆ ei
By applying the above mentioned steps we obtain the estimated regression line as
Yˆ 4.80 0.45 X 1 0.09 X 2 .
125
Summarisation of
Bivariate and Multi- 5.14 NON-LINEAR REGRESSION
variate Data
The equation fitted in regression can be non-linear or curvilinear also. In fact, it
can take numerous forms. A simpler form involving two variables is the
quadratic form. The equation is
Y = a + bX + cX 2
There are three parameters here viz., a, b and c and the normal equations are:
Y n bX cX 2
XY X bX 2 cX 3
X 2Y X 2 bX 3 cX 4
By solving for these equation we obtain the values of a, b and c.
Certain non-linear equations can be transformed into linear equations by taking
logarithms. Finding out the optimum values of the parameters from the
transformed linear equations is the same as the process discussed in the previous
section. We give below some of the frequently used non-linear equations and the
respective transformed linear equations.
1) Y = a c bx
By taking natural log (ln), it can be written as
ln Y = ln a + bX
or Y’ = + X’
Where, Y’ = lnY, = ln a, X’ = X and = b
2) Y = aX b
By taking logarithm (log), the equation can be transformed into
log Y = log a + b log X
or Y’ = + X’
where, Y’ = log Y, = log a, = b and X’ = log X
1
3) Y=
a bX
1
If we take Y’ = then
Y
Y’ = a + bX
4) Y=a+b X
If we take X’ = X then
Y = a + bX’
Once the non-linear equation is transformed, the fitting of a regression line is as
per the method discussed in the beginning of this Unit.
126
We derive the normal equations and substitute the values calculated from the Correlation and
Regression
observed data. From the transformed parameters, the actual parameters can be
obtained by making the reverse transformation.
127
Summarisation of
Bivariate and Multi- 5.16 ANSWERS/HINTS TO CHECK YOUR
variate Data PROGRESS EXERCISES
Check Your Progress 1
1) + 0.47
2) + 0.996
3) + 0.98
4) + 0.995
5) – 0.84
1) 2/3
2) + 0.64
4) + 0.82
128