Unit 17 Correlation and Regression
Unit 17 Correlation and Regression
REGRESSION
Structure
17.1 Introduction
Objectives
17.2 Correlation and Scatter Diagram
17.3 Correlation Coefficient
17.4 Regression
17.5 Fitting of Linear Regression
17.6 Summary
17.7 Solutions1Answers
INTRODUCTION
So far we have been dealing with the distributions of the data involving only one
variable. Such a distribution is called a univariate distribution. Very often, we have to
deal with the situations where more than one variables are involved. For example, we
may like to study the relationship between the heights and weights of adult males,
quantum of rainfall and the yield of wheat in India over a number of years, doses of
drug and a response vi2.a dose of insulin and blood sugar levels in a person, the age of
individuals and their blood pressure, etc.
In such situations, our main purpose is to determine whether or not a relationship
exists between the two v ~riables.If such a relationship can be expressed by a
mathematical formula, then we shall be able to use it for an analysis and hence make
certain predictions.
Correlation and regression are methods that deal with the analysis of such
relationships between various variables and possible predictions. In this unit, we shall
confine ourselves to analysing the linear relationship between two variables. However,
we can extend the methods for two variables to the situations where more than two
variables are studied simultaneously.
0bjectives
After reading this, you should be able to
describe the correlation between two variables
compute and interpret correlation coefficient
describe simple linear regression line, and
explain how to fit a linear regression line using least squares method.
TABLE 1
I 2 3 4 5 6 7 8 9 1 0
X: 5 10 20 15 6 18 17 12 I5 25
Set I
Y: 20 8 10 5 12 16 22 17 12 17
X: 5 10 7 I5 17 12 20 25 22 I5
Set I1
Y: 4 6 9 12 27 14 17 24 21 17
X: 5 10 8 I5 13 18 20 23 25. 27
Set 111
Y: 25 22 17 20 15 13 17 12 19 10
Now plot the values of X and Y as (x,y) coordinates in the X-Y plane, as shown on
next page in Fig. 1.
The Figure I presents the scatter diagrams of these 3 sets of data, set by set. Study
these scatter diagrams carefully.
(i) Set I represents a situation in which we do not see any relationship between the
values of X and Y. High values of X do not appear to be associated with either high -
or low values of Y. This indicates that the sample values of X and Y vary
Note that in studying the
independently or that there is no evidence of a relationship between X and Y. When rrlationship between and Y, we
such a relationship is absent, as in the case of the scatter diagram of Set I (Fig. 1 (a)), rn studying pairstofvalues of x
then we say that the two variables are uncorrelated (or not correlated). and Y.
In other words, if the two variables increase or decrease simultaneously (i.e., when
one increases, the other also increases or. when one decreases, the other also
decreases), then the correlation between jhe two variables is said to be a positive
correlation. In this case as shown in Fig. 1 (c), the points follow a line of positive
Statistical Inference Set I
(b)
Figure I
slope. For.example, the correlation between heights and weights of a group of persons
is a positive correlation.
(iii) In the scatter diagram of Set 111 (Fig. (1 (c)), it is seen that high values of X pair
with low values of Y and low values of X pair with high values of Y. In such a case,
we say that X and Y are negatively correlated. In other words, the scatter diagram of
Set 111 gives the negative correlation. Thus, we have the following definition:
Definition :If two variables deviate in opposite direction, then they are said to be
negatively correlated or inversely correlated.
In other words, if the increase in m e variable creates a decrease in the other, or the
decrease in one creates an increase in the other, then the correlation between the two
variables is said to be negative correlation. In this case, the points of the scatter
diagram follow a line of negative slope.
For example, the correlation between the price and demand of a commodity is a
negative correlation.
,,Thus, you see how useful the scatter diagram is to study the correlation between two
variables. Now, you should try the following exercise:
E 1) Plot the following two sets of paired variables in a scatter diagram and
comment on~fheirrelationship.
,Set I :
Set I1
Thus, we have seen that correlation between two variables may or may not exist. In Correlation and Regression
case, there exists a correlation between two variables then it is either a positive
correlation or a negative correlation. How to measure such a correlation, if it exists?
The most popular 'method used for this purpose is the Pearson Product-Moment
Correlation Coefficient method due to Karl Pearson, a noted statistician as already
mentioned in Unit 14.
Variance (x) =
z (x - x ) ~
n
where % is the mean of the sample and 'n' the number of observations in the sample.
We now define co-variance (analogously) as
where (XI,yl) forms the first pair (xz, yz) forms the second pair and so on, till the 10th
pair (XIO,YIO).
We can measure the deviation of each x value from its mean Tt and the deviation of
each y value from its mean y. If x and y vary together positively, then whenever
(x-jl) is positive, (y-y) will also be positive, and whenever (x-jl) is negative, (y-7)
will also be negative. However, if x and y vary together negatively, then whenever
(x-jl) is positive, (y-y) will be negative and vice-versa. Then you take the sum of the
products of (x-%) and (y-y). Since there are 10 pairs of x and y values, we divide this
by 10 to get a n average measure of covariance as
In general, when n pairs of values of the two variables x and y are given, then the
covariance is given as
Cov. (x, y) =
We would, however, like to measure the strength of the relationship between two
variables by means of a single number. The correlation coefficient is such a number
with the property that its value will vary from -1 to +1, taking the value of -1 when
the relationship between the two variables is perfectly negative, the value of +1 when
Stmtirticd Inference the relationship is perfectly positive (refer back to Fig. 1 of this unit), and the value of
0 when there is no relationship.
In other words, a value r = - 1 will occur when all the points lie exactly on a straight
+
line having a negative slope. The value r = 1 will indicate that all points lie on a
+
straight line have positive slope. If r is close to 1 or - 1, then we say that the linear
relationship (related by a linear equation) between the two variables is strong and we
have high correlation. However, if r is close to zero, then we say that the relationship
between the two variables is weak or perhaps even does not exist.
The correlation coefficient is denoted by r and is computed by the formula
Covariance between x and )I
r =
Ji~ariance of X) (variance of Y)
Note the symbol 'r' is used to denote the correlation coefficient calculated from a
sample of a population (see Unit IS). The symbol for the correlation coefficient in the
population is p (pronounced as 'Rho'). In other words, r is usually used for the
Sample Correlation Coefficient which means a value computed from sample of n
pairs of measurements whereas p is generally referred to as the Population Correlation
~oefficieniwhich means a linear cokelation coefficient for the entire population.
Sometimes, the correlation coefficient r is also mentioned as the product-moment
correlation coefficient to distinguish it from the notation p in'the following way:
The correlation coefficient 'r' between n pairs of observations whose values are
, ( ~ 3~, 3 ...
(XI,YI)( ~ 2YZ) ) (xn, yn) is
In any case the two are equivalent to each other. For computational purposes, 'r' can
be writtten as
-
E x y - Xx. Xy
where
i) Xxy is the sum of products of the 'n' pairs of x and y,
ii) XX' and zy2 are the sums of squares of x's and y's respectively,
iii) Xx and Xy are the sums of x and y respectively,
iv) 'n' is the number of pairs of observation.
Often this correlation coefficient is referred to as Pearson's correlation coefficient.
There is one important point to be noted in using the Pearson's correlation Correlation and Regression
coefficient. It is applicable only when the relationship between the two variables under
consideration is linear or in other words the two variables have a straight line .
relationship. If the relationship between two variables is non-linear or curvilinear,
Pearson's correlation coefficient should not be used to measure the strength of the
relationship. The scatter diagram, which we learnt in Section 17.2, is very helpful to
decide whether the relationship is linear or not.
Example 1 :Compute the correlation coefficient for the following data:
E 3) Draw the scatter diagram of the following data and comment if it is valid to
represent the relationship seen by a Pearson's correlation coefficient.
17.4 REGRESSION
In the previous section you have seen that the data giving the corresponding values of
two variables can be graphically represented by a scatter diagram. Also, you were
introduced to a method of finding the relationship between these two variables in
terms of the correlation coefficient. Very often, in the study of relationship of two
variables, we come across instances where one of the two variables depends on the
other. In other words, what is the possible value of the dependent variable when the
value of independent variable is known. For example, the bodyweight of a growing
child depends on the nutrient intake of the child or the weight of an individual may be
dependent on his height or the response L. a drug can be dependent on the dose of the
drug or the agricultural yield may depena on the quantum of rainlall. In such
situations, where one of the variables is dependent and the other independent, you
may ask "can we find a method of estimating the numerical relationship between two
variables so that given a value of the independent variable, we can predict the average
value of the dependent variable?".
Note that we are trying to predict or estimate the average value of the dependent
variable for a given value of the independent variable. We cannot determine the exact
value of the dependent variable when the value of the independent variable is known.
What perhaps we can do is just to make an estimation of the value of the dependent
variable, knowing fully well that there could be an error in our estimation. This is
because of the reason that there is no certainty that the estimated value of the variable
would be exactly the same as the value actually observed. This is also because for a
given value of the independent variable, the dependent variable will usually show
some variations in its values. For example, not all persons of a given height, say of
5' 6" have the same weight. Some will be heavier than others. This is why we talk of
predicting the average value of the dependent variable for a given value of the
independent variable.
Let us consider another situation, where it is not logically meaningful to consider one
variable as dependent on the other. The heights of brothers and sisters, we expect, will
Statistical Inference be related. However, it would not be logical to consider the height of one member of a
pair to be dependent on the height of the other member. The height of a sister is not
dependent on the height of the brother while both of them may be related through the
heights of their parents.
When two variables are so related that it is logical to think of estimating the value of
one variable as dependent on the other, then the relationship can be described by
what is called a Regression Function or Regression Equation, a term introduced by
Sir Francis Galton (1822-1911).
The words 'dependent' and 'independent' are used in the study of probability with
different meaning now-adays. Therefore, in the context of regression, the dependent
variable is called the "outcome" variable and independent variable is called the
"predictor" variable. We shall stick to this terminology in studying regression, since in
regression, our interest is in prediction. Hence Y will be called the outcome variable,
and X the predictor variable.
The simplest form of the regression functjon r e l a t g the predictor and outcome
variables is that of a straight line as shown in Fig. 2 belo*. That is, the nature of
relationship between Y i d X is linear.
4
+
In this situation, the regression function takes the form, Y = a bX where Y is the
outcome vaiiable, X the predictor variable, and a and b are constants to be estimated
from a given set of data. Figure 3 below presents a straight line relationship.
In the Fig. 3 Correlation and Regression
i) 'a' is the intercept that the straight line Y = a + bX makes on the Y axis, that is,
Y = a, when X = 0.
ii) 'b' is the slope of the line, that is, the increase in the value of Y for every unit
increase in the value of X.
Note that b is marked on the graph as the height that the straight line gains
corresponding to an increase of one unit in X value. Further, b is positive when Y
a
increases when X increases and is negative when Y decreases as X increases or Y
decreases as X increases. If Y takes the value 0 when X is 0, then 'a' becomes equal to
0 so that the regression line passes through the origin-that is where both X and Y are
b
0. The regression equation then is written as
Height
c Fig. 4
difference between the value yl and the corresponding value as determined from the
straight line which fit to data. We denote this difference by dl, which is generally
called as the deviation. Of all the straight lines that can be fitted to this data, we
define that line as the "best" which minimises the sum of squares of deviations of the
individual points from the line, the deviations being measured along the Y-axis, i.e., if
d, denote the deviations then C di2 is minimum. In Figure 4, the deviations of the
points from the best fitting line are marked in dotted lines. Since we are attempting toL
fit a straight line of the form
the deviations of the observed values of Y from the line have to be measured in the
units in which the dependent variable Y is expressed. That is why in the diagram, you
find the deviations of the observed Y values from the line marked (in dotted lines) in
the direction of the Y axis. But then you may raise the following questions: "why
should we sum the squares of these deviations and choose that line which minimises
this sum? Why not just add up these deviations, some of which are positive and others
negative and take that line whichSgivesa minimum value to this sum?"
The reason why we do not take just the sum of these deviations is that more than one
straight line may satisfy the condition of this sum having the same minimum value.
For example, we may be able to fit two different straight lines to this data both of
which may give the sum of the deviations as zero. We want that we should get a
unique line by our method. Minimising the sum of squares of the deviations will give
a unique line or only a single line. Let us express the above idea in simple algebraic
terms.
We want (Y -a - b ~ ) ' for the 16 pairs of observations in our example to be
+
minimum. In the equation Y = a bX we can substitute the values of X and Y but
WP h a v e tn rctimatp the t w n ~ i n k n n w nennctantc a and h
That is, we want to find the value of a and b such that (Y -a - b ~ ) ' is a minimum. Correlation and Regression
This can be done by using calculus.
We are not actually doing this here but we only give the expressions for a and b which
minimise (Y - a - b ~ ) ' .The estimates of a and b are
b
Cx. Cy
C xy -
Thus a and b can be calculated from the given set of data. For the example given in
Table 2 of 16 pairs of heights and weights, we have
That is the line drawn on the scatter diagram in Figure 4. The Method of Least
Squares as described here is the best method for fitting a straight line regression
provided the following assumptions are satisfied:
1) For every value of X, the predictor variable, the value of Y have a mean that is .
+
given by the equation a px where a and /3 are parameters corresponding to a
and b respectively that we estimated from the sample. In the example, we
considered 16 pairs of individuals. If we had measured all the individuals in the
population from which we got 16 pairs of observations, then corresponding to
each height, the mean weight would be given by a px. +
2) The distribution of Y values around this mean a + BX corresponding to each X,
will have the same standard deviation (i.e., the variability of Y remains constant
I at all values of X).
E 4) The data below show the initial weight (grams) and gains in weight of 12
female rats on a high protein diet from the 20th to 80th day after their birth.
Examine the data to decide if the gain in weight depends on the initial weight.
I
Compute the linear regression equations of gain in weight on initial weight.
Present your results on a graph taking initial weight on the X-axis and gain in
weight on the Y-axis.
Rat Number 1
Gain in weight (gm) 93 128 106 126 154 79 112 128 120 118 120 98
Statistical Inference E 5) The data below give the percentage of fruits affected by worms and the
corresponding number fruits on a sample of 14 apple fruit trees. Examine
through fitting a linear regression to this data, if the percentage of fruits
affected by worms is dependent on the number of.fruits on the tree. Present
the data on a graph paper.
1 2 3 4 5 6 7 8 9 1 0 l i 1 2 1 3 1 4
Number of
fruits (in 5 8 9 15 12 6 10 8 7 16 4 11 13 14
hundred)
Per cent
affected 60 44 41 11 28 54 37 46 52 4 65 32 24 17
by worms
In our earlier discussion, we have considered one of the variables, Y as the outcome
variable and X as the predictor variable. What happens if we reverse the variables,
i.e., what happens if we consider X as the outcome variable and Y as the predictor
variable? We get a regression of the form
where a' and b' are different from a and b of the regression equation Y = a bx. In +
other words, the regression of Y on X on the regression of X and Y are two different
lines. It is easy to see why they are so. Recall that for fitting regression of Y on X we
minimised (Y - a - d x ) ~Y, being the observed X value, that is, we minimised the
sum of squares of the deviations of the observed values from those predicted by the
"best" regression line. These deviations, we noted, were measured along the Y axis or,
\
in other words, the deviations were the vertical deviations from the line along the Y
axis. If we fit the regression line of X on Y, then we will have to minimise
(X-a'-b' Y)? (X beipg the observed Y value) or, in other words, minimise the
deviations from the "best" line, measuring the deviations along the X-axis. We
consider horizontal instead of vertical deviations which means the interchange of
X-axis and Y-axis. That is, we get two different lines for regression of Y on X and X
on Y.
We now end this by giving a summary of what we have covered in it.
17.6 SUMMARY
In this unit, we have
5 ) described the method of least squares for fitting a linear regression line to a given
set of data and defined the regression coefficient,
7-
6- 0 0
5-
4 -
3-
' 2'
1 -. 0 0
Z I I A.
i 2 3 4 5 6 7 8 6
Fig. 5 (b)
Fig. 5 (c)
Zxy-
zx. Zy
-
b -
a = j i - b x = [ . - - -Y' XX ] = 24.5924
n n
+
The regression equation is Y = 24.5924 14030X
z x . xy
zxv - -