Correlation and Regression
Correlation and Regression
Module 3
• Correlation is used to study the degree of relationship among
two or more variables.
• On the other hand, Regression technique is used to estimate
the value of one variable for a given value of another.
In practice, we may come across with lot of situations which
need statistical analysis of either one or more variables.
The data concerned with one variable only is called univariate
data. For eg: price, income, demand, production, weight,
height, marks etc. are concerned with one variable only. The
analysis of such data is called univariate analysis.
The data concerned with two variables are called bivariate
data. For eg: rainfall & agriculture; income & consumption;
price & demand; height & weight etc. The analysis of these
two sets of data is called bivariate analysis.
The data concerned with three or more variables are called
multivariate data. For eg: Agriculture production is
influenced by rainfall, quality of soil, fertilisers used etc. The
analysis of three or more variables is called multivariate
analysis.
Correlation Analysis
Definition: Two or more variables are said to be correlated if the
change in one variable results in a corresponding change in the
other variable.
According to Simpson and Kafka “ Correlation analysis deals
with the association between two or more variables”.
Lun Chou defines “ Correlation analysis attempts to determine
the degree of relationship between variables”.
Boddington states that “ Whenever some definite connection
exists between two or more groups or classes of series of data,
there is said to be correlation”
Correlation Coefficient
Correlation analysis is actually an attempt to find a numerical
value to express the extend of relationship exists between two or
more variables.
The numerical measurement showing the degree of correlation
between two or more variable is called correlation coefficient.
Where:
N= No. of pairs of symbol
C= No. of concurrent deviations (i.e no. of + signs in Dx*Dy
column)
Method of calculation
1. Increase in the value is denoted by +sign and decrease by -
sign.
2. For the first value, the increase or decrease is unknown.
3. C stands for the number of + signs in Dx*Dy i.e. the
concurrent deviations.
4. N is the no. of pairs compared, i.e. the no. of +signs and –
signs in the Dx* Dy column.
5. When 2C N, r is positive
6. When 2C , r is negative
Mean: Add up all the data points and then divide by the total no.
of variables.
Median: The middle value, the midpoint of the data when
arranged in order.
Mode: The value that appears the most often.
Population and Sample variance
• When you have collected data from every member of the
population that you are interested, you can get an exact value
for population variance.
• When you collect data from a sample, the sample variance is
calculated and used to make estimates or inferences about the
population variance.
Population Variance
=
Where,
X = Variable
Population mean
N = Population size.
Sample Variance
S² = Sample variance
X = Variable
X̅ = Sample mean
n = Sample size
Standard deviation & Variance
• SD is a measure of how spread out nos. are
• SD measures the dispersion of a data set relative to its mean
• Its symbol is ‘
• It is the square root of variance
SD =
• The variance is a measure of variability. It is calculated by
taking the average of squared deviations from the mean.
• Variance tells you the degree of spread in your data set. The
more spread the data, the larger the variance is in relation to
the mean.
The characteristics of Range, Variance and SD
• The more the data are spread out or dispersed, the larger the
range, variance and SD.
• The more the data are concentrated or homogeneous, the
smaller the range, variance and SD.
• If the values are all same ( so that there is no variation in the
data), the range, variance and SD are equal to zero.
• None of the measures of variance (range, SD and variance)
can never be negative.
Coefficient of Variation
Coefficient of variation is a relative measure of variation that is
always expressed as a percentage rather than in terms of the
units of the particular data. The coefficient of variation, denoted
by the symbol ‘CV’, measures the scatter in the data relative to
the mean.
CV = x 100
S = Sample standard deviation
X = Sample mean
Karl Pearson’s Coefficient of Correlation
Karl Pearson, the great biologist and statistician has given a formula
for calculation of coefficient of correlation, popularly known as
Pearsonian coefficient of correlation and is denoted by the symbol
‘r’. The formula for computing the pearsonian coefficient of
correlation is
r=
= SD of X series ,= SD of Y series
N = No. of pairs of observations.
This is also known as product moment correlation coefficient.
This method is to be applied only where deviations of items are
taken from actual means and not from assumed means.
The above formula for computing Pearsonian coefficient of
correlation can be transformed to the following form which is
easier to apply.
r =
Direct method for calculating correlation coefficient
Correlation coefficient can also be calculated without taking
deviations of items either from actual mean or assumed mean.
The standard formula in such a case is:
r =
Degree of Correlation (Interpretation of ‘r’)
The degree or the intensity of the relationship between two
variables can be ascertained by finding the value of coefficient of
correlation. The degree of correlation can be classified into:
1. Perfect Correlation: When the change in the two variables is
such that with an increase in the value of one, the value of
the other increases in a fixed proportion, correlation is said
to be perfect. Perfect correlation may be positive or
negative. Coefficient of correlation is +1 for perfect positive
correlation and it is -1 for perfect negative correlation.
2. No correlation: If changes in the value of one variable are
not associated with changes in the value of the other
variable, there will be No correlation. When there is no
correlation, the coefficient of correlation is Zero.
3. Limited degree of correlation: In between perfect
correlation and no correlation there may be limited degree
of correlation. Limited degree of correlation may also be
positive or negative. Limited degree of correlation may be
termed as high, moderate or low. For limited degree of
correlation the coefficient of correlation lies between 0 and 1
numerically.
Properties of Correlation Coefficient
1. Correlation coefficient has a well defined formula.
2. Correlation coefficient is a pure number and is independent
of the units of measurement.
3. It lies between -1 and +1.
4. Correlation coefficient does not change with reference to
change of origin or change of scale.
5. Correlation coefficient between x and y is same as that
between y and x.
Probable Error
• Probable error (PE) of the coefficient of correlation is a
statistical measure which measures the reliability and
dependability of the value of coefficient of correlation.
• It is mainly used for interpretation and determination of
limits.
• Probable error is used to interpret whether ‘r’ is significant or
not.
If ‘r’ PE = No correlation
If ‘r’ PE = There is correlation
If ‘r’ 6PE = There is significant correlation.
Formula for finding PE
PE =
Where, ‘r’ = coefficient of correlation
‘n’ = No. of pairs of observation.
Standard Error
• It is basically standard deviation of any mean
• Standard error is denoted by SE
Formula for finding SE is
SE =
Spearman’s Rank Correlation
• Karl Pearson’s coefficient is applicable when variables are
measured in quantitative form. But in many cases
measurement is not possible because they are in qualitative
form.
• For example, we cannot measure the beauty or intelligence
quantitatively. But it may be possible, in their case, to rank the
individuals in some order.
• The correlation coefficient obtained from the ranks so obtained
is called rank correlation.
• Therefore, rank correlation is the correlation obtained from
ranks, instead of their quantitative measurement.
• Thus when the values of two variables are expressed in ranks
and therefrom correlation is obtained, that correlation is
known as rank correlation.
• Spearman has devised a formula known as Spearman’s rank
correlation coefficient to find the correlation coefficient from
the ranks.
According to spearman’s method, the formula for Rank
Correlation coefficient is:
r = 1-
Where, ‘D’ is the difference between ranks and ‘n’ number of
items.
Equal/ Repeated Ranks (Tie in Rank)
When the values repeat in one or both the series i.e. x and y,
Rank correlation coefficient is obtained using the formula:
r=1-
Where, ‘m’ stands for the no. of times each value repeats/ no. of
equal ranks.
• While assigning ranks, if two or more items have equal values
(i.e if there occur a tie) they may be given mid rank.
• Thus if two items are on the fifth rank, each may be ranked
(5+6)/2 =5.5 and the next item in the order of size would be
ranked seventh.
Merits and Demerits of Rank Correlation
Merits
1. It is easy to calculate
2. It is simple to understand
3. It can be applied to both quantitative and qualitative data.
Demerits
4. Rank correlation coefficient is only approximate measure as
the actual values are not used.
5. It is not convenient when ‘n’ is large.
6. Further algebraic treatment is not possible.
Partial Correlation
• Suppose there are many variables and we want to study
relationship between only two of them, then we have partial
correlation.
• In partial correlation we consider only two variables and
others are treated as normal or having no effect and so
ignored.
• For eg: consider three variables; yield, rainfall and
temperature. Here the correlation between yield and rainfall
treating temperature as normal, is partial correlation.
Partial Correlation Coefficient
Partial correlation coefficient measures the relationship between
one variable and one of the other variables assuming that the
effect of the rest of the variables is eliminated.
Let be three variables then ’ is the partial correlation coefficient
between treating as constant or normal. Similarly we have and .
These partial correlation coefficients cane be computed using
the simple correlation coefficients as shown below:
=
=
Where, respectively stand for the simple coefficients of
correlation between .
& are the same.
& are the same, like wise
& are the same.
Multiple Correlation
When there are many variables and we want to study relation of
one variable with all the other variables taken together, the
correlation obtained is called Multiple correlation.
For example, if the variables are yield, rainfall and temperature
and we want to study the relation of yield with both rainfall and
temperature taken together we find the multiple correlation.
So in multiple correlation one variable is on one side and all
other variables together on the other side.
Multiple Correlation Coefficient
If are three variables, then the coefficient of multiple correlation
between on one hand and and together on the other hand
denoted by
=
=
=
Regression Analysis
• The estimation or prediction of future production,
consumption, prices, investments, sales, profits, income etc,.
Are of paramount importance to a businessman or an
economist.
• Regression analysis is one of the very scientific techniques for
making such predictions.
Definition: Regression analysis, in the general sense, means the
estimation or the prediction of the unknown value of one
variable from the known value of the other variable. It is a
statistical device used to study the relationship between two or
more variables that are related.
In the words of MM Blair “ Regression analysis is a mathematical
measure of the average relationship between two or more
variables in terms of the original units of the data”
Dependent and Independent Variables
In regression analysis there are two types of variables. The
variable whose value is influenced or is to be predicted is called
dependent variable and the variable which influences the values
or is used for prediction, is called independent variable.
Types of regression analysis
On the basis of number of variables – Simple and Multiple
On the basis of proportion of change in variable – Linear and
Non- linear.
Simple and Multiple regressions
When there are only two variables the regression equation
obtained is called simple regression equation.
In multiple regression analysis there are more than two variables
and we try to find out the effect of two or more independent
variables on one dependent variable.
Let X,Y and Z be three variables. Let X and Y be the independent
variables and Z be depending on them. Then we use multiple
regression analysis to study the relative movement of Z, for a
unit movement in X and Y.
For Eg: If there are three variables yield, rainfall and
temperature. Suppose yield is depending on rainfall and
temperature, then we get the regression equation of Z on X and
Y where Z is yield, X – rainfall and Y- temperature.
Linear and Non –Linear Regression
• On the basis of proportion of changes in the variables, the
regression can be classified into Linear and Non-Linear
regressions.
• If the given bivariate data are plotted on a graph, the points
so obtained on the scatter diagram will more or less
concentrate around a curve called ‘curve of regression’.
• If the regression curve is a straight line, we say that there is
linear regression between the variables under study. The
equation of such a curve is the first degree equation in the
variables x and y.
• Mathematically, the relation between x and y in a linear
regression, can be expressed in the form, y = a + bx.
• In a linear regression, the change in the dependent variable is
proportionate to the changes in the independent variable.
• If the curve of regression is not a straight line, then the
regression is termed as curved or Non-Linear regression. The
regression equation in such cases is not of first degree. In
this case the dependent variable does not change by a
constant amount of change in the independent variable.
Line of best fit (Regression lines)
• When the given bivariate data are plotted on a graph, we get
the scatter diagram. If the points of the scatter diagram
concentrate around a straight line, that line is called the line
of best fit. The line of best fit is that line which is closer to
the points of the scatter diagram.
• This line is also known as Regression line.
• So, a regression line is a graphic technique to show the
functional relationship between the dependent and the
independent variables. It shows average relationship
between the variables.
Method of drawing regression lines- Free hand curve method
Under this method, original data are plotted on a graph paper.
Usually original data when plotted on a graph gives a wave like
curve but it depicts a general tendency of the data.
Independent variable is taken along the horizontal axis and
dependent variable along the vertical axis.
We draw smooth free hand line in such a way that it clearly
indicates the tendency of the original data.
This line is fitted by inspection. Care is taken that the line is
drawn in such a way that the area of the curve below and above
the line are approximately equal.
Two Regression lines
While estimating the value of ‘y’ for any given value of ‘x’, we
take y as dependent variable and x as independent variable.
Then we get the line of regression of y on x.
Similarly for estimating x for any given value of y, we use the
regression of x on y. Here x is dependent variable and y is
independent variable.
Thus there are two regression lines.
Regression Equations: Regression equation is a mathematical
relation between the dependent and independent variables.
There are two regression lines and hence there are two
regression equations; Regression equation of y on x and
Regression equation of x on y
Regression equation of y on x
y – y ̅ = (x-x̅)
Where, =
Regression equation of x on y
x – x ̅ = (y-y̅)
Where, =
Relationship between correlation coefficient and Regression
coefficient.
=r
=r
r=
Distinction between correlation and Regression
1. In correlation analysis we study degree of relationship between the
variables whereas in regression analysis we study the nature of
relationship between the variables so that we may be able to
predict the value of one on the basis of another.
2. Correlation is merely a tool of ascertaining the degree of
relationship between two variables and therefore, we cannot say
that one variable is the cause and other the effect. However, In
regression analysis, one variable is taken as dependent while the
other as independent , thus making it possible to study the cause
and effect relationship.
3. Correlation analysis is not for the purpose of prediction whereas the
regression analysis is basically used for prediction purposes.
4. There may be nonsense correlation between two variables
which is purely due to chance and has no practical relevance
such as increase in income and increase in weight of a group
of people. However, there is nothing like nonsense
regression.
5. In correlation analysis is a measure of direction and degree
of linear relationship between two variables X and Y. It is
immaterial which of X and Y is dependent variable and which
is independent variable ( = ). In regression analysis the
regression coefficients and are not equal and hence it
definitely makes a difference as to which variable is
dependent and which is independent.
Usefulness of the study of regression
• Regression analysis is a branch of statistical theory that is widely
used in almost all the statistical disciplines.
• In economics it is the basic technique for measuring or estimating
the relationship among economic variables that constitute the
essence of economic theory and economic life. For example, if
we know that two variables price(X) and demand (Y) are closely
related we can find out the most probable value of X for a given
value of Y or the most probable value of Y for a given value of X.
• Regression analysis is widely used in business. The success of
every businessman depends upon his correct estimation about
future production, sales, profit etc.
• The utility of regression is high in physical sciences where the
data are generally in functional relationship. Therefore, it is
always possible to exactly calculate the value of one variable
for a given value of the other variable by studying their
regression.
• With the help of regression coefficients we can calculate the
correlation coefficient. The square of correlation coefficient,
called coefficient of determination, measures the degree of
association or correlation that exists between the two
variables. It assesses the proportion of variance in the
dependent variable that has been accounted for by the
regression equation.
The Properties of regression lines
1. The two lines intersect at (x̅, y̅)
2. When r = 1, the two lines coincide.
3. When r = 0, the two lines are mutually perpendicular.