Statistics
Statistics
Veerendra Patil
Quantitative Aptitude 3.1 Correlation & Regression Analysis
CHAPTER -III
Correlation and Regression
Correlation:
Meaning and Definition of correlation:
The concept of correlation analysis and term correlation originated with Galton in 1888.
Correlation is the relationship that exists between two or more variables. If two variables
are related to each other in such a way that change in one creates a corresponding change in
the other, then the variables are said to be correlated.
The relationship between two variables is such that change in one variable results in a
positive or negative change in the other variable and also a greater change in one
variable results in corresponding greater or smaller change in the other variable is
known as correlation.
Correlation is sometimes termed as Co-variation. The measure of correlation can be
absolute or relative. Absolute measure is known as covariance and the relative measure is
called coefficient of correlation.
Characteristics:
1. The movements in one variable are accompanied by corresponding movements in the
other variable.
2. Correlation indicates the degree of relationship between two variables.
3. Correlation shows the direction of change in one variable when there is a change in
another variable.
4. Correlation is an analysis of co-variation between two or more variables.
5. The correlation expresses rates or relationship between the groups of items or values of
the variables but not between the individual items or values of the variables. The
relationship between two variables is not functional.
Meaning of Correlation analysis:
Correlation Analysis is a statistical technique used to measure the degree and direction of
relationship between the variables. Correlation analysis is a statistical procedure by which
we can determine the degree of association or relationship between two or more variables.
Significance and Uses of Correlation
The correlation is of great significance in practical life, because of the following reasons:
1. The study of correlation enables us to know the nature, direction and degree of
relationship between two or more variables. Correlation analysis is used in deriving
precisely the degree and direction of relationship between variables like price and
demand, advertising expenditure and sales, rainfalls and crops yield etc.
2. It is used in presenting the average relationship between any two variables through a
single value of co-efficient of correlation.
3. Correlation analysis helps us in understanding the behaviour of certain events under
specific circumstances. For example, we can identify the factors for rainfall in a given
area and how these factors influence paddy production.
4. Correlation facilitates the decision making in the business world. It reduces element of
uncertainty in decision-making. Economic theory and business studies show
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.2 Correlation & Regression Analysis
Relationships between variables like price and quantity demanded, advertising
expenditure and sales promotion measures etc.,
5. The measure of coefficient of correlation is a relative measure of change.
6. Correlation is used in developing the concept of regression and ratio of variation which
help in estimating the values of one variable for a given value of another variable. It
helps in making predictions. The effect of correlation is to reduce the range of
uncertainty of our prediction. The prediction based on correlation analysis will be more
reliable and near to reality.
Types of distributions:
Bi-variate and Multivariate Distributions:
a. Bi-variate Distribution:
The distribution in which each unit of the series assumes two values is called bivariate
distribution. In a Bi-variate distribution two variables are said to be correlated. The change
in one variable results in corresponding change in the other variable. Two variables can be
measured in the same individual. e.g.; length and weight, oxygen consumption and body
weight, etc., or same variable can measured in two or more related groups such as
intelligence quotient in siblings , height in parents and off springs.
b. Multivariate Distribution:
If more than two variables of each unit of distribution are measured, it is called a
multivariate distribution. In multivariate distribution the relationship between three or
more variables are studied. One variable will be dependent variable and all the other
variables will be independent variables.
Covariance:
Covariance is the sum of the products of deviations of variables(x, y, zn) from their
respective Arithmetic Means divided by the number of observations. Covariance is an
absolute measure of correlation. It is denoted by Cov(x, y).
When there are two variables (x, y) then covariance can be calculated by using the
following formula.
n
y x Cov
Y Y X ) ( ) (
) , (
X
=
OR
Cov(x, y) =E (xy) E(x) E(y).
Where, E(x), E(y), E (xy) are the expectations of X, Y, and XY respectively.
Types of correlation:
Positive and Negative Correlation
Depending upon the direction of change of the variables, correlation may be positive
correlation or negative correlation.
1. Positive Correlation
If both the variables vary in the same direction, correlation is said to be positive. In other
words, if one variable increases, the other also increases or, if one variable decreases the
other also decreases, then the correlation between the two variables is said to be a positive
correlation.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.3 Correlation & Regression Analysis
Example: The correlation between heights and weights of a group of persons is a positive correlation.
Height(Cms) 158 161 163 166 168 171 173 176
Weight (kg) 60 62 64 65 67 69 71 73
a. Perfect Positive Correlation:
When two variables move proportionately in the same direction, i.e., the increase in the
values of one variable leads to corresponding increase in the values of other variable, the
correlation between them is called Perfect Positive. It is also called direct correlation.
Examples of perfect positive correlation are rare in reality.
b. Moderately Positive correlation:
When two variables are partially positively correlated, the correlation is termed moderately
positive correlation. E.g. tallness of plants and the quantity of manure used the mortality
rate and overcrowding, etc.
2. Negative Correlation:
If both the variables vary in opposite direction, the correlation is said to be negative or
opposite correlation. In other words, if one variable increases, but the other variable
decreases or, if one variable decreases but the other variable increases, then the correlation
between the two variables is said to be a negative correlation.
Example: The correlation between the price and demand of a commodity is a negative correlation.
Price (Rs. per unit) 5 4 3 2 1
Demand (units) 100 200 300 400 500
a. Perfect Negative Correlation:
The two variables show negative correlation when one variable increases with a constant
interval and another decreases with constant interval. These variables deviate in opposite
directions. This is also called inverse correlation. Examples of perfect negative correlation
are very rare in reality.
b. Moderately Negative Correlation:
When two variables are partially negatively correlated the correlation is termed as
moderately negative correlation, e.g. standard of living and cost of living.
3. Absolutely no Correlation:
When two variables are completely independent of each other, the correlation is termed as
absolutely any correlation. E.g. body weight and I.Q. In this case no imaginary mean line is
formed which could indicate the trend of correlation.
Simple and Multiple Correlation:
Depending upon the study of number of variables, correlation may be simple or multiple.
1. Simple Correlation:
When only two variables are studied, it is a case of simple correlation. For example, when
one studies relationship between the yield of wheat per acre and the amount of rainfall, it is
a problem of simple correlation.
2. Multiple Correlation:
When three or more variables are studied, it is a case of multiple correlation. For example,
when one studies the relationship between yield of wheat per acre, amount of rainfall and
amount of fertilizers used, it is a problem of multiple correlation.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.4 Correlation & Regression Analysis
Partial Multiple and Total Multiple Correlation
1. Partial Multiple Correlation
In case of Partial Multiple Correlation one studies three or more variables but considers
only two variables to be influencing each other and the effect of other influencing variables
being held constant. Its order depends on the number of variables which are held constant
e.g. if one variable is kept constant, it is called First order Partial Correlation.
2. Total Multiple Correlation
In case of total Multiple Correlation one studies three or more variable without excluding
the effect of any variable held as constant.
Linear and Non-linear (or Curvi-Linear) Correlation:
Depending upon the constancy of the ratio of change between the variables, the correlation
may be linear or Non-Linear.
1. Linear Correlation
If the amount of change in one variable bears a constant ratio to the amount of change in
the other variables, then correlation is said to be linear. If such variables are plotted on a
graph paper all the plotter on a graph paper all the plotted points would fall on a straight
line.
Example:
Milk (l) 10 20 30 40 50
Cheese (kg.) 2 4 6 8 10
2. Non-Linear (Curvilinear) Correlation
If the amount of change in one variable does not bear a constant ratio to the amount of
change in the other variable, then correlation is said to be non-linear. If such variables are
plotted on a graph, the point would fall on a curve and not on a straight line. For example,
if we double the amount of advertising expenditure, the sales would not necessarily be
doubled.
Example:
Advertising expenditure (Rs. in lakhs) 2 4 6 8 10
Sales (Rs. in lakhs) 10 12 15 15 16
Methods of Studying Correlation:
Correlation analysis measures the degree of association of two variables. Following
methods are used to measure the correlation between two variables.
The various methods of studying correlation are given below:
Method of Studying Correlation
Graphic Algebraic
1. Scatter Diagram
2. Graphic Method
1. Karl Pearsons Coefficient of
Correlation
2. Rank Method
3. Concurrent Deviation Method
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.5 Correlation & Regression Analysis
A. Graphic Method
1. Scatter diagram method.
2. Graphic Method.
B. Algebraic methods:
1. Karl Pearsons coefficient of correlation.
2. Spearmans rank correlation coefficient.
3. Concurrent deviation method.
A. Graphic Method
1. Scatter diagram method or scatter plot method.
Scatter diagram is the simplest method of studying relationship between two variables.
It is in the form of graphic representation of degree and direction of correlation between
two variables.
This is a very simple graphical method of studying correlation between two variables. One
of the variables is taken up on the X axis and the other on the Y axis. Each pair of the
variables is marked on the graph by means of a dot. So, we get as many dots as the number
of pairs. If the dots form a straight line, it means that correlation is present between the two
variables. If the straight line so formed slopes upwards from left to right, correlation is said
to be positive. On the other hand, if the line slopes downwards from left to right,
correlation is said to be negative. If the dots are scattered widely over the entire graph so
that they cannot form a straight line, it means that no correlation exists between the two
variables.
Merits of scatter Diagram:
1. Simple method:
This is a simple and nonmathematical method for studying correlation.
2. Easy to understand:
It is easy to understand and easy to interpret. It provides quick idea of correlation between
two variables just by a glance on the diagram.
3. First Step:
It can be regarded as the first step in studying relation between two variables.
4. Uninfluenced:
The results are not influenced by the size of extreme observations.
Demerits of scatter diagram:
1. Scatter diagram gives just an idea about the direction of correlations. It does not
establish the exact degree of correlation between two variables.
2. It is just a qualitative method of showing relationship between two variables.
3. This method is suitable for small number of observations. When number of observation
is very large, this method becomes tedious and complicated.
2. Graphic Method:
The presence or absence of correlation between two variables can be studied by drawing
two separate curves one for each variable on a graph. If the two curves run parallel to each
other in the same direction, it is an indication of positive correlation. If, on the other hand,
the two curves run in opposite directions, it is an indication of negative correlation.
For example, if it is desired to study the correlation between the value and volume of
exports of tea from India, during the last decade, the two variables i.e., volume and value
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.6 Correlation & Regression Analysis
should be shown on the ordinate and the years on the X axis. The ordinate should have two
scalesone for volume in kilograms and another for value in rupees. The relationship
between the curves can be easily studied whe3n the two curves are very close to each other.
For this purpose, the two scales should be so adjusted that the averages of both the
variables are at the same level. If necessary, a false base line may be taken to bring the
curves nearer each other.
Merits of scatter Diagram:
1. Simple method:
This is a simple and nonmathematical method for studying correlation.
2. Easy to understand:
It is easy to understand and easy to interpret. It provides quick idea of correlation between
two variables just by a glance on the graph.
3. First Step:
It can be regarded as the first step in studying relation between two variables.
4. Uninfluenced:
The results are not influenced by the size of extreme observations.
Demerits of scatter diagram:
1. Scatter diagram gives just an idea about the direction of correlations. It does not
establish the exact degree of correlation between two variables.
2. It is just a qualitative method of showing relationship between two variables.
3. This method is suitable for small number of observations. When number of observation
is very large, this method becomes tedious and complicated.
Algebraic methods:
1. Karl Pearsons Coefficient of Correlation:
Meaning of Coefficient of correlation:
Correlation may be defined as a tendency towards interrelation variation and the
coefficient of correlation is a measure of such tendency, i.e. the degree to which the two
variables are interrelated is measured by a coefficient which is called the coefficient of
correlation. It gives degree of correlation.
Coefficient of correlation is the degree to which two variables are inter-related. It is a
mathematical method for measuring the tendency of linear relationship between two
variables. The British Biometrician Prof. Karl Pearson has devised several formulae of
algebraic nature for measuring not only the nature of correlation but also the exact extent of
the correlation in numerical forms. For this he represents the co-efficient of correlation
through the letter r and asserts that the value of his r must be between 1. His
interpretation of the different values of r is as follows:
Interpretation of coefficient of correlation:
1. Coefficient of correlation is a measure of closeness between two variables.
2. The correlation may be positive or negative.
3. The range of correlation coefficient is from -1 to +1.
4. If r = +1, the correlation between two variables is perfect and positive.
5. If r = -1, the correlation between two variables is perfect and negative.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.7 Correlation & Regression Analysis
6. If there is a strong positive linear relationship between two variables, the value of r will
be close to +1.
7. If there is a strong negative linear relationship between two variables, the value of r will
be close to -1.
8. If r = 0, there is no correlation between two variables. It means variables are
independent.
9. The correlation coefficient is a pure number and is not affected by a change of origin and
scale in magnitude.
10. It is a relative measure of association between two or more variables.
Results of r Degree of Correlation
1
.90 or more but <1
.75 and < .90
.50 and < .75
.25 and < 50
< .25
0
Perfect correlation
Very high degree of correlation
Fairly high degree of correlation
Moderate degree if correlation
Low degree of correlation
Very low degree of correlation
No correlation
Algebraic properties of Pearsons co-efficient of correlation:
Prof. Karl Pearsons co-efficient of correlation thus discussed above has the following
algebraic properties:
1. Its value must lie between + 1 and 1 r + 1. This property provides us with a
yardstick of checking the accuracy of the calculations.
2. It is independent of the change of origin and scale as well. By change of origin we mean
subtraction or addition of some constant value from/to each value of a variable. Such
constant may be the same or different for the two variables X and Y. Further, by change
of scale we mean diving or multiplying each value of a variable by some constant figure
may also be the same of different for the two variables of X and Y, This property implies
that the value of the co-efficient of correlation will remain the same even if there occurs
change of origin or change of scale. This property helps us in simplifying the process of
calculations.
3. It is independent of the units of measurement. This implies that even if the two
variables are expressed in two different units measurement viz. rain fall in inches and
yield of crops in quintals, the value of the coefficient of correlation comes out with a pure
number. Thus, it does not require that the units of measurement of both the variables
should be the same.
4. It is independent of the order of comparison the two variables symbolically, rxy = ryx .
5. It is geometric mean of the two regression coefficients.
Assumptions of the Pearsons Co-efficient of Correlation:
Prof. Karl Persons co-efficient of correlation is based on the following assumptions
1. Linear relationship:
In devising the formulae Prof. Pearson has assumed that there is a linear relationship
between the variables which means that if the values of the two variables are plotted on a
scatter diagram it will give rise to a straight line.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.8 Correlation & Regression Analysis
2. Cause and effect relationship:
Prof. Pearson has assumed that there is a cause and effect relationship between the
correlated variables which means that a change in the value of one variable is a cause for
effecting a change in the value of another variable. According to him without such
relationship correlation would carry no meaning at all.
3. Normalcy in distribution:
It is assumed that population from which the data are collected are normally distributed.
The two variables under correlation study are affected by a large number of independent
causes so as to form a normal distribution.
4. Multiplicity of causes:
Prof. Pearson has assumed further that each of the variables under study is affected by
multiplicity of causes so as to form a normal distribution. Variables like age, height,
weight, price demand supply yield, temperature etc., which are usually taken to study
correlation, are affected by multiplicity of causes.
5. Probable error of measurement
Prof, Pearson has further assumed that there is probability of some error which may creep
into the measurement of the co-efficient of correlation. But the magnitude of such error
must lie within a limit which is obtained by the following formula:
Where, r= Co-efficient of correlation, and n= number of pairs of the two variables.
If the constant .6475 is omitted from the above formula of probable error, we get the
standard thus,
The above formula of probable error helps us in interpreting the significance of the co-
efficient of correlation as follows:
1. The correlation is taken to almost absent if r< PE (r)
2. The Correlation is taken to be significant if r >6 times PE ( r )
3. The correlation is taken to be moderate if r > PE (r ) but < 6 times PE ( r)
4. The limits of the correlation co-efficient of the population or
Computation of Karl Pearsons coefficient of correlation:
I. Individual series:
It can be expressed in different forms as follows:
1. A ratio of the co-variance between two variables to the product of their standard
deviations is called Karl Pearsons correlation coefficient.
a)
y x
r
=
y) Cov(x,
b)
2
) (
2
) (
) ( ) ( X
Y Y X X
Y Y X
r
=
c)
n
) ( ) X (
x x x y
Y Y X
r
=
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.9 Correlation & Regression Analysis
2. Under this method actual values of the variable are taken to calculate coefficient of
correlation between two variables. This method is known as direct method.
2
) y (
2
y n
2
) x (
2
x n
) y )( x ( y x n
x x
x
= r
3. Under this method deviations from assumed are taken to calculate coefficient of
correlation. This method is known as deviations from assumed mean method.
2
) d n (
2 2
) d (
2
d n x
) d )( d ( d d n
x y x x
y x y x x
y d
r
=
II. Discrete and continuous series:
Under this method deviations from assumed are taken to calculate coefficient of
correlation. This method is known as deviations from assumed mean method.
2
) d n (
2 2
) d (
2
d n x
) d )( d ( d d n
x y x x
y x y x x
y f fd f f
f f f
r
=
Merits and Demerits of Pearsons method of studying correlation
Merits:
The following are the chief points of merits that go in favour of the Karl Pearsons method
of correlation:
1. This method not only indicates the presence or absence of correlation between two
variables but also determines the exact extent or degree to which they are correlated.
2. Under this method we can also ascertain the direction of the correlation i.e. whether
the correlation between the two variables is positive or negative.
3. This method enables us in estimating the value of dependent variable with reference to
a particular value of an independent variable through regression equations.
4. This method has a lot of algebraic properties for which the calculation of co-efficient of
correlation and a host of factors viz. Co-efficient of determination, non-determination
etc. are made easy.
Demerits:
1. It is comparatively difficult to calculate as its computation involves intricate algebraic
methods of calculations.
2. It is very much affected by the values of the extreme items.
3. It is based on a large number of assumptions viz, linear relationship, cause and effect
relationship etc. which may not always hold well.
4. It is very much likely to be misinterpreted particularly in case of homogeneous data.
5. In comparison to the other methods, it takes much time to arrive at the results.
6. It is subject to probable error which its profounder himself admits and, therefore, it is
always advisable to compute its probable error while interpreting its results.
7. This method can be used only for those variables and attributes which have
quantitative measurements.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.10 Correlation & Regression Analysis
Probable Error:
Probable error of correlation coefficient
The probable error of the coefficient of correlation helps in the interpreting its value.
Probable Error (P.E) is represented by r; it is an old measure of testing the reliability of an
observed value of correlation coefficient. Probable error is a measure of the reliability of
coefficient of correlation. It indicates the limits of variation between the coefficient of
correlation of one sample and those of other samples obtained from the same population. It
is calculated by applying the following formula.
P.E. = 0.6745
n
r
2
1
Where, P.E. = Probable error, r = coefficient of correlation and n = number of observations.
The reliability of coefficient of correlation may be assessed as follows:
1. If coefficient of correlation is less than probable error, it is to be inferred that there is no
correlation between the variables. In other words, the value of r is to be taken as
insignificant.
2. If coefficient of correlation is more than six times the probable error, it is to be inferred
that there is significant correlation between the variables.
3. By adding and subtracting the value of probable error from the coefficient of correlation,
the upper and lower limits within which the coefficient of correlation can be expected in
the population can be determined.
Significance of Probable Error
1. Probable error of correlation coefficient may be used to determine the limits within
which the correlation coefficient of a population may be expected to be located. Limit for
populations correlation coefficient is: Correlation in population = . .E P r
2. P.E. is used to test if an observed value of sample correlation coefficient is significant of
any correlation in population.
3. By adding and subtracting the value of probable error from r, we get respectively the
upper and lower limits within which the coefficient of correlation in the population can
be expected to lie.
P.E. can be used under the following two conditions:
1. Data must have been drawn from a normal population.
2. The conditions of random sampling should prevail in selecting sampled observations.
Coefficient of determination
It is the square of the coefficient of correlation, i.e., r. it depicts what percentage of the total
variance is explained by the measure of coefficient of correlation. It is also called as the
index of determination. For example, if r = 0.8, then the Coefficient of Determination, i.e., (r
(square). = 0.64. It means that 64% of the variation in dependent variables is explained by
the given independent variable or variable, the rest are unexplained and may be due to
factors. It is also seen that Coefficient of Determination comes down rapidly with the
lowering of the Coefficient of Determination is + more useful for economic analysis because
it shows the degree of dependence.
Formula of Coefficient of Determination is:
r = Explained Variation / Total Variation
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.11 Correlation & Regression Analysis
Coefficient of non-determination:
It is complement of Coefficient of determination, i.e., 1-r square. For example, if Coefficient
of determinate is 64, then the coefficient of non-determination is (1- 0.64) = 0.36 Whereas,
the coefficient of determination depicts the degree of dependence of a dependent variable
on the concerned independent variable, the coefficient of non-determination shows a lack
of dependence on given independence on given independent variable.
1-r = 1- Explained Variation/ Total Variation
Coefficient of Alienation:
Coefficient of alienation is the square root of coefficient of non-determination. It is
calculated as follows:
Coefficient of Alienation r = 1
2. Spearmans rank correlation method:
Karl Pearsons method for calculating correlation can be applied to those cases which have
quantitative measurements. Spearmans rank correlation method is used for qualitative
variables such as efficiency, intelligence; honesty, ability, bravery, beauty, tolerance, and
colour etc., these variables cannot be measured quantitatively, but can be arranged
serially. Correlation coefficient between any of such qualitative variables cannot be
calculated by Karl Pearsons method. The method to determine coefficient of correlation
was developed by Charles Edward Spearman in 1904. This is called Spearmans rank
correlation coefficient method. The correlation coefficient between two variables in n
number of individuals can be calculated by the following this method.
In this method different expressions of an attribute are arranged serially in the order of
Preference. Such an ordered arrangement is called ranking and the ordinal number
Indicating the position of a given attribute in the ranking is called rank. For example, the
admission committee of a school prepares a list of successful candidates interviewed in
the order of preference based on two attributes, i.e., marks in English language and
performance in sports. This case study will have two sets of ranks for two variables.
Computation of Rank Correlation Coefficient by Spearmans method:
The value of rank correlation coefficient (R) ranges between +1 and -1. It can be calculated
in three different situations:
1. When actual ranks are given.
2. When actual ranks are not given.
3. When ranks are equal.
Computation Spearmans rank correlation coefficient:
R =
( ) 1
6
1
2
2
n n
D
=
n n
D
3
2
6
1
Where R = coefficient of rank correlation
D = differences of ranks between paired items
n = number of paired observations.
Here also rank correlation coefficient varies between + 1 and 1. + 1 indicated complete
agreement in the order of the ranks which are in the same direction. 1 indicates complete
agreement in the order of ranks which are in the opposite direction.
In case the items of a series have the same value, such items should be assigned the average
of ranks, they would have received if they had not tied. For example, 5 students scoring
marks 50, 40, 35, and 30 would be ranked 1, 2.5, 2.5, 4, and 5. Here, there is a tie between the
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.12 Correlation & Regression Analysis
second student and the third. Hence they are assigned the average rank
( )
2
3 2 +
. But the next
student will have the fourth rank. The following formula is used when there are ties in
ranks:
( ) ( ) ( )
N N
m m m m D
R
+ + +
=
3
3 3 2
.....
12
1
12
1
6
1
Where m = number of items with common ranks.
If there is more than one group of items with common rank, the value i.e. ( ) m m
3
12
1
is
added as many times as the number of such groups.
Features of rank correlation:
1. The value of such co-efficient of correlation lies between +1 and -1.
2. The sum of the differences between the corresponding ranks i.e. d=0.
3. It is independent of the nature of distribution from which the sample data are collected
for calculation of the co-efficient.
4. It is calculated on the basis of the ranks of the individual items rather than their actual
values.
5. Its result equals with the result of Karl Pearsons co-efficient of correlation unless there
is repetition of any rank. This is because, Spearmans correlation is nothing more than
the Pearsons co-efficient of correlation between the ranks.
Merits and demerits of Spearmans rank correlation method:
Merits:
1. In comparison to Karl Pearsons method, this method is much easy to understand and
simple to calculate.
2. This method can be applied to the phenomena of qualitative nature viz. honesty, beauty,
efficiency etc, which can be ranked in some order.
3. Rank correlation method can also be used at times in cases of actual quantitative Data.
4. This method is not affected by the extreme items.
5. This method is considered indispensable when the data are given in the form of ranks
rather than their real value.
6. This method does not need the assumption that the population from which the samples
are taken should be parametric or normally distributed.
Demerits:
1. This method is not suitable in case of frequency distribution i.e. grouped data.
2. This method is not suitable when the number of pairs of the variables is larger because
the work of ranking in that case becomes very much cumbersome.
3. The result obtained by this method differs from that of Pearsons method when there
are repetitions of the ranks.
4. This method is not capable of further algebraic treatment like that of the Pearsons
method.
5. This method is not based on the original values of observations.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.13 Correlation & Regression Analysis
3. Meaning of Concurrent Correlation Coefficient:
The concurrent correlation coefficient is a measure of correlation in which only directions of
change (Positive/Negative) in the variables are taken into account.
This is a simple and crude method of studying correlating. It is based on the fact that the
curves of the variables plotted on a graph paper will move in the same direction. If
correlation is positive and that they will move in the opposite direction if correlation is
negative. To calculate the coefficient of correlation under this method, only the direction of
deviations is taken into consideration. The magnitude of deviations is ignored. Further, the
deviation is recorded from the preceding values and not from any measure of central
tendency. If a value shows an increase when compared to the preceding one, it is given a +
(plus) sign. If it shows a decrease, it is given a (minus) sign. If there is no increase or
decrease, = is marked against the value.
Symbolically, |
\
|
=
n
n c
r
2
Where r = coefficient of concurrent deviations
n = the number of pairs or observations
c = the number of concurrent deviations
If the quantity
|
\
|
n
n c 2
is negative, a minus sign should be placed before it and also before
the radical.
In this method also, the coefficient of correlation varies from + 1 to -1. +1 denotes perfect
positive correlation. 1 indicates perfect negative correlation and 0 absence of correlation.
The presence of correlation between two variables does not necessarily imply cause and
effect relationship. In certain cases, correlation may be due to causation. For example, as
rainfall increases, yield also increases. So, there is correlation between rainfall and yield.
Rainfall is the cause and yield is the effect. Even in this example, it is not correct to think
that rainfall is the only factor that determines yield. There are other factors such as fertility
of the soil, quality of the seed, weather conditions, etc., which have an effect on the yield.
Each of the above factors is correlated with the yield.
Purpose of concurrent Correlation
Its main purpose is to give only a rough/crude idea/estimate about presence or absence of
correlation between the variables.
Properties of concurrent correlations:
1. Its value ranges from -1 to +1.
2. When r = -1, then C = 0 showing a strong negative correlation between the variables.
3. When r = +1, then C = n indicating that a strong positive correlation exists between the
variables.
4. In case of random behaviour, r may be near to zero.
Merits:
1. It is very easy to understand.
2. It is very simple to calculate irrespective of the form, size and number of the data.
3. It does not depend upon the assumption that the population from which the data have
been collected in normally distributed.
4. It gives the correct indication as to the direction of the correlation.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.14 Correlation & Regression Analysis
Demerits:
1. It does not give the correct degree of correlation as it is given by the method of
Pearsons and spearman.
2. It does not differentiate between the small and the big changes in values. Thus, an
increase from 10 to 11 is equally treated as an increase from 10 to 1920 in as much as in
both the cases a plus sign only is noted notwithstanding the gaps of their differences.
3. It is not capable of further algebraic treatment.
4. It is not suitable for grouped data.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.15 Correlation & Regression Analysis
Regression
Meaning of Regression:
Regression is the measure of average relationship between two or more variables in terms
of the original units of the data. For example, after having established that two variables
(say sales and advertising expenditure) are correlated, one may find out the average
relationship between the two to estimate the unknown values of dependent variable (say
sales) from the known values of independent variable (say advertising expenditure)
Regression shows a relationship between the average values of two variables.
Regression is very helpful in estimating and predicting the average value of one variable
for a given value of other variable. The estimate or prediction may be made with the help
of a regression line which shows the average value of one variable x for the given value of
other variable y. the best average value of one variable associated with the given value of
the other variable may also be estimated or predicted by means of an equation and the
equation is known as regression equation.
The term regression was coined by F.Galton in 1885 to explain the data obtained during the
study of inheritance. Galton observed the height of off springs during a few generations of
a family and came to the conclusion that the height of off springs tend to occupy median
position. He expressed the regression as the tendency to remain towards central position.
Meaning of Regression Analysis:
Regression Analysis is a statistical tool to study the nature and extent of functional
relationship between two or ore variables and to estimate (or predict) the unknown values
of dependent variable from the known values of independent variable.
Dependent Variable (or Explained Variable) The variable which is predicted on the basis
of another variable is called dependent variable or explained variable. Independent
variable is usually denoted by X.
Independent Variable (or Explanatory Variable) The variable which is used to predict
another variable is called independent variable or explanatory variable. Independent
variable is usually denoted by X.
Example: When sales are predicted on the basis of advertising expenditure, sales is
dependent variable and advertisement expenditure is independent variable.
Note: The terms dependence and independence do not mean that there is necessarily any
cause and effect relationship between the variables.
Characteristics:
1. It consists of mathematical devices that are used to measure the average relationship
between two or more closely related variables.
2. It is used for estimating the unknown values of some dependent variable with
reference to the known values of its related independent variables.
3. It provides a mechanism for prediction or forecast of the values of one variable in terms
of the values of the other variable.
4. It consists of two lines of equation viz. (i) equation of X on Y (ii) equation Y on X.
Utilities and Limitations of Regression Analysis:
The regression analysis as a statistics tool has a number of uses or utilities for which it is
widely used in various fields relating to almost all the natural, physical and social sciences.
The specific uses or utilities of such a technique may be outlined as under:
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.16 Correlation & Regression Analysis
1. It provides a functional relationship between two or more reacted variables with the
help of which we can easily estimate to predict the unknown values of one variable
from the known value another variable.
2. It provides a measure of errors of estimates made through the regression lines. A little
scatter of the observed (actual) values around the relevant regression line in relevant
regression line indicates good estimates of the values of a variable, and less degree of
errors involved therein. On the other hand, a great deal a scatter of the observed values
around the relevant regression line indicates inaccurate estimates of the values of a
variable and high degree of errors involved therein.
3. It provides a measure of coefficient of correlation between the two variables which can
be calculated by taking the square root of the product of the two regression coefficients.
4. It provides a measure of coefficient of the determination which speaks of the effect of
the independent variable (explanatory or regressing variable ) on the dependent
variable ( explained or regressed variable) which in its turn five us an idea about the
predictive values of the regression analysis. This coefficient of determination is
computed by taking the product of the two regression coefficients. The greater the value
of Coefficient of Determination (r
2
) the better is the fit and more useful is the regression
equations as the estimating devices.
5. It provides a formidable tool of statistical analysis in the field of business and commerce
where people are interested in predicting the future events viz. consumption,
production investment, prices sales, profits, etc. and success of businessmen depends
very much on the degree of accuracy in their various estimates.
6. It provides a valuable tool for measuring and estimating the cause and effect
relationship among the economic variables that constitute the essence of economic
theory and economic life. It is highly used in the estimation of Demand curves. Supply
curves production functions cost production consumption functions etc. In fact,
economists have propounded many types of production functions by fitting regression
lines to the input and output data.
7. This technique is highly used in our day-to-day life and sociological studies as well to
estimate the various factors viz, birth rate, death rate, tax rate yield rate etc.,
Limitations
Despite the above utilities and usefulness the techniques of regression analysis suffers from
the following serious limitations:
1. It is assumed that the cause and effect relationship between the variables remains
unchanged this assumption may not always hold good and hence estimation of values
of a variable made on the basis of the regressions equation may lead to erroneous
misleading results.
2. The functional relationship that is established between any two or more variables on
the basis of some limited data may not hold good if more and more data are taken into
consideration
3. It involves very lengthy and complicated procedure of calculations and analysis.
4. It cannot be used in case of qualitative phenomenon viz. honesty, crime etc.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.17 Correlation & Regression Analysis
Type of Regression Analysis:
The regression can be of two types: simple and multiple.
1. Simple Regression:
The regression analysis confined to the study of only two variables at a time is known as
simple regression.
2. Multiple Regression:
The regression analysis for studying more than two variables at a time is known as multiple
regressions.
Regression Lines and Linear Regression:
When there are two variables there will be two regression lines. Regression line of x on y
and Regression line of y on x when observations from two variables are plotted as a graph,
and if the points so obtained fall in a straight line the relationship is linear and it is said
that there is linear regression between the variables under study. However, if the line is not
a straight line, the regression is termed as non-linear regression.
When the points are obtained on a scattered diagram, the process of deciding the line of the
best fit to summarize a particular set of points on a graph is called regression analysis. This
is worked out by deriving an equation called regression equation.
Regression Equation:
A linear equation of a line obtained by applying least square principle is called
regression equation. It is the mathematical form of regression line. The equation that
describes position of any line on a graph is called regression equation. For a linear
regression, the equation for a dependent variable xy against independent variable x can be
given as follows:
Y=a+bx
Here, values of a and b are constant and are fixed for a particular line. If the values of a
and b known, y can be obtained for any corresponding value of x. The values of a and
b can be calculated by using different equations.
After obtaining the value of b the value a can be calculated.
The constant a is known as intercept, and denotes the value of y when the value of x is
zero.
The constant b measures the slope of the line and is called regression coefficient the
constant b gives an idea of that how change occurs in variable y when the variable x varies
by 1 unit. For instance, if the value of b is 5.8, then a change in X by one unit will bring
out a change in y by 5.8 units. The positive value of b indicates the increase in the value of
y; it is associated with the increase in the value of x, while a negative value will tell the
decrease in y with an increase in x.
There are two regression equations:
In case of simple linear regression model (i.e. when there is only one independent variable
and there is linear relationship between the dependent variable and independent variable),
there are two regression lines as follows:
1. Regression line of X on Y and
2. Regression line of Y on X.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.18 Correlation & Regression Analysis
Method I: Least squares method.
Computation of regression equations:
By using the regression equations Y=a+bx and X=a+by, it is possible fit two regression
lines for the data given. Regression equations can be computed by determining the values
of a and b in the following normal equations:
Regression of X on Y:
The regression equation X=a+by.
Where, X= Dependent Variable, Y= Independent Variable, a= X intercept, (i.e. the value of
dependent variable when value of independent variable is zero)
b= Slope of the said line (i.e. the amount of change in the value of the dependent variable
per unit change in independent variable).
The values of two constants a and b can be calculated for the given data of X and Y
variable by solving the following normal equations:
) .......( ..........
) .......( ..........
2
ii y b y a xy
i y b Na x
+ =
+ =
Regression of Yon X:
The Regression equation Y=a+bx.
Where, X= Independent Variable, Y= Dependent Variable.
a= Y intercept, (i.e. the value of dependent variable when value of independent variable is zero)
b= Slope of the said line (i.e. the amount of change in the value of the dependent variable
per unit change in independent variable).
The values of two constants a and b can be calculated for the given data of X and Y
variable by solving the following normal equations:
) .......( ..........
) .......( ..........
2
ii x b x a xy
i x b Na y
+ =
+ =
Steps:
1. Summate the items in X series ( X)
2. Summate the items in Y series ( Y)
3. Square the items in both the series and summate ( )
2 2
y and x
4. Multiply the items in X series with their corresponding items in Y series and summate
the products ( XY)
5. Count the number of pairs in the series (N)
6. Substitute the above values in the equations
7. Equalize the equations.
8. Subtract the new equation (third) from the second equation and obtain the values of a and b.
Method II: Direct Method.
Regression of X on Y:
( ) ( ) Y Y b X X xy =
Where X = the mean of X
Y = the mean of Y
bxy
= Regression coefficient of x on y.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.19 Correlation & Regression Analysis
Regression of Y on X
( ) ( ) X X b Y Y yx =
The above equations provide not only values for plotting regression lines, but also a
mathematical method of finding out the most probable value of X for a given value of Y
and the most probable value of Y for a given value of X.
Regression coefficients:
Regression coefficient is a numerical measure which shows probable changes in the value
of dependent variable for a unit change in the value of independent variable. The
coefficient of regression gives the value by which one variable change for a unit change in
the other variable.
The coefficient of regression of X on Y gives the value by which X variable changes for a
unit change in Y.
The coefficient of regression Y on X, gives the value by which Y variable changes for a unit
change in X.
Computation of regression coefficients:
S.No Regression coefficient of x on y Regression coefficient of y on x
1.
y
xy
y) Cov(x,
r b
=
x
y) Cov(x,
r byx =
2.
y
x
xy r b
=
x
y
yx r b
=
3.
y x
xy
n
) ( ) (
b
Y Y X X
=
x x
yx
n
) ( ) (
b
Y Y X X
=
4.
2
xy
) Y - Y (
) Y - (Y ) X - X (
b
=
2
yx
) X - X (
) Y - (Y ) X - X (
b
=
5. ( )( )
( )
2 2
xy
y y
y x y x
b
=
n
( )( )
( )
2 2
yx
x x
y x y x
b
=
n
6. ( )( )
( )
2
y
2
y
y x y x
xy
d d
d d d d
b
=
n
( )( )
( )
2
x
2
x
y x y x
yx
d d
d d d d
b
=
n
Properties of the Linear Regression:
1. There are two regression equations. The regression equation of X on Y
( ) ( ) Y Y b X X xy = and The Regression equation of Y on X( ) ( ) X X b Y Y yx = .
2. The product of the two regression coefficients is equal to the square of correlation
coefficient, bxy X byx =r
2
.
3. The regression coefficients and correlation coefficient will have same signs. If the
coefficient of correlation is 0, the regression coefficients bxy and byx will also be 0.
4. The regression lines always intersect at their Arithmetic Means.
5. The slope of the regression line of X on Y and the regression line Y on X are respectively
bxy and byx.
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.20 Correlation & Regression Analysis
6. The angle between the two regression lines depends on the correlation coefficient.
Value of r Angle between regression Lines
1. If r=0 Regression lines are perpendicular to each other.
2. If r=+1 or -1 Regression lines coincide (i.e. become identical).
As value of r increases numerically from 0 to 1, the angle between regression equation
decreases from 90
0
to 0
0
. In other words, the farther the two regression lines are from
each other, the lesser is the degree of correlation and the nearer the two regression lines
are to each other, the higher is the degree of correlation.
7. The value of X or Y can be estimated from linear equations if r0.
Properties of the Regression coefficients:
1. The two Regression coefficients will have the same signs. Both of them will be either
positive or negative.
2. If regression coefficients are positive, correlation also will be positive and if regression
coefficients are negative, correlation also will be negative.
3. If one of the regression coefficients is greater than unity, then the other is less than
unity. Both of them cannot be greater than one or unity.
4. Arithmetic mean of the regression coefficients is greater than the correlation coefficient.
5. The correlation coefficient is the geometric mean of the regression coefficients.
r = byx byx
6. Regression coefficients are independent of change of origin but not of scale.
7. When byx = bxy, coefficient of correlation will be equal to regression coefficients.
r bxy bxy = .
8. Both regression coefficients will have the same sign, i.e., either both are positive or both
are negative.
9. The sign of correlation coefficient is same as that of regression coefficients.
Standard Error of Estimate:
The regression lines or equations relating to the two variables are nothing but the lines or
equations of estimates. With these equations or lines we estimate the best probable value of
one variable say x, on the basis of some given value of the other variable say y. The
estimates may or may not be exact or accurate. The difference between exact values and
estimated values is known as error. The probable amount of error expected to be in the
estimates is called standard error of estimates. The standard error of estimate is
dispersion about an average line called the regression line. It indicates how precise the
prediction of y is based on x or of x on y. it is calculated as follows:
1. Standard error of X values from Xc (Sxy)=
n
) X - X (
2
c
2. Standard error of X values from Xc (Sxy)=
n
) Y - Y (
2
c
Spellbound Centre for Professional Studies, Hyderabad Faculty, K.Veerendra Patil
Quantitative Aptitude 3.21 Correlation & Regression Analysis
Explained and Unexplained Variation:
The variation of a variable is the sum of the squares of deviations of its values from its
arithmetic mean. Symbolically, it is represented by:
1. For variable X ,The total variation is:
( )
2 2
X - X ( = x
2. For variable Y, The total variation is
( )
2 2
Y - Y ( Y =
Unexplained variation:
Unexplained variation = total variation- unexplained variation
Differences between Regression Analysis and Correlation Analysis:
Correlation Regression
Correlation is the relationship between two or
more variable which vary is sympathy with
measure showing the average relationship
between two the variables.
Regression is a mathematical the other
in the same or the opposite direction.
Here both the variables i.e. x and y are random
variables.
Here x is a random variable And y is
fixed. Sometimes both the variables may
be random variables.
It finds out the degree of relationship between
two variables not cause and effect of the
variables.
It indicates the cause and effect
relationship between the variables.
It is used for testing and verifying the
relationship between two variables.
It is used for prediction of one Value in
respect to the other given value.
The coefficient correlation is a relative
measure. The range of relationship lies
between 1.
Regression coefficient is an absolute
figure. If we know the value of
independent variable, we can find the
value of dependent variables.
It has limited application because as its study is
confined only to linear variables.
It has wide application because it studies
relationship between linear and non-
linear variables.
If the coefficient correlation is positive, then the
two variables are positively correlated and vice
versa.
The regression coefficient explains that
the decrease in one variable is associated
with the increase in the other variable.