0% found this document useful (0 votes)
23 views

Correlation and Regression

This document discusses correlation and regression analysis. Correlation analysis examines the relationship between two or more variables, while regression analysis estimates the value of one variable based on the value of another. There are different types of correlation, including positive, negative, simple, partial and multiple correlation. Correlation coefficient ranges from -1 to +1 and measures the strength of the linear relationship between variables. Common methods to study correlation include scatter diagrams, correlation graphs, and algebraic methods like Pearson's correlation coefficient.

Uploaded by

Dilshaj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Correlation and Regression

This document discusses correlation and regression analysis. Correlation analysis examines the relationship between two or more variables, while regression analysis estimates the value of one variable based on the value of another. There are different types of correlation, including positive, negative, simple, partial and multiple correlation. Correlation coefficient ranges from -1 to +1 and measures the strength of the linear relationship between variables. Common methods to study correlation include scatter diagrams, correlation graphs, and algebraic methods like Pearson's correlation coefficient.

Uploaded by

Dilshaj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

Correlation and Regression

Module 3
• Correlation is used to study the degree of relationship among
two or more variables.
• On the other hand, Regression technique is used to estimate
the value of one variable for a given value of another.
In practice, we may come across with lot of situations which
need statistical analysis of either one or more variables.
 The data concerned with one variable only is called univariate
data. For eg: price, income, demand, production, weight,
height, marks etc. are concerned with one variable only. The
analysis of such data is called univariate analysis.
 The data concerned with two variables are called bivariate
data. For eg: rainfall & agriculture; income & consumption;
price & demand; height & weight etc. The analysis of these
two sets of data is called bivariate analysis.
 The data concerned with three or more variables are called
multivariate data. For eg: Agriculture production is
influenced by rainfall, quality of soil, fertilisers used etc. The
analysis of three or more variables is called multivariate
analysis.
Correlation Analysis
Definition: Two or more variables are said to be correlated if the
change in one variable results in a corresponding change in the
other variable.
According to Simpson and Kafka “ Correlation analysis deals
with the association between two or more variables”.
Lun Chou defines “ Correlation analysis attempts to determine
the degree of relationship between variables”.
Boddington states that “ Whenever some definite connection
exists between two or more groups or classes of series of data,
there is said to be correlation”
Correlation Coefficient
Correlation analysis is actually an attempt to find a numerical
value to express the extend of relationship exists between two or
more variables.
The numerical measurement showing the degree of correlation
between two or more variable is called correlation coefficient.

Correlation coefficient ranges between -1 and +1.


Significance of correlation analysis
1. Correlation analysis helps us to find a single figure to measure
the degree of relationship existing between the variables.
2. Correlation analysis helps to understand the economic
behaviour.
3. Correlation analysis enables the business executives to
estimate cost, price and other variables.
4. Correlation analysis can be used as a basis for the study of
regression. Once we know that two variables are closely
related, we can estimate the value of one variable if the value
of other variable is known.
5. Correlation analysis helps to reduce the range of uncertainty
associated with decision making. The prediction based on
correlation analysis is always near to reality.
6. Inter relationship studies between variables can be used as a
tool for promoting research and opening new era of
knowledge.
Types of Correlation : The different types of correlation are:
Positive & Negative
Simple , Partial and Multiple
Linear and Non linear
Positive Correlation: A positive correlation is a relationship
between two variables in which both variables move in the same
direction; i.e an increase in the value of one variable results into
an increase in the value of other variable also or if a decrease in
the value of one variable, results into a decrease in the value of
the other variable also, correlation is said to be positive.
Negative correlation: If the value of two variables move in
opposite directions so that an increase in the value of one
variable results into a decrease in the value of other variable or
the decrease in the value of one variable results into an increase
in the value of the other variable the correlation is said to be
negative. An example would be height above sea level and
temperature.
Zero Correlation: A zero correlation exists when there is no
relationship between two variables.
For eg: there is no relationship between the amount of tea drunk
and level of intelligence.
Simple Correlation: In the study of relationship between
variables, if there are only two variables, the correlation is said
to be simple. For eg: the correlation between price and demand
is simple.
Multiple Correlation: When we measure the degree of
association between one variable on one side and all the other
variables(two or more) together on the other side, the
correlation is said to be multiple. For Eg: the relationship
between yield with both rainfall and temperature is multiple
correlation.
Partial Correlation: In partial correlation, we study the
relationship of one variable with one of the other variables
presuming that the other variables remain constant. For eg: let
there be three variables yield, rainfall and temperature. Each is
related with the other. Then the relationship between yield and
rainfall (assuming temperature is constant) is the partial
correlation.
Linear Correlation: When the amount of change in one variable
leads to a constant ratio of change in the other variable,
correlation is said to be linear. When there is linear correlation,
the points plotted on a graph will give a straight line.
Non Linear Correlation: When the amount of change in one
variable is not in constant ratio to the change in the other
variable, correlation is said to be non linear or curvilinear. In the
case of curvilinear correlation, the ratio of change fluctuates and
is never constant.
Methods of studying correlation (Measures of correlation)
1. Graphic Method :
a) Scatter diagram
b) Correlation graph
2. Algebraic Method :
a) Karl pearson’s coefficient of correlation
b) Spearman’s Rank correlation coefficient
c) Concurrent deviation method
d) Method of least squares
Scatter diagram method
• One of the variable is shown on the X axis and the other on the Y
axis.
• Each pair of values is plotted on the graph with dot marks.
• After all the items are plotted , we get as many dots on the graph
paper as the number of pairs.
• If these plotted points show some trend either upward or
downward, the two variables are said to be correlated.
• If the plotted points do not show any trend, then the two
variables are not correlated.
• If the trend is upward – Positive correlation
• If the trend is downward – Negative correlation.
Merits:
1. It is easy to plot.
2. It can be easily understood and interpreted.
3. It is a Non mathematical method and value of extreme items
do not affect this method. Such points are always isolated in
diagram.
4. First step for investigating the relationship between two
variables.
Demerits:
1. The degree of correlation cannot be easily estimated.
2. Algebraic treatment is not possible. This chart does not show
the relationship for more than two variables.
3. When the number of pairs of observations is either very big
or very small, the method is not easy.
4. We cannot establish the degree of relationship exactly
between two variables.
Correlation Graph
• This is an extension of linear graphs. In this case two or more
variables are plotted on graph paper.
• Under this method, separate curves are drawn for the X
variable and Y variable on the same graph paper.
• If both the curves move in the same direction (upward or
downward) , correlation is said to be positive. If the curves are
moving in the opposite direction correlation is said to be
negative.
Merits
• It is easy to understand and simple to use.
• Relation between two variables can be studied in a non
mathematical way.
Demerits
• It is a non mathematical method as the results are non- exact
and non accurate.
• It gives only an approximate idea of the relationship.
Concurrent deviation method
• It is the simplest method of calculating correlation.
• It is used to know the directional changes between two
variables.
• It is suitable only when the variable includes short term
fluctuations.
• It lies between -1 and +1
• Under this method the nature of correlation is known from the
direction of change in the values of variable.
• If the deviations of the two variables are concurrent then they
move in the same direction, otherwise in the opposite direction.
=

Where:
N= No. of pairs of symbol
C= No. of concurrent deviations (i.e no. of + signs in Dx*Dy
column)
Method of calculation
1. Increase in the value is denoted by +sign and decrease by -
sign.
2. For the first value, the increase or decrease is unknown.
3. C stands for the number of + signs in Dx*Dy i.e. the
concurrent deviations.
4. N is the no. of pairs compared, i.e. the no. of +signs and –
signs in the Dx* Dy column.
5. When 2C N, r is positive
6. When 2C , r is negative
Mean: Add up all the data points and then divide by the total no.
of variables.
Median: The middle value, the midpoint of the data when
arranged in order.
Mode: The value that appears the most often.
Population and Sample variance
• When you have collected data from every member of the
population that you are interested, you can get an exact value
for population variance.
• When you collect data from a sample, the sample variance is
calculated and used to make estimates or inferences about the
population variance.
Population Variance

=
Where,

X = Variable
Population mean
N = Population size.
Sample Variance

S² = Sample variance
X = Variable
X̅ = Sample mean
n = Sample size
Standard deviation & Variance
• SD is a measure of how spread out nos. are
• SD measures the dispersion of a data set relative to its mean
• Its symbol is ‘
• It is the square root of variance
SD =
• The variance is a measure of variability. It is calculated by
taking the average of squared deviations from the mean.
• Variance tells you the degree of spread in your data set. The
more spread the data, the larger the variance is in relation to
the mean.
The characteristics of Range, Variance and SD
• The more the data are spread out or dispersed, the larger the
range, variance and SD.
• The more the data are concentrated or homogeneous, the
smaller the range, variance and SD.
• If the values are all same ( so that there is no variation in the
data), the range, variance and SD are equal to zero.
• None of the measures of variance (range, SD and variance)
can never be negative.
Coefficient of Variation
Coefficient of variation is a relative measure of variation that is
always expressed as a percentage rather than in terms of the
units of the particular data. The coefficient of variation, denoted
by the symbol ‘CV’, measures the scatter in the data relative to
the mean.
CV = x 100
S = Sample standard deviation
X = Sample mean
Karl Pearson’s Coefficient of Correlation
Karl Pearson, the great biologist and statistician has given a formula
for calculation of coefficient of correlation, popularly known as
Pearsonian coefficient of correlation and is denoted by the symbol
‘r’. The formula for computing the pearsonian coefficient of
correlation is
r=
= SD of X series ,= SD of Y series
N = No. of pairs of observations.
This is also known as product moment correlation coefficient.
This method is to be applied only where deviations of items are
taken from actual means and not from assumed means.
The above formula for computing Pearsonian coefficient of
correlation can be transformed to the following form which is
easier to apply.
r =
Direct method for calculating correlation coefficient
Correlation coefficient can also be calculated without taking
deviations of items either from actual mean or assumed mean.
The standard formula in such a case is:
r =
Degree of Correlation (Interpretation of ‘r’)
The degree or the intensity of the relationship between two
variables can be ascertained by finding the value of coefficient of
correlation. The degree of correlation can be classified into:
1. Perfect Correlation: When the change in the two variables is
such that with an increase in the value of one, the value of
the other increases in a fixed proportion, correlation is said
to be perfect. Perfect correlation may be positive or
negative. Coefficient of correlation is +1 for perfect positive
correlation and it is -1 for perfect negative correlation.
2. No correlation: If changes in the value of one variable are
not associated with changes in the value of the other
variable, there will be No correlation. When there is no
correlation, the coefficient of correlation is Zero.
3. Limited degree of correlation: In between perfect
correlation and no correlation there may be limited degree
of correlation. Limited degree of correlation may also be
positive or negative. Limited degree of correlation may be
termed as high, moderate or low. For limited degree of
correlation the coefficient of correlation lies between 0 and 1
numerically.
Properties of Correlation Coefficient
1. Correlation coefficient has a well defined formula.
2. Correlation coefficient is a pure number and is independent
of the units of measurement.
3. It lies between -1 and +1.
4. Correlation coefficient does not change with reference to
change of origin or change of scale.
5. Correlation coefficient between x and y is same as that
between y and x.
Probable Error
• Probable error (PE) of the coefficient of correlation is a
statistical measure which measures the reliability and
dependability of the value of coefficient of correlation.
• It is mainly used for interpretation and determination of
limits.
• Probable error is used to interpret whether ‘r’ is significant or
not.
 If ‘r’ PE = No correlation
 If ‘r’ PE = There is correlation
 If ‘r’ 6PE = There is significant correlation.
Formula for finding PE
PE =
Where, ‘r’ = coefficient of correlation
‘n’ = No. of pairs of observation.
Standard Error
• It is basically standard deviation of any mean
• Standard error is denoted by SE
Formula for finding SE is
SE =
Spearman’s Rank Correlation
• Karl Pearson’s coefficient is applicable when variables are
measured in quantitative form. But in many cases
measurement is not possible because they are in qualitative
form.
• For example, we cannot measure the beauty or intelligence
quantitatively. But it may be possible, in their case, to rank the
individuals in some order.
• The correlation coefficient obtained from the ranks so obtained
is called rank correlation.
• Therefore, rank correlation is the correlation obtained from
ranks, instead of their quantitative measurement.
• Thus when the values of two variables are expressed in ranks
and therefrom correlation is obtained, that correlation is
known as rank correlation.
• Spearman has devised a formula known as Spearman’s rank
correlation coefficient to find the correlation coefficient from
the ranks.
According to spearman’s method, the formula for Rank
Correlation coefficient is:
r = 1-
Where, ‘D’ is the difference between ranks and ‘n’ number of
items.
Equal/ Repeated Ranks (Tie in Rank)
When the values repeat in one or both the series i.e. x and y,
Rank correlation coefficient is obtained using the formula:
r=1-
Where, ‘m’ stands for the no. of times each value repeats/ no. of
equal ranks.
• While assigning ranks, if two or more items have equal values
(i.e if there occur a tie) they may be given mid rank.
• Thus if two items are on the fifth rank, each may be ranked
(5+6)/2 =5.5 and the next item in the order of size would be
ranked seventh.
Merits and Demerits of Rank Correlation
Merits
1. It is easy to calculate
2. It is simple to understand
3. It can be applied to both quantitative and qualitative data.
Demerits
4. Rank correlation coefficient is only approximate measure as
the actual values are not used.
5. It is not convenient when ‘n’ is large.
6. Further algebraic treatment is not possible.
Partial Correlation
• Suppose there are many variables and we want to study
relationship between only two of them, then we have partial
correlation.
• In partial correlation we consider only two variables and
others are treated as normal or having no effect and so
ignored.
• For eg: consider three variables; yield, rainfall and
temperature. Here the correlation between yield and rainfall
treating temperature as normal, is partial correlation.
Partial Correlation Coefficient
Partial correlation coefficient measures the relationship between
one variable and one of the other variables assuming that the
effect of the rest of the variables is eliminated.
Let be three variables then ’ is the partial correlation coefficient
between treating as constant or normal. Similarly we have and .
These partial correlation coefficients cane be computed using
the simple correlation coefficients as shown below:
=

=
Where, respectively stand for the simple coefficients of
correlation between .
& are the same.
& are the same, like wise
& are the same.
Multiple Correlation
When there are many variables and we want to study relation of
one variable with all the other variables taken together, the
correlation obtained is called Multiple correlation.
For example, if the variables are yield, rainfall and temperature
and we want to study the relation of yield with both rainfall and
temperature taken together we find the multiple correlation.
So in multiple correlation one variable is on one side and all
other variables together on the other side.
Multiple Correlation Coefficient
If are three variables, then the coefficient of multiple correlation
between on one hand and and together on the other hand
denoted by
=
=
=
Regression Analysis
• The estimation or prediction of future production,
consumption, prices, investments, sales, profits, income etc,.
Are of paramount importance to a businessman or an
economist.
• Regression analysis is one of the very scientific techniques for
making such predictions.
Definition: Regression analysis, in the general sense, means the
estimation or the prediction of the unknown value of one
variable from the known value of the other variable. It is a
statistical device used to study the relationship between two or
more variables that are related.
In the words of MM Blair “ Regression analysis is a mathematical
measure of the average relationship between two or more
variables in terms of the original units of the data”
Dependent and Independent Variables
In regression analysis there are two types of variables. The
variable whose value is influenced or is to be predicted is called
dependent variable and the variable which influences the values
or is used for prediction, is called independent variable.
Types of regression analysis
On the basis of number of variables – Simple and Multiple
On the basis of proportion of change in variable – Linear and
Non- linear.
Simple and Multiple regressions
When there are only two variables the regression equation
obtained is called simple regression equation.
In multiple regression analysis there are more than two variables
and we try to find out the effect of two or more independent
variables on one dependent variable.
Let X,Y and Z be three variables. Let X and Y be the independent
variables and Z be depending on them. Then we use multiple
regression analysis to study the relative movement of Z, for a
unit movement in X and Y.
For Eg: If there are three variables yield, rainfall and
temperature. Suppose yield is depending on rainfall and
temperature, then we get the regression equation of Z on X and
Y where Z is yield, X – rainfall and Y- temperature.
Linear and Non –Linear Regression
• On the basis of proportion of changes in the variables, the
regression can be classified into Linear and Non-Linear
regressions.
• If the given bivariate data are plotted on a graph, the points
so obtained on the scatter diagram will more or less
concentrate around a curve called ‘curve of regression’.
• If the regression curve is a straight line, we say that there is
linear regression between the variables under study. The
equation of such a curve is the first degree equation in the
variables x and y.
• Mathematically, the relation between x and y in a linear
regression, can be expressed in the form, y = a + bx.
• In a linear regression, the change in the dependent variable is
proportionate to the changes in the independent variable.
• If the curve of regression is not a straight line, then the
regression is termed as curved or Non-Linear regression. The
regression equation in such cases is not of first degree. In
this case the dependent variable does not change by a
constant amount of change in the independent variable.
Line of best fit (Regression lines)
• When the given bivariate data are plotted on a graph, we get
the scatter diagram. If the points of the scatter diagram
concentrate around a straight line, that line is called the line
of best fit. The line of best fit is that line which is closer to
the points of the scatter diagram.
• This line is also known as Regression line.
• So, a regression line is a graphic technique to show the
functional relationship between the dependent and the
independent variables. It shows average relationship
between the variables.
Method of drawing regression lines- Free hand curve method
Under this method, original data are plotted on a graph paper.
Usually original data when plotted on a graph gives a wave like
curve but it depicts a general tendency of the data.
Independent variable is taken along the horizontal axis and
dependent variable along the vertical axis.
We draw smooth free hand line in such a way that it clearly
indicates the tendency of the original data.
This line is fitted by inspection. Care is taken that the line is
drawn in such a way that the area of the curve below and above
the line are approximately equal.
Two Regression lines
While estimating the value of ‘y’ for any given value of ‘x’, we
take y as dependent variable and x as independent variable.
Then we get the line of regression of y on x.
Similarly for estimating x for any given value of y, we use the
regression of x on y. Here x is dependent variable and y is
independent variable.
Thus there are two regression lines.
Regression Equations: Regression equation is a mathematical
relation between the dependent and independent variables.
There are two regression lines and hence there are two
regression equations; Regression equation of y on x and
Regression equation of x on y
Regression equation of y on x
y – y ̅ = (x-x̅)
Where, =
Regression equation of x on y

x – x ̅ = (y-y̅)
Where, =
Relationship between correlation coefficient and Regression
coefficient.
=r

=r
r=
Distinction between correlation and Regression
1. In correlation analysis we study degree of relationship between the
variables whereas in regression analysis we study the nature of
relationship between the variables so that we may be able to
predict the value of one on the basis of another.
2. Correlation is merely a tool of ascertaining the degree of
relationship between two variables and therefore, we cannot say
that one variable is the cause and other the effect. However, In
regression analysis, one variable is taken as dependent while the
other as independent , thus making it possible to study the cause
and effect relationship.
3. Correlation analysis is not for the purpose of prediction whereas the
regression analysis is basically used for prediction purposes.
4. There may be nonsense correlation between two variables
which is purely due to chance and has no practical relevance
such as increase in income and increase in weight of a group
of people. However, there is nothing like nonsense
regression.
5. In correlation analysis is a measure of direction and degree
of linear relationship between two variables X and Y. It is
immaterial which of X and Y is dependent variable and which
is independent variable ( = ). In regression analysis the
regression coefficients and are not equal and hence it
definitely makes a difference as to which variable is
dependent and which is independent.
Usefulness of the study of regression
• Regression analysis is a branch of statistical theory that is widely
used in almost all the statistical disciplines.
• In economics it is the basic technique for measuring or estimating
the relationship among economic variables that constitute the
essence of economic theory and economic life. For example, if
we know that two variables price(X) and demand (Y) are closely
related we can find out the most probable value of X for a given
value of Y or the most probable value of Y for a given value of X.
• Regression analysis is widely used in business. The success of
every businessman depends upon his correct estimation about
future production, sales, profit etc.
• The utility of regression is high in physical sciences where the
data are generally in functional relationship. Therefore, it is
always possible to exactly calculate the value of one variable
for a given value of the other variable by studying their
regression.
• With the help of regression coefficients we can calculate the
correlation coefficient. The square of correlation coefficient,
called coefficient of determination, measures the degree of
association or correlation that exists between the two
variables. It assesses the proportion of variance in the
dependent variable that has been accounted for by the
regression equation.
The Properties of regression lines
1. The two lines intersect at (x̅, y̅)
2. When r = 1, the two lines coincide.
3. When r = 0, the two lines are mutually perpendicular.

Properties of Regression Coefficient


is the regression coefficient of y on x and is the regression
coefficient of x on y.
4. The sign of both regression coefficients will be the same. i.e.,
both will be positive or both will be negative.
2. Product of the regression coefficients is the square of
correlation coefficient .i.e. = r²
3. and will have the same sign as ‘r’
4. When there is perfect correlation, and are reciprocals of
each other.
5. = and =
6. Both the regression coefficients will not be greater than 1.
That is, one of them can be greater than 1 or both can be less
than 1.
Coefficient of determination
• Coefficient of determination gives the percentage variation in the
dependent variable in relation with the independent variable.
• In other words coefficient of determination gives the ratio of the
explained variance to the total variance.
• The coefficient of determination is the square of the correlation
coefficient.
• The coefficient of determination is a much useful and better
measure of interpreting the value of ‘r’
• Coefficient of determination states what percentage of variations
in the dependent variable is explained by the independent
variable.
For eg: If the value of r= 0.8, we cannot conclude that 80% of the
values of the variation in the dependent variable is due to the
variation in the dependent variable. The coefficient of
determination in this case is =0.64, which implies that only 64%
of the variation in the dependent variable has been explained by
the independent variable and the remaining 36% of the variation
is due to other factors.

You might also like