0% found this document useful (0 votes)
2 views

Statistics Unit 4 (1)

The document covers correlation, regression, and time series analysis, detailing methods such as correlation analysis, Spearman's Rank Correlation, and the Method of Least Squares. It provides formulas for calculating correlation coefficients and regression equations, along with examples to illustrate the concepts. The document emphasizes the relationship between variables and the statistical methods used to analyze these relationships.

Uploaded by

Kaustuv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Statistics Unit 4 (1)

The document covers correlation, regression, and time series analysis, detailing methods such as correlation analysis, Spearman's Rank Correlation, and the Method of Least Squares. It provides formulas for calculating correlation coefficients and regression equations, along with examples to illustrate the concepts. The document emphasizes the relationship between variables and the statistical methods used to analyze these relationships.

Uploaded by

Kaustuv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

UNIT 4

CORRELATION, REGRESSION & TIME SERIES


ANALYSIS : Correlation analysis, estimation of
regression line, Spearman’s Rank Correlation and
Method of Least square in Time Series Analysis.
CORRELATION ANALYSIS
• According to Ya Lun Chou, “ Correlation analysis attempts to
determine the degree of relationship between variables”
• Change in one value is reflected on the other.
• Two variables may have positive correlation, negative
correlation or no correlation between them.
• Scatter diagrams – representation of the relationship between
the variables.
• Multiple Correlation: Studying the degree of relationship of
more than 2 variables.
• For quantitative measurement of the degree of relationship
between 2 variables-Karl Pearson’s formula is used. We
calculate the correlation coefficient, ‘r’ using the formula.
TYPES OF CORRELATION
• In correlation, when values of one variable
increase with the increase in another variable, it
is supposed to be a positive correlation.
• On the other hand, if the values of one variable
increase with the decrease in another variable,
then it would be a negative correlation.
• There might be the case when there is no change
in a variable with any change in another variable.
In this case, it is defined as no
correlation between the two.
Types of Correlation
CORRELATION ANALYSIS - FORMULAS
1. When the deviations are measured from the
mean values,
r = ∑xy / [√∑x2 √∑y2]
2. If no assumed average is taken for X and Y series,
r = [N∑xy - ∑x ∑y]
[√{N∑x2- (∑x)2} * (√{N∑y2- (∑y)2}]
3. If the deviations of X and Y series are taken from
assumed values,
r = [N∑dxdy - ∑dx ∑dy]
[√{N∑dx2- (∑dx)2} * (√{N∑dy2- (∑dy)2}]
CORRELATION ANALYSIS - FORMULAS
4. In a bivariate frequency distribution,
r = [N∑fdxdy - ∑fdx ∑fdy]
[√{N∑fdx2- (∑fdx)2} * (√{N∑fdy2- (∑fdy)2}]
Properties:
1. The value of ‘r’ always lies between -1 and +1.
2. The correlation coefficient is unaffected by the
change of origin of reference and the scale of
reference
Correlation - Formula
1. When the deviations are measured from the mean
values,
r = ∑xy / [√∑x2 √∑y2]
Example:
Find Karl Pearson’s coefficient of correlation form
the following data:
Wages 100 101 102 102 100 99 97 98 96 95
Cost of 98 99 99 97 95 92 95 94 90 91
living
• Solution:
Wages X=x-Mean X2 Cost of Y=y-Mean Y2 XY
X = x-99 living Y = y-95
100 1 1 98 3 9 3
101 2 4 99 4 16 8
102 3 9 99 4 16 12
102 3 9 97 2 4 6
100 1 1 95 0 0 0
99 0 0 92 -3 9 0
97 -2 4 95 0 0 0
98 -1 1 94 -1 1 1
96 -3 9 90 -5 25 15
95 -4 16 91 -4 16 16
∑x = 990 ∑x2 = 54 ∑y = 950 ∑y2 = 96 ∑xy = 61

• r = ∑xy / [√∑x2 √∑y2] = 61 / √(54 x 96) = 61/72


• r = 0.847
• There is a high degree of relationship between the variables wage
and cost of living stated in the problem.
Correlation - Formula
2. If no assumed average is taken for X and Y series,
r = [N∑xy - ∑x ∑y]
[√{N∑x2- (∑x)2} * (√{N∑y2- (∑y)2}]
Example:
Determine the correlation coefficient value for the
given set of X and Y values:
X Values 21 23 37 19 24 33

Y Values 2.5 3.1 4.2 5.6 6.4 8.4


Solution
X Values Y Values X*Y X2 Y2

21 2.5 52.5 441 6.25

23 3.1 71.3 529 9.61

37 4.2 155.4 1369 17.64

19 5.6 106.4 361 31.36

24 6.4 153.6 576 40.96

70.56
33 8.4 277.2 1089

∑X=157 ∑Y=30.2 ∑XY=816.4 ∑X2 =4365 ∑Y2 =176.38


Solution ctd..
• Obtained values:
∑X=157
∑Y=30.2
∑XY=816.4
∑X2 =4365
∑Y2 =176.38
N=6
Formula: Correlation coefficient,
• r = [N∑xy - ∑x ∑y] / [√{N∑x2- (∑x)2} * (√{N∑y2- (∑y)2}]
• On substituting the values, the value of r = 0.33
• There exists a moderate relationship between the 2
variables.
Correlation - Formula
3. If the deviations of X and Y series are taken from assumed
values,
r = [N∑dxdy - ∑dx ∑dy]
[√{N∑dx2- (∑dx)2} * (√{N∑dy2- (∑dy)2}]
Example:
Calculate the correlation coefficient for the following case
taking 29 and 119 as assumed means for fertilisers used
and productivity (All the values given are in tonnes):
Fertiliser 15 18 20 24 30 35 40 50
used
Producti 85 93 95 105 120 130 150 160
vity
• Solution:
Fertiliser dx= x- 29 dx2 Productivity dy= y- dy2 dx dy
Used 119

15 -14 196 85 -34 1156 476

18 -11 121 93 -26 676 286

20 -9 81 95 -24 576 216

24 -5 25 105 -14 196 70

30 1 1 120 1 1 1

35 6 36 130 11 121 66

40 11 121 150 31 961 341

50 21 441 160 41 1681 861

∑dx = 0 ∑dx2 = ∑dy = - ∑dy2 = ∑dx dy =


1022 14 5368 2317
Solution ctd...
Formula:
r= [N∑dxdy - ∑dx ∑dy]
[√{N∑dx2- (∑dx)2} * (√{N∑dy2- (∑dy)2}]
• Substituting the values,
r = 8(2317) – (0 x -14)
[√{(8 x 1022)-(0)2 } x √{(8 x 5368)-(-14)2 }]
Therefore, r = 0.9869

– There exists a higher degree of relationship between


the 2 variables fertilisers used and productivity stated
in the problem.
Spearman's rank correlation
coefficient
• Spearman's rank correlation coefficient allows
us to identify easily the strength of correlation
within a data set of two variables, and
whether the correlation is positive or
negative.
• The Spearman coefficient is denoted with the
Greek letter rho (ρ).
• FORMULA, ρ = 1 – {6∑d2 / [n(n2−1)]}
Spearman's rank correlation
coefficient - SOLVED EXAMPLE
1. The following are the ranks obtained by 10 students in
Statistics & Mathematics subject. To what extent is the
knowledge of the students in the two subjects are related?
Statistics Mathematics
1 1
2 4
3 2
4 5
5 3
6 9
7 7
8 10
9 6
10 8
Solution
Statistics (X) Mathematics (Y) d = X-Y d2
1 1 0 0
2 4 -2 4
3 2 1 1
4 5 -1 1
5 3 2 4
6 9 -3 9
7 7 0 0
8 10 -2 4
9 6 3 9
10 8 2 4
∑d2 = 36
Solution
• FORMULA, ρ = 1 – {6∑d2 / [n(n2−1)]}
• On substituting the values,
ρ = 1 – {(6*36) / [10(100−1)]}
ρ = 1 – 0.2182
So, ρ = 0.7818
Spearman Rank Correlation – Missing
Data Problem
Example:
The coefficient of rank correlation of marks
obtained by 10 students in English and
Economics was found to be 0.5. It was later
discovered that the difference in marks in two
subjects obtained by one of the student was
wrongly taken as 3 instead of 7. Find the
correct coefficient of rank correlation.
Solution:
• Given: ρ = 0.5 & n= 10
• FORMULA, ρ = 1 – {6∑d2 / [n(n2−1)]}
Substituting the values in the formula,
0.5 = 1 – {6∑d2 / [10(100−1)]}
∑d2 = 82.5
Correcting the values:
∑d2 = 82.5 - 32 + 72 = 122.5
Corrected rank correlation:
Where ∑d2 = 122.5 and n = 10
ρ = 1 – {6∑d2 / [n(n2−1)]} = 1- {6x122.5 / [10(100−1)]}
ρ = +0.2576
REGRESSION ANALYSIS
• Correlation is the study of degree of relationship
between 2 variables while regression is the study of
relationship between the variables.
• According to Blair, “Regression is the measure of the
average relationship between two or more variables in
terms of the original units of data”
• If Y is a dependent variable and X is an independent
variable, the linear relationship between the variables
is the regression equation of Y on X i.e., Y = a+bX
• The parameters of the equation are determined using
the principle of least squares.
ESTIMATION OF REGRESSION LINE
1.Least Square Method-:
• The regression equation of X on Y is : X= a+bY
Where,
X=Dependent variable
Y=Independent variable
• The regression equation of Y on X is: Y = a+bX
Where,
Y=Dependent variable
X=Independent variable
• And the values of a and b in the above equations are found by the
method of least of Squares-reference . The values of a and b are
found with the help of normal equations given below:
• (I )  X  na  b Y (II )  Y  na  b X
 XY  a Y  b Y 2
 XY  a X  b X 2
ESTIMATION OF REGRESSION LINE
2.Deviation from the Arithmetic mean method:
• The calculation by the least squares method are
quite cumbersome when the values of X and Y
are large. So the work can be simplified by using
this method.
• The formula for the calculation of Regression
Equations by this method:
• Regression Equation of X on Y- ( X  X )  bxy (Y  Y )
• Regression Equation of Y on X - (Y  Y )  b ( X  X )
yx
ESTIMATION OF REGRESSION LINE
• Regression Coefficients, bxy & byx
b 
 xy
byx 
 xy
y 
xy 2 2
x
• Properties of Regression coefficient:
1. r = ±√bxy byx
Value of r is positive, if both the coefficients are
positive and vice-versa
2. One of the regression coefficient is greater than 1
while the other is lesser than 1 i.e., byx ≤ (1/ bxy)
Solved Example-: From the given data obtain the regression equations by
Taking deviations from the actual means of X and Y series.

X 3 2 7 4 8
Y 6 1 8 5 9
Solution-:
X Y xX X y  Y Y x2 y2 xy

3 6 -1.8 0.2 3.24 0.04 -0.36

2 1 -2.8 -4.8 7.84 23.04 13.44

7 8 2.2 2.2 4.84 4.84 4.84

4 5 -0.8 -0.8 0.64 0.64 0.64

8 9 3.2 3.2 10.24 10.24 10.24

 X  24 Y  29 x 0  y 0  x  26.8  y 2  38.8  xy  28.8


2
Regression Equation of X on Y is
( X  X )  bxy (Y  Y )

bxy 
 xy
y 2

X  4.8 
28.8
Y  5.8
38.8
X  4.8  0.74Y  5.8
………….(I)
X  0.74Y  0.49
Regression Equation of Y on X is
(Y  Y )  byx ( X  X )

byx 
 xy
x 2

Y  5.8 
28.8
 X  4.8
26.8
Y  5.8  1.07( X  4.8)
Y  1.07 X  0.66 ………….(II)
Solved Example
• Question: For two variables X and Y the
equation of regression lines are 9y-x-288=0
and x-4y+38=0 find,
(i) The mean values of x and y
(ii) Regression Coefficient
(iii) Ratio of SD of y to that of x
(iv) Most probable value of y when x=145 &
(v) Most probable value of x when y=35
Solution
(i) Mean values of x and y:
• Solving the two equations:
X-9y = -288 and x-4y = 38
• The values obtained are x = 162 and y= 50. These are the mean
values for x and y.
(ii) Regression Coefficient,
Formula, r = ±√bxy byx
and as well using the formulas for bxy & byx
• The value of r is calculated.
9y-x-288=0
• Then, y=(1/9)x+32. So, byx = 1/9
Similarly, x-4y+38=0
• Then, x=4y-38. So, bxy = 4
• Substituting the values of bxy & byx and the value of r = (2/3)
Solution
(iii) Ratio of SD of y to that of x:
byx = r (σy / σx) = 1/9
& bxy = r (σx / σy) = 4
Dividing byx value by bxy value, the required ratio
value is obtained i.e 1 : 6
(iv) x=145, substitute the value of x in 9y-x-288=0
equation, so the proposed value of y=48.11
(v) y=35, substitute the value of y in x-4y+38=0
equation, so the proposed value of x=102
Basic Differences between Correlation
and Regression
TIME SERIES ANALYSIS
• According to Morris Hamburg “A time series is a set of
observations arranged in chronological order”
• Examples: Weather reports, Stock exchanges,
production unit.
• This shows that the observed values of the variable
fluctuate from time to time. Thus, analysis of time
series involves an examination of the past observations
and estimation of future values.
• These variations are broadly grouped under the
following four categories.
Components of Time Series
1. Secular trends - direction of a time series movement over a
long period of time usually represented by a straight line or
a smooth curve.
2. Seasonal variation - repeating periodic movement of a time
series
3. Cyclical fluctuations or “business cycles” - expansions (ups)
and contractions (downs) of business activities around the
normal value
4. Irregular movements - erratic movements, including all
types of time series movements other than secular,
seasonal, or cyclical
• These components provide a basis for the explanation of the
past behaviour. They help us to predict the future behaviour.
Uses of Time Series Analysis
• It helps in understanding the past behaviour.
• It helps in planning future operations.
• It helps in analysing the current
accomplishments.
• It facilitates comparison of data.
Mathematical Models of Time Series Analysis
• Classical Approach:
• Addition Model:
Y=T+S+C+I
Where:- Y = Original Data
T = Trend Value
S = Seasonal Fluctuation
C = Cyclical Fluctuation
I=Irregular Variation
• Multiplication Model:
Y=TxSxCxI
or
Y = TSCI
This model assumes that the output of an economy is the product of various
forces operating on one another. In this model S,C and I are given as
percentages.
Measurement of Secular trend:-

• The following methods are used for calculation of


trend:

 Free Hand Curve Method


 Semi – Average Method
 Moving Average Method
 Least Square Method
Free hand Curve Method:-
• In this method the data is denoted on graph paper. We
take “Time” on ‘x’ axis and “Data” on the ‘y’ axis. On
graph there will be a point for every point of time. We
make a smooth hand curve with the help of this plotted
points.
Example:
Draw a free hand curve on the basis of the
following data:
Years 1989 1990 1991 1992 1993 1994 1995 1996

Profit 148 149 149.5 149 150.5 152.2 153.7 153


(in ‘000)
155

154

153

152 Trend Line

151

150
Profit ('000)
149

148 Actual Data

147

146

145
1989 1990 1991 1992 1993 1994 1995 1996
Semi – Average Method:-
• In this method the given data are divided in two parts,
preferable with the equal number of years.

• For example, if we are given data from 1991 to 2008,


i.e., over a period of 18 years, the two equal parts will be
first nine years, i.e.,1991 to 1999 and from 2000 to
2008. In case of odd number of years like, 9, 13, 17, etc..,
two equal parts can be made simply by ignoring the
middle year. For example, if data are given for 19 years
from 1990 to 2007 the two equal parts would be from
1990 to 1998 and from 2000 to 2008 - the middle year
1999 will be ignored.
• Example:
Find the trend line from the following data
by Semi – Average Method:-
Year 1989 1990 1991 1992 1993 1994 1995 1996

Production 150 152 153 151 154 153 156 158


(M.Ton.)

There are total 8 years in the data. So the given data is split into
two parts i.e., the first 4 years and the next set of 4 years and
average values are calculated for both the parts.
First Part = 150 + 152 + 153 + 151 = 151.50
4

Second Part = 154 + 153 + 156 + 158 = 155.25


4
Year Production Arithmetic Mean
(1) (2) (3)

1989 150

1990 152
151.50
1991 153

1992 151

1993 154

1994 153
155.25
1995 156

1996 158
Production
160

158

156
155.25

154

Production
152

150 151.50

148

146
1989 1990 1991 1992 1993 1994 1995 1996
Moving Average Method:-
• It is one of the most popular method for calculating Long Term Trend.
This method is also used for ‘Seasonal fluctuation’, ‘cyclical
fluctuation’ & ‘irregular fluctuation’. In this method we calculate the
Moving Average for certain years.
• For example: If we calculating ‘Three year’s Moving Average’ then
according to this method:
=(1)+(2)+(3) , (2)+(3)+(4) , (3)+(4)+(5), ……………..
3 3 3
Where (1),(2),(3),………. are the various years of time series.
Example: Find out the five year’s moving Average:
Year 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996

Price 20 25 33 33 27 35 40 43 35 32 37 48 50 37 45
Year Price of Five year’s moving Five year’s moving
sugar (Rs.) Total Average (Col 3/5)
(1) (2) (3) (4)
1982 20 - -
1983 25 - -
1984 33 135 27
1985 30 150 30
1986 27 165 33
1987 35 175 35
1988 40 180 36
1989 43 185 27
1990 35 187 37.4
1991 32 195 39
1992 37 202 40.4
1993 48 204 40.8
1994 50 217 43.4
1995 37 - -
1996 45 - -
Least Square Method:-
• This method is most widely in practice. When this method is
applied, a trend line is fitted to data in such a manner that the
following two conditions are satisfied:-
 The sum of deviations of the actual values of y and computed values
of y is zero.
 Y  Y   0
c

 i.e., the sum of the squares of the deviation of the actual and
computed values is least from this line. That is why method is called
the method of least squares. The line obtained by this method is
known as the line of `best fit`.

 Y  Y  is least
c
2
The Method of least square can be used either to fit a straight line
trend or a parabolic trend.
The straight line trend is represented by the equation:-

Yc = a + bx
Where, Yc = Trend value to be computed
X = Unit of time (Independent Variable)
a = Constant to be Calculated
b = Constant to be calculated
Formulas:

 Y  Na  b X
XY  a  X b  X
2
Example:-

Draw a straight line trend and estimate trend


value for 1996 assuming the base year as 1990.

Year 1991 1992 1993 1994 1995


Production 8 9 8 9 16
Solution:
Deviation From Trend
Year 1990 Y XY X2 Yc = a + bx
(1) X (3) (4) (5) (6)
(2)

1991 1 8 8 1 5.2 + 1.6(1) = 6.8

1992 2 9 18 4 5.2 + 1.6(2) = 8.4

1993 3 8 24 9 5.2 + 1.6(3) =10.0

1994 4 9 36 16 5.2 + 1.6(4) =11.6

1995 5 16 80 25 5.2 + 1.6(5) =13.2

N= 5 X Y  XY X 2

= 15 =50 = 166 = 55
Now we calculate the value of two constant ‘a’ and ‘b’ with the help
of two equation:-
Y  Na  b X
 XY  a X  b X 2

Substituting the values of     ,&N :-


X , Y , XY, X 2

50 = 5a + 15(b) ……………. (i)


166 = 15a + 55(b) ……………… (ii)

Or 5a + 15b = 50 ……………… (iii)


15a + 55b = 166 …………………. (iv)
By solving equations (iii) and (iv)
a=5.2 & b = 1.6
Equation of straight line trend, Yc = a + bx
Y= 5.2 + 1.6X
Now we calculate the trend line value for the year 1996:-
Y = 5.2 + 1.6 (6) = 14.8
Methods Of Seasonal Variation:-

• Seasonal Average Method


• Link Relative Method
• Ratio To Trend Method
• Ratio To Moving Average Method
Seasonal Average Method
• Seasonal Averages = Total of Seasonal Values
No. Of Years
• General Averages = Total of Seasonal Averages
No. Of Seasons
• Seasonal Index = Seasonal Average
General Average
Link Relative Method:

• In this Method the following steps are taken for


calculating the seasonal variation indices
• We calculate the link relatives of seasonal figures.
Link Relative: Current Season’s Figure x 100
Previous Season’s Figure
• We calculate the average of link relative foe each
season.
• Convert These Averages in to chain relatives on
the basis of the first seasons.
• Calculate the chain relatives of the first season on
the base of the last seasons. There will be some
difference between the chain relatives of the first
seasons and the chain relatives calculated by the
pervious Method.
• This difference will be due to effect of long term
changes.
• For correction the chain relatives of the first
season calculated by 1st method is deducted from
the chain relative calculated by the second
method.
• Then Express the corrected chain relatives as
percentage of their averages.
Ratio To Moving Average Method:
• In this method seasonal variation indices are
calculated in following steps:
• We calculate the 12 monthly or 4 quarterly
moving average.
• We use following formula for calculating the
moving average Ratio:
Moving Average Ratio= Original Data x 100
Moving Average
Then we calculate the seasonal variation indices on
the basis of average of seasonal variation.
Ratio To Trend Method:-
• This method based on Multiple model of Time
Series. In It We use the following Steps:
• We calculate the trend value for various time
duration (Monthly or Quarterly) with the help of
Least Square method
• Then we express the all original data as the
percentage of trend on the basis of the following
formula.
= Original Data x 100
Trend Value
Rest of Process are as same as moving Average
Method
Methods Of Cyclical Variation:-

Residual Method
References cycle analysis method
Direct Method
Harmonic Analysis Method
Residual Method:-
• Cyclical variations are calculated by Residual Method . This
method is based on the multiple model of the time Series. The
process is as below:

• (a) When yearly data are given:


In class of yearly data there are not any seasonal variations so
original data are effect by three components:
• Trend Value
• Cyclical
• Irregular
(b) When monthly or quarterly data are given:
First we calculate the seasonal variation indices according to
moving average ratio method.

At last we express the cyclical and irregular variation as the


Trend Ratio & Seasonal variation Indices
Measurement of Irregular Variations

• The irregular components in a time series


represent the residue of fluctuations after trend
cycle and seasonal movements have been
accounted for. Thus if the original data is divided
by T,S and C ; we get I i.e. . In Practice the cycle
itself is so erratic and is so interwoven with
irregular movement that is impossible to
separate them.

You might also like