0% found this document useful (0 votes)
213 views

Unit 3 - Statistics

The document discusses key concepts in statistics including moments, measures of central tendency, and relationships between raw and central moments. Specifically, it defines: - Raw (or uncentralized) moments as measures of the shape of a distribution about a point like the origin. - Central moments as measures about the mean, including the variance as the second central moment. - Relationships between raw and central moments using equations that express one in terms of the other. - How to calculate moments for both ungrouped and grouped data using formulas involving the data values and frequencies.

Uploaded by

Nilanjan Kundu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
213 views

Unit 3 - Statistics

The document discusses key concepts in statistics including moments, measures of central tendency, and relationships between raw and central moments. Specifically, it defines: - Raw (or uncentralized) moments as measures of the shape of a distribution about a point like the origin. - Central moments as measures about the mean, including the variance as the second central moment. - Relationships between raw and central moments using equations that express one in terms of the other. - How to calculate moments for both ungrouped and grouped data using formulas involving the data values and frequencies.

Uploaded by

Nilanjan Kundu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Department of Mathematics

UNIT-III

STATISTICS

Topic Learning Objectives:

Upon Completion of this unit, students will be able to:

 Expand their knowledge and skills of the Statistical Concepts and a personal
development experience towards the needs of statistical data analysis.
 Understand the Central Moments, Skewness and Kurtosis.
 Describe the principle of least squares.
 Fit data using several types of curves.
 Describe & evaluate the concept of correlation and regression coefficients.
 Investigate the strength and direction of a relationship between two variables by
collecting measurements and using appropriate statistical analysis.

Introduction:

In many fields of Applied Mathematics and Engineering we face some problems and do the
experiments involving two variables. In this chapter, we consider the Mathematical theory of
statistics, by presenting an elementary treatment of Central moments, mean, variance,
coefficients of skewness and kurtosis in terms of moments, curve fitting, correlation and
regression. In mathematics, a moment is a specific quantitative measure of the shape of a
function. It is used in both mechanics and statistics. If the function represents
physical density, then the zeroth moment is the total mass, the first moment divided by the
total mass is the center of mass, and the second moment is the rotational inertia. If the
function is a probability distribution, then the zeroth moment is the total probability (i.e. one),
the first moment is the mean, the second central moment is the variance, the
third standardized moment is the skewness, and the fourth standardized moment is
the kurtosis.

Moments:
In mechanics, moment refers to the turning or the rotating effect of a force whereas it is used
to describe the peculiarities of a frequency distribution in statistics. We can measure the
central tendency of a set of observations by using moments. Moments also help in measuring
the scatteredness, asymmetry and peakedness of a curve for a particular distribution.
Moments refers to the average of the deviations from mean or some other value raised to a
certain power. The arithmetic mean of various powers of these deviations in any distribution
is called the moments of the distribution about mean. Moments about mean are generally
used in statistics.

Fourth Semester 1 Statistics (18MA41B)


Department of Mathematics
Moments for ungrouped data:
Now we first define the moments for ungrouped data. The rth moment about origin is denoted
by μ r and defined by,
n
 xir ,
1
μ r  r = 1, 2, 3 … (1)
n
i 1
Here the μ r is the rth moment when we are dealing with the n observations denoted by x1,
x2... xn. Thus for r =1, 2, 3 and 4 we get the first four raw moments about the origin.
n n n n
   
1 1 1 1
μ1 = x i , μ 2 = x , μ3 =
2
x and μ 4 =
3
x4.
n n i n i n i
i=1 i=1 i=1 i=1
Similarly we can define the rth moment about the arithmetic mean x or this is also called the
r th central moment and it is denoted by the notation μ r and it is defined as:
n
  x i  x r ,
1
μr  r = 1, 2, 3 … (2)
n
i 1
n
  xi  x   0.
1
Thus for r =1, we get the first central moment about the mean as μ1 
n
i 1
n
  x i  x 2
1
Similarly for r = 2, we get the second central moment about the mean as μ 2 
n
i 1
which is equal to variance.

Moments for grouped data:


Suppose we are having observations x1, x2, . . . ,xn which are the mid points of the class-
intervals and f1, f2, . . . ,fn are their corresponding frequencies then the rth moment about
origin is denoted by μ r and defined by,
n

1 n
μ r 
N
fi x ir
, r = 1, 2, 3 … and N=  fi (3)
i1 i=1
Similarly the rth moment about arithmetic mean is denoted by μ r and defined by,
n
 fi  x i  x r ,
1
μr 
N r = 1, 2, 3 … (4)
i1
Also, the rth moment about any point A is denoted by μ r and defined by,
n
 fi  x i  A r ,
1
μ r  r = 1, 2, 3 … (5)
N
i1

Fourth Semester 2 Statistics (18MA41B)


Department of Mathematics

 xi - A   xi - x 
Note: If di  or di  , Then rth order moments about an arbitrary point A
h h
n n
  fi di r h r
1 1
and mean x are defined respectively by μ r =
r r
fi di h & μ r = r = 1, 2, 3 …
N N
i=1 i=1

Relation between raw (Moments about origin or any point) and Central Moments

The central moments can be expressed in terms of raw moments and vice-versa. The general
relation between the moments about mean in terms of moments about any point is given by,

μ r  μr  r C1 μr-1 μ1  r C2 μr-2 μ12 ...  (1) r μ1r, r = 1, 2, 3 … (6)

In particular, on putting r = 2, 3 and 4 in equation (6), we get

μ 2  μ2  μ12, μ3  μ3  3μ2μ1 + 2μ13 and μ 4  μ4  4μ3 μ1 + 6μ2μ12-3μ14.

Conversely,
μr  μ r  r C1 μ r-1 μ1  r C2 μ r-2 μ12 ...  μ1r, r = 1, 2, 3 … (7)

In particular, on putting r = 2, 3 and 4 in equation (7), we get

μ2  μ 2  μ12, μ3  μ3  3μ 2μ1 + μ13 and μ4  μ 4  4μ3μ1 + 6μ 2μ12+μ14.

Example 1: The first four moments of a distribution about the value 4 of the variables are
-1.5, 17, -30 and 108. Find the moments about the mean.

Solution: Given A = 4, μ1 =-1.5, μ2 =17, μ3 =-30 and μ4 =108.

Moments about mean:

μ 2  μ 2  μ12= 17 - (-1.5)2 =14.75


μ3  μ3  3μ 2μ1 + 2μ13 = -30 - 3(17)(-1.5) + 2 (-1.5)2  39.75
μ 4  μ 4  4μ 3 μ1 + 6μ 2μ12 - 3μ14=108 - 4(-30)(-1.5) + 6(17)(-1.5)2 - 3(-1.5)4 = 142.3125.

Example 2: Calculate the first four moments of the following distribution about the mean.

0 1 2 3 4 5 6 7 8
1 8 28 56 70 56 28 8 1

Fourth Semester 3 Statistics (18MA41B)


Department of Mathematics

Solution:

x f d = (x - x ) fd fd2 fd3 fd4


0 1 -4 -4 16 -64 256
1 8 -3 -24 72 -216 648
2 28 -2 -56 112 -224 448
3 56 -1 -56 56 -56 56
4 70 0 0 0 0 0
5 56 1 56 56 56 56
6 28 2 56 112 224 448
7 8 3 24 72 216 648
8 1 4 4 16 64 256
∑=0 ∑ = 512 ∑=0 ∑ = 2186

Moments about the mean x = 4 are

μ1 
 fd
= 0, μ 2 
 fd 2
= 2, μ3 
 fd3
= 0, μ 4 
 fd 4
= 11
N N N N

Example 3: Wages of workers are given in the following table:

1.5 - 2.5 2.5 - 3.5 3.5 - 4.5 4.5 - 5.5 5.5 - 6.5
1 3 7 3 3
Calculate the first four central moments of the following distribution.

Mid-point
Wages f d=(x- x) fd fd2 fd3 fd4
x
1.5 - 2.5 1 2 -2 -2 4 -8 16
2.5 - 3.5 3 3 -1 -3 3 -3 3
3.5 - 4.5 7 4 0 0 0 0 0
4.5 - 5.5 3 5 1 3 3 3 3
5.5 - 6.5 1 6 2 2 4 8 16
∑=0 ∑ = 14 ∑=0 ∑ = 38

Moments about mean:

μ1 
 fd
= 0, μ 2 
 fd 2
= 0.933, μ3 
 fd3
= 0, μ 4 
 fd 4
= 2.533
N N N N
Skewness and Kurtosis:
Averages tell us about the central value of the distribution and measures of dispersion tell us
about the concentration of the items around a central value. These measures do not reveal
whether the dispersal of value on either side of an average is symmetrical or not. If
observations are arranged in a symmetrical manner around a measure of central tendency, we
get a symmetrical distribution; otherwise, it may be arranged in an asymmetrical order which
gives asymmetrical distribution.
Fourth Semester 4 Statistics (18MA41B)
Department of Mathematics
Measures of Skewness and Kurtosis, like measures of central tendency and dispersion, study
the characteristics of a frequency distribution. Thus, skewness is a measure that studies the
degree and direction of departure from symmetry.
A symmetrical distribution, gives a ‘symmetrical curve’, where the value of mean, median
and mode are exactly equal. On the other hand, in an asymmetrical distribution, the values of
mean, median and mode are not equal. When two or more symmetrical distributions are
compared, the difference in them is studied with ‘Kurtosis’. On the other hand, when two or
more symmetrical distributions are compared, they will give different degrees of Skewness.
These measures are mutually exclusive i.e. the presence of skewness implies absence of
kurtosis and vice-versa.
Measures of Kurtosis:
Kurtosis enables us to have an idea about the flatness or peakedness of the curve. It is
measured by the Karl Pearson co-efficient β2 and given by

μ4
2 =
μ 22
Kurtosis studies the concentration of the items at the central part of a series. The following
figure in which all the three curves A, B and C are symmetrical about the mean.

Curve of the type ‘A’ which is neither flat nor peaked is called the normal curve or
‘MESOKURTIC’ curve (β2 = 3). If items concentrate too much at the center (more peaked
than the normal curve), the curve of the type ‘C’ becomes ‘LEPTOKURTIC’ curve (β2 > 3).
If the concentration at the center is comparatively less (flatter than the normal curve), the
curve of the type ‘B’ becomes ‘PLATYKURTIC’ curve (β2 < 3).
Measures of Skewness:
Literally, skewness means ‘lack of symmetry’. A distribution is said to be skewed if
(i) Mean, Median and Mode fall at different points.
(ii) The curve drawn with the help of the given data is not symmetrical but stretched more to
one side than to the other.

Fourth Semester 5 Statistics (18MA41B)


Department of Mathematics
Karl Pearson’s coefficient of Skewness: The method is most frequently used for measuring
skewness. The formula for measuring coefficient of skewness is as follows:

Mean - Mode
Sk = , where σ is the standard deviation of the distribution.
σ
Based upon moments, co-efficient of skewness is defined as follows:

1   2  3 μ 32 μ
Sk =  =  = 4 .
2  5 2  61  9 
, where 1 and 2
μ 32 μ 22
Nature of Skewness:
Skewness can be positive or negative or zero. The direction of skewness is determined by
observing whether the mean is greater than the mode (positive skewness) or less than the
mode (negative skewness).
(i) When the values of mean, median and mode are equal, there is no skewness.
(ii) When mean > median > mode, skewness will be positive.
(iii) When mean < median < mode, skewness will be negative.

Characteristic of a good measure of skewness:


1. It should be a pure number in the sense that its value should be independent of the unit of
the series and also degree of variation in the series.
2. It should have zero-value, when the distribution is symmetrical.
3. It should have a meaningful scale of measurement so that we could easily interpret the
measured value.

Fourth Semester 6 Statistics (18MA41B)


Department of Mathematics

Example: Wages of workers are given in the following table:

10-12 12-14 14-16 16-18 18-20 20 - 22 22 - 24


1 3 7 12 12 4 3
Calculate the first four central moments of the following distribution. Also compute β1 and β2.

Mid-point
Wages f d = (x -17) / 2 fd fd2 fd3 fd4
x
10-12 1 11 -3 -3 9 -27 81
12-14 3 13 -2 -6 12 -24 48
14-16 7 15 -1 -7 7 -7 7
16-18 12 17 0 0 0 0 0
18-20 12 19 1 12 12 12 12
20-22 4 21 2 8 16 32 64
22-24 3 23 3 9 27 81 243
∑ = 13 ∑ = 27 ∑ = 67 ∑ =455

μ1 
 fd
x h = 0.52, μ 2 
 fd 2
xh 2
= 2.16, μ3 
 fd3
x h3 = 10.72,
N N N

μ 4 
 fd 4
x h 4 = 145.6
N
Moments about mean:
μ1  0,μ 2  μ 2  μ12= 2.16 - 0.2704= 1.8896
μ3  μ3  3μ 2μ1 + 2μ13 = 10.72 - 3(2.16)(0.52) + 2 (0.52)2  7.491
μ 4  μ 4  4μ 3 μ1 + 6μ 2μ12 - 3μ14=145.6 - 4(0.52)(10.72) + 6(2.56)(0.52)2 -3 x 0.07312
= 126.5874.

μ32 μ4
So, we have β1  = 8.317, β 2  = 35.4527.
μ32 μ 22
Exercise:

1. The first four raw moments of a distribution are 2, 136, 320 and 40,000. Find the
coefficients of skewness and kurtosis.
Ans.
μ32 μ4
β1  = 0.0904, β2  = 2.333.
μ32 μ 22
2. Find the second, third and fourth central moments of the frequency distribution given
below. Hence, find (i) a measure of skewness and (ii) a measure of kurtosis.

Fourth Semester 7 Statistics (18MA41B)


Department of Mathematics

Class limits Frequency


110.0 – 114.9 5
115.0 – 119.9 15
120.0 – 124.9 20
125.0 – 129.9 35
130.0 – 134.9 10
135.0 – 134.9 10
140.0 – 144.9 5

Ans.
μ 2 = 2.16, μ3 = 0.804, μ 4 = 12.5232
γ1 = β1 = 0.25298; γ 2 = β2 -3 = -0.317

3. Find the second, third and fourth central moments of the frequency distribution
given below. Hence, find (i) a measure of skewness and (ii) a measure of kurtosis.
5 10 15 20 25 30 35
4 10 20 36 16 12 2
Ans.
μ 2 = 44.41, μ3 = -12.504, μ 4 = 5423.5057, β1 = 0.001785,
β2 = 2.7499, γ1 = β1 = 0.25298; γ 2 = β 2 -3 = -0.317.
4. Compute the first four moments about mean from the following data. Hence, find (i) a
measure of skewness and (ii) a measure of kurtosis.
Class Intervals: 0 -10 10 – 20 20 – 30 30 – 40
Frequency: 1 3 4 2
Ans.
μ1 = 0, μ 2 = 81, μ3 = -144, μ 4 = 14817, β1 = 0.03902,
β2 = 0.01909, γ1 = β1 = 0.1975; γ 2 = β 2 -3 = - 2.9809.

Method of Least squares:


Suppose we are given n values of x1, x2, x3,….., xn of an independent variable x and the
corresponding values y1, y2, y3,….., yn of a variable y depending on x. Then the pairs (x1,
y1), (x2, y2),........, (xn, yn) give us n- points in the xy-plane. Generally it is not possible to find
the actual curve y = f(x) that passes through these points. Hence we try to find a curve that
serves as best approximation to the curve y = f(x). Such a curve is referred to as the curve of
best fit. The process of determining a curve of best fit is called curve fitting. A method to find
curve of best fit is called method of least squares.
The method of least squares tells that the curve should pass as closely as possible to meet all
the points. Let y = f(x) be an approximate relation that fits into the data (xi, yi), yi are called

Fourth Semester 8 Statistics (18MA41B)


Department of Mathematics
observed values and Yi = f(xi) are called the expected values. Then Ei = yi - Yi are called the
estimated error or residuals.
The method of least squares provides a relationship y = f(x) such that sum of the squares of
the residues is least. Such a curve is known as least square curve.

Fitting of polynomial:
Approximating a data set using a polynomial equation is useful when conducting engineering
calculations as it allows results to be quickly updated when inputs change without the need
for manual lookup of the dataset. The most common method to generate a polynomial
equation from a given data set is the least squares method. We will discuss the fitting of the
following types of the curves.
Fitting of a straight line: y  a  bx

Let y  a  bx be the equation of the straight line.


The error estimate is given by E  y - (a  bx)  y - a - bx.
By the principle of least squares we have to determine the constants a, b such that
n
E   ( y - a  bx) 2 is minimum.
1
For E to be minimum the two necessary conditions are
E E
 0,  0,
a b
E n
i.e,  0  2 (y  a  bx)(1)  0 ,
a 1

Fourth Semester 9 Statistics (18MA41B)


Department of Mathematics
n
 2 (y  a  bx)  0 ,
1

  y   a  b x  0 ,
  y  na  b x ,
E n
 0  2 (y  a  bx)( x)  0 ,
b 1

  xy  a  x  b x 2 .
The normal equations for estimating the values of a and b are
 y  na  b x ,

 xy  a  x  b x .
2

Solving the above normal equations we estimate the values of a & b. With these values of a
and b y  a  bx is the line of best fit.

Fitting of a second degree equation (quadratic): y  a  bx  cx 2

Let y  a  bx  cx 2 be the equation of the curve.

The error estimate is given by E  y - a  bx - cx 2 .


By the principle of least squares we have to determine the constants a, b and c such that
n
E   ( y - a  bx - cx 2 )2 is minimum.
1
E E E
For E to be minimum  0,  0,  0,
a b c
E n
 0  2 (y  a  bx  cx 2 )(1)  0 ,
a 1
  y   a  b x - c x 2  0 ,

  y  na  b x  c x ,
2

E n
 0  2 (y  a  bx  cx 2 )( x)  0 ,
b 1

  xy  a  x  b x 2  c x3 ,
E n
 0  2 (y  a  bx  cx 2 )( x 2 )  0 ,
c 1
  x 2 y  a  x 2  b  x 3  c x 4 .
The normal equations for estimating the values of a , b ,c are

Fourth Semester 10 Statistics (18MA41B)


Department of Mathematics
 y  na  b x  c x ,
2

 xy  a  x  b x  c x ,
2 3

4
 x y  a  x  b x  c x .
2 2 3

Solving the above equations we estimate the values of a ,b & c. With these values of a , b &

c, y  a  bx  cx is the curve of best fit.


2

Fitting of a curve of the form: y  ae bx


Let y  ae
bx be the equation of the given curve.

Taking log on both sides we get, logy = loga + loge


bx ,

 u  A  bx, where A  loga & u  logy .


This is linear in u and x.
Then the normal equations for estimating the values of A and b are
 u  nA  b x ,

 xu  A x  b x .
2

By solving these equations, we get the values of A and b .


But .

With these values of a and b, y  ae


bx is the curve of best fit.

Fitting of a curve of the form: y  ax b


Let y  ax b .
Taking log on both sides we get
logy = loga + blogx,
Y  A  bX whereY  logy, A  loga, X  logx .
The normal equations are
 Y  nA  b X ,

 XY  A X  b X .
2

Solving the above equations we estimate the values of a & b. With these values of a and b,

y  ax b is the curve of best fit.

Fourth Semester 11 Statistics (18MA41B)


Department of Mathematics
Examples:

1. Fit a straight line to the following data.

x 1 6 11 16 20 26
y 13 16 17 23 24 31

Let y = a + b x be the straight line.


The normal equations for estimating the values of a and b are
 y  na  b x ,

 xy  a x  b x 2 .
Given n = 6
x y x2 xy
1 13 1 13
6 16 36 96
11 17 121 187
16 23 256 368
20 24 400 480
26 31 676 806
∑x = 80 ∑y = 124 ∑x2 =1490 ∑xy = 1950

Substituting the above values in the normal equations we get

6a + 80b = 124
80a + 1490 = 1950

Solving, we get a = 11.3227, b = 0.7008.


Therefore the equation of best fit is y = 11.3227 +0.7008x

2. Fit a straight line to the following data.

x 1 2 3 4 5 6
y 6 4 3 5 4 2
Soln:
Let y = a + b x be the straight line.
The normal equations for estimating the values of a and b are

 y  na  b x ,  xy  a x  b x 2 .
Here n = 6 and following the procedure as in example 1 we get
2
 x =21,  y =24,  xy =75,  x = 91.
Therefore, we get 24 = 6a + 21b, 75 = 21a + 91b.
Solving, we get a = 5.799, b = – 0.514.
Therefore the equation of best fit is y = 5.799 – 0.514x.

Fourth Semester 12 Statistics (18MA41B)


Department of Mathematics
3. Fit a straight line of the form y = ax +b for the following data by the method of least
squares.
x 5 10 15 20 25
y 16 19 23 26 30
Soln:
Let y = ax + b be the given straight line
The normal equations are  y  a  x  nb ,  xy  a  x 2  b x .

Here n = 5 and following the procedure as in example 1 we get


2
 y  114,  x  75, xy  1885,  x  1375,
Substituting in the above equations we get a = 0.7, b = 12.3.
The best fit is y = 0.7 x + 12.3.
4. Fit an exponential curve of the type y = a e b x from the following data by the method of
least squares.
x 1 2 4
y 5 10 30

Let y = a e b x ………….(1) be the required curve.


Taking log on both side of (1) and simplifying we get

Y = A + b x, where A = log a, Y = log y

The normal equations for estimating the values of a and b are

 Y  nA  b x and  xY  A  x  b x2
x y Y = log y xY x2
1 5 0.6990 0.6990 1
2 10 1.0000 2.0000 4
4 30 1.4771 5.9085 16
∑=7 ∑ = 3.1761 ∑ = 8.6095 ∑ = 21

Substituting the above values in the normal equations we get

3 A + 7 b = 3.1761
7 A + 21b = 8.6095

Solving, we get A = 0.4604 but a = antilog (0.4604) = 2.8867, b = 0.2564.


Therefore the equation curve of best fit is y = 2.8867 e 0.5624 x.

Fourth Semester 13 Statistics (18MA41B)


Department of Mathematics
5. Fit a curve of the form y = a e b x to the data by the method of least squares.

x 0 2 4
y 8.12 10 31.82
Soln:

Let y = a e b x ………….(1) be the required curve.


Taking log on both side of (1) and simplifying we get

Y = A + b x, where A = log a, Y = log y

The normal equations for estimating the values of a and b are

 Y  nA  b x and  xY  A  x  b x2
Here n = 3 and following the procedure as in example 4 we get

 Y = 7.85,  xY = 18.44,  x
2
 x = 6, = 20.

Substituting the above values in the normal equations we get

3 A + 6 b = 7.85
6 A + 20b = 18.44

By solving these equations, we get A = 1.932 but a = antilog (A) = 6.903, b = 0.3425.
Therefore y = 6.903 e 0.3425x is the curve of best fit.
6. Fit a curve of the form y = a b x ………….(1) to the data by the method of least squares.

x 2 4 6 8 10
y 1 3 6 12 24

Soln:
Let y = a b x ………….(1) be the required curve.
Taking log on both side of (1) and simplifying we get

Y = A + B x, where A = log a, B = log b and Y = log y

The normal equations for estimating the values of a and b are

 Y  nA  B x and  xY  A  x  B x2
Here n = 5 and following the procedure as in example 4 we get

 x =30,  Y = 3.7147,  xY = 29.0130,  x


2
= 220.

Substituting the above values in the normal equations we get

5 A + 30 B = 3.7147 , 30 A + 220B = 29.0130

Fourth Semester 14 Statistics (18MA41B)


Department of Mathematics

By solving these equations, we get A = -0.26566 but a = antilog (A) = 1.8436,


B = 0.1681 but b = antilog (B) = 1.4727.
Therefore y = (1.8436) (1.4727) x is the curve of best fit.
7. At constant temperature the pressure P and the volume V of a gas are connected by the

relation PV  K (constant). Find the best fitting equation of this form to the following data
and estimate V when P = 4.
P 0.5 1.0 1.5 2.0 2.5 3.0
V 1620 1000 750 620 520 460

Soln: Let PV  K. …….. (1) be the given relation. Taking log on both side of (1) and
simplifying we get
 logV  39.73,  logP  2.42 ,

 logVlogP 14.4786 ,  (logV)  264.1689 .


2

Here n = 3 and following the procedure as in example 4 we get


 = 1.42 and K = 18144
Therefore PV1.42  18144 is the curve of best fit.
At P = 4, V = 375.9428  376.
8. Fit a second degree parabola for the following data.

x 0 1 2 3 4
y 1 3 4 5 6

Soln: Let y = a + bx + cx2 be the second degree polynomial and we have to determine a, b
and c.
Normal equations for the second degree parabola are
 y  na  b x  c x ,
2

 xy  a  x  b  x  c x ,
2 3

 x y  a  x  b  x  c x .
2 2 3 4

x y xy x2 x2y x3 x4
0 1 0 0 0 0 0
1 3 3 1 3 1 1
2 4 8 4 16 8 16
3 5 15 9 45 27 81
4 6 24 16 96 64 256
∑x = 10 ∑y =19 ∑xy = 50 ∑x2 = 30 ∑x2y = 160 ∑x3 = 100 ∑x4 = 354

Fourth Semester 15 Statistics (18MA41B)


Department of Mathematics

Substituting the above values in the normal equations and solving we get a = 1.114,
b = 1.7717, c = 0.1429.
Therefore the second degree of parabola of best fit is y = 1.114 + 1.7717 x – 0.1429x2

9. Fit a curve of the form y = a + bx + cx2 to the data by the method of least squares.
x 0 1 2 3 4
y 1 1.8 1.3 2.5 6.3

Soln: Let y = a + bx + cx2 be the second degree parabola and we have to determine a, b and
c.
Normal equations for the second degree parabola are
 y  na  b x  c x ,
2

 xy  a  x  b  x  c x ,
2 3

 x y  a  x  b  x  c x .
2 2 3 4

Here n = 5 and following the procedure as in example 7 we get


 x =10,  y  12.9 ,  xy  38.1,  x  30,  x  100,  x  354,  x y  131.3.
2 3 4 2

Substitute these values in normal equations

 y  na  b x  c x ,
2

 xy  a  x  b  x  c x ,
2 3

 x y  a  x  b  x  c x .
2 2 3 4

Solving we get a = 0.7914, b = - 0.1128, c = 0.3357.


Then the curve of best fit is y = 0.7914 – 0.1128 x + 0.3357 x 2.
10. The following table gives the production (in thousand units) of a certain commodity in
different years:
Year(x) 1968 1978 1988 1998 2008
Production(y) 8 10 12 10 16

Fit a straight line to the data and estimate the production in the year 2015.
Soln:
For convenience in computations, let us set X = x -1967 and Let y = a + b X be the straight
line.
The normal equations for estimating the values of a and b are

 y  na  b X ,  Xy  a X  b X2.
Fourth Semester 16 Statistics (18MA41B)
Department of Mathematics

Here n = 6 and following the procedure as in example 1 we get

 yi  56,  Xi  105,  Xi yi  1336,  xi


2
 3025,

Substitute these values in normal equations we get


56 = 5a + 10b,
1336 = 105a + 3205b.
Solving these equations, we get a = 7.84 and b = 0.16. Therefore the line of best fit is given
by y = a +b X = 7.84 + 0.16X = 7.84 + 0.16 (x -1967) = 0.16x - 306.88.
For x = 2015, this gives y = 15.52.
Thus, for the year 2015, the estimated production is 15.52(thousand units).

Exercise:
1. An experiment gave the following data:
x 1 3 4 6 8 9 11 14
y 1 2 4 4 5 7 8 9

It is known that x and y are connected by the relation y = a0 + a1x . Find the best
values of a and b using least square method.
Ans. a0 = 1, a1= 0.5420 and y = 1 + 0.5420 x
2. The number y of bacteria per unit volume present in a culture after x hours is given by
the following table :
x 0 1 2 3 4 5 6
y 32 47 65 92 132 190 275

Fit a curve of the form y = a b x to the data. Estimate the value of y when x = 7.
Ans. a = 32.14, b= 1.4270 and y = 32.14 (1.4270) x, y7 = 387.
3. The following table gives the production (in thousands units) of a certain commodity in
different years:
Year (x) 1941 1951 1961 1971 1981 1991 2001
Production ( y) 3.9 5.3 7.3 9.6 12.9 17.1 23.2

Fit a curve of the form y = a b x to this data and estimate the production in the year
2006.
Ans. a = 9.5735, b = 1.3433 and y = 9.5735 (1.3433) x, y2006 = 27.5 x1000 quintals

Fourth Semester 17 Statistics (18MA41B)


Department of Mathematics
4. The latent heat of vaporization of steam r is given in the following table at different
temperatures t: For this range of temperature fit a relation of the form r = a + b t using
the method of least squares.
t 40 50 60 70 80 90 100 110
r 1069.1 1063.6 1058.2 1052.7 1049.3 1041.8 1036.3 1030.8

Ans. a = 1090.26, b = -0.534 and r = 1090.26 – 0.534 t.


5. The following table gives the results of the measurements of train resistances; V is the
velocity in mile per hour and R is the resistance in pound per ton.

V 20 30 40 50 60 70
R 54 90 138 206 292 396

If R is related to V by the relation R = a + bV + cV2. Find a, b and c by the method


of least squares and estimate R when V = 45 miles / hour.
Ans. a = 41.77, b = -1.096 and c = 0.08786 R = 41.77 + ( -1.096) V + 0.08786 V2,
R = 170 Pound when V = 45 miles / hour.

Correlation and Regression:


The word correlation is used in everyday life to denote some form of association. In statistical
terms we use correlation to denote association between two quantitative variables. We also
assume that the association is linear, that one variable increases or decreases a fixed amount
for a unit increase or decrease in the other. The other technique that is often used in these
circumstances is regression, which involves estimating the best straight line to summarize the
association.
Correlation:
Correlation means simply a relation between two or more variables.
Two variables are said to be correlated if the change in one variable results in a
corresponding change in the other.
Ex: 1. x: supply y: price
2. x: demand y: Price.
Positive correlation:
If an increase or decrease in one variable corresponds to an increase or decrease in the other
then the correlation is said to be positive correlation or direct correlation.
Ex: 1. Demand and price of commodity. 2. Income and expenditure.

Fourth Semester 18 Statistics (18MA41B)


Department of Mathematics
Negative correlation:
If an increase or decrease in one variable corresponds to an decrease or increase in the other
then the correlation is said to be negative correlation or inversely correlated.
Ex: 1.Supply and Price of a commodity.
2. Correlation between Volume and pressure of a perfect gas.
No correlation
If there exist no relationship between two variables then they are said to be non correlated.
Scatter diagram
To obtain a measure of relationship between two variables x and y we plot their
corresponding values in the xy - plane. The resulting diagram showing the collection of the
dots is called the dot diagram or scatter diagram.

Correlation Coefficient (Karl Pearson correlation coefficient)


The degree of association is measured by a correlation coefficient, denoted by r. It is
sometimes called Karl Pearson's correlation coefficient and is a measure of linear association.
If a curved line is needed to express the relationship, other and more complicated measures of
the correlation must be used.
Let x1 , x 2 , x 3 ,......, x n be n values of x and y1 , y 2 , y 3 ,......y n be the corresponding n values

of y, then the coefficient of correlation between x and y is


 (x  x)(y  y)
, where  x - variance of the x series,  y - variance of the y series,
2 2
r
nσ x σ y

x y
x → Mean of the x series y → mean of the y series.
n n
For computation purpose we can use the formula
n  xy ( x)( y)
r .
n  x 2
 ( x) 2 n  y 2  ( y) 2 

Fourth Semester 19 Statistics (18MA41B)


Department of Mathematics
Limits for correlation coefficient
The coefficient of correlation numerically does not exceed unity (  1  r  1 ).
Proof:
1
 (x i  x)(yi  y)
We have r  n , i=1,2,………n,
1 1
 (x i  x)  (y i  y)
2 2

n n
1
 a i  bi  a i  b i 2
, r 
2
r n 2 2
. (1)
1 2 1 2
 a i  b i
ai  bi
n n
By Schwartz inequality, which states that if a i , b i i=1, 2… n are real quantities then

 a i  b i    a i  bi and the sign of equality holding if and only if


2 2 2

a1 a 2 a 3 a
   ............  n .
b1 b 2 b 3 bn
Using this equation (1) becomes r  1 ,
2

 r  1,
 1  r  1.

Hence correlation coefficient cannot exceed unity numerically.


Note:

Figure 1.1 Correlation illustrated.


1. If r =-1 there is a perfect negative correlation.
2. If r =1 there is a perfect positive correlation.
3. If r =0 then the variables are non-correlated.

Fourth Semester 20 Statistics (18MA41B)


Department of Mathematics
π
4. When r = 0, θ = . i.e, when the variables are independent the two lines of regression
2
are perpendicular to each other.
5. When r  1, θ  0 or π . i,e the lines of regression coincide.

Examples:
1. If r is the correlation coefficient between x and y and z= ax+by. Show that

σ z  (a 2 σ x  b 2 σ y )
2 2 2

r .
2abσ x σ y
Soln:

z  nx  n  y
1 a b
Let z = ax + by   z  a x  by ,
n

 (z  z)2  a 2 n  (x  x)2  b2 n  (y  y)2  2ab n  (x  x)(y  y),


1 1 1 1
n
 σ z 2  a 2 σ x 2  b2 σ y 2  2abrσ x σ y ,

σ z  (a 2 σ x  b 2 σ y )
2 2 2

r  .
2abσ x σ y
2. While calculating the correlation coefficient between x and y from 25 pairs of

observations a person obtained the following values.  x i  125,  x i  650,


2

 y i  100,  y i  460, x i y i  508 . It was later discovered that he had copied


2

down the pairs (8,12) and (6,8) as (6,12) and (8,6) respectively. Obtain the correct
value of the correlation coefficient.
Soln:

Correct  x i  125,  x i  650,  y i  102,  y i  488, x i y i  532 ,


2 2

n = 25,
n  xy ( x)( y)
r = 0.51912.
n  x 2
 ( x) 2 n  y 2  ( y) 2 
3. The following Table gives the age (in years) of 10 married couples. Calculate the coefficient
of correlation between these ages.

Age of Husband(x) 23 27 28 29 30 31 33 35 36 39
Age of wife(y) 18 22 23 24 25 26 28 29 30 32

Fourth Semester 21 Statistics (18MA41B)


Department of Mathematics
Soln:
Here n=10
1 311 1 257
We find x   xi   31.1 y   y i   25.7 .
n 10 n 10
xi Xi  xi - x Xi
2
Yi  y i  y Yi
2
X i Yi

23 -8.1 65.61 -7.7 59.29 62.37


27 -4.1 16.81 -3.7 13.69 15.17
28 -3.1 9.61 -2.7 7.29 8.37
29 -2.1 4.41 -1.7 2.89 3.57
30 -1.1 1.21 -0.7 0.49 0.77
31 -0.1 0.01 0.3 0.09 -0.03
33 1.9 3.61 2.3 5.29 4.37
35 3.9 15.21 3.3 10.89 12.87
36 4.9 24.01 4.3 18.49 21.07
39 7.9 62.41 6.3 39.69 49.77
 X i  202.9  Yi  158.10  X i Y i =178.
2 2

 X i Yi
r= = 0.9955 ≈ 1.
2 2
 X i  Yi

i.e, the ages of husbands and wives are almost perfectly correlated.

Regression :
Correlation describes the strength of an association between two variables, and is completely
symmetrical, the correlation between A and B is the same as the correlation between B and
A. However, if the two variables are related it means that when one changes by a certain
amount the other changes on an average by a certain amount. The relationship can be
represented by a simple equation called the regression equation. In this context "regression"
(the term is a historical anomaly) simply means that the average value of y is a "function" of
x, that is, it changes with x.
Regression analysis is a mathematical measure of the average relationship between two or
more variables in terms of the original units of data.

Fourth Semester 22 Statistics (18MA41B)


Department of Mathematics
Line of regression:
Line of regression is the line which gives the best estimate to the value of one variable for
any specific value of the other variable. So the line of regression is the line of best fit.
Regression line of y on x:
Let regression line of y on x be y = a + bx.
The normal equations by the method of least squares is

 y = na + b x ,
 xy = a x + b x2 ,
y = a + n x .
1 b
n
y  a  bx is the regression line passing through ( ( x , y )

 (x  x )(y  y)  (XY)  (XY) σy


b   r ,
 (x  x ) σx
2
X
2 2
nσ x
σy
y y  r (x x)  Y  b yx X is the regression line of y on x.
σx
Note:
1. Regression coefficient of y on x

 (x  x)(y  y) n  xy  x  y σy
b yx    r .
 (x  x) n  x 2  ( x ) 2 σx
2

2. Regression coefficient of x on y
 (x  x)(y  y) n  xy  x  y σx
b xy    r .
 (x  x) n  y 2  (  y) 2 σy
2

Examples:
1. If two regression equations of the variables x and y are x = 19.13 - .87y, y = 11.6 – 0.5x,
find
(a) mean of x
(b) mean of y
(c)The correlation coefficient between x and y.

Soln:
Since x and y lie on two regression lines,

Fourth Semester 23 Statistics (18MA41B)


Department of Mathematics
x  19.13  0.87y, y  11.64  0.5x,
Solving we get x  15.79, y  3.74.

b yx  0.5, b xy  0.87, r   0.5  0.87  0.66 .


2. In the following table data is showing the test scores made by sales man on an intelligent
test and their weekly sales.
Test scores(x) 1 2 3 4 5 6 7 8 9 10
sales(y) 2.5 6 4.5 5 4.5 2 5.5 3 4.5 3
Calculate the regression line of sales on test scores and estimate the most possible weekly
volume if a sales man scores 70.
Soln:
σy
x  60, y  4.05, Regression line of y on x is y  y  r (x  x) ,
σx
y = 0.06x + 0.45.
When x = 70, y = 4.65.

3. In a partially destroyed laboratory, record of an analysis of correlation data, the


following results only are legible.
Variance of x=9, Regression equations 8x -10y + 66 = 0, 40x - 18y = 214
what are (i) the mean values of x and y
(ii) the correlation coefficient between x and y
(iii) the standard deviation of y.
Soln:
(i) Since both the lines of regression pass through the point ( x , y )

8 x -10 y + 66 = 0,

40 x -18 y - 214 = 0.

Solving these equations we get x =13 , y =17

 x2  9
(ii)
x  3
Let 8x - 10y + 66 = 0 and 40x - 18y = 214 be the lines of regression of y on x
and x on y respectively

Fourth Semester 24 Statistics (18MA41B)


Department of Mathematics
4 18 9 9 3
b yx  , b xy   , Hence r 2  b yx b xy = , r    0.6 .
5 40 20 25 5
Since both the regression coefficients positive we take r = 0.6.
Standard deviation of y = 4.
4. The following table gives the stopping distance y in meters of a motor bike
Moving at a speed of x Kms/hour when the breaks are applied
x 16 24 32 40 48 56
y 0.39 0.75 1.23 1.91 2.77 3.81
Find the correlation coefficient between the speed and the stopping distance, and the
equations of regression lines. Hence estimate the maximum speed at which the motor
bike could be driven if the stopping distance is not to exceed 5 meters.
Soln:
x  36, y  1.81, , σ x  13.663, σ y  1.1831,
b yx  0.0851, b xy  11.352,

n  xy ( x)( y)
rr = 0.983.
n  x 2
 ( x) 2 n  y 2  ( y) 2 
The equation of the line of regression of y on x is y = 0.0851x - 1.2536 (i)
and the equation of the line of regression of x on y is x =11.352y + 15.453. (ii)
For y = 5, equation (ii) gives x = 72.213.
Accordingly, for the stopping distance not to exceed 5 meters, the speed must not
exceed 72 Kms/hour.
Exercise:
1. If the coefficient of correlation between the variables x and y is 0.5 and the acute
 3
angle between their lines of regression is tan   . Find the ratio of the standard
-1

5
deviation of x and y.
x 1  2
Ans. = or . x = .
y 2 y 1
2. Prove the following formulas for the coefficient of correlation r (in the usual notation)
2 2
1  X i Yi  1  X i Yi 
a) r  1    , r  1    .
2n  σ x σ y  2n  σ x σ y 

Fourth Semester 25 Statistics (18MA41B)


Department of Mathematics

3. The following table shows the ages x and the systolic pressures of 12 persons.

Age (x) 56 42 72 36 63 47 55 49 38 42 68 60
Blood Pressure (y) 147 125 160 118 149 128 150 145 115 140 152 155

Calculate the coefficient of correlation between x and y. Estimate the blood pressure
of a person whose age is 45 years.
Ans. r = 0.8961, y = 80.78 + 1.138 x , when x = 45, y = 132.
4. The height (inches) and weight (pounds) of baseball players are given below:
(76, 212), (76, 224), (72, 180), (74, 210), (75, 215), (71, 200), (77, 235), (78, 235),
(77, 194), (76, 185).
(i) Estimate the coefficient of correlation between weight and height of baseball
players.
(ii) Find the regression line between weight and height. Use the regression equation to
find the weight of a baseball player that is 68 inches tall.
Ans. r = 0.5529, y = 4.737 x – 147.227, x = 0.064 y + 61.712, when x = 68, y =
97.37.
5. The equations of regression lines of two variables x and y are 4 x – 5y + 33 = 0 and
20x - 9y = 107, Find the correlation coefficient and the means of x and y.
Ans. r = 0.6, Mean of x = 13 and Mean of y = 17.
6. If the tangent of the angle between the lines of regression of y on x and x on y is 0.6
and the standard deviation of y is twice the standard deviation of x. find the coefficient
of correlation between x and y.
Ans. r = 0.5.

Resources:
1. https://nptel.ac.in/courses/111105042/
2. http://www.nptelvideos.in/2012/12/regression-analysis.html
3. https://nptel.ac.in/courses/111104074/

Fourth Semester 26 Statistics (18MA41B)

You might also like