Unit 3 - Statistics
Unit 3 - Statistics
UNIT-III
STATISTICS
Expand their knowledge and skills of the Statistical Concepts and a personal
development experience towards the needs of statistical data analysis.
Understand the Central Moments, Skewness and Kurtosis.
Describe the principle of least squares.
Fit data using several types of curves.
Describe & evaluate the concept of correlation and regression coefficients.
Investigate the strength and direction of a relationship between two variables by
collecting measurements and using appropriate statistical analysis.
Introduction:
In many fields of Applied Mathematics and Engineering we face some problems and do the
experiments involving two variables. In this chapter, we consider the Mathematical theory of
statistics, by presenting an elementary treatment of Central moments, mean, variance,
coefficients of skewness and kurtosis in terms of moments, curve fitting, correlation and
regression. In mathematics, a moment is a specific quantitative measure of the shape of a
function. It is used in both mechanics and statistics. If the function represents
physical density, then the zeroth moment is the total mass, the first moment divided by the
total mass is the center of mass, and the second moment is the rotational inertia. If the
function is a probability distribution, then the zeroth moment is the total probability (i.e. one),
the first moment is the mean, the second central moment is the variance, the
third standardized moment is the skewness, and the fourth standardized moment is
the kurtosis.
Moments:
In mechanics, moment refers to the turning or the rotating effect of a force whereas it is used
to describe the peculiarities of a frequency distribution in statistics. We can measure the
central tendency of a set of observations by using moments. Moments also help in measuring
the scatteredness, asymmetry and peakedness of a curve for a particular distribution.
Moments refers to the average of the deviations from mean or some other value raised to a
certain power. The arithmetic mean of various powers of these deviations in any distribution
is called the moments of the distribution about mean. Moments about mean are generally
used in statistics.
xi - A xi - x
Note: If di or di , Then rth order moments about an arbitrary point A
h h
n n
fi di r h r
1 1
and mean x are defined respectively by μ r =
r r
fi di h & μ r = r = 1, 2, 3 …
N N
i=1 i=1
Relation between raw (Moments about origin or any point) and Central Moments
The central moments can be expressed in terms of raw moments and vice-versa. The general
relation between the moments about mean in terms of moments about any point is given by,
μ 2 μ2 μ12, μ3 μ3 3μ2μ1 + 2μ13 and μ 4 μ4 4μ3 μ1 + 6μ2μ12-3μ14.
Conversely,
μr μ r r C1 μ r-1 μ1 r C2 μ r-2 μ12 ... μ1r, r = 1, 2, 3 … (7)
Example 1: The first four moments of a distribution about the value 4 of the variables are
-1.5, 17, -30 and 108. Find the moments about the mean.
Solution: Given A = 4, μ1 =-1.5, μ2 =17, μ3 =-30 and μ4 =108.
Example 2: Calculate the first four moments of the following distribution about the mean.
0 1 2 3 4 5 6 7 8
1 8 28 56 70 56 28 8 1
Solution:
μ1
fd
= 0, μ 2
fd 2
= 2, μ3
fd3
= 0, μ 4
fd 4
= 11
N N N N
1.5 - 2.5 2.5 - 3.5 3.5 - 4.5 4.5 - 5.5 5.5 - 6.5
1 3 7 3 3
Calculate the first four central moments of the following distribution.
Mid-point
Wages f d=(x- x) fd fd2 fd3 fd4
x
1.5 - 2.5 1 2 -2 -2 4 -8 16
2.5 - 3.5 3 3 -1 -3 3 -3 3
3.5 - 4.5 7 4 0 0 0 0 0
4.5 - 5.5 3 5 1 3 3 3 3
5.5 - 6.5 1 6 2 2 4 8 16
∑=0 ∑ = 14 ∑=0 ∑ = 38
μ1
fd
= 0, μ 2
fd 2
= 0.933, μ3
fd3
= 0, μ 4
fd 4
= 2.533
N N N N
Skewness and Kurtosis:
Averages tell us about the central value of the distribution and measures of dispersion tell us
about the concentration of the items around a central value. These measures do not reveal
whether the dispersal of value on either side of an average is symmetrical or not. If
observations are arranged in a symmetrical manner around a measure of central tendency, we
get a symmetrical distribution; otherwise, it may be arranged in an asymmetrical order which
gives asymmetrical distribution.
Fourth Semester 4 Statistics (18MA41B)
Department of Mathematics
Measures of Skewness and Kurtosis, like measures of central tendency and dispersion, study
the characteristics of a frequency distribution. Thus, skewness is a measure that studies the
degree and direction of departure from symmetry.
A symmetrical distribution, gives a ‘symmetrical curve’, where the value of mean, median
and mode are exactly equal. On the other hand, in an asymmetrical distribution, the values of
mean, median and mode are not equal. When two or more symmetrical distributions are
compared, the difference in them is studied with ‘Kurtosis’. On the other hand, when two or
more symmetrical distributions are compared, they will give different degrees of Skewness.
These measures are mutually exclusive i.e. the presence of skewness implies absence of
kurtosis and vice-versa.
Measures of Kurtosis:
Kurtosis enables us to have an idea about the flatness or peakedness of the curve. It is
measured by the Karl Pearson co-efficient β2 and given by
μ4
2 =
μ 22
Kurtosis studies the concentration of the items at the central part of a series. The following
figure in which all the three curves A, B and C are symmetrical about the mean.
Curve of the type ‘A’ which is neither flat nor peaked is called the normal curve or
‘MESOKURTIC’ curve (β2 = 3). If items concentrate too much at the center (more peaked
than the normal curve), the curve of the type ‘C’ becomes ‘LEPTOKURTIC’ curve (β2 > 3).
If the concentration at the center is comparatively less (flatter than the normal curve), the
curve of the type ‘B’ becomes ‘PLATYKURTIC’ curve (β2 < 3).
Measures of Skewness:
Literally, skewness means ‘lack of symmetry’. A distribution is said to be skewed if
(i) Mean, Median and Mode fall at different points.
(ii) The curve drawn with the help of the given data is not symmetrical but stretched more to
one side than to the other.
Mean - Mode
Sk = , where σ is the standard deviation of the distribution.
σ
Based upon moments, co-efficient of skewness is defined as follows:
1 2 3 μ 32 μ
Sk = = = 4 .
2 5 2 61 9
, where 1 and 2
μ 32 μ 22
Nature of Skewness:
Skewness can be positive or negative or zero. The direction of skewness is determined by
observing whether the mean is greater than the mode (positive skewness) or less than the
mode (negative skewness).
(i) When the values of mean, median and mode are equal, there is no skewness.
(ii) When mean > median > mode, skewness will be positive.
(iii) When mean < median < mode, skewness will be negative.
Mid-point
Wages f d = (x -17) / 2 fd fd2 fd3 fd4
x
10-12 1 11 -3 -3 9 -27 81
12-14 3 13 -2 -6 12 -24 48
14-16 7 15 -1 -7 7 -7 7
16-18 12 17 0 0 0 0 0
18-20 12 19 1 12 12 12 12
20-22 4 21 2 8 16 32 64
22-24 3 23 3 9 27 81 243
∑ = 13 ∑ = 27 ∑ = 67 ∑ =455
μ1
fd
x h = 0.52, μ 2
fd 2
xh 2
= 2.16, μ3
fd3
x h3 = 10.72,
N N N
μ 4
fd 4
x h 4 = 145.6
N
Moments about mean:
μ1 0,μ 2 μ 2 μ12= 2.16 - 0.2704= 1.8896
μ3 μ3 3μ 2μ1 + 2μ13 = 10.72 - 3(2.16)(0.52) + 2 (0.52)2 7.491
μ 4 μ 4 4μ 3 μ1 + 6μ 2μ12 - 3μ14=145.6 - 4(0.52)(10.72) + 6(2.56)(0.52)2 -3 x 0.07312
= 126.5874.
μ32 μ4
So, we have β1 = 8.317, β 2 = 35.4527.
μ32 μ 22
Exercise:
1. The first four raw moments of a distribution are 2, 136, 320 and 40,000. Find the
coefficients of skewness and kurtosis.
Ans.
μ32 μ4
β1 = 0.0904, β2 = 2.333.
μ32 μ 22
2. Find the second, third and fourth central moments of the frequency distribution given
below. Hence, find (i) a measure of skewness and (ii) a measure of kurtosis.
Ans.
μ 2 = 2.16, μ3 = 0.804, μ 4 = 12.5232
γ1 = β1 = 0.25298; γ 2 = β2 -3 = -0.317
3. Find the second, third and fourth central moments of the frequency distribution
given below. Hence, find (i) a measure of skewness and (ii) a measure of kurtosis.
5 10 15 20 25 30 35
4 10 20 36 16 12 2
Ans.
μ 2 = 44.41, μ3 = -12.504, μ 4 = 5423.5057, β1 = 0.001785,
β2 = 2.7499, γ1 = β1 = 0.25298; γ 2 = β 2 -3 = -0.317.
4. Compute the first four moments about mean from the following data. Hence, find (i) a
measure of skewness and (ii) a measure of kurtosis.
Class Intervals: 0 -10 10 – 20 20 – 30 30 – 40
Frequency: 1 3 4 2
Ans.
μ1 = 0, μ 2 = 81, μ3 = -144, μ 4 = 14817, β1 = 0.03902,
β2 = 0.01909, γ1 = β1 = 0.1975; γ 2 = β 2 -3 = - 2.9809.
Fitting of polynomial:
Approximating a data set using a polynomial equation is useful when conducting engineering
calculations as it allows results to be quickly updated when inputs change without the need
for manual lookup of the dataset. The most common method to generate a polynomial
equation from a given data set is the least squares method. We will discuss the fitting of the
following types of the curves.
Fitting of a straight line: y a bx
y a b x 0 ,
y na b x ,
E n
0 2 (y a bx)( x) 0 ,
b 1
xy a x b x 2 .
The normal equations for estimating the values of a and b are
y na b x ,
xy a x b x .
2
Solving the above normal equations we estimate the values of a & b. With these values of a
and b y a bx is the line of best fit.
y na b x c x ,
2
E n
0 2 (y a bx cx 2 )( x) 0 ,
b 1
xy a x b x 2 c x3 ,
E n
0 2 (y a bx cx 2 )( x 2 ) 0 ,
c 1
x 2 y a x 2 b x 3 c x 4 .
The normal equations for estimating the values of a , b ,c are
xy a x b x c x ,
2 3
4
x y a x b x c x .
2 2 3
Solving the above equations we estimate the values of a ,b & c. With these values of a , b &
xu A x b x .
2
XY A X b X .
2
Solving the above equations we estimate the values of a & b. With these values of a and b,
x 1 6 11 16 20 26
y 13 16 17 23 24 31
xy a x b x 2 .
Given n = 6
x y x2 xy
1 13 1 13
6 16 36 96
11 17 121 187
16 23 256 368
20 24 400 480
26 31 676 806
∑x = 80 ∑y = 124 ∑x2 =1490 ∑xy = 1950
6a + 80b = 124
80a + 1490 = 1950
x 1 2 3 4 5 6
y 6 4 3 5 4 2
Soln:
Let y = a + b x be the straight line.
The normal equations for estimating the values of a and b are
y na b x , xy a x b x 2 .
Here n = 6 and following the procedure as in example 1 we get
2
x =21, y =24, xy =75, x = 91.
Therefore, we get 24 = 6a + 21b, 75 = 21a + 91b.
Solving, we get a = 5.799, b = – 0.514.
Therefore the equation of best fit is y = 5.799 – 0.514x.
Y nA b x and xY A x b x2
x y Y = log y xY x2
1 5 0.6990 0.6990 1
2 10 1.0000 2.0000 4
4 30 1.4771 5.9085 16
∑=7 ∑ = 3.1761 ∑ = 8.6095 ∑ = 21
3 A + 7 b = 3.1761
7 A + 21b = 8.6095
x 0 2 4
y 8.12 10 31.82
Soln:
Y nA b x and xY A x b x2
Here n = 3 and following the procedure as in example 4 we get
Y = 7.85, xY = 18.44, x
2
x = 6, = 20.
3 A + 6 b = 7.85
6 A + 20b = 18.44
By solving these equations, we get A = 1.932 but a = antilog (A) = 6.903, b = 0.3425.
Therefore y = 6.903 e 0.3425x is the curve of best fit.
6. Fit a curve of the form y = a b x ………….(1) to the data by the method of least squares.
x 2 4 6 8 10
y 1 3 6 12 24
Soln:
Let y = a b x ………….(1) be the required curve.
Taking log on both side of (1) and simplifying we get
Y nA B x and xY A x B x2
Here n = 5 and following the procedure as in example 4 we get
relation PV K (constant). Find the best fitting equation of this form to the following data
and estimate V when P = 4.
P 0.5 1.0 1.5 2.0 2.5 3.0
V 1620 1000 750 620 520 460
Soln: Let PV K. …….. (1) be the given relation. Taking log on both side of (1) and
simplifying we get
logV 39.73, logP 2.42 ,
x 0 1 2 3 4
y 1 3 4 5 6
Soln: Let y = a + bx + cx2 be the second degree polynomial and we have to determine a, b
and c.
Normal equations for the second degree parabola are
y na b x c x ,
2
xy a x b x c x ,
2 3
x y a x b x c x .
2 2 3 4
x y xy x2 x2y x3 x4
0 1 0 0 0 0 0
1 3 3 1 3 1 1
2 4 8 4 16 8 16
3 5 15 9 45 27 81
4 6 24 16 96 64 256
∑x = 10 ∑y =19 ∑xy = 50 ∑x2 = 30 ∑x2y = 160 ∑x3 = 100 ∑x4 = 354
Substituting the above values in the normal equations and solving we get a = 1.114,
b = 1.7717, c = 0.1429.
Therefore the second degree of parabola of best fit is y = 1.114 + 1.7717 x – 0.1429x2
9. Fit a curve of the form y = a + bx + cx2 to the data by the method of least squares.
x 0 1 2 3 4
y 1 1.8 1.3 2.5 6.3
Soln: Let y = a + bx + cx2 be the second degree parabola and we have to determine a, b and
c.
Normal equations for the second degree parabola are
y na b x c x ,
2
xy a x b x c x ,
2 3
x y a x b x c x .
2 2 3 4
y na b x c x ,
2
xy a x b x c x ,
2 3
x y a x b x c x .
2 2 3 4
Fit a straight line to the data and estimate the production in the year 2015.
Soln:
For convenience in computations, let us set X = x -1967 and Let y = a + b X be the straight
line.
The normal equations for estimating the values of a and b are
y na b X , Xy a X b X2.
Fourth Semester 16 Statistics (18MA41B)
Department of Mathematics
Exercise:
1. An experiment gave the following data:
x 1 3 4 6 8 9 11 14
y 1 2 4 4 5 7 8 9
It is known that x and y are connected by the relation y = a0 + a1x . Find the best
values of a and b using least square method.
Ans. a0 = 1, a1= 0.5420 and y = 1 + 0.5420 x
2. The number y of bacteria per unit volume present in a culture after x hours is given by
the following table :
x 0 1 2 3 4 5 6
y 32 47 65 92 132 190 275
Fit a curve of the form y = a b x to the data. Estimate the value of y when x = 7.
Ans. a = 32.14, b= 1.4270 and y = 32.14 (1.4270) x, y7 = 387.
3. The following table gives the production (in thousands units) of a certain commodity in
different years:
Year (x) 1941 1951 1961 1971 1981 1991 2001
Production ( y) 3.9 5.3 7.3 9.6 12.9 17.1 23.2
Fit a curve of the form y = a b x to this data and estimate the production in the year
2006.
Ans. a = 9.5735, b = 1.3433 and y = 9.5735 (1.3433) x, y2006 = 27.5 x1000 quintals
V 20 30 40 50 60 70
R 54 90 138 206 292 396
x y
x → Mean of the x series y → mean of the y series.
n n
For computation purpose we can use the formula
n xy ( x)( y)
r .
n x 2
( x) 2 n y 2 ( y) 2
n n
1
a i bi a i b i 2
, r
2
r n 2 2
. (1)
1 2 1 2
a i b i
ai bi
n n
By Schwartz inequality, which states that if a i , b i i=1, 2… n are real quantities then
a1 a 2 a 3 a
............ n .
b1 b 2 b 3 bn
Using this equation (1) becomes r 1 ,
2
r 1,
1 r 1.
Examples:
1. If r is the correlation coefficient between x and y and z= ax+by. Show that
σ z (a 2 σ x b 2 σ y )
2 2 2
r .
2abσ x σ y
Soln:
z nx n y
1 a b
Let z = ax + by z a x by ,
n
σ z (a 2 σ x b 2 σ y )
2 2 2
r .
2abσ x σ y
2. While calculating the correlation coefficient between x and y from 25 pairs of
down the pairs (8,12) and (6,8) as (6,12) and (8,6) respectively. Obtain the correct
value of the correlation coefficient.
Soln:
n = 25,
n xy ( x)( y)
r = 0.51912.
n x 2
( x) 2 n y 2 ( y) 2
3. The following Table gives the age (in years) of 10 married couples. Calculate the coefficient
of correlation between these ages.
Age of Husband(x) 23 27 28 29 30 31 33 35 36 39
Age of wife(y) 18 22 23 24 25 26 28 29 30 32
X i Yi
r= = 0.9955 ≈ 1.
2 2
X i Yi
i.e, the ages of husbands and wives are almost perfectly correlated.
Regression :
Correlation describes the strength of an association between two variables, and is completely
symmetrical, the correlation between A and B is the same as the correlation between B and
A. However, if the two variables are related it means that when one changes by a certain
amount the other changes on an average by a certain amount. The relationship can be
represented by a simple equation called the regression equation. In this context "regression"
(the term is a historical anomaly) simply means that the average value of y is a "function" of
x, that is, it changes with x.
Regression analysis is a mathematical measure of the average relationship between two or
more variables in terms of the original units of data.
y = na + b x ,
xy = a x + b x2 ,
y = a + n x .
1 b
n
y a bx is the regression line passing through ( ( x , y )
(x x)(y y) n xy x y σy
b yx r .
(x x) n x 2 ( x ) 2 σx
2
2. Regression coefficient of x on y
(x x)(y y) n xy x y σx
b xy r .
(x x) n y 2 ( y) 2 σy
2
Examples:
1. If two regression equations of the variables x and y are x = 19.13 - .87y, y = 11.6 – 0.5x,
find
(a) mean of x
(b) mean of y
(c)The correlation coefficient between x and y.
Soln:
Since x and y lie on two regression lines,
8 x -10 y + 66 = 0,
40 x -18 y - 214 = 0.
x2 9
(ii)
x 3
Let 8x - 10y + 66 = 0 and 40x - 18y = 214 be the lines of regression of y on x
and x on y respectively
n xy ( x)( y)
rr = 0.983.
n x 2
( x) 2 n y 2 ( y) 2
The equation of the line of regression of y on x is y = 0.0851x - 1.2536 (i)
and the equation of the line of regression of x on y is x =11.352y + 15.453. (ii)
For y = 5, equation (ii) gives x = 72.213.
Accordingly, for the stopping distance not to exceed 5 meters, the speed must not
exceed 72 Kms/hour.
Exercise:
1. If the coefficient of correlation between the variables x and y is 0.5 and the acute
3
angle between their lines of regression is tan . Find the ratio of the standard
-1
5
deviation of x and y.
x 1 2
Ans. = or . x = .
y 2 y 1
2. Prove the following formulas for the coefficient of correlation r (in the usual notation)
2 2
1 X i Yi 1 X i Yi
a) r 1 , r 1 .
2n σ x σ y 2n σ x σ y
3. The following table shows the ages x and the systolic pressures of 12 persons.
Age (x) 56 42 72 36 63 47 55 49 38 42 68 60
Blood Pressure (y) 147 125 160 118 149 128 150 145 115 140 152 155
Calculate the coefficient of correlation between x and y. Estimate the blood pressure
of a person whose age is 45 years.
Ans. r = 0.8961, y = 80.78 + 1.138 x , when x = 45, y = 132.
4. The height (inches) and weight (pounds) of baseball players are given below:
(76, 212), (76, 224), (72, 180), (74, 210), (75, 215), (71, 200), (77, 235), (78, 235),
(77, 194), (76, 185).
(i) Estimate the coefficient of correlation between weight and height of baseball
players.
(ii) Find the regression line between weight and height. Use the regression equation to
find the weight of a baseball player that is 68 inches tall.
Ans. r = 0.5529, y = 4.737 x – 147.227, x = 0.064 y + 61.712, when x = 68, y =
97.37.
5. The equations of regression lines of two variables x and y are 4 x – 5y + 33 = 0 and
20x - 9y = 107, Find the correlation coefficient and the means of x and y.
Ans. r = 0.6, Mean of x = 13 and Mean of y = 17.
6. If the tangent of the angle between the lines of regression of y on x and x on y is 0.6
and the standard deviation of y is twice the standard deviation of x. find the coefficient
of correlation between x and y.
Ans. r = 0.5.
Resources:
1. https://nptel.ac.in/courses/111105042/
2. http://www.nptelvideos.in/2012/12/regression-analysis.html
3. https://nptel.ac.in/courses/111104074/