DSRP Unit-II Notes
DSRP Unit-II Notes
2.1 INTRODUCTION
In unit 1 of this block, we have explained how to present data in the form of
tables and graphs. A more complete understanding of data can be attained by
summarizing the data using statistical measures. The present unit deals with
various measures of central tendency and dispersion in a variable. It also
explains how to measure correlation between two variables.As the computation
of these measures is different for ungrouped and grouped data, we present
some measures for both ungrouped and grouped data.
If the value of observations in the data is denoted by x1 ,x2 ,...,xn then the
arithmetic mean is given by x1 + x2 + xn i=1
n
xi
x= =
n n
where n is the number of observations. In this formula, the Greek letter ( xi i=1
n
)
n
denotes summation of all the values , i.e., = x + x + + x
i=1
1 2 n .
Example 2.1 Suppose we have the following data on the minimum temperature
(°C) of New Delhi for 10 days.
19 17 21 11 15 17 12 17 15 18.
For finding the average temperature, we have total no. of observations = n
=10,
n
where N= fi = Total no. of observations.
i=1
37
Statistical Analysis Example 2.2 Let us consider the data given in Table 1.3 of Unit 1 and compute
the mean.
Let us compute the arithmetic mean of the data given in the above table.
f x i 0 6 1 5 6 3 109
x
n
i 1 i
2.725
N 40 40
15-25 20 9 180
25-35 30 12 360
35-45 40 21 840
45-55 50 15 750
55-65 60 11 660
65-75 70 7 490
38
Total N=75 fx i i 3280
Thus, the mean of the age of 75 persons is Measures of Central
Tendency and Dispersion
x= i i
f x 3280
43.73 44 Years
N 75
2.2.2 Median
Median is a positional average. Median is the middlemost value of the set of
observations which divides the data set into two equal parts, where all the
observations are arranged in either ascending or descending order. So there
are 50 per cent observations below the median and the remaining 50 per cent
are above the median.
Calculation of Median from Raw Data
For calculation of median from raw (unorganised) data you should take the
following three steps.
a) Arrange the data either in ascending or in descending order of magnitude
(both methods give the same value for median).
b) If there are odd number of observations (n), median is calculated by
n+1 th
Median = value of observation
2
where n = number of observations
c) If there are even numbers of observations, median is calculated by
n th n th
observation value of
value of 1 observation
Median = 2 2
2
Example 2.4 The following is the data of the hemoglobin level of 11 women
in gm/dL:
12.1 13.6 14.2 12.4 14.3 13.2 12.8 14.6 13.9 13.8 12.4
For finding the median of the above data, we arrange the values of hemoglobin
in ascending order as follows:
12.1 12.4 12.4 12.8 13.2 13.6 13.8 13.9 14.2 14.3 14.6
Here, the number of observations are odd , since n = 11
Median hemoglobin level is given by
n 1 th
Median = value of observation
2
11 1 th
= value of observation
2
= value of (6th) observation = 13.6 gm/dL (Arrange the above data in
descending order and calculate the median. You should obtain the
value, i.e., 13.6)
Example 2.5 The following is the data of the hemoglobin level of 12 women
in gm/dL: 39
Statistical Analysis 12.1 13.6 14.2 12.4 14.3 13.2 12.8 14.6 13.9 13.8 12.4 14.8
For finding the median of the above data, we arrange the values of hemoglobin
in ascending order as follows:
12.1 12.4 12.4 12.8 13.2 13.6 13.8 13.9 14.2 14.3 14.6 14.8
Here, the number of observations are even , i.e., n = 12
Median hemoglobin level is given by
n th n th
value of observation value of 1 observation
Median = 2 2
2
12 th 12 th
value of observation value of 1 observation
= 2 2
2
th th
value of 6 observation value of 7 observation
=
2
13.6 13.8 27.4
= 13.7 gm / dL
2 2
Calculation of Median from ungrouped frequency distribution
a) First of all, arrange the data in ascending or descending order of magnitude.
b) Next find the cumulative frequencies.
c) Apply the following formula:
N 1 th
Median = value of observation
2
where N= f i = Total number of observations
d) Finally, we will find the cumulative frequency which is either equal or just
N 1
higher to and the value of the variable which corresponds to that
2
cumulative frequency will be our required median.
Example 2.6 Consider the data given in Table 2.1 and calculate the median.
Table 2.4
No. of Children No. of families cumulative
(xi ) (frequency) fi frequencies (less
than type)
0 6 6
1 5 11
2 8 19
3 7 26
4 6 32
5 5 37
6 3 40
40 Total N=40
In the above data set, the values on number of children are already in ascending Measures of Central
order. In the third column of the table, we have calculated the cumulative Tendency and Dispersion
frequencies of less than type then
N 1 th
Median = value of observation
2
40 1 th
= value of observation
2
= value of (20.5)th observation
Now the cumulative frequency which is either equal or just higher to 20.5 is
26 then the corresponding value of the variable is 3. Thus the median value for
the above data set is 3 children per family.
Calculation of Median from Grouped Frequency Distribution
N
a) First of all, we find the value of
2
, where N = f i = Total number of
observations
b) Next, we calculate the cumulative frequencies and identify the class interval
for which the cumulative frequency is either equal to or just higher than
N
. This class will contain the median and called the ‘median class’.
2
c) Finally, we use the following formula to compute the median
N
c. f
Median = L 2 h
f
N
c. f
Median = L 2 h
f
37.5 21
Median = 35 10
21
16.5
Median = 35 10 42.86
21
The median age is 42.86 years.
2.2.3 Mode
Mode is the value of given data set which occurs maximum number of times,
i.e., the value which has the highest frequency. Mode is the most commonly
used measure of central tendency when we have to decide which is the most
fashionable (most demanded or most preferable) item at this time. For example,
to decide the most preferable size of shoes, clothes, etc., we find their mode.
Calculation of Mode from Raw Data
Example 2.8 Let us consider the temperature of 10 days in New Delhi, i.e.,
19, 17, 21, 11, 15, 17, 12, 17, 15, 18
In this data, the observation 17 is occurring maximum number of time (i.e., 3).
Hence the mode is 17 (Note that Mode is 17, not 3).
Calculation of Mode from Ungrouped Frequency Distribution
Example 2.9 Consider the data given in Table 2.1, and find out the mode.
Table 2.6
No. of Children No. of families
(xi) (frequency) fi
0 6
1 5
2 8
3 7
4 6
5 5
6 3
Total N=40
In this data set, the value 2 has the maximum frequency (, i.e., 8). Thus 2 is the
42 most commonly occurring value. We can say that the modal value for the above
data is 2 children per family.
Calculation of Mode for Grouped Frequency Distribution Measures of Central
Tendency and Dispersion
i) First of all, we will find the class having maximum frequency which is
called modal class.
ii) Then, we will calculate the mode by the following formula.
fi f 0
Mode = L 2 f f f h
1 0 2
f1 f 0
Mode L h
2 f1 f 0 f 2
21 12
35 10
2 21 12 15
9
35 10 41
15
Thus the modal value for the above data is 41 years.
Mean=Median=Mode
(b) (c)
If there are n observations say x1 ,x2 ,...,xn of a variable under study and x is
the mean of these n observations, then the mean deviation about mean is given
by
1 n
M .D x1 x
n i 1
Here x1 x (read as mod x1 x ) is the absolute value of the difference
between xi and x . For finding the absolute value of a number, we ignore the
minus signs. Thus (5) = 5.
Also (-5) = 5.
Example 2.11 Calculate the mean deviation from the following data of the
hemoglobin level of 10 women in gm/dL:
12.1 13.6 14.2 12.4 14.3 13.2 12.8 14.6 13.9 13.9
For computing mean deviation, we will prepare the following table:
Table 2.9
xi x 13.5
i xi 13.5
x i 135 x x 7
i 45
Statistical Analysis
From the above table, n=10, x i 135 ,
Then, x
xi 135 13.5
n 10
The mean deviation about mean is given by
1 n
M .D xi x
n i 1
7
0.7
10
Thus the mean deviation of the above data on hemoglobin level is 0.7 gm/dL.
For frequency distribution
Let xi (i=1, 2, …, n) be the value of ith observation in the data and it occurs with
frequency fi,(i =1, 2, …., n). For the ungrouped frequency distribution the
mean deviation about mean is given by
1 n
M .D xi x
n i 1
Where N fi
And xi x is the deviation from mean after ignoring the minus signs.
In the fourth column we compute xi x and in the fifth column we compute
xi x .
1 n
M.D fi x1 x
N i1
61.55
1.539 1.54
40
Example 2.13 Calculate the mean deviation from the data given in table 2.3
Thus the mean deviation is 1.54 children per family
We will construct the following table for computation of the mean deviation:
Table 2.11
( xi ) ( fi )
1 n
M.D fi xi x
N i1
913.6
12.181 years
75
2
1 n
2
Variance
xi x
n i 1
We can rewrite it for computational convenience
1 2
2
n
xi2 x
1 xi 2
Or, xi
2 2
n n
1 2
2
Variance
N
fi xi x
1 2
Or, 2
N
fi xi 2 x
1 fi xi 2
Or, fi xi
2 2
, where N f i
N N
S.D variance
The three formulae given above will provide the same result. Thus you can use
any one of the above. For computation of variance we usually prepare a table
from the given data as per our requirements. As mentioned earlier standard
deviation is the positive square root of variance. Thus, in the case of grouped
frequency distribution, we consider as mid value of the ith class interval.
Let us now consider the following examples to understand the computational
method of variance and standard deviation:-
Example 2.14: Calculate the variance and standard deviation from the data
set given in Example 2.11.
48
For computation of variance and standard deviation, we prepare the following Measures of Central
table: Tendency and Dispersion
Table 2.12
xi xi 2
12.1 146.41
13.6 184.96
14.2 201.64
12.4 153.76
14.3 204.49
13.2 174.24
12.8 163.84
14.6 213.16
13.9 193.21
13.9 193.21
x i 135 x i
2
1828.92
Then, x
xi 135
13.5
n 10
Variance is given by
1 2
2
n
xi2 x
1828.92
(13.5)2
10
S.D variance
= 0.642
= 0.801
Example 2.15 Calculate the variance and standard deviation from the data set
given in table 2.1
For the computation of variance and standard deviation, we have to construct
the following table.
49
Statistical Analysis Table 2.13
0 6 0 0 0
1 5 5 1 5
2 8 16 4 32
3 7 21 9 63
4 6 24 16 96
5 5 25 25 125
6 3 18 36 108
Total N=40 fx i i 109 91 fx 2
i i 429
429 2
2.725
40
= 10.725 - 7.426
= 3.299
Standard deviation is given by
S.D variance
3.299
=1.816
Example 2.16 Calculate the variance and standard deviation from the data
given in table 2.3
We construct the following table for the computation of variance and standard
deviation:
Table 2.14
Class Mid values frequency fi xi xi2 f i xi2
Interval ( xi ) ( fi )
15-25 20 9 180 400 3600
25-35 30 12 360 900 10800
35-45 40 21 840 1600 33600
45-55 50 15 750 2500 37500
55-65 60 11 660 3600 39600
65-75 70 7 490 4900 34300
50 Total N=75 fx i i 3280 13900 fx 2
i i 159400
Measures of Central
From the above table, x
fi xi 3280
43.733 Tendency and Dispersion
N 75
Variance is given by
1 2
2
N
fi xi2 x
159400 2
43.7333
75
= 2125.333-1912.604
=212.729
212.729
=14.585
1 2
s2
n 1
xi x
1
x nx
2 2 2
Or s i
n 1
When we want to compare two or more data sets in respect to variability then
we will use coefficient of variation. The coefficient of variation is also useful
even in comparison of data sets having different measurement units because it
is a unit free measure. It is given by
S.D
Coefficient of variation 100
mean
Or, C .V . 100
x
51
Statistical Analysis The data set for which coefficient of variation is less is said to be more consistent
or more uniform or more homogeneous. For the above examples we can
calculate the coefficient of variation as:
0.801
For example 2.14, C .V . 100 100
x 13.5
= 5.93 %
1.816
For example 2.15, C .V . 100 100
x 2.725
= 66.64 %
14.585
For example 2.16, C .V . 100 100
x 43.733
= 33.35 %
Example 2.18: The following data gives the means and standard deviations of
the marks of two students in MA (Anthropology) examination.
Student A Student B
Mean ( x ) 60 70
Standard Deviation ( ) 11 10
A
CVA 100
xA
11
100
60
= 18.33 %
Coefficient of variation of student B is given by
B
CVB 100
xB
10
100 = 14.29 %
70
52
Measures of Central
2.4 CORRELATION Tendency and Dispersion
So far we have dealt with a single characteristic of data. But, there may be
cases when we would be interested in analyzing more than one characteristic
at a time. For example, you may like to study the relationship between the age
and the number of books a person reads. Such data, having two characteristics
under study are called bivariate data. One of the measures to find out the extent
or degree of relationship between two variables is correlation coefficient.
An analysis of the covariation of two or more variables is usually called
correlation. If two characteristics vary in such a way that movement in one is
accompanied by movement in the other, these characteristics are correlated.
For example, there are relationships between age and blood pressure of
individuals, the price and demand of a product, the height and weight of a
person, the number of hours devoted in study and performance in the
examination etc. are some examples of correlated variables. Correlations
coefficient measures the strength and direction of the relationship between
two variables. The value of correlation coefficient (r) remains between -1 and
+1. A positive value of r indicates a positive relationship and negative value
indicates a negative relationship.
In order to have a rough idea about the nature of relationship between two
variables we plot the data on graph paper, called the scatter plot or scatter
diagram. In the case of quantitative variables we can however have a unique
value of the relation in the form of Karl Pearson’s Coefficient of Correlation.
In the case of ordinal data where ranks only are available we use Spearman’s
rank correlation method to obtain the degree of relationship.
a) Scatter Diagram
If we are interested in finding out the relationship between two variables, the
simplest way to visualize it is to prepare a dot chart called scatter diagram.
Using this method, the given data are plotted on a graph paper in the form of
dots. For example, for each pair of X and Y values, we put a dot and thus
obtain as many point as the number of observations. Now, by looking into the
scatter of various dots, we can ascertain whether the variables are related or
not. The greater the scatter of the plotted points on the chart, the lesser is the
relationship between the two variables. The more closely the points come to a
straight line, the higher the degree of relationship.
The following figures show the different types of Correlation.
r =1
r = -1
Y Y
X X
Perfect Positive Correlation Perfect Negative Correlation
53
Statistical Analysis
Y Y
X X
High Degree Positive Correlation High Degree Negative Correlation
r=0
Y
X
No Correlation
b) Karl Pearson’s Coefficient of Correlation
Let X and Y be the two variables representing two characteristics which are
known to have some meaning full relationship.
The Karl Pearson’s coefficient of correlation is given by
r
n
x x y y
i 1 i i
2 2
n
i 1 x x y y
i
n
i 1 i
n
x yi nx y
i 1 i
n xi yi xi yi
r
Or, 2 2 2 2
n 2
i 1 i x nx n
i 1 yi2 n y n xi2 xi n yi2 yi
Fig. 2.1
By looking at the scatter diagram, we can say that the height and weight are
correlated. It is clear from the above diagram that correlation is positive because
the points are in upward rising from the lower left hand corner to the upper
right hand corner and all the points are close to a line, so there is a high degree
positive correlation.
For calculating Karl Pearson’s Correlation Coefficient, we will construct the
following table:
Table 2.16
x i 653 y i 578 x 2
i 42863 y 2
i 33726 x y i i 37961
Here, n=10
x i
x 653
65.3
n 10
55
Statistical Analysis
y i
y 578
57.8
n 10
r n
x yi nx y
i 1 i
2 2
n 2
x nx
i 1 i n
i 1 yi2 n y
r 37961 10 65.357.8
2 2
42863 10 65.3 33726 10 57.8
37961 37743.4
222.1 317.6
217.6 0.819
14.90317.821
c) Spearman’s Rank Correlation
This is denoted by (read as ‘rho’) instead of ‘r’. Here the raw data are
converted to their ranks. For example, suppose two examiners rank individual
students in a class according to their performance in a viva voce test. It may so
happen that both examiners will assign different ranks to a particular student.
If there is too much difference in ranks assigned by both the examiners, then
the evaluation of students may not be appropriate. Thus we need to study the
relationship between the ranks assigned by the examiners and the degree of
relationship will judge how appropriate the evaluation process has been. There
could be several similar situations where rank correlation can be applied.
In rank correlation method we take into account the difference in ranks assigned
to an observation. By considering such difference in ranks for all observations
we arrive at the rank correlation coefficient. The formula for rank correlation
is given by
6 di2
1
n n2 1
where di is the difference in ranks assigned to an observation.
The Spearman’s rank correlation also ranges from +1 to -1. Thus, positive
values indicate direct relationship between the variables, while negative values
indicate inverse relationship. The value = 0 indicates absence of association
between the variables.
Example 2.20: Given below are the ranks assigned by two examiners, A and
B, to a group of 10 students. Find out the degree of relationship between ranks
assigned by the examiners.
56
We prepare a table as given below and find out the difference in ranks assigned Measures of Central
by the examiners. Tendency and Dispersion
6 di2
Next, we apply the formula 1
n n2 1
We find the value = -0.175757575
Thus we can say that the Spearman’s rank correlation in the above case is -
0.18 (approx).
In the simplest case of regression analysis there is one dependent variable and
one independent variable. Let us assume that consumption expenditure of a
household is related to the household income. For example, it can be postulated
that as household income increases, expenditure also increases. Here
consumption expenditure is the dependent variable and household income is
the independent variable.
The relationship between X and Y can take many forms. The general practice
is to express the relationship in terms of some mathematical equation. The
57
Statistical Analysis simplest of these equations is the linear equation. This means that the
relationship between X and Y is in the form of a straight line and is termed
linear regression. When the equation represents curves (not a straight line) it is
called non-linear regression.
Now the question arises, ‘How do we identify the equation form?’ There is no
hard and fast rule as such. The form of the equation depends upon the reasoning
and assumptions made by us. However, we may plot the X and Y variables on
a graph paper to prepare a scatter diagram. From the scatter diagram, the location
of the points on the graph paper helps in identifying the type of equation to be
fitted. If the points are more or less in a straight line, then linear equation is
assumed. On the other hand, if the points are not in a straight line and are in the
form of a curve, a suitable non-linear equation (which resembles the scatter) is
assumed.
You may by now be wondering why the term ‘regression’, which means
‘reduce’. This name is associated with a phenomenon that was observed in a
study on the relationship between the stature of father (x) and son (y). It was
observed that the average stature of sons of the tallest fathers has a tendency to
be less than the average stature of these fathers. On the other hand, the average
stature of sons of the shortest fathers has a tendency to be more than the average
stature of these fathers. This phenomenon was called regression towards the
mean. Although this appeared somewhat strange at that time, it was found
later that this is due to natural variation within subgroups of a group and the
same phenomenon occurred in most problems and data sets. The explanation
is that many tall men come from families with average stature due to vagaries
of natural variation and they produce sons who are shorter than them on the
whole. A similar phenomenon takes place at the lower end of the scale.
Yi a bX i …(2.1)
58
Linear Regression Measures of Central
Tendency and Dispersion
Let us consider the following data on the amount of rainfall and the agricultural
production for ten years.
60 33 75 45
62 37 81 49
65 38 85 52
71 42 88 55
73 42 90 57
We assume that rainfall is the cause (X) and agricultural production is the
effect (Y). We plot the data on a graph paper. The scatter diagram looks
something like Fig. 2.2. We observe from Fig. 2.2 that the points do not lie
strictly on a straight line. But they show an upward rising tendency where a
straight line can be fitted.
When we fit a straight line to the data there is some sort of error we are
committing – the observations are not on a straight line but we are forcing a
straight line. The vertical difference between the regression line and the
observations is the ‘error’. Our objective is to minimize the error values. This
is usually done by the method of ‘least squares’. We will not go into the details
of the method here. Instead, two equations derived on the basis of least squares
method and known as normal equations are given below.
These are:
Y = na + b X …(1)
XY =a X +b X 2
… (2)
59
Statistical Analysis From our sample survey we have data on X and Y variables; we also know the
number of observations (n). The unknowns in the above two equations are ‘a’
and ‘b’; we estimate these two values.
Example 2.21: Estimate the regression equation from rainfall data given above.
We apply the normal equations to the rainfall data. For that purpose we prepare
a table as given below.
Table 9.2: Computation of Regression Line
Xi Yi X i2 X iYi Yi
60 33 3600 1980 33.85
62 37 3844 2294 35.34
65 38 4225 2470 37.57
71 42 5041 2982 42.03
73 42 5329 3066 43.51
75 45 5625 3375 45.00
81 49 6561 3969 49.46
85 52 7225 4420 52.43
88 55 7744 4840 54.66
90 57 8100 5130 56.15
Total Y 450
i
i X
i
i
2
57294 X Y 34526 Y
i i
i
i 450
i
X
i
i 750
Yi 10.73 0.743 X i
Multiple Regression
In many cases you have more than one independent variables which together
explain the dependent variable. This sort of models are termed ‘multiple
regression’. A typical example of a multiple regression is Y a bX 1 cX 2 .
60
Measures of Central
2.6 SUMMARY Tendency and Dispersion
Suggested Reading
Kothari, C. R. 1985. Research Methodology: Methods and Techniques. Delhi:
New Age International (P) Limited.
Nagar, A.L. and R.K. Dass, 1983, Basic Statistics, Oxford University Press,
Delhi.
Sundar Rao, P.S.S. and Richard, J. 1996. An Introduction to Biostatistics. New
Delhi: Prentice-Hall of India.
Sample Questions
1) Consider the following data set.
91 83 60 58 73 48 79 85 92 80.
On the basis of the above data
i) Calculate mean, median and mode.
ii) Calculate mean deviation, standard deviation and variance.
iii) Compute coefficient of variation.
2) The following are the number of injured persons in 50 accidents that took
place in New Delhi during 1st week of August.
61
Statistical Analysis 3) Following are the data of hours worked by 50 workers for a period of a
month in a certain factory.
Hours worked Number of workers
(class interval) (Frequency)
40-60 2
60-80 2
80-100 5
100-120 5
120-140 12
140-160 10
160-180 10
180-200 4
Total 50
4) i) x A 6.29 ,x B 5
63
Measures of Central Tendency & Dispersion
Measures that indicate the approximate center of a distribution are called measures of central tendency.
Measures that describe the spread of the data are measures of dispersion. These measures include the mean,
median, mode, range, upper and lower quartiles, variance, and standard deviation.
Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
∑
Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Determine the absolute middle of the data. 9, 10, 12, 13, 14, 14, 17, 17, 20
Note: Since the number of data points is odd choose the one in the very middle.
1. Put the data in order from smallest to largest, as you did to find your median.
2. Look for any value that occurs more than once.
3. Determine which of the values from Step 2 occurs most frequently.
Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Look for any number that occurs more than once. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 3: Determine which of those occur most frequently. 14 and 17 both occur twice.
Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Identify the lower half of your data. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 3: Identify the upper half of your data. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 4: For the lower half, find the median. 9, 10, 12, 13
Since there are an even number of data points in this half, you will find the median by summing the
two in the center and dividing by two. This is Q1.
Step 5: For the upper half, find the median. 14, 17, 17, 20
Since there are an even number of data points in this half, you will find the median by summing the
two in the center and dividing by two. This is Q3.
1. Identify the largest value in your data set. This is called the maximum.
2. Identify the lowest value in your data set. This is called the minimum.
3. Subtract the minimum from the maximum.
Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14
Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Identify your maximum. 9, 10, 12, 13, 14, 14, 17, 17, 20
Step 2: Identify your minimum. 9, 10, 12, 13, 14, 14, 17, 17, 20
1. Find the mean of the data. ( if calculating for a population or ̅ if using a sample)
2. Subtract the mean ( or ̅ ) from each data value (xi ).
3. Square each calculation from Step 2.
4. Add the values of the squares from Step 3.
5. Find the number of data points in your set, called n.
6. Divide the sum from Step 4 by the number n (if calculating for a population) or n – 1(if using a
sample). This will give you the variance.
7. To find the standard deviation, square root this number.
Formulas:
Sample Variance, : Population Variance, :
∑ ̅ ∑
̅
√∑ √∑
13 – 14 = -1; 17 – 14 = 3; 12 – 14 = -2; 20 – 14 = 6; 14 – 14 = 0
Step 3: Square these values. 32 = 9; (-4)2 = 16; (-5)2 = 25; 02 = 0; (-1)2 = 1; 32 = 9; (-2)2 = 4; 62 = 36
Step 6: Square root this number to find your standard deviation. √ = 3.536
Lower Quartile
These could be
subtracted to
Median find the range.
2
Hypothesis Testing
3
Null and Alternative Hypotheses
Hypothesis tests are tests about a population parameter ( μ or p). We will do
hypothesis tests about population mean and population proportion p.
The null hypothesis (H0) is a statement involving equality (=; <;>) about a
population parameter. We assume the null hypothesis is true to do our analysis.
The exact statement of the null and alternative hypotheses depend on the claim
that you are testing.
4
Outcomes and the Type I and Type II Errors
Hypothesis tests are based on incomplete information, since a sample can never give
us complete information about a population. Therefore, there is always a chance that
our conclusion has been made in error.
Type I error = { Deciding to reject the null when the null is true
incorrectly supporting the alternative
The other possible error is if we conclude that the null hypothesis (our assumption)
seems reasonable (choosing not to believe the alternative hypothesis), when the null
hypothesis is really false. This is called a Type II error.
Type II error =
{ Failing to reject the null when the null is False
incorrectly NOT supporting the alternative
5
TYPE I and TYPE II ERROR IN TABULAR FORM
Decision
Accept H0 Reject H0
H0 True Correct decision Type I Error
H0 False Type II Error Correct Decision
Type I Error : Rejecting Null Hypothesis when Null Hypothesis is true. It is called ‘α-error’.
Type II Error : Accepting Null Hypothesis when Null Hypothesis is false. It is called ‘β-error’.
6
Outcomes and the Type I and Type II Errors Cont…
It is important to be aware of the probability of getting each type
of error. The following notation is used:
7
Outcomes and the Type I and Type II Errors Cont…
8
Distribution Needed for Hypothesis Testing
The sample statistic (the best point estimate for the population parameter, which
we use to decide whether or not to reject the null hypothesis) and distribution for
hypothesis tests are basically the same as for confidence intervals.
The only difference is that for hypothesis tests, we assume that the population
mean (or population proportion) is known: it is the value supplied by the null
hypothesis. (This is how we \assume the null hypothesis is true" when we are
testing if our sample data contradicts our assumption.)
When testing a claim about population mean μ, ONE of the following two
requirements must be met, so that the Central Limit Theorem applies and we can
assume the random variable, x̅ is normally distributed:
− The sample size must be relatively large (many books recommend at least 30
samples), OR
− the sample appears to come from a normally distributed population.
9
Stating Hypotheses
The first step in conducting a test of statistical significance is to state
the hypotheses.
A significance test starts with a careful statement of the claims being compared.
The claim tested by a statistical test is called the null hypothesis (H0). The test
is designed to assess the strength of the evidence against the null hypothesis.
the null hypothesis is a statement of “no difference.”
when conducting a test of significance, a null hypothesis is used. The term null
is used because this hypothesis assumes that there is no difference between the
two means or that the recorded difference is not significant.
10
Test Statistic
• It is a random variable that is calculated from sample data and used in
hypothesis test.
• Test statistic compare your data with what is expected under the null
hypothesis.
• It is used to calculate P-Value.
• A test statistic measures the degree of agreement between a sample of the
data and the null hypothesis.
Different hypothesis tests use different test statistics based on the probability
model assumed in the null hypothesis. Common tests and their test statistics
are:
Hypothesis Test Test Statistics
Z-test Z-statistic
T-test T-statistic
ANOVA F-statistic
11
P-Value
The p-value is the probability, computed under the assumption that the null
hypothesis is true, of observing a value from the test statistic at least as
extreme as the one that was actually observed.
Thus, P-value is the chance that the presence of difference is concluded when
actually there is none.
When the p value is between 0.05 and 0.01 the result is usually called
significant.
When P value is less than 0.01, result is often called highly significant.
When p value is less than 0.001 and 0.005, result is taken as very highly
significant.
12
Statistical test
• These are intended to decide whether a hypothesis about distribution
of one or more populations should be rejected or accepted.
Statistical Test
13
Parametric tests
• Used for Quantitative Data
• Used for continuous variables
• Used when data are measured on approximate interval or ratio scales
of measurements.
• Data should follow normal distribution.
Some parametric tests are:-
• t-test
• ANOVA (Analysis of variance)
• Pearson’s r Correlation (r= rank)
• Z test for large samples( n>30)
14
Student’s t- test
Developed by Prof. W.S Gossett in 1908, who publishes statistical
papers under the pen name of “student.” Thus the test is known as
Student’s “t” test.
15
Student’s t- test Cont…
16
One Sample t-test
When compare the mean of a single group of observations
with a specified value.
In one sample t-test, we know the population mean. We draw
a random sample from the population and then compare the
sample mean with the population mean and make a statistical
decision as to whether or not the sample mean is different
from the population.
Formula :
17
One Sample t-test Cont…
If absolute value of “t” obtained is grater than table value then reject the
null hypothesis and if it is less than table value, the null hypothesis may
be accepted.
Therefore, rule for rejecting the null hypothesis:
18
T- test for unpaired two samples
• Used when the two independent random samples come from the
normal populations having unknown or same variance.
• We test the null hypothesis that the two population means are same
i.e., μ1 = μ2
19
T- test for unpaired two samples Cont…
20
T- test for paired two samples Cont…
Assumptions made for the test
1. Populations are distributed normally
2. Samples are drawn independently and at random
When the test apply
1. Samples are related with each other.
2. Sizes of the samples are small and equal.
3. Standard deviations in the populations are equal and not
known.
Null Hypothesis:
H0: μd = 0
Under H0, the test statistic
𝒕 = ǀ𝒅̅ ǀ
s/n
Where, d = difference between x1 and x2
d̅ = Average of d
s = Standard deviation
n = Sample size
21
ANOVA (Analysis of Variance)
• Developed by R.A.Fischer.
• Analysis of Variance (ANOVA) is a collection of statistical models
used to analyze the differences between group means or variances.
• Compares multiple groups at one time.
22
One way ANOVA
Compares two or more unmatched groups when data are categorized in one factor.
Example :
1. Comparing a control group with three different doses of aspirin
2. Comparing the productivity of three or more employees based on
working hours in a company
Example :
Comparing the employee productivity based on the working hours
and working conditions.
23
Assumptions of ANOVA :
• The samples are independent and selected randomly.
• Parent population from which samples are taken is of normal
distribution.
• Various treatment and environmental effects are additive in nature.
• The experimental errors are distributed normally with mean zero and
variance σ2
ANOVA compares variance by means of F-ratio
• F = 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 / 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
• It again depends on experimental designs
Null hypothesis:
Hο = All population means are same
• If the computed Fc is greater than F critical value, we are likely to
reject the null hypothesis.
• If the computed Fc is lesser than the F critical value , then the null
hypothesis is accepted.
24
Z-test
25
Z test for large samples( n>30) Conti…
26
Non- Parametric Test
Non-parametric tests can be applied when:
27
Chi Square test
• First used by Karl Pearson
• Simplest & most widely used non-parametric test in statistical
work.
• Calculated using the formula:- χ2 = Σ ( O – E )2 / E
Where,
O = observed frequencies
E = expected frequencies
• Greater the discrepancy b/w observed & expected frequencies, greater
shall be the value of χ2.
28
Chi Square test Cont…
Application of chi-square test :
• Test of association (smoking & cancer, treatment & outcome
of disease, vaccination & immunity).
29
Sources
1. These lecture notes are intended to be used with the open source textbook
“Introductory Statistics" by Barbara Illowsky and Susan Dean (OpenStax
College, 2013).
2. https://study.com/academy/lesson/what-is-a-hypothesis-definition-lesson-
quiz.html.
3. https://personal.utdallas.edu/~scniu/OPRE6301/documents/Hypothesis_T
esting.pdf
4. http://isoconsultantpune.com/hypothesis-testing/
5. http://www.fosonline.org/wordpress/wp‐content/uploads/2010/06/Salafasky
EtAl_ConsBiol_2002.pdf
30