0% found this document useful (0 votes)
14 views62 pages

DSRP Unit-II Notes

This document covers measures of central tendency and dispersion, including arithmetic mean, median, mode, mean deviation, variance, standard deviation, and correlation. It provides definitions, methods for calculating these measures from both raw and grouped data, and examples to illustrate the computations. The unit aims to equip learners with the ability to explain and compute these statistical measures effectively.

Uploaded by

manu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views62 pages

DSRP Unit-II Notes

This document covers measures of central tendency and dispersion, including arithmetic mean, median, mode, mean deviation, variance, standard deviation, and correlation. It provides definitions, methods for calculating these measures from both raw and grouped data, and examples to illustrate the computations. The unit aims to equip learners with the ability to explain and compute these statistical measures effectively.

Uploaded by

manu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Measures of Central

UNIT 2 MEASURES OF CENTRAL Tendency and Dispersion

TENDENCY AND DISPERSION


Contents
2.1 Introduction
2.2 Measures of Central Tendency
2.2.1 Arithmetic Mean
2.2.2 Median
2.2.3 Mode
2.2.4 Relationship between Mean, Median and Mode
2.3 Measures of Dispersion
2.3.1 Mean Deviation
2.3.2 Variance and Standard Deviation
2.3.3 Coefficient of Variation
2.4 Correlation
2.5 Concept of Regression
2.6 Summary
Suggested Reading
Sample Questions
Answers or Hints
Learning Objectives
After going through this unit, you will be in a position to:
 explain the concepts of central tendency and dispersion;
 compute mean, median and mode from raw data as well as frequency
distribution;
 compute mean deviation, variance, standard deviation and coefficient of
variation from raw data and frequency distribution; and
 compute and interpret coefficient of correlation.

2.1 INTRODUCTION
In unit 1 of this block, we have explained how to present data in the form of
tables and graphs. A more complete understanding of data can be attained by
summarizing the data using statistical measures. The present unit deals with
various measures of central tendency and dispersion in a variable. It also
explains how to measure correlation between two variables.As the computation
of these measures is different for ungrouped and grouped data, we present
some measures for both ungrouped and grouped data.

2.2 MEASURES OF CENTRAL TENDENCY


The most commonly investigated characteristics of a set of data are measures
of central tendency. Measures of central tendency provide us with a summary 35
Statistical Analysis that describes some central or middle point of the data. There are five important
measures of central tendency, viz., i) arithmetic mean, ii) median, iii) mode,
iv) geometric mean, and v) harmonic mean. Out of these, the last two measures,
viz., geometric mean and harmonic mean, have very specific uses and thus
less frequently used. Therefore, we will discuss the first three measures in this
unit.
Remember that all these measures may not have the same value for a particular
group of observations; because the formula is different for each measure. Which
one of these measures, should be used in a particular case depends upon the
type of data and the way in which the observations in the group cluster around
a point.
Before dealing with these measures let us be familiar with certain notations
which we will use. The standard notation is: X, which is a variable that takes
values, X 1 , X 2 , X 3 ... X N . Suppose we have data on number of children in a
family obtained from a household survey of 40 households. The total number
of children (n), as we know from our survey, is 40. We present these data in the
form a frequency distribution such that 6 families have no child; five families
have one child each, eight families have two children each and so on (see
Table 2.1). Here the number of children is our variable ‘X’ and it takes seven
values, viz, X 1 , X 2 , X 3  X 7 such that X 1 ,  0 , X 2  1, X 3  2, X 7  6 . The
corresponding frequency for each observation are: 6, 5, 8, 7, 6, 5 and 30. These
are denoted as f1 , f 2 , ... , f N .

Many times we refer to a typical observation; it could be any of the observations


under consideration. We call the typical observation as the ‘ith observation’
and denote it as with corresponding frequency. Here ‘i’ is the sub-script. For
greater clarity we provide a range for the variable. In the current example ‘i’
ranges between 0 and 6.
Table 2.1: Number of Children in families
No. of Children ( xi ) No. of families (frequency) f i
0 6
1 5
2 8
3 7
4 6
5 5
6 3
Total  6
f  40
i0 i

In the case of continuous variable we take the mid-values of class intervals as


X 1 , X 2 , X 3 ... X n and the corresponding frequencies as f1 , f 2 , ... , f N .
The sum of the frequencies is given below the table as  6
f . The symbol
i0 i

36  (read as sigma) is usually used to denote the sum of a variable. Adjacent


Measures of Central
to  are two numbers, i = 0 and 6 which denote the lower and upper range of Tendency and Dispersion
the variable respectively. When there is no confusion in notations, we omit the
subscripts and superscripts and just write f i or simply f .

2.2.1 Arithmetic Mean


The average or the arithmetic mean or simply the mean is the most commonly
used measure of central tendency. It is computed by dividing the sum of all
observations by the number of observations. It is denoted by x (read as ‘x
bar’). We explain the methods of computing arithmetic mean in the case of
ungrouped data and grouped data.
In the case of ungrouped data

If the value of observations in the data is denoted by x1 ,x2 ,...,xn then the
arithmetic mean is given by x1 + x2 + xn  i=1
n
xi
x= =
n n
where n is the number of observations. In this formula, the Greek letter ( xi  i=1
n
)
n
denotes summation of all the values , i.e., = x + x +  + x
i=1
1 2 n .

Example 2.1 Suppose we have the following data on the minimum temperature
(°C) of New Delhi for 10 days.
19 17 21 11 15 17 12 17 15 18.
For finding the average temperature, we have total no. of observations = n
=10,

Total of all these temperature x i  19  17  21    18  162

Therefore, the arithmetic mean x=


 xi  162  16.2
n 10

In the case of grouped data


In the case of grouped data we are provided with a frequency distribution. Let
xi (i = 1, 2, …, n) be the value of ith observation in the data and it occurs with
frequency fi (i =1, 2, …., n). For the grouped frequency distribution the
arithmetic mean is given by

f1 x1+f 2 x2 +f n xn  in1 fi xi


x= =
N N

n
where N= fi = Total no. of observations.
i=1

Remember that in the case of grouped frequency distribution, xi is the mid


value of the ith class interval.

37
Statistical Analysis Example 2.2 Let us consider the data given in Table 1.3 of Unit 1 and compute
the mean.

No. of Children Number of families f i xi


( xi ) (frequency) f i
0 6 0
1 5 5
2 8 16
3 7 21
4 6 24
5 5 25
6 3 18
Total N=40 fx i i =109

Let us compute the arithmetic mean of the data given in the above table.

f x i 0  6  1 5  6  3 109
x
n
i 1 i
   2.725
N 40 40

Thus, the average number of children per family is 2.725


Example 2.3 Consider the data given in Table 1.7 of Unit 1 and compute the
mean.
Table2.2
Age in years frequency (fi)
Class interval
15-25 9
25-35 12
35-45 21
45-55 15
55-65 11
65-75 7
Total 75
For the computation of the mean, we have to construct the following table.
Table2.3

Class Interval Mid value ( xi ) Frequency ( f i ) f i xi

15-25 20 9 180
25-35 30 12 360
35-45 40 21 840
45-55 50 15 750
55-65 60 11 660
65-75 70 7 490

38
Total N=75 fx i i  3280
Thus, the mean of the age of 75 persons is Measures of Central
Tendency and Dispersion

x=  i i 
f x 3280
 43.73  44 Years
N 75
2.2.2 Median
Median is a positional average. Median is the middlemost value of the set of
observations which divides the data set into two equal parts, where all the
observations are arranged in either ascending or descending order. So there
are 50 per cent observations below the median and the remaining 50 per cent
are above the median.
Calculation of Median from Raw Data
For calculation of median from raw (unorganised) data you should take the
following three steps.
a) Arrange the data either in ascending or in descending order of magnitude
(both methods give the same value for median).
b) If there are odd number of observations (n), median is calculated by

 n+1  th
Median = value of   observation
 2 
where n = number of observations
c) If there are even numbers of observations, median is calculated by

 n  th  n  th
  observation  value of
value of    1 observation
Median = 2 2 
2
Example 2.4 The following is the data of the hemoglobin level of 11 women
in gm/dL:
12.1 13.6 14.2 12.4 14.3 13.2 12.8 14.6 13.9 13.8 12.4
For finding the median of the above data, we arrange the values of hemoglobin
in ascending order as follows:
12.1 12.4 12.4 12.8 13.2 13.6 13.8 13.9 14.2 14.3 14.6
Here, the number of observations are odd , since n = 11
Median hemoglobin level is given by

 n  1  th
Median = value of   observation
 2 

11  1  th
= value of   observation
 2 
= value of (6th) observation = 13.6 gm/dL (Arrange the above data in
descending order and calculate the median. You should obtain the
value, i.e., 13.6)
Example 2.5 The following is the data of the hemoglobin level of 12 women
in gm/dL: 39
Statistical Analysis 12.1 13.6 14.2 12.4 14.3 13.2 12.8 14.6 13.9 13.8 12.4 14.8
For finding the median of the above data, we arrange the values of hemoglobin
in ascending order as follows:
12.1 12.4 12.4 12.8 13.2 13.6 13.8 13.9 14.2 14.3 14.6 14.8
Here, the number of observations are even , i.e., n = 12
Median hemoglobin level is given by

 n  th  n  th
value of   observation  value of   1 observation
Median = 2 2 
2

12  th 12  th
value of   observation  value of   1 observation
= 2 2 
2
th th
value of 6 observation  value of 7 observation
=
2
13.6  13.8 27.4
=   13.7 gm / dL
2 2
Calculation of Median from ungrouped frequency distribution
a) First of all, arrange the data in ascending or descending order of magnitude.
b) Next find the cumulative frequencies.
c) Apply the following formula:

 N  1  th
Median = value of   observation
 2 
where N=  f i = Total number of observations
d) Finally, we will find the cumulative frequency which is either equal or just
N 1
higher to and the value of the variable which corresponds to that
2
cumulative frequency will be our required median.
Example 2.6 Consider the data given in Table 2.1 and calculate the median.
Table 2.4
No. of Children No. of families cumulative
(xi ) (frequency) fi frequencies (less
than type)
0 6 6
1 5 11
2 8 19
3 7 26
4 6 32
5 5 37
6 3 40
40 Total N=40
In the above data set, the values on number of children are already in ascending Measures of Central
order. In the third column of the table, we have calculated the cumulative Tendency and Dispersion
frequencies of less than type then

 N  1  th
Median = value of   observation
 2 

 40  1  th
= value of   observation
 2 
= value of (20.5)th observation
Now the cumulative frequency which is either equal or just higher to 20.5 is
26 then the corresponding value of the variable is 3. Thus the median value for
the above data set is 3 children per family.
Calculation of Median from Grouped Frequency Distribution

N
a) First of all, we find the value of
2
, where N = f i = Total number of
observations
b) Next, we calculate the cumulative frequencies and identify the class interval
for which the cumulative frequency is either equal to or just higher than
N
. This class will contain the median and called the ‘median class’.
2
c) Finally, we use the following formula to compute the median

N
 c. f
Median = L  2 h
f

where L = lower limit of median class


c.f. = cumulative frequency of the class preceding the median class
f = frequency of the median class
h = class interval of the median class
Example 2.7 for the data set given in Table 2.2, calculate the median age.
Table 2.5
Age in years frequency (f) cumulative frequencies
Class interval (less than type)
15-25 9 9
25-35 12 21 = cf
35-45 21 = f 42
45-55 15 57
55-65 11 68
65-75 7 75
Total N=75 41
Statistical Analysis
N 75
For the frequency distribution of age of 75 persons,   37.5
2 2
Therefore, the above table indicates that median would lie in the class interval
35-45. Thus
L = 35, C.f. = 21, f= 21, h = 45-35 = 10
Hence, the median is given by

N
 c. f
Median = L  2 h
f

37.5  21
Median = 35  10
21

16.5
Median = 35  10  42.86
21
The median age is 42.86 years.
2.2.3 Mode
Mode is the value of given data set which occurs maximum number of times,
i.e., the value which has the highest frequency. Mode is the most commonly
used measure of central tendency when we have to decide which is the most
fashionable (most demanded or most preferable) item at this time. For example,
to decide the most preferable size of shoes, clothes, etc., we find their mode.
Calculation of Mode from Raw Data
Example 2.8 Let us consider the temperature of 10 days in New Delhi, i.e.,
19, 17, 21, 11, 15, 17, 12, 17, 15, 18
In this data, the observation 17 is occurring maximum number of time (i.e., 3).
Hence the mode is 17 (Note that Mode is 17, not 3).
Calculation of Mode from Ungrouped Frequency Distribution
Example 2.9 Consider the data given in Table 2.1, and find out the mode.
Table 2.6
No. of Children No. of families
(xi) (frequency) fi
0 6
1 5
2 8
3 7
4 6
5 5
6 3
Total N=40

In this data set, the value 2 has the maximum frequency (, i.e., 8). Thus 2 is the
42 most commonly occurring value. We can say that the modal value for the above
data is 2 children per family.
Calculation of Mode for Grouped Frequency Distribution Measures of Central
Tendency and Dispersion
i) First of all, we will find the class having maximum frequency which is
called modal class.
ii) Then, we will calculate the mode by the following formula.
fi  f 0
Mode = L  2 f  f  f  h
1 0 2

where L = lower limit of the modal class


f1  frequency of the modal class

f o  frequency of the class preceding the modal class

f 2  frequency of the class succeeding the modal class


h = Class interval
Example 2.10 Consider the data given in Table 2.2 and find out the mode.
Table 2.7
Age in years frequency (f)
Class interval
15-25 9
25-35 12
35-45 21
45-55 15
55-65 11
65-75 7
Total N=75
In the above data set, modal class is 35-45 since it has maximum frequency
(i.e., 21).

We find that L = 35 f1  21, f 0  12 , f 2  15,h  45  35  10 , h = 45-35 = 10


Therefore, mode can be calculated as

f1  f 0
Mode  L  h
2 f1  f 0  f 2

21  12
 35  10
2  21  12  15
9
 35  10  41
15
Thus the modal value for the above data is 41 years.

2.2.4 Relationship between Mean, Median and Mode


If a distribution is symmetric, values of mean median and mode coincide (as
in the case of a normal distribution). If a distribution is moderately asymmetrical
or skewed (positively or negatively) mean, median and mode have the following
relationship given by Karl Pearson: 43
Statistical Analysis (Mean - mode) = 3(mean-median)
(a)

Mean=Median=Mode

(b) (c)

Mode Median Mode


Mean Median Mode
Figure 2.1: Relationship between mean, median, and mode in case of (a)
Symmetric, (b) Positively skewed and (c) Negatively skewed distributions.

2.3 MEASURES OF DISPERSION


The various measures of central tendency discussed in previous section gives
us an idea about the concentration of data around its central part. But these
measures of central tendency cannot be used alone to describe the data. The
following data on the marks of 3 students tells us that a single measure of
central tendency cannot sufficient to describe the data and it is needed to use
another measure called dispersion to get the complete idea about the entire
data.
Table 2.8: Marks of the students A, B and C in 5 subjects
Subject Student A Student B Student C
1 30 50 10
2 40 50 20
3 50 50 30
4 60 50 90
5 70 50 100
Total 250 250 250
Mean 50 50 50
You can observe from the above table, that the average marks of the students
A, B and C are the same. After a thorough examination of the marks of all the
students, we can find that the distribution of the marks of three students differ
widely from one another. Marks in all the subjects vary widely from one another.
So it is necessary to use other measures called measures of dispersion or
variability (along with measures of central tendency) to get a complete idea
44 about the distribution of the data.
In measures of dispersion, we measure the extent of scatterness or deviation to Measures of Central
which all observations of the data varies from its central value. Measures of Tendency and Dispersion
dispersion give an idea about homogeneity & heterogeneity of the distribution.
In this section, we will discuss the commonly used measures of dispersion,
viz., mean deviation, variance and standard deviation. The following are the
commonly used measures of dispersion:
2.3.1 Mean Deviation
Mean deviation is the average of the absolute deviation of all the observations
from its mean value. Here, first of all we compute the deviations of the data
values from the mean. Secondly we, obtain the ‘absolute values’ for these
deviations (it means you take you take only the numerical part of a number
and ignore the minus sign). Finally, we calculate the average for these deviations.
The mean deviation is also called the average deviation.
For Raw Data

If there are n observations say x1 ,x2 ,...,xn of a variable under study and x is
the mean of these n observations, then the mean deviation about mean is given
by

1 n
M .D   x1  x
n i 1

 
Here x1  x (read as mod x1  x ) is the absolute value of the difference

between xi and x . For finding the absolute value of a number, we ignore the
minus signs. Thus (5) = 5.
Also (-5) = 5.
Example 2.11 Calculate the mean deviation from the following data of the
hemoglobin level of 10 women in gm/dL:
12.1 13.6 14.2 12.4 14.3 13.2 12.8 14.6 13.9 13.9
For computing mean deviation, we will prepare the following table:
Table 2.9
xi x  13.5
i xi  13.5

12.1 -1.4 1.4


13.6 0.1 0.1
14.2 0.7 0.7
12.4 -1.1 1.1
14.3 0.8 0.8
13.2 -0.3 0.3
12.8 -0.7 0.7
14.6 1.1 1.1
13.9 0.4 0.4
13.9 0.4 0.4

x i  135  x x 7
i 45
Statistical Analysis
From the above table, n=10, x i  135 ,

Then, x 
 xi  135  13.5
n 10
The mean deviation about mean is given by

1 n
M .D   xi  x
n i 1

7
  0.7
10
Thus the mean deviation of the above data on hemoglobin level is 0.7 gm/dL.
For frequency distribution
Let xi (i=1, 2, …, n) be the value of ith observation in the data and it occurs with
frequency fi,(i =1, 2, …., n). For the ungrouped frequency distribution the
mean deviation about mean is given by

1 n
M .D   xi  x
n i 1

Where N   fi

And xi  x is the deviation from mean after ignoring the minus signs.

In case of grouped frequency distribution, we consider xi as the mid value of


ith class interval.
Example 2.12 Calculate the mean deviation about mean from the data set
given in Table 2.1
For the computation of the mean deviation, we have to construct the following
table
Table 2.10

No. of No. of f i xi xi  2.725 xi  2.725 fi xi  2.725


Children ( xi ) families
fi
0 6 0 -2.725 2.725 16.350
1 5 5 -1.725 1.725 8.625
2 8 16 -0.725 0.725 5.800
3 7 21 0.275 0.275 1.925
4 6 24 1.275 1.275 7.650
5 5 25 2.275 2.275 11.375
6 3 18 3.275 3.275 9.825

46 Total N=40 109.000 61.550


Measures of Central
x
fx i i 109 Tendency and Dispersion
From the above table,   2.725
N 40

 
In the fourth column we compute xi  x and in the fifth column we compute

xi  x .

Mean deviation about mean is given by

1 n
M.D   fi x1  x
N i1

61.55
  1.539 1.54
40
Example 2.13 Calculate the mean deviation from the data given in table 2.3
Thus the mean deviation is 1.54 children per family
We will construct the following table for computation of the mean deviation:
Table 2.11

Mid values frequency f i xi x  43.733


i xi  43.733 fi xi  43.733

( xi ) ( fi )

20 9 180 -23.733 23.733 213.600

30 12 360 -13.733 13.733 164.800

40 21 840 -3.733 3.733 78.400

50 15 750 6.267 6.267 94.000

60 11 660 16.267 16.267 178.933

70 7 490 26.267 26.267 183.867

Total N=75 3280 90.000 913.600

From the above table, x 


 fi xi  3280  43.733
N 75

Mean deviation about mean is given by

1 n
M.D   fi xi  x
N i1

913.6
  12.181 years
75

2.3.2 Variance and Standard Deviation


The variance and the standard deviation are the most commonly and popularly
used measures of dispersion. The average of the squared deviation from the
mean is known as variance and it is denoted by  2 (read as sigma square). That 47
Statistical Analysis is, first of all we compute the deviations of the data values from the mean, then
find the square of the values for these deviations and finally we find the average
of these squared values.
The positive square root of the variance is called standard deviation. It is also
known as root mean square deviation because it is the square root of the mean
of the squared deviation from the arithmetic mean. It is denoted by (read as
sigma). We can calculate variance and standard deviation as follows:
For Raw Data

2
1 n
2
Variance   
 xi  x
n i 1 
We can rewrite it for computational convenience

1 2
2 
n
 xi2  x

1   xi  2
Or,    xi  
2 2

n  n 

And the standard deviation is given by

S.D     var iance


Remember that standard deviation is always positive.
For frequency distribution
When we have frequency distribution, we can calculate variance and standard
deviation by following formulae:

1 2
2
Variance  
N 
 fi xi  x 
1 2
Or, 2 
N
 fi xi 2  x

1   fi xi  2
Or,    fi xi  
2 2
 , where N   f i
N  N 

S.D     variance

The three formulae given above will provide the same result. Thus you can use
any one of the above. For computation of variance we usually prepare a table
from the given data as per our requirements. As mentioned earlier standard
deviation is the positive square root of variance. Thus, in the case of grouped
frequency distribution, we consider as mid value of the ith class interval.
Let us now consider the following examples to understand the computational
method of variance and standard deviation:-
Example 2.14: Calculate the variance and standard deviation from the data
set given in Example 2.11.
48
For computation of variance and standard deviation, we prepare the following Measures of Central
table: Tendency and Dispersion

Table 2.12

xi xi 2
12.1 146.41
13.6 184.96
14.2 201.64
12.4 153.76
14.3 204.49
13.2 174.24
12.8 163.84
14.6 213.16
13.9 193.21
13.9 193.21

x i  135 x i
2
 1828.92

From the above table, n=10, x i


2
 135 ,

Then, x  
xi 135
  13.5
n 10

Variance is given by

1 2
2 
n
 xi2  x

1828.92
  (13.5)2
10

 182.892  182.25  0.642

Standard deviation is given by

S.D     variance

= 0.642

= 0.801

Example 2.15 Calculate the variance and standard deviation from the data set
given in table 2.1
For the computation of variance and standard deviation, we have to construct
the following table.
49
Statistical Analysis Table 2.13

No. of Children No. of families f i xi xi2 f i xi2


( xi ) fi

0 6 0 0 0
1 5 5 1 5
2 8 16 4 32
3 7 21 9 63
4 6 24 16 96
5 5 25 25 125
6 3 18 36 108
Total N=40 fx i i  109 91 fx 2
i i  429

From the above table, x  


f i xi 109
  2.725
N 40
Variance is given by
1 2
2 
N
 fi xi2  x

429 2
  2.725
40
= 10.725 - 7.426
= 3.299
Standard deviation is given by
S.D     variance

 3.299
=1.816
Example 2.16 Calculate the variance and standard deviation from the data
given in table 2.3
We construct the following table for the computation of variance and standard
deviation:
Table 2.14
Class Mid values frequency fi xi xi2 f i xi2

Interval ( xi ) ( fi )
15-25 20 9 180 400 3600
25-35 30 12 360 900 10800
35-45 40 21 840 1600 33600
45-55 50 15 750 2500 37500
55-65 60 11 660 3600 39600
65-75 70 7 490 4900 34300
50 Total N=75 fx i i 3280 13900 fx 2
i i  159400
Measures of Central
From the above table, x  
fi xi 3280
  43.733 Tendency and Dispersion
N 75

Variance is given by

1 2
2 
N
 fi xi2  x

159400 2
  43.7333
75

= 2125.333-1912.604

=212.729

Standard deviation is given by

S.D     var iance

 212.729

=14.585

Note: To get an unbiased estimate of population variance from sample, we


2
divide the quantity x i x  by (n-1) instead of by n. and this is denoted by

s 2 . Thus the formula to compute the sample variance is

1 2
s2 
n 1 
 xi  x 
1
 x  nx 
2 2 2
Or s  i
n 1

2.3.3 Coefficient of Variation (C.V.)


As, the measures of central tendency and measures of dispersion specify the
characteristics of a data set. A limitation with these two measures is that they
are not free from the unit of measurement. For example if I take height instead
of inches in centimeters then I get different values of mean and standard
deviation. In order to avoid this problem we often use the coefficient of variation.

When we want to compare two or more data sets in respect to variability then
we will use coefficient of variation. The coefficient of variation is also useful
even in comparison of data sets having different measurement units because it
is a unit free measure. It is given by

S.D
Coefficient of variation  100
mean


Or, C .V .   100
x
51
Statistical Analysis The data set for which coefficient of variation is less is said to be more consistent
or more uniform or more homogeneous. For the above examples we can
calculate the coefficient of variation as:

 0.801
For example 2.14, C .V .   100  100
x 13.5

= 5.93 %

 1.816
For example 2.15, C .V .   100  100
x 2.725

= 66.64 %

 14.585
For example 2.16, C .V .   100  100
x 43.733

= 33.35 %
Example 2.18: The following data gives the means and standard deviations of
the marks of two students in MA (Anthropology) examination.

Student A Student B
Mean ( x ) 60 70
Standard Deviation (  ) 11 10

Which student is the better performer in the examination?


To find a better performer, we will calculate their coefficient of variations. The
student having less coefficient of variation will have better performance in the
examination.
Coefficient of variation of student A is given by

A
CVA  100
xA

11
 100
60

= 18.33 %
Coefficient of variation of student B is given by

B
CVB  100
xB

10
 100 = 14.29 %
70

Since the coefficient of variation of student B is less than that of student A so


student B is the better performer in the examination.

52
Measures of Central
2.4 CORRELATION Tendency and Dispersion

So far we have dealt with a single characteristic of data. But, there may be
cases when we would be interested in analyzing more than one characteristic
at a time. For example, you may like to study the relationship between the age
and the number of books a person reads. Such data, having two characteristics
under study are called bivariate data. One of the measures to find out the extent
or degree of relationship between two variables is correlation coefficient.
An analysis of the covariation of two or more variables is usually called
correlation. If two characteristics vary in such a way that movement in one is
accompanied by movement in the other, these characteristics are correlated.
For example, there are relationships between age and blood pressure of
individuals, the price and demand of a product, the height and weight of a
person, the number of hours devoted in study and performance in the
examination etc. are some examples of correlated variables. Correlations
coefficient measures the strength and direction of the relationship between
two variables. The value of correlation coefficient (r) remains between -1 and
+1. A positive value of r indicates a positive relationship and negative value
indicates a negative relationship.
In order to have a rough idea about the nature of relationship between two
variables we plot the data on graph paper, called the scatter plot or scatter
diagram. In the case of quantitative variables we can however have a unique
value of the relation in the form of Karl Pearson’s Coefficient of Correlation.
In the case of ordinal data where ranks only are available we use Spearman’s
rank correlation method to obtain the degree of relationship.
a) Scatter Diagram
If we are interested in finding out the relationship between two variables, the
simplest way to visualize it is to prepare a dot chart called scatter diagram.
Using this method, the given data are plotted on a graph paper in the form of
dots. For example, for each pair of X and Y values, we put a dot and thus
obtain as many point as the number of observations. Now, by looking into the
scatter of various dots, we can ascertain whether the variables are related or
not. The greater the scatter of the plotted points on the chart, the lesser is the
relationship between the two variables. The more closely the points come to a
straight line, the higher the degree of relationship.
The following figures show the different types of Correlation.

r =1
r = -1
Y Y

X X
Perfect Positive Correlation Perfect Negative Correlation

53
Statistical Analysis

Y Y

X X
High Degree Positive Correlation High Degree Negative Correlation

r=0
Y

X
No Correlation
b) Karl Pearson’s Coefficient of Correlation
Let X and Y be the two variables representing two characteristics which are
known to have some meaning full relationship.
The Karl Pearson’s coefficient of correlation is given by

r
 n
x  x y  y
i 1 i i

2 2
 n
i 1  x  x   y  y 
i
n
i 1 i

 n
x yi  nx y
i 1 i
n xi yi   xi  yi 
r 
Or, 2 2 2 2
 n 2
i 1 i x  nx  n
i 1 yi2  n y n xi2   xi  n yi2   yi 

The method of computation of correlation coefficient will be more clear by the


following example.
Example 2.19: Following are the heights and weights of 10 students.
Table 2.15
Heights (in inches) 70 61 73 67 58 65 71 65 63 60
Weight (in kgs) 64 50 64 66 50 54 60 61 54 55
i) Make a Scatter diagram
ii) Find the correlation coefficient between height and weight
In the following figure, we considered height on X-axis and weight on Y axis
then plotted the corresponding points.
54
Measures of Central
Tendency and Dispersion

Fig. 2.1

By looking at the scatter diagram, we can say that the height and weight are
correlated. It is clear from the above diagram that correlation is positive because
the points are in upward rising from the lower left hand corner to the upper
right hand corner and all the points are close to a line, so there is a high degree
positive correlation.
For calculating Karl Pearson’s Correlation Coefficient, we will construct the
following table:
Table 2.16

Height ( xi ) Weight ( yi ) xi2 yi2 xi yi

70 64 4900 4096 4480


61 50 3721 2500 3050
73 64 5329 4096 4672
67 66 4489 4356 4422
58 50 3364 2500 2900
64 54 4225 2916 3510
71 60 5041 3600 4260
65 61 4225 3721 3965
63 54 3669 2916 3402
60 55 3600 3025 3300

x i  653 y i  578 x 2
i  42863 y 2
i  33726 x y i i  37961

Here, n=10

x i 
x 653
 65.3
n 10

55
Statistical Analysis
y i 
y 578
 57.8
n 10

The Karl Pearson’s coefficient of correlation is given by

r  n
x yi  nx y
i 1 i
2 2
 n 2
x  nx
i 1 i  n
i 1 yi2  n y

r 37961  10 65.357.8
2 2
42863  10 65.3 33726  10 57.8


37961  37743.4
222.1 317.6

 217.6  0.819
14.90317.821
c) Spearman’s Rank Correlation

This is denoted by  (read as ‘rho’) instead of ‘r’. Here the raw data are
converted to their ranks. For example, suppose two examiners rank individual
students in a class according to their performance in a viva voce test. It may so
happen that both examiners will assign different ranks to a particular student.
If there is too much difference in ranks assigned by both the examiners, then
the evaluation of students may not be appropriate. Thus we need to study the
relationship between the ranks assigned by the examiners and the degree of
relationship will judge how appropriate the evaluation process has been. There
could be several similar situations where rank correlation can be applied.

In rank correlation method we take into account the difference in ranks assigned
to an observation. By considering such difference in ranks for all observations
we arrive at the rank correlation coefficient. The formula for rank correlation
is given by

6 di2
 1

n n2  1 
where di is the difference in ranks assigned to an observation.

The Spearman’s rank correlation also ranges from +1 to -1. Thus, positive
values indicate direct relationship between the variables, while negative values
indicate inverse relationship. The value  = 0 indicates absence of association
between the variables.

Example 2.20: Given below are the ranks assigned by two examiners, A and
B, to a group of 10 students. Find out the degree of relationship between ranks
assigned by the examiners.
56
We prepare a table as given below and find out the difference in ranks assigned Measures of Central
by the examiners. Tendency and Dispersion

Student rank by A rank by B di di2


1 1 1 0 0
2 2 6 -4 16
3 3 8 -5 25
4 4 7 -3 9
5 5 10 -5 25
6 6 9 -3 9
7 7 3 4 16
8 8 5 3 9
9 9 2 7 49
10 10 4 6 36
total 194

6 di2
Next, we apply the formula   1 

n n2  1 
We find the value  = -0.175757575

Thus we can say that the Spearman’s rank correlation in the above case is -
0.18 (approx).

2.6 CONCEPT OF REGRESSION


In regression analysis we have two types of variables: i) dependent (or
explained) variable, and ii) independent (or explanatory) variable. As the name
(explained and explanatory) suggests the dependent variable is explained by
the independent variable. Note that correlation coefficient does not reflect cause
and effect relationship whereas regression analysis assumes that one variable
(or more than one) is the cause and other is the effect.

In the simplest case of regression analysis there is one dependent variable and
one independent variable. Let us assume that consumption expenditure of a
household is related to the household income. For example, it can be postulated
that as household income increases, expenditure also increases. Here
consumption expenditure is the dependent variable and household income is
the independent variable.

Usually we denote the dependent variable as Y and the independent variable


as X. Suppose we took up a household survey and collected n pairs of
observations in X and Y. The next step is to find out the nature of relationship
between X and Y.

The relationship between X and Y can take many forms. The general practice
is to express the relationship in terms of some mathematical equation. The
57
Statistical Analysis simplest of these equations is the linear equation. This means that the
relationship between X and Y is in the form of a straight line and is termed
linear regression. When the equation represents curves (not a straight line) it is
called non-linear regression.

Now the question arises, ‘How do we identify the equation form?’ There is no
hard and fast rule as such. The form of the equation depends upon the reasoning
and assumptions made by us. However, we may plot the X and Y variables on
a graph paper to prepare a scatter diagram. From the scatter diagram, the location
of the points on the graph paper helps in identifying the type of equation to be
fitted. If the points are more or less in a straight line, then linear equation is
assumed. On the other hand, if the points are not in a straight line and are in the
form of a curve, a suitable non-linear equation (which resembles the scatter) is
assumed.

Regression analysis can be extended to cases where one dependent variable is


explained by a number of independent variables. Such a case is termed ‘multiple
regression’.

You may by now be wondering why the term ‘regression’, which means
‘reduce’. This name is associated with a phenomenon that was observed in a
study on the relationship between the stature of father (x) and son (y). It was
observed that the average stature of sons of the tallest fathers has a tendency to
be less than the average stature of these fathers. On the other hand, the average
stature of sons of the shortest fathers has a tendency to be more than the average
stature of these fathers. This phenomenon was called regression towards the
mean. Although this appeared somewhat strange at that time, it was found
later that this is due to natural variation within subgroups of a group and the
same phenomenon occurred in most problems and data sets. The explanation
is that many tall men come from families with average stature due to vagaries
of natural variation and they produce sons who are shorter than them on the
whole. A similar phenomenon takes place at the lower end of the scale.

The simplest relationship between X and Y could perhaps be a linear


deterministic function given by

Yi  a  bX i …(2.1)

In the above equation X is the independent variable or explanatory variable


and Y is the dependent variable or explained variable. You may recall that the
subscript i represents the observation number, i ranges from 1 to n. Thus Yi is
the first observation of the dependent variable, X i is the fifth observation of
the independent variable, and so on.

Equation (2.1) implies that Y is completely determined by X and the parameters


a and b. Suppose we have parameter values a = 3 and b = 0.75, then our linear
equation is Y = 3 + 0.75 X. From this equation we can find out the value of Y
for given values of X. For example, when X = 8, we find that Y = 9. Thus if we
have different values of X then we obtain corresponding Y values on the basis
of (2.1).

58
Linear Regression Measures of Central
Tendency and Dispersion
Let us consider the following data on the amount of rainfall and the agricultural
production for ten years.

Rainfall Agricultural production Rainfall Agricultural production


(in mm.) ( in tonne) (in mm.) ( in tonne)

60 33 75 45
62 37 81 49
65 38 85 52
71 42 88 55

73 42 90 57

We assume that rainfall is the cause (X) and agricultural production is the
effect (Y). We plot the data on a graph paper. The scatter diagram looks
something like Fig. 2.2. We observe from Fig. 2.2 that the points do not lie
strictly on a straight line. But they show an upward rising tendency where a
straight line can be fitted.

Fig. 2.2: Scatter Diagram

When we fit a straight line to the data there is some sort of error we are
committing – the observations are not on a straight line but we are forcing a
straight line. The vertical difference between the regression line and the
observations is the ‘error’. Our objective is to minimize the error values. This
is usually done by the method of ‘least squares’. We will not go into the details
of the method here. Instead, two equations derived on the basis of least squares
method and known as normal equations are given below.
These are:

Y = na + b X …(1)

 XY =a X +b X 2
… (2)

59
Statistical Analysis From our sample survey we have data on X and Y variables; we also know the
number of observations (n). The unknowns in the above two equations are ‘a’
and ‘b’; we estimate these two values.
Example 2.21: Estimate the regression equation from rainfall data given above.
We apply the normal equations to the rainfall data. For that purpose we prepare
a table as given below.
Table 9.2: Computation of Regression Line

Xi Yi X i2 X iYi Yi
60 33 3600 1980 33.85
62 37 3844 2294 35.34
65 38 4225 2470 37.57
71 42 5041 2982 42.03
73 42 5329 3066 43.51
75 45 5625 3375 45.00
81 49 6561 3969 49.46
85 52 7225 4420 52.43
88 55 7744 4840 54.66
90 57 8100 5130 56.15

Total Y  450
i
i X
i
i
2
 57294  X Y  34526 Y
i i
i
i  450
i

X
i
i  750

We obtain the normal equations as


450 = 10a + 750b
34526 = 750a + 57294b
By solving these two equations we obtain the values a = -10.73 and b = 0.743
Thus the estimated regression equation is

Yi  10.73  0.743 X i

Multiple Regression
In many cases you have more than one independent variables which together
explain the dependent variable. This sort of models are termed ‘multiple
regression’. A typical example of a multiple regression is Y  a  bX 1  cX 2 .

In the above equation Y is the dependent variable while X 1 and X 2 are


independent variables.
These days many statistical software are available which can compute the
estimates for you. Once you have the estimates, you can formulate the regression
equation.

60
Measures of Central
2.6 SUMMARY Tendency and Dispersion

In this unit we discussed the methods of presentation of data, particularly


measures of central tendency and dispersion. These measures are summary
statistics of the dataset. In addition to the above, we dealt with correlation and
regression also. These two indicate the relationship between two variables.
Correlation coefficient is a summary statistic of the strength of relationship
between two variables. A major limitation of correlation coefficient is that it
does not show cause and effect relationship; it just says that both the variable
move together – either in the same direction (positive correlation) or in opposite
directions (negative correlation). Regression analysis shows cause and effect
relationship – changes in the independent variable causes changes in the
dependent variable. These measures help us in interpretation of data.

Suggested Reading
Kothari, C. R. 1985. Research Methodology: Methods and Techniques. Delhi:
New Age International (P) Limited.
Nagar, A.L. and R.K. Dass, 1983, Basic Statistics, Oxford University Press,
Delhi.
Sundar Rao, P.S.S. and Richard, J. 1996. An Introduction to Biostatistics. New
Delhi: Prentice-Hall of India.

Sample Questions
1) Consider the following data set.
91 83 60 58 73 48 79 85 92 80.
On the basis of the above data
i) Calculate mean, median and mode.
ii) Calculate mean deviation, standard deviation and variance.
iii) Compute coefficient of variation.
2) The following are the number of injured persons in 50 accidents that took
place in New Delhi during 1st week of August.

No. of injured persons (x) 0 1 2 3 4 5 6


No. of accidents (f) 9 10 10 6 8 3 4

i) On an average how many persons were injured in an accident?


ii) Calculate the median and mode number of accidents.
ii) Calculate mean deviation, standard deviation and variance of number
of injured persons.
iii) Find the coefficient of variation of the above data.

61
Statistical Analysis 3) Following are the data of hours worked by 50 workers for a period of a
month in a certain factory.
Hours worked Number of workers
(class interval) (Frequency)
40-60 2
60-80 2
80-100 5
100-120 5
120-140 12
140-160 10
160-180 10
180-200 4
Total 50

i) Calculate mean, median and mode hours worked.


ii) Calculate mean deviation, standard deviation and variance of the hours
worked.
iii) Find coefficient of variation of hours worked.
4) Following are the data of the hours worked by two workers for seven days
in a factory.
Worker A 8 5 7 4 6 9 5
Worker B 2 8 4 3 7 6 5
i) Find the average hours of work done by both workers.
ii) Which worker is more consistent (hint: the worker with less variance)?
5) Ten persons were advised by their physicians to lose weight for health
reasons. They enrolled in a special weight loss program. The following
table gives the time spent in the program (in days) and weight lost after
completion of the program (in kg.).
Time Spent(x) 25 39 12 30 52 41 67 92 10 11
Weight Loss (y) 12 18 5 14 20 17 25 47 7 6
i) Present a scatter plot for the data.
i) Compute correlation coefficient between number of days enrolled and
weight lost.
Answers and Hints
1) i) 74.9, 79.5, mode does not exist
ii) 12.12, 14.13, 199.69
iii) 18.87%
62
2) i) 2.38 Measures of Central
Tendency and Dispersion
ii) median = 2. The data is bi-modal (1 and 2)
iii) 1.56, 1.83, 3.35
iv) 76.89%
3) i) 135.2, 138.33, 135.56
ii) 28.608, 35.51, 1260.96
iii) 26.27%

4) i) x A  6.29 ,x B  5

ii) CVA=26.50%, CVB= 40% and worker A is more consistent


5) i) Scatter diagram can be plotted in the same manner as in Example 2.19
ii) 0.9704

63
Measures of Central Tendency & Dispersion
Measures that indicate the approximate center of a distribution are called measures of central tendency.
Measures that describe the spread of the data are measures of dispersion. These measures include the mean,
median, mode, range, upper and lower quartiles, variance, and standard deviation.

A. Finding the Mean


The mean of a set of data is the sum of all values in a data set divided by the number of values in the set.
It is also often referred to as an arithmetic average. The Greek letter (“mu”) is used as the symbol for
population mean and the symbol ̅ is used to represent the mean of a sample. To determine the mean of
a data set:

1. Add together all of the data values.


2. Divide the sum from Step 1 by the number of data values in the set.
Formula:

Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14

The mean of this data set is 14.

B. Finding the Median


The median of a set of data is the “middle element” when the data is arranged in ascending order. To
determine the median:

1. Put the data in order from smallest to largest.


2. Determine the number in the exact center.
i. If there are an odd number of data points, the median will be the number in the absolute
middle.
ii. If there is an even number of data points, the median is the mean of the two center data
points, meaning the two center values should be added together and divided by 2.

Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14

Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 2: Determine the absolute middle of the data. 9, 10, 12, 13, 14, 14, 17, 17, 20

Note: Since the number of data points is odd choose the one in the very middle.

The median of this data set is 14.


C. Finding the Mode
The mode is the most frequently occurring measurement in a data set. There may be one mode; multiple
modes, if more than one number occurs most frequently; or no mode at all, if every number occurs only
once. To determine the mode:

1. Put the data in order from smallest to largest, as you did to find your median.
2. Look for any value that occurs more than once.
3. Determine which of the values from Step 2 occurs most frequently.

Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14

Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 2: Look for any number that occurs more than once. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 3: Determine which of those occur most frequently. 14 and 17 both occur twice.

The modes of this data set are 14 and 17.

D. Finding the Upper and Lower Quartiles


The quartiles of a group of data are the medians of the upper and lower halves of that set. The lower
quartile, Q1, is the median of the lower half, while the upper quartile, Q3, is the median of the upper
half. If your data set has an odd number of data points, you do not consider your median when finding
these values, but if your data set contains an even number of data points, you will consider both middle
values that you used to find your median as parts of the upper and lower halves.

1. Put the data in order from smallest to largest.


2. Identify the upper and lower halves of your data.
3. Using the lower half, find Q1 by finding the median of that half.
4. Using the upper half, find Q3 by finding the median of that half.

Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14

Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 2: Identify the lower half of your data. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 3: Identify the upper half of your data. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 4: For the lower half, find the median. 9, 10, 12, 13
Since there are an even number of data points in this half, you will find the median by summing the
two in the center and dividing by two. This is Q1.

Step 5: For the upper half, find the median. 14, 17, 17, 20
Since there are an even number of data points in this half, you will find the median by summing the
two in the center and dividing by two. This is Q3.

Q1 of this data set is 11 and Q3 of this data set is 17.


E. Finding the Range
The range is the difference between the lowest and highest values in a data set. To determine the range:

1. Identify the largest value in your data set. This is called the maximum.
2. Identify the lowest value in your data set. This is called the minimum.
3. Subtract the minimum from the maximum.

Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14

Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 2: Identify your maximum. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 2: Identify your minimum. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 3: Subtract the minimum from the maximum. 20 – 9 = 11

The range of this data set is 11.

F. Finding the Variance and Standard Deviation


The variance and standard deviation are a measure based on the distance each data value is from the
mean.

1. Find the mean of the data. ( if calculating for a population or ̅ if using a sample)
2. Subtract the mean ( or ̅ ) from each data value (xi ).
3. Square each calculation from Step 2.
4. Add the values of the squares from Step 3.
5. Find the number of data points in your set, called n.
6. Divide the sum from Step 4 by the number n (if calculating for a population) or n – 1(if using a
sample). This will give you the variance.
7. To find the standard deviation, square root this number.

Formulas:
Sample Variance, : Population Variance, :
∑ ̅ ∑

Sample Standard Deviation, s: Population Standard Deviation, :

̅
√∑ √∑

Example: Calculate the sample variance and sample standard deviation


Consider the sample data set: 17, 10, 9, 14, 13, 17, 12, 20, 14.

Step 1: The mean of the data is 14, as shown previously in Section A.


Step 2: Subtract the mean from each data value. 17 – 14 = 3; 10 – 14 = -4; 9 – 14 = -5; 14 – 14 = 0

13 – 14 = -1; 17 – 14 = 3; 12 – 14 = -2; 20 – 14 = 6; 14 – 14 = 0

Step 3: Square these values. 32 = 9; (-4)2 = 16; (-5)2 = 25; 02 = 0; (-1)2 = 1; 32 = 9; (-2)2 = 4; 62 = 36

Step 4: Add these values together. 9 + 16 + 25 + 0 + 1 + 9 + 4 + 36 = 100

Step 5: There are 9 values in our set, so we will divide by 9 – 1 = 8. = 12.5

Note: This is your variance.

Step 6: Square root this number to find your standard deviation. √ = 3.536

The variance is 12.5 and the standard deviation is 3.536.

G. Using the TI-84


1. To enter the data values, press 2. Enter the data values
3. Press the key.
in the L1 column.
the key. Select Edit… Under CALC, select 1-VarStats

4. Make sure the List is


L1 then select Calculate.
Mean
Sum of all data values

Sample Standard Deviation


Population Standard Deviation

Number of data values

Lower Quartile
These could be
subtracted to
Median find the range.

Upper Quartile Smallest data value

Largest data value


Content
 Hypotheses testing :
o t test
o z test
o Chi-square test
o Analysis of Variance (ANOVA)

2
 Hypothesis Testing

A hypothesis test is a formal way to make a decision based on statistical


analysis. A hypothesis test has the following general steps:

• Set up two contradictory hypotheses. One represents our


“assumption”.
• Perform an experiment to collect data. Analyze the data using the
appropriate distribution.
• Decide if the experimental data contradicts the assumption or not.
• Translate the decision into a clear, non-technical conclusion.

3
 Null and Alternative Hypotheses
Hypothesis tests are tests about a population parameter ( μ or p). We will do
hypothesis tests about population mean and population proportion p.

The null hypothesis (H0) is a statement involving equality (=; <;>) about a
population parameter. We assume the null hypothesis is true to do our analysis.

The alternative hypothesis (Ha) is a statement that contradicts the null


hypothesis. The alternative hypothesis is what we conclude is true if the
experimental results lead us to conclude that the null hypothesis (our
assumption) is false.

The alternative hypothesis must not involve equality (≠; <;>).

The exact statement of the null and alternative hypotheses depend on the claim
that you are testing.

4
 Outcomes and the Type I and Type II Errors

Hypothesis tests are based on incomplete information, since a sample can never give
us complete information about a population. Therefore, there is always a chance that
our conclusion has been made in error.

There are two possible types of error:


The first possible error is if we conclude that the null hypothesis (our assumption) is
invalid (choosing to believe the alternative hypothesis), when the null hypothesis is
really true. This is called a Type I error.

Type I error = { Deciding to reject the null when the null is true
incorrectly supporting the alternative

The other possible error is if we conclude that the null hypothesis (our assumption)
seems reasonable (choosing not to believe the alternative hypothesis), when the null
hypothesis is really false. This is called a Type II error.

Type II error =
{ Failing to reject the null when the null is False
incorrectly NOT supporting the alternative

5
TYPE I and TYPE II ERROR IN TABULAR FORM
Decision

Accept H0 Reject H0
H0 True Correct decision Type I Error
H0 False Type II Error Correct Decision

When a Null hypothesis is tested, there may be four possible outcome:

I. The Null Hypothesis is true but our test rejects it.


II. The Null Hypothesis is false but our test accept it.
III. The Null Hypothesis is true and our test accepts it.
IV. The Null Hypothesis is false but our test rejects it.

Type I Error : Rejecting Null Hypothesis when Null Hypothesis is true. It is called ‘α-error’.

Type II Error : Accepting Null Hypothesis when Null Hypothesis is false. It is called ‘β-error’.

6
 Outcomes and the Type I and Type II Errors Cont…
It is important to be aware of the probability of getting each type
of error. The following notation is used:

7
 Outcomes and the Type I and Type II Errors Cont…

The signicance level α is the probability that we incorrectly reject the


assumption (null) and support the alternative hypothesis. In practice, a data
scientist chooses the signicance level based on the severity of the
consequence of incorrectly supporting the alternative. In our problems, the
signicance level will be provided.

The power of a test is the probability of correctly supporting a true


alternative hypothesis. Usually we are testing if there is statistical evidence to
support a claim represented by the alternative hypothesis. The power of the
test tells us how often we “get it right” when the claim is true.

8
 Distribution Needed for Hypothesis Testing
The sample statistic (the best point estimate for the population parameter, which
we use to decide whether or not to reject the null hypothesis) and distribution for
hypothesis tests are basically the same as for confidence intervals.

The only difference is that for hypothesis tests, we assume that the population
mean (or population proportion) is known: it is the value supplied by the null
hypothesis. (This is how we \assume the null hypothesis is true" when we are
testing if our sample data contradicts our assumption.)

When testing a claim about population mean μ, ONE of the following two
requirements must be met, so that the Central Limit Theorem applies and we can
assume the random variable, x̅ is normally distributed:

− The sample size must be relatively large (many books recommend at least 30
samples), OR
− the sample appears to come from a normally distributed population.

It is very important to verify these requirements in real life. In the problems we


are usually told to assume the second condition holds if the sample size is small.

9
 Stating Hypotheses
The first step in conducting a test of statistical significance is to state
the hypotheses.
A significance test starts with a careful statement of the claims being compared.
The claim tested by a statistical test is called the null hypothesis (H0). The test
is designed to assess the strength of the evidence against the null hypothesis.
the null hypothesis is a statement of “no difference.”

when conducting a test of significance, a null hypothesis is used. The term null
is used because this hypothesis assumes that there is no difference between the
two means or that the recorded difference is not significant.

Null Hypotheses denoted by H0.

The opposite of a null hypothesis is called the alternative hypothesis. The


alternative hypothesis is the claim that researchers are actually trying to prove is
true.
The claim about the population that evidence is being sought for is the
alternative hypothesis (Ha).

10
 Test Statistic
• It is a random variable that is calculated from sample data and used in
hypothesis test.
• Test statistic compare your data with what is expected under the null
hypothesis.
• It is used to calculate P-Value.
• A test statistic measures the degree of agreement between a sample of the
data and the null hypothesis.

Different hypothesis tests use different test statistics based on the probability
model assumed in the null hypothesis. Common tests and their test statistics
are:
Hypothesis Test Test Statistics

Z-test Z-statistic

T-test T-statistic

ANOVA F-statistic

Chi-square tests Chi-square statistic

11
 P-Value
The p-value is the probability, computed under the assumption that the null
hypothesis is true, of observing a value from the test statistic at least as
extreme as the one that was actually observed.

Thus, P-value is the chance that the presence of difference is concluded when
actually there is none.

 When the p value is between 0.05 and 0.01 the result is usually called
significant.
 When P value is less than 0.01, result is often called highly significant.

 When p value is less than 0.001 and 0.005, result is taken as very highly
significant.

12
 Statistical test
• These are intended to decide whether a hypothesis about distribution
of one or more populations should be rejected or accepted.

Statistical Test

Parametric Test applied when data


concern : Non- Parametric Test applied when data
• Normal distribution concern :
• Quantitative data • Skewed distribution
• Scale of measurement is METRIC • Qualitative data
scale(i.e. interval / ratio scale • Nominal or ordinal scale
• Compare mean and standard deviation • Compared in terms of %, proportion,
• More powerful etc.
• Comparatively less powerful

13
 Parametric tests
• Used for Quantitative Data
• Used for continuous variables
• Used when data are measured on approximate interval or ratio scales
of measurements.
• Data should follow normal distribution.
 Some parametric tests are:-
• t-test
• ANOVA (Analysis of variance)
• Pearson’s r Correlation (r= rank)
• Z test for large samples( n>30)

14
 Student’s t- test
Developed by Prof. W.S Gossett in 1908, who publishes statistical
papers under the pen name of “student.” Thus the test is known as
Student’s “t” test.

 When the test is apply?


1. When samples are small.
2. Population variance are not known.

 Assumption made in the use of “t” test


1. Samples are randomly selected.
2. Data utilized is Quantitative.
3. Variables follow normal distribution.
4. Sample variance are mostly same in both the groups under
the study.
5. Samples are small, mostly lower than 30.

15
 Student’s t- test Cont…

 T- test compares the difference between two means of different groups to


determine whether that difference is statistically significant.

 It is use in different – different purposes:


- “t” test for one sample
- “t” test for unpaired two samples.
- “t” test for paired two samples.

16
 One Sample t-test
 When compare the mean of a single group of observations
with a specified value.
 In one sample t-test, we know the population mean. We draw
a random sample from the population and then compare the
sample mean with the population mean and make a statistical
decision as to whether or not the sample mean is different
from the population.

Formula :

Where, = sample mean


µ= population mean, = standard error.

Now we compare calculate value with table value at certain level of


significance (generally 5% or 1%).

17
 One Sample t-test Cont…

 If absolute value of “t” obtained is grater than table value then reject the
null hypothesis and if it is less than table value, the null hypothesis may
be accepted.
Therefore, rule for rejecting the null hypothesis:

Reject Ho if t ≥ +ve Tabulated value


or,
Reject Ho if t ≤ -ve Tabulated value
or, we can say that p< .05

18
 T- test for unpaired two samples

• Used when the two independent random samples come from the
normal populations having unknown or same variance.

• We test the null hypothesis that the two population means are same
i.e., μ1 = μ2

 Assumption made for use


1. Populations are distributed normally
2. Samples are drawn independently and at random

 When the test is apply?


1. Standard deviations in the populations are same and not known
2. Size of the sample is small

19
 T- test for unpaired two samples Cont…

If two independent samples xi ( i = 1,2,….,n1) and yj ( j = 1,2, …..,n2) of


sizes n1and n2 have been drawn from two normal populations with
means μ1 and μ2 respectively.
Null hypothesis
H0 : μ1 = μ2
Under H0, the test statistic is
𝒕 = ǀ 𝒙 − 𝒚ǀ
S√1/n1 =+1/n2
 T- test for paired two samples
Used when measurements are taken from the same subject
before and after some manipulation or treatment.

Ex: To determine the significance of a difference in blood pressure before and


after administration of an experimental pressure substance.

20
 T- test for paired two samples Cont…
 Assumptions made for the test
1. Populations are distributed normally
2. Samples are drawn independently and at random
 When the test apply
1. Samples are related with each other.
2. Sizes of the samples are small and equal.
3. Standard deviations in the populations are equal and not
known.
Null Hypothesis:
H0: μd = 0
Under H0, the test statistic

𝒕 = ǀ𝒅̅ ǀ
s/n
Where, d = difference between x1 and x2
d̅ = Average of d
s = Standard deviation
n = Sample size
21
 ANOVA (Analysis of Variance)
• Developed by R.A.Fischer.
• Analysis of Variance (ANOVA) is a collection of statistical models
used to analyze the differences between group means or variances.
• Compares multiple groups at one time.

22
 One way ANOVA

Compares two or more unmatched groups when data are categorized in one factor.

Example :
1. Comparing a control group with three different doses of aspirin
2. Comparing the productivity of three or more employees based on
working hours in a company

 Two way ANOVA

• Used to determine the effect of two nominal predictor variables on a continuous


outcome variable.
• It analyses the effect of the independent variables on the expected outcome along
with their relationship to the outcome itself.

Example :
Comparing the employee productivity based on the working hours
and working conditions.

23
Assumptions of ANOVA :
• The samples are independent and selected randomly.
• Parent population from which samples are taken is of normal
distribution.
• Various treatment and environmental effects are additive in nature.
• The experimental errors are distributed normally with mean zero and
variance σ2
ANOVA compares variance by means of F-ratio
• F = 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 / 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
• It again depends on experimental designs

Null hypothesis:
Hο = All population means are same
• If the computed Fc is greater than F critical value, we are likely to
reject the null hypothesis.
• If the computed Fc is lesser than the F critical value , then the null
hypothesis is accepted.

24
 Z-test

 Z test for large samples( n>30)


• Z-test is a statistical test where normal distribution is applied and is
basically used for dealing with problems relating to large samples when
the frequency is greater than or equal to 30.
• It is used when population standard deviation is known.
Assumptions made for use:
1. Population is normally distributed
2. The sample is drawn at random
When the test apply?
• Population standard deviation σ is known
• Size of the sample is large (say n > 30)

25
 Z test for large samples( n>30) Conti…

Let x1, x2, ………x , n be a random sample size of n from a normal


population with mean μ and variance σ2 .
Let x̅ be the sample mean of sample of size “n”
Null Hypothesis:
Population mean (μ) is equal to a specified value μο
H0: μ = μο

Under Hο, the test statistic is


𝒁 = ǀ 𝒙 − μ𝝄ǀ
s/n

If the calculated value of Z < table value of Z at 5% level of


significance, H0 is accepted and hence we conclude that there is no
significant difference between the population mean and the one
specified in H0 as μο.

26
 Non- Parametric Test
 Non-parametric tests can be applied when:

− Data don’t follow any specific distribution and no assumptions


about the population are made.
− Data measured on any scale applied when data concern.

 Commonly used Non Parametric Tests are:

1. Chi Square test


2. Mann-Whitney U test
3. Kruskal-wallis one-way ANOVA
4. Friedman ANOVA
5. The Spearman rank-order correlation test.

27
 Chi Square test
• First used by Karl Pearson
• Simplest & most widely used non-parametric test in statistical
work.
• Calculated using the formula:- χ2 = Σ ( O – E )2 / E

Where,
O = observed frequencies
E = expected frequencies
• Greater the discrepancy b/w observed & expected frequencies, greater
shall be the value of χ2.

• Calculated value of χ2 is compared with table value of χ2 for given


degrees of freedom.

28
 Chi Square test Cont…
 Application of chi-square test :
• Test of association (smoking & cancer, treatment & outcome
of disease, vaccination & immunity).

• Test of proportions (compare frequencies of diabetics & non-


diabetics in groups weighing 40-50kg, 50-60kg, 60-
70kg & >70kg.).

• The chi-square for goodness of fit (determine if actual


numbers are similar to the expected/theoretical numbers).

29
 Sources

1. These lecture notes are intended to be used with the open source textbook
“Introductory Statistics" by Barbara Illowsky and Susan Dean (OpenStax
College, 2013).
2. https://study.com/academy/lesson/what-is-a-hypothesis-definition-lesson-
quiz.html.
3. https://personal.utdallas.edu/~scniu/OPRE6301/documents/Hypothesis_T
esting.pdf
4. http://isoconsultantpune.com/hypothesis-testing/
5. http://www.fosonline.org/wordpress/wp‐content/uploads/2010/06/Salafasky
EtAl_ConsBiol_2002.pdf

30

You might also like