CHAPTER 1
Introduction to Statistics
1.0 Objectives
1.1 Introduction
1.2 Definitions of Statistics
1.3 Importance of Statistics
1.4 Scope of Statistics
1.5 Summary
1.6 Check Your Progress - Answers
1.7 Questions for Self – Study
1.8 Suggested Readings
1.0 OBJECTIVES
In our everyday life we make use of numbers or figures. These numbers is an
information expressed in numerical form and is generally refered as data in statistics. It
may also be in the form of tables. After studying this chapter you will be able to
• explain terminologies such as statistics, statistical methods.
• discuss and explain the need of and scope of statistics in different fields.
1.1 INTRODUCTION
The word statistics seems to be derived from the word statist, the known use of
which dates back to 1602, when it was used in ‘Hamlet’ by Shakespeare. The
numerical information used by the statists for the purpose of administration of state
was termed as statistics. At present the word statistics is used to mean numerical data
pertaining to some department of inquiry and it also means the science of Statistics
which includes a number of statistical methods such as collection, classification,
analysis and interpretation of numerical data.
1.2 DEFINITIONS OF STATISTICS
Different persons have defaced statistics in different ways. Some of them have
defined statistics as numerical data and the others have defined it as a science. Most of
them have described statistics as it appealed to them. Therefore none of the definitions
has defined statistics quite comprehensively. Some of these definitions are given
below.
a) Webster defined statistics as ‘The classified facts representing the condition
of people in a state, especially those facts which can be stated in numbers or in tables
or in any tubular or classified arrangement.’ This definition is too narrow as it confines
the scope of statistics only to such facts and figures that represent the condition of the
people in a state. Since statistics may represent various other facts such as biological,
physical, commercial and others, this definition is quite inadequate.
b) Prof. Horace Secrist has given a more comprehensive definition which read
as ‘statistics are the aggregates of facts affected to a marked extent by multiplicity of
causes, numerically expressed, enumerated estimated according to reasonable
standards of accuracy, collected in a systematic manner for a predetermined purpose
and placed in relation to each other.’ This definition clearly points out the
characteristics which numerical data must possess in order that they can be called
statistics.
Introduction to Statistics / 153
Check Your Progress – 1.1-1.2
1. What is statistics?
______________________________________________________
______________________________________________________
2. How statistics id defined by Webster?
______________________________________________________
______________________________________________________
1.3 IMPORTANCE OF STATISTICS
Importance of Statistics as a science lies in the service it has rendered to the
mankind. In recent years, the growth of statistics has made itself felt in almost every
phase of human activity. It no longer consists merely of collection of data and
presenting it in tables or diagrams. It is now considered to encompass the science of
making decisions in the face of uncertainty. This covers considerable ground since
uncertainties are met when we roll a die, a doctor treats a patient, an actuary
determines life insurance premiums, meteorologists make weather forecasts, a broker
makes predictions about prices of shares, a newspaper predicts an election, so on and
so forth.
Statistics, in its present state of development, can handle most of the situations
involving uncertainties. It at least provides the models that are needed to study
situations involving uncertainties. Statements, in any form become stronger, precise
and more appealing when they are supported by relevant statistics. Statistical methods
provide tools to summaries the complex numerical data and to present them in a
manner which is easily intelligible. Statistics has provided techniques like statistical
quality control which have made revolutionary changes in the field of industrial
production. Statistical methods are widely used in the field of agriculture in estimating
yield of a crop, in testing effectiveness of fertilizers, methods of irrigation and water
management, in developing new varieties of seeds etc.
To sum up one can say that it is hardly possible to single out a department of
human activity where statistics has not creeped in. It has rather become indispensible
in all phases of human endeavour.
1.4 SCOPE OF STATISTICS
The field of application of statistics is expanding very fast in modern times. It is
of immense value not only to the administrators of a state but also economists,
businessmen, scientists and research workers in sociology and psychology as well.
In industry, statistical methods are used in estimating demand for a production
in future and estimating the need for a raw material, labour, finances etc. In large scale
production the statistical quality control techniques are used to reduce rejection of the
product and wastage of raw materials, which results in increasing profits by reducing
cost of production.
Statistical techniques are widely used in Economics also. Economics is mainly
concerned with the production and distribution of wealth and with the consumption,
saving and investment of income. Statistics is also used in formulating taxation
policies. The economists have to depend on statistics to a great extent in solving
problems confronted in production and distribution of essential commodities. The
policies of reducing unemployment, poverty, rising prices etc. also depend on Statistics
to a fairly good extent.
The management techniques are developing very fast in the twentieth century
due to enormous growth of industry and business. Decision making is the prime
function of any management and the statistical information and statistical techniques
provide a sound basis for all sorts of decision. Since the complexities of business
environment make the process of decision making difficult, the decision maker cannot
Mathematics & Statistics / 154
rely entirely upon his observation, experience or evaluation to make a decision.
Decisions have to be based on data which show relationship, indicate trends and show
rates of change in various relevant variables. Statistics provides methods for collecting,
presenting, analysing and interpreting meaningfully such data which is helpful in better
decision making. The various statistical tools guide a manager in selecting the best
course of action under given circumstances. The decisions relating to production,
pricing purchasing and controlling various activities are rendered easier with the help
of statistics.
There is a wide scope for application of statistical methods in the field of basic
sciences like Biology, Astronomy, Meteorology, Physics, Chemistry etc. Research
work carried out in different branches of science proves that it is impossible to conduct
any research without the help of statistics. It is used as a scientific method in
development of different branches of science. Statistical methods are used in
establishing laws and principles in science and also in validating the same.
In the field of medical science almost all the conclusions are based on
observations and experimentation. The statements like 'smoking is injurious to health,
'chewing tobacco causes cancer' are the statistical conclusions based on systematically
collected data. Statistical methods are used in planning the experiments and analysing
the result for testing the effectiveness of different medicines and their hazards. There
can be no research in the medical sciences in modern times, without being supported
by statistics.
The usefulness of statistics as a scientific method of studies in sociology and
psychology has been widely recognised in modern age. The sociological studies are
based on properly designed sample inquiries which involve planning of inquiry,
collection of data, analysis and interpretation of these data. Since these sciences are not
exact sciences, the observed facts can be handled more purposefully only by using
statistical techniques. These studies are useful for planning and execution of social
welfare activities to be undertaken by government or private agencies.
Check Your Progress. - 1.3 & 1.4
1. List the use of statistical methods in agriculture field.
______________________________________________________
______________________________________________________
2. List the use of statistical methods in medical science field.
______________________________________________________
______________________________________________________
3. List the branches of basic science in which statistical methods are used.
______________________________________________________
______________________________________________________
1.5 SUMMARY
This chapter explains in detail the importance of statistics to mankind and scope
of statistics in different fields. Thus we see that statistics is very important subject and
is useful in almost all areas.
1.6 CHECK YOUR PROGRESS - ANSWERS
1.1 &1.2
1. The numerical information used by the statists for the purpose of administration
of state was termed as statistics.
Introduction to Statistics / 155
2. Webster defined statistics as 'The classified facts representing the condition of
people in a state, especially those facts which can be stated in numbers or in
tables or in any tubular or classified arrangement.'
1.3 & 1.4
1. Statistical methods are widely used in the field of agriculture in estimating yield
of a crop, in testing effectiveness of fertilizers, methods of irrigation and water
management, in developing new varieties of seeds etc.
2. In medical science field Statistical methods are used in planning the experiments
and analysing the result for testing the effectiveness of different medicines and
their hazards. Also for research.
3. Application of statistical methods in the field of basic sciences like Biology,
Astronomy, Meteorology, Physics, Chemistry etc.
1.7 QUESTIONS FOR SELF - STUDY
1. Explain the different meaning of the word Statistics
2. Give the definitions of statistics by Webster, Secrist
3. What is the role of statistics in economic planning?
4. Describe the scope of statistics in business and industry.
5. Explain the role of statistics in social sciences.
6. Describe the importance of statistics in management science.
1.8 SUGGESTED READINGS
1. Mathematics and Statistics by M. L. Vaidya, M. K. Kelkar
2. Statistical Analysis by S. P. Azen and A. A. Afifi
Mathematics & Statistics / 156
NOTES
Introduction to Statistics / 157
NOTES
Mathematics & Statistics / 158
Chapter 2
Statistical Data
2.0 Objectives
2.1 Introduction
2.2 Nature of Subject
2.3 Language of Statistics
2.3.1 Population
2.3.2 Variables
2.3.3 Size of Population
2.3.4 Discrete and Continuous Variables
2.4 Classification of data
2.4.1 Classification by attributes
2.4.2 Classification of variables
2.5 Graphical representation of data
2.5.1 Histogram
2.5.2 Frequency polygon
2.5.3 Ogive curves
2.6 Diagramatic representation of data
2.6.1 Simple bar diagram
2.6.2 Subdivided bar diagram
2.6.3 Pie diagram
2.7 Summary
2.8 Check your Progress - Answers
2.9 Questions for Self - Study
2.10 Suggested Readings
2.0 OBJECTIVES
After studying this chapter you will be able to :-
explain terms - attributes, variables, raw data, classification of data, population
sample
draw graphical representation of data (Histogram Frequency polygon and
Ogive curves)
describe Diagramatic representation of data. (Simple bar diagram subdivided
Bar diagram, Pie diagram)
2.1 INTROUDCTION
Statistics in concerned with scientific methods for collecting organising,
summerising, presenting and analysing data as well as drawing conclusions and
making reasonable decisions on the basis of such analysis.
2.2 NATURE OF SUBJECT
Suppose we want to compare the performance in Mathematics of two
divisions in the examination. The first thing we have to do is to collect the
marks of students. These marks are collected is called, “Data” Hence first step
of statistics is to collect the data. But, merely looking at the mark lists, we will
not get any idea about performance of students in two divisions under
consideration.
We have to find,
i) number of fail students
Statistical Data / 159
i.e. students getting less than 40 marks.
ii) number of pass students,
i.e. students getting equal to and more than 40 marks.
Here again we have consider the class of students means
I class – Marks 60 & up to 74 above
II class – 40 to 59
Distinction – 75 and above
This all means hat we must make “Classification”, of the collected data.
Hence, broadly speaking we can say that,
1) Collection of data
2) Classification of data
3) Diagrammatic representation of data
4) Analysis
5) Inference
are the different aspects of this subject (statistics).
Sampling : When a population is very large or infinite practically it is not
possible to collect desired information on all the units of population. This may
happen even in case of small or finite population when measurements of a variable is
costly or in some cases destructive in nature. In such cases we select a small group of
units drawn from the population to carry out investigation. This small group of units
drawn from the population is called a sample. e.g. to find average height of student
studying in a college instead of taking measurement of height on all students of
college (population) we select a small group of no. of students (sample) and take
measurements on such selected students. Similarly to estimate the average life of
electric bulbs produced by a company a sample of bulbs is taken from the large
population of no. of bulbs produced and the life time of each bulb from a sample is
determined by actually burning out the bulbs. The no. of units in the sample is called
the size of that sample. In real life there are many situations where we use sample
from population for making judgement about the population.
Following are few examples
(1) For judging the quality of rice in a bag we pick up handful of rice and judge
the quality of rice in bag.
(2) The average yield of crop can be estimated by selecting a sample of farms and
finding the mean yield per hectare for these forms.
(3) A housewife confirms whether the food is properly cooked or not with the help
of few particles taken out of the container. Clearly the food in container is
population and food taken out of container for inspection is sample.
(4) For testing quality of milk a small quantity of milk is tested instead of entire
bulk. Concept of population and sample can be easily understand from
following diagram
Mathematics & Statistics / 160
2.3 LANGUATE OF STATISTICS
Every subject has got its own special terminology; the special words are used
for special purpose. So the terms used in, are
2.3.1 Population
In common language used word ‘Population’ as no of people live in that
particular area. But in statistics, we use this word, ‘Population’ for any ‘Collection of
articles (items) under consideration of our study purpose.’
e.g.- 1) Students in class
2) Workers in Industry
3) Radio sets
4) T. V. sets
5) Variety of Mobiles available now a days.
6) Agricultural field yields.
So each member, object or observations of the population is called, ‘an
individual’ or ‘member’ or ‘element’ of that population.
The population is also called “an Universe”.
2.3.2 Variables
Each individual in the population is studied for a certain character or
characteristic.
e.g. Height
Weight
Marks in a subject
Rain fall in a region
Yield of production of variety of crop
Production in factory.
Variables are of two types
I) Quantitative – can be counted as a number.
e. g. Height and weight of a student.
Temperature recorded in Month of May
II) Quantitative – When the character is qualitative in nature and
hence not expressible in numerical forms, it is called qualitative
(an attribute)
e. g. Sex, Religion. Mother tongue. Faculty Nationality.
Statistical Data / 161
None of these can be expressed numerically, but each will divide a population
is two or more groups; as
Sex – gives you no. of males & no, of females.
Mother tongue – Marathi, Hindi, Tamil, Guajarati.
2.3.3 Size of Population
The number of units constituting the population is called size of that
population.
e. g. B. C. A. = 100 Students.
Size of Population = 100
So this population is called, ‘Finite Population’.
The other is ‘In finite Population.’
e.g. N = Set of ‘Natural Numbers’
= {1, 2, 3 ……….}
Number of elements are infinite means uncountable.
2.3.4 Discrete and Continuous Variables
I) Discrete variable – A variable is said to be discrete if it a takes distinct and
isolated values.
e. g. Number of daily accidents in city.
Number of family members.
Number of decayed teeth of a child.
II) Continuous Variable – A Variable is said to be continuous when it takes all
possible values in an interval.
e.g. weights of persons in a group
Temperature of Certain place
2.4 CLASSIFICATION OF DATA
The raw data are very difficult to understand and we cannot draw any
conclusions from them unless we process it. The data so obtained after processing is
called as secondary data.
Suppose a collection of on a certain characteristics. Such a set of numbers does
not help in drawing any conclusion about the data. The data can be made more
meaningful by an ordered arrangement or by dividing it into different groups or
classes. This process is called, ‘Classification of Data.’
2.4.1 Classification by attributes
When the characteristics under consideration is qualitative types or an
attribute; the simplest way of classification is to put all the items or units possessing
that attribute in one class and remaining items in other class. Such classification is
called simple classification by attribute or dichotomy. e.g. we may classify group of
persons into two classes males and females according to attribute sex. Similarly a
group of individuals may be classified into smokers and non-smokers with attribute
smoking habit.
If we classify group of items or units or individuals into more than tow classes
then such classification is called manifold classification. e.g. group of persons may
be classified according to their mother tongue into different classes such as persons
having mother tongue Marathi, Tamil, Telgu, Punjabi etc.
In any type of classification by attributes either dichotomy or manifold the
important thing is that the classes should be defined unambiguously. The classes
should be mutually exclusive and exhaustive. An item should belong to one and only
one class and no item should miss the classification.
2.4.2 Classification of variables
When the character under study is quantitative type or variable the
classification is done according to values of variables.
In case of discrete variables like chest size of banians in the stock held by
hosiery shop, the variable assumes only a few values like 60, 65, 70, 75, 80 -- 100
cms. Here each possible value of variable forms a class. These classes are said to
Mathematics & Statistics / 162
form discrete series of observations on that variable. The no. of children in family,
no. of accidents in a day in a city, size of footwear etc. are some examples of variable
which can be classified in this way.
There are generally two types of variables we want to study for general
consideration.
Suppose we have the following information about the number of accidents that
in a month in a certain city.
1 0 3 1 3 4 3 4 0 2
2 2 3 2 3 4 3 5 0 5
4 2 1 1 4 3 2 3 4 5
We observed that the data we recorded is a discrete type means there was
number of minimum ‘0’ accident and maximum ‘5’ accidents. But there is no such
case that accident is 1.5 or -1. This means variable under consideration is positive
and Integer.
So it is called discrete type and the ‘Distribution’ is called as,
‘Ungrouped Frequency Distribution’
Using tally marks we write this information in a tabular form as shown in
table-
How to prepare frequency distribution –
i) We find minimum number of accidents is zero and maximum is 5.
So first column as ‘Number of accidents and values as 0, 1, 2, 3, 4, 5
ii) Next read the given observation in the data and make ‘1’ called a tally
mark in the next column. Read all the data & make such marks.
iii) Third column as ‘frequency’ count tally marks and accordingly write
numbers as frequency.
iv) Check whether we are correct or not as the total of frequency column
should be equal to total number of observations give in the data.
Frequency distribution for number accidents
Number of accident Tally-marks Number of days
Frequency
0 ||| 3
1 |||| 4
2 |||| | 6
3 |||| ||| 8
4 |||| | 6
5 ||| 3
Total 30
Grouped Frequency Distribution –
(I) Inclusive Method
(II) Exclusive Method
This type of classification is most popular in practice.
The weights in kg of 50 students in a class are give in the following data-
49 57 59 57 50 45 52 58 56 48
54 50 51 64 49 58 47 53 63 64
49 62 62 54 53 51 53 61 49 47
48 54 48 53 49 46 53 47 51 52
49 56 45 49 51 55 52 46 48 54
Statistical Data / 163
To solve the above example. We observed the data Minimum number as 46
and Maximum as 64. Now we will classify this data according to class-intervals.
We shall divide the numbers, in groups as 45-47, 48-50…….
1) Class ‘limits (class boundaries) – There are two class limits lower and upper
class limit
Class : 45-47
Where 45 – lower class limit
47 – upper class limit.
2) Class intervals (classes)
45-47, 48-50, 51-53 ……63-65 are called class intervals or classes.
3) Frequency – number of observations included in that class is called
‘frequency’ of that class.
4) Class width – The difference between lower limit (lower boundary) and upper
limit (upper boundary) of that class is called class width.
L Class width = upper limit – lower limit
e. g. class : 63-65
class width = upper limit – lower limit
= 65 – 63
=2
5) Class marks (mid value) – The arithmetic mean or average of the upper and
lower limits of a class is called class marks or mid value of that class (class
interval)
lowerlimit upperlimit
L the class mark (class mid value) =
2
e.g. class – 45-47
45 47
class mark =
2
= 46
(I) Inclusive Method – The upper limit (boundary) is included in that class
is called, ‘Inclusive method of classification’.
e.g. class : 51-53
then values 51, 52, 53 are included.
(II) Exclusive Method – The upper limit is excluded in that class is called,
‘exclusive method of classification’
e.g. class : 54 – 56 then weight of students upto 56 kg is added in
this class but exact 56 kg are not consider it will consider in the next
class as in 56-58.
Class boundaries – In our example
We considered weight of students which is in kg. first class
interval is 45-47 & second is 48-50. But if a student has 47.5 kg
weight. Then we have to make classes as 44.5-47.5, 47.5-50.5 ……
thus all values between were considered, so this is called as class
boundary.
To get these class boundaries of a class, we add 0.5 to upper limit
and subtract 0.5 from the lower limit.
Example : Find class boundaries of the following classes.
(1) 100–104, 104 –109, 110–114,115–119, 120–124
(2) 4–6,6–8,8–12,12–16,16–20
Solution :
(1) Here inclusive method of classification is used.
class boundaries are
99.5–104.5, 104.5–109.5, 109.5–114.5, 114.5–119.5, 119.5–124.5
Mathematics & Statistics / 164
(2) Here exclusive method of classification is used
Hence class boundaries are same as class limits.
4–6,6–8,8–12,12–16,16–20
Frequency Distribution of weights
Class Tally marks Frequency Class Class marks
interval boundaries
45-45 |||| || 7 44.5-47.5 46
48-50 |||| |||| ||| 13 47.5-50.5 49
51-53 |||| |||| || 12 50.5-53.5 52
54-56 |||| || 7 53.5-56.5 55
57-59 |||| 5 56.5-59.5 58
60-62 ||| 3 59.5-62.5 61
63-65 ||| 3 62.5-65.5 64
Total 50
e.g. Following is a frequency distribution of no. of students according to their
pocket money (in Rs.)
Pocket money No. of students
50–55 7
55–60 20
60–70 30
70–100 5
In the above frequency distribution variable under study is pocket money (in
Rs.) of a student and method of classification is exclusive method. No. of students
belonging to respective classes are the class frequencies of those classes.
Cumulative frequencies : Class frequency is the no. of observations in that
class. But many times we may be interested in knowing how many items have their
values less than (or more than) the given value. e.g. We may be interested in finding
no. of students having marks less than 60 or no. of students having marks more than
70. These numbers are also frequencies and are called cumulative frequencies.
The frequencies which give the no. of observations less than given value are
called "less than" cumulative frequencies and those giving the no. of items having
values more than the given value are called "more than" cumulative frequencies.
The computations of less than and more than frequencies is given below
Height Less than More than
No. of
cumulative cumulative
(in cms) persons
frequencies frequencies
140–145 3 3 67
145–150 13 16 64
150–155 30 46 51
155–160 16 62 21
160–165 5 67 5
The less than cumulative frequency of a given class is the no. of observations
having their values less than the upper boundary of the given class. In above example
less than cumulative frequency of class 150 –155 is 46 means there are 46 persons
having height less than 155 cms.
Similarly more than cumulative frequency of a given class is the no. of
observations having their values more than the lower boundary of the given class.
Statistical Data / 165
In above example the more than cumulative frequency of class 150 –155 is 51
means there are 51 persons having height more than 150 cms.
The table showing classes together with their cumulative frequencies is called
cumulative frequency distribution.
Example:
Following is a frequency distribute of no. of screws according to the length (in
cms)
Find less than and more than cumulative frequency distribution.
Length (in cms) No. of screws
1.0 – 1.5 12
1.5 – 2.0 32
2.0 – 2.5 27
2.5 – 3.0 10
3.0 – 3.5 9
Solution :
No. of screws less than more than
length (in cms)
(frequency) cumulative cumulative
1.0–1.5 12 12 90
1.5–2.0 32 44 78
2.0–2.5 27 71 46
2.5–3.0 10 81 19
3.0–3.5 9 90 9
Check your Progress – 2.4
Fill in the blanks
1. Colour of eyes is a ……………………. Variable.
2. Marks of students is ………………….. Variable
3. We select item randomly is ………………… sampling.
4. The number of observation belong to a particular class is called
…………………..
5. Midpoint of a class interval is called. ……………………….
2.5 GRAPHICAL REPRESENTATION OF DATA
The frequency distribution itself brings out some important features of raw
data. However these features can be studied more conveniently it we represent it in
the diagramatic or graphical form. Many questions about the data can be answered by
means of these graphs. The various types of graphs used for presenting frequency
distribution are (i) Histogram (ii) Frequency polygon (iii) Ogive curves.
2.5.1 Histogram
This is simple method of representing frequency distribution graphically. In
this graph classes are represented by a series of adjacent rectangles. The base of each
rectangle is the class interval of that class. The area of each rectangle is proportional
to the frequency of that class. Hence when the class intervals are uniform throughout
the distribution, the height of each rectangle is proportional to the frequency of
corresponding class.
Mathematics & Statistics / 166
But when the class intervals are not uniform the height of rectangle is
proportional to the frequency density of that class. Frequency density of class is ratio
of class frequency to its width.
The histogram can distinguish more clearly, class with maximum
concentration of frequency, This will be identified by the rectangle with maximum
height irrespective of the fact that the class intervals are equal or not. It can be used
to determine mode of the distribution.
In case of discrete frequency distribution the rectangles are reduced to vertical
lines as the class interval are reduced to zero width. If class intervals are of type 5–9,
10–14, 15–19 etc. they are converted into continuous intervals by finding class
boundaries as 4.5–9.5, 9.5–14.5, 14.5–19.5 respectively in order to have the
rectangles adjacent to each other.
Example :
Following is a frequency distribution of no. of students according to their
marks in a test. Draw histogram for it.
Marks No. of students
10–20 5
20–30 13
30–40 37
40–50 14
50–60 6
Histogram :
2.5.2 Frequency Polygon
Frequency polygon is plotted representing every class by a point on a graph
paper. The class mark or mid value of class interval is taken as X - co-ordinate and
the frequency of the class as Y-co-ordinate of the point representing the class.
Consider two imaginary classes one at each end of the given distribution with
frequency zero. These are represented by two points on X axis one at each end. The
consecutive points are then connected by segments of straight lines. The figure
enclosed by these lines and the X-axis is in the form of polygon and is called
frequency polygon.
If the points representing different classes are joined by a smooth curve the
curve is called as frequency curve.
From both of these graphs we can answer the queries about symmetry of
distribution the points of maximum concentration of the frequency and the nature of
frequency distribution.
Statistical Data / 167
Example :
Represent following frequency distribution by means of frequency polygon.
Class Frequency
20–30 3
30–40 10
40–50 23
50–60 17
60–70 6
70–80 11
Solution :
First find the class marks of different classes.
Class Class Mark Frequency
20–30 25 3
30–40 35 10
40–50 45 23
50–60 55 17
60–70 65 6
70–80 75 11
2.5.3 Ogive curve
Ogive curve is also called as cumulative frequency curve. It is a smooth free
hand curve passing through the points which have upper & lower class boundary as
X-co-ordinate and less than (more than) cumulative frequency as Y-co-ordinate.
Accordingly curve is called less than or more than Ogive curve.
The less than Ogive curve goes on rising from left to right and on the other
hand move than Ogive curve goes on declining from left to right. The Ogive curves
are very useful as we can determine partition values like median, quantities etc. from
them. We can also find the number and percentage of observations which lie between
two given values of the variable.
Example 1 : Draw less than Ogive curve for the following distribution of daily
wages (in Rs.) of workers in a small scale industry.
Mathematics & Statistics / 168
Wages No. of workers less than cumulative frequency
10–15 8 8
15–20 17 25
20–25 27 42
25–30 13 55
30–35 3 58
35–40 2 60
Less than Ogive curve Scale
x axis : 1 unit = 5
70 y axis : 1 unit = 10
60
50
40
30
20
10
Fig. 2.4 10 15 20
25 30 35 40 45
Wages
Example 2 : Draw more than Ogive curve for the following data.
Class: 0–10 10–20 20–30 30–40 40–50
Frequency: 5 13 17 3 2
Solution :
Class Frequency less than cumulative frequency
0–10 5 40
10–20 13 35
20–30 17 22
30–40 3 5
40–50 2 2
Statistical Data / 169
Check Your Progress - 2.5
Write True or False
1. Histogram is a simple method to represent a frequency distribution.
2. Frequency polygon is three dimensional graph.
3. Ogive curve is also called as cumulative frequency curve.
4. There are four types of Ogive curves
5. Frequency polygon is a line graph.
2.6 DIAGRAMATIC REPRESENTATION OF DATA
Frequency distribution can be represented by a graph. But the categorical data
cannot be represented by graphs. e.g. distribution of population of country according
to religion cannot be represented by graph. Such data can be represented by means of
a diagram very attractively. The diagrams are easy to remember as they create longer
lasting impression on mind. Statistical data are made easily intelligence by means of
diagrams. Following are some commonly used diagrams to represent statistical data.
(1) Simple bar diagram
(2) Sub divided bar diagram
(3) Pie diagram
2.6.1 Simple Bar Diagram
This is the simplest way of presenting the statistical data classified according
to a single characteristic. It can be used to present data of population of different
cities, exports of different countries etc. It can be used to represent any single series
but generally it is used to show categorical series.
In drawing simple bar diagram quantities are represented by rectangular
vertical bars separated from each other by uniform distance. The height of bar is
proportional to the magnitude it represents. The width of bars must be the same for
all bars as it does not have any significance. It is more convenient to use graph for
drawing a bar diagram. Usually values of variable are marked along Y-axis and the
factor of classification or category are marked on X-axis. The scale of Y-axis must
have zero as starting point.
This diagram is also known as one dimentional diagram as it represents only
one characteristic. Unless the order of bars has any significance it is suggested that
the bars should be arranged in increasing or decreasing order of magnitudes
represented by them. This makes the diagram more attractive as well as it facilitates
the comparison.
Mathematics & Statistics / 170
Example :
Present the following information by bar diagram
Birth rate (per
Country
1000)
Iran 30
Libya 20
Malaysia 40
Mexico 30
Sweden 15
Solution :
2.6.2 Subdivided bar diagram
In many cases we have to represent a whole quantity and its sub divisions in
the same diagram. In that case we can use bar diagram to represent whole quantities
and the sub division can be represented proportionally by dividing each bar into
number of parts. This type of bar diagram is called subdivided bar diagram.
The subdivided bar diagram is drawn using following steps.
(1) Draw one bar for each of the whole quantity with its height proportional to the
magnitude it represents.
(2) Using same scale divide each bar into different parts proportionally. The order
of subdivisions from bottom to top should be the same in each bar.
(3) Use different notations like horizontal, vertical or slanting lines or dots or
columns for showing subdivisions.
(4) Give the title, scale and explanation of notation used at a suitable place in the
diagram.
Subdivided bar diagram can be used (1) to represent data of population of
different states in country with its subdivisions according to religion (2) to represent
data of students studying in college for no. of years with subdivisions according to
classes.
Example:
Represent following data by sub divided bar diagram
No. of Students
Year
FY SY TY Total
1995–96 230 115 55 400
1996–97 210 120 60 390
1997–98 190 110 50 350
Statistical Data / 171
2.6.3 Pie diagram
Pie diagram is special type of diagram used to represent whole quantity by a
circle and subdivision of whole quantity are shown by sectors of that circle. The
whole circle is divided into different sectors, areas of which are proportional to the
magnitudes they represent. It is very easy to divide circle into sectors as the area of
each sector is proportional to angle it subtends at the centre. Hence to divide the
circle into sectors reduces to divide angle of 360° into proportional parts. The angle
for a particular sector is given by the relation
partial quantity
= × 360
Total quantity
This diagram is a two dimensional diagram because in this case area of sector
represents the quantity. Pie diagram can be used to represent the subdivisions of total
budget or total expenditure or total income etc. The name pie diagram is derived
from the word pie which means a cake or slice of it with layer of custard on it. For
drawing a pie diagram the first step is to convert all the sub quantities into angle
using above formula. Then draw a circle of suitable size. Start measuring angles from
some reference line from centre to circumstances. These angles divide the circle into
different sectors. We may mark the sectors by different signs such as dots, crosses,
parallel line or different colors, quantities represented by sectors may be written
inside the sectors of the circle.
Example :
Draw Pie diagram for the following data on percentage of expenditure on
different items in an average family budget.
Items : Food Rent Clothing Fuel Others
% expenditure: 40 20 15 10 15
Solution : The first step is to convert the quantities into proportional angles.
e.g. angle for the item food is
40
= 360 = 144°
100
The angles for remaining items are obtained as follows.
Item % expenditure angle
food 40 144°
Rent 20 72°
Clothing 15 54°
Fuel 10 36°
Others 15 54°
The Pie diagram representing these data is given below
Mathematics & Statistics / 172
Illustrative examples 2.0
Example 1 Frequency distribution of scores of 80 candidates is given below.
Score No. of candidates
60–69 3
70–79 7
80–89 16
90–99 20
100–109 14
110–119 11
120–129 7
130–139 2
(1) Find class boundaries of all the classes.
(2) What is the lower class boundary of 4th class?
(3) What is width of 3rd class?
(4) What is the class mark of 6th class?
(5) Find less than cumulative frequencies.
Solution :
Class Less than cumulative
Class Frequency
Frequency Frequency
60 – 69 59.5 – 69.5 3 3
70 – 79 69.5 – 79.5 7 10
80 – 89 79.5 – 89.5 16 26
90 – 99 89.5 – 99.5 20 46
100 – 109 99.5 – 109.5 14 60
110 – 119 109.5 – 119.5 11 71
120 – 129 119.5 – 129.5 7 78
130 – 139 129.5 – 139.5 2 80
(1) Class boundaries are obtained in above table
(2) Lower boundary of 4th class = 89.5
(3) Width of 3rd class = 89.5–79.5 = 10
110 119
(4) Class mark of 6th class = = 114.5
2
Statistical Data / 173
(5) Less than cumulative frequencies are obtained in above table
Example 2 Draw histogram and frequency polygon for the following data.
Size of farm 1–20 21–40 41–60 61–80 81–100
(in hectors)
No. of farms 13 38 16 5 3
Solution : Here first step is to find class boundaries
Size of No. of Class
Class mark
farms farms boundaries
1 – 20 13 0.5 – 20.5 10 – 5
21 – 40 38 20.5 – 40.5 30 – 5
41 – 60 16 40.5 – 60.5 50–5
61 – 80 5 60.5 – 80.5 70–5
81 – 100 3 80.5 – 100.5 90–5
Example 3 Draw less than and more than Ogive curves for the following data
class Frequency Class Frequency
20–25 4 25–30 9
30–35 13 35–40 18
40–45 6 45–50 3
50 – 55 2
Solution : First we shall find less than and more than cumulative frequencies
Class Frequency Less than C.F. Less than C.F
20–25 4 4 55
25–30 9 13 51
30–35 13 26 42
35–40 18 44 29
40–45 6 50 11
45–50 3 53 5
50–55 2 55 2
Mathematics & Statistics / 174
Example 4 : Draw simple bar diagram for the following data on no. of students
enrolls or certain course for different years
Year No. of students
1994 140
1995 210
1996 170
1997 200
1998 180
Solution:
Examples 5 : Present the following data using suitable diagram
Class F.Y. S.Y T.Y
Pass 250 200 100
Fail 100 150 80
Total 350 350 180
Statistical Data / 175
Solution :
Example 6 : Draw Pie diagram for the following data
Country India Sri USA U.K. Mexico
Lanka
Population growth rate (%) 2.2 1.8 2.0 1.8 3.2
Solution : Here first step is to find the angles for different countries
Country Population angle
growth rate
(%)
India 2.2 2.2
×360=72
11.0
Sri Lanka 1.8 1.8
360=58.91
11.0
USA 2.0 2.0
×360=65.45
11.0
UK 1.8 1.8
×360=58.91
11.0
Mexico 3.2 3.2
×360=104.73
11.0
Total 11.0 360.0
2.7 SUMMARY
Graphs is very strong statistical tools for presenting a given frequency data.
We collect a data with different methods. From the data we draw sample with
random sampling then process the raw data. Calculate measures of central tendency
an then with the help of given type of decided variable we draw particular types of
graphs to get values of variables.
Mathematics & Statistics / 176
2.8 CHECK YOUR PROGRESS – ANSWERS
2.4
1. Qualitative
2. Quantitative
3. Simple random sampling
4. Class frequency
5. Class marks.
2.5
1. True
2. False
3. True
4. False
5. False
2.9 QUESTIONS FOR SELF - STUDY
1. Explain different methods of classification briefly. Give suitable examples.
2. Explain the following terms with illustrations.
(i) Attribute (ii) Variable (iii) Class limits (iv) Class width (v) Class frequency
(vi) Class mark (vii) less than and more than cumulative frequency
3. Following is a frequency distribution of heights in cm.
Height No. of persons
150–154 2
155–159 17
160–164 29
165–169 21
170–174 1
(i) Find class boundaries of each class.
(ii) Determine class width of each class
(iii) Find less than and more than cumulative frequencies
4. Write a short notes on
(i) Histogram
(ii) frequency polygon
(iii) less than Ogive curve
(iv) more than Ogive curve
5. Draw histogram and frequency polygon for the following data
Class Frequency
5–10 8
10–15 16
15–20 20
20–25 26
25–30 10
30–35 5
Statistical Data / 177
6. Draw less than and more than Ogive curves for the following frequency
distribution.
Marks No. of students
0–20 2
20–40 18
40–60 42
60–80 28
80–100 5
7. Draw histogram for the following data
Weight (in Kg) No. of students
30–35 3
35–40 7
40–45 23
45–50 17
50–55 8
55–60 2
8. Frequency distribution of screws according to their length in cms is given
below
length No. of
in cm. screws
4.0–4.1 13
4.1–4.2 23
4.2–4.3 42
4.3–4.4 67
4.4–4.5 30
4.5–4.6 13
4.6–4.7 12
(i) Determine class boundaries of all the classes
(ii) What is the width of 4th class?
(iii) Find less than cumulative frequencies
(iv) Draw more than Ogive curve.
(v) Draw histogram
9. Draw suitable diagrams in each of the following cases
(i) Following are the result of survey regarding viewership of different
histograms telecast by Doordarshan
Mahabharata 96 %
Hindi film 65 %
Chitrahar 55 %
Rangoli 36 %
Hindi Serials 50 %
Hindi News 35 %
English News 20 %
Mathematics & Statistics / 178
(ii) The following table shows the cost of goods produces in a factory for
different year
Year Cost of goods (in Rs.)
1995 20000
1996 14000
1997 26000
1998 21000
1999 34000
2000 37000
(iii) Category Revenue in % to
total
Corporate tax 43.5
Income tax 35.0
Excise duty 9.5
custom 12.0
(iv) Year No. of student in course
MCM MCA BCA Total MCM
2008-09 120 125 105 350
2009-10 90 75 90 225
2010-11 100 90 120 310
2011-12 130 120 150 350
(v)
Item Food clothing Recreation house
rent
expenditure 500 325 150 400
10. Write short notes on
(i) Simple bar diagram
(ii) Subdivided bar diagram
(iii) Pie- diagram
11. Construct subdivided bar diagram for the following data
Year Import Export
1990 25 23
1995 35 37
2000 31 30
2005 28 32
2010 32 30
Statistical Data / 179
12. By the Economic budget of Maharashtra state of 2013-14, ‘One Rupee comes
from and one Rupee goes to’ is given below-
Rupee comes from
No. Tax & Revenue Amount (Rs.)
1 Internal debt of the state 14.42%
2 State’s own tax revenue 55.23%
3 Loans and advance by state government 0.43%
4 Grants-in-aid from central government 9.59%
5 Share of central taxes 9.31%
6 State’s own nontax revenue 6.17%
7 Public account 3.43%
8 Loans from central government 1.42%
Rupee goes to –
No. Tax and Grants Amount (Rs.)
1 Social Service 37.08%
2 Grants-in-aid to Local bodies 0.81%
3 Loans and advances given by state 0.64
government
4 Repayment of public debt 6.77%
5 Interest payment and debt services 11.69%
6 Capital expenditure 12.31%
7 Economic Services 13.69%
8 General Services 17.00%
2.10 SUGGESTED READINGS
1. Mathematics and Statistics by M. L. Vaidya, M. K. Kelkar
2. Statistical Analysis by S. P. Azen and A. A. Afifi
Mathematics & Statistics / 180
NOTES
Statistical Data / 181
NOTES
Mathematics & Statistics / 182
Chapter 3
Measures of Central Tendency
3.0 Objectives
3.1 Introduction
3.2 Arithmetic mean
3.2.1 Properties of arithmetic mean
3.2.2 Merits and Demerits of mean
3.3 Median
3.3.1 Merits and Demerits of median
3.4 Mode
3.4.1 Merits and Demerits of mode
3.5 Summary
3.6 Check your Progress - Answers
3.7 Illustrative Examples
3.8 Questions for Self – Study
3.9 Suggested Readings
3.0 OBJECTIVES
After studying this chapter you will be able to –
Explain what is Mean
Discuss how mode is calculated
Calculate Median
Discuss about central Tendency
Discuss about value of Central item.
Explain values coming again and again
3.1 INTRODUCTION
We have studied in the previous chapters that the first step towards
condensation of raw and large data into compact form is to classify it and prepare
frequency distribution. In the form of frequency distribution of data it becomes easy
to understand many features of data such as pattern of variation of values, portion of
concentration of values, symmetry of distribution etc. It is a descriptive measure as it
depicts the pattern of behaviour of the variable. However for further statistical
analysis we need the data to be condensed or summarized into a single number which
may be taken as the representative number of the whole group. Such a number is
called as an average or central value.
In most of the data we note a property of observations or values to concentrate
in a central part of data. In other words large proportion of observations are gathered
near central value. This property of observations in a data is called as property of
central tendency. Naturally we select a representative observation from the central
part and such observation in central part around which large no. of observations in a
data are concentrated is called measure of central tendency. In most of data average
is a centre of concentration of values. In that sense average is called measure of
central tendency. The average locates centre of data and in many cases the whole
distribution is identified by the average. The average is therefore called measure of
location. Average is a descriptive measure and it can focus attention more sharply on
various properties of data.
There are many types of averages each having particular properties and each
being typical or represented in some unique way. The most frequently used averages
are the arithmetic mean, the median and the mode.
Measures of Central Tendency / 183
3.2 ARITHMETIC MEAN OR MEAN
This is most commonly used and widely applicable average. Mean is a familiar
average to a comman man.
Definition : Mean is defined as the ratio of sum of observations in the data to
the number of observations.
Computational Formula
In statistics while computing different measures for the data on variable we
require to consider two types of data.
(i) Ungrouped data (discrete variables)
(ii) Grouped data or frequency distribution (continuous variable)
Accordingly the computational formula for these two types of data are
different.
(I) Ungrouped data
Suppose that x1 x2-------, xn are the n given observations. Then mean of these n
observations is denoted by x (read as x bar) and is given by
X = sum of n observations
no. of observations
x1 x2 .............. x n
X=
n
xi
X=
n
Example 1 :
Annual sales (in ,000 Rs) of a company for 10 months are given below.
23, 47, 29, 32, 25, 30, 34, 32, 25, 35. Find mean annual sales.
Solution :
Here there are n = 10 observations
x1 x2 ................x n 23 47 ..... 35
X= =
n 10
312
= 31.2
10
Mean annual sales = 31200 Rs.
(II) Grouped data (frequency distribution)
In case of frequency distribution
suppose
k = no. of classes
Xi = class mark of ith class i = 1,2 .....k
th
fi = frequency of i class
Then mean of frequency distribution is denoted by x and is given by
X = x 1f 1 x 2 f 2 ...... x k f k
f 1 f 2 ...... f k
fi x i
X=
fi
Mathematics & Statistics / 184
Steps for finding mean
(1) Find class marks xi - values of all the classes. i = 1, 2.......k
(2) Find values of Xifi i =1,2......k
(3) Using formula find x
Example 2: Frequency distribution of marks obtained by 100 students is
given below
Marks No. of students
10–20 2
20–30 17
30–40 23
40–50 31
50–60 15
60–70 6
70–80 4
80–90 2
Find mean marks
Solution : First it is required to find class marks of all classes
Marks No. of students (f) class mark xi fi-xi
10 20
10–20 2 =15 2×15 = 30
2
20 30 17×25 =
20–30 17 =25
2 425
30–40 23 35 805
40–50 31 45 1395
50–60 15 55 825
60–70 6 65 390
70–80 4 75 300
80–90 2 85 170
Total 100 4340
fi x i 4340
Mean x = = =43.4
fi 100
Mean marks is 43.4
3.2.1 Properties of arithmetic mean
1) If we know the number of values n in the data and mean x then we can find
sum of values in the data.
sum of values = n x x
2) The sum of deviations of values in the data from its mean is equal to zero.
If x1, x2,…..., xn are n observations and x is their mean. Then x1– x , x2–
x ,……xn– x are deviations of these observations from their mean.
xi x xi nx nx nx 0
3) Effect of change of origin and scale
let x1 , x2,……….xn be n given observations and x is their mean.
If we transform xi to ui
i = 1, 2 – n using change of origin and scale as
Measures of Central Tendency / 185
xi a
ui = i = 1,2....n a, n are constants
4
then the mean u of u1 , u2,……,un is given by
x a
u so that
n
x a nu
This result is useful for simplifying the computations of mean particularly
when observations are large and in case of frequency distribution.
Example 3 : Following are data on no. of students in different colleges in first
year.
105, 110, 98, 103, 105, 101,112, 106
calculate mean no of students
Solution : Here observations are large so we transform them to new
observation by subtracting from each observation a suitable constant 100
The new observations are
xi a
ui i = 1,2 8
n
n = 1, a = 100,
u1 = 5, u2 = 10, u3 =–2, u4 = 3, u5 = 5 u6 = 1 u7 = 12, u8 = 6
ui 40
u 5.00
n 8
mean x = a + n u = 100 + 5.00=105.00
Example 4 :
For the following frequency distribution on heights of students compute mean
height.
Height (in cms) No. of students
120–130 13
130–140 22
140–150 10
150–160 8
160–170 7
Solution : We shall solve this problem by transforming class marks (xi) to ui
using
No. of Class mark xi 145
Height ui fiui
students (fi) (Xi) 10
120–130 13 125 –2 –26
130–140 22 135 –1 –22
140–150 10 145 0 0
150–160 8 155 1 8
160–170 7 165 2 14
60 Total –26
Mathematics & Statistics / 186
change of origin and scale
fiu i 26
u= = = 0.4333
f 60
x=a+nu = 145.4.333=140.6667
3.2.2 Merits and Demerits of mean
The concept of mean is familiar and long usage and hence it seems to be best
average or best measure of central tendency. Moreover there are certain limitations in
using it. Following are merits and demerits of mean as a measure of central tendency.
Merits : (1) It is rigidly defined and uniquely determined.
(2) It is familiar to common man and easy to compute.
(3) !t is based on all values in the data and therefore is more stable.
(4) It is capable of further algebraic treatment.
(5) It is least affected by sampling fluctuations.
(6) It is widely used in practice and is most commonly used average in many
fields.
(7) Observations need not required to be arranged in order for computations of
mean.
Demerits : (1) The mean can be used only when characteristic under study is a
variable. For attribute type character mean can not be determine.
(2) It is much affected by extreme observations specially when no. of observations
in the data is small.
(3) If frequency distribution is having open end classes mean cannot be
determined. Because we cannot find class mark value for such open end
classes.
(4) It cannot be determined graphically.
3.3 THE MEDIAN
The median of data is the value of central item or observations when the
observations are arranged in ascending order of magnitude. For most of the data the
median can serve as an average as it will be always located at centre of the data. It is
a positional average. There are equal no. of observations above and below the
median in the data. It divides the data into two equal parts. It is the most suitable
measure of central tendency for distribution's like income distribution or age
distribution which are mostly non-symmetric.
Computational formula
(I) Ungrouped data : In case of ungrouped data when n observations are
given
th
n 1
median = value of observation when observations are arranged in
2
increasing order (n is odd)
th
n th n
median = mean of and 1 observation when observations are
2 2
arranged in ascending order. (n is even)
Example 5 :- Compute median of following observations in each set
(i)10,12,23,17,13,9,17
(ii)37, 31,42, 35, 27, 38, 18, 26
Measures of Central Tendency / 187
Solution
(1) Here no. of observations n = 7 which is odd no.
First we arrange observations in ascending order.
9,10,12,13,17,17,23
7 1 th
median = value of ( ) = 4th observation =13
2
(ii) Here no. of observations n = 8 which is even no.
first we arrange observations in ascending order.
18,26,27,31,35,38,42
n th
Median = mean of ( ) = 4th
2
n 31 35
and ( +1)th = 5th observations = = 33
2 2
(II) Grouped data : (Frequency distribution)
In case of grouped data median is given by the following formula.
N h
Median = L + [ –cf]
2 f
L = lower class boundary of median class
N = Total frequency = fi
c. f = less than cumulative frequency of the median class.
h = width of median class.
f = frequency of median class.
Median class : It is the class having less than cumulative frequency just greater
than or equal to N/2
Steps (1) first find less than cumulative frequencies of all the classes.
(2) Determine median class.
(3) Determine the values of
L, c. f., h, f,
(4) Using the formula compute median
Ratable No. of
dwelling
Value (Rs)
0–1000 27
1000–2000 56
2000–3000 85
3000–4000 40
4000–5000 10
Example 6 : Frequency distribution of ratable value of dwelling in locality is given
below.
Mathematics & Statistics / 188
Solutions :
No. of value less than cumulative
Ratable
(Rs.) dwellings frequency
0–1000 27 27
1000–2000 58 85
2000–3000 85 170
3000–4000 40 210
4000–5000 10 220
Total 220
N 220
= = 110
2 2
N
for the class 2000–3000 less than cumulative frequency is just greater than =110
2
(Note that for classes 2000–3000 onwards less than cumulative frequency is greater
than 110)
Median class is 2000–3000
L = 2000, c.f. = 85, h = 1000, f = 85
N h
Median =L+[ –cf]
2 f
1000
= 2000+ [10–85]
85
1000 5
= 2000 +
17
= 2000 + 294.11764
= 2294.11764 Rs
= 2294.12 Rs.
3.3.1 Merits and Demerits of median : Whenever the mean fails to be a good
measure of central tendency the median in general is found to be useful and the
appropriate average. It has several advantages and limitations also.
Following are merits and demerits of median
Merits
(1) It is applicable to all kinds of data on variable or attributes. In case of
qualitative data the items can be arranged in particular order according to a
qualitative character and the quality of central item gives the median or
average quality.
(2) For non-symmetric distributions like age distribution or income distribution the
median is most appropriate average.
(3) It is not affected much by extreme observations in the data.
(4) Concept of median is easy to understand and is appealing.
(5) It can be determined even if there are open end classes in case of frequency
distribution
(6) It is least affected by choice of class intervals.
(7) It is useful when the mean is either indeterminate or unsuitable.
(8) It can be determined graphically.
Demerits :
(1) It is not based on all the observation in the data.
(2) It is not as rigidly defined as the mean.
Measures of Central Tendency / 189
(3) It is not suitable average from small group of item.
(4) It is less capable of further mathematical treatment.
(5) It needs to arrange observations in ascending order.
3.4 THE MODE
The word mode means fashion. We say that wearing narrow bottom trousers is
the current fashion among youngsters. It means that majority of youngsters wear that
type of trousers.
The mode Mo is thus defined as the value of the variable occurring more or
maximum no. of times in the data than any other value. It is the most frequently
occurring value in the data.
Computation of mode: (I) Ungrouped data: In case of ungrouped data of n
observations x1, x2,…… xn mode is that observation which occurs maximum number
of times.
Example7 : Calculate mode of the following observations.
11,13,17,20,17,15,17,13,17,19.
Solution : In the above observations observation 17 occurs more no. of times
as compared to other observations. Hence mode is 17.
(II) Grouped data (Frequency distribution) : In case of frequency
distribution mode is given by the following formula.
f1 f0
Mode = L + xh
2 f1 f0 f 2
L = lower class boundary of modal class.
f0 = frequency of pre-modal class
f1 = frequency of modal class
f2 = frequency of post modal class
h = width of modal class.
Modal class = It is the class having maximum frequency
Steps (1) Determine modal class
(2) Determine the values of f0, f1; f2, h, L
(3) Using the formula determine the value of mode.
Example 8 : The marks obtained by 40 students in a certain test is given
below. Find model marks.
Solution No. of students
0–10 3
10–20 11
20–30 16
30–40 8
40–50 2
Marks No. of students
0–10 3
10–20 11
20–30 16
30–40 8
40–50 2
Mathematics & Statistics / 190
Here maximum frequency 16 corresponding to class 20–30
modal class is 20–30
L = 20, f0 = 11, f1 =16, f2 = 8h = 10
f1 f0
Mode = L + xh
2 f1 f0 f2
16 11
= 20 + ×10
32 11 8
5 10
= 20 + = 23.8461
13
= 23.85
3.4.1 Merits and Demerits of mode
As compared to mean and median the mode has very limited utility. Following
are merits and demerits of mode.
Demerits :
(1) It is not based on all observations in the data and hence it is not sensitive to the
changes in extreme values in the data.
(2) It is not suitable average when the number of items in the data is very small.
(3) It is not suitable average for extremely non-symmetric distributions.
(4) It cannot be determined when maximum frequency is at one end of
distribution.
(5) It is affected to a great extent by the choice of class intervals.
Merits :
(1) It is applicable to both qualitative and quantitative type data.
(2) It is useful in some special type of situations only.
(3) It is not influenced by extreme values in the data.
(4) It can be determined graphically
Check Your Progress - 3.2 to 3.4
1. What is mean?
____________________________________________________
____________________________________________________
2. What is median?
____________________________________________________
____________________________________________________
3. What is mode?
____________________________________________________
____________________________________________________
4. Choose the correct alternative.
i. For a set of 101 distinct values, The median value happened to be 55. Later it
was observed that a value 74 was wrongly written as 64. With this correction
now
a) The median will undergo a change and gets increased.
b) The median will undergo a change and gets decreased.
c) The median will be unchanged.
d) The given information is insufficient for recalculation of median.
Measures of Central Tendency / 191
ii. For the following distribution, how would the mean compare with the
median?
a) The Mean would be less than the Median
b) The Mean would be equal to the Median
c) The Mean would be greater than the Median
d) None of the above
iii. If a constant value 50 is subtracted from each observation of a set, the mean
of the set is
a) increased by 50 b) decreased by 50
c) not affected d) 50 times the original value
iv. A distribution of 6 scores has a median of 21. If the highest score increases
by 3 points, the median will become
a) 21 b)21.5
c) 24 d) cannot be determined without additional information
v. The value of (xi-x)/n, is
n
1
a) zero if x = xi b) always zero
n
i 1
c) n –1 d) none of the above
3.5 SUMMARY
– Mean is nothing but average which is the ratio of sum of observations in the
data to the number of observations. It is denoted as x (read as x bar)
– Mean is calculated on two types of data.
Ungrouped
Grouped
for ungrouped data is formula is
xi
x
n
Where xi = x1 + x2 +…..xn
n = No. of observation
for Grouped data the formula is
f i xi
x
fi
Where fi xi = x1f1+x2f2+……+xk fk
fi = f1+f2+……fk
Where k = no. of classes
Xi = class mark of ith class i = 1, 2…..k
fi = frequency of ith class
Mathematics & Statistics / 192
– The Median of data is the value of central item of observations when the
observations are arranged in ascending order of magnitede.
– Formula of Medion for
(I) Ungrouped data : In case of ungrouped data when n observations are given
n 1 th
median = value of ( ) observation when observations are arranged in
2
increasing order (n is odd)
n th n
median = mean of and ( 1) th observation when observations are
2 2
arranged in ascending order. (n is even)
(II) Grouped data : (Frequency distribution)
In case of grouped data median is given by the following formula.
N h
Median = L + [ cf ]
2 f
L = lower class boundary of median class
N = Total frequency = fi
c. f = less than cumulative frequency of the median class.
h = width of median class.
f = frequency of median class.
Median class : It is the class having less than cumulative frequency just greater
than or equal to N/2
Steps (1) first find less than cumulative frequencies of all the classes.
(2) Determine median class.
(3) Determine the values of L, c. f., h, f,
(4) Using the formula compute median
- The Mode is the value of the variable occurring more or maximum no. of
times in the data
than any other value.
Grouped data (Frequency distribution): In case of frequency distribution mode is
given by the following formula.
f1 f 0
Mode = L + ×h
2f 1 f 0 f 2
L = lower class boundary of modal class.
f0 = frequency of pre-modal class
f1 = frequency of modal class
f2 = frequency of post modal class
h = width of modal class.
Modal class = It is the class having maximum frequency
Steps (1) Determine modal class
(2) Determine the values of f0, f1, f2, h, L
(3) Using the formula determine the value of mode.
Ungrouped data: In case of ungrouped data of n observation x1 x2 … xn mode
is that observation which occurs maximum number of times.
Measures of Central Tendency / 193
3.6 CHECK YOUR PROGRESS - ANSWERS
3.2 to 3.5
1. Mean is nothing but average which is the ratio of sum of observations in the
data to the number of observations. It is denoted as x (read as x bar)
2. The Median of data is the value of central item of observations when the
observations are arranged in ascending order of magnitude.
3. The Mode is the value of the variable occurring more or maximum no. of
times in the data than any other value.
4. (i) – c, (ii) – b, (iii) – b, (iv) – a (v) - a
3.7 ILLUSTRATIVE EXAMPLES
Example 1 : The starting monthly salaries of 10 employees recruited in a firm are
Rs. 1500, 1750, 1680, 1820, 1850, 1750, 2000, 1725, 1575 and 1750
Find the mean, median and the mode
Solution : Let x1 , x2….. x10 be 10 given observations
Mean :
xi 17400
Mean x = =1740Rs
n 10
Median : For finding median we arrange observations in ascending order.
1500, 1575, 1680, 1725, 1750, 1750, 1750, 1820, 1850, 2000
Here no. of observations n = 10
Median = mean of 5th and 6th observation = 1750 Rs.
Mode :
Observation 1750 is repeated maximum no. of times.
mode = 1750
Example 2 The distribution of life time in hrs. of 200 radio tubes is given
below.
Calculate the mean, median and the mode
Life Tubes
600–800 40
800–1000 55
1000–1200 60
1200–1400 25
1400–1600 20
Solution :
fu 70
Mean : u = 0.350
Mathematics & Statistics / 194
No. of Class less than xi 1100
Life (in hrs) ui fiui
tubes (fi) mark xi cf 200
600–800 40 700 40 –2 –80
800–1000 55 900 95 –1 –55
1000–1200 60 1100 155 0 0
1200–1400 25 1300 180 1 25
1400–1600 20 1500 200 2 40
Total N =200 – – – –70
f 200
x = a + hu a = 1100 h = 200
x = 1100 – 200 x 0.350 = 1030 hrs.
Median
From less than cumulative frequency observe that for the class 1000 –1200 the
N
less than cumulative frequency is just greater than =100
2
Median class is 1000 –1200
L = 1000 c. f =95 h = 200 f = 60
N h
Median =L+[ – cf]
2 f
200
= 1000 + [100–95]
60
= 1016.6667 hrs.
Mode : Maximum frequency corresponds to class 1000–1200
Modal class is 1000–1200
L = 1000 f0 =55 f1 = 60 f2 = 25 h = 200
f1 f 0
Median =L+[ ]h
2f 1 f 0 f 2
60 55
= 1000+ [ ] 200
120 55 25
= 1025 hrs
3.8 QUESTIONS FOR SELF - STUDY
1. Define mean and state its important properties.
2. Define median. State merits and demerits of median.
3. Define mode. State merits and demerits of mode.
4. What do you mean by central tendency of data? What is measure of central
tendency?
5. State merits and demerits of mean.
6. The lifetime in days of 8 small insects is given below.
15, 14,28, 19,26, 17, 15,23
7. Find mean and median life time.
8. An incomplete frequency distribution is given below. The total frequency is
230 and the median is 46. Find the missing frequencies.
Measures of Central Tendency / 195
Marks : 10–20 20–30 30–40 40–50 50–60 60–70
Students 12 30 ? 65 ? 43
9. The frequency distribution of no. of tablets required to cure fever is given
below.
Find the mean, the median and the mode
Tablets No. of persons
4–8 11
8–12 13
12–16 16
16–20 14
20–24 8
24–28 5
28–32 3
10. The following is the age distribution of life insurance policy holders whose
mean age is 23.6 years. Find the missing frequencies.
Age: 0–10 10–20 20–30 30–40 40–50
Persons: 7 12 ? 13 3
11. The monthly expenditure (in Rs.) of 10 families is given below. Find mean,
median and mode.
700, 750, 700, 800, 750, 775, 800, 750, 720, 750
3.9 SUGGESTED READINGS
1. Mathematics and Statistics by M. L. Vaidya, M. K. Kelkar
2. Statistical Analysis by S. P. Azen and A. A. Afifi
Mathematics & Statistics / 196
NOTES
Measures of Dispersion / 197
NOTES
Mathematics & Statistics / 198
CHAPTER 4
Measures of Dispersion
4.0 Objectives
4.1 Introduction
4.2 Range
4.3 Mean Deviation
4.4 Variance
4.5 Standard Deviation
4.6 Absolute and Relative Measure of Dispersion
4.7 Coefficient of Variation
4.8 Summary
4.9 Check Your Progress - Answers
4.10 Illustrative Example
4.11 Questions for Self - Study
4.12 Suggested Readings
4.0 OBJECTIVES
Friends, Dispersion means the expanse of the given sample data. After studying this
chapter you will be able to –
Explain Range
Discuss Mean
Calculate Variance
4.1 INTRODUCTION
The average or measure of Central tendency is a good descriptive measure of a
distribution of a variable. However it cannot describe the distribution completely. It
gives us idea about the location of Centre of the distribution. For complete
knowledge of the distribution some additional information is required. One such
information is that about nature and extent of variation of the values in the data. The
average scoress of two batsmen for a season may be equal or nearly equal but their
consistency can be judged by studying the variability of their scores e.g. If the scores
of one batsman are 40, 45, 50, 56, 59 and those of other are 20, 35, 50, 60, 85 then
they do not differ in average but they differ in variation. Hence only average is not
sufficient for comparing the performance of two batsmen. This nature and extent of
variation of values in the data is known as dispersion.
The knowledge of dispersion helps in judging the reliability of the average.
The average of the data will be more reliable or representative of data when the data
has less variability. This analysis of variation in values in the data has no of practical
applications in various fields like agriculture, industry medical etc.
For the study of dispersion present in the data we need some measure of the
degree of dispersion and it is called measure of dispersion. In the remaining chapter
we are going to study some measure of dispersion.
4.2 RANGE
Range is simplest measure among several measures of dispersion. Range is
defined as the difference between maximum and minimum observations in the data.
In case of frequency distribution range may be defined as the difference between
smallest and largest class boundaries.
Since range is the simplest measure to Compute, it is the crude measure of
dispersion. The range is used in limited applications and also it has certain defects. It
is greatly affected by an unusal value of the extremity. We can not interpret the value
of range properly without knowing the no. of observation.
Measures of Dispersion / 199
The range is useful is situations where one desires to know only the extent of
extreme dispersion. The stock market reports are frequently stated in terms of their
range by quoting the high and low price of stock over a period. In weather reports
also we find maximum and minimum temperatures stated. The daily mean
temperature can be obtained by averaging these two temperatures. In quality contral
range is used as a measure of variation within the sample.
The range being easy to compute and is a common way of describing
dispersion is often used in engineering and medical reports.
Illustrations (1) : Following are the prices of stock market shares of a certain
company for last 10 days. Find the range
123, 98, 96, 120, 115, 121, 117, 151, 101, 99
Answer : Here minimum observation is 96 and maximum observation is 131.
Hence range is 131 – 96 = 35.
4.3 MEAN DEVIATION
Range as a measure of dispersion does not takes into consideration all
observations in the data. So it is Comparatively unstable and insensitive measure of
dispersion. Hence it is not further useful for analysis of data. Mean deviation is a
measure of dispersion based on all observations in the data. By deviation we mean
subtracting same constant from given observation and is called deviation of that
observation from that constant e.g. deviation of x1 from arbitrary constant a is x1 – a.
In mean deviation we consider the deviation of each observation from some
constant a. The mean of absolute deviations of observations from a is called mean
deviation about 'a'.
Definition. (I) In case of n observation x1, x2,……xn the mean deviation about
a is g
n
xi a
MD about a=
n
(II) In case of frequency distribution
xi = class mark of ith class i = 1,2, ……… k
fi = frequency of ithclass
mean deviation about a is given by
fi xi a
MD about a =
fi
Usually we obtain mean deviation about some central value such as mean or
median or mode accordingly we get mean deviation about mean or mean deviation
about median or mean deviation about mode.
Illustration : (1) Calculate mean deviation about mean for the following
observation.
15, 17, 22, 18, 26, 13, 14, 20, 15, 10
Answer : xi xi x
15 2 xi
x mean
17 0 n
22 5
18 1 170
17
26 9 10
13 4
14 3 xi x
MD about mean =
20 3 n
15 2 36
10 7 3.6
10
Total 170 36
Mathematics & Statistics / 200
(2) Frequency distribution of number of days of medical leaves enjoyed by 30
employees is given below.
No. of Days No. of Employees
0–10 5
10–20 6
20–30 10
30–40 5
40–50 4
Calculates mean deviation about mean.
Solution :
No. of No. of
xi fixi xi x fi x i x
Days Employees fi
0–10 5 5 25 19 95
10–20 6 15 90 9 54
20–30 10 25 250 1 10
30–40 5 35 175 11 55
40 – 50 4 45 180 21 84
Total 30 – 720 298
x = 720 24.0
30
fi x i x 298
MD about mean = = = 9.93 days
fi 30
4.4 VARIANCE
Variance is the mean square deviation about mean. Thus variance is defined as
the mean of square of deviations taken from arithmetic mean. Variance is good
measure of dispersion and it has many desirable properties. It is denoted by 2 (sigma
squared)
(I) In case of n observations x1 x2……xn the variance is defined as –
2
xi x xi
2
= where x = mean =
n n
Computational formula
2
2 xi x xi2 2x x i 2
= x
n n n
xi2 2 2
= 2x x
n
xi2 2
= x
n
2
x1
2
= x2
n
(II) In case of frequency distribution.
X1 = Class mark of ith class I = 1,2,……k
Measures of Dispersion / 201
fi = frequency of ith class
Variance is defined as
2
fI xi x fi xi
2
= where x = mean =
fi fi
Computational Formula
2
2 fi x i x fi x i2 2x i .x x 2
=
fi fi
2 2
fi xi fixi 2 fi x i 2
2x x x
fi fi fi
2
f i xi
2
= x2
fi
In case of frequency distribution as well as individual observations
calculations of variance can be simplifies by making use of charge of origin and
scale.
Change of origin and scale :
xi A
Let ui = be the new observations obtained from xi
h
by using charge of origin and scale so that -
Xi = A +hui
x = A + h u then variance of
u is given by
2
ui
u
2
= u 2 (in case of individual observation)
n
2
fiu i f i ui fi ui
when u and 2
u = u 2 when u
zf fi fi
(in case of frequency distribution)
Then variance of original observation is x2 = h2 u2
Illustrations
(I) Followings are monthly savings (in Rs.) of 10 families.
1150, 750, 700, 1000, 800, 900, 720, 840, 980.
Find variance.
Solution :
x x2 x = 8490 = 849
10
1150 1322500
750 562500 x2 2
= x
700 490000 n
1000 1000000 =7431900 – 720801
800 40000 =22398
900 810000
720 578400
840 705600
980 960400
650 422500
Total 8490 7431900
Mathematics & Statistics / 202
(2) Find the variance of the following distribution of percentage dividend paid by
50 company.
Dividend No. of Companies
0–6 8
6–12 10
12–18 15
18 –24 12
24 –30 5
Solution :
No. of Class x 15
Dividen
Companies mark u= Fu fu2
d 6
(f) X
0–6 8 3 –2 –16 32
6–12 10 9 –1 –10 10
12–18 15 15 0 0 0
18–24 12 21 1 12 12
24–30 5 27 2 10 20
Total 50 –4 74
x i 15
ui xi = 15+6ui = A+hu
6
gives A = 15 h=6
fu 4
u 0.08
f 50
x = A + hu
= 15 + 6(–0.08) = 14.52
= –0.08
2 2 2
x = h u
fu 2 74
u
2
= u 2
( 0.08) 2
f 50
1.48 0.0064
1.6336
2
x = 58.8096
4.5 STANDARD DEVIATION
Standard deviation is most Commonly used measure of dispersion. It has been
devised to remove the drawback of the variance that it is rather an awkward value to
interpret. The units attached to variance are squares of the units in practice. e.g. cm2,
Rs2 etc. But we define standard deviation as the positive square root of variance or
the root mean square deviation from the arithmetic mean. Due to this standard
deviation can be expressed in the same units as that of the original data. It also has all
advantages of the variance as a measure of dispersion. However from magnitude of
standard deviation we cannot immediately say whether distribution has small or high
degree of variability.
Standard deviation is denoted by or SD.
Formula of Computing SD are as follows :
x 12 2 xi
= x where x
n n
(in case of observations x1, x2,………xn)
fi xi 2 2 fi x1
= x where x
f1 fi
(in case of frequency distribution)
Measures of Dispersion / 203
4.6 ABSOLUTE AND RELATIVE MEASURES OF DISPERSION
The measure of dispersion like range, mean deviation, variance, standard
deviation measures the magnitude of dispersion and they are called measure of
absolute dispersion. These are to be expressed with appropriate units. They are useful
for comparison of variability of two sets of data only when both are in the same units
and their central values ie. averages are nearly equal. But in many problems
situations one or both of these conditions are not fulfilled. So we need measures of
dispersion which are independent of units. Such a measure can be obtained by taking
ratio of the absolute measure of dispersion to same central value of the data. It is
called measure of relative dispersion. Most commonly used measure of relative
dispersion is coefficient of variation.
4.7 COEFFICIENT OF VARIATION
Coefficient of variation (cv) is widely and commonly used measure of
dispersion. Whenever we require to compare the variability of sets of values we use
cv. It is defined as the ratio of standard devidation of the series to its arithmetic
mean. It is always expressed in percentage.
SD
CV = ×100
mean
The series which has less CV is said to be more consistent or stable.
Check Your Progress - 4.1 to 4.6
1. Define the following terms.
(a) Range
__________________________________________________
__________________________________________________
(b) Mean deviation
__________________________________________________
__________________________________________________
(c) Variance
__________________________________________________
__________________________________________________
(d) Standard deviation
__________________________________________________
__________________________________________________
2. Choose the correct answer from given.
i) Let A = {–2, –1,0,1,2} and B = {–4, –2, 0, 2, 4}
Let m(.) and v(.) denote the mean and variance respectively of the set
mentioned.
Then indentify the correct statement.
a) m(A) > m(B), v(A) < v(B)
b) m(A) > m(B), v(A) < v(B)
c) m(A) = m(B), v(B) = 4v(A)
d) m(A) = m(B), v(B) = 2v(A)
Mathematics & Statistics / 204
ii) Which of the following measures of dispersion does not depend on the units
of measurement?
a) S. D. b) Mean Deviation c) Range d) C. V.
iii) Mean deviation is minimum when the observations are taken from
a) Mean b) Median
c) Mode d) Q4
iv) If you are told that a population has a mean of 25 and variance of zero what
must you conclude?
a) Someone has made a mistake
b) There is only one element in the population
c) There are no elements in the population
d) All the elements in the population are 25
v) The following set of scores is obtained on a test
X : 4, 6, 8, 9, 11, 13, 16, 24, 24, 24, 26. The teacher computes all of the
descriptive indices of central tendency and variability on these data, then he
discovered that an error was made and one of the 24's is actually a 17. Which
of the following will be changed from the original computation?
a) median b) range
c) S. D. d) None of the above
vi) If each observation of a series is multiplied by a constant C, the coefficient of
variation as compared to the original value
a) is increased by C
b) is decreased by C
c) remains unchanged
d) is C times the original value
4.8 SUMMARY
– Variations of values in the data is known as dispersion.
– Measure of dispersions are Range, Mean deviation, Variance, Standard
deviation.
– Range is defined as difference between maximum and minimum observations.
– Mean deviation is a measure of dispersion based on all observations in the data
which is calculated by subtracting same constant from given observation and is
called as deviation of that observation from that constant.
– Variance is the mean square deviation about mean. Variance is defined as the
mean of square of deviations taken from arithmetic mean.
– Standard deviation is the positive square root of variance or the root mean
square deviation from the arithmetic mean.
– Range, mean deviation, variance and standard deviation are called as absolute
dispersion.
4.9 CHECK YOUR PROGRESS- ANSWERS
4.2 to 4.7
1. (a) Range is defined as difference between maximum and minimum observations.
(b) Mean deviation is a measure of dispersion based on all observations in the data
which is calculated by subtracting same constant from given observation and is
called as deviation of that observation from that constant.
Measures of Dispersion / 205
(c) Variance is the mean square deviation about mean. Variance is defined as the
mean of square of deviations taken from arithmetic mean.
(d) standard deviation is the positive square root of variance or the root mean
square deviation from the arithmetic mean.
2 (i)–c, (ii)–d, (iii)–b, (iv)–d (v)–c, (vi) – d
4.10 ILLUSTRATIVE EXAMPLES
Example I: The scores of batsmen in a certain test are as given below :
35, 47, 52, 45, 61, 37, 40, 58
Find (i) variance (ii) coefficient of variation
Solution : Here data given are 8 observations
Say x1,x2,………….. x8
We use change of origin for simplifying calculation as
ui = x1 –45 i = 1, 2...... 8
So that new observations and their squares are as follows :
ui 15
2 u = 1.875
Xi ui = Xi – 45 ui n 8
35 –10 100 x = mean = 45 + u = 46.875
47 2 4 2
u = variance for u
52 7 49
u 2i 2
45 0 0 = u
n
61 16 256 667
– (1.875)2
37 –8 64 8
40 –5 25 = 79.8594
58 13 169 x
2
= u
2
= 79.859, x =+ x
2
Total 15 667 = 8.9365
Example 2 : The number of items of an industrial product sold by two
salesman A and B in ten months in an year are given below. From these date
determine which salesman is more consistent.
Number of Items Sold
A 128 132 143 140 152 145 135 129 130 145
B 142 150 160 130 120 125 135 145 140 142
Solution : For judging which salesman is more consistent we have to compare
the variability of their performance. This can be done more appropriately by
comparison of their coefficients of variation. So let us find the mean and the SD for
each of the two series, here the given values are large in size. So we may use the
method of change of origin.
Number of items sold
Salesman A Salesman B u= x - 140 u2 v = y –140 V2
x y
128 142 –12 144 2 4
132 150 –8 64 10 100
Mathematics & Statistics / 206
143 160 3 9 20 400
140 130 0 0 –10 100
152 120 12 144 –20 400
145 125 5 25 –15 225
135 135 –5 25 –5 25
129 145 –11 121 5 25
130 150 –10 100 10 100
145 142 5 25 2 4
–21 657 –1 1383
The computations of standard deviations and means of the series are as follows
:
u = u 21
= –2.1
n 10
v = v 1
= –0.1
n 10
u2
SD of u = u= (u ) 2
n
657
= ( 2.1) 2 65.7 4.41
10
= 61.29
= 7.83
v2
SD of v = v = ( v) 2
n
1383
= ( 0.1) 2
10
138.3 0.01
138.29
=
11.76
Since SD is invariant to change of origin
x = u = 7.67 and y = v = 11.76
x = 140 + u =140 –2.1 = 137.9
y = 140 + v = 140 – 0.1 = 139.9
x 7.83
CV of x = ×100 = ×100
x 137.9
= 5.68 %
y 11.76
CV of y = ×100 = ×100
y 139.9
= 8.4 %
Since the salesman A has smaller CV, he is more consistent.
Measures of Dispersion / 207
4.11 QUESTIONS FOR SELF - STUDY
1. What is dispersion? Why is it necessary to take into consideration the
dispersion of . the data?
2. Define range as the measure of dispersion. Discuss its advantages and
limitations. Mention some uses of range.
3. Define standard deviation. Establish its importance as a measure of variability?
4. What are the measures of absolute dispersion? Can they be used for
comparison of variability?
5. What are the relative measures of dispersion? In what respect are they superior
to the absolute measures?
6. Define coefficient of variation. In what situations is it useful?
7. A set of 10 observations has the sum of squares of diviation from the mean
equal to 120. Find its variance. If two more values, each equal to mean, are
added, what will the variance of the new set?
8. If all the observations in the data are of same value, what will be its SD?
9. If xi(i = 1,2 ...n) are observations on X, show that
( xi )2
xi2
n
10. For 20 observations on Y, y2 = 500. Show that the mean of the data cannot
exceed 5.
11. Are the data n = 10, x2 = 500. x = 8, consistent? Give reasons for your
answer.
12. A variable takes values 1, 2, 4 ... n. Find the mean and variance.
13. From the following distribution of milk co–operative societies according to
procurement of milk per day (in liters), compute standard deviation
Quantity of Milk 100–150 150–200 200–250 250–300 300–350
Societies : 10 20 35 25 10
14. A survey conducted to determine the distance travelled (in Km) per litre of
petrol by newly introduced moped yielded the following distribution.
Distance 40–45 45–50 50–55 55–60 60–65
No. of Moped 13 12 25 35 50
Find the standard deviation.
15. The polythene bags were supplied by two suppliers A and B. These bags were
tested for bursting pressure and the following data were obtained.
Pressure in Kg. : 30–40 40–50 50–60 60–70 70–80 80–90
Bags A : 4 6 15 25 20 10
Bags B : 6 14 20 25 10 5
Which supplier's bags have more consistency in bursting pressure?
16. Marks obtained by two students in the ten different papers at an examinations
are given below. Find who is more consistent.
Mathematics & Statistics / 208
Marks A : 50 60 35 40 70 90 65 70 75 75
Marks B: 60 65 78 72 80 55 45 65 75 80
17. The mean of 5 observations is 4.4 and the variance is 8.24. If three of the five
observations are 1, 2 and 6, find the other two.
18. The statistics of runs scored by the batsmen A and B in 1o innings are given
below.
Player A Player B
Mean 53 45
Standard deviation 40 16
Which of the two players is more consistent?
19. Find the standard deviation of the following frequency distribution.
x: 1 2 3 4 5
f: k 2k 3k 4k 5k
20. For a group of 10 observations on X
x = 452 and x2 = 24270. Find the standard deviation.
4.12 SUGGESTED READINGS
1. Mathematics and Statistics by M. L. Vaidya, M. K. Kelkar
2. Statistical Analysis by S. P. Azen and A. A. Afifi
3. Pre-degree Mathematics by Vaze, Gosavi
Measures of Dispersion / 209
NOTES
Mathematics & Statistics / 210
Chapter 5
Correlation
5.0 Objectives
5.1 Introduction
5.2 Correlation
5.2.1 Positive & Negative Correlation
5.3 Covariance
5.4 Coefficient of Correlation
5.4.1 Properties of Correlation Coefficient
5.4.2 Interpretation of the value of Correlation
Coefficient
5.4.3 Computing Correlation Coefficient For
Ungrouped Data
5.5 Summary
5.6 Check Your Progress - Answers
5.7 Illustrative Examples
5.8 Questions for Self - Study
5.9 Suggested Readings
5.0 OBJECTIVES
After studying this chapter you will be able to explain following -
Two variables
Relations between two variables
Positive relations
Negative relation
Reduce the negative relation
5.1 INTRODUCTION
Bivariate Data :
The data we have studied upto this stage were consisting of observations on a
single variable and are called the univariate data. But there are many situations in
which we are interested in observations on two variables for every unit in a sample or
a group of units. If we observe consumption of coal X and production of electricity Y
for 30 days in a month. We get pairs of values (xi, yi ) for i = 1, 2,…..30. These data
on two variables constitute bivariate data. In short the set of pairs of observations on
two variables are called bivariate data. For example, the observations on:
(i) Age of husband and age of wife in several married couples.
(ii) Intelligence quotient and score in aptitude test of students in a class.
(iii) Supply and price of a commodity in a market on several days, are some
examples of the bivariate data.
5.2 CORRELATION
The major interest in collection and study of bivariate data is in finding
whether there is any mutual relation between the two variables under consideration
or not. This mutual or joint relation between the two variables is called correlation
which can be ascertained by studying the joint variation of the two variables in the
Correlation / 211
data. For example, if we observe the data on consumption of coal and production of
electricity at a thermal power plant, we find that there is relation between these
variables because more consumption of coal is bound to produce more electricity and
shortage of coal is bound to result in shortage of electricity produced. In fact
consumption of coal is the cause of production of electricity. Unless there exists such
a logical relationship between the two variables the study of correlation will be
meaningless. There is no point in studying correlation between height and
intelligence quotient of a group of adults.
Thus two variables are said to be correlated when change in value of one
variable causes corresponding change in the value of the other variable. Population
of a town and number of vehicles in the town are correlated because towns with
larger population are bound to have larger number of vehicles.
5.2.1 Positive and Negative Correlation :
The Variables showing corresponding changes in their values are said to be
correlated. But these changes in different pairs of variables are not of the same kind.
In some cases the changes in values of both the variables are in the same direction.
Increase in value of one variable causes increase in value of the other variable.
Correlation between these variables is said to be positive. The consumption of coal
and the amount of electricity produced are positively correlated.
In some other cases the changes in the values of the two variables may be in
opposite direction. Increase in value of one variable may cause decrease in value of
the other variable. These variables are said to be negatively correlated. Since ample
supply of a commodity results in fall of price and scarcity results in rise of price the
variables supply of a commodity and its price have negative correlation between
them.
5.3 COVARIANCE
As stated above, the study of correlation between two variables is in some
sense a study of joint variation of the two variables which may be termed as
covariation. In order to ascertain the degree of correlation we need a measure of this
degree of covariation. Such a measure is provided by covariance which is defined as
the mean of products of deviations of the observed values of X and Y from their
respective means.
Let us have a sample of n pairs of observations (xi yi) on the variables X and
Y. Then the means of X and Y for the given sample are
xi yi
x and y
n n
Then the covariance of X and Y for the given sample is
xi x yi y
Cov. (x, y) =
n
x i y i nx y
=
n
The interesting feature of this measure is that it may be negative, zero or
positive according to the nature of correlation between the variables. In case of data
on positively correlated variables the covariance is also positive.
Let us now study the effect of change of origin and scale. Let us change the
variables X and Y into u and v by the transformation.
X a Y a
u= and v =
h k
Then xi = a + hui and yi, = b + kvi,
x = a + h u and y = b + k v
Mathematics & Statistics / 212
xi – x = h(ui – u ) and yi –y = k(vi – v )
(x i x) ( y i y)
Hence Cov. (X, Y) =
n
(u i u) ( v i v)
= hk
n
= hk Cov. (u, v)
Thus covariance is invariant to the change of origin but not to the change of scale.
5.4 COEFFICIENT OF CORRELATION
For further study of correlation and comparison of correlation it is necessary to
measure the degree of correlation between the two variables under consideration.
Professor Karl Pearson has suggested a measure of a degree of correlation called
coefficient of correlation.
Karl Pearson's Coefficient:
It is defined as the ratio of covariance of two variables to the product of
standard deviations of these variables. It is also known as product moment
correlation coefficient. The coefficient of correlation computed for a sample from a
bivariate population is denoted by r.
Let us have a sample of n pairs of observations (xi , yi) on variables X and Y.
Then the sample correlation coefficient is given by
Cov.(x , y)
r=
x y
x i yi n x y
Since Cov. (X, Y) =
n
x 2i 2 y 2i 2
and x= x ; y= y
n n
On simplification we get
x i yi n x y
r=
2 2 2
( xi n x ) ( y 2i n y
This form is more suitable for computation of correlation coefficient.
The Magnitude or numerical value of r measures the degree of correlation and
the algebraic sign of r indicates the type of correlation – positive or negative. Thus
the value of correlation coefficient gives us the complete idea about the correlation
between two variables.
5.4.1 Properties of Correlation Coefficient:
Let r be the coefficient of correlation between the variables X and Y computed
from the sample of n pairs (xi yi)
(i) The magnitude of the coefficient of correlation i.e|r| is invariant to the
change of origin and scale.
Let us denote the coefficient of correlation between X and Y by rxy and that
between u and v by ruv Let us change the variables X and Y into u and v by the
transformation.
x a y b
u= and v =
h k
Then. X = a + hu and Y = b + kv and |rxy| = |ruv|
This shows that the numerical value of the correlation coefficient is invariant
to the change of origin and scale.
Correlation / 213
Further it can be concluded that (i) when h and k have same algebraic sign i.e.
when both are positive or both are negative, rxy = ruv and (ii) when h and k have
different algebraic signs rxy = –ruv.
For example, if coefficient of correlation between X and Y is 0.8 that between
2X and 3Y is 0.8. Also the coefficient of correlation between –2X and –3Y is 0.8.
But the correlation between 2X and –3Y or that between –2X and 3Y is –0.8.
Further the coefficient of correlation between (2X + 5) and (3Y –10) is the
same as rxy but that between (2X + 5) and (–3y +10) is –rxy
ii) Karl Pearson's coefficient of correlation between two variables is
numerically less than or equal to unity.
5.4.2 Interpretation of the value of Correlation Coefficient:
The numerical value of the correlation coefficient measures the degree of
correlation between the two variables. The larger value of |r| indicates closer
relationship between the variables. When |r| > 0.8, it indicates correlation of very
high degree. When |r| lies between 0.3 and 0.7, one can say that there is significant or
considerable, correlation between the two variables. Correlation is said to be very
poor or insignificant when |r| < 0.3. The algebraic sign of r indicates whether the
correlation is positive or negative.
The value r = 0 means that the variables are uncorrelated. When r = 1 or r = –
1, there is perfect positive correlation or perfect negative correlation respectively,
between the two variables. But these values of r are very uncommon. In real life
situations, chance of occurrence of these values of r are very rare.
In all these interpretations it is assumed that the sample from which r is
computed is sufficiently large.
5.4.3 Computing Correlation Coefficient For Ungrouped Data :
The data specifying all the pairs of observation (xi, yi), i = 1, 2 ... n; on two
variables X and Y are called ungrouped data. The steps in computing correlation
coefficient for these data are given below :
i) Compute means of X and Y
xi yi
x and y
n n
ii) Compute standard deviations of x and y
2
x i y 2i
x = (x ) 2 , y = ( y) 2
n n
This needs computation of sums of squares x2i and y2i
iii) Compute the sum of products xi yi.
xi yi
Then Cov. (x, y) = x y.
n
iv) Compute the coefficient of correlation between X and Y.
Cov. ( x, y )
r=
x y
Note : If the values of X and Y in the data are inconveniently large so as to
make computation of sums of squares and sum of products difficult, we may employ
the techniques of change of origin and/or change of scale that would simplify the
computations : Usually the scaling factors are both positive. So the value of r remains
unaltered. For example, if we have the transformation
X a Y a
u= and v = where h > 0, k > 0, rxy = ruv
h k
Mathematics & Statistics / 214
Check Your Progress. - 5.2 to 5.4
1. What is correlation?
_____________________________________________________
_____________________________________________________
2. What is positive correlation & negative correlation?
_____________________________________________________
_____________________________________________________
3. What is Covariance?
____________________________________________________
_____________________________________________________
4. What is Karl Pearson's coefficient of correlation?
_____________________________________________________
_____________________________________________________
5. Choose the correct answer from given.
(i) If X and Y are any two random variables then the covariance between aX +
b, cY + d is given by
a) cov(X, Y)
b) abcd cov(X, Y)
c) ac cov(X, Y)
d) ac cov(X,Y) + ab
(ii) The correlation coefficient between college entrance exam grades and the
final gardes was computed to be – 1.08. On the basis of this you would
recommend that:
a) the entrance exam is a good predictor of success
b) students who do worst in this exam will do best in final
c) Students at this school are not scholars
d) Recomputed the correlation coefficient
(iii) The correlation coefficient between X and Y is known to be zero, We then
conclude that
a) X and Y have standard distributions.
b) the variances of X and Y are equal.
c) there exists no relationship between X and Y
d) there exists no linear relationship between X and Y
(iv) Suppose the correlation coefficient between height as measured in feet and
weight as measured in pounds is 0.40. What is the correlation coefficient of
height measured in inches versus weight measured in ounces (12 inches =
one feet, 16 ounces = one pound)
a) 0.40 b) 0.30
c) 0.533 d) cannot be determined from the information given
(v) Consider the following data.
x 10 11 12 13 14
y 43 40 37 34 31
which one of the following would be true?
(a) Correlation coefficient between X and Y is negative but not equal to –1.
(b) Correlation coefficient between X and Y is –1.
(c) Correlation coefficient between X and Y is 0.
(d) None of the above.
Correlation / 215
5.5 SUMMARY
– The mutual or joint relation between the two variables is call correlation.
– The two variables are said to be correlated when change in value of one
variable causes corresponding change in the value of the other variable In
some cases the changes in values of both the variables are in the same
direction. Increase in value of one variable causes increase in value of the
other variable such a variables are called as positively correlated variables. In
some other cases the changes in the values of the two variables may be in
opposite direction. Increase in value of one variable may cause decrease in
value of the other variable. These variables are said to be negatively correlated.
– The study of correlation between two variables termed as covariation.
– A measure of the degree of covariation is called as covariance.
– According to professor Karl Pearson a measure of a degree of correlation
called coefficient of correlation.
– It is defined as the ratio of covariance of two variables to the product of
standard deviations of these variables. It is also known as product moment
correlation coefficient.
5.6 CHECK YOUR PROGRESS– ANSWERS
5.2 to 5.4
1. The mutual or joint relation between the two variables is called correlation.
2. Increase in value of one variable causes increase in value of the other variable
such a variables are called as positively correlated variables. Increase in value
of one variable may cause decrease in value of the other variable. These
variables are said to be negatively correlated.
3. The study of correlation between two variables termed as covariation. A
measure of the degree of covariation is called as covariance.
4. According to Professor Karl Pearson a measure of a degree of correlation
called coefficient of correlation. It is defined as the ratio of covariance of two
variables to the product of standard deviations of these variables. It is also
known as product moment correlation coefficient.
5. (i) – c, (ii) – d, (iii) – d (iv) – a (v) – a
5.7 ILLUSTRATIVE EXAMPLE
Example : The following are the values of exports of raw cotton (X) and the
values of imports of manufactured cotton goods (Y) in Crores of Rs. Compute the
coefficient between X and Y.
Table 5.1 : Computation of Coefficient of Correlation
X Y y = u – 70 v = Y – 60 uv u2 v2
42 56 –28 –4 112 784 16
44 49 –26 –11 286 676 121
58 53 –12 –7 84 144 49
55 58 –15 –2 30 225 4
89 65 19 5 95 361 25
98 76 28 16 448 784 256
66 58 –4 –2 8 16 4
–38 –5 1063 2990 475
Mathematics & Statistics / 216
Here the given values of X and Y are large. So we convert X and Y into u and
v by change of origin.
Take u = X – 70 and v= Y – 60
ui 38
Then u = -5.4285
n 7
v1 5
v = = - 0.7143
n 7
The standard deviations are
u 2i 2990
u = (u ) = ( 5.4285) 2 = 19.94
n 7
v2i 2 475 2
v = v = 0.7143 = 8.2065
n 7
u i vi
Cov (u,v)= uv
n
1063
= –(–5.4285) (–0.7142) =151.857-3.8776
7
= 147.9794
Cov (u , v) 147.9794
Hence ruv= = 0.9043
u v 19.94 x8.2065
Since the coefficient of correlation is invariant to change of origin we have rxy = rv =
0.9043
This shows that there is correlation of high degree between the variables X and Y.
Note : Transformation of variables need not be used unless it significantly facilities
computations. Mere it is used only as an illustration.
5.8 QUESTIONS FOR SELF - STUDY
1. Explain with an example the concept of bivariate data.
2. When are two variables said to be correlated? What do you mean by (i)
positive correlation and (ii) negative correlation? Give two examples of each
type.
3. Define Karl Pearsion correlation coefficient and state its properties.
4. Show that the correlation coefficient is numerically invariant to the change of
scale and origin.
5. Ten pairs of values of X and Y give the following result: x = 40, y = 50,
x2 = 200, y2 = 500 and xy = 160. Find the correlation coefficient between
X and Y.
6. Twenty pairs of values of X and Y given x = 5, y = 3, Ex2 = 680, y2 = 500
and xi (yi – y) = 120. Find the coefficient of correlation
7. From the data of 25 pairs of observations of X and Y a student got x = 100,
y = 1250, y2 = 1300, xy = 50. Are these result consistent?
8. Two series of X and Y with 50 observations have standard deviations 4.5 and
3.5 respectively. The sum of products of deviations of X and Y from their
respective means is 420.0. Find the coefficient of correlation between X and
Y.
9. From the following data of supply in quintals (X) and price in Rs. per quintal
(Y) of a certain commodity compute the correlation coefficient between price
and supply.
Correlation / 217
X : 80 82 86 91 83 85 89 96 93 90
Y: 145 140 130 124 133 127 120 110 116 130
10. From the following data of height X in cm and weight Y in kg. of 10 adults
find the correlation coefficient between X and Y.
X: 155 185 175 145 180 178 158 150 180 165
Y : 50 65 63 50 60 61 55 54 65 54
11. The mean soil temperature (X) and number of days (Y) required for
germination for winter wheat at 10 places are given below :
X 57 42 38 42 45 42 44 40 46 44
Y: 10 26 41 29 27 27 19 18 19 31
Compute the correlation coefficient between X and Y.
12. From the following data on water X (in ft) and yield of Alfalfa Y (in tons per
acre) calculate the correlation coefficient between X and Y1
X: 1.0 1.5 2.0 2.5 3.0 2.5 4.0
Y: 5.3 5.7 6.3 7.3 8.3 8.7 8.4
13. From the data of n pairs of observations of X and Y following result are
obtained x = 199, y = 94, (X – x)2 = 1298, (y – y)2 = 600 and (X – y) = –
262. Find the coefficient of correlation.
14. Find n, if r = 0.5, y = 8, (Xj – x)2 = 90 and (xi – x) (yi – y) = 120.
15. Given n = 20, = 80, y = 40, x2 = 1680, y2 = 320, xy= 480, find the
correlation coefficient between x and y.
16. Compute the correlation coefficient between x and y from the following :
n =1,0, x = 100, y = 150, (x –10)2 = 180, (y –15)2 = 150, (x–10) (y–
15) = 60
17. Given rxy = 0.75, find the correlation coefficient between
a) (x–10) and (y–15)
b) (2x–4)and(2–y)
x y
c) and
2 5
5.9 SUGGESTED READINGS
1. Mathematics and Statistics by M. L. Vaidya, M. K. Kelkar
2. Statistical Analysis by S. P. Azen and A. A. Afifi
3. Pre-degree Mathematics by Vaze, Gosavi
Mathematics & Statistics / 218
NOTES
Correlation / 219
NOTES
Mathematics & Statistics / 220
Chapter 6
Linear Regression
6.0 Objectives
6.1 Introduction
6.2 Line of Regression
6.3 Equation of Line of Regression by the Method of
Least Squares
6.4 Interpretation of Coefficient of Regression
6.5 Properties of Coefficient of Regression
6.6 Summary
6.7 Check Your Progress - Answers
6.8 Illustrative Examples
6.9 Questions for Self – Study
6.10 Suggested Readings
6.0 OBJECTIVES
After studying this chapter you will be able to discuss following –
Regression
Two variable
Three variables
Quantitative evidence
Sophisticated Results
Proper equation
Interpretation of results
Use of results (decisions)
6.1 INTRODUCTION
In the preceding chapter we have studied methods of measuring the, degree of
correlation between the two variables by obtaining bivariate data on these variables.
If the bivariate data provide a quantitative evidence of existence of correlation or
association between the variables, our attempt would be to establish this association
in some functional form mathematically, that would enable us to estimate quite
accurately, on an average, the value of one variable on the basis of the value of other
variable. Such a mathematical relationship between two variables is called regression
equation or simply regression.
This estimation by association is quite sophisticated and very useful. This
procedure is actually that of prediction and prediction is the central function of
sciences. The main task of any scientific study is to discover the general relationships
between the observed variables and to state the nature of such relationships in
mathematical terms, so that the value of one variable can be predicted on the basis of
that of another. This is what we are going to attempt in this chapter. Generally, the
relationships between the variables understudy such as i) height and weight of adult
men ii) number of infant deaths and number of births etc. are very blurred, vague and
imprecise. Ordinary mathematical methods are not useful in this case but statistical
methods are. The special contribution of statistics in this field is that of handling such
vague, blurred, and imprecise relationships.'
Linear Regression / 221
As stated above the mathematical relationship between the two variables under
study is called regression equation which is essentially a prediction equation. But the
term regression is well established in statistics and no attempt has been made to
replace it.
6.2 LINE OF REGRESSION
The simplest equation for expressing the relationship between the two
variables is linear equation. In the case the regression is known as liner and the
equation is called the line of regression. Among the two variables under
consideration the regression equation expresses one variable in terms of the other. If
the equation expresses Y in terms of X, Y is called 'dependent' or 'explained' variable
and X the 'independent' or 'explanatory' variable. [Note that the term 'independent' is
not used in statistical sense.]
Thus the equation Y = a + bX is called line of regression of Y on X and is used
for prediction of Y for given X. Here a and b are constants for the given line. The
coefficient b of X, is called the regression coefficient of Y on X.
Likewise the equation X = a' + b'Y gives the line of regression of X on Y and
is used for prediction of X for given Y. The coefficient b' is the regression coefficient
of X on Y.
There is only one measure of degree of correlation between the two variables
X and [Link] is the correlation coefficient r. But for the same pair of variables we have
two lines of regression because we have two choices for dependent and independent
variables. The coefficient of correlation rxy is not different from r yx. Hence there is
only one coefficient of correlation for the given pair of variables.
The constants in the regression equation are determined that fits the data is
obtained by the principle of least squares.
6.3 EQUATION OF LINE OF REGRESSION BY THE METHOD OF
LEAST SQUARES
Let us have a sample of n pairs of observations (xi, yi) on the variables X and
Y. Let the equation of line of regression of Y on X be
Y = a + bX ……..(1)
th
For the i observation, yi is the observed of Y. The value of Y obtained from
the equation
(1) for X = Xi, is called the linear regression estimate of Y denoted. y i
Thus y i = a+bxi ........(2)
Now the constants a and b in the equation (1) are evaluated so that the sum of
squares of deviations of observed yi from their regression estimates y i is the least.
This is known as the method of least squares.
Let the sample of n pairs (xi yi) have the means x and y and the variances X2
and y2 for X and Y respectively and let Cov (X, Y) = m11 be covariance between X
and Y for the sample.
Let D = (yi – y i )2, which is the sum of squares of deviations of observed yi
form the linear regression estimate y i
The constants a and b are found in such a way that D is minimum. These
values of a and b will be given by the equations.
na + b xj = yi ..... (3)
a xi + b x2i = xiyi .... (4)
Mathematics & Statistics / 222
The sums Xi, yi, x2i and Xi yi are known from the data. Thus we have two
equations 3and 4 in two unknowns a and b. We can obtain a and b by solving these
equations for a and b.
The equations (3) and (4) are called normal equations.
From equation (4) we have
yi xi
a= –b = y –b x
n n
substituting this value of a in (4), we get
( y – b x ) xi + b xi2 = xi yi.
Now xi = nx
( y – b x ) (n x ) + b xi2 = xi yi
This gives
Cov( x, y) m 11
b= 2
= 2
x x
Substituting these values in (1) the equation of line of regression is written as
Y = y –b x + bX
or Y = y + b(X– x ) .....(5)
Likewise the equation of line of regression of X on Y obtained by the method
of least squares is
X = x + b'(Y– y ) .....(6)
Cov ( x, y )
Where b1 = 2
y
From the equations of lines of regression as given in (5) and (6) it is evident
that both the lines pass through the point ( x , y ). Thus ( x , y ) is the point of
intersection of the two lines of regression.
The lines of regression also can be expressed as
(Y– y ) = b(X– x )
which is the line of regression of Y on X and (X – x ) = b'(Y – y ) which is the
line of regression of X on Y. In this form the equations are easy to memorize.
6.4 INTERPRETATION OF COEFFICIENT OF REGRESSION
Consider line of regression of Y on X. In this form the equation is of the form
Y = a + bX. Here b is the coefficient of regression of Y on X. From the equation of
line of regression it is clear that for unit change in value of X, the value of Y will
change by b units. This b is the rate of change of value of Y for unit change in X. If b
is positive the increase in value of X will be associated with increase in value of Y
i.e. the correlation between X and Y will be positive. On the contrary if b is negative
increase in value of X will correspond to decrease in value of Y, showing that there
is negative correlation between X and Y.
In general the coefficient of regression gives the rate of change of dependent
variable per unit change in value of independent variable and the algebraic sign of
the coefficient of regression determines whether the correlation is positive or
negative.
Linear Regression / 223
6.5 PROPERTIES OF COEFFICIENT OF REGRESSION
Let a sample of n pairs of values (xi yi) of the variables X and Y given the
variances x2and y2 and the coefficient of correlation r. Let b and b' be the
coefficients of regression of Y on X and X on Y respectively,
Then we know that Cov. (X, Y) = r x y
and the regression coefficients b and b' are given by
Cov (X, Y) Cov (X, Y)
b= and b =
x 2
y2
Cov (X, Y) r x y y
Then b= = =r
x2 x2 x
Cov (X, Y) r x y
= = x
and b = y2 y2 r
y
a) Since x and y are always positive we can say that both the regression
coefficients have the same algebraic sign which is the same as that of
correlation coefficient. Also from the values of b and b' it follows that
y
bb' = r r x
= r2
x y
The product of regression coefficients is equal to r2.
Thus the values of b and b will be said to be consistent if (i) both have the
same algebraic sign and (ii) their product is less than of equal to unity, as
r2<1.
b) The regression coefficient is invariant to the change of origin but not to the
change of scale.
Since every regression coefficient is a ratio of covariance to variance it is
invariant to the change of origin. The reason for this is that both the covariance and
the variance are central moment which are known to be invariant to the change of
origin.
X Y
Let us have u= and v =
h k
2
Then x = h2 u2, and y2 = k2 v2
and Cov. (X, Y) = hk Cov. (U, V)
Let bvu be the coefficient of regression of v on u.
Cov (u, v )
Then bvu = 2
and the coefficient of regression of Y on X is then given by
u
Cov. ( X , Y ) hk Cov. ( u, v) k
b = byx = 2
= = bvu
x h2 u2 h
and b = bxy= h buv
k
From this it follows that if the scaling factors h and k for the two
transformation are equal the regression coefficient will be unaltered.
Mathematics & Statistics / 224
Check Your Progress. - 6.2 to 6.5
1. What is Line of Regression?
_____________________________________________________
_____________________________________________________
2. What is Regression Co-efficient?
_____________________________________________________
_____________________________________________________
3. Choose the correct answer from the given.
i) Based on the data {(xi, yi), i = 1,2……20} the two regression lines are
y = 1/5 + 3/5 x and x = 1/5 + 3/5x. Let mx, my denote sample means.
(a) The two lines actually collapse into one line and the correlation
coefficient is 1.
(b) correlation coefficient is 1/20
(c) mx = my = 1 (d) mx = my = 1/2
ii) Linear regression of Y on X is Y = 1 + X. Correlation coefficient between y
and x is 1/2. Then the regression coefficient bx,y of x on y is;
(a) 1 (b) ½ (c) 1/4 (d) 1/2
6.6 SUMMARY
– The mathematical relationship between the two variables under study is called
regression equation which is essentially a prediction equation.
– The simplest equation for expressing the relationship between the two
variables is linear equation. In the case the regression is known as liner and the
equation is called the line of regression.
– The equation Y = a + bX is called line of regression of Y on X and is used for
prediction of Y for given X. Here a and b are constants for the given line. The
coefficient b of X, is called the regression coefficient of Y on X.
6.7 CHECK YOUR PROGRESS - ANSWERS
6.2 to 6.5
1. The simplest equation for expressing the relationship between the two
variables is linear equation. In the case the regression is known as liner and the
equation is called the line of regression
2. The equation Y = a + bX is called line of regression of Y on X and is used for
prediction of Y for given X. Here a and b are constants for the given line. The
coefficient b of X, is called the regression coefficient of Y on X.
3 (i) – d (ii) – c
6.8 ILLUSTRATIVE EXAMPLES
Example 1 : In an agricultural experiment on the study of effect of depth of
water in the soil (X) in ft. on the yield of as crop in 1b. per plot (Y) the following
data were obtained.
X: 1.8 1.9 2.5 1.4 1.3 2.1 2.3
Y: 200 370 450 160 90 440 380
Obtain the equation of line of regression of Y on X and estimate the yield
when the depth of water in the soil is 2 ft.
Linear Regression / 225
Solution : The Steps in computations are as follows :
i) Find the sum of squares and products.
xi = 13.3 yi = 2090 xi2 = 26.45
yi2=751100 Xi;yi = 4327
The number of observations n = 7.
ii) Compute the means of X and Y
x = 13.3 =1.9 and y = 2090 =298.57
7 7
iii) Compute the variance of X
2 x 2i 2
x = x
n
26.45
= –(1.9)2
7
= 3.7786–3.61
= 0.1685
iv) Compute the covariance
x i yi
Cov (X,Y) = –x y
n
4327
= –(1.9) (298.57)
7
= 618.143 – 567.28
= 50.863
v) Let the equation of line of regression of Y on X be Y = a + bX
Cov. ( X, Y )
Then b =
x2
50.863
=
0.1685
=301.85
a = y – bx
] = 298.57 – (301.85) (1.9)
= -274.945
Hence the equation of line of regression is
Y = 301.85 × – 274.945
The regression estimate of Y when X = 2 is
Y = 328.76
Thus the linear regression estimate of yield of crop when the depth of water is
2 ft., is 328.76 Ib.
Example 2 : The following table shows the means and the standard deviations
of prices of shares of two companies.
Mathematics & Statistics / 226
Company Mean Price Standard
deviation
A Rs. 39.50 Rs. 10.80
B Rs. 47.50 Rs. 16.80
The coefficient of correlation between the prices of two shares is 0.42. Find
the most likely price of shares of company A when the price of share of company B
is Rs. 55.
Solution : Let the prices of share of company A and company B in Rs. be X
and Y respectively. Then we are given that the means of X and Y are x = 39.50 and y
= 47.50 and the standard deviations are x = 10.80 and y = 16.80.
Also the coefficient of correlation is r = 0.42.
Now to estimate the price of shares of company. A i.e. value of X we are given
the price of shares of company B i.e. Y = 55.
For this we have to use the equation of line of regression of X on Y.
Let this equation be X = a + b Y.
x
Here b =r
y
10.8
= 0.42 ×
16.8
=0.27
and a = x –b y
= 39.50 – 0.27 × 47.50
a = 26.675
= 26.68
The equation of line of regression is
X = 26.68 + 0.27 Y
The most likely price of shares of company A is the linear regression estimate
of X.
For Y = 55 this estimate of X is given by
x = 26.68 + 0.27 × 55 = 41.53
The most likely price of shares of company A is Rs. 41.53 when that of
company B is Rs. 55.
Example 3 : Given the two linear regression equation
8X + 10Y + 66 = 0 and 40X – 18Y = 214 and V (X) = 9,
find the means of X and Y, the correlation coefficient between X and Y and
V(Y).
Solution : We know that the coordinates of point of intersection of the two
lines of regression are x and y, the means of X and Y.
The regression equations are
8X - 10Y = –66
and 40X–18Y = 214
Solving these equations we get X = 13 and Y = 17.
Hence the means of X and Y are x = 13 and y = 17.
Now to find the correlation coefficient we have to find the regression
coefficients b and b'.
For this we have to choose one of the lines as that of regression of Y on X and
Linear Regression / 227
the other is then the line of regression of X on Y.
Let 8X - 10Y + 66 = 0 be the line of regression of Y on X.
8
This gives Y= x + 6.6 = 0.8x + 6.6
10
The coefficient of X in this equation is b = 0.8. Then the other equation is that
of line of regression of X on Y which can be written as
18 214
X= Y+
40 40
18
Here the regression coefficient b' = =0.45
40
Now r2 = b b' = 0.8 × 0.45 = 0.36
r = ± 0.6
The correlation coefficient has the algebraic sign same as that of b r = 0.6
[Note : We choose arbitrarily the lines as that of regression of Y on X or X on
Y. If the product b b' is less than unity, our choice is correct. Otherwise we have to
take the other chose. Fortunately there are only two choices].
y
Now the coefficient of regression of Y on X is r
x
y y
b=r = 0.6 × =0.8
x 3
3 0.8
y= = 4, Hence V (Y) = 16
0 .6
6.9 QUESTIONS FOR SELF - STUDY
1. Explain the concept of regression and its utility.
2. Why do we have two lines of regression (i) of Y on X and (ii) X on Y?
3. What do you mean by a linear regression coefficient of Y on X? How will you
interpret the value of it?
4. If byx and bxy are the coefficients of regression of Y on X and X on Y
respectively, show that byx, bxy = r2
5. Bring out the inconsistency, if any, in the following :
i) b = 1.6, b' = –0.5 ii) b = 3.2, b' = –0.5
1
iii) b = b = 1.50 and r = – 0.7
5. The following table gives the infant mortality rate (X) and birth rate (Y) for
eight years.
X: 22.9 17.8 20.8 21.3 20.7 20.9 17.5 13.9
Y: 44 46 56 42 32 47 38 45
Obtain the line of regression of birth rate on infant mortality rate and estimate
the birth rate for the infant mortality rate 15.
6. The following data give live weight of a pig (X) and weight of a side of bacon
(Y)
X: 125 155 190 203 217
Y: 36 46 51 65 72
Mathematics & Statistics / 228
Estimate the line of regression of weight of pig on weight of a side of bacon
and calculate the weight of pig if the weight of bacon is 101.
7. The number of defective items produced per unit time, Y, by a certain machine
is thought to vary directly with the speed of the machine, X measured in r p.m.
Observations for 10 hours selected at random from a month give the following
results.
X: 13.2 14.9 16.4 8.1 13.1 10.8 10.9 17.4 10.2 15.8
Y: 9.4 12.2 11.4 6.0 9.6 7.5 5.7 12.3 7.0 9.0
Estimate the line of regression of Y on X and the number of defectives per
hour when the speed of the machine is 10 r.p.m.
8. The following random sample gives the number of hours of study (X) for an
examination and the grades Y obtained by 12 students.
X: 3 3 3 4 4 5 5 5 6 6 7 8
Y: 45 60 55 60 75 70 80 75 90 80 75 85
Obtain the line of regression of grades on hours of study.
9. The average price of 200 shares was Rs. 150 and the average gain per share
was Rs. 7. The coefficient of regression of gain per share (Y) on the price (X)
was 0.50, Estimate the gain per share for the price Rs. 200.
10. Twelve observations on the price (X) of shares and the volume of sales (Y) at
Bombay stock exchange gave the following results.
x = 580, y = 370, xy = 11494, x2 = 41568 and y2 = 17206.
Obtain the equation of line of regression of volume of sales on price of shares.
Predict the volume of sales (in thousands of shares) for shares of price Rs. 40/–
.
11. Give the following data, obtain the linear regression estimate of X for Y = 10.
xi = 7.6, yi = 14.8, x = 3.6, y = 25, r = 0.8
12. The two regression lines are 2x – 3y = 0 and 4y – 5x + 7 = 0. Find the means
of X and Y. If standard deviation of X is 3, find that of Y.
13. Find the means of X and Y and the correlation between X and Y, if the
equations of lines of regression are 2y – x – 50 = 0 and 3y – 2x –10 = 0.
14. The equations of two lines of regression are
3X + 2Y – 26 = 0 and 6X + Y – 31 = 0
Find the means of X and Y. Estimate Y for X = 2.
15. Given the means of X and Y, 5 and 10. The line of regression of Y on X is
parallel to the line 20Y = 9X + 40 and correlation coefficient is 0.6. Estimate
the value of X when Y = 30.
16. In the regression analysis of a problem the equations of lines of regression
were found to be 10X – 4Y = 80 and 10Y – 9X + 40 = 0. The variance of Y
was 36. Find the means of X and Y, x2 and the coefficient of correlation.
6.10 SUGGESTED READINGS
1. Mathematics and Statistics by M. L. Vaidya, M. K. Kelkar
2. Statistical Analysis by S. P. Azen and A. A. Afifi
3. Pre-degree Mathematics by Vaze, Gosavi
Linear Regression / 229
NOTES
Mathematics & Statistics / 230
Chapter 7
Index Numbers
7.0 Objectives
7.1 Introduction
7.2 Uses of Index Numbers
7.3 Price Index Numbers
7.4 Problems in Construction of an Index Number
7.5 Summary
7.6 Check Your Progress - Answers
7.7 Questions for Self - Study
7.8 Suggested Readings
7.0 OBJECTIVES
After studying this chapter you will be able to calculate and explain-
Index numbers
Types of index numbers
Construction of Index numbers
Uses of Index numbers
7.1 INTRODUCTION
Many a time we are interested in knowing the relative changes in values of
variables like population, prices, industrial production, agricultural production,
exports, imports etc. over a period. One of the ways of measuring these changes is
the index numbers. We are quite familiar with the wholesale price index, consumer's
price index, Bombay stock exchange index which give us the knowledge of the
degree of changes in corresponding variables.
An index number can be defined as the device used for measuring the relative
changes in value of a variable or of a group of related variables from one period to
other or from one place to another place.
7.2 USES OF INDEX NUMBERS
The price and quantity (of consumption) index numbers have in recent years
become he important tools in interpretation of the economic conditions of a state.
Rapid and erect changes in price index or index of price of shares indicate unstable
economic conditions. Whereas stability in these index numbers indicate stable
economy. In that sense these index numbers are called Economic Barometers.
Wages of employees are closely tied with the consumer's price index of that locality.
Revision of pay scales, fixation of dearness allowances, minimum wages, pension
policies are linked with price index numbers. So the price index numbers are closely
watched by the employers as well as employees. The index of industrial production is
of great interest to businessmen and the students of national economy because it
furnishes the information on the current position of national production. The price
index numbers are also useful in determining the purchasing power of money. It can
also be used for determining the real income or real wages of employees.
The index numbers like intelligence quotient are useful in Psychology and
Education. The population index is of interest to the students of sociology and
demography.
Although there can exist many index numbers for different purposes a
common man is more concerned with the price index numbers. So for the further
discussion let us restrict ourselves to the price index numbers only.
Index Numbers / 231
7.3 PRICE INDEX NUMBERS
For measuring the relative change in price of a single commodity or a group of
commodities we use price index number. Generally we are interested in measuring
changes in prices over a change of time.
The change is measured from some fixed period of reference known as base
period The period which is compared with this base period is called current period.
Let us use the notation p0 and p1 for the prices in base period and current period
respectively. Then we use the ratio of prices for measuring the change because the
ratio is independent of units in which the prices are expressed.
The relative importance of different commodities can be considered by using
weights. The weights used are proportional to the quantities of consumption or the
value of goods and services in the series. Such index numbers are called weighted
Index numbers. Different systems of weights used give rise to different formula.
Some of these are given below.
(a) Laspeyre's Index:
This index number is the ratio of weighted aggregate prices using the
quantities consumed in base year (q0) as the weights.
P01 =
p1 q0 × 100.
p0 q0
This can be interpreted as the ratio of the value of basket of goods consumed
in base year according to current year and base year prices. This index number
always gives an upward bias.
(b) Paasche's Index :
This index number is also a ratio of weighted agreegate of prices using the
quantities consumed in current year as weights
p1q1
P01= × 100
p0 q1
This index can be interpreted as ratio of value of good consumed in current
year to prices in current year and base year. This index as contrary to Laspeyre's
Index, is found to give a downward bias,
(c) Fisher's Index:
Since neither Laspeyers formula nor Paasche's formula give a correct idea of
change in price level, Irving Fisher suggested that the geometric mean of these two
index numbers will give the suitable index number. According to him the index
number is given by
p1q 0 p1q 1
P01= × 100
p 0q 0 p 0 q1
This gives a more accurate price index but it lacks in interpretation. However,
it has many desirable properties and hence it is known as Fisher's Ideal Index
Number.
Example : Calculate the Laspeyre's, Paasche's and Fisher's index numbers for
prices in the year 1987 with 1982 as base year from the following data.
Base year (1982) Current Year (1987)
Commodity Price Quantity Price Quantity
(Po) (qo) (P1) (q1)
Rice 4 15 6 20
Wheat 3 40 5 35
Jawar 5 20 5 25
Pulses 6 10 8 10
Mathematics & Statistics / 232
Solution : Computations for the required index numbers are shown in the following
table.
Commodity Po Qo P1 q1 poqo p0q1 p1q0 p1q1
Rice 4 15 6 20 60 80 90 120
Wheat 3 40 5 35 120 105 200 175
Jawar 5 20 5 25 100 125 100 125
Pulses 6 10 8 10 60 60 80 80
TOTAL 340 370 470 500
Laspeyre's Index
p 1q 0
p1 = × 100
p0q0
470
= × 100
340
=138.235
Paasche's Index
p 1q 1
Pa = × 100
p 0 q1
500
= × 100
370
=135.135
Fisher’s Index PF = PL Pa
= 138.235 x135.135
= 136.676
7.4 PROBLEMS IN CONSTRUCTION OF AN INDEX NUMBER
Construction of any index number itself a difficult task. It poses many
problems in the process of construction of an index number. Following are the
common problems which need a careful and thoughtful consideration while
constructing an index number.
(i) Specification of the Purpose and Scope of Index Number :
Every index number is constructed with some definite purpose and its uses are
also limited. There does not exist an all purpose index number. The wholesale
price index umber cannot be used for comparing the retail price levels in two
periods. The consumer's price index for textile workers cannot be used for
comparing cost of living of higher income group. So it is very important and
necessary to specify the purpose at the outset only. It governs the further
details of construction of an index number. It also defines the proper use of
index number under consideration.
(ii) Selection of Items :
Selection of the items and their number is governed by the purpose itself. Only
relevant items should be included in the series which have direct influence on
the index number. The number of items to be included should be enough to
make it representative and it should not be too large also as it would create
difficulties in collecting the price data and information on weights. The items
should suit the tastes, habits and customs of the class of people for whom the
index number is constructed. Consideration should be given to the quality of
the items like rice, wheat, recreation as these differ from class to class.
(iii) Selection of Weights :
Generally, the index number is a weighted index numbers as it is more
Index Numbers / 233
realistic. Weights allow different items to influence the indes number to
different extents. The weights should be proportional to the relative
importance of the item. The method in which we use ushc weights is called
explicit weighting. In some cases, we may include in the series more varieties
of the same item which is more important and less varieties of items of less
importance, this is indirect or implicit weighting. In explicit weighting, we
generally use the quantity of consumption or the value of goods consumed as
weights For collecting these data we have to conduct a sample enquiry.
(iv) Selection of Base Year:
The usefulness of the index number depends to some extent on the choice of
base year. So proper care should be taken in selecting the base year also. The
base year should to be too distant in the past. The pattern of consumption is
likely to change with the time. Some items may become out of use and some
new items may come in to use. this may lose the comparability of the periods.
This necessitates revision of base year from time to time. The other important
point in choice of base year is that it must be the period of economic stability.
The events like wars, famines, epidemics are likely to create erratic changes in
prices which are bound to be temporary. They reflect the instability of the
economy and these conditions do not prevail for a long time. So it is necessary
that the base year chosen must be an year of economic stability.
(v) Selection of Sources of Price Data :
After the commodities have been selected it is necessary to collect the data on
prices of these commodities for constructing a price index number. This prices
are collected from the markets or shops from which useal purchases are done.
The concessions or discounts should not be taken into consideration. The
prices should be for those qualities of the commodity which are commonly
consumed by the class of people under consideration. The price may be
collected by inviting quotations from the reliable sources and agents. To
ensure reliability the quotations for some commodities may be invited from
two or more agents. The price data also can be obtained from published reports
of official agencies.
(vi) Selection of Average or Formula :
Since we are interested in constructing a single index which will summaries the
changes in values of number of related variables we have to select the proper
average that will serve the purpose.
Among the various averages the median and the mode are out of consideration.
Only the arithmetic mean and the geometric mean can be used. Among them
the A. M. is easy to understand and to compute also. So in many cases we used
the weighted A. M. of price relatives for constructing the index number. In
practice, Laspeyre's formula is widely used as it uses the base year quantities
as weights. The data on base year quantities are easily available then those on
current year quantities.
7.2 to 7.4 Check Your Progress.
1. What is Index number?
______________________________________________________
______________________________________________________
2. What is Price index number?
______________________________________________________
______________________________________________________
3. List the common problems which need to be considered while constructing
an index number.
______________________________________________________
______________________________________________________
Mathematics & Statistics / 234
4. Choose the correct answer from given list.
i) Laspeyre's index number is given by the formula
p 0 q1 p0q0
a) × 100 b) × 100
p1q 0 p1q1
p 1q 0 p1q1
c) × 100 d) × 100
p0q0 p0q0
ii) Paasche's index number is given by the formula.–
p 0 q1 p 1q 1
a) × 100 b) × 100
p1q 0 p 0 q1
p1q 0 p 1q 0
c) × 100 d) × 100
p 0 q1 p0q0
7.5 SUMMARY
An Index number can be defined as the device used for measuring the relative
changes in value of a variable or of a group of related variables from one
period to other or from one place to another place. Index numbers are also
called as economic barometers:
For measuring the relative change in price of a single commodity or a group of
commodities we use price index number.
Common problems in construction of an Index number are :
(i) Specification of the Purpose and Scope of Index Number.
(ii) Selection of Items .
(iii) Selection of Weights .
(iv) Selection of Base Year.
(v) Selection of Sources of Price Data.
(vi) Selection of Average or Formula.
7.6 CHECK YOUR PROGRESS - ANSWERS
7.2 to 7.4
1. An Index number can be defined as the device used for measuring the relative
changes in value of a variable or of a group of related variables from one
period to other or from one place to another place. Index numbers are also
called as economic barometers.
2. For measuring the relative change in price of a single commodity or a group of
commodities we use price index number.
3. Common problems in construction of an Index number are :
(i) Specification of the Purpose and Scope of Index Number:
(ii) Selection of Items
(iii) Selection of Weights
(iv) Selection of Base Year
(v) Selection of Sources of Price Data
Index Numbers / 235
(vi) Selection of Average or Formula
4. (i) – c (ii) – b
7.7 QUESTIONS FOR SELF - STUDY
1. Explain the meaning and the utiling of Index numbers.
2. State Laspeyre's, Paasche's and Fisher's formulae of index numbers and
mention their specialties.
3. Give interpretation of Laspeyre's and Pasche's Index number of price:
4. Discuss various problems in construction of an Index number.
5. Compute Laspeyre's, Passche's and Fisher's Index number for price from the
following data.
Commodity Base Year Current Year
Price Quantity Price Quantity Price
A 8 50 10 60
B 10 40 12 50
C 5 100 9 70
D 6 10 8 20
6. Calculate Laspeyre's and Pasche's Index number for price from the following
data and comment on your results.
(a)
Commodity po qo p1 q1
A 1 10 1.5 8
B 5 12 6.0 9
C 8 5 12.0 3
(b)
Commodity po qo p1 q1
1 2 7 3 8
2 5 9 4 7
3 3 5 6 9
(c)
Commodity po qo p1 q1
I 19 3 6
II 3 8 4 3
III 2 7 6 5
Mathematics & Statistics / 236
7. Calculate appropriate Index number in each of the following.
(a)
Commodity A B C D
Po 5 7 8 4
P1 6 5 9 2
q0 6 8 10 5
(b)
Commodity I II III
P1 1 4 7
P0 2 5 5
q1 5 7 11
7.8 SUGGESTED READINGS
1. Mathematics and Statistics by M. L. Vaidya, M. K. Kelkar
2. Statistical Analysis by S. P. Azen and A. A. Afifi
3. Pre-degree Mathematics by Vaze, Gosavi
Index Numbers / 237
NOTES
Mathematics & Statistics / 238