Processing of Data
Processing of Data
Variable
Example
Types of Variables
Qualitative Quantitative
Children in a family
Continuous
Weight of a student
Types of Variables
1. Qualitative
2. Quantitative
Processing of Data
Examples
Qualitative variables are gender, religious, affiliation, type of automobiles owned, state of
birth and eye color.
When the data are qualitative, we are usually interested in how many or what proportion
fall in each category. For example, what percent of the population has blue eyes? How
many Catholics and how many Protestants are there in the United States?
Quantitative Variable
When the variable studied can be reported numerically, the variable is called a
quantitative variable.
Examples
Quantitative variables are the balance in your checking account, the ages of company
presidents, the life of an automobile battery (such as 42 months) and the number of
children in a family.
i. Discrete Variables
Discrete variables can assume only certain values, and there are usually “gaps” between
the values. Examples of discrete variables are the number of bedrooms in a house (1,2,3,4
etc), the number of cars arriving and the number of students in each section course (25 in
section A, 42 in section B and 18 in section C).
Observations of a continuous variable can assume any value within a specific range.
Examples of continuous variables are in the air pressure in a tire and the weight of a
shipment of tomatoes.
Classification of Data
After collection and editing of data an important step towards processing the data is
classification.
Types of Classification
Classification of
Data
i. Geographical Classification
Geographical classifications are usually listed in alphabetical order for easy reference.
Items may also be listed by size to emphasize the important areas as in ranking the states
by the population.
When data are observed over a period of time the type of classification is known as
chronological classification. For examples, the sales figures of a company are given
below:
In qualitative classification, data are classified on the basis of some attribute or quality
such as sex, color of hair, literacy, religion etc. The point to note in this type of
classification is that the attribute under study is blindness, we may found out how many
persons are blind in a given population.
Population
Blinds Non-Blinds
The process of preparing this type of distribution is very simple. We have just to count
the number of times a particular value is repeated which is called the frequency of that
class. In order to facilitate counting, prepare a column of “tally”. In other column, place
all possible values of the variable from the lowest to the highest. Then put a bar (vertical
line) opposite the particular value to which it relates.
We finally count the number of bars corresponding to each value of the variable and
place it in the column of the frequency.
Example
The number of refrigerators sold on 22 working days by a leading agency house:
23 30 20 26 30 20 23 40 40 26 20 30
23 40 28 26 23 40 28 28 30 30
20 lll 3
23 llll 4
26 lll 3
28 lll 3
30 1111 5
40 llll 4
The table clearly shows that on 3 days 20 refrigerators were sold each day, on 4 days 23
refrigerators were sold each day etc.
This method of classification helps in condensing the data only where values are largely
repeated, otherwise there will be hardly and condensation. In order to make the series
more compact so that its characteristics can be easily studied, data may be classified
according to class- intervals.
Cumulative Frequency
In some situations, we may be interested, not in the frequencies in various classes, but
rather in the frequencies or proportions of observation which are “less than” or “greater
than” a given value. This leads to a cumulative frequency distribution. This is derived
from a frequency distribution by forming a cumulative frequency column. This column is
computed by adding the successive class frequencies from top to bottom. The entry
corresponding to the top interval is the frequency of that class., the entry opposite the
second interval is the sum of the frequencies in first and second class intervals etc. and so
on.
This type of classification is most popular in practice. The following technical terms are
important when data are classified according to class intervals:
i. Class limits
The class limits are the lowest and the highest values that can be included in the class.
For example, take the class 20-24. The lowest value of this class is 20 and the highest 40.
The two boundaries of a class are known as the lower limit and upper limit of the class.
The lower limit of a class is the value below which there can be no value in that class.
The upper limit of a class is the value above which no value can belong to that class. Of
the class 70-89, 70 is the lower limit and 89 is the upper limit, i.e. in this class there can
be no value which is less than 70 or more than 89. Similarly, if we take the class 90-109,
there can be no value in that class is less than 90 or more than 109.
The span of a class, that is, the difference between the upper limit and lower limit, is
known as class interval. For example, in the class 20-40, the class interval is 20 (i.e. 40
minus 20). The size of the class interval is determined by the number of the classes and
the total range in the data.
It is the value lying half-way between the lower and the upper class limits of a class
interval. Mid point of a class is ascertained as follows:
There are two methods of classifying the data according to class intervals namely
a. Exclusive method
b. Inclusive method
a. Exclusive Method
When the class intervals are so fixed that the upper limit of one class is the lower limit of
the next class it is known as the ‘Exclusive’ method of classification. The following data
are classified on the basis:
It is clear that ‘Exclusive method’ ensures continuity of data inasmuch as the upper limit
of one class is the lower limit of the next class. Thus in the above example, there are 50
persons whose income is between Tk. 1800 and Tk. 1899.99. A person who is getting
exactly Tk. 1900 would be included in the class 1900-2000.
Here, whenever this method is used it is necessary to give clear instructions in the
questionnaire. However, the reader should note that if class intervals are given like 0-10,
10-20,, it is always presumed that upper limit is exclusive i.e. an observation exactly to
the upper limit is not included in that class.
b. Inclusive method
Under the “Inclusive method’ of classification, the upper limit of one class is included in
that class itself.
Income No of Employees
(Tk.)
800-899 50
900-999 100
1000-1099 200
In the class 800-899 we include persons whose income is between Tk 800 and Tk. 899. If
the income of persons is exactly Tk. 900 he is included in the next class.
Principles of Classification
It is difficult to lay down any hard and fast rules for classifying the data as the type of
classification.
The number of classes should preferably be between 5 and 15. However, there is
no rigidity about it. The classes can be more than 15 depending upon the total
number of observations in the series and the details required, but they should not
be less than five because in that case the classification may not reveal the
essential characteristics.
Struges suggested the following formula for determining the approximate number of
classes:
However, the precise number of classes to be used for a given variable depends upon
personal judgment and other considerations such as the details required, The case of
calculation of further statistical work etc.
As far as possible one should avoid odd values of class intervals e.g. 3, 7, 11, 26,
39 etc. Preferably, one should have class intervals of either five or multiples of
five like 10, 20, 25, 100 etc.
The starting point, i.e. the lower limit of the first class, should either be zero or 5
or multiple of 5. For example, if the lowest value of the series is 63 and we have
taken a class interval of 10, then the first class should be 60-70, instead of 63-75.
Similarly, if the lowest value of the series is 76 and the class interval is 5 then the
first class should be 75 to 80 rather than 76 to 81.
Example
The profits (in lakhs of Tk’s) of 30 Bangladeshis companies for the year 2005-2006 are
given below:
18 16 23 37 35 49 63 65 55
45 58 57 69 20 22 35 42 37
42 48 53 49 65 39 48 67 25
29 58 65
Solution
Let us determine the suitable class interval with the help of the following formula:
i=
Where, K=1+3.322logN and Range=Highest value-lowest value
K=1+3.322log30=5.91 6, Range=69-16=53
i= = 8.97 or 9
Since values like 3, 7, 9 etc. should be avoided we will take 10 as the class interval and
the first class be 15-25.
Frequency Distribution of the profits
25-35 ll 2
35-45 IIII II 7
45-55 IIII l 6
55-65 llll 5
65-75 llll 5
Total 30
Example
The following are the marks of the 30 students in statistics. Prepare a frequency
distribution taking a suitable class interval.
12 33 23 25 18 35 37 49 54 51 37 15
27 33 42 45 47 55 69 65 63 46 29 18
37 45 46 59 29 55
Solution
Let us determine the suitable class interval with the help of the following formula:
i=
Where, K=1+3.322logN and Range=Highest value-lowest value
K=1+3.322log30=5.91 6, Range=69-12=57
i= = 9.64 or 10
20-30 llll 5
30-40 llll I 6
40-50 llll II 7
50-60 llll 5
60-70 lll 3
Total 30
Tabulation of Data
One of the simplest and most revealing devices for summarizing data and presenting
them in meaningful fashion is the statistical table. A table is a systematic arrangement of
statistical data in columns and rows. Rows are horizontal arrangement, whereas columns
are vertical ones.
Parts of a table
The various parts of a table may vary from case to case depending upon the given data.
But a good table must contain at least the following parts:
Table number
Title of the table
Caption
Stub
Body of the table
Head note
Footnote
Table number
Each table should be numbered. There are the different practices with regard to the place
where this number is to be given. The number may be given either in the centre at the top
above the title or in the side of the table at the top or at the bottom of the table on the left
hand side.
Caption
Captions refer to the column headings. It explains what the column represents. It may
consists of one or more column headings. Under a column heading there may be sub-
heads.
Stub
As distinguished from caption, stubs are the designation of rows or row headings.
Body
The body of the table contains the numerically information. This is the most vital part of
the table.
Head note
It is used to explain certain points relating to the whole table that have not been included
in the title nor in the captions or stubs. For example, the unit of measurement is
frequently written as the head note, such as “in thousand” or “in millions” or “in crores”
etc.
Footnote
Anything in a table which the reader may find difficult to understand from the title,
captions and stubs should be explained in footnotes.
Types of Tables
In this type of table only one characteristics is shown. This is the simplest of tables. The
following is the illustration of such a table:
Such a table shows two characteristics and is formed when either the stub or the caption
is divided into two coordinate parts.
When three or more characteristics are represented in the same table, such a table is
called higher order table.
General purpose tables, also known as the reference tables or repository tables, provided
information for general use or reference.
Special purpose tables, also known as summary or analytical tables, provided information
for particular discussion. They show relationship between different groups of figures.
Example
Charting Data
A chart can take the shape of either a diagram or a graph. For the sake of clarity we will
discuss them under two separate heads:
Diagrams
Graphs
Diagrams
For representing data diagrams are more commonly used than graphs.
1. Title
Every diagram must be given a suitable title. The title should convey in as few a words as
possible the main idea that the diagram is intended to portray.
A proper proportion between the height and width of the diagram should be maintained.
If either the height or width is too short or too long is proportion, the diagram would give
an ugly look.
The scale showing the values should be in even numbers or in multiples of five or ten e.g.
25, 50, 75 or 20, 40, 60. Odd values like 1, 3, 5, and 7 should be avoided.
4. Footnotes
In order to clarify certain points about the diagrams footnotes may be given at the bottom
of the diagram.
5. Index
Index illustration different types of lines or different shades, colors, should be given so
that the reader can easily make out the meaning of the diagram.
7. Simplicity
Diagrams should be as simple as possible so that the reader can understand their meaning
clearly.
Types of
Diagrams
Bar diagrams are the most common type of diagrams used in practice. A bar is a thick
line whose width is shown merely for attention. They are called one-dimensional because
it is only the length of the bar that matters and not the width.
The gap between one bar and another bar should be uniform throughout.
Bars may be either horizontal or vertical. The vertical bars should be preferred
because they give better look and also facilitate comparison.
While constructing the bar diagrams, it is desirable to write the respective figure
at the end of each bar so that the reader can know the precise value without
looking at the scale.
A simple bar diagram is used to represent only one variable. For example the figures of
sales, production, population etc, for various years may be shown by means of a simple
bar diagram. However, an important limitation of such diagrams is that they can present
only one classification or one category of data.
Example
The funds flow of Goodwill India Ltd from 1991-92 to 1995-96 are given below:
150 126.31
109.61 Funds Flow
100 85.8
50
0
1991-92 1992-93 1993-94 1994-95 1995-96
Years
These diagrams are used to represent various parts of the total. For example, the number
of employees in various departments of a company may be represented by a sub-divided
bar diagrams. While constructing such a diagram the various components in each bar
should be kept in the same order. To distinguish between the different components, it is
useful to use different shades or colors. Sub-divided bar diagrams can be vertical as well
as horizontal.
Example
Represent the following data by sub-divided bar diagrams
In multiple bar diagram two or more sets of inter-related data are represented. The
technique of drawing such a diagram is the same as that of simple bar diagram. The only
difference is that since more than one phenomenon is represented, different shades, colors
or crossings are used to distinguish between the bars.
Example
2500
Gross profits
2000 1663
1376 Profits before tax
1500 1219
982 Profits after tax
1000
500
0
1994-95 1995-96
Year
Histogram
Frequency Polygon
Smoothed frequency curve
Cumulative frequency curves or ‘Ogives’.
Histogram
A histogram is a graphical method for presenting data, where the observations are located
on a horizontal axis (usually grouped into intervals) and the frequency of those
observations is depicted along the vertical axis.
While constructing histograms the variable (class interval) is always taken on the X-axis
and the frequencies depending on it on the Y axis. The distance for each rectangle on the
X-axis shall remain the same in case the class intervals are uniform throughout; if they
are different the width of the rectangles shall also vary. The Y axis represents the
frequencies of each class which constitute the height of its rectangle.
First, a histogram is used for representing a frequency distribution only but a bar diagram
is never used for representing a frequency distribution. A bar diagram is one-dimensional
i.e. only the length of the bar is material and not the width; a histogram is two
dimensional, that is in a bar histogram both the length as well as the width are important.
When class-intervals are equal, take frequency on the Y axis, the variable on the X-axis
and construct adjacent rectangles. In such a case the heights of the rectangles will be
proportional to the frequencies.
Example
25
21
0-10
20 19
10-20
16
20-30
15
Frequency
30-40
11
10 40-50
10 8
6 50-60
5
5 60-70
3
1 70-80
0 80-90
1
Size class 90-100
When class-intervals are equal, the frequencies must be adjusted before constructing the
histogram. For making the adjustment we take that class which has the lowest class-
interval and adjust the frequencies of other classes in the following manner. If one class-
interval is twice as wide as the one having lowest class interval, we divide the height of
its rectangle by two; if it is three times more, we divide the height of its rectangles by
three.
Example
Pie Diagram
This type of diagram enables us to show the portioning of a total into component parts. A
very common use of the pie chart is to represent the division of a sum of money into its
components. For example, the entire circle or pie, may represent the budget of a family
for a month and the sections may represent portions of the budget allotted to rent, food,
clothing and so on. Similarly, through a pie diagram we can show how a rupee by a firm
is distributed over various heads such as wages, raw materials, administration expenses
etc.
Example
Areas of continents of the world
The pie diagram is intended to compare the distinct components which together constitute
a whole. The whole is represented by a circle of arbitrary radius and the segments of the
circle represent the component parts. To construct such a diagram we use the fact “the
whole” (51.5 in the above illustration) corresponds to the total number of degrees in the
circular arc, namely 3600. This 3600 is then proportionately divided among the various
components of the whole. Thus the above illustration; the arc of the segment representing
This diagram should be sparingly used, especially if there are many segments.
Line Diagram
If we are given values of a variable at different points of time, the set of values is known
as a time series. The line diagram is used to represent this type of data. In this diagram
time is represented along the X-axis and the variable is plotted along the Y-axis. Thus we
get a point, for each time period and successive points, when connected by straight lines,
give the desired diagram. Often smooth curve is drawn through these points. This
diagram is alternatively called a line diagram or a time series graph.
Example
The productions (in thousand quintals) of a sugar factory are given below:
120
100 99
92 94 92
90
80 80 83
Production
60
40
20
0
1998 1999 2000 2001 2002 2003 2004 2005 2006
Year
Graphs
Frequency polygon
In frequency polygon the mid-values of the continuous class intervals are represented
along X-axis and the frequencies corresponding to the class intervals are represented
along the Y-axis. The class frequencies are plotted against the mid-values of the
respective class intervals. These points are then joint by straight lines one after another.
The first and the last points are then brought down at each end to the X-axis by joining it
to the mid-value of the next out lying interval of zero frequency. The polygon thus
obtained is called frequency polygon.
In cumulative frequency polygon the upper limits of the continuous class intervals are
represented in X-axis and the cumulative frequencies are represented to the Y-axis. A
free hand curve to smooth a cumulative frequency polygon is called an ogive.