Fundamentals of Statistics (Lecture Note1)
Fundamentals of Statistics (Lecture Note1)
Note: If we are interested about the heights of the students of B. Sc. in Bioinformatics Engineering Level-
2 Semester-2 of BAU, then a single value of the height of a student is called datum, and the set of values
of heights of two or more students are known as data.
Sir Ronald Aylmer Fisher (R. A. Fisher) (British) is called the Father of Statistics.
Types of Statistics
1. Descriptive Statistics are the methods of collecting, organizing, summarizing and presenting
data in an informative way. For example, a graph that shows the number of defective fans
produced at a night-shift during a period of one month can be considered as an issue of
descriptive statistics.
2. Inferential Statistics provide the bases for predictions, forecasts and estimates to decision
making about the population. For example, an estimate of the percentage of employees who
arrive to the work late. This is an issue of inferential statistics.
Biostatistics
Biostatistics is the application of statistics to a wide range of topics in biology. It encompasses the design
of biological experiments; the collection, summarization, and analysis of data from those experiments;
and the interpretation of, and inference from, the results.
Population means an aggregate of all individual persons, objects or items possessing certain
characteristics of interest in a particular investigation or enquiry. An aggregate of all farmers who have
used power tillers for their cultivation of land can be considered as a population. The size of the
population is usually denoted by N.
There are two types of population viz. study population and target population. The population from which
the sample is drawn is known as study population and the population for which sample-based results are
generalized is called target population.
Sample is a subset or representative part of a population whose properties are studied to gain information
about the whole population. We are generally interested to know the properties of the population.
Sometimes it is impracticable or even impossible to handle population because of limited resources like
time, fund, manpower, trained personnel, management capability, etc. That is why, inferences about the
population are usually drawn on the basis of the sample. As an illustration, if we select 25 students from
the population of 650, we have a sample of size 25.
A random sample is one in which every individual person or object or item of a population has an equal
chance of being selected in the sample.
Book References:
Variable
Continuous Discrete
variable variable
Quantitative Variable
Variables that can be expressed numerically are known as quantitative variables: Height or weight of
students, length or breadth of fishes, weight of tomato, number of grapes per bunch, number of grains per
panicle, etc. are some examples of quantitative variables.
Qualitative Variable
Variables that cannot be expressed numerically but can be classified or categorised into some mutually
exclusive categories are called qualitative variables. For example, merit of students, educational
attainment, type of farmers (big, medium, small), type of fishes (sea fish, river fish) etc. cannot be
numerically measured but can be grouped into classes or categories. Qualitative variables are also known
as attributes.
Frequency Distribution
Arrangement of observational data according to frequencies of the observations is called frequency
distribution. Frequency distribution should be such that the arrangement according to the observations
becomes easily understandable. Frequency distributions are constructed mainly to present the data in
condensed form and for easy understanding.
1. Finding the Range: In constructing frequency distribution the highest and the lowest value in the data
set are first identified and their difference is obtained. This difference between the highest value and the
lowest value is called the range usually denoted by R.
2. Decision about the Number of Classes: After finding the range, it is necessary to decide the number
of classes in which the entire data set should be divided. Choice of the number of classes should be
realistic; this number should not be very small and at the same time it should not be very large so that the
aim of construction of frequency distribution (condensation) is not achieved. It is generally expected to
limit the number of classes between 7 and 15. There is no hard and fast rule for choosing the number of
classes. However, M.A. Sturge's formula gives a guideline for desired number of classes. The formula is -
k = 1 + 3.322 log10 N
where N is the total number of observations in the data set and k is the desired number of classes.
4. Counting of Frequencies: For convenience of counting the number of observations falling within each
class tally marks are used; frequency of each class is determined by counting the tally marks.
Sometimes it may be necessary to know the observations greater or smaller than a particular value or
class of values. For this, cumulative frequencies for observation or class are obtained.
Example 1.
Suppose the marks obtained by 50 students in an examination are as follows :
32 27 19 40 31 17 15 18 21 27 38 15 33 34 29
26 16 25 33 36 24 22 26 19 36 18 25 20 25 25
31 24 16 28 30 24 29 42 29 28 26 27 47 43 22
25 28 22 24 23
Here the variable is the marks obtained by the students. The data as shown above are called raw or
ungrouped data.
If it is needed to describe the performance of the students, it may be done in a number of ways.
We may enumerate the grade of each student either in ascending or descending order; data such
arranged are said to be arranged in array. Counting the number of times each value of the variable occurs,
we get a table of the following type:
Table1. Frequency Distribution of Marks
Marks Frequency Cumulative Marks Frequency Cumulative
(No. of frequency (No. of frequency
students) students)
15 2 2 28 3 33
16 2 4 29 3 36
17 1 5 30 1 27
18 2 7 31 2 39
19 2 9 32 1 40
20 1 10 33 2 42
21 1 11 34 1 43
22 3 14 36 2 45
23 2 16 38 1 46
24 4 20 40 1 47
25 5 25 42 1 48
26 3 28 43 1 49
27 2 30 47 1 50
Such a table is known as frequency table or frequency distribution. The above arrangement is an
improvement over the raw data, but to get a still better idea of the performance of the students we
reclassify the data into grouped frequency distribution as shown below :
Table 2. Grouped Frequency Distribution of Marks
45-50 | 1 50 1
This type of classification of raw data is called grouped frequency distribution or simply frequency
distribution.
In the above example the highest value is 47, the lowest value is 15 and the range is, R = 47-15 = 32.
According to Sturge's formula,
k = 1 + 3.322 log1050 = 6.47
That is, 6 to 7 groups are appropriate in this case.
R 32
Again, C = = = 4.57; accordingly 5 is taken as the class interval.
k 7
Example 2.
The weight (in gm.) of tomato harvested from the kitchen garden is given below:
75 80 52 87 95 105 92 82 120 65
55 100 115 92 82 97 85 72 67 98
115 62 85 98 110 105 77 63 80 90
54 89 108 103 75 53 105 117 95 64
77 85 94 72 68 100 78 89 94 102
82 95 98 100 77 85 92 97 72 85
72 83 66 58 96 75 88 90 80 95
63 78 84 92 88 77 65 85 92 87
In constructing a frequency distribution the highest and the lowest observations are to be identified first.
In the present data set the highest value is 120 and the lowest value is 52. Therefore range is
R = 120 - 52 = 68; N = 80
It will be convenient to take 10 as the class interval. As the variable here is continuous, open interval
method is to be followed in grouping the data set. Though the lowest observation is 52, it is convenient to
start from 50. The classes or groups will, therefore, be 50-60, 60-70, 70-80, 80-90, 90-100, 100-110 and
110-120.
110-above |||| 5 80 5
Example 3.
The following data show the number of grapes per bunch:
25 75 15 20 18 62 45 33 40 45
77 30 35 25 65 42 55 37 44 50
40 35 38 47 52 45 33 28 22 22
18 29 48 55 60 58 43 40 47 39
35 45 43 52 57 50 48 55 59 42
28 36 54 48 58 68 78 61 53 42
Here, N = 60, the highest value is 78 and the lowest value is 15. The variable is discrete in nature.
R = 78 - 15 = 63 and k = 1 + 3.322 log1060 = 6.91 7.
The data may be classified in more or less 7 groups.
R 63
Again, C = = =9
k 7
For convenience, however, 10 may be taken as the class interval.
Table 4. Frequency Distribution of No. of Grapes per Bunch
75-84 ||| 3 60 3
Frequency distributions may be presented by graphs and charts in order to make them more clear, more
easily understandable and to compare distributions quickly. It is also easy to understand by illiterate
persons and people from different regions with different languages. Graphical representation brings to
light the salient features of the data at a glance. It is also useful in locating some partition values.
The X-axis is used for the variable values and the Y-axis is for the frequency; if we indicate the
frequencies of each variable value by dots, the resulting diagram is known as dot frequency diagram.
Frequency
Variable values
Figure1. Dot frequency diagram.
Histogram :
Class intervals are plotted along the X-axis and frequencies are plotted along the Y-axis. For each class or
group, a rectangle is drawn taking class interval as the base and the class frequency as the height. For
continuous variable, the rectangles such drawn are attached to adjacent rectangles at both sides and the
resulting graph is known as histogram.
20
18
16
14
Frequency
12
10
8
6
4
2
0
50 60 70 80 1 90 100 110 120
Class Interval
For drawing histograms of frequency distributions having unequal class intervals, frequency density,
instead of frequency is plotted along the Y-axis.
f
Frequency density is obtained as fd = ; c being the class interval.
c
Example 4.
Drawing the histogram of the distribution of members per family in a certain locality is described below:
Frequency density
Family size No. of families Class interval f
(class Interval) (f) (C) fd = ;
c
0-2 8 2 4
2-4 14 2 7
4-8 16 4 4
8-12 20 4 5
12-20 8 8 1
Total 66
2 3 4 5 6 7
Frequency density
0 1
0 2 4 8 12 20
Class interval
Figure 3: Histogram for data with unequal class interval.
Bar Diagram:
Bar diagram is used mainly to represent discrete frequency distributions. Drawing process of bar diagram
is similar to that of histogram. For discrete variables a gap exists between the upper limit of a class and
the lower limit of the following class and the adjacent rectangles are not attached to each other. The graph
is known as bar diagram.
20
15
Frequency
10
0
0 15 25 35 45 55 65 75 85
20
Class interval
Figure 4. Bar diagram for data in table 4.
Besides, discrete frequency distributions, bar diagrams can also be used to represent information based on
different times or places. For example, crop production in different years or rainfall in different countries
or at different regions of a country may be presented by bar diagrams.
Example 5. Data on annual rainfall at divisional cities of Bangladesh are as follows:
4000
Rainfall (Inches)
3000
2000
1000
Divisions
200
160
120
80
40
0
1991-92 1992-93 1993-94 1994-95
Year
Figure 6. Multiple bar diagram of pulses production.