0% found this document useful (0 votes)
44 views

Fundamentals of Statistics (Lecture Note1)

The document provides an overview of biostatistics, including definitions, types of statistics, and the importance of data collection and analysis in decision-making. It explains the concepts of population, sample, parameters, and statistics, as well as the classification of variables into quantitative and qualitative types. Additionally, it outlines the process of constructing frequency distributions, illustrated with examples and tables.

Uploaded by

wasi78045
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Fundamentals of Statistics (Lecture Note1)

The document provides an overview of biostatistics, including definitions, types of statistics, and the importance of data collection and analysis in decision-making. It explains the concepts of population, sample, parameters, and statistics, as well as the classification of variables into quantitative and qualitative types. Additionally, it outlines the process of constructing frequency distributions, illustrated with examples and tables.

Uploaded by

wasi78045
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

AAS2207:Biostatistics

Lecturer: Prof. Shankar Majumder


Lecture Note-1
Statistics
Statistics is the science of collecting, organizing, summarizing, presenting and analyzing and interpreting
data that assists in making better decisions. Data mean information used as a basis for reasoning,
discussion, calculation, or analysis.

Note: If we are interested about the heights of the students of B. Sc. in Bioinformatics Engineering Level-
2 Semester-2 of BAU, then a single value of the height of a student is called datum, and the set of values
of heights of two or more students are known as data.

Sir Ronald Aylmer Fisher (R. A. Fisher) (British) is called the Father of Statistics.

Types of Statistics
1. Descriptive Statistics are the methods of collecting, organizing, summarizing and presenting
data in an informative way. For example, a graph that shows the number of defective fans
produced at a night-shift during a period of one month can be considered as an issue of
descriptive statistics.

2. Inferential Statistics provide the bases for predictions, forecasts and estimates to decision
making about the population. For example, an estimate of the percentage of employees who
arrive to the work late. This is an issue of inferential statistics.

Characteristics or Features of Statistics:


➢ Statistics deals with aggregate of individuals rather than with individuals. Single number is not
statistics.
➢ Statistics deals with variation.
➢ Statistics deals with only numerically specified facts.
➢ Statistical inferences are drawn with the probability or uncertainty.

Biostatistics
Biostatistics is the application of statistics to a wide range of topics in biology. It encompasses the design
of biological experiments; the collection, summarization, and analysis of data from those experiments;
and the interpretation of, and inference from, the results.

Population and Sample

Population means an aggregate of all individual persons, objects or items possessing certain
characteristics of interest in a particular investigation or enquiry. An aggregate of all farmers who have
used power tillers for their cultivation of land can be considered as a population. The size of the
population is usually denoted by N.
There are two types of population viz. study population and target population. The population from which
the sample is drawn is known as study population and the population for which sample-based results are
generalized is called target population.

Sample is a subset or representative part of a population whose properties are studied to gain information
about the whole population. We are generally interested to know the properties of the population.
Sometimes it is impracticable or even impossible to handle population because of limited resources like
time, fund, manpower, trained personnel, management capability, etc. That is why, inferences about the
population are usually drawn on the basis of the sample. As an illustration, if we select 25 students from
the population of 650, we have a sample of size 25.

A random sample is one in which every individual person or object or item of a population has an equal
chance of being selected in the sample.

Parameter and Statistic


A parameter is a characteristic of a population. A statistic is a characteristic of a sample.

Population and Sample Characteristics and their Notations:

Characteristic Parameter Statistic


Mean µ x
Standard deviation  s
Variance 2 s2
Correlation Coefficient ρ r
Regression coefficient  b

Book References:

1.. Biostatistical Analysis.


---- Jerrold H. Zar
2. A Textbook of Agricultural Statistic, 2nd Edition
…. R. Rangaswamy,
3. Introduction to Statistics
. ---M Nurul Islam
4. Methods of Statistics, 6th Edition
---- Abdur Rashid Ahmed & Gong.
5. An Introduction to The Theory of Statistics.
---- RN Shill & SC Debnath
Variable
Measurable characteristics of the elements of a population that may vary from element to element either
in magnitude or in quality are known as variables. Variables are of two types - quantitative variable and
qualitative variable.

Variables and its classification can be demonstrated as below:

Variable

Quantitative variable Qualitative variable

Continuous Discrete
variable variable

Quantitative Variable
Variables that can be expressed numerically are known as quantitative variables: Height or weight of
students, length or breadth of fishes, weight of tomato, number of grapes per bunch, number of grains per
panicle, etc. are some examples of quantitative variables.

Qualitative Variable
Variables that cannot be expressed numerically but can be classified or categorised into some mutually
exclusive categories are called qualitative variables. For example, merit of students, educational
attainment, type of farmers (big, medium, small), type of fishes (sea fish, river fish) etc. cannot be
numerically measured but can be grouped into classes or categories. Qualitative variables are also known
as attributes.

Quantitative variables are of two types - continuous and discrete.


A variable that can assume any value, integral or fractional, within specified limits, is known as
continuous variable. For example, height of students, weight of tomato, length of fish, height of trees,
weight of animal etc. are continuous variables which can take both integral and fractional values.
On the other hand, some variables that can take only integral values or some isolated values, not any
value of a specified range are known as discrete variables. For example, number of grains per panicle,
number of students per class, number of fishes caught per unit time, etc.

Frequency Distribution
Arrangement of observational data according to frequencies of the observations is called frequency
distribution. Frequency distribution should be such that the arrangement according to the observations
becomes easily understandable. Frequency distributions are constructed mainly to present the data in
condensed form and for easy understanding.

Construction of a Frequency Distribution


Steps in constructing a frequency distribution are discussed below:

1. Finding the Range: In constructing frequency distribution the highest and the lowest value in the data
set are first identified and their difference is obtained. This difference between the highest value and the
lowest value is called the range usually denoted by R.

Range = Highest value - Lowest value.

2. Decision about the Number of Classes: After finding the range, it is necessary to decide the number
of classes in which the entire data set should be divided. Choice of the number of classes should be
realistic; this number should not be very small and at the same time it should not be very large so that the
aim of construction of frequency distribution (condensation) is not achieved. It is generally expected to
limit the number of classes between 7 and 15. There is no hard and fast rule for choosing the number of
classes. However, M.A. Sturge's formula gives a guideline for desired number of classes. The formula is -
k = 1 + 3.322 log10 N
where N is the total number of observations in the data set and k is the desired number of classes.

3. Choosing the Class Interval:


The next step of constructing frequency distribution is the calculation of the class interval. Each class
will have two limits, the lower limit (the lower value) and the upper limit (the higher value). The
difference of the upper limit and the lower limit of a class is known as class interval, usually denoted by c
or h. If the range is divided by the number of classes, we get the class interval.
Range
Class Interval (C) =
No. of classes
The value of c is taken as the next integral value of the ratio R/k. For choosing the class interval,
there is no rigid rule as to use the exact end values of the data set, rather convenient values near the
highest and lowest observations of the data set may be used. However, class interval should be such that
classes are distinct and separate from each other. Depending on the nature of the variable, two different
methods are used in choosing the class limits. If the variable is discrete, closed intervals like a  x  b are
used, both the lower and upper limits are included (e.g., 0-4, 5-9, 10-14, etc.). On the other hand, if the
variable is continuous, open interval system (a < x  b or a  x < b) is used; one of the class limits is
included and the other is excluded (usually the lower limit is included). In this case the classes will be 0-
5, 5-10, 10-15, etc.

4. Counting of Frequencies: For convenience of counting the number of observations falling within each
class tally marks are used; frequency of each class is determined by counting the tally marks.

Sometimes it may be necessary to know the observations greater or smaller than a particular value or
class of values. For this, cumulative frequencies for observation or class are obtained.

Example 1.
Suppose the marks obtained by 50 students in an examination are as follows :
32 27 19 40 31 17 15 18 21 27 38 15 33 34 29
26 16 25 33 36 24 22 26 19 36 18 25 20 25 25
31 24 16 28 30 24 29 42 29 28 26 27 47 43 22
25 28 22 24 23
Here the variable is the marks obtained by the students. The data as shown above are called raw or
ungrouped data.
If it is needed to describe the performance of the students, it may be done in a number of ways.
We may enumerate the grade of each student either in ascending or descending order; data such
arranged are said to be arranged in array. Counting the number of times each value of the variable occurs,
we get a table of the following type:
Table1. Frequency Distribution of Marks
Marks Frequency Cumulative Marks Frequency Cumulative
(No. of frequency (No. of frequency
students) students)
15 2 2 28 3 33
16 2 4 29 3 36
17 1 5 30 1 27
18 2 7 31 2 39
19 2 9 32 1 40
20 1 10 33 2 42
21 1 11 34 1 43
22 3 14 36 2 45
23 2 16 38 1 46
24 4 20 40 1 47
25 5 25 42 1 48
26 3 28 43 1 49
27 2 30 47 1 50

Such a table is known as frequency table or frequency distribution. The above arrangement is an
improvement over the raw data, but to get a still better idea of the performance of the students we
reclassify the data into grouped frequency distribution as shown below :
Table 2. Grouped Frequency Distribution of Marks

Class Frequency Cumulative frequency


Tally mark (No. of Ascending Descending
interval
students)
15-20 |||| |||| 9 9 50
20-25 |||| |||| | 11 20 41
25-30 |||| |||| |||| | 16 36 30
30-35 |||| || 7 43 14
35-40 ||| 3 46 7
40-45 ||| 3 49 4

45-50 | 1 50 1

This type of classification of raw data is called grouped frequency distribution or simply frequency
distribution.
In the above example the highest value is 47, the lowest value is 15 and the range is, R = 47-15 = 32.
According to Sturge's formula,
k = 1 + 3.322 log1050 = 6.47
That is, 6 to 7 groups are appropriate in this case.
R 32
Again, C = = = 4.57; accordingly 5 is taken as the class interval.
k 7
Example 2.
The weight (in gm.) of tomato harvested from the kitchen garden is given below:
75 80 52 87 95 105 92 82 120 65
55 100 115 92 82 97 85 72 67 98
115 62 85 98 110 105 77 63 80 90
54 89 108 103 75 53 105 117 95 64
77 85 94 72 68 100 78 89 94 102
82 95 98 100 77 85 92 97 72 85
72 83 66 58 96 75 88 90 80 95
63 78 84 92 88 77 65 85 92 87

In constructing a frequency distribution the highest and the lowest observations are to be identified first.
In the present data set the highest value is 120 and the lowest value is 52. Therefore range is

R = 120 - 52 = 68; N = 80

According to Sturge's formula

k = 1 + 3.322 log1080 = 1 + 3.322 x 1.903089987 = 7.322.


The next integral value of k is 8; the data set may be grouped in about 8 classes.
R 68
Now, C = = = 8.5  9
K 8

It will be convenient to take 10 as the class interval. As the variable here is continuous, open interval
method is to be followed in grouping the data set. Though the lowest observation is 52, it is convenient to
start from 50. The classes or groups will, therefore, be 50-60, 60-70, 70-80, 80-90, 90-100, 100-110 and
110-120.

The frequency distribution will be -


Table 3. Frequency Distribution of Weight of Tomato

Class Cumulative frequency


Tally mark Frequency
interval Ascending Descending
50-60 |||| 5 5 80
60-70 |||| |||| 9 14 75
70-80 |||| |||| ||| 13 27 66
80-90 |||| |||| |||| |||| 20 47 53
90-100 |||| |||| |||| |||| 19 66 33
100-110 |||| |||| 9 75 14

110-above |||| 5 80 5

Example 3.
The following data show the number of grapes per bunch:
25 75 15 20 18 62 45 33 40 45
77 30 35 25 65 42 55 37 44 50
40 35 38 47 52 45 33 28 22 22
18 29 48 55 60 58 43 40 47 39
35 45 43 52 57 50 48 55 59 42
28 36 54 48 58 68 78 61 53 42

Here, N = 60, the highest value is 78 and the lowest value is 15. The variable is discrete in nature.
R = 78 - 15 = 63 and k = 1 + 3.322 log1060 = 6.91  7.
The data may be classified in more or less 7 groups.
R 63
Again, C = = =9
k 7
For convenience, however, 10 may be taken as the class interval.
Table 4. Frequency Distribution of No. of Grapes per Bunch

Class Cumulative frequency


Tally mark Frequency
interval Ascending Descending
15-24 |||| | 6 6 60
25-34 |||| ||| 8 14 54
35-44 |||| |||| |||| 15 29 46
45-54 |||| |||| |||| | 16 45 31
55-64 |||| |||| 10 55 15
65-74 || 2 57 5

75-84 ||| 3 60 3

Graphical Representation of Frequency Distribution:

Frequency distributions may be presented by graphs and charts in order to make them more clear, more
easily understandable and to compare distributions quickly. It is also easy to understand by illiterate
persons and people from different regions with different languages. Graphical representation brings to
light the salient features of the data at a glance. It is also useful in locating some partition values.

The following graphs are generally used in representing frequency distributions:


1. Dot Frequency Diagram
2. Histogram and Bar Diagram
3. Frequency Polygon and Frequency Curve
4. Cumulative Frequency Curve or Ogive
5. Pie Chart

Dot Frequency Diagram:

The X-axis is used for the variable values and the Y-axis is for the frequency; if we indicate the
frequencies of each variable value by dots, the resulting diagram is known as dot frequency diagram.
Frequency

Variable values
Figure1. Dot frequency diagram.
Histogram :
Class intervals are plotted along the X-axis and frequencies are plotted along the Y-axis. For each class or
group, a rectangle is drawn taking class interval as the base and the class frequency as the height. For
continuous variable, the rectangles such drawn are attached to adjacent rectangles at both sides and the
resulting graph is known as histogram.

20
18
16
14
Frequency

12
10
8
6
4
2
0
50 60 70 80 1 90 100 110 120
Class Interval

Figure 2. Histogram for data in table 3.

For drawing histograms of frequency distributions having unequal class intervals, frequency density,
instead of frequency is plotted along the Y-axis.
f
Frequency density is obtained as fd = ; c being the class interval.
c
Example 4.
Drawing the histogram of the distribution of members per family in a certain locality is described below:

Frequency density
Family size No. of families Class interval f
(class Interval) (f) (C) fd = ;
c
0-2 8 2 4
2-4 14 2 7
4-8 16 4 4
8-12 20 4 5
12-20 8 8 1
Total 66
2 3 4 5 6 7
Frequency density

0 1

0 2 4 8 12 20
Class interval
Figure 3: Histogram for data with unequal class interval.

Bar Diagram:
Bar diagram is used mainly to represent discrete frequency distributions. Drawing process of bar diagram
is similar to that of histogram. For discrete variables a gap exists between the upper limit of a class and
the lower limit of the following class and the adjacent rectangles are not attached to each other. The graph
is known as bar diagram.

20

15
Frequency

10

0
0 15 25 35 45 55 65 75 85
20
Class interval
Figure 4. Bar diagram for data in table 4.
Besides, discrete frequency distributions, bar diagrams can also be used to represent information based on
different times or places. For example, crop production in different years or rainfall in different countries
or at different regions of a country may be presented by bar diagrams.
Example 5. Data on annual rainfall at divisional cities of Bangladesh are as follows:

Division Annual Rainfall


(mm.)
Dhaka 1540
Chittagong 2260
Khulna 1159
Rajshahi 1142
Sylhet 3568
Barisal 1200

4000
Rainfall (Inches)

3000

2000

1000

Divisions

Figure5. Bar diagram of rainfall at different divisions


Multiple Bar Diagram:
Data on several variables in respect of different places or time-points may be represented by multiple bar
diagrams. The simple bars for the variables corresponding to a place or time-point are constructed side by
side (without gap). Heights of these bars indicate the values of the respective variables. Usually different
bars at the same place or time-point are given with different colours or marks for identification.
Example 6.
Data on production of different pulses (in '000 tons) in Bangladesh during the years from 1991-92 to
1994-95 and the corresponding multiple bar diagram are shown below :
Yield of Pulses (000 tons)
Year
Pulses
1991-92 1992-93 1993-94 1994-95
Kheshari 185 172 188 189
Moshur 153 163 168 168
Mug & Mashkalai 82 82 82 85

Kheshari Moshur Mug & Mashkalai


Production (000 tons)

200
160
120
80
40
0
1991-92 1992-93 1993-94 1994-95

Year
Figure 6. Multiple bar diagram of pulses production.

Comparison between Histogram and Bar diagram


Histogram Bar diagram
Histograms are used to Bar diagrams are used to
represent continuous frequency represent discrete frequency
distributions. distributions.
Histograms are used to Besides frequency distributions,
represent frequency distribution data on different places or time
only. points can be represented by bar
diagrams.
In drawing the rectangles of the In drawing bar diagrams
histogram both breath and consideration of the breath of
length of the rectangles are bars is not necessary. For
considered. descent pictorial presentation,
bars of suitable breath may be
drawn.
Frequency distributions having Bar diagrams are not usually
unequal class intervals may be drawn to represent frequency
represented by histograms; in distributions having unequal
such case the breath of class intervals.
rectangles are unequal.

You might also like