Unit - 1 Biostatistics Notes
Unit - 1 Biostatistics Notes
INTRODUCTION
Applications of biostatistics
1
DEFINITIONS
--------------------------------------------------------------------------------------------------
2
VARIABLES
● Variables can be
❖ Independent
3
E.g. In an experimental study on relaxation intervention for reducing HTN, blood
pressure is the dependent variable and relaxation training, age and gender are
independent variable.
----------------------------------------------------------------------------------------------------
SAMPLING
4
SCALES OF MEASUREMENT
● Four measurement scales are used: nominal, ordinal, interval and ratio.
● Each level has its own rules and restrictions.
5
● E.g. variable like time, length and weight are ratio scales and also be
measured using nominal or ordinal scale.
[The mathematical properties of interval and ratio scales are very similar, so the
statistical procedures are common for both the scales.
Process of sampling
Population definition
Successful statistical practice is based on focused problem definition. In sampling,
this includes defining the population from which our sample is drawn. A
population can be defined as including all people or items with the characteristic
one wishes to understand. Because there is very rarely enough time or money to
gather information from everyone or everything in a population, the goal becomes
finding a representative sample (or subset) of that population.
6
Sometimes that which defines a population is obvious. For example, a
manufacturer needs to decide whether a batch of material from production is of
high enough quality to be released to the customer, or should be sentenced for
scrap or rework due to poor quality. In this case, the batch is the population.
Note also that the population from which the sample is drawn may not be the
same as the population about which we actually want information. Often there is
large but not complete overlap between these two groups due to frame issues etc.
(see below). Sometimes they may be entirely separate - for instance, we might
study rats in order to get a better understanding of human health, or we might study
records from people born in 2008 in order to make predictions about people born in
2009.
Sampling frame
In the most straightforward case, such as the sentencing of a batch of material from
production (acceptance sampling by lots), it is possible to identify and measure
every single item in the population and to include any one of them in our sample.
However, in the more general case this is not possible. There is no way to identify
all rats in the set of all rats
7
population. Information about the relationship between sample and population is
limited, making it difficult to extrapolate from the sample to the population.
Within any of the types of frame identified above, a variety of sampling methods
can be employed, individually or in combination. Factors commonly influencing
the choice between these designs include:
2. Systematic sampling
3. Stratified sampling
5. Cluster sampling
6. Quota sampling
8. Line-intercept sampling
9. Panel sampling
8
1. Simple random sampling
In a simple random sample ('SRS') of a given size, all such subsets of the
frame are given an equal probability. Each element of the frame thus has an equal
probability of selection: the frame is not subdivided or partitioned. Furthermore,
any given pair of elements has the same chance of selection as any other such pair
(and similarly for triples, and so on). This minimises bias and simplifies analysis of
results. In particular, the variance between individual results within the sample is a
good indicator of variance in the overall population, which makes it relatively easy
to estimate the accuracy of results.
However, SRS can be vulnerable to sampling error because the randomness
of the selection may result in a sample that doesn't reflect the makeup of the
population. For instance, a simple random sample of ten people from a given
country will on average produce five men and five women, but any given trial is
likely to overrepresent one sex and underrepresent the other. Systematic and
stratified techniques, discussed below, attempt to overcome this problem by using
information about the population to choose a more representative sample.
1. Systematic sampling
Systematic sampling relies on arranging the target population according to
some ordering scheme and then selecting elements at regular intervals through that
ordered list. Systematic sampling involves a random start and then proceeds with
the selection of every kth element from then onwards. In this case, k=(population
size/sample size). It is important that the starting point is not automatically the first
in the list, but is instead randomly chosen from within the first to the kth element in
the list. A simple example would be to select every 10th name from the telephone
directory (an 'every 10th' sample, also referred to as 'sampling with a skip of 10').
As long as the starting point is randomized, systematic sampling is a type of
probability sampling. It is easy to implement and the stratification induced can
make it efficient, if the variable by which the list is ordered is correlated with the
variable of interest. 'Every 10th' sampling is especially useful for efficient sampling
from databases.
3. Stratified sampling
Where the population embraces a number of distinct categories, the frame
can be organized by these categories into separate "strata." Each stratum is then
sampled as an independent sub-population, out of which individual elements can
be randomly selected.[1] There are several potential benefits to stratified sampling.
9
First, dividing the population into distinct, independent strata can enable
researchers to draw inferences about specific subgroups that may be lost in a more
generalized random sample.
Second, utilizing a stratified sampling method can lead to more efficient
statistical estimates (provided that strata are selected based upon relevance to the
criterion in question, instead of availability of the samples). Even if a stratified
sampling approach does not lead to increased statistical efficiency, such a tactic
will not result in less efficiency than would simple random sampling, provided that
each stratum is proportional to the group's size in the population.
Third, it is sometimes the case that data are more readily available for
individual, pre-existing strata within a population than for the overall population;
in such cases, using a stratified sampling approach may be more convenient than
aggregating data across groups (though this may potentially be at odds with the
previously noted importance of utilizing criterion-relevant strata).
Finally, since each stratum is treated as an independent population, different
sampling approaches can be applied to different strata, potentially enabling
researchers to use the approach best suited (or most cost-effective) for each
identified subgroup within the population.
4. Probability proportional to size sampling
In some cases the sample designer has access to an "auxiliary variable" or
"size measure", believed to be correlated to the variable of interest, for each
element in the population. These data can be used to improve accuracy in sample
design. One option is to use the auxiliary variable as a basis for stratification, as
discussed above.
Another option is probability-proportional-to-size ('PPS') sampling, in which
the selection probability for each element is set to be proportional to its size
measure, up to a maximum of 1. In a simple PPS design, these selection
probabilities can then be used as the basis for Poisson sampling. However, this has
the drawback of variable sample size, and different portions of the population may
still be over- or under-represented due to chance variation in selections. To address
this problem, PPS may be combined with a systematic approach.
5. Cluster sampling
Sometimes it is more cost-effective to select respondents in groups
('clusters'). Sampling is often clustered by geography, or by time periods. (Nearly
all samples are in some sense 'clustered' in time - although this is rarely taken into
account in the analysis.) For instance, if surveying households within a city, we
10
might choose to select 100 city blocks and then interview every household within
the selected blocks.
Clustering can reduce travel and administrative costs. In the example above,
an interviewer can make a single trip to visit several households in one block,
rather than having to drive to a different block for each household.
It also means that one does not need a sampling frame listing all elements in
the target population. Instead, clusters can be chosen from a cluster-level frame,
with an element-level frame created only for the selected clusters. In the example
above, the sample only requires a block-level city map for initial selections, and
then a household-level map of the 100 selected blocks, rather than a
household-level map of the whole city.
6. Quota sampling
In quota sampling, the population is first segmented into mutually exclusive
sub-groups, just as in stratified sampling. Then judgment is used to select the
subjects or units from each segment based on a specified proportion. For example,
an interviewer may be told to sample 200 females and 300 males between the age
of 45 and 60.
7. Convenience sampling or Accidental Sampling
Convenience sampling (sometimes known as grab or opportunity
sampling) is a type of nonprobability sampling which involves the sample being
drawn from that part of the population which is close to hand. That is, a population
is selected because it is readily available and convenient. It may be through
meeting the person or including a person in the sample when one meets them or
chosen by finding them through technological means such as the internet or
through phone. The researcher using such a sample cannot scientifically make
generalizations about the total population from this sample because it would not be
representative enough. For example, if the interviewer were to conduct such a
survey at a shopping center early in the morning on a given day, the people that
he/she could interview would be limited to those given there at that given time,
which would not represent the views of other members of society in such an area,
if the survey were to be conducted at different times of day and several times per
week. This type of sampling is most useful for pilot testing.
8. Line-intercept sampling
Line-intercept sampling is a method of sampling elements in a region
whereby an element is sampled if a chosen line segment, called a "transect",
intersects the element.
11
9. Panel sampling
Panel sampling is the method of first selecting a group of participants
through a random sampling method and then asking that group for the same
information again several times over a period of time. Therefore, each participant
is given the same survey or interview at two or more time points; each period of
data collection is called a "wave".
Replacement of selected units
Sampling schemes may be without replacement ('WOR' - no element can be
selected more than once in the same sample) or with replacement ('WR' - an
element may appear multiple times in the one sample). For example, if we catch
fish, measure them, and immediately return them to the water before continuing
with the sample, this is a WR design, because we might end up catching and
measuring the same fish more than once. However, if we do not return the fish to
the water (e.g. if we eat the fish), this becomes a WOR design.
Sample size
Formulas, tables, and power function charts are well known approaches to
determine sample size.
1. Selection bias: When the true selection probabilities differ from those
assumed in calculating the results.
2. Random sampling error: Random variation in the results due to the
elements in the sample being selected at random.
12
Non-sampling error
Non-sampling errors are caused by other problems in data collection and
processing. They include:
Sampling bias
14
If entire segments of the population are excluded from a sample, then there
are no adjustments that can produce estimates that are representative of the entire
population. But if some groups are underrepresented and the degree of
underrepresentation can be quantified, then sample weights can correct the bias.
For example, a hypothetical population might include 10 million men and 10
million women. Suppose that a biased sample of 100 patients included 20 men and
80 women. A researcher could correct for this imbalance by attaching a weight of
2.5 for each male and 0.625 for each female. This would adjust any estimates to
achieve the same expected value as a sample that included exactly 50 men and 50
women, unless men and women differed in their likelihood of taking part in the
survey.
----------------------------------------------------------------------------------------------------
Data
Some classifications divide the data into two broad types i.e. primary and
secondary and qualitative and quantitative. But in this classification each of the
type is divided individually.
The primary data is the data that is collected directly and is not taken from a
source. This data includes the data collected through direct interviews, surveys and
experiments. Basically this data comprises of results obtained through surveys on
which the statistical operations have not been applied.
15
● Data Collected Through Personal Investigation:
The researcher or the researching team may hire another person or a group
of people for conducting a survey to collect a data.
The qualitative statistical data is the data which is expressed in words rather
than in numbers. In other words, the qualitative data is the data in which the
measurement of a category is expressed in words. For example in a qualitative data
measurement of height will be explained a tall, short or medium.
16
The qualitative data only tells us about something but does not tell the extent
of something.
For example if we make a survey of climate of different cities of a country
we may record the climate of different cities as cold, warm or moderate, this data
would simple summarize the kind of climate of the city but this would not tell us
the average maximum or minimum temperature of the city which would explain
the extent of the weather in the city.
4. Quantitative Statistical Data
The quantitative statistical data is the data in which the measurements are
numerically expressed. For example the temperature of a city in this data would be
given in accurate measurement like 25 degrees C.
The quantitative data represents measurements taken with a scale includes
the variables such as temperature, weight and size that can be measured in a
precise scale are expressed in their actual measurements in quantitative data
Other numeric expressions in the qualitative data can be for example: the number
of people in a town, the social security numbers of the citizens or the number of
deaths occurring due to a disease.
The quantitative data is more accurate than the qualitative data as it tells us
the extent of something.
● The ratio scale which is used to measure variables such as age and money.
The observations that can be counted are measured by this scale.
● The interval scale which is used to measure variables such as temperatures,
height and weight
17
3. Mailed Questionnaire method:
4. Schedule Method:
5. From Local Agents:
1. Direct Personal observation:
Here the investigator directly contacts the informants, solicits their
cooperation and enumerates the data. The information are collected by direct
personal interviews. It is neither difficult for the enumerator nor the informants.
Because both are present at the spot of data collection. This method provides most
accurate information as the investigator collects them personally. But as the
investigator alone is involved in the process, his personal bias may influence the
accuracy of the data. So it is necessary that the investigator should be honest,
unbiased and experienced. In such cases the data collected may be fairly accurate.
However, the method is quite costly and time-consuming. So the method should be
used when the scope of enquiry is small.
Secondary data are second hand informations. They are not collected from
the source as the primary data. In other words, secondary data are those which have
already been collected. So they may be relatively less accurate than the primary
data. Secondary data are generally used when the time of enquiry is short and the
accuracy of the enquiry can be compromised to some extent.
Secondary data are already collected informations. They might have been collected
for some specific purposes. So they must be used with caution. It is generally very
different to verify such information to find out inconsistencies, errors, omissions
etc. Therefore scrutiny of secondary data is essential. Because the data might be
inaccurate, unsuitable or inadequate. Thus it is very risky to use statistics collected
by other people unless they have been thoroughly edited and found reliable,
adequate and suitable for the purpose.
Pie charts, Bar charts and Pictograms show relative frequencies and
Histograms show relative frequencies of continuous data
1. Tables or Tabulation:
❖ Construction of a table:
22
most cases footnotes are used to mention the source of data especially in
case of secondary data.
Types of Tables:
(i) Simple table or one-way table. In this type of table only one characteristic is
shown. This is the simplest of tables. The following is the illustration of such a
table:
(ii) Two-way table. Such a table shows two characteristics and is formed when
either the stub or the caption is divided into two coordinate parts. The example
given on page 56 illustrates the nature of such a table:
2. GRAPHS
23
Besides formal tables, statistical data can also be presented in the form of
various types of graphs. Graphs are a useful way of conveying information very
quickly and briefly. With the same ease and efficiency, they help in comparing data
over time and space. They are visual aids and have a powefi impact on the people.
It is ofien said, "a picture is worth a thousand words". They attract a
reader's attention to what they are supposed to convey about the data. Further, they
may help us to estimate some values at a glance, and serve as a pictorial check on
the accuracy of our solutions.
However, graphical presentation of data, although usem in different ways
mentioned above, is only one method of describing data. T ' s cannot and is not a
substitute for other forms of presentation as well as firher statistical analysis. In the
following, we discuss some of the graphical methods of presentation.
Line graphs can be used to show how something changes over time. Line
graphs are good for plotting data that has peaks (ups) and valleys (downs), or that
was collected in a short time period. The following pages describe the different
parts of a line graph.
The Title
The title offers a short explanation of what is in your graph. This helps the
reader identify what they are about to look at. It can be creative or simple as long
24
as it tells what is in the graph. The title of this graph tells the reader that the graph
contains information about the changes in money spent on students of elementary
and secondary schools from 1961 to 2002.
The Legend
The legend tells what each line represents. Just like on a map, the legend
helps the reader understand what they are looking at. This legend tells us that the
green line represents the actual dollar amount spent on each child and the purple
line represents the amount spent when adjusted for inflation.
The Source
The source explains where you found the information that is in your graph.
It is important to give credit to those who collected your data! In this graph, the
source tells us that we found our information from NCES.
Y-Axis
In line graphs, the y-axis runs vertically (up and down). Typically, the y-axis
has numbers for the amount of stuff being measured. The y-axis usually starts
counting at 0 and can be divided into as many equal parts as you want to. In this
line graph, the y-axis is measuring the amount of money spent on individual
students for public education.
The Data
The most important part of your graph is the information, or data, it
contains. Line graphs can present more than one group of data at a time. In this
graph, two sets of data are presented.
X-Axis
In line graphs, like the one above, the x-axis runs horizontally (flat).
Typically, the x-axis has numbers representing different time periods or names of
things being compared. In this line graph, the x-axis measured different school
years.
25
3. Graphical Representation of Data
INTRODUCTION
i) Bar Graph
27
Example - 01
The data below shows the number of students present in different classes on
a particular day :
Example – 02
28
Interpretation of Bar graphs
After drawing a bar graph, we can draw some conclusions, which is called
interpreting bar graphs.
Let us take some examples and do the same.
Example – 1
Example – 2
Bar Diagram
It is also called a columnar diagram. The bar diagrams are drawn through
columns of equal width. Following rules were observed while constructing a bar
diagram:
(a) The width of all the bars or columns is similar.
29
(b) All the bars should are placed on equal intervals/distance.
(c) Bars are shaded with colours or patterns to make them distinct and attractive.
Three types of bar diagrams are used to represent different data sets:
Construction Steps:
Draw X and Y-axes on a graph paper. Take an interval and mark it on Y-axis
to plot data.
Divide X-axis into equal parts to draw bars. The actual values will be plotted
according to the selected scale.
30
⮚ Line and Bar Graph
The line and bar graphs as drawn separately may also be combined to depict
the data related to some of the closely associated characteristics such as the
climatic data of mean monthly temperatures and rainfall.
Construction:
a) Draw X and Y-axes of a suitable length and divide X-axis into parts to show
months in a year.
b) Select a suitable scale with equal intervals on the Y-axis and label it at its right
side.
c) Similarly, select a suitable scale with equal intervals on the Y-axis and label at
its left side.
d) Plot data using line graph and columnar diagram.
Multiple bar diagrams are constructed to represent two or more than two
variables for the purpose of comparison. For example, a multiple bar diagram may
be constructed to show proportion of males and females in the total, rural and
31
urban population or the share of canal, tube well and well irrigation in the total
irrigated area in different states.
Construction
a) Mark time series data on X-axis and variable data on Y-axis as per the selected
scale.
b) Plot the data in closed columns.
32
a) Arrange the data in ascending or descending order.
b) A single bar will depict the set of variables by dividing the total length of the
bar as per percentage.
33
Solution : For drawing a histogram we go through the steps similar to those of a
bar graph.
They are given below :
Step 1 : On a paper, we draw two perpendicular lines and call them horizontal and
vertical axes.
Step 2 : Along the horizontal axis, we take classes of equal width :
45–50, 50–55, ...... As the axis starts from 45–50, we take one interval 40–45
before it and put a kink on axis before that.
Step 3 : Choose a suitable scale on the vertical axis to represent the frequency. It
can start from 0 to 12, with a step of 2, i.e., 0, 2, 4, 6, ...., 12, 14
Step 4 : Draw the rectangles as shown in Fig. shows the histogram required.
Note : A frequency polygon has been shown in dotted lines, as explained in the
steps shown
above.
Example – 2
34
The histogram and frequency polygon representing the above data are given below
35
In addition to histograms and frequency polygons, we are sometimes faced with
graphs of other types. When a patient is admitted in a hospital with fever the
doctor/nurses prepare a temperature-time graph, which can be referred to any time
for reference. Similarly, the velocity time graph and pressure-volume graph are of
day-to-day use. We shall learn to draw these graphs and interpret them in the
sections below :
Solution : The graph of the above data is given in Fig. 28.10. The graph has been
obtained by joining the points corresponding to pairs, like (7, 102), (9, 103), .........,
(23, 99) in the rectangular system of coordinates, by line-segments.
36
Note : While drawing the graph it has been assumed that during the time interval
in between times, the same trend was present.
During a journey from one place to other, the speeds of vehicles keep on
changing according to traffic congestions. This can be very well shown by a
velocity-time graph. Let us illustrate it with the help of example:
During a journey from city A to city B by car the following data regarding
the time and velocity of the car was recorded :
Solution : As before the graph can be obtained by plotting the ordered pairs (6,
60), (7, 60), ... (15, 65), ..., (17, 50) in the rectangular system of coordinates and
then by joining them by line-segments.
37
Pressure-Volume Graph
38
The graph is obtained by joining the plot of the ordered pairs (60, 90), (90, 60), .....
(75, 72) by free hand curve.
39
40
iv) Pie Diagram
41
If data is given in percentage form, the angles are calculated using the given
formulae.
Calculation of Angles
Precautions
a) The circle should neither be too big to fit in the space nor too small to be
illegible.
b) Starting with bigger angle will lead to accumulation of error leading to the plot
of the smaller angle difficult.
42
v) Pie Charts
Pie charts are useful to compare different parts of a whole amount. They
are often used to present financial information. E.g. A company's expenditure can
be shown to be the sum of its parts including different expense categories such as
salaries, borrowing interest, taxation and general running costs (i.e. rent, electricity,
heating etc).
A pie chart is a circular chart in which the circle is divided into sectors.
Each sector visually represents an item in a data set to match the amount of the item
as a percentage or fraction of the total data set.
Example
A family's weekly expenditure on its house mortgage, food and fuel is as follows:
We can find what percentage of the total expenditure each item equals.
Percentage of weekly expenditure on:
43
To draw a pie chart, divide the circle into 100 percentage parts. Then allocate
the number of percentage parts required for each item.
Note:
● It is simple to read a pie chart. Just look at the required sector representing
an item (or category) and read off the value. For example, the weekly
expenditure of the family on food is 37.5% of the total expenditure
measured.
● A pie chart is used to compare the different parts that make up a whole
amount.
44