Biostatistics Lecture Notes - 1 - 4
Biostatistics Lecture Notes - 1 - 4
The need to have accurate of pieces of information about our patients, health care
systems and the society is of great concern to a sane nurse, health workers and the
body of medical practitioners and allied studies. These pieces of information can
help in taking at least roughly right decisions. This informs the use of statistics.
The word statistics is derived from the Italian word stato which means state and
statista refers to a person involved with the affairs of state. Therefore, statistics
originally means the collection of facts useful to statista (Aczel 1999:9). Statistics
in this sense was widely used across Europe and the world to gather information
not just about the state but other areas of human endervour like health care system.
Statistics writes Levine, Stephen, Krehbiel and Berenson (2013:32) is the branch
of mathematics that transforms numbers into useful information for decision
makers. Statistics lets you know about the risk associated with making a decision
and allows you to understand and reduce the variation in the decision-making
process. The Ethiopia Public Health Training Initiative (EPHTI) manual 2005
defines statistics in two folds-statistical data and statistical methods. Statistical data
refers to numerical descriptions of things. These descriptions may take the form of
counts or measurements. Thus statistics of malaria cases in one of malaria
detection and treatment posts of Ethiopia include fever cases, number of positives
obtained, sex and age distribution of positive cases, etc
In one breath, Afonja, Olubusoye, Ossai and Arinola (2014: 3) opine that the use
of statistics has permeated almost every facet of human life. It is about making
everyday important decisions and choices, turning numbers to useful information
and understanding uncertainties and risks. Aminu (1999:1) posits that the word
statistics is a plural, meaning a collection of more than one figure. The singular of
the word is called statistic. Therefore statistics is the science of collecting,
organizing, pre-selling, analyzing and interpreting data to assist in making a better
decision under a condition of understanding. Without doubt, today’s world is
complex hence; there are basically two senses in which the word statistics is
commonly used. They are either numerical record of pieces of information or as a
discipline (subject) of study. In this lecture, our concern is on numerical record
keeping in our hospitals. Statistics here is concerned with abstracting data,
classifying it and then comparing it with data obtained from similar sources so that
plans and control mechanism can be implemented in our health care system.
There exist two types of statistics: Descriptive statistics and inferential statistics.
The branch of modern statistics expressed in the EPHTI manual 2005 that is most
relevant to public health and clinical medicine is statistical inference. This branch
of statistics deals with techniques of making conclusions about the population.
Inferential statistics builds upon descriptive statistics. The inferences are drawn
from particular properties of sample to particular properties of population. These
are the types of statistics most commonly found in research publications.
It is evident from the discussion above that statistics is concerned about making
use of data. Data is a collection of pieces of information to help with the
administration of the state, hospitals, companies, etc. Its domain covers data
collection, data management, data processing, data analysis, and report writing.
This process is aptly captured in a circular form by Afonja et al’s (2014: 9) work in
figure.1.
Data
Collection
Data
Checking
Use of
and
Data
Verificatio
n
Data
Data
Transfer
Dissemina
and
tion and
Compilati
Reporting
on
Data Data
Analysis Quality
and Assessme
Packaging nt
Decision/Conclusi
Analysing on
Collection (storage/dissemin
(process) ation/use)
The National Cancer Institute defines biostatistics as “the science of collecting and
analyzing biologic or health data using statistical methods.” The use of statistics in
health care dates back more than a century to the earliest application of the
scientific method in medical research. Many health care decisions are based in
small or large part on the results of biostatistical research. What has changed in
recent years is the amount of health-related data available to researchers, the
technology available to translate the information into knowledge, and the need to
improve the quality and efficiency of health care.
(a)Time Series Data: Time series data, as the name suggests, are data that have
been collected over a period of time on one or more variables. Time series data
have associated with them a particular frequency of observation or collection of
data points. The frequency is simply a measure of the interval over, or the
regularity with which, the data are collected or recorded. Examples are hourly
injects for a patient, daily /routine drug monitoring for patients, monthly checkups,
etc. It is generally required that all time series data used in a model must be of the
same frequency of observation.
(b) Cross-sectional Data: Cross-sectional data are data on one or more variables
collected at a single point in time. For example, the data might be on: A poll of
usage of polio vaccination, measles, cholera, etc.
(c) Panel Data: Panel data have the dimensions of both time series and cross-
sections, e.g. the daily prices of a number of blue chip stocks over two years. The
estimation of panel regressions is an interesting and developing area.
2.1.2 Qualitative Data (or Categorical Data)
(a) Nominal Scale: Nominal variables are categorical variables that have three
or more possible levels with no natural ordering. In a nominal scale, no
quantitative information is conveyed and no ordering of the items is
implied. The nominal scale is the simplest, and it involves only assignment
to classes and does not imply magnitude. It tells us the category or the
names without no specific order in mind. They are used as labels for groups
or classes. Examples are the classification of respondents into male and
female, nurses classified into trained and untrained. We can thus code the
various categories as Male 1, female 0; Trained 1, Untrained 0. With
nominal scale data, the obvious and intuitive descriptive summary measure
is the proportion or percentage of subjects who exhibit the attribute. Table 1
shows a hypothetical example on the application of say survival status of
patients on the administration of propanolol.
Table 1: Hypothetical Nominal Scale Data for Survival Status of
Propanolol - treated and control patients with Myocardial
Infarction(MI)
Dead 11 27
Alive 31 15
Table 1 shows the hypothetical placement of the MI patients. It shows clearly how
the individuals are simply placed in the proper category or group, and the number
in each category is counted. Each item fits into exactly one category. It shows a
total of 35.7% Survival rate for the control patients while 73.81% were for the
experimental patients. This analysis indicates that one group actually receives the
propranolol while other does not-dichotomous groups.
(b) Ordinal Scale: Here, data elements are usually ordered according to their
relative quality or size. It is a level of measurement which classifies data into
categories that can be ranked. Differences between the ranks do not exist. The
ordinal scale is based on the natural order property of real numbers which says the
one real number may be greater than or equal to or less than another real number. It
allows for classification and indication of size of some predefined basis. We can
rank patients according to magnitude of ailment. An example is the degree of
injuries of patients during accidents and a host of others. Equal intervals on it do
not represent equal quantities, e.g., fatal/very severe- I, not too severe- 2, severe- 3,
minor- 4.
(c) Interval Scale: It is a level of measurement which classifies data that can be
ranked, and differences are meaningful. However, there is no meaningful zero, so
ratios are meaningless. The interval scale possesses the order property of the
ordinal scale. The magnitude of difference between adjacent intervals on the scale
is equal. Arithmetic operations permissible on this scale include all those allowed
on the ordinal scale, in addition, measurement can be added, subtracted, divided
and multiplied by a constant and yield an interpretable result. Comparison between
interva1s on this scale is meaningful, and is independent of the unit of
measurement or the system of assigning scores. Examples of these are height,
weight and temperature.
The term census means a complete enumeration of all units in the population. It is
a survey which includes every item or element in the population. It is sometimes
called a 100% count.
Apart from human population, other types of censuses include: houses, farms,
patients, road traffic, etc.
3.1.4 Questionnaire Design: The issue to consider relate to the number, type,
order and arrangement of the questions. The EPHTI Manual 2005 states that
“designing a good questionnaire always takes several drafts. In the first draft we
should concentrate on the content. In the second, we should look critically at the
formulation and sequencing of the questions. Then we should scrutinize the format
of the questionnaire. Finally, we should do a test-run to check whether the
questionnaire gives us the information we require and whether both the
respondents and we feel at ease with it. Usually the questionnaire will need some
further adaptation before we can use it for actual data collection”. This fivefold
steps are shown in Figure 3.
Content
_objecti
ve
Formulating
Translation of
Questions
Formating
Sequencing
the
of Questions
questions
3.1.6 Sample Design: Sample size determination and the selection procedure are
sometimes regarded as constituting the sample design.
3.1.7 Survey Size: Both the questionnaire size and the sample size make up the
survey size.
3.1.8 Sample Selection: The list of all the units of the population from which a
sample is to be selected is called a frame. The units are called sampling units.
Common examples of frames are list of members of staff at University of Port
Harcourt teaching Hospital (UPTH), here, we can select list of doctors, nurses,
others.
Sequel to the above discussions, one can infer that sample selection may be
broadly classified into two types: random and non- random sampling or probability
and non-probability sampling. Any sampling procedure differing from random
sampling will be regarded as non-random. In particular, any such non-random
procedure will not employ or attempt to employ chance devices make the selection
of units. We mention the more easily-recognizable ones.
(i) Haphazard Selection in which the selector thinks that he/she is making a
random selection. A good example is the sort of selection in public places often
made by various press agents. By stopping to interview people ‘at random’ without
following any prescribed sampling rules like those previously discussed, a
journalist makes a claim at random sample of public opinion. You can think of
many possible biases that can come into such haphazard selection.
(ii) Systematic Sampling is one in which the sampling units are selected at fixed
intervals from the frame. Suppose it is decided to take a sample of 100 medical
practitioners using a medical directory as a frame, if the directory has 800 names in
it, then a systematic sample will be obtained by taking even’ eighth name.
Similarly, instead of selecting a random sample of ten students from your school,
you may select a systematic sample in which every ninth student on the school list
of ninety names is picked.
(iii) Quota sampling attempts a fair representation of different classes that may
exist in a given population. It is commonly used in public opinion and market
research surveys. In such surveys, the interviewer is required to ensure that
specified number of units in various classes like sex, age, income group,
geographic location is included in the sample.
(iv) Expert Selection is, in the opinion of the expert, the, one that produces a truly
representative sample. The procedure is totally devoid of any standard obec1ive
rule and leaves too much to the personal judgment of the enumerator. It is hardly
recommended in the statistical world.
In any field of enquiry, the objective of the investigator determines the type and the
nature of data to be collected. This objective must be clearly stated. The statistical
objective therefore is to devise the best methods of collecting the data in order to
achieve the investigator’s objective bearing in mind: the cost of collecting the data;
practicability of the proposed methods of collection and representativeness of
samples, if sampling is done; time spent in collecting data and accuracy of
observations; and the ability to assess and minimize the various errors and biases
which could lead to uncertainty in the conclusion.
According to the EPHTI Manual 2005, data collection techniques allow us to
systematically collect data about our objects of study (people, objects, and
phenomena) and about the setting in which they occur. In the collection of data we
have to be systematic. If data are collected haphazardly, it will be difficult to
answer our research questions in a conclusive way. Various data collection
techniques can be used such as:
(c ) Postal or mail method and Telephone: Under this method, the investigator
prepares a questionnaire containing a number of questions pertaining the field of
inquiry. The questionnaires are sent by post to the informants together with a polite
covering letter explaining the detail, the aims and objectives of collecting the
information, and requesting the respondents to cooperate by furnishing the correct
replies and returning the questionnaire duly filled in. In order to ensure quick
response, the return postage expenses are usually borne by the investigator (EPHTI
Manual 2005).
In sum, EPHTI Manual 2005 categorically states that Face-to-face and telephone
interviews have many advantages. A good interviewer can stimulate and maintain
the respondent’s interest, and can create a rapport (understanding, concord) and
atmosphere conducive to the answering of questions. If anxiety aroused, the
interviewer can allay it. If a question is not understood an interviewer can repeat it
and if necessary (and in accordance with guidelines decided in advance) provide an
explanation or alternative wording. Optional follow-up or probing questions that
are to be asked only if prior responses are inconclusive or inconsistent cannot
easily be built into self-administered questionnaires. In face-to-face interviews,
observations can be made as well. In general, apart from their expenses, interviews
are preferable to self-administered questionnaire, with the important proviso that
they are conducted by skilled interviewers.
(d) Use of documentary sources: Clinical and other personal records, death
certificates, published mortality statistics, census publications, etc. Examples
include:
5. Records of hospitals or any Health Institutions. During the use of data from
documents, though they are less time consuming and relatively have low cost, care
should be taken on the quality and completeness of the data. There could be
differences in objectives between the primary author of the data and the user etc
(EPHTI Manual, 2005).
Data are facts, observations, arid information that are obtained from investigations.
When a particular characteristic or an attribute is measured or observed in an
object, the resulting value is regarded as an observation. Such characteristic or
attribute is a variable. This is because it varies from one object to another. For
instance, age of students in a class is a variable. The classifications of data/variable
are illustrated in figure 3. There are two kinds of data/variables. They are
quantitative and qualitative data.
4.1.1 Quantitative variable is one which is measured on numerical scale. Data
collected on a quantitative variable ate often referred to as metric data. For
example, heights, weights, school marks, market prices, daily temperatures and
many others that can be measured on an object are - quantitative. They can be
further divided into discrete and continuous. They are said to be discrete, if the
observed values take whole numbers only. An example of such is family size. On
the other hand, they are said to be continuous if they can take any value within a
range of values. An example is height. This category of variables could be
subjected to arithmetic operations such as addition, subtraction, division and
multiplication.
4.1.2 Qualitative variable involves non-numerical items which are classified into
groups or categories. A qualitative variable describes observations as belonging to
one of a set of categories. Qualitative data such as gender, eye colour, etc. of a
group of individuals are not computable by arithmetic relations.
Data/Variable
Qualitative
Quantitative Data/Variable [Non-
Data/Variable Numerical]e.g
[Numerical] gender, religion,
colour
Continous
Discrete Data/ Data/Variable e.g
Variable height, weight,
temperature
The first step in the use of data for evidence-based decision making is the search
for any available data. The data user may then decide to us some existing ones or
coiled fresh data. The existing ones could be found from published documents or
unpublished administrative documents while fresh data may come from censuses,
sample surveys, electronic sources such as internet and experiments.
(b) Published Sources: Published data are naturally more readily accessible than
unpublished ones. Listed below are the main sources of published data:
(c) Electronic Sources: The electronic sources are widely used because of the
tremendous growth in information technology (IT) globally. The electronic device
include World Wide Web, internet/intranet, direct data capturing machine, Global
System of Mobile phones (GSM), etc. Huge data are now available for
downloading on internet at relatively no cost, with little or no restrictions. All these
are in vogue now, and they make the collection of -data fast/accurate and less
costly.
4.4. 1 Classification
Classification can be qualitative when items are sorted in groups, each possessing
some attributes that cannot be expressed numerically, e.g. gender can be grouped
as male or female, etc. It can also be quantitative when items vary in respect of
some measurable characteristics. Most quantitative classifications form frequency
distributions.
There are geographical classifications (by state, regions and entities) which are
good for administrative purposes. Also, there are chronological classifications or
time ‘series which give figures concerning a particular phenomenon at various
specified times. Table 3 is an example of time series on poverty incidence,
estimated population and population in poverty in Nigeria between 1980 and 2010.
We have seen from Table 2 above most of the desirable features of a good table.
The construction of statistical tables does not require expert thoughts or a great
skill. All that is necessary is to pay attention to the more obvious and simple
points. The following guidelines for the drawing of tables should be noted:
i. The table should have a title and should be short and self-explanatory.
ii. The table should be simple and unambiguous. It must be easily interpreted.
iii. It should present the data clearly, highlighting important details.
iv. It should save space but attractively designed.
v. Self-sufficiency of the table in the sense of not requiring supporting tables or
evidence. This is because, quite often, a text table could be easily used by
another person as a reference material.
vi. Abbreviations and symbols should be avoided as much as possible.
Approximations and omissions should be explained in the footnote.
vii. Always indicate the source(s) of the data at the bottom of the table as to
know the origin of the table.
Based on the purpose for which the table is designed and the complexity of the
relationship, a table could be either of simple frequency table or cross tabulation.
The simple frequency table is used when the individual observations involve only
to a single variable whereas the cross tabulation is used to obtain the frequency
distribution of one variable by the subset of another variable. In addition to the
frequency counts, the relative frequency is used to clearly depict the distributional
pattern of data. It shows the percentages of a given frequency count. For simple
frequency distributions, (like Table 1) the denominators for the percentages are the
sum of all observed frequencies (EPHTI Manual, 2005).
On the other hand, EPHTI Manual, 2005 reports that in cross tabulated frequency
distributions where there are row and column totals, the decision for the
denominator is based on the variable of interest to be compared over the subset of
the other variable.
In a set of data, the number of times that a particular value occurs is called the
frequency of that value. If the ages of ten students are 4, 5, 6, 6, 7, 2, 2, 2, 2 and 2,
the frequency of the value 6 is 2 and that of 2 is 5. If we construct a table with the
age values and frequencies as entries, we have a frequency table as Table 3.
Frequency
Age
4 1
5 1
6 2
7 1
2 5
We can thus define a frequency table as one in which the variable of interest forms
the basis for classification and the entries are frequencies. A frequency table is
also called a frequency distribution because such a table shows how the values of
the variable are distributed. However, we need not use exact values as the classes.
We could use the intervals as we shall see very soon.
Frequency distribution is a tabular arrangement of data by classes together with the
corresponding class frequencies. The frequency distribution tables are of
fundamental importance in statistics. Apart from being a useful summary method
of presenting data, they form a basis for the discussion of probability ideas.
In medical research diagrams and graphs are used to present information or data
for the understanding of both users and patients in an attractive and colourful
manner. They are used for clearer display of pieces of information regarding the
research process. Examples of graph and diagrams are pictograph, bar chart, pie
charts, etc. we look at each of them in the next sub-section.
(a) Pictorgraph/Ideograph
The pictograph primarily looks at presentation of data using picture forms. In
pictograph, the number of scale of the picture bears a relationship to the
magnitude of the variable being presented. One of the advantages is that it’s easy
to understand and easy to draw. However, one of its major disadvantages is its
inaccuracy for a complex statistical presentation and interpretation (Aminu,
1995; 8).
Example 1: suppose, following number of children suffers pneumonia in
Obio/Akpor Local Government Area, Rivers State for the last five years: 30, 60,
80, 40, 10. Present the information in a form of pictograph
Solution
1st Year 30
2ndYear 60
3rd Year 80
4th Year 20
5th Year 10
Solution
Here, you can use either a horizontal or vertical bar chart.
Steps to follow
Step 1: Choose a desired scale for the axes
Step 2: Plot the simple bar chart
35
2014
30
25 2013
20
2012
15
10 2011
5 2010
0
2010 2011 2012 2013 2014 0 10 20 30 40
Step 3: You now construct a pie chart using the various degrees as follows
Key
2010 33.96 Yellow
2011 50.94 Blue
2012 84.91 Red
33.96
Home Work
Using the hypothetical distribution of Doctors and Nurses in UPTH in the
table below
Year Doctors Nurses
2015 40 80
2016 46 102
2017 60 120
2018 80 170
Use the pie chart to describe the data above
4.6 Frequency Distributions
Aside the use of charts and pictures, another way to organize and group data
for decision making in the hospital is the frequency distribution table.
In kalu’s (2013:39) view, frequency distribution is used to compress data by
recording an observation and counting the number of times a particular
observation is repeated which is called the frequency in this sense is called
speed of occurrence of variable in the data set. The data is thus organized in
classes with specific width or size.
Example 3: In a hypothetical survey of 50 families in Alakahia,
Obio/AkporL.G.A, the number of children per family was recorded as:
10 80 15 70 15 10 30 40 50 20
15 20 30 40 10 5 10 20 15 30
50 30 20 40 50 79 38 71 10 80
40 5 15 20 15 65 43 68 15 42
30 10 10 15 30 45 55 70 19 45
You are required to represent the above data in the form of a frequency
distribution.
Solution
Step 1: To make it easier, we begin by either arranging the data in an array
form (I.e. Ascending or descending order of magnitude). You notice that the
lowest and the highest number in the data. That is,
5 5 10 10 10 10 10 10 10 15
15 15 15 15 15 15 15 19 20 20
20 20 20 30 30 30 30 30 30 38
40 40 40 40 42 43 45 45 50 50
50 55 65 68 70 70 71 79 80 80
Step 2: Form the frequency distribution of the number of children.
Number of Tallies Frequency
children
5 2
10 7
15 8
19 1
20 5
30 6
38 1
40 4
42 1
43 1
45 2
50 3
55 1
65 1
68 1
70 2
71 1
79 1
80 2
Total 50
Note: the Frequency distribution table above shows the speed of occurrence
of each of the data.