Basic Concepts, Methods of Data Collection and Presentation
Basic Concepts, Methods of Data Collection and Presentation
Page 1
Stage 2: Organization of Data: This includes
Editing: measurement of how important it is .
Classification: similar and differences.
Tabulation: organization of data in row and column.
Stage 3: Presentation of Data: It is a process of showing our data in understandable way.
Example: charts, graphs and tables.
Stage 4: Analysis of Data: It is a process of extracting a useful characteristics associated with
data.
Stage 5: Interpretation of Data (Inference): It is a process of making interpretations or
conclusions from sample data for the totality of the population.
It is the most difficult and risk stage. It needs professionals in statistics.
1.1.4 Definition of Some Basic Terms
Data: is any recordable interrelated observations.
Population: is the totality of all individuals of the phenomena under study.
Sample: It is a part of population selected in statistical manner to study the population.
Parameter: It is statistical value which refers to the population characteristics.
or it is a result obtained from the population.
Statistic: It is statistical value which refers to the sample characteristics.
or it is a result obtained from the sample.
Census: It is a process of studying a population at large.
Example: a researcher wants to study the academic performance of fist year
student in MTU. But for several constraints he cannot enumerate the whole
students. So he took randomly 500 students and obtained the average GPA to
be 2.58.
a. Identify the population? b. Identify the sample? c. Identify the statistic?
1.1.5 Uses, Applications and Limitation of Statistics
Uses of Statistics
a. It represents the facts in the form of numerical data.
b. It condenses and summarizes mass of data into a few presentable,
understandable and precise figures.
c. It facilitates comparison of data.
d. It helps in predicting future trends.
e. It helps in formulating policies.
Page 2
Applications of Statistics
Statistics can be applied in almost all fields of study. Some of these are:
1. In health 2. In education 3. In agriculture etc
Limitations of Statistics
It is not suited to the study of qualitative phenomena.
It's results are true on the average. (It does not show the exact fact) like law of
physics.
It deals with a set (aggregate) of individuals not a single individual.
It can be easily misused.
Statistical interpretations requires a high degree of skill and understanding of the
subject.
1.1.6 Types of Variables and Level of Measurements
Types of variables: There are two types of variables.
1. Qualitative (Categorical) Variables: are variables that can be placed into distinct category
according to some characteristics. They are not numeric. They cannot be counted or measured.
Example: gender, religion, color etc
2. Quantitative Variables: are variables which are numerical in nature and can be measured
and counted.
Example: height, weight, no of students, GPA etc.
Quantitative variables can also divided into discrete and continuous variables.
Discrete variables: are variables whose values are determined by counting.
Example: no of students in the class.
Continuous Variables: are variables whose values are determined by measuring rather than
counting.
Example: height of a person.
Exercise: are the following variables discrete or continuous?
a. The no of correct answers on true false test.
b. The duration of effectiveness of a pain medication.
c. The weight of Sunday newspapers.
Page 3
Measurement Scales (Levels)
There are 4 types of measurement scales. These are:
1. Nominal Scale 3. Interval Scale
2. Ordinal Scale 4. Ratio Scale
1. Nominal Scale: When the possible categories of a variable have no a natural order then the
measurement is called nominal scale.
we cannot apply any mathematical operations and inequalities.
Example: Blood type (A,B,AB,O) , sex (f,m), no's given to region (1,2,3,...)
2. Ordinal Scale: When the possible categories of a variable have a natural order then the
measurement is called ordinal scale.
we can apply any mathematical inequalities but we can not apply any mathematical
operations.
Example: Economic status (low, medium, high), Education level (diploma, degree, master).
3. Interval Scale: It is a scale with arbitrary zero point, and zero does not shows a total absence
of the quantity being measured.
we can apply any mathematical inequalities.
we can also apply addition and subtraction but we cannot form
multiplication and division.
Example: a) The temperature of a certain area may be 00𝐶 . But this does not mean that
there is no heat at all. It simply indicates that it is too cool.
b) The temperature of a certain areas may be 630𝐹 , 680𝐹 , 1100𝐹 , 1260𝐹 & 1310𝐹 .
→ 𝑤𝑒 𝑐𝑎𝑛 𝑠𝑎𝑦 𝑡𝑎𝑡 680𝐹 > 630𝐹 => 680𝐹 𝑖𝑠 𝑤𝑎𝑟𝑚𝑒𝑟 𝑡𝑎𝑛 630𝐹 .
→ 680𝐹 − 630𝐹 = 1310𝐹 − 1260𝐹 => 𝑠𝑖𝑛𝑐𝑒 𝑒𝑞𝑢𝑎𝑙 𝑡𝑒𝑚𝑝𝑟𝑎𝑡𝑢𝑟𝑒 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 𝑎𝑟𝑒 𝑒𝑞𝑢𝑎𝑙.
126
But we cannot say that 1260𝐹 is twice as hot as 630𝐹 . 𝐸𝑣𝑒𝑛𝑡𝑜𝑔 63
= 2.
Page 4
1.2. METHOD OF DATA COLLECTION AND PRESENTATION
1.2.1 Source and Types of Data
There are two types of data:
a) Primary Data
Data collected by the investigator directly from the source.
Example: observe signs, measure characteristics, record symptoms and
interview respondents, etc.
Two activities involved: planning and measuring.
Identify source and elements of the data.
Decide whether to consider sample or census.
If sampling is preferred, decide on sample size, selection method,… etc
Decide measurement procedure.
Set up the necessary organizational structure.
b) Secondary Data
• Data gathered or compiled from published and unpublished sources or files.
Example: Hospital records, vital statistics and registers, etc.
• When our source is secondary data check that:
The type and objective of the situations.
The purpose for which the data are collected and compatible with the present problem.
The nature and classification of data is appropriate to our problem.
There are no biases and misreporting in the published data.
Note: Data which are primary for one may be secondary for the other.
1.2.2 Methods of Data Collection
There are three major methods of data collection.
1. Observational or measurement.
2. Interview with questionnaires.
a. Face to face interview.
b. Telephone interview.
c. Self administered questionnaires returned by mail (mailed questionnaire).
3. The use of documentary sources
Page 5
1. Observational or measurement ( direct personal observation)
In this case data can be obtained through direct observation or measurement. This requires
training and monitoring of the measurer to ensure the use of standard procedure.
Provides accurate information but it is expensive and inconvenient.
Example: physical examination, clinical measurements, laboratory tests etc.
2. Interview with questionnaires: Hear one drafts a detailed questionnaire. These
questionnaires can either be mailed to the respondent for filling and returning, or can put
in charge of the enumerators who go around and fill them after obtaining the desired
information.
Questionnaires: are written documents which instruct the reader or listener to answer
the questions written on it.
Respondents (Interviewees): are individuals those who are answered the questions
on the questionnaire.
Interviewers: are individuals those who are recorded the responses given by the
respondents.
a) Face to Face Interviews (questionnaires in charge of enumerators)
The interviewer knows exactly who is responding to the questionnaire.
Advantages
The interviewer can help the respondent if he/she has difficulty in
understanding the questions. The difficulty could be due to language,
concentration or limited intellectual capacity.
There is more flexibility in presenting the items; they can range from closed to
open.
There is the ability to use the method of skip patterns.
Skip patterns means skipping a questions or a group of questions
which are not applicable.
Disadvantages
Untrained interviewer may distort the meaning of the questions.
Attribute of the interviewer may affect the responses due to:
a) Bias of the interviewer and b) his/her social or ethnic characteristics.
It costs much in terms of time and money.
Page 6
b) Telephone Interviews
Advantages
It is less expensive in time and money compared with face to face interviews.
The interviewer is able to help the respondent if he/she doesn’t understand the
question (as seen with face to face interview)
Broad representative samples can be obtained for those who have telephone lines.
Disadvantage
Under representation of those groups which do not have telephones.
Problem with unlisted telephone number in the directory.
Respondent may be substituted by another.
c) Self administered questionnaires returned by mail (mailed questionnaire)
Here the questionnaire is mailed to the respondents to be filled. Sometimes
it is known as self enumeration.
Advantages
These are the cheapest.
There is no need for trained interviewer.
There is no interviewer bias.
Disadvantage
Low response rate
Uncompleted questionnaires due to omission or invalid responses.
No assurance that the questionnaire was answered by the right person
Needs intense follow up to get a high response rate.
3. The use of documentary sources
Extracting information from existing sources (e.g. Hospital records) is much less expensive
than the other two methods. It can be an important source of data.
Limitation: It is difficult to get information needed, when records are compiled in
unstandardized manner.
Page 7
1.2.3 METHODS OF DATA PRESENTATION
After having the collected and edited data, the next important step is to organize it.
That is to present it in a readily comprehensible condensed form that aids to draw
inferences from it. It is also necessary that the like be separated from the unlike ones.
Page 8
M S D W D
S S M M M
W D S M M
W D D S S
S W W D D
Solution: Since the data are qualitative (categorical), discrete classes can be used. There are four
types of marital status M, S, D, and W. These types will be used as the classes for the distribution.
0 2 2 1 1 2
3 5 3 2 2 2
1 0 1 2 4 2
0 1 0 1 4 4
2 2 0 1 1 5
Solution: First arrange the data in order of magnitude (in ascending order) and then count the
frequency. The distinct values for these data are: 0,1,2,3,4 & 5. => 𝑠𝑚𝑎𝑙𝑙.
No of cups Frequency (f)
0 5
1 8
2 10
3 2
4 3
5 2
Total 30
Page 9
b ). Grouped frequency Distribution:
When the number of "distinct values" of the data is too large, the data must be grouped in to
classes. So, we divide the values into groups or class intervals, and then count the number of data
values falling in each class interval.
Class intervals (CI): are a non-overlapping intervals such that each value in the set of
observations can be placed in one, and only one, of the intervals.
Then continue to add the class width to this upper limit to get the rest of
the upper class limits. i.e. 𝒖𝒄𝒍𝒊+𝟏 = 𝒖𝒄𝒍𝒊 + 𝒘 , 𝑖 = 1,2, … , 𝑘 − 1.
Page 10
where "𝒖" is a unit measurement or the smallest difference between the two nearest
observations in the data. It is usually taken as 1, 0.1, 0.01,... as the data is given as whole
numbers , tenth digit, hundredth digit , ... respectively.
6. Find the frequencies.
Class boundaries (CB): are the set of exact limits or true limits. They are called
lower and upper class boundaries.
o Lower class boundary (LCB): The lcb is obtained by subtracting half the unit
of measurements from the lcl of the class. i.e.
𝒖
𝒍𝒄𝒃𝒊 = 𝒍𝒄𝒍𝒊 − 𝟐 𝑵𝒐𝒕𝒆: 𝒍𝒄𝒃𝒊+𝟏 = 𝒍𝒄𝒃𝒊 + 𝒘
o Upper class boundary (UCB): The ucb is obtained by adding half the unit of
measurements from the ucl of the class. i.e.
𝒖
𝒖𝒄𝒃𝒊 = 𝒖𝒄𝒍𝒊 + 𝟐 𝑵𝒐𝒕𝒆: 𝒖𝒄𝒃𝒊+𝟏 = 𝒖𝒄𝒃𝒊 + 𝒘
Class marks (mid points) (m): It is the average of lcl and ucl or lcb and ucb.
𝒍𝒄𝒍𝒊 +𝒖𝒄𝒍𝒊 𝒍𝒄𝒃𝒊 +𝒖𝒄𝒃𝒊
𝒎𝒊 = 𝒐𝒓 𝒎𝒊 = 𝑵𝒐𝒕𝒆: 𝒎𝒊+𝟏 = 𝒎𝒊 + 𝒘
𝟐 𝟐
Page 11
𝑅 33
Step 4: Find the class width; 𝑤 = =𝑤= = 6.6 ≈ 7 𝑟𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑢𝑝 .
𝑘 5
• Then continue adding 𝒘 on both boundaries to obtain the rest boundaries. By doing so one
can obtain the following classes.
Class boundary
5.5 – 12.5
12.5 – 19.5
19.5 – 26.5
26.5 – 33.5
33.5 – 39.5
Step 7: Find the frequencies.
Page 12
1.2.3.2 DIAGRAMATICAL PRESENTATION OF DATA
These are techniques for presenting data in visual displays using geometric and pictures.
Importance:
They have greater attraction.
They facilitate comparison.
They are easy to understand.
Diagrams are appropriate for presenting discrete data.
The two most commonly used diagrammatic presentation for discrete as well as
qualitative data are:
• Bar charts and • Pie charts
1. Bar chart
There are three types of bar charts. These are:
I) Simple bar chart II) Component bar chart III) Multiple bar chart
Year of report 1986 1987 1988 1989 1990 1991 1992 1993
Cases 2 17 87 190 448 885 3256 2814
Page 13
b). Component Bar chart
It is used to present data which have more than one variable. For each category the bars are
subdivided in to components to allow comparison between parts. The bars represent the total
value of a variable with each total broken in to its component parts and different colors or
designs are used for identifications.
Example
Construct component bar chart for the number of children who were vaccinated with DPT,
POLIO and BCG antigens in Mizan-Aman General Hospital in 1979 E.C.
Sex
Antigen Male Female Total
DPT 250 300 550
Polio 300 320 620
BCG 200 210 410
Page 14
2. Pie-Chart
It is used to show the partitioning of a total data into its component parts using circles.
The circles should be divided into sectors proportional to the frequencies of the
categories they represent.
Steps to draw a pie chart
1. Convert frequencies into percentage relative frequency.
2. Draw a circle of any radius.
3. Convert percentage relative frequencies into degree measures.
𝟑𝟔𝟎𝟎 𝒙 %𝒓𝒇
𝒂𝒏𝒈𝒍𝒆 𝒐𝒇 𝒂 𝒔𝒆𝒄𝒕𝒐𝒓 =
𝟏𝟎𝟎%
Example
Draw the pie chart for the following data. First construct a table providing the central angles.
Page 15
50-54 49.5-54.5 52 3
55-59 54.5-59.5 57 1
60-64 59.5-64.5 62 1
Histogram
b) Frequency polygon
It is a multi-sided figure which is drawn by plotting the class marks (midpoints) in the x-axis and
the frequencies in the y-axis. Then connect the points with straight lines and extend these lines
on both ends so that it reaches the horizontal axis at the class mid points. This allows the total
area to be enclosed.
Example: draw the frequency polygon for the following age data.
Page 16
Note: The total area under the frequency polygon is equal to the area under the histogram.
Page 17