Unit 2: Collection of Data
Unit 2: Collection of Data
INTRODUCTION
Nowadays most executives and other decision makers pass effective decisions based on research
findings. Most researches in different areas of study require data so as to generate valuable
information that facilitate the decision making process. Data are raw materials for researches.
Moreover, the quality of the collected data greatly affects or determines the precision of results
to be obtained from a specific investigation. Therefore, it is extremely important to know about
the basics of data collection.
CONTENTS:
This block aims at addressing two important points. First, it is the goal of this chapter to discuss
about the different sources of data. The different methods of collecting data from primary source
and the precautions to be considered while taking data from secondary sources are the core
elements to be dealt with.
Second, scope of statistical investigation and the different sampling techniques together with
their merits and demerits constitute the second aim or goal of this chapter.
At the end of this unit, you are able to:
Identify sources of data together with their advantages and limitations.
Explain the various methods of collecting primary data.
Distinguish between random and non-random sampling techniques.
Discuss the various types of random and non-random sampling methods.
2.1 INTRODUCTION
The term “ Data Collection” refers to all the issues related to data sources, scope of investigation
and sampling techniques. In this chapter, our discussion starts with the discussion of the meaning
of data collection. Having the reader acquainted with the meaning of data collection, the chapter
advances to the discussion of the two sources of data namely, primary and secondary sources. In
addition, the different methods of collecting data from primary sources are discussed. Next, the
block addresses the meaning and respective advantages and limitations of census and sample
survey, which are the two scopes of statistical investigation. Finally, the chapter closes by
presenting a satisfactory discussion on the two sampling techniques namely, Random and Non-
Random sampling techniques.
Collection of data implies a systematic and meaningful assembly of information for the
accomplishment of the objective of a statistical investigation. It refers to the methods used in
gathering the required information from the units under investigation.
The quality of data greatly affects the final output of an investigation. Hence, utmost care should
be attached to the data collection process and every possible precaution should be taken to ensure
accuracy while collecting data. Otherwise, with inaccurate and inadequate data, the whole
analysis is likely to be faulty and also the decisions to be taken will also be misleading.
In most cases, secondary data is obtained from such sources as census and survey reports, books,
official records, reported experimental results, previous research papers, bulletins, magazines,
newspapers, web sites, and other publications. Different organizations and government agencies
publish information (data) in the form of reports, periodicals, journals, etc. In the case of
Ethiopia, the Central Statistics Authority (CSA) is the first to be mentioned in publishing such
relevant information (secondary data).
The following are major advantages of primary data over that of secondary data.
The primary data gives more reliable, accurate and adequate information, which is
suitable to the objective and purpose of an investigation.
Primary source usually shows data in greater detail.
Primary data is free from errors that may arise from copying of figures from
publications, which is the case in secondary data.
Finally, it should be clear that primary data in the hands of one person might be secondary in the
hands of another. That is why it is often said, “the difference between primary and secondary
data is largely one of degree.”
After discussing the two sources of data, primary and secondary, it is logical to say a few words
about the methods employed in collecting data from its original or primary source.
Many authors commonly state three methods of collecting primary data. These are:
a. Personal Enquiry Method (Interview method)
b. Direct Observation
c. Questionnaire method
b. Direct Observation
In this approach, an investigator stays at the place of survey and notes down the observation
himself. There is no enquires in the case of direct observation. For example, an investigator
making a study on nutritional status of children may directly (physically) measure the weight,
height, and other required parameters himself/herself. Direct observation is more experimental
and usually applied in scientific studies. It is time consuming and also costly.
c. Questionnaire Method
Under this method, a list of questions related to the survey is prepared and sent to the various
respondents by post, Web sites, e-mail, etc. However, this method cannot be used if the
respondent is illiterate. It is a method that is often used in many statistical investigations.
The following are the major points that we need to take into account while preparing a
questionnaire.
The number of questions should be small. Naturally, respondents are not comfortable with
lengthy questionnaires. Lengthy questionnaires usually bore respondents. Hence, fifteen to
twenty five questions in a questionnaire are optimal. If a lengthy questionnaire is
unavoidable, it should preferably be divided into two or more parts.
The questions should be short, clear, simple and unambiguous. Moreover, the questions must
be arranged in a logical order so that natural and spontaneous reply to each is induced. For
instance, it is not appropriate to ask a person how many packets of cigarette he/she smokes
before asking whether he/she smokes or not.
Questions of sensitive nature should be avoided. Sensitive questions are those questions that
are too personal and pecuniary like “ Sources of income”, “Drinking habit”, etc. The logic
here is that respondents do not willingly answer sensitive questions. Such information, if
necessary, may be gathered through interview or through other indirect questions.
Mail questionnaires should be accomplished by a covering letter, which should state the
purpose of the questionnaire, promise of confidentiality of responses, etc.
Furthermore, the questions preferably designed in such a way can easily be answered as yes/no.
A Sample Questionnaire
Suppose that it is required to identify the factors that affect the performance of students in their
first year (freshman) college life. The following questionnaire may be used to collect the
information that enables the achievement of the objective of the study.
i. General (Personal Background)
Name____________________
Age_________
Sex M F
Depending on their coverage, statistical investigations are usually carried out either in the form
of census or sample survey. These two approaches constitute the scope of statistical
investigation.
A census is the one in which all the units connected with the problem are taken into account.
Complete enumeration is the basic characteristic of census. In census approach data is gathered
from each and every member in the population (universe).
On the other hand, in sample survey only some selected representative units are studied. Sample
survey refers to the collection of information about a variable of interest from only some part (or
subset) of the population called sample. Sample elements are selected from a population through
different alternative sampling techniques. For example, suppose a quality controller wants to
know the average number of defective items in 10 batches of bottles produced, each batch
containing 1000 bottles. One possible approach is to check each and every of the 1000 bottles in
all batches. Accordingly, the controller is going to check 10 1000 = 10,000 bottles. This is the
census approach. Alternatively, the controller may take a sample of two batches of bottle out of
the available 10 batches. In this case, he/she is going to check 2 1000 = 2000 bottles only.
Advantage of census
Information is available for each separate part of the universe,
The results to be obtained are likely to be more representative, accurate and reliable,
It serves as a basis for various surveys, because it is free from sampling error,
Easier to check and reduce coverage error.
Limitation of census
It requires very large effort, money and time,
In case where the population is infinite, census approach can’t be applied.
CYP 1
i) Discuss the difference between primary data and secondary data.
ii) What are parameters and Statistics?
In real life problems complete census, which is enumeration of all units and thereafter analysis
on the characteristics of all units, may be impractical. This occurs for several reasons. The
population could be too large to manage with the available fund, time, and trained personnel. In
addition, the members of the population may be dispersed in different corners, where
transportation, communication and other necessary facilities are not available. Furthermore, there
may also be areas that are inaccessible like areas where there is war, epidemic disease, etc. In
such cases, complete enumeration or census approach fails to be applicable. Thus, the only
option to be considered is sample survey. In sample survey approach, members that represent the
whole population are selected using an appropriate sampling technique. We first introduce the
definition of some of the important terms.
Sampling
Sampling is the process of taking sample and making inference to the whole population.
Elementary Units
Elementary units are elements or groups of elements in the population about which information
is required.
Sampling Units
Sampling unit is the unit in terms of which the enumerator collects the data. This unit may be a
geographical unit, a construction unit, or social groups or individuals.
The following table presents some examples of sampling units and corresponding possible
elementary units.
Note that there are a number of cases where the sampling unit and the elementary unit turn out to
be the same. The last example is a case in point.
Generally speaking, elementary units constitute sampling units. That is, a sampling unit is a
collection of one or more elementary units. Whenever that collection contains only one
elementary unit, the sampling unit and the elementary unit become identical. Borrowing terms
from set theory, we can conclude that elementary units are subsets, but not necessarily proper
subsets, of sampling units.
Sample Size
Sample size is the number of elements or observation in a sample.
Sampling Frame
Sampling frame is the listing of all sampling units in the population from which sample selection
is to be made at any stage of sampling.
The following are characteristics of a good sampling frame.
It should be exhaustive, covering the whole population,
It should be non-repetitive,
The units should be mutually exclusive.
There should be clear and unambiguous demarcation between sampling units,
It should be updated,
The units in the list must be traceable in field.
Sampling error
Sampling error is the difference between the results obtained from a sample study and the results
that would have been obtained from an equal complete coverage (census), i.e., from an
investigation of the entire population conducted in exactly the same manner as in the sample
study.
Sampling errors are those errors that occur only because we take sample instead of taking the
whole population. The magnitude of these errors increases if a sample is not a good
representative of the population. Unfortunately in sample surveys sampling errors are
unavoidable. As a result the whole effort in sampling is to minimize sampling errors. Taking
larger and larger sample size considerably minimizes sampling errors. Bias of the enumerator,
bias in sample selection, bias in data collection, bias in analysis and interpretation, and
heterogeneity of population are the major causes of sampling errors.
Non Sampling errors Non-sampling errors are those errors that can arise even in census
(complete enumeration). Often non-sampling errors arise due to, among others, the following
factors:
Questions that are not worded properly and clearly,
Biases or mistakes on the part of the interviewers,
Inaccuracy of information furnished by respondents,
Case of non-response. Meaning, respondents may deviate to give information, etc.
In the theory of sampling, there are two distinct types of sampling techniques. These are:
Probability Sampling and
Non-probability Sampling
Probability or random sampling is a kind of sampling technique where each elementary unit in
the population has a known (pre-calculated) probability or chance of being included in the
sample. In probability, sampling items are chosen strictly at random. The selection process is
such that chance only determines which items shall be included in the sample. Probability
sampling is further classified into two types namely, restricted and unrestricted random
sampling.
Lottery Method
Pretty sure that each one of us uses the lottery method in our day-to-day activities. In lottery
method, each member of the population is represented by identifiable disk. These disks are then
placed in an urn or bow and well mixed. We also use small pieces of papers instead of disks and
rap them in such a way that one cannot be distinguished from the other. Thus, a sample of the
required size is selected. Utmost care should be attached to the process of the lottery method so
that it will generate random sample elements.
Remark:
At the time we exhaust all the numbers in a selected row or column, we continue with the
immediate next row or column.
If the value of d is greater than five, then we merge successive rows or columns so that
we obtain a 10 digit, 15 digit, and so on numbers in each column or row.
Illustrative example
Consider a case where a population consists of 180 units and simple random sample of size 5 is
to be taken from this population using table of random numbers.
From the given random number table (see the table at the end of this text book) the 4 th column
and 21st row are selected, the number at that point is 30193.
In our case,
N = 180 n = 5 N-1 = 179
Since N-1=179 is a three digit number we set d=3.
Thus, start from the number 30193 and read the last 3 digits row or column wise, let us read
column wise. See it in the next page.
Therefore, the items or observations located at the 30 th, 140 th, 64th, 126 th and 156 th positions or
roll numbers will be taken as sample elements.
Stratified Sampling
Stratified Sampling is a procedure that involves the division or stratification of a population by
partitioning the sampling frame into a number of homogenous groups or strata on the basis of
certain characteristic(s) of the sampling units. A number of variables including geographical,
demographic, social, economic, ethnic and political may be used for the stratification purpose.
The selection of the appropriate variable(s) merely rests upon the nature of the study. Sampling
can be performed separately within each stratum.
For example, in an opinion surveys, the population may be divided into homogeneous groups
according to their qualification, age, sex, size of family, etc. Specifically if the population is
divided or stratified according to the variable sex, then we are going to have two separated strata,
namely males and females.
Each individual group formed after stratification is called a stratum (singular). Collection of
Stratum forms Strata (Plural).
Proportional allocation method: The idea of proportional allocation method is to fix the
sample size of a given stratum according to its proportion. Suppose, we have k strata obtained
from a population of size N each containing Ni elements. In addition, let n be the total sample
size required. Suppose also that ni is the sample size to be taken from the ith stratum. Thus,
according to the proportional allocation method, the population proportion should be equal to the
sample proportion. That is,
Ni n Ni n
i ni
N n N
Where N = N1 + N2 + … + NK
and n = n 1 + n2 + … + n k
k k
Nin n k n
Observe that ni Ni N n
i 1 i 1 N N i 1 N
In general, according to the proportional allocation method the sample size to be taken from the
Nin
ith stratum is given by ni
N
Example (Hypothetical)
Dinkinesh Ethiopia Tour has 1000 employees placed in 4 departments, finance & accounting,
personnel, operations, and Marketing. A student wanted to make some kind of research in this
company and decided to take a stratified sample of 100 employees. Moreover, it is known that
there are 50 employees in personnel, 500 in operations, 300 in Accounting and finance, and 150
in marketing. What number of employees should be taken from each of these 4 departments,
using the proportional allocation method?
Solution:
or n4 = n – n1 – n2 – n3 = 100 – 30 – 5 – 15 = 50
Note: Rounding of figures to an integer value is a customary action in determining sample sizes.
Cluster Sampling
In simple random sampling and stratified random sampling, we have been considering the
smallest well-identifiable unit of the population called elementary units. It is to mean that the
observations have been taken on these elementary units. For several reasons, however, such an
approach may not be sometimes applicable. Some of these reasons may be:
The sampling frame may not be available or may be prohibitively expensive to construct the
frame in relation to resources like money, time and labor.
Elementary units may be situated far apart from one another, and if selected the process will
consume a lot of time and money to survey them.
The elementary units may not be well identifiable and easily locatable. Specifically,
migratory elementary units, like birds, are not easily identifiable and locatable.
Thus, to cope with the above and other related problems in sampling elementary units, a
sampling plan known as cluster sampling or area sampling is used.
Cluster sampling is a sampling technique that is preferred when the population is subdivided into
groups or clusters. In most cases, the clusters are formed location wise.
Examples of clusters
Single stage and multi-stage cluster sampling are the two types or plans of cluster sampling
technique. Nevertheless, the scope of this material is limited only to single stage sampling plan.
In single stage, sampling plan clusters are chosen using simple random sampling and within each
sample cluster, all the elementary units are treated or taken as sample units.
To illustrate the idea, suppose we want to take a sample of 1000 college students in Addis
Ababa. Our clusters can be the different colleges in Addis Ababa. According to the cluster
sampling procedure, we take a simple random sample of one college and consider all the students
in that college. To that end, let college A selected from 20 available colleges using the table of
random numbers. Assume also that there are 5000 students in each one of these 20 (hypothetical)
colleges. Then after, take a simple random sample of sample of 1000 students out of 5000
students in college A.
Note that in case where the size of a cluster size is less than the required sample size we will be
forced to take two or more simple random sample clusters instead of one.
The basic premise of cluster sampling is that clusters are internally heterogeneous and externally
homogenous.
CYP 2
i) What is the difference between stratified and cluster sampling techniques?
ii) What are the two methods of taking simple random samples? Which one is more reliable?
Systematic Sampling is a method of selecting units at a fixed interval from a list, starting from a
randomly selected point. It follows a step-by-step procedure.
Suppose a sample of size n is to be selected from a population of size N using the systematic
sampling technique. Thus, the following steps are followed in the sample selection
process.
N
Step 1. Calculate your interval size as K ; K must be an integer
n
Step 2. Select any random number between 1 and K inclusive. Let that random number be S.
Step 3. Select every kth index, from the sampling frame, starting from S. Meaning
Index of the 1st sample is S
Index of the 2nd sample is S + k
Index of the 3rd sample is S + 2k, etc
Generally the index of the ith sample element is
S + (i – 1)k
Particularly, for the nth sample the index is S + (n – 1)k
Examples
Therefore, from our sampling frame of households, we will take the (60) th , (166)th (272) th , (378)
th
, (484) th, (590) th, (696) th, (802) th, (908) th and (1014) th elements.
It is extremely important to note that these numbers are not our observations. They are simply
roll numbers (indices) at which our sample observations are located. In the first case for instance,
the household that is located at the (60) th place or position will be included in our sample.
2. In a systematic sample, it was found that the 2 nd and 7th samples correspond to the indices 8
and 33 respectively. Find
a. the value of K (interval) and the index for the first sample.
b. the index for the 10th sample.
c. what would be the sample size if the population size is 100.
Solution:
a. Given the 2nd sample element to be 8 and the 7 th sample element to be 33, we have the
following two equations from the formula for the indices i = 2 and i = 7
S + (2 – 1) k = 8
and S + (7 – 1) k = 33
k = 25 / 5 = 5
and S + k = 8 S = 8 – k = 8 – 5 = 3
Therefore, we have the values k = 5 and S = 3 where S=3 is the index or serial number for the
first sample element.
b. i = 10
10th = S + (10 – 1)k = 3 + (9) 5 = 48
Unlike probability (random) sampling techniques, discussed in part “a” of this section, in non-
probability sampling techniques there is no predetermined probability or chance for a given
elementary unit from the population to be included in the sample. Non-probability sampling
techniques do not use randomization. In non-probability, sampling sample elements are selected
based on such factors as judgment, convenience, preference, intuition, quota, etc.
There are three common types of non- random (non-probability) sampling techniques. These are:
-Judgment Sampling
-Convenience Sampling
-Quota Sampling
Judgment Sampling
In Judgment Sampling samples are selected merely on the basis of the judgment of the
investigator who is believed to be skilled and experienced in such a practice. In simple words, it
is only the investigator’s preference that is the determinant factor for the inclusion or elimination
of an element to or from the sample. Clearly, the investigator is the first person to be attributed
for the good or bad nature of the resulting sample as a representative of the population.
Advantages
-It saves time and reduces cost,
-Sometimes it is used in solving economic and business problems.
Limitations
-It is highly subjected to human bias,
-There is no objective (scientific) way of evaluating sample results.
Overall, the success of judgment sampling entirely depends on the excellence of the investigator
in terms of knowing the population well.
According to the convenience sampling approach sample elements are obtained by selecting
population units that are convenient for the investigator. It is a relatively easy way to select, but
the sample will hardly be representative of the population. In more simplified manner, what we
do in convenience sampling is just to take any element from the population as a sample, which is
convenient (preferable) to us in terms of cost, time, accessibility, suitability to make interview,
etc. Nevertheless, it should be considered here that poor representative sample definitely results
in less accurate outcomes. In other words, if convenient sampling is considered without assuring
its appropriateness, the whole process will be “garbage in garbage out”.
Advantages
It is simple and cost effective,
It requires relatively shorter time,
It is often used for pilot surveys
Limitations
Hardly representative of the population,
It is extremely exposed to human bias.
Quota sampling is a type of judgment sampling where quotas are set up according to given
criteria and the selection of sample units within the prescribed quota is made according to the
personal judgment of the investigator.
Of course, quota sampling can be viewed as a combination of the concept of stratified and
judgments sampling techniques. The stratification concept arises when we divide the population
into parts and assign individual quota for each one of them. In other words, the quota assigned to
a given group may be viewed as the allocated sample size to be taken in a given stratum in the
case of stratified sampling. On the other hand, the judgment-sampling concept comes while we
take sample from each group. In quota sampling samples are taken from each group using the
judgment of the investigator. Recall, however, that in the case of stratified sampling samples
from each stratum (group) is taken at random, not by judgment.
Advantages
Mostly applicable in social studies,
It occasionally provides satisfactory results if the remaining processes, like interviewing, are
carried out with utmost care.
Limitations
Slight negligence on the part of the interviewer may lead to a great disaster in the final
results.
2.7 SUMMARY
Data may be collected or obtained from two major sources, primary and secondary sources. Data
obtained from a primary (original) source is called primary data. On the other hand, data taken
from a secondary sources, like magazines, newspapers and other publications, are called
secondary data. Primary data is collected through different methods including interview,
questionnaire, and direct or physical observation.
Census and sample survey are the two distinct scopes of statistical investigation. In census
approach, data is collected from each and every member of the population. In sample survey,
however, data is recorded or taken only from some portion of the population called the sample.
The quality of the results to be obtained from a sample survey mainly relies on the
appropriateness of the selected sample as a representative of the population.
In order to assure representativeness of a sample, one should carefully select the appropriate
(most suited) sampling technique under the prevailing conditions and within the given
circumstance. There are random (probability) sampling and non-random (non probability)
sampling techniques. Further, within random sampling we have four sub-categories namely,
Simple Random Sampling (SRS), Stratified Sampling, Cluster Sampling and Systematic
Sampling. In the non-random sampling category also we have three sub-categories namely,
Judgment, Convenience, and Quota Sampling.
CYP 1
i) Primary data is data obtained or collected from the original sources using interview or
questionnaire or direct observation (physical measurement) methods. When data is collected or
obtained for the first time it is called primary data.
On the other hand, Secondary data is data obtained from secondary sources like web sites,
magazines, newspapers, annual reports, etc. Whenever we make use of data collected by other
agency for some other purpose, then we are said to be secondary users as we are not the owner of
the data.
ii) Parameters are results that refer to the population as a whole. That is, parameters are values
that determine the characteristics of a population. They include population average (), standard
deviation (), correlation coefficient (), etc. A population is characterized by its parameters.
Statistics are also results that, however, refer to not the whole of a population but only to some
part called the sample. Any result calculated or obtained from samples is called a Statistics. We
can calculate or generate a number of results like averages ( x ), standard deviation (s), etc from
a single sample. These results are collectively called Statistics. In short, parameters are
population values while statistics are sample values.
CYP 2
i) Both Stratified and Cluster Sampling techniques are parts of random sampling technique.
Nevertheless, they differ at least in the following points.
Stratified Cluster
*Stratification is based on some variable of *Categorization (partitioning) of the population
interest. is mainly based on area (location).
*Samples are taken from each part or stratum. *Samples are taken from few clusters only.
*Only few elements from a single stratum are *All the elements within a selected cluster are
included in the sample. included within the sample
*Strata are internally homogenous and *Clusters are internally heterogeneous and
externally heterogeneous. externally homogenous.
*It requires full sampling frame *It does not require full sampling frame
*Costly *Less costly
*More accurate results could be obtained *Results are less accurate as compared with
that of stratified sampling.
1) In a systematic random sampling, the 10th and 15th sample elements correspond to the indices
(serial numbers) 68 and 103 respectively. Find the index for the 5th systematic sample.
4) Unity University College has registered 12,000 students for the last four years. The college
administration would like to know the number of students who have participated in co-
curricular activities. For the purpose of the study, the administrator collected the names of
400 students from the files by taking proportional number of students from each of the years
(batches) for interview.
Based on the above information, find
a. The variable of interest
b. The source of data (primary or secondary)
c. The population
d. The sample
e. The sampling technique used
2.10 GLOSSARY
2.11 REFERENCES
Agarwal B.L (1991); “Basic Statistics,” Second Edition; Wiley Easter Limited; India.
Bluman G. Allan (1992) “Elementary Statistics: A step-by-step Approach,” Second
Edition; Wn.C. Brown Communications, Inc.; USA.
Cochran W.G (1977); “Sampling Techniques,” Third Edition, John Wiley and Sons Inc,
India.
Gupta C.B (1995); “ An introduction to Statistical Methods,” Nineteenth edition; Vikas
Publishing House PVT.LTD; New Delhi
Gupta S.P (1991); “Statistical Methods,” Twenty sixth edition; Sultan Chand and Sons
Publishers; New Delhi.