Data Sampling
Data Sampling
C
• Data sampling is a statistical analysis technique used to select, manipulate
and analyze a representative subset of data points to identify patterns and
trends in the larger data set being examined. It enables data scientists,
predictive modelers and other data analysts to work with a small,
manageable amount of data about a statistical population to build and run
analytical models more quickly, while still producing accurate findings.
• Sampling can be particularly useful with data sets that are too large to
efficiently analyze in full -- for example, in big data analytics applications or
surveys, Identifying and analyzing a representative sample is more efficient
and cost-effective than surveying the entirety of the data or population.
Populations and Samples
• Population: Population is the group of elements which has
common characteristics. It is a collection of observations about
which we would like to make inferences.
• Sample: A sample is the subset of population
• Sampling: a collection of samples from the population is a
sampling. In other words, sampling units are an overlapping
collection of elements from the population.
• An important consideration, though, is the size of the required data
sample and the possibility of introducing a sampling error. In some
cases, a small sample can reveal the most important information about a
data set. In others, using a larger sample can increase the likelihood of
accurately representing the data as a whole, even though the increased
size of the sample may impede ease of manipulation and interpretation.
Sampling Error
• Sampling error is the deviation between the estimate of an ideal sample
and the true population.
• The core assumption of data sampling is that samples are a subset of the
population, and the sample mean is equal to the mean of the population.
• To the degree that doesn’t happen is the term Sampling Error
• We can reduce sampling error by following sampling best practices, like
having a large enough sample size, choosing the right kind of sampling
to do, and avoiding sampling bias.
Data Sampling Methods
When taking a sample from a larger population you must make sure
that the samples are an appropriate size and without bias.
There are two types of sampling
• Probability sampling
• Non-probability sampling
Probability Sampling:
Every element in the sample population has an equal chance of being
selected. A sampling method is biased if every member of the population
doesn’t have equal likelihood of being in the sample.
Different types of probability sampling
• Simple random sampling
• Stratified sampling
• Systematic sampling
• Cluster sampling
Simple random sampling:
• It is a method of sampling in which every element of the universe has
equal probability of being chosen. For example, choose an individual
from a lottery. The advantage of this method is free from personal bias,
and the universe gets fairly represented by samples.
Stratified sampling:
• The population is broken down into non-overlapping groups. In other words, strata
(elements within the subgroups are homogenous or heterogeneous). Then random
samples are taken from each strata, so that entire population gets represented. The
advantage of this method is it covers all the elements of the population. But there
is a possibility of bias at the time of classification of population.
Systematic sampling:
• Samples are selected from the population according to a pre determined
rule. In other words, every nth element selected from the population as a
sample. Arrange all the elements in a sequence and then select the
samples from the population at regular intervals.
Cluster sampling:
• The population is broken down into many different clusters, and then
clusters or subgroups are randomly selected. For example, clusters are of
different ages, sex, locations etc.
Different types of non-probability
sampling
• Purposive sampling
• Convenience sampling
• Quota sampling
• Snowball/referral sampling
Purposive sampling:
• Purposive sampling is also known
as judgment sampling. Samples are
selected based on the purpose or
intention of research. The method
is flexible to allow the inclusion of
those items in the sample which are
of special significance.
Convenience sampling:
• Convenience sampling is one
of the easiest sampling
methods. Samples selection
is based on availability and
selecting the samples that are
convenient to the researcher.
Quota sampling:
• It is one type of stratified
sampling, where samples are
collected in each subgroup until
the desired quota is met. The
proportion of this sample does
not match the proportion of the
group to the population.
Snowball/referral sampling:
• Snowball sampling or referral sampling is
the method famous in medical and social
science surveys where the population is
unknown and difficult to get the sample.
Hence researchers will take help from the
existing elements to refer the others as
samples who can fit in the population.
Since it is based on referrals, there is a
chance of bias.
Kinds of Sampling Bias
Sampling bias is a bias in which samples are collected in such a way that
some elements of the intended population have less or more sampling
probability than the others.
Following are the different types of sampling bias
• Response Bias: A response or data bias is a systematic bias that occurs
during data collection that influences the response.
• Voluntary response Bias: Occurs when individuals can chose to
participate.
• Non response Bias: Non response bias occurs when units selected as
part of the sampling procedure do not respond in whole or part.
• Convenience Bias: When sample is taken from individuals that are
conveniently available.