Rao 2020
Rao 2020
https://doi.org/10.1007/s13571-020-00227-w
# Indian Statistical Institute 2020
Abstract
Survey samplers have long been using probability samples from one or more
sources in conjunction with census and administrative data to make valid and
efficient inferences on finite population parameters. This topic has received a lot of
attention more recently in the context of data from non-probability samples such
as transaction data, web surveys and social media data. In this paper, I will
provide a brief overview of probability sampling methods first and then discuss
some recent methods, based on models for the non-probability samples, which
could lead to useful inferences from a non-probability sample by itself or when
combined with a probability sample. I will also explain how big data may be used
as predictors in small area estimation, a topic of current interest because of the
growing demand for reliable local area statistics.
1 Introduction
Sample surveys have long been conducted to obtain reliable estimates of finite
population descriptive parameters, such as totals, means and quantiles, and asso-
ciated standard errors and normal theory confidence intervals with large enough
sample sizes. Probability sampling designs and repeated sampling inference, also
called the design-based approach, has played a dominant role, especially in the
production of official statistics, ever since the publication of the landmark paper by
Neyman (1934) which laid the theoretical foundations of the design-based ap-
proach. The design-based approach was almost universally accepted by practicing
official statisticians. The early landmark contributions to the design-based ap-
proach, outlined in Section 2.1, were mostly motivated by practical and efficiency
considerations. Methods for combining two or more probability samples were also
developed to increase the efficiency of estimators for a given cost (Section 3). Data
2 J. N. K. Rao
collection issues have received considerable attention in recent years to control costs
and maintain response rates using new modes of data collection (Section 4.1).
In spite of the efforts under the probability sampling setup to gather
designed data, response rates are decreasing and costs are rising (Williams
and Brick 2018). At the same time, due to technological innovations, large
amounts of inexpensive data, called big data or organic data, and data from
non-probability samples (especially online panels) are now accessible. Big data
include the following types: administrative data, transaction data, social media
data, internet of things and scrape data from websites, sensor data and satellite
images. Big data and data from web panels have the potential of providing
estimates in near real time, unlike traditional data derived from probability
samples. Statistical agencies publishing official statistics are now taking mod-
ernization initiatives by finding new ways to integrate data from a variety of
sources and produce “reliable” near real-time official statistics. However, naïve
use of such data can lead to serious sample selection bias (Section 4.2) and
without adjustment to reduce selection bias it can lead to the “big data
paradox: the bigger the data, the surer we fool ourselves “(Meng 2018). Sections
5, 6 and 7 discuss methods for attempting to reduce sample selection bias under
different scenarios. These methods build on the techniques discussed in Sections
2, 3 and 4 for making efficient inferences from probability samples.
Big data has the potential of providing good predictors for models useful in
small area estimation (Section 8). In the case of small areas with very small sample
sizes within areas, direct estimators based on traditional probability sampling do
not provide adequate precision and it becomes necessary to use model-based
methods to “borrow strength” across related areas through linking models based
on suitable predictors, such as census data and administrative records. Big data
predictors can be useful as additional predictors in the linking models.
2 Probability Sampling
sampling may perform poorly if the underlying model assumptions are violated.
This result is of importance in the context of making inferences from non-
probability samples based on implicit or explicit models.
A general design-unbiased estimator of the population total is of the form
Yb ¼ ∑i∈A d i y i , where d i ¼ π −1 is called the design weight (Horvitz and
i
Thompson 1952; Narain 1951). This estimator may be expressed as
Yb ¼ ∑i∈U ða i d i Þy i , where ai is the indicator variable for the inclusion of pop-
ulation unit i in the sample and is equal to 1 with probability πi and 0 with
probability 1 − πi. In the case of stratified simple random sampling, the design
weights are equal to the inverses of sampling fractions within strata and they
may vary across strata. Design-unbiased
estimators of the variance of Y b,
b 2 b
denoted by v Y ¼ s Y , can be obtained provided all the joint inclusion
probabilities are positive, as is the case under stratified simple random
sampling. Sampling
practitioners often use the coefficient of variation,
b b b
C Y ¼ s Y =Y as a measure of precision of the estimator Y b and con-
n
struct 95% normal theory
o confidence intervals on the total as
b b b b
Y −2s Y ; Y þ 2s Y . This assumes large samples and single-stage
sampling. Neyman noted that for large enough samples the frequency of
errors in the confidence statements based on all possible stratified simple
random samples that could be drawn does not exceed the limit prescribed in
advance “whatever the unknown properties of the finite population”.
The above attractive features of probability sampling and design-based
inference were recognized soon after Neyman's paper appeared. This, in turn,
led to the use of probability sampling in a variety of sample surveys covering
large populations, and to theoretical developments of efficient sampling designs
minimizing cost subject to specified precision of the estimators, and associated
estimation theory. I will now list a few important post-Neyman theoretical
developments under the design-based approach.
Mahalanobis used probability sampling designs for surveys in India as early
as 1937. His classic 1944 paper (Mahalanobis 1944) rigorously formulated cost
and variance functions for the efficient design of sample surveys based on
probability sampling. He studied simple random sampling, stratified random
sampling, single-stage cluster sampling and stratified cluster sampling for the
efficient design of sample surveys of different crops in Bengal, India. He also
extended the theoretical setup to subsampling of sampled clusters (which he
named as two-stage sampling). He was instrumental in establishing the Na-
tional Sample Survey (NSS) of India and the world famous Indian Statistical
Institute. The NSS is the largest multi-subject continuing survey with full-time
staff using personal interviews for socio-economic surveys and physical
measurements for crop surveys. He attracted brilliant survey statisticians to
4 J. N. K. Rao
work with him, including D. B. Lahiri, M. N. Murthy and Des Raj. Hall (2003)
provides a scholarly historical account of the pioneering contributions of
Mahalanobis to the early developments of survey sampling theory and methods
in India. P. V. Sukhatme, who studied under Neyman, also made pioneering
contributions to the design and analysis of large-scale agricultural surveys in
India, using stratified multi-stage sampling.
Under the leadership of Morris Hansen, survey statisticians at the U. S.
Census Bureau made fundamental contributions to the theory and methods of
probability sampling and design-based inference, during the period 1940-1960.
The most significant contributions from the Census Bureau group include the
development of the basic theory of stratified two-stage cluster sampling with
one cluster (or primary sampling unit) within each stratum drawn with prob-
ability proportional to size (PPS), where size measures are obtained from
external data sources such as a recent census (Hansen and Hurwitz 1943).
Sampled clusters may then be subsampled at a rate to provide an overall self-
weighting (equal overall probability of selection) sample design. Such probabil-
ity sampling designs can lead to significant variance reduction by controlling the
variability arising from unequal cluster sizes. Another major contribution was
the introduction of rotation sampling with partial replacement of households to
handle response burden in surveys repeated over time, such as the monthly U. S.
Current Population Survey (CPS) for measuring monthly unemployment rates
and month-to-month changes in the labor force. Hansen et al. (1955) developed
efficient composite estimators of level and change under rotation sampling. This
methodology is widely used in large-scale rotating panel surveys. Yet another
significant contribution is the development of a unified approach for construct-
ing confidence intervals for quantiles, such as the median, applicable to general
probability sampling designs (Woodruff 1952). This method remains a corner-
stone for making inference on the population quantiles.
As described above, population information on auxiliary variables related to
a variable of interest (the study variable) is often used at the design stage for
stratification or PPS sampling or both. Use of information on auxiliary variables
at the estimation stage was also advocated. In particular, ratio estimation based
on a single auxiliary variable x correlated with y has been widely used. The value
xi is obtained for each unit in the sample and the population total X = ∑i ∈ Uxi
must be known. The sample values of x are either directly observed along with
the associated values of yor through
record linkage. A ratio estimator of the
b b b
total is of the form Y r ¼ Y =X X ¼ RX, b where X b ¼ ∑i∈A d i x i is the design-
unbiased estimator of the known total X. It is not necessary to know all the
individual population values xi to implement the ratio estimator, unlike the use
of xi at the design stage. The ratio estimator also enjoys the calibration property,
ON MAKING VALID INFERENCES BY... 5
namely it reduces to the known total X when yi is replaced by xi. It can lead to
considerable gain in efficiency when yi is roughly proportional to xi . The above
desirable properties and the computational simplicity of the ratio estimator led
to extensive use in surveys based on probability sampling. The well-known
Hajek estimator the total is a special case of the ratio estimator. It is given
of
b b b
by Y H ¼ Y =N N, where N b ¼ ∑i∈A d i . The Hájek estimator of the popula-
b
tion mean is given by Y H ¼ Y b =N
b and it is widely used in practice and it does
not require the knowledge of N, unlike the unbiased estimator Y b =N.
The ratio estimator is not generally design unbiased but it is design consis-
tent for large samples. Survey researchers do not insist on design unbiasedness
(contrary to statements in some papers on inferential issues of sampling theory)
because it “often results in much larger MSE than necessary (Hansen et al.
is
1983)”. A design consistent estimator of the variance of the ratio estimator
simply obtained by replacing the variable yi in the variance estimator v Y b by
b
the residuals ei ¼ y i −Rx i for i ∈ A. Regression estimation was also studied early
on but it was not widely used due to computational limitations in those days.
In the early days of survey sampling, surveys were generally much simpler
than they are today and data were collected through personal interviews or
through mail questionnaires (possibly followed by personal interviews of a
subsample of non-respondents). Physical measurements, such as those advo-
cated by Mahalanobis for estimating crop yields in India, were also used.
Response rates were generally high and the 1970s is generally regarded as the
“Golden Age of Survey Research” (Singer 2016).
Together with the consolidation of the basic design-based sampling theory,
attention turned to measurement or response errors in the 1940 U. S. census
(Hansen et al. 1951). Under additive response error models with minimal model
assumptions on the observed responses treated as random variables, the total
variance of the estimator Y b is decomposed into sampling variance, simple
response variance and correlated response variance (CRV) due to interviewers.
The CRV can dominate the total variance if the number of interviews per
interviewer is large. Partly for this reason, self-enumeration by mail was first
introduced in the 1960 U. S. Census to reduce CRV. This is indeed a success
story of theory influencing practice.
Prior to the work of Hansen et al. (1951), Mahalanobis (1946) proposed the
ingenious method of interpenetrating subsamples to assess both sampling errors
and variable interviewer errors in socio-economic surveys. He showed that both
the total variance and the interviewer variance component can be estimated by
assigning the subsamples at random to the interviewers. Kalton (2019) notes
that response errors remain a major concern although much research on total
6 J. N. K. Rao
survey error has been conducted since the pioneering work of Mahalanobis
(1946) and Hansen et al. (1951).
Nonresponse in probability surveys was also addressed in early survey
sampling development. Hansen and Hurwitz (1946) proposed two-phase sam-
pling for following up initial nonrespondents. In their application, the sample is
contacted by mail in the first phase and a subsample of nonrespondents is then
subjected to personal interview, assuming complete response or negligible
nonresponse at the second phase. This method is currently used in the Amer-
ican Community Survey and it can also be regarded as an early application of
what is now fashionable as a responsive or adaptive design (Tourangeau et al.
2017).
and the resulting calibration weights wi satisfy ∑i∈A w i b h i ¼ ∑i∈U bhi. The
b 0
GREG estimator Y gr is a special case of (1) when hðx i ; β Þ ¼ x i β. Breidt and
Opsomer (2017) review model-assisted estimation using modern prediction
methods, including non-parametric regression and machine learning.
In practice, available auxiliary variables might include several categorical
variables with many categories and significant interactions may exist among
the variables. For example, the Quarterly Census of Employment and Wages
(QCEW) in the United States is used as a sampling frame for many establish-
ment surveys. The frame is compiled from administrative records and contains
several categorical variables such as industry code, firm ownership type, loca-
tion and size class of the establishment (McConville and Toth 2018). Clearly,
the linear regression working model including all categorical variables and their
two factor interactions as auxiliary variables can lead to unstable GREG
weights and even some negative weights. In this case, the GREG estimator
can be even less efficient than the basic estimator Y. b McConville and Toth
(2018) proposed a model-assisted regression tree approach that can automat-
ically account for relevant interactions, leading to significantly more efficient
design-consistent post-stratified estimators than the GREG estimator obtained
8 J. N. K. Rao
after variable selection from the pool of categorical variables. The GREG
estimator performed poorly when the pool for variable selection also included
all the two factor interactions. The tree weights are always positive and their
variability is considerably smaller than the variability of the GREG weights.
Regression trees use a recursive partitioning algorithm that partitions the space
of predictor variables into a set of boxes (post strata) B1, ..., BL such that the
sample units within a box are homogeneous with respect to the study variable.
The model-assisted estimator (1) with unspecified mean function reduces to the
regression tree estimator of the form
b rt ¼ ∑L N l ∑i∈A d i y i =∑i∈A d i
Y ð2Þ
l¼1 l l
where Al is the set of sample units and Nl is the number of population units
belonging to poststratum l. It follows from (2) that the weight attached to a
sample unit belonging to poststratum l is N l d i =∑i∈Al d i and the regression tree
estimator (2) calibrates to the poststrata counts Nl. Note that all the popula-
tion values xi need to be known in order to obtain the poststrata counts Nl, l =
1, ..., L, unlike in the case of the GREG estimator which depends only on the
vector of population totals X. Also, the tree boxes B1, ..., BL depend on the
values of the study variable in the sample. Establishment surveys typically
provide population values of the vector x at the unit level, unlike socio-
economic surveys. Regression tree methods can be very useful in the context
of utilizing multiple sources of auxiliary data provided they can be linked to the
sample data on the study variables. McConville and Toth (2018) also provide a
design consistent estimator of the variance of the regression tree estimator (2),
based on Taylor linearization.
In some applications, the number of potential auxiliary variables could be
large, often correlated, and several of those variables may not have significant
relationship with the study variable. In such cases, it is desirable to select
significant auxiliary variables first and then apply GREG to the reduced
model. McConville et al. (2017) used a weighted lasso method which can
perform model selection and coefficient estimation simultaneously by shrinking
weakly related auxiliary variables to zero through a penalty function
(Tibshirani 1996). Note that the model selection depends on the choice of
tuning parameter(s) in the penalty function. The lasso GREG can also be
expressed as a weighted sum but the weights depend on the study variable
and all the population values xi must be known, as in the case of tree GREG.
They also established design consistency of the lasso GREG estimator,
assuming that the number of potential auxiliary variables is fixed as the
sample size increases. Ta et al. (2019) allowed the number of auxiliary variables
ON MAKING VALID INFERENCES BY... 9
to increase as the sample size increases and established design consistency of the
lasso GREG estimator. McConville et al. (2017) conducted a simulation study
on a forestry population dataset with a marginal model based on 12 potential
covariates and a more complex model with 78 potential covariates based on
the main effects and interactions of those variables. As expected, lasso
GREG led to very large gains in efficiency over the customary GREG in
the second case and significant gains in the first case when the sample size is
not large. However, the efficiency of customary GREG can be improved in
the second case by doing variable selection first and then using GREG on the
selected variables.
(Hidiroglou 2001). Note that the variance of Y b 2;gr is asymptotically larger than
the corresponding GREG with known totals Z because of the extra variability
due to estimating Z from the larger sample. We refer the reader to Guandalini
and Tille (2017) for more recent work on double sampling designs involving two
independent probability samples and relevant references to past work.
Kim and Rao (2012) studied a related problem of integrating two indepen-
dent probability samples. Here the primary interest was to create a single
synthetic data set of proxy values e y i associated with zi in the larger sample
A(1) to produce a projection estimator of the total Y. The proxy values e y i ; i∈
Að1Þ are generated by first fitting a working model relating y to z in the smaller
A(2) sample dataset {(yi, zi), i ∈ A(2)} and then predicting yi associated with
zi, i ∈ A(1). This approach facilitates the use of only synthetic data and
associated design weights d1i reported in survey 1. Kim and Rao (2012) iden-
tified the conditions for the projection estimator Y b 1p ¼ ∑i∈Að1Þ d 1i e
y i to be
asymptotically design consistent. Variance estimation is also considered.
Schenker and Raghunathan (2007) reported several applications of the syn-
thetic data approach using a parametric model-based method to estimate the
total Y. In one application, sample A(2) observed both self-reporting health
measurements zi and clinical measurements from physical examination yi and
the much larger survey A(1) had only self-reported measurements zi. Only the
imputed, or synthetic, data from sample A(1) and associated survey weights
are released to the users of sample A(1), thus minimizing disclosure risk (Reiter
2008).
For a fixed p, the dual frame estimator Y b H can be written as a weighted sum
with weights not depending on the study variable, that is the same weight is
used for all study variables. A simple choice is p = 0.5 and the resulting
estimator is called a multiplicity estimator. Hartley (1962) obtained an optimal
value of p by minimizing the variance of the dual frame estimator Y b H , but the
optimal p will depend on the values of the study variable in the population. A
screening dual frame estimator is obtained by letting p = 0 in (3). In this case,
the data on units from sample A belonging to frame 2 are not collected and this
could lead to cost saving if it is cheaper to collect data from the incomplete
frame 2. For example, in agricultural surveys, farms belonging to the list frame
2 are removed from the area frame 1 before sampling commences and this could
lead to considerable cost saving since the list frame is cheaper to sample and it
contains the largest farms (Lohr 2011). An application of the screening dual
frame estimator to combining a non-probability sample with a probability
sample is given in Section 6.
Lohr (2011) provides an excellent account of dual frame sampling theory
and its extension to more than two frames. The relative merits of several
alternative estimators of the total are discussed. She also presents replication-
based variance estimators, in particular jackknife variance estimators. The
effect of errors in ascertaining to which frame a sample unit belongs is also
studied.
small area estimation (Section 8). Kalton (2019) notes that administrative data
may provide longitudinal data both for the period before and the period after
the survey data collection. Limitations of data from administrative sources
include possible incomplete coverage of the target population and confidential-
ity and privacy issues. Further, in the United States, sharing some administra-
tive records by federal agencies is not allowed. Also, for some records held by
states and localities, there is no mandate or incentive to share data with federal
statistical agencies. Administrative data should be subject to the same scrutiny
as survey data with regard to response errors. More so, because they are usually
compiled by a large team of clerical staff with uncertain training on quality.
Given the current initiatives to modernize official statistics through exten-
sive use of nonprobability samples and big data, one might rightly ask the
question “Are probability surveys bound to disappear for the production of
official statistics?” (Beaumont 2019). To answer this question, we should
first examine the effects of sample selection bias in non-probability samples
(Section 4.2) and the use of models to reduce selection bias. If selection bias
cannot be reduced sufficiently through modeling of non-probability samples,
then we may examine the utility of combining non-probability samples with
probability samples in an attempt to address sample selection bias (Sections 5
and 6). Also, many major surveys conducted by federal agencies, such as
monthly labor force surveys or business surveys, will continue to use probability
sampling and efficient estimators based on model-assisted methods using aux-
iliary information. Kalton (2019) says “it is unlikely that social surveys will be
replaced by administrative records, although these data can be valuable addi-
tion to surveys” and “quality of estimates from internet surveys is a concern”.
Couper (2013) notes some limitations in big data, including a single variable
and few covariates observed, lean on relationships and demographic informa-
tion often lacking for a large proportion of big data. However, big data may be
potentially useful in providing predictors for models used for small area esti-
mation (Section 8).
plays the key role in determining the bias and it is approximately zero on the
average under simple random sampling (assuming complete response and
coverage). Note that we do not have control on the participation mechanism
under non-probability sampling, unlike in the case of probability sampling.
We now turn to the model MSE of y B , assuming the indicators δi are
random with unknown nonzero participation probabilities qi. Then, the model
MSE is given by
MSE δ y B ¼ E δ ρ 2δ;y f −1
B ð1−f B Þ σ y
2
ð4Þ
B is poor quality. Using the indicator variable δi , we can express the total Y as
YB + YC, where YB = ∑i ∈ Uδiyi = ∑i ∈ Byi and YC = ∑i ∈ U(1 − δi)yi are the to-
tals for the units in the sample B and the units not belonging to B, respectively.
This situation can be thought of as a population composed of two strata: the
stratum of the sample B that is completely enumerated and a stratum of the
elements not in B from which a probability sample is selected. It immediately
follows that a design unbiased direct estimator of the total is given by Y B
þY b C , where Y b C ¼ ∑i∈A d i ½ð1−δ i Þy i is the design unbiased estimator of YC
and di are the design weights associated with the probability sample. This
estimator is a special case of the dual frame screening estimator when all the
units in the incomplete frame 2 are observed.
Sometimes, it may be desirable to reduce the big data sample size by taking
a large random sample from an extremely large big data sample B , as in the
case of transaction data. In this case, we need methods of drawing random
samples from a very large big data file stored in the fast memory of the
computer and whose size NB may not be known in advance. McLeod and
Bellhouse (1983) proposed a convenient algorithm for drawing simple random
samples from big data files.
Since the sizes NB and NC = N − NB are known, a more efficient post-stratified
estimator is given by Y b P ¼ YB þ NC Y b C =N
b C , where N
b C ¼ ∑i∈A d i ð1−δ i Þ.
Kim and Tam (2018) showed that with a simple random sample A, the post-
stratified estimator achieves a large reduction in the design variance compared to
the design unbiased estimator Y b ¼ ∑i∈A d i y i based only on the probability sample
A. In particular, if the sampling fraction f = n/N is small and the population
variance σ 2y and the varianceof the
population
units not belonging to B, denoted
b
σ C;y , are roughly equal, V Y P =V Y
2 b ≈1−W B , where WB = NB/N is the
proportion of units belonging to the non-probability sample B.
One could use calibration estimation by minimizing the chi-square
distance,∑i ∈ A(wi − di)2/di, between the design weights di and the calibration
weights wi subject to calibration constraints ∑i ∈ Awiδi = NB, ∑i ∈ Awi(1 − δi) =
NC and ∑i ∈ Awi(δiyi) = YB, where NB, NC and YB are known. The resulting
calibration estimator ∑i ∈ Awiyi is identical to the post-stratified estimator Y bP
(Kim and Tam 2018). However, the main advantage of the calibration ap-
proach is that it permits the inclusion of other calibration constraints, if
available. One important application of the calibration approach is the estima-
tion of the total Y when the study variable observed in the non-probability
sample B is subject to measurement error and what we observe is y *i instead of
yi for i ∈ B, and no measurement errors in the probability sample A. In this
case, we minimize the chi-square distance subject to the previous constraints
but replacing yi by y *i for i ∈ B. The resulting calibration estimator is given by
ON MAKING VALID INFERENCES BY... 17
If the population total X is available from one or more external sources, such as
a census, then there is no need for the probability sample and we simply replace
the design unbiased estimator X b ¼ ∑i∈A d i x i in (8) by the known total X
(Singh et al. 2017). Note that the Chen et al. (2018a) estimator of qi depends
ON MAKING VALID INFERENCES BY... 19
on the design weights associated with the probability sample. Chen et al.
(2018a) also provided asymptotically valid variance estimators of the DR esti-
mators under the assumed models. Lee (2006) and Lee and Valliant (2009) earlier
proposed the above setup, but their method of estimating the participation
probabilities by letting δi = 1 if i ∈ B and δi = 0 if i ∈ A leads to biased estimates
of the participation probabilities, as observed by Valliant and Dever (2011).
Yang et al. (2019) extended the DR estimators of Chen et al. (2018a) to the
case of a large number of candidate covariates x. They used a two-step ap-
proach for variable selection and estimation of the finite population parameter,
using a generalization of lasso. Some of the commercial web panels and health
records may observe many covariates, but several of those covariates may be
weakly related to the study variable. In such cases, the lasso-based DR estima-
tors of Yang et al. (2019) might be useful.
A regression type estimator of the total is obtained by estimating the
unknown total of the predicted values m b i , using the probability sample
(Chen et al. 2018a, b):
b REG ¼ ∑i∈A d i m
Y bi ð9Þ
Note that (9) could be regarded as a mass imputed estimator with missing
values yi for i ∈ A replaced by the corresponding imputed values m b i . The
estimator (9) will be biased if the model for the study variable is incorrectly
specified. A ratio type version of (9) is obtained by multiplying (9) with the
b where N
scale factor N =N, b ¼ ∑i∈A d i is the design unbiased estimator of N.
Rivers (2007) used a non-parametric mass imputation approach that avoids
the specification of the mean function Em(yi| xi). For each unit i in the
probability sample A, a nearest neighbor (NN) to the associated xi is found
from the donor set {(i, xi), i ∈ B}, say xl, using Euclidean distance, and the
associated yl is used as the imputed value y *i ð¼ y l Þ, leading to the mass
imputed estimator
b RI ¼ ∑i∈A d i y *
Y ð10Þ
i
We now turn to the scenario where only non-probability sample B data {(i, yi,
xi), i ∈ B} are available. In this case, we need some population information on the
auxiliary variables x. It is then possible to use a model-dependent or prediction
approach which was advocated for probability samples by Royall (1970) and
others. An advantage of the prediction approach for probability samples is that it
provides conditional model-based inferences referring to the particular sample of
units selected. Such conditional inferences may be more relevant and appealing
than the unconditional repeated sampling inferences used in the design-based
approach. Inferences do not depend on the sampling design if the assumed popu-
lation model holds for the sample or there is no sample selection bias.
In the case of probability sampling, Little (2015) notes that sample selection
bias may be reduced by incorporating explicitly design features like stratifica-
tion and sample selection probabilities into the model to ensure design consis-
tency of the prediction estimators. For example, with disproportionate strati-
fied random sampling, stratum effects are included in the model, leading to a
prediction estimator that agrees with the standard design-based estimator.
Unfortunately, under non-probability sampling the above calibrated modeling
approach is not feasible and issues of selection bias are not resolved. Little
(2015) suggests incorporating auxiliary population information, such as
poststratification, to reduce the sample selection bias. Bethlehem (2016) dem-
onstrated the effectiveness of post-stratification in reducing the selection bias in
ON MAKING VALID INFERENCES BY... 21
the context of a web panel on voting (binary variable) when the panel is post-
stratified into cells defined by age (young, middle and old) and education ( low,
high) and the population cell counts are known. Wang et al. (2015) proposed
multilevel regression and post-stratification (MRP) to forecast elections with
non-representative polls and demonstrated its effectiveness in forecasting U. S.
presidential elections, based on a large number of post-stratification cells with
known population cell counts. Smith (1983) examined the conditions for ignor-
ing non-random selection mechanisms with particular attention to post-
stratification and quota sampling.
Suppose the population model is given by the mean function Em(yi) =
h(xi, β), i ∈ U and the model holds for the non-probability sample B. We fit
the model to the sample data to obtain an estimator b β and predictors bhi ¼ h
b
x i ; β for i ∈ U, assuming the population values xi are known. In this case, the
prediction estimator of the total is given by
b PR ¼ ∑i∈B y i þ ∑
Y b
hi ð11Þ
i∈e
B
b PR ¼ X b
Y β ð12Þ
where b β is the ordinary least squares estimator of β. In this special case, the
prediction estimator (12) requires only the vector of population totals X.
A major challenge in making inference based only on a non-probability
sample B is how to account for sample selection bias. Chen et al. (2018b)
suggested using a large pool of predictors and then applying the lasso
(Tibshirani 1996) to implement both variable selection and estimation of model
parameters associated with the selected variables. Their simulation study
suggests that the resulting predictors might be able to account for sample
selection bias. However, the chosen setups do not reflect strong sample selection
bias. Also, it should be noted that the number of demographic variables
available for use as predictors is limited in the context of volunteer web surveys
and other nonprobability samples and not all variables may be available for all
the units in the sample (Couper 2013).
22 J. N. K. Rao
results showed that ARE of the direct estimates averaged over all the areas is
reduced from 33.9% to 14.7% through the use of the “optimal” model-based
estimates. For the 28 smallest areas, reduction in ARE is more pronounced:
70.4% to 17.7%.
Ybarra and Lohr (2008) studied the case of area-level covariates subject to
sampling error. The covariates are obtained from a much larger survey, similar
to the double sampling scenario of Section 3.1. They developed “optimal”
estimators under this setup, assuming the sampling variances and covariances
of the area level covariates are known.
Use of area level big data as additional predictors in the area level
model has the potential of providing good predictors for modeling. We
mention three recent applications that have used big data covariates in
an area level model. Marchetti et al. (2015) studied the estimation of
poverty rates for local areas in the Tuscany region of Italy. In this
application, the big data covariate is a mobility index based on different
car journeys between locations automatically tracked with a GPS device.
Direct estimates of area poverty rates were obtained from a probability
sample. The big data covariate in this application is based on a
nonprobability sample which was treated as a simple random sample
and the Ybarra-Lohr method was used to estimate model-based poverty
rates. The second application analyzed relative change in the percent of
Spanish speaking households in the eastern half of the Unites States
(Porter et al. 2014). Here direct estimates for the states (small areas)
were obtained from the American Community Survey (ACS) and a big
data covariate was extracted from Google Trends of commonly used
Spanish words available at the state level. In the third application,
Schmid et al. (2017) used mobile phone data as covariates in the basic
area level model to estimate literacy rate by gender at the commune
level in Senegal. Direct estimates of area literacy rates were obtained
from a Demographic and Health Survey based on a probability sample.
It is interesting that recent census data or social media data were not
available to use as covariates. The authors provide details regarding the
construction of the mobile phone covariates. In another application of big
data, Muhyi et al. (2019) obtained small area estimates of the electabil-
ity of a candidate as Central Java Governor in 2018, using predictor
variables extracted from Twitter data and direct estimates obtained from
a sample survey.
Turning to a basic unit level model, the unit level sample data are given
by {(yij, xij), j = 1, ..., ni; i = 1, ..., m} , where j denotes a sample unit be-
longing to area i, and the population area means X i are assumed to be
24 J. N. K. Rao
known. In practice, either xij is observed along with the study variable yij or
obtained from external sources through linkage. In the latter case, the
observed data may be subject to linkage errors if unique unit identifiers
are not available (Chambers et al. 2019). Battese et al. (1988) proposed a
0
nested error linear regression model y ij ¼ x ij β þ v i þ eij in the case of xij
observed along with yij , where eij is unit error with mean zero and variance
σ 2e , and vi is random area effect with mean zero and variance σ 2v . The
optimal model-based estimator is again a weighted combination of a direct
estimator and a synthetic estimator. Battese et al. (1988) applied the
nested error regression model to estimate county crop areas using sample
survey data in conjunction with satellite information. Each county was
divided into area segments and the area under corn and soybeans, taken
as yij, was ascertained for a random sample of segments by interviewing
farm operators. Auxiliary variable xij in the form of number of pixels
classified as corn and soybeans were obtained for all the area segments,
including the sample segments, in each county using the LANSAT satellite
readings. This is an application of big data in the form of satellite readings.
Chambers et al. (2019) studied the effect of linkage errors when xij is
obtained from another source such as a population register. They derived
model-based estimators of the area means, taking account of linkage errors.
Both Battese et al. (1988) and Chambers et al. (2019) assumed the absence
of sample selection bias in the sense that the population model holds for the
sample of units within an area. However, sampling becomes informative if
the known selection probability for the sample unit is related to the
associated yij given xij. In this case, the population model may not hold
for the sample and methods that ignore informative sampling can lead to
biased estimators of small area means. Pfeffermann and Sverchkov (2007)
studied small area estimation under informative probability sampling by
modeling the known selection probabilities for the sample as functions of
associated xij and yij . They developed bias-adjusted estimators under their
set-up. Verret et al. (2015) proposed augmented models by using a suitable
function of the known selection probability as an additional auxiliary
variable in the sample model to take account of the sample selection bias
and demonstrated that the resulting estimators of small area means can
lead to considerable reduction in MSE relative to methods that ignore
sample selection bias. However, neither method is applicable to data ob-
tained from a non-probability sample because the selection probabilities are
unknown.
In the case of sample surveys repeated over time, considerable gain in
efficiency can be achieved for small area estimation by borrowing strength
ON MAKING VALID INFERENCES BY... 25
across both areas and time, using extensions of the basic cross-sectional area
level model. There is extensive literature on this important topic and we refer
the reader to Rao and Molina (2015, Section 8.3). Big data sources can be used
to construct covariates for use under such models by combining time series and
cross-sectional survey data. A referee noted that “Particularly, the high fre-
quency of big data sources can be used to improve the timeliness of survey
samples in nowcasting methods.”
9 Concluding remarks
samples. Couper (2013) says “We need other ways to quantify the risks of
selection bias or non-coverage in big data or non-probability surveys.” The
European Statistical System (2015) has published guidelines for reporting
about the quality of statistics calculated from non-probability samples and
administrative sources, as well as for statistical processes involving multiple
data sources. The U.S. Federal Committee on Statistical Methodology
(2018) has outlined some steps that might be taken toward more transpar-
ent reporting of data quality for integrated data.
Covariates extracted from big data have the potential of providing good
additional predictors in linking models used in small area estimation. We
can expect to see more applications using big data predictors in small area
estimation. In the time series context, big data has the potential of provid-
ing estimates for small areas over time that can improve the timeliness of
survey samples using nowcasting methods, as noted by a referee.
I have not discussed other practical issues related to big data and non-
probability samples, such as privacy, access and transparency, and I refer the
reader to the following overview and appraisal papers: Baker et al. (2013),
Brick (2011), Citro (2014), Couper (2013), Elliott and Valliant (2017), Groves
(2011), Kalton (2019) , Keiding and Louis (2016), Lohr and Raghunathan
(2017), Mercer et al. (2017), Tam and Kim (2018) and Thompson (2019).
The report by the National Academies of Sciences, Engineering, and
Medicine (2017) extensively treated the privacy issue, in addition to method-
ology for integrating data from multiple sources.
It is unlikely that all surveys based on probability sampling , especially
large-scale surveys, will be replaced by big data, non-probability samples or
administrative data in the near future because probability samples have
much wider scope such as collecting multiple study variables to estimate
relationships. For some studies, data can only be obtained in person
(Kalton 2019). Of course, we should make improvements to probability
sampling such as reducing survey length and respondent burden and for
making increased use of technology (Couper 2013). Rao and Fuller (2017)
provide some future directions. Inevitably, non-probability samples will be
more widely used in the future, and we need to continue researching
methods for obtaining valid (or at least acceptable) inferences from them,
possibly in combination with probability samples as illustrated in this
paper. Falling response rates and increasing respondent burden are often
given as reasons for using non-probability samples, especially in socio-
economic surveys, but those reasons do not necessarily apply to traditional
sample surveys not involving people as respondents, such as agricultural
and natural resources surveys.
ON MAKING VALID INFERENCES BY... 27
References
BAKER, R., BRICK, J. M., BATES, N. A., BATTAGLIA, M., COUPER, M. P., DEVER, J. A., GILE, K. J. AND
TOURANGEAU, R. (2013). Report of the AAPOR task force on non-probability sampling.
J. Surv. Statist. Methodol., 1, 90-143.
BATTESE, G. E., HARTER, R. M. AND FULLER, W. A. (1988). An error component model for prediction of
county crop areas using survey and satellite data. J. Am. Stat. Assoc., 83, 28-36.
BEAUMONT, J. – F. (2019). Are probability surveys bound to disappear for the production of official
statistics? Technical Report. Statistics Canada.
BETHLEHEM, J. (2016). Solving the nonresponse problem with sample matching. Soc. Sci. Comput.
Rev., 34, 59-77.
BIEMER, P. P. (2018). Quality of official statistics: present and future. Paper presented at the
International Methodology Symposium. Statistics Canada, Ottawa.
BOSE, C. (1943). Note on the sampling error in the method of double sampling. Sankhya, 6, 329-330.
BRAKEL VAN DEN, J. A. AND BETHLEHEM, J. (2008). Model-assisted estimators for official statistics.
Discussion Paper 09002, Statistics Netherland.
BREIDT, F. J. AND OPSOMER, J. D. (2017). Model-assisted survey estimation with modern prediction
techniques. Stat. Sci., 32, 190-205.
BRICK, M. J. (2011). The future of survey sampling. Public Opin. Q., 75, 872-888.
CHAMBERS, R. L., FABRIZI, E. AND SALVATI, N. (2019). Small area estimation with linked data.
Technical report appeared as arXiv: 1904.00364v1.
CHAUDHURI, A. AND CHRISTOFIDES, T. (2013). Indirect Questioning in Sample. Springer: New York.
CHEN, S. AND HAZIZA, D. (2017). Multiply robust imputation procedures for the treatment of item
nonresponse in surveys. Biometrika, 104, 439-453.
28 J. N. K. Rao
HOLT, D. T. (2007). The official statistics Olympics challenge: Wider, deeper, quicker, better,
cheaper. The American Statistician, 61, 1-8. With commentary by G. Brackstone and J.
L. Norwood.
HORVITZ, D. G. AND THOMPSON, D. J. (1952). A generalization of sampling without replacement from
a finite universe. J. Am. Stat. Assoc., 47, 6630685.
KALTON, G. (2019). Developments in survey research over the past 60 years: A personal perspec-
tive. Int. Stat. Rev., 87, S10-S30.
KEIDING, N. AND LOUIS, T. A. (2016). Perils and potentials of self-selected entry to epidemiological
studies and surveys. J. R. Soc. Stat. Ser. A, 179, 319-376.
KIM, J. K. AND HAZIZA, D. (2014). Doubly robust inference with missing data in survey sampling.
Stat. Sin., 24, 375-394.
KIM, J. K. AND RAO, J. N. K. (2012). Combining data from independent surveys: model-assisted
approach. Biometrika, 99, 85-100.
KIM, J. K. AND TAM, S-M. (2018). Data integration by combining big data and survey sample data
for finite population inference. Submitted for publication.
KIM, J. K. AND WANG, Z. (2019). Sampling techniques for big data analysts. Int. Stat. Rev. (in press).
KIM, J. K., PARK, S., CHEN, Y. AND WU, C. (2019). Combining non-probability and probability survey
samples through mass imputation. Technical Report: arXiv: 1812. 10694v2 [stat.ME].
LEE, S. (2006). Propensity score adjustment as a weighting scheme for voluntary panel web
surveys. J. Off. Stat., 22, 329-349.
LEE, S. AND VALLIANT, R. (2009). Estimation for volunteer panel web surveys using propensity score
adjustment and calibration adjustment. Sociol. Methods Res., 37, 319-343.
LITTLE, R. J. (2015). Calibrated Bayes, an inferential paradigm for official statistics in the era of big
data. Stat. J. IAOS, 31, 555-563.
LOHR, S. L. (2011). Alternative survey sample designs: Sampling with multiple overlapping frames.
Surv. Methodol., 37, 197-213.
LOHR, S. L. AND RAGHUNATHAN, T. E. (2017). Combining survey data with other data sources. Stat.
Sci., 32, 293-312.
MAHALANOBIS, P. C. (1944). On large scale sample surveys. Philos. Trans. R. Soc. B, 231, 329-351.
MAHALANOBIS, P. C. (1946). Recent experiments in statistical sampling in the Indian Statistical
Institute. J. R. Stat. Soc., 109, 325-378.
MARCHETTI, S., GIUSTI, C., PRATESI, M., SALVATI, N., GIANNOTTI, F., PEDRESCHI, D., RINZIVILLO, S.,
PAPPALARDO, L AND GABRIELLI, L. (2015). Small area model-based estimators using big data
sources. J. Off. Stat., 31, 263-281.
MCCONVILLE, K. S. AND TOTH, D. (2018). Automated selection of post-strata using a model-assisted
regression tree estimator. Scand. J. Stat. (in press).
MCCONVILLE, K. S., BREIDT, F. J., LEE, T. C. AND MOISEN, G. G. (2017). Model-assisted survey regression
estimation with the lasso. J. Surv. Statist. Methodol., 5, 131-158.
MCLEOD, A. I. AND BELLHOUSE, D. R. (1983). A convenient algorithm for drawing a simple random
sample. Applied Statistics., 32, 182-184.
MENG, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations,
big data paradox, and the 2016 US presidential election. Ann. Appl. Stat., 12, 685-726.
MERCER, A. W., KREUTER, F. AND STUART, E. A. (2017). Theory and practice in nonprobability surveys.
Public Opin. Q., 81, 250-279.
MUHYI, F. A., SARTONO, B., SULVIANTI, I. D. AND KURNIA, A. (2019). Twitter utilization in application of
small area estimation to estimate electability of candidate central java governor. IOP Conf.
Ser. Earth Environ. Sci., 299 012033, 1-10.
NARAIN, R. D. (1951). On sampling without replacement with varying probabilities. J. Indian Soc.
Agric. Stat., 3, 169-174.
30 J. N. K. Rao
NATIONAL ACADEMIES OF SCIENCES, ENGINEERING, AND MEDICINE. (2017). Federal Statistics, Multiple
Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National
Academies Press. https://doi.org/10.17226/24893.
NEYMAN, J. (1934). On the two different approaches of the representative method. The method of
stratified sampling and the method of purposive selection. J. R. Stat. Soc., 97, 558-606.
PFEFFERMANN, D. AND SVERCHKOV, M. (2007). Small-area estimation under informative probability
sampling of area and within the selected areas. J. Am. Stat. Assoc., 102, 1427-1439.
PORTER, A. T., HOLAN, S. H., WIKLE, C. K. AND CRESSIE, N. (2014). Spatial Fay-Herriot model for small
area estimation with functional covariates. Spat. Stat., 10, 27-42.
RAO, J. N. K. (1999). Some current trends in sample survey theory and methods (with discussion).
Sankhya Ser. B, 61, 1-57.
RAO, J. N. K. AND FULLER, W. A. (2017). Sample survey theory and methods: Past, present and future
directions (with discussion). Surv. Methodol., 43, 145-181.
RAO, J. N.K. AND MOLINA, I. (2015). Small Area Estimation. Wiley, Hoboken.
REITER, J. (2008). Multiple imputation when records used for imputation are not used or dissem-
inated for analysis. Biometrika, 95, 933-946.
RIVERS, D. (2007). Sampling for web surveys. In 2007 JSM Proceedings, ASA Section on Survey
Research Methods, American Statistical Association.
ROYALL, R. M. (1970). On finite population sampling under certain linear regression models.
Biometrika, 57, 377-387.
SARNDAL, C.-E. (2007). The calibration approach in survey theory and practice. Surv. Methodol.,
33, 99-119.
SCHENKER, N. AND RAGHUNATHAN, T. (2007). Combining information from multiple surveys to
enhance estimation of measure of health. Stat. Med., 26, 1802-1811.
SCHMID, T., BRUCKSCHEN, F., SALVATI, N. AND ZBIRANSKI, T. (2017). Constructing sociodemographic
indicators for national statistical institutes by using mobile phone data: estimating literacy
rates in Senegal. J. R. Stat. Soc. Ser. A, 180, 1163-1190.
SINGER, E. (2016). Reflections on surveys past and future. J. Surv. Statist. Methodol., 4, 463-
475.
SINGH, A. C., BERESOVSKY, V. AND YE, C. (2017). Estimation from purposive samples with the aid of
probability supplements but without data on the study variable. In 2017 JSM Proceedings,
ASA Section on the Survey Research Method Section, American Statistical Association.
SMITH, T. M. F. (1983). On the validity of inferences from non-random samples. J. R. Stat. Soc. Ser.
A, 146, 393-403.
TA, T., SHAO, J., LI, Q. AND WANG, L. (2019). Generalized regression estimators with high-dimensional
covariates. Stat. Sin. (in press).
TAM. S. M. AND KIM, J. K. (2018). Big data, selection bias and ethics – an official statistician s
perspective. Stat. J. IAOS, 34, 577-588.
THOMPSON, S. K. (2002). Sampling. Wiley: New York.
THOMPSON, M. E. (2019). Combining data from new and traditional sources in population surveys.
Int. Stat. Rev., 87, S79-S89.
TIBSHIRANI, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B,
58, 267-288.
TOURANGEAU, R., BRICK, M. J., LOHR, S. AND LI, J. (2017). Adaptive and responsive survey designs: a
review and assessment. J. R. Stat. Soc. Ser. A, 180, 203-223.
VALLIANT, R. AND DEVER, J. A. (2011). Estiamting propensity adjustments for volunteer web surveys.
Sociol. Methods Res., 40, 105-137.
VERRET, F., RAO, J. N. K. AND HIDIROGLOU, M. H. (2015). Model-based small area estimation under
informative sampling. Surv. Methodol., 41, 333-347.
ON MAKING VALID INFERENCES BY... 31
WANG, W., ROTHSCHILD, D., GOEL, S. AND GELMAN, A. (2015). Forecasting elections with non-
representative polls. Int. J. Forecast., 31, 980-991.
WILLIAMS, D. AND BRICK, M. J. (2018). Trends in U. S. face-to-face household survey nonresponse and
level of effort. J. Surv. Statist. Methodol., 6, 186-211.
WOODRUFF, R. S. (1952). Confidence intervals for medians and other position measures. J. Am.
Stat. Assoc., 47, 635-646.
WU, C. AND SITTER, R. R. (2001). A model-calibrated approach to using complete auxiliary
information from survey data. J. Am. Stat. Assoc., 96, 185-193.
YANG, S., KIM, J. K. AND SONG, R. (2019). Doubly robust inference when combining probability and
non-probability samples with high-dimensional data. Technical Report: arXiv:
1903.05212v1 [stat.ME].
YBARRA, L. M. R. AND LOHR, S. L. (2008). Small area estimation when auxiliary information is
measured with error. Biometrika, 95, 919-931.