0% found this document useful (0 votes)
226 views

Rao 2020

This document summarizes methods for making valid inferences by integrating data from surveys and other sources. It begins by providing an overview of probability sampling methods and how they have been used to make reliable estimates and inferences from survey data. It then discusses challenges like decreasing response rates and rising costs, and how new sources of "big data" and data from non-probability samples can potentially provide estimates in near real-time. However, naively using such non-probability data risks sample selection bias. The document reviews recent methods to attempt reducing this bias by building on probability sampling techniques and integrating different data sources. It also discusses how big data can provide good predictors for small area estimation models to "borrow strength" across related areas.

Uploaded by

FIZQI ALFAIRUZ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
226 views

Rao 2020

This document summarizes methods for making valid inferences by integrating data from surveys and other sources. It begins by providing an overview of probability sampling methods and how they have been used to make reliable estimates and inferences from survey data. It then discusses challenges like decreasing response rates and rising costs, and how new sources of "big data" and data from non-probability samples can potentially provide estimates in near real-time. However, naively using such non-probability data risks sample selection bias. The document reviews recent methods to attempt reducing this bias by building on probability sampling techniques and integrating different data sources. It also discusses how big data can provide good predictors for small area estimation models to "borrow strength" across related areas.

Uploaded by

FIZQI ALFAIRUZ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Sankhyā B: The Indian Journal of Statistics

https://doi.org/10.1007/s13571-020-00227-w
# Indian Statistical Institute 2020

On Making Valid Inferences by Integrating Data


from Surveys and Other Sources
J. N. K. Rao
Carleton University, Ottawa, Canada

Abstract
Survey samplers have long been using probability samples from one or more
sources in conjunction with census and administrative data to make valid and
efficient inferences on finite population parameters. This topic has received a lot of
attention more recently in the context of data from non-probability samples such
as transaction data, web surveys and social media data. In this paper, I will
provide a brief overview of probability sampling methods first and then discuss
some recent methods, based on models for the non-probability samples, which
could lead to useful inferences from a non-probability sample by itself or when
combined with a probability sample. I will also explain how big data may be used
as predictors in small area estimation, a topic of current interest because of the
growing demand for reliable local area statistics.

Keywords. Big data, Dual frames, Probability sampling, Non-probability


sampling, Sample selection bias, Small area estimation

1 Introduction

Sample surveys have long been conducted to obtain reliable estimates of finite
population descriptive parameters, such as totals, means and quantiles, and asso-
ciated standard errors and normal theory confidence intervals with large enough
sample sizes. Probability sampling designs and repeated sampling inference, also
called the design-based approach, has played a dominant role, especially in the
production of official statistics, ever since the publication of the landmark paper by
Neyman (1934) which laid the theoretical foundations of the design-based ap-
proach. The design-based approach was almost universally accepted by practicing
official statisticians. The early landmark contributions to the design-based ap-
proach, outlined in Section 2.1, were mostly motivated by practical and efficiency
considerations. Methods for combining two or more probability samples were also
developed to increase the efficiency of estimators for a given cost (Section 3). Data
2 J. N. K. Rao

collection issues have received considerable attention in recent years to control costs
and maintain response rates using new modes of data collection (Section 4.1).
In spite of the efforts under the probability sampling setup to gather
designed data, response rates are decreasing and costs are rising (Williams
and Brick 2018). At the same time, due to technological innovations, large
amounts of inexpensive data, called big data or organic data, and data from
non-probability samples (especially online panels) are now accessible. Big data
include the following types: administrative data, transaction data, social media
data, internet of things and scrape data from websites, sensor data and satellite
images. Big data and data from web panels have the potential of providing
estimates in near real time, unlike traditional data derived from probability
samples. Statistical agencies publishing official statistics are now taking mod-
ernization initiatives by finding new ways to integrate data from a variety of
sources and produce “reliable” near real-time official statistics. However, naïve
use of such data can lead to serious sample selection bias (Section 4.2) and
without adjustment to reduce selection bias it can lead to the “big data
paradox: the bigger the data, the surer we fool ourselves “(Meng 2018). Sections
5, 6 and 7 discuss methods for attempting to reduce sample selection bias under
different scenarios. These methods build on the techniques discussed in Sections
2, 3 and 4 for making efficient inferences from probability samples.
Big data has the potential of providing good predictors for models useful in
small area estimation (Section 8). In the case of small areas with very small sample
sizes within areas, direct estimators based on traditional probability sampling do
not provide adequate precision and it becomes necessary to use model-based
methods to “borrow strength” across related areas through linking models based
on suitable predictors, such as census data and administrative records. Big data
predictors can be useful as additional predictors in the linking models.

2 Probability Sampling

2.1 Some early landmark contributions Probability sampling from a


finite population of units assumes that every unit i in a finite population U of
size N has a known non-zero probability of inclusion in the sample, denoted by
πi. Focusing on estimating the population total Y= ∑i ∈ U yi or the mean Y
¼ Y =N of a variable of interest y from a probability sample A, Neyman (1934)
laid the theoretical foundations of the design based approach. He introduced
the ideas of efficiency of design-unbiased estimators, depending on the inclusion
probabilities, and optimal sample size allocation in his theory of stratified
simple random sampling. He also demonstrated that balanced purposive
ON MAKING VALID INFERENCES BY... 3

sampling may perform poorly if the underlying model assumptions are violated.
This result is of importance in the context of making inferences from non-
probability samples based on implicit or explicit models.
A general design-unbiased estimator of the population total is of the form
Yb ¼ ∑i∈A d i y i , where d i ¼ π −1 is called the design weight (Horvitz and
i
Thompson 1952; Narain 1951). This estimator may be expressed as
Yb ¼ ∑i∈U ða i d i Þy i , where ai is the indicator variable for the inclusion of pop-
ulation unit i in the sample and is equal to 1 with probability πi and 0 with
probability 1 − πi. In the case of stratified simple random sampling, the design
weights are equal to the inverses of sampling fractions within strata and they
may vary across  strata.  Design-unbiased
 estimators of the variance of Y b,
b 2 b
denoted by v Y ¼ s Y , can be obtained provided all the joint inclusion
probabilities are positive, as is the case under stratified simple random
 
sampling. Sampling
 practitioners often use the coefficient of variation,
b b b
C Y ¼ s Y =Y as a measure of precision of the estimator Y b and con-
n
struct 95%  normal  theory
o confidence intervals on the total as
b b b b
Y −2s Y ; Y þ 2s Y . This assumes large samples and single-stage
sampling. Neyman noted that for large enough samples the frequency of
errors in the confidence statements based on all possible stratified simple
random samples that could be drawn does not exceed the limit prescribed in
advance “whatever the unknown properties of the finite population”.
The above attractive features of probability sampling and design-based
inference were recognized soon after Neyman's paper appeared. This, in turn,
led to the use of probability sampling in a variety of sample surveys covering
large populations, and to theoretical developments of efficient sampling designs
minimizing cost subject to specified precision of the estimators, and associated
estimation theory. I will now list a few important post-Neyman theoretical
developments under the design-based approach.
Mahalanobis used probability sampling designs for surveys in India as early
as 1937. His classic 1944 paper (Mahalanobis 1944) rigorously formulated cost
and variance functions for the efficient design of sample surveys based on
probability sampling. He studied simple random sampling, stratified random
sampling, single-stage cluster sampling and stratified cluster sampling for the
efficient design of sample surveys of different crops in Bengal, India. He also
extended the theoretical setup to subsampling of sampled clusters (which he
named as two-stage sampling). He was instrumental in establishing the Na-
tional Sample Survey (NSS) of India and the world famous Indian Statistical
Institute. The NSS is the largest multi-subject continuing survey with full-time
staff using personal interviews for socio-economic surveys and physical
measurements for crop surveys. He attracted brilliant survey statisticians to
4 J. N. K. Rao

work with him, including D. B. Lahiri, M. N. Murthy and Des Raj. Hall (2003)
provides a scholarly historical account of the pioneering contributions of
Mahalanobis to the early developments of survey sampling theory and methods
in India. P. V. Sukhatme, who studied under Neyman, also made pioneering
contributions to the design and analysis of large-scale agricultural surveys in
India, using stratified multi-stage sampling.
Under the leadership of Morris Hansen, survey statisticians at the U. S.
Census Bureau made fundamental contributions to the theory and methods of
probability sampling and design-based inference, during the period 1940-1960.
The most significant contributions from the Census Bureau group include the
development of the basic theory of stratified two-stage cluster sampling with
one cluster (or primary sampling unit) within each stratum drawn with prob-
ability proportional to size (PPS), where size measures are obtained from
external data sources such as a recent census (Hansen and Hurwitz 1943).
Sampled clusters may then be subsampled at a rate to provide an overall self-
weighting (equal overall probability of selection) sample design. Such probabil-
ity sampling designs can lead to significant variance reduction by controlling the
variability arising from unequal cluster sizes. Another major contribution was
the introduction of rotation sampling with partial replacement of households to
handle response burden in surveys repeated over time, such as the monthly U. S.
Current Population Survey (CPS) for measuring monthly unemployment rates
and month-to-month changes in the labor force. Hansen et al. (1955) developed
efficient composite estimators of level and change under rotation sampling. This
methodology is widely used in large-scale rotating panel surveys. Yet another
significant contribution is the development of a unified approach for construct-
ing confidence intervals for quantiles, such as the median, applicable to general
probability sampling designs (Woodruff 1952). This method remains a corner-
stone for making inference on the population quantiles.
As described above, population information on auxiliary variables related to
a variable of interest (the study variable) is often used at the design stage for
stratification or PPS sampling or both. Use of information on auxiliary variables
at the estimation stage was also advocated. In particular, ratio estimation based
on a single auxiliary variable x correlated with y has been widely used. The value
xi is obtained for each unit in the sample and the population total X = ∑i ∈ Uxi
must be known. The sample values of x are either directly observed along with
the associated values of yor through
 record linkage. A ratio estimator of the
b b b
total is of the form Y r ¼ Y =X X ¼ RX, b where X b ¼ ∑i∈A d i x i is the design-
unbiased estimator of the known total X. It is not necessary to know all the
individual population values xi to implement the ratio estimator, unlike the use
of xi at the design stage. The ratio estimator also enjoys the calibration property,
ON MAKING VALID INFERENCES BY... 5

namely it reduces to the known total X when yi is replaced by xi. It can lead to
considerable gain in efficiency when yi is roughly proportional to xi . The above
desirable properties and the computational simplicity of the ratio estimator led
to extensive use in surveys based on probability sampling. The well-known

Hajek estimator  the total is a special case of the ratio estimator. It is given
of
b b b
by Y H ¼ Y =N N, where N b ¼ ∑i∈A d i . The Hájek estimator of the popula-
b
tion mean is given by Y H ¼ Y b =N
b and it is widely used in practice and it does
not require the knowledge of N, unlike the unbiased estimator Y b =N.
The ratio estimator is not generally design unbiased but it is design consis-
tent for large samples. Survey researchers do not insist on design unbiasedness
(contrary to statements in some papers on inferential issues of sampling theory)
because it “often results in much larger MSE than necessary (Hansen et al.
  is
1983)”. A design consistent estimator of the variance of the ratio estimator
simply obtained by replacing the variable yi in the variance estimator v Y b by
b
the residuals ei ¼ y i −Rx i for i ∈ A. Regression estimation was also studied early
on but it was not widely used due to computational limitations in those days.
In the early days of survey sampling, surveys were generally much simpler
than they are today and data were collected through personal interviews or
through mail questionnaires (possibly followed by personal interviews of a
subsample of non-respondents). Physical measurements, such as those advo-
cated by Mahalanobis for estimating crop yields in India, were also used.
Response rates were generally high and the 1970s is generally regarded as the
“Golden Age of Survey Research” (Singer 2016).
Together with the consolidation of the basic design-based sampling theory,
attention turned to measurement or response errors in the 1940 U. S. census
(Hansen et al. 1951). Under additive response error models with minimal model
assumptions on the observed responses treated as random variables, the total
variance of the estimator Y b is decomposed into sampling variance, simple
response variance and correlated response variance (CRV) due to interviewers.
The CRV can dominate the total variance if the number of interviews per
interviewer is large. Partly for this reason, self-enumeration by mail was first
introduced in the 1960 U. S. Census to reduce CRV. This is indeed a success
story of theory influencing practice.
Prior to the work of Hansen et al. (1951), Mahalanobis (1946) proposed the
ingenious method of interpenetrating subsamples to assess both sampling errors
and variable interviewer errors in socio-economic surveys. He showed that both
the total variance and the interviewer variance component can be estimated by
assigning the subsamples at random to the interviewers. Kalton (2019) notes
that response errors remain a major concern although much research on total
6 J. N. K. Rao

survey error has been conducted since the pioneering work of Mahalanobis
(1946) and Hansen et al. (1951).
Nonresponse in probability surveys was also addressed in early survey
sampling development. Hansen and Hurwitz (1946) proposed two-phase sam-
pling for following up initial nonrespondents. In their application, the sample is
contacted by mail in the first phase and a subsample of nonrespondents is then
subjected to personal interview, assuming complete response or negligible
nonresponse at the second phase. This method is currently used in the Amer-
ican Community Survey and it can also be regarded as an early application of
what is now fashionable as a responsive or adaptive design (Tourangeau et al.
2017).

2.2 Model-assisted calibration Formal working linear regression


models, relating the study variable yi to a vector xi of auxiliary variables with
known population total X, were studied in later years to develop efficient
estimators of the total Y, called generalized regression estimators (GREGs).
0
The working linear regression model is of the formy i ¼ x i β þ ε i ; i∈U , with
model errors εi assumed to be uncorrelated with mean zero and variance
proportional to a known constant ci. The GREG estimator under the above
working model is given by
0  
Yb gr ¼ Y
b þb β d X−X b ;
 −1
where b
0
β d ¼ ∑i∈A d i x i x i =ci ð∑i∈A d i x i y i =ci Þ is the survey weighted least
squares estimator of the regression parameter β. The GREG estimator is
design-consistent regardless of the validity of the working model (Fuller
1975). It can lead to significant gain in efficiency over the basic design-
unbiased estimator Y b if the working model provides a good fit to the sample
data. The GREG estimator reduces to the ratio estimator Y b r in the special case
of a scalar x and regression through the origin with error variance proportional
to x.
The GREG estimator can also be expressed   as
 a weighted sum∑  i ∈ Awiyi,
where the weight wi = digi with g i ¼ 1 þ X−X b 0 ∑i∈A d i x i x 0 =ci −1 x i =ci . The
i
adjustment factor gi ensures the calibration property ∑i ∈ Awixi = X which is
attractive to the user. Note that the GREG weights do not depend on the study
variable as long as the same linear regression model is used for all the study
variables y.
The use of a common weight wi for all study variables may be justified
through a model-free calibration approach in which the weights wi for i ∈ A are
obtained by minimizing a chi-square distance measure between di and wi
ON MAKING VALID INFERENCES BY... 7

subject to calibration constraints ∑i ∈ Awixi = X. The resulting calibration


weights wi are in fact identical to the GREG weights wi based on the linear
regression model (Deville and Sarndal 1992). Calibration estimation has
attracted the attention of users due to its model-free property and its ability
to produce a common set of weights. Sarndal (2007) says “Calibration has
established itself as an important methodological instrument in large scale
production of statistics”. Brakel Van Den and Bethlehem (2008) note that the
use of common calibration weights for estimation in multipurpose surveys
makes the calibration method “very attractive to produce timely official sta-
tistics in a regular production environment”. The calibration approach has the
potential for adjusting for selection bias due to non-probability sampling, as
shown later.
Study variable-specific working models have also been studied, by assuming
a mean function Em(yi| xi) = h(xi) for i ∈ U, where Em denoted model expecta-
tion. In the parametric case h(xi) = h(xi, β) for known function h(.), where β is
the model parameter. Wu and Sitter (2001) estimated β from the sample data
and used b hi ¼ hðx i ; b
β) as the predictor of h(xi, β) for i ∈ U, assuming all the
population values xi are known, where b β is the estimator of the model param-
eter. The resulting model-assisted estimator of the total is given by
 
b ma ¼ ∑i∈A d i y i −b
Y hi þ ∑i∈U b hi ð1Þ

and the resulting calibration weights wi satisfy ∑i∈A w i b h i ¼ ∑i∈U bhi. The
b 0
GREG estimator Y gr is a special case of (1) when hðx i ; β Þ ¼ x i β. Breidt and
Opsomer (2017) review model-assisted estimation using modern prediction
methods, including non-parametric regression and machine learning.
In practice, available auxiliary variables might include several categorical
variables with many categories and significant interactions may exist among
the variables. For example, the Quarterly Census of Employment and Wages
(QCEW) in the United States is used as a sampling frame for many establish-
ment surveys. The frame is compiled from administrative records and contains
several categorical variables such as industry code, firm ownership type, loca-
tion and size class of the establishment (McConville and Toth 2018). Clearly,
the linear regression working model including all categorical variables and their
two factor interactions as auxiliary variables can lead to unstable GREG
weights and even some negative weights. In this case, the GREG estimator
can be even less efficient than the basic estimator Y. b McConville and Toth
(2018) proposed a model-assisted regression tree approach that can automat-
ically account for relevant interactions, leading to significantly more efficient
design-consistent post-stratified estimators than the GREG estimator obtained
8 J. N. K. Rao

after variable selection from the pool of categorical variables. The GREG
estimator performed poorly when the pool for variable selection also included
all the two factor interactions. The tree weights are always positive and their
variability is considerably smaller than the variability of the GREG weights.
Regression trees use a recursive partitioning algorithm that partitions the space
of predictor variables into a set of boxes (post strata) B1, ..., BL such that the
sample units within a box are homogeneous with respect to the study variable.
The model-assisted estimator (1) with unspecified mean function reduces to the
regression tree estimator of the form
 
b rt ¼ ∑L N l ∑i∈A d i y i =∑i∈A d i
Y ð2Þ
l¼1 l l

where Al is the set of sample units and Nl is the number of population units
belonging to poststratum l. It follows from (2) that the weight attached to a
sample unit belonging to poststratum l is N l d i =∑i∈Al d i and the regression tree
estimator (2) calibrates to the poststrata counts Nl. Note that all the popula-
tion values xi need to be known in order to obtain the poststrata counts Nl, l =
1, ..., L, unlike in the case of the GREG estimator which depends only on the
vector of population totals X. Also, the tree boxes B1, ..., BL depend on the
values of the study variable in the sample. Establishment surveys typically
provide population values of the vector x at the unit level, unlike socio-
economic surveys. Regression tree methods can be very useful in the context
of utilizing multiple sources of auxiliary data provided they can be linked to the
sample data on the study variables. McConville and Toth (2018) also provide a
design consistent estimator of the variance of the regression tree estimator (2),
based on Taylor linearization.
In some applications, the number of potential auxiliary variables could be
large, often correlated, and several of those variables may not have significant
relationship with the study variable. In such cases, it is desirable to select
significant auxiliary variables first and then apply GREG to the reduced
model. McConville et al. (2017) used a weighted lasso method which can
perform model selection and coefficient estimation simultaneously by shrinking
weakly related auxiliary variables to zero through a penalty function
(Tibshirani 1996). Note that the model selection depends on the choice of
tuning parameter(s) in the penalty function. The lasso GREG can also be
expressed as a weighted sum but the weights depend on the study variable
and all the population values xi must be known, as in the case of tree GREG.
They also established design consistency of the lasso GREG estimator,
assuming that the number of potential auxiliary variables is fixed as the
sample size increases. Ta et al. (2019) allowed the number of auxiliary variables
ON MAKING VALID INFERENCES BY... 9

to increase as the sample size increases and established design consistency of the
lasso GREG estimator. McConville et al. (2017) conducted a simulation study
on a forestry population dataset with a marginal model based on 12 potential
covariates and a more complex model with 78 potential covariates based on
the main effects and interactions of those variables. As expected, lasso
GREG led to very large gains in efficiency over the customary GREG in
the second case and significant gains in the first case when the sample size is
not large. However, the efficiency of customary GREG can be improved in
the second case by doing variable selection first and then using GREG on the
selected variables.

3 Combining two independent probability samples

3.1 Double sampling We now turn to making valid inferences by


combining two independent probability samples, possibly drawn from
different frames covering the same population. We focus on a simple case
where a large sample A(1) is drawn from the first frame and inexpensive
auxiliary variables z are observed and a much smaller sample A(2) is drawn
from the second frame and both the study variable y and z are observed. We
denote the design weights for the two samples by d1i, i ∈ A(1) and d2i, i ∈ A(2)
respectively, the corresponding basic design unbiased estimators of the total Z
of the common variable z by Z b 1 and Zb 2 respectively, and the estimator of the
total of the study variable y from the sample A(2) by Y b 2. The aim is to get
more efficient estimators of Y by taking advantage of Z b 1 obtained from the
large sample A(1). C. Bose of the Indian Statistical Institute studied regression
estimation of the total Y using such double sampling designs 75 years ago (Bose
1943). However, it is only in recent years that those designs have received
considerable attention in the context of combining data from two or more
independent probability samples. Hidiroglou (2001) described a real application
of the double sampling design in Statistics Canada's Survey of Employment,
Payrolls and Hours (SEPH). In this application, the large sample A(1) was
selected from a Canada Customs and Revenue Agency administrative data file
and auxiliary variables, z, which include the number of employees and the total
amount of payroll, were collected. A much smaller sample A(2) was indepen-
dently selected from the Statistics Canada Business Register and the study
variables y, number of hours worked by employees and summarized earnings,

were collected in addition tothe
0 variables z. A GREG-type estimator is of the
form Yb 2;gr ¼ Y
b2 þ Z b 1 −Z
b 2 β, where β is estimated either by a weighted least
squares estimator or an “optimal” estimator minimizing the variance of Y b 2;gr
10 J. N. K. Rao

(Hidiroglou 2001). Note that the variance of Y b 2;gr is asymptotically larger than
the corresponding GREG with known totals Z because of the extra variability
due to estimating Z from the larger sample. We refer the reader to Guandalini
and Tille (2017) for more recent work on double sampling designs involving two
independent probability samples and relevant references to past work.
Kim and Rao (2012) studied a related problem of integrating two indepen-
dent probability samples. Here the primary interest was to create a single
synthetic data set of proxy values e y i associated with zi in the larger sample
A(1) to produce a projection estimator of the total Y. The proxy values e y i ; i∈
Að1Þ are generated by first fitting a working model relating y to z in the smaller
A(2) sample dataset {(yi, zi), i ∈ A(2)} and then predicting yi associated with
zi, i ∈ A(1). This approach facilitates the use of only synthetic data and
associated design weights d1i reported in survey 1. Kim and Rao (2012) iden-
tified the conditions for the projection estimator Y b 1p ¼ ∑i∈Að1Þ d 1i e
y i to be
asymptotically design consistent. Variance estimation is also considered.
Schenker and Raghunathan (2007) reported several applications of the syn-
thetic data approach using a parametric model-based method to estimate the
total Y. In one application, sample A(2) observed both self-reporting health
measurements zi and clinical measurements from physical examination yi and
the much larger survey A(1) had only self-reported measurements zi. Only the
imputed, or synthetic, data from sample A(1) and associated survey weights
are released to the users of sample A(1), thus minimizing disclosure risk (Reiter
2008).

3.2 Dual frame sampling designs In this section, we give a brief


account of combining data on a study variable from samples drawn from
two different frames. There are two scenarios: (1) Both frames are incom-
plete but their union is a complete frame, as often happens with list frames;
and (2) frame 1 is complete (such as an area frame) and frame 2 is
incomplete (such as a list frame). We assume that we can correctly ascer-
tain to which frames a sample unit belongs. For simplicity, we focus on the
second case since it has relevance to combining data from a non-probability
sample with data from a probability sample drawn from the complete
frame (see Section 6). The dual frame approach is efficient when it is
cheaper to sample from frame 2 or when a large proportion of a rare
population of interest belongs to frame 2.
Denote the sample drawn from frame 1 as A and the sample from frame
2 as B, and the corresponding design weights as d1i and d2i, respectively. We
can express the population total Y as Y = ∑i ∈ U(1 − δi)yi + ∑i ∈ Uδiyi, where
δi = 1 if the unit i belongs to frame 2 and δi = 0 otherwise. Using this
ON MAKING VALID INFERENCES BY... 11

representation, a class of estimators that combines data from the two


samples is given by

b H ¼ ∑i∈A d 1i ð1−δ i Þy i þ pð∑i∈A δ i d 1i y i Þ þ q ð∑i∈B d 2i y i Þ; p þ q ¼ 1 ð3Þ


Y

For a fixed p, the dual frame estimator Y b H can be written as a weighted sum
with weights not depending on the study variable, that is the same weight is
used for all study variables. A simple choice is p = 0.5 and the resulting
estimator is called a multiplicity estimator. Hartley (1962) obtained an optimal
value of p by minimizing the variance of the dual frame estimator Y b H , but the
optimal p will depend on the values of the study variable in the population. A
screening dual frame estimator is obtained by letting p = 0 in (3). In this case,
the data on units from sample A belonging to frame 2 are not collected and this
could lead to cost saving if it is cheaper to collect data from the incomplete
frame 2. For example, in agricultural surveys, farms belonging to the list frame
2 are removed from the area frame 1 before sampling commences and this could
lead to considerable cost saving since the list frame is cheaper to sample and it
contains the largest farms (Lohr 2011). An application of the screening dual
frame estimator to combining a non-probability sample with a probability
sample is given in Section 6.
Lohr (2011) provides an excellent account of dual frame sampling theory
and its extension to more than two frames. The relative merits of several
alternative estimators of the total are discussed. She also presents replication-
based variance estimators, in particular jackknife variance estimators. The
effect of errors in ascertaining to which frame a sample unit belongs is also
studied.

4 Inference from non-probability samples

4.1 Present and future of probability sampling Many important


advances in probability sampling theory and methodology have taken place
since the landmark contributions of the early period. Issues addressed include
efficient sample designs and analysis of survey data taking account of survey
design features. Mixed mode data collection methods, such as combining face-
to-face and web, have been proposed to control costs and maintain good
coverage and response rates (de Leeuw 2005). Also, responsive or adaptive
designs (Groves and Heeringa 2006) and paradata collection to increase re-
sponse rates and adjust for non-response bias and measurement errors have
been studied. Kalton (2019) notes that the new work on responsive designs and
paradata collection “have not had great success in counteracting falling
12 J. N. K. Rao

response rates and increasing costs”. Other significant contributions include


randomized response sampling methods to handle sensitive questions
(Chaudhuri and Christofides 2013), and adaptive or network sampling when
serviceable frames are not available. (Thompson 2002).
Kalton (2019) mentions the use of quasi-probability sample designs that are
not strictly based on probability sampling. One example is quota sampling
which may be “described as stratified sampling with a more or less nonrandom
selection of units within strata” (Cochran 1977). It is often used in marketing
research to reduce cost and time. A requirement of quota sampling is that the
population counts in quota cells need to be known accurately. Other examples
are respondent driven sampling and venue-based sampling to access hard to
survey populations (Kalton 2019). Note that the data generated by probability
sampling or quasi-probability sampling are “designed data” collected to address
pre-specified purposes unlike “organic data” (Groves 2011) extracted from
social media (Facebook, Google and Twitter), unrestricted web surveys and
other sources such as commercial transactional data and administrative
records.
Steadily falling response rates, increasing costs and response burden associ-
ated with traditional sample surveys have become major concerns in highly
developed countries, especially for socio-economic surveys. On the other hand,
the low cost of obtaining non-probability samples of very large sizes through
self-reporting Internet web surveys, social media and other external sources and
the speed with which estimates can be produced from those data are very
attractive to statistical agencies producing official statistics. As a result, those
agencies are now taking modernization initiatives to find new ways to integrate
data from a variety of sources and to produce “real-time” official statistics.
Citro (2014) says “official statistical offices need to move from the probability
sample survey paradigm for the past 75 years to a mixed model data source
paradigm for the future”. Holt (2007) listed five formidable challenges for
official statistics: “wider, deeper, better, quicker and cheaper”. Citro (2014)
added “less burdensome” and “more relevant”. It is doubtful that we will even
come close to achieving those goals in the near future.
Administrative records are not primarily collected for statistical purposes
and there is no direct control by the statistical agency, unlike for survey data.
However, in some cases administrative data may be used as a substitute for
some items in surveys based on probability samples to lower response burden
and reduce cost. For example, in the Canada Income Survey and the Census of
Population, basic income questions are skipped by accessing tax records. Ad-
ministrative records are more current than census data and can be more
effective as auxiliary variables when combined with survey data, especially in
ON MAKING VALID INFERENCES BY... 13

small area estimation (Section 8). Kalton (2019) notes that administrative data
may provide longitudinal data both for the period before and the period after
the survey data collection. Limitations of data from administrative sources
include possible incomplete coverage of the target population and confidential-
ity and privacy issues. Further, in the United States, sharing some administra-
tive records by federal agencies is not allowed. Also, for some records held by
states and localities, there is no mandate or incentive to share data with federal
statistical agencies. Administrative data should be subject to the same scrutiny
as survey data with regard to response errors. More so, because they are usually
compiled by a large team of clerical staff with uncertain training on quality.
Given the current initiatives to modernize official statistics through exten-
sive use of nonprobability samples and big data, one might rightly ask the
question “Are probability surveys bound to disappear for the production of
official statistics?” (Beaumont 2019). To answer this question, we should
first examine the effects of sample selection bias in non-probability samples
(Section 4.2) and the use of models to reduce selection bias. If selection bias
cannot be reduced sufficiently through modeling of non-probability samples,
then we may examine the utility of combining non-probability samples with
probability samples in an attempt to address sample selection bias (Sections 5
and 6). Also, many major surveys conducted by federal agencies, such as
monthly labor force surveys or business surveys, will continue to use probability
sampling and efficient estimators based on model-assisted methods using aux-
iliary information. Kalton (2019) says “it is unlikely that social surveys will be
replaced by administrative records, although these data can be valuable addi-
tion to surveys” and “quality of estimates from internet surveys is a concern”.
Couper (2013) notes some limitations in big data, including a single variable
and few covariates observed, lean on relationships and demographic informa-
tion often lacking for a large proportion of big data. However, big data may be
potentially useful in providing predictors for models used for small area esti-
mation (Section 8).

4.2 Effect of selection bias We first consider the case of a non-


probability sample B of size NB with data{(i, yi), i ∈ B}, assuming no measure-
ment errors. Let δi = 1 if the population unit i belongs to B and δi = 0 other-
wise. The estimator of the population mean Y in the absence of supplementary
information is given by the mean y B ¼ N −1 B ∑i∈U δ i y i and , following Hartley
and Ross (1954), its estimation error y B −Y may be expressed as the product of
three terms: (1) corr(δ, y) = ρδ, y , called data quality, (2) square root of (1 − fB)/
fB with fB = NB/N, called data quantity, and (3) square root of the population
variance σ 2y , called problem difficulty (Meng 2018). The data quality term
14 J. N. K. Rao

plays the key role in determining the bias and it is approximately zero on the
average under simple random sampling (assuming complete response and
coverage). Note that we do not have control on the participation mechanism
under non-probability sampling, unlike in the case of probability sampling.
We now turn to the model MSE of y B , assuming the indicators δi are
random with unknown nonzero participation probabilities qi. Then, the model
MSE is given by
   
MSE δ y B ¼ E δ ρ 2δ;y  f −1
B ð1−f B Þ  σ y
2
ð4Þ

where Eδ denotes the expectation with respect to the unknown participation


mechanism.  Meng
 (2018) named the first term in (4) as the data defect index,
D I ¼ E δ ρ 2δ;y . Note that NB could be very large in the context of big data, for
example, NB is five million and N is 10 million giving a sampling fraction fB = 1/
2. Yet the MSE given by (4) depends only on the sampling fraction fB. As a
result, we will only need a relatively small simple random sample of size n to
achieve the same MSE. In the above example, suppose that the average
correlation Eδ(ρδ, y) is as small as 0.05, then the “effective” sample size n of
the big data is less than 400. Another important result is that the width of the
confidence interval, treating the non-probability sample as a simple random
sample, goes to zero as NB and N increase such that the ratio fB tends to some
positive limiting value; however, the interval has a small chance to cover the
true value Y because it is centered at a wrong value (Meng 2018). We know this
phenomenon under probability sampling when the ratio of bias to standard
error is large. For example, Cochran (1977, p.14) reports a coverage rate of 68%
for nominal 95% when the ratio of bias to standard error is 1.50. For a design
biased, but design consistent, estimator, such as the ratio estimator, the bias
ratio goes to zero as the sample size increases.
If the non-probability sample B is subject to measurement errors and we
observe y *i ¼ y i þ ε i , where εi is the measurement error, then the expected bias
*
of the estimator Y B is unchanged provided E(εi) = 0, otherwise the bias will
contain an additional term, Bε, assuming a constant expected bias E(εi) = Bε.
On the other hand, the MSE will contain two additional terms, the variance
from measurement error σ2/NB+B 2ε and an interaction term 2Bε[(1 − fB)/
fB]1/2Eδ(ρδ, y) respectively, where σ2 is the variance of measurement error
(Biemer 2018). If Bε = 0, then the interaction term is zero and the variance
from measurement error term is negligible for large NB. As a result, the
contribution from sample selection bias dominates the total MSE. It is unlikely
that the expected bias will be small for “found data” from online sources, such
as Facebook, where people may actively lie.
ON MAKING VALID INFERENCES BY... 15

The above facts need to be taken into consideration in using non-probability


samples without some adjustment for selection bias. Reducing bias through
weighting is a common practice used in the context of bias due to nonresponse.
In the latter case, we have some information on the nonrespondents, such as
covariates observed on all the sample units, unlike in the case of non-
probability sampling. This additional information is used to estimate response
probabilities and used to reduce bias through weighting. We discuss methods of
estimating participation probabilities in the case of non-probability sampling
and using them through weighting in Section 6.
Meng (2018) studied the effect of using the weighted mean Y Bω =∑i ∈ Uωiδiyi/
∑i ∈ Uωiδi with arbitrary weights ωi. In this case,  e
letting δ i ¼ δ i ωi , the first term in
the estimation error YBω −Y is changed to e
corr δ; y ¼ ρe , the second term is
1=2 δ ;y
inflated by the factor 1 þ C V 2ω =ð1−f B Þ ≥1 where CVω is the coefficient of
variation (CV) of the ωi for i ∈ B, and the third term remains unchanged. Hence, it
follows that if the weighting does not lead to reduction in the first term, the
estimation error is in fact larger than not using weights and this inflation is
significant if the CV of the weights is large as in the case of nonresponse.
Under probability sampling, the estimator Y Bω is approximately design
unbiased, noting
 that the inclusion probabilities πi are known, ωi ¼ d i ¼ π −1 i
and E δ ρ δ;y ≈0. However, in the case of non-probability sampling, the partic-
ipation probabilities P(δi = 1) = qi are unknown and we need to estimate those
probabilities through models by combining the non-probability sample B with
an independent probability sample A observing some auxiliary variables com-
mon to those associated with the non-probability sample. If the study variable
is also observed in the probability sample and the units in the probability
sample that do not belong to the non-probability sample can be identified,
then the problem is similar to the screening dual frame estimator mentioned in
Section 3.2 and no model assumptions are needed. If only the nonprobability
sample B is observed, then the population information on the auxiliary vari-
ables associated with B is needed to make inference through models relating the
study variable to the auxiliary variables.

5 Study variable observed in both samples

In the ideal case, the study variable y is observed in the non-probability


sample B as well as in a probability sample A of size n independently selected
from the target population. We make the assumption that the units in sample
A that do not belong to sample B can be identified. This assumption is not met
in a lot of applications, especially when the demographic information in sample
16 J. N. K. Rao

B is poor quality. Using the indicator variable δi , we can express the total Y as
YB + YC, where YB = ∑i ∈ Uδiyi = ∑i ∈ Byi and YC = ∑i ∈ U(1 − δi)yi are the to-
tals for the units in the sample B and the units not belonging to B, respectively.
This situation can be thought of as a population composed of two strata: the
stratum of the sample B that is completely enumerated and a stratum of the
elements not in B from which a probability sample is selected. It immediately
follows that a design unbiased direct estimator of the total is given by Y B
þY b C , where Y b C ¼ ∑i∈A d i ½ð1−δ i Þy i  is the design unbiased estimator of YC
and di are the design weights associated with the probability sample. This
estimator is a special case of the dual frame screening estimator when all the
units in the incomplete frame 2 are observed.
Sometimes, it may be desirable to reduce the big data sample size by taking
a large random sample from an extremely large big data sample B , as in the
case of transaction data. In this case, we need methods of drawing random
samples from a very large big data file stored in the fast memory of the
computer and whose size NB may not be known in advance. McLeod and
Bellhouse (1983) proposed a convenient algorithm for drawing simple random
samples from big data files.
Since the sizes NB and NC = N − NB are known, a more efficient post-stratified
estimator is given by Y b P ¼ YB þ NC Y b C =N
b C , where N
b C ¼ ∑i∈A d i ð1−δ i Þ.
Kim and Tam (2018) showed that with a simple random sample A, the post-
stratified estimator achieves a large reduction in the design variance compared to
the design unbiased estimator Y b ¼ ∑i∈A d i y i based only on the probability sample
A. In particular, if the sampling fraction f = n/N is small and the population
variance σ 2y and the varianceof the
 population
  units not belonging to B, denoted
b
σ C;y , are roughly equal, V Y P =V Y
2 b ≈1−W B , where WB = NB/N is the
proportion of units belonging to the non-probability sample B.
One could use calibration estimation by minimizing the chi-square
distance,∑i ∈ A(wi − di)2/di, between the design weights di and the calibration
weights wi subject to calibration constraints ∑i ∈ Awiδi = NB, ∑i ∈ Awi(1 − δi) =
NC and ∑i ∈ Awi(δiyi) = YB, where NB, NC and YB are known. The resulting
calibration estimator ∑i ∈ Awiyi is identical to the post-stratified estimator Y bP
(Kim and Tam 2018). However, the main advantage of the calibration ap-
proach is that it permits the inclusion of other calibration constraints, if
available. One important application of the calibration approach is the estima-
tion of the total Y when the study variable observed in the non-probability
sample B is subject to measurement error and what we observe is y *i instead of
yi for i ∈ B, and no measurement errors in the probability sample A. In this
case, we minimize the chi-square distance subject to the previous constraints
but replacing yi by y *i for i ∈ B. The resulting calibration estimator is given by
ON MAKING VALID INFERENCES BY... 17

∑i ∈ Awiyi. The case of measurement errors only in the probability sample A is


more complex and requires a measurement error model. Kim and Tam (2018)
also studied calibration estimation in the case of unit nonresponse in the
probability sample A, assuming a general response model allowing the proba-
bility of response to depend on the study variable.
Kim and Tam (2018) gave an interesting application of the above setup
in official statistics. In this application, the non-probability sample (or big
data) is the Australian Agricultural Census with 85% response rate and the
probability sample is the Rural Environment and Agricultural Commodi-
ties Survey. In this application, the study variable is subject to measure-
ment error in the probability sample while the true value is observed in the
non-probability sample.

6 Study variable not observed in the probability sample

We now turn to the case where a non-probability sample B observing the


study variable y and a reference probability sample A observing some other
study variables have common covariates, denoted by x. The available data are
denoted by {(i, yi, xi), i ∈ B} and {(i, xi), i ∈ A}. The above scenario is similar
to the double sampling case of section 3.1 but the roles of the large sample and
the small sample are reversed. Here, the study variable y is cheaper to collect
from sample B, unlike in the double sampling case where the study variable is
observed in a small probability sample.
First, we consider the case where the units in sample A that do not belong
to sample B can be identified, as in Section 5. We can then use the data {(δi,
xi), i ∈ A} to fit a model for the participation probabilities or propensity scores
P(δi = 1| xi) = q(xi, θ) = qi in sample B based on the missing at random (MAR)
assumption and qi › 0 for all i ∈ U. For example, one could use a logistic

regression model for the binary variable δi and obtain estimators b qi ¼ q
b
x i ; θ fori ∈ B, using the xi observed in the sample B. We then calculate a
ratio estimator of the total as Y b r;q ¼ N ð∑i∈B ωi y i Þ=ð∑i∈B ω i Þ, where ωi ¼ b
q −1
i
(Kim and Wang 2019). This estimator will lead to valid inferences provided the
model on the participation probabilities is correctly specified. The MAR as-
sumption that the participation probabilities qi depend only on the observed xi
is a strong assumption and difficult to validate in practice.
Suppose a working population model, Em(yi) = m(xi, β) = mi for i ∈ U, is
assumed to hold for the sample B, where Em denotes model expectation and the
mean function mi is specified. Using the data from the sample B we obtain an
18 J. N. K. Rao

estimator of bβ which is consistent for β if the model is correctly specified. Then a


doubly robust (DR) estimator of the total is given by
 
b DR ¼ ∑i∈B ω i y i −m
Y b i þ ∑i∈A d i m
bi ð5Þ
 
where m b i ¼ m xi; b
β denote the predicted values (Kim and Wang 2019). The
estimator (5) is DR in the sense that it is consistent if either the model for the
participation probabilities or the model for the study variable is correctly
specified. The assumption that the model holds for the nonprobability sample
is also a strong assumption. DR estimators have been used in the context of
nonresponse in a probability sample (Kim and Haziza 2014).
Chen et al. (2018a) proposed DR estimators of the type (5) that do not require
matching the two samples and in fact the set of units in sample A that also belong
to sample B may be empty. This scenario is more common in practice. Under this
approach, the values {(δi, xi), i ∈ U} are first used under the assumed participation
probabilities model to obtain a population log likelihood l(θ), involving the un-
known term ∑i ∈ U log {q(xi, θ)}, and then replacing this unknown term by its
design unbiased estimator based on the probability sample A. This two-step
procedure leads to the following pseudo log likelihood:

bl ðθ Þ ¼ ∑i∈B logitfq ðx i ; θ Þg þ ∑i∈A d i logf1−q ðx i ; θ Þg ð6Þ

where logit(a) = log {a/(1 − a)}.


0
For the commonly used logistic regression model, logitfq ðx i ; θ Þg ¼ x i θ, the
score equations corresponding to (6) reduce to
o
Ub ðθ Þ ¼ ∑i∈B x i −∑i∈A d i q ðx i ; θ Þx i ¼ 0 ð7Þ
 
Solving (7), we obtain an estimator θb and the corresponding b q i ¼ q x i ; θb . We
simply use the resulting ωi in (5) to get a DR estimator of the total. Chen et al.
(2018a) derived variance estimators for the DR estimator of the total or mean.
For general q(xi, θ), a calibration-type estimator of the model parameter θ
may be obtained by solving

∑i∈B ½q ðx i ; θ Þ−1 x i ¼ ∑i∈A d i x i ð8Þ

If the population total X is available from one or more external sources, such as
a census, then there is no need for the probability sample and we simply replace
the design unbiased estimator X b ¼ ∑i∈A d i x i in (8) by the known total X
(Singh et al. 2017). Note that the Chen et al. (2018a) estimator of qi depends
ON MAKING VALID INFERENCES BY... 19

on the design weights associated with the probability sample. Chen et al.
(2018a) also provided asymptotically valid variance estimators of the DR esti-
mators under the assumed models. Lee (2006) and Lee and Valliant (2009) earlier
proposed the above setup, but their method of estimating the participation
probabilities by letting δi = 1 if i ∈ B and δi = 0 if i ∈ A leads to biased estimates
of the participation probabilities, as observed by Valliant and Dever (2011).
Yang et al. (2019) extended the DR estimators of Chen et al. (2018a) to the
case of a large number of candidate covariates x. They used a two-step ap-
proach for variable selection and estimation of the finite population parameter,
using a generalization of lasso. Some of the commercial web panels and health
records may observe many covariates, but several of those covariates may be
weakly related to the study variable. In such cases, the lasso-based DR estima-
tors of Yang et al. (2019) might be useful.
A regression type estimator of the total is obtained by estimating the
unknown total of the predicted values m b i , using the probability sample
(Chen et al. 2018a, b):

b REG ¼ ∑i∈A d i m
Y bi ð9Þ

Note that (9) could be regarded as a mass imputed estimator with missing
values yi for i ∈ A replaced by the corresponding imputed values m b i . The
estimator (9) will be biased if the model for the study variable is incorrectly
specified. A ratio type version of (9) is obtained by multiplying (9) with the
b where N
scale factor N =N, b ¼ ∑i∈A d i is the design unbiased estimator of N.
Rivers (2007) used a non-parametric mass imputation approach that avoids
the specification of the mean function Em(yi| xi). For each unit i in the
probability sample A, a nearest neighbor (NN) to the associated xi is found
from the donor set {(i, xi), i ∈ B}, say xl, using Euclidean distance, and the
associated yl is used as the imputed value y *i ð¼ y l Þ, leading to the mass
imputed estimator

b RI ¼ ∑i∈A d i y *
Y ð10Þ
i

The estimator (10) is based on real values obtained


 * from the donor set, unlike
(9). It is not design-model unbiased unless E m y i jx i ¼ E m ðy i jx i Þ: Singh et al.
(2017) proposed alternative mass imputed estimators based on NN. Kim et al.
(2019) studied mass imputation under a semi-parametric model for the non-
probability sample B with E(y| x) = m(x, β) for a known function m(., .) and
established design-model consistency of the mass imputed estimator. They also
derived variance estimators using either linearization or the bootstrap method.
20 J. N. K. Rao

Chen et al. (2018a) conducted a simulation study on the performance of the


alternative estimators. They demonstrated that the DR estimator performs
well in terms of relative bias and mean squared error (MSE) when either the
assumed response model or the model on the study variable is correctly spec-
ified. On the other hand, the estimator Y b r:q , based only on the estimated qi
performs poorly when the model on the participation probabilities qi is incor-
rectly specified. Similarly, the regression estimator (9) performs poorly when
the model on the study variable is incorrectly specified. The authors did not
study the case when both models are incorrectly specified. It may be possible to
develop multiply robust (MR) estimators, along the lines of Chen and Haziza
(2017), by specifying multiple response models and multiple models for the
study variable. Under this scenario, an estimator is MR if it performs well when
at least one of the candidate models is correctly specified. Simulation results of
Chen and Haziza (2017) indicated that the MR estimators tend to perform well
even when all the candidate models are incorrectly specified.

7 Non-probability sample alone available

We now turn to the scenario where only non-probability sample B data {(i, yi,
xi), i ∈ B} are available. In this case, we need some population information on the
auxiliary variables x. It is then possible to use a model-dependent or prediction
approach which was advocated for probability samples by Royall (1970) and
others. An advantage of the prediction approach for probability samples is that it
provides conditional model-based inferences referring to the particular sample of
units selected. Such conditional inferences may be more relevant and appealing
than the unconditional repeated sampling inferences used in the design-based
approach. Inferences do not depend on the sampling design if the assumed popu-
lation model holds for the sample or there is no sample selection bias.
In the case of probability sampling, Little (2015) notes that sample selection
bias may be reduced by incorporating explicitly design features like stratifica-
tion and sample selection probabilities into the model to ensure design consis-
tency of the prediction estimators. For example, with disproportionate strati-
fied random sampling, stratum effects are included in the model, leading to a
prediction estimator that agrees with the standard design-based estimator.
Unfortunately, under non-probability sampling the above calibrated modeling
approach is not feasible and issues of selection bias are not resolved. Little
(2015) suggests incorporating auxiliary population information, such as
poststratification, to reduce the sample selection bias. Bethlehem (2016) dem-
onstrated the effectiveness of post-stratification in reducing the selection bias in
ON MAKING VALID INFERENCES BY... 21

the context of a web panel on voting (binary variable) when the panel is post-
stratified into cells defined by age (young, middle and old) and education ( low,
high) and the population cell counts are known. Wang et al. (2015) proposed
multilevel regression and post-stratification (MRP) to forecast elections with
non-representative polls and demonstrated its effectiveness in forecasting U. S.
presidential elections, based on a large number of post-stratification cells with
known population cell counts. Smith (1983) examined the conditions for ignor-
ing non-random selection mechanisms with particular attention to post-
stratification and quota sampling.
Suppose the population model is given by the mean function Em(yi) =
h(xi, β), i ∈ U and the model holds for the non-probability sample B. We fit

the model to the sample data to obtain an estimator b β and predictors bhi ¼ h
b
x i ; β for i ∈ U, assuming the population values xi are known. In this case, the
prediction estimator of the total is given by

b PR ¼ ∑i∈B y i þ ∑
Y b
hi ð11Þ
i∈e
B

where B e denotes the set of population units not belonging to B. In practice, it


may not be possible to identify the units belonging to B e in the context of non-
probability sampling. However, if the model is a linear regression model
0
h ðx i ; β Þ ¼ x i β with constant model variance and the intercept term is included
in the model, then the prediction estimator (11) reduces to

b PR ¼ X b
Y β ð12Þ

where b β is the ordinary least squares estimator of β. In this special case, the
prediction estimator (12) requires only the vector of population totals X.
A major challenge in making inference based only on a non-probability
sample B is how to account for sample selection bias. Chen et al. (2018b)
suggested using a large pool of predictors and then applying the lasso
(Tibshirani 1996) to implement both variable selection and estimation of model
parameters associated with the selected variables. Their simulation study
suggests that the resulting predictors might be able to account for sample
selection bias. However, the chosen setups do not reflect strong sample selection
bias. Also, it should be noted that the number of demographic variables
available for use as predictors is limited in the context of volunteer web surveys
and other nonprobability samples and not all variables may be available for all
the units in the sample (Couper 2013).
22 J. N. K. Rao

8 Small area estimation

Reliable local area or subpopulation (denoted small area) statistics are


needed in formulating policies and programs, allocation of government
funds, regional planning and making business decisions. Traditional direct
estimators of small area totals or means, based on probability sampling and
area-specific sample data, do not lead to adequate precision due to small
sample sizes within small areas. As a result, it becomes necessary to
“borrow strength” across areas through linking models based on auxiliary
information such as recent census counts and administrative records.
Kalton (2019) says that opposition to using models has been overcome by
the growing demand for small area estimation (SAE). Rao and Molina
(2015) provide a comprehensive account of model-based methods for
SAE. Models used for SAE may be broadly classified into two categories:
(a) area level models that relate area level direct estimators to area level
covariates, and (b) unit level models that relate the observed values of the
study variable to unit-specific auxiliary variables.
For simplicity, assume that simple random samples of sizes ni(i = 1, ..., m)
are drawn from m areas, that ni › 0 for all i and that the corresponding direct
estimators of the area means Y i are the sample means y i . Let the sampling
model be denoted by y i ¼ Y i þ ei, where ei is the sampling error with mean
zero and known variance ψi. A model linking the area means Y i is given by
0
Y i ¼ z i β þ v i , where zi is a vector of area level covariates not subject to
sampling or measurement errors and vi is a random area effect with mean zero
and variance σ 2v . Combining the sampling model with the linking model leads
0
to the well-known Fay-Herriot model y i ¼ z i β þ v i þ ei (Fay and Herriot
1979). The “optimal” estimator of the mean Y i derived under this model,
assuming the model parameters are known for simplicity, is a weighted combi-
0
nationof the sample
 mean y i and a “synthetic” estimator z i β with weights γ i
¼ σ 2v = σ 2v þ ψ i and 1 − γi. It follows that the optimal estimator gives more
weight to the synthetic estimator as the sampling variance increases or the
sample size ni decreases. The mean squared prediction error (MSPE) of the
optimal estimator is equal to γiψi, which shows a large reduction in MSPE
compared to the variance ψi of the direct estimator y i when γi is small or the
sampling variance is large. Hidiroglou et al. (2019) applied the area level model
to estimate unemployment rates for cities (small areas) in Canada, using direct
estimates from the Canadian Labour Force Survey and the number of employ-
ment insurance beneficiaries as the area level covariates. They evaluated the
absolute relative error (ARE) by comparing the estimates to the unemploy-
ment rates from the 2016 long-form Census treated as the gold standard. Their
ON MAKING VALID INFERENCES BY... 23

results showed that ARE of the direct estimates averaged over all the areas is
reduced from 33.9% to 14.7% through the use of the “optimal” model-based
estimates. For the 28 smallest areas, reduction in ARE is more pronounced:
70.4% to 17.7%.
Ybarra and Lohr (2008) studied the case of area-level covariates subject to
sampling error. The covariates are obtained from a much larger survey, similar
to the double sampling scenario of Section 3.1. They developed “optimal”
estimators under this setup, assuming the sampling variances and covariances
of the area level covariates are known.
Use of area level big data as additional predictors in the area level
model has the potential of providing good predictors for modeling. We
mention three recent applications that have used big data covariates in
an area level model. Marchetti et al. (2015) studied the estimation of
poverty rates for local areas in the Tuscany region of Italy. In this
application, the big data covariate is a mobility index based on different
car journeys between locations automatically tracked with a GPS device.
Direct estimates of area poverty rates were obtained from a probability
sample. The big data covariate in this application is based on a
nonprobability sample which was treated as a simple random sample
and the Ybarra-Lohr method was used to estimate model-based poverty
rates. The second application analyzed relative change in the percent of
Spanish speaking households in the eastern half of the Unites States
(Porter et al. 2014). Here direct estimates for the states (small areas)
were obtained from the American Community Survey (ACS) and a big
data covariate was extracted from Google Trends of commonly used
Spanish words available at the state level. In the third application,
Schmid et al. (2017) used mobile phone data as covariates in the basic
area level model to estimate literacy rate by gender at the commune
level in Senegal. Direct estimates of area literacy rates were obtained
from a Demographic and Health Survey based on a probability sample.
It is interesting that recent census data or social media data were not
available to use as covariates. The authors provide details regarding the
construction of the mobile phone covariates. In another application of big
data, Muhyi et al. (2019) obtained small area estimates of the electabil-
ity of a candidate as Central Java Governor in 2018, using predictor
variables extracted from Twitter data and direct estimates obtained from
a sample survey.
Turning to a basic unit level model, the unit level sample data are given
by {(yij, xij), j = 1, ..., ni; i = 1, ..., m} , where j denotes a sample unit be-
longing to area i, and the population area means X i are assumed to be
24 J. N. K. Rao

known. In practice, either xij is observed along with the study variable yij or
obtained from external sources through linkage. In the latter case, the
observed data may be subject to linkage errors if unique unit identifiers
are not available (Chambers et al. 2019). Battese et al. (1988) proposed a
0
nested error linear regression model y ij ¼ x ij β þ v i þ eij in the case of xij
observed along with yij , where eij is unit error with mean zero and variance
σ 2e , and vi is random area effect with mean zero and variance σ 2v . The
optimal model-based estimator is again a weighted combination of a direct
estimator and a synthetic estimator. Battese et al. (1988) applied the
nested error regression model to estimate county crop areas using sample
survey data in conjunction with satellite information. Each county was
divided into area segments and the area under corn and soybeans, taken
as yij, was ascertained for a random sample of segments by interviewing
farm operators. Auxiliary variable xij in the form of number of pixels
classified as corn and soybeans were obtained for all the area segments,
including the sample segments, in each county using the LANSAT satellite
readings. This is an application of big data in the form of satellite readings.
Chambers et al. (2019) studied the effect of linkage errors when xij is
obtained from another source such as a population register. They derived
model-based estimators of the area means, taking account of linkage errors.
Both Battese et al. (1988) and Chambers et al. (2019) assumed the absence
of sample selection bias in the sense that the population model holds for the
sample of units within an area. However, sampling becomes informative if
the known selection probability for the sample unit is related to the
associated yij given xij. In this case, the population model may not hold
for the sample and methods that ignore informative sampling can lead to
biased estimators of small area means. Pfeffermann and Sverchkov (2007)
studied small area estimation under informative probability sampling by
modeling the known selection probabilities for the sample as functions of
associated xij and yij . They developed bias-adjusted estimators under their
set-up. Verret et al. (2015) proposed augmented models by using a suitable
function of the known selection probability as an additional auxiliary
variable in the sample model to take account of the sample selection bias
and demonstrated that the resulting estimators of small area means can
lead to considerable reduction in MSE relative to methods that ignore
sample selection bias. However, neither method is applicable to data ob-
tained from a non-probability sample because the selection probabilities are
unknown.
In the case of sample surveys repeated over time, considerable gain in
efficiency can be achieved for small area estimation by borrowing strength
ON MAKING VALID INFERENCES BY... 25

across both areas and time, using extensions of the basic cross-sectional area
level model. There is extensive literature on this important topic and we refer
the reader to Rao and Molina (2015, Section 8.3). Big data sources can be used
to construct covariates for use under such models by combining time series and
cross-sectional survey data. A referee noted that “Particularly, the high fre-
quency of big data sources can be used to improve the timeliness of survey
samples in nowcasting methods.”

9 Concluding remarks

In this paper, I have discussed the effect of selection bias on inference


from non-probability samples and model-based methods to reduce sample
selection bias. Models that have been used for participation probabilities
and for the study variable are based on strong assumptions. Understanding
those assumptions and validating them is a big challenge in making reliable
inferences from a non-probability sample alone, or from a non-probability
sample in combination with a probability sample collecting auxiliary var-
iables common to those observed in the nonprobability sample. The assur-
ance of success of model-based methods will increase with the availability
of covariates strongly related to the study variables in the non-probability
samples. The similarity of the methods for dealing with nonresponse in
probability sampling and selection bias in non-probability sampling should
be noted. The evidence to date is that the use of nonresponse and calibra-
tion adjustments to compensate for nonresponse in probability samples can
reduce nonresponse bias but it will not eliminate it. Applying such adjust-
ments with non-probability samples is likely to be less successful. The
dilemma for analysts of non-probability samples is to assess how large is
the residual nonresponse bias and whether the survey estimators are “fit for
purpose”.
A major concern in using estimates based on non-probability samples or
found data is their ability in providing statistics reliable for use as official
statistics. Since analyses of trends over time are widespread, another con-
cern is whether the non-probability samples are comparable across time.
Research on the quality of the responses obtained from administrative
records and from non-probability samples is needed in the same way that
it is needed for the responses obtained with probability samples. However,
quality measures such as total MSE, developed for the estimates derived
from probability samples and extensively used by statistical agencies, may
not be entirely relevant for the estimates derived from non-probability
26 J. N. K. Rao

samples. Couper (2013) says “We need other ways to quantify the risks of
selection bias or non-coverage in big data or non-probability surveys.” The
European Statistical System (2015) has published guidelines for reporting
about the quality of statistics calculated from non-probability samples and
administrative sources, as well as for statistical processes involving multiple
data sources. The U.S. Federal Committee on Statistical Methodology
(2018) has outlined some steps that might be taken toward more transpar-
ent reporting of data quality for integrated data.
Covariates extracted from big data have the potential of providing good
additional predictors in linking models used in small area estimation. We
can expect to see more applications using big data predictors in small area
estimation. In the time series context, big data has the potential of provid-
ing estimates for small areas over time that can improve the timeliness of
survey samples using nowcasting methods, as noted by a referee.
I have not discussed other practical issues related to big data and non-
probability samples, such as privacy, access and transparency, and I refer the
reader to the following overview and appraisal papers: Baker et al. (2013),
Brick (2011), Citro (2014), Couper (2013), Elliott and Valliant (2017), Groves
(2011), Kalton (2019) , Keiding and Louis (2016), Lohr and Raghunathan
(2017), Mercer et al. (2017), Tam and Kim (2018) and Thompson (2019).
The report by the National Academies of Sciences, Engineering, and
Medicine (2017) extensively treated the privacy issue, in addition to method-
ology for integrating data from multiple sources.
It is unlikely that all surveys based on probability sampling , especially
large-scale surveys, will be replaced by big data, non-probability samples or
administrative data in the near future because probability samples have
much wider scope such as collecting multiple study variables to estimate
relationships. For some studies, data can only be obtained in person
(Kalton 2019). Of course, we should make improvements to probability
sampling such as reducing survey length and respondent burden and for
making increased use of technology (Couper 2013). Rao and Fuller (2017)
provide some future directions. Inevitably, non-probability samples will be
more widely used in the future, and we need to continue researching
methods for obtaining valid (or at least acceptable) inferences from them,
possibly in combination with probability samples as illustrated in this
paper. Falling response rates and increasing respondent burden are often
given as reasons for using non-probability samples, especially in socio-
economic surveys, but those reasons do not necessarily apply to traditional
sample surveys not involving people as respondents, such as agricultural
and natural resources surveys.
ON MAKING VALID INFERENCES BY... 27

I had many stimulating discussions on inferential issues in survey sampling


with the late Professor Jayanta Ghosh while I was a visiting professor at the
Indian Statistical Institute during the period 1968-69. He also contributed to
my overview Sankhya paper (Rao 1999) by acting as a discussant and provid-
ing valuable insights and comments. In the discussion, he suggested that more
complex modeling than simple linear regression models might mitigate the
failure of model-based methods for large probability samples. In particular, he
proposed nonlinear and nonparametric regression or linear regression models
whose dimension or complexity increases with the sample size. It now appears
that such methods might indeed be useful to adjust for sample selection bias
induced by non-probability samples.
Acknowledgement This research was supported by a grant from the Nat-
ural Sciences and Engineering Research Council of Canada. I thank Jean-
Francois Beaumont, Paul Biemer, Mike Brick, Wayne Fuller, Jack Gambino,
Graham Kalton, Jae Kim, Frauke Kreuter, Sharon Lohr and Jean Opsomer for
some useful comments and suggestions on my paper. I also thank two referees
for constructive comments.

References

BAKER, R., BRICK, J. M., BATES, N. A., BATTAGLIA, M., COUPER, M. P., DEVER, J. A., GILE, K. J. AND
TOURANGEAU, R. (2013). Report of the AAPOR task force on non-probability sampling.
J. Surv. Statist. Methodol., 1, 90-143.
BATTESE, G. E., HARTER, R. M. AND FULLER, W. A. (1988). An error component model for prediction of
county crop areas using survey and satellite data. J. Am. Stat. Assoc., 83, 28-36.
BEAUMONT, J. – F. (2019). Are probability surveys bound to disappear for the production of official
statistics? Technical Report. Statistics Canada.
BETHLEHEM, J. (2016). Solving the nonresponse problem with sample matching. Soc. Sci. Comput.
Rev., 34, 59-77.
BIEMER, P. P. (2018). Quality of official statistics: present and future. Paper presented at the
International Methodology Symposium. Statistics Canada, Ottawa.
BOSE, C. (1943). Note on the sampling error in the method of double sampling. Sankhya, 6, 329-330.
BRAKEL VAN DEN, J. A. AND BETHLEHEM, J. (2008). Model-assisted estimators for official statistics.
Discussion Paper 09002, Statistics Netherland.
BREIDT, F. J. AND OPSOMER, J. D. (2017). Model-assisted survey estimation with modern prediction
techniques. Stat. Sci., 32, 190-205.
BRICK, M. J. (2011). The future of survey sampling. Public Opin. Q., 75, 872-888.
CHAMBERS, R. L., FABRIZI, E. AND SALVATI, N. (2019). Small area estimation with linked data.
Technical report appeared as arXiv: 1904.00364v1.
CHAUDHURI, A. AND CHRISTOFIDES, T. (2013). Indirect Questioning in Sample. Springer: New York.
CHEN, S. AND HAZIZA, D. (2017). Multiply robust imputation procedures for the treatment of item
nonresponse in surveys. Biometrika, 104, 439-453.
28 J. N. K. Rao

CHEN, Y., LI, P. AND WU, C.


(2018A). DOUBLY ROBUST INFERENCE WITH NON-PROBABILITY SURVEY SAMPLES.
TECHNICAL REPORT: ARXIV: 1805.06432V1 [STAT. ME].
CHEN, J. K. T., VALLIANT, R. L. AND ELLIOTT, M. R. (2018B). MODEL-ASSISTED CALIBRATION OF NON-
PROBABILITY SAMPLE SURVEY DATA USING ADAPTIVE LASSO. SURV. METHODOL., 44, 117-144.
CITRO, C. (2014). From multiple modes for surveys to multiple data sources for estimates. Surv.
Methodol., 40, 137-161.
COCHRAN, W. G. (1977). Sampling Techniques, 3rd Edition, Wiley: New York.
COUPER, M. P. (2013). Is the sky falling? New technology changing media, and the future of
surveys. Surv. Res. Methods, 7, 145-156.
LEEUW, E. D. D. (2005). To mix or not to mix. Data collection modes for surveys. J. Off. Stat., 21,
233-255.
DEVILLE, J. C. AND SARNDAL, C. E. (1992). Calibration estimators in survey sampling. J. Am. Stat.
Assoc., 87, 376-382.
ELLIOTT, M. R. AND VALLIANT, R. (2017). Inference for nonprobability samples. Stat. Sci., 32, 249-264.
EUROPEAN STATISTICAL SYSTE. (2015). ESS Handbook for Quality Reports, 2014 Edition.
Luxembourg: Publications Office of the European Union. Available at https://ec.europa.
eu/eurostat/documents/3859598/6651706/KS-GQ-15-003-EN-N.pdf.
FAY, R. E. AND HERRIOT, R. A. (1979). Estimation of income for small places: An application of
James-Stein procedures to census data. J. Am. Stat. Assoc., 74, 269-277.
FEDERAL COMMITTEE ON STATISTICAL METHODOLOG. (2018). Transparent Quality Reporting in the
Integration of Multiple Data Sources: A Progress Report, 2017-2018. Washington, DC:
Federal Committee on Statistical Methodology. Available at https://nces.ed.
gov/FCSM/pdf/Quality_Integrated_Data.pdf.
FULLER, W. A. (1975). Regression analysis for sample survey. Sankhya Ser. C., 31, 117-132.
GROVES, R. M. (2011). Three eras of survey research. Public Opin. Q., 75, 861-871 (Special 75th
Anniversary Issue).
GROVES, R. M. AND HEERINGA, S. G. (2006). Responsive design for household surveys: Tools for
actively controlling survey errors and costs. J. R. Stat. Soc. Ser. A, 169, 439-457.
GUANDALINI, A. AND TILLE, Y. (2017). Design-based estimators calibrated on estimated totals from
multiple surveys. Int. Stat. Rev., 85, 250-269.
HALL, P. (2003). A short prehistory of the bootstrap. Stat. Sci., 18, 158-167.
HANSEN, M. H. AND HURWITZ, W. N. (1943). On the theory of sampling from finite populations. Ann.
Math. Stat., 14, 333-362.
HANSEN, M. H. AND HURWITZ, W. N. (1946). The problem of non-response in sample surveys. J. Am.
Stat. Assoc., 41, 517-529.
HANSEN, M. H., HURWITZ, W. N., MARKS, E. S. AND MAULDIN, W. P. (1951). Response errors in surveys.
J. Am. Stat. Assoc., 46, 147-190.
HANSEN, M. H., HURWITZ, W. N., NISSELSON, H. AND STEINBERG, J. (1955). The redesign of the census
current population survey. J. Am. Stat. Assoc., 50, 701-719.
HANSEN, M. H., MADOW, W.G. AND TEPPING, B. J. (1983). An evaluation of model-dependent and
probability sampling inferences in sample surveys. J. Am. Stat. Assoc., 78, 776-793.
HARTLEY, H. O. (1962). Multiple frame surveys. Proceedings of the Social Statistics Section,
American Statistical Association, 203-206.
HARTLEY, H. O. AND ROSS, A. (1954). Unbiased ratio estimators. Nature, 174, 270-271.
HIDIROGLOU, M. (2001). Double sampling. Surv. Methodol., 27, 143-154.
HIDIROGLOU, M., BEAUMONT, J.-F AND YUNG, W. (2019). Development of a small area estimation
system at Statistics Canada. Surv. Methodol., 45, 101-126.
ON MAKING VALID INFERENCES BY... 29

HOLT, D. T. (2007). The official statistics Olympics challenge: Wider, deeper, quicker, better,
cheaper. The American Statistician, 61, 1-8. With commentary by G. Brackstone and J.
L. Norwood.
HORVITZ, D. G. AND THOMPSON, D. J. (1952). A generalization of sampling without replacement from
a finite universe. J. Am. Stat. Assoc., 47, 6630685.
KALTON, G. (2019). Developments in survey research over the past 60 years: A personal perspec-
tive. Int. Stat. Rev., 87, S10-S30.
KEIDING, N. AND LOUIS, T. A. (2016). Perils and potentials of self-selected entry to epidemiological
studies and surveys. J. R. Soc. Stat. Ser. A, 179, 319-376.
KIM, J. K. AND HAZIZA, D. (2014). Doubly robust inference with missing data in survey sampling.
Stat. Sin., 24, 375-394.
KIM, J. K. AND RAO, J. N. K. (2012). Combining data from independent surveys: model-assisted
approach. Biometrika, 99, 85-100.
KIM, J. K. AND TAM, S-M. (2018). Data integration by combining big data and survey sample data
for finite population inference. Submitted for publication.
KIM, J. K. AND WANG, Z. (2019). Sampling techniques for big data analysts. Int. Stat. Rev. (in press).
KIM, J. K., PARK, S., CHEN, Y. AND WU, C. (2019). Combining non-probability and probability survey
samples through mass imputation. Technical Report: arXiv: 1812. 10694v2 [stat.ME].
LEE, S. (2006). Propensity score adjustment as a weighting scheme for voluntary panel web
surveys. J. Off. Stat., 22, 329-349.
LEE, S. AND VALLIANT, R. (2009). Estimation for volunteer panel web surveys using propensity score
adjustment and calibration adjustment. Sociol. Methods Res., 37, 319-343.
LITTLE, R. J. (2015). Calibrated Bayes, an inferential paradigm for official statistics in the era of big
data. Stat. J. IAOS, 31, 555-563.
LOHR, S. L. (2011). Alternative survey sample designs: Sampling with multiple overlapping frames.
Surv. Methodol., 37, 197-213.
LOHR, S. L. AND RAGHUNATHAN, T. E. (2017). Combining survey data with other data sources. Stat.
Sci., 32, 293-312.
MAHALANOBIS, P. C. (1944). On large scale sample surveys. Philos. Trans. R. Soc. B, 231, 329-351.
MAHALANOBIS, P. C. (1946). Recent experiments in statistical sampling in the Indian Statistical
Institute. J. R. Stat. Soc., 109, 325-378.
MARCHETTI, S., GIUSTI, C., PRATESI, M., SALVATI, N., GIANNOTTI, F., PEDRESCHI, D., RINZIVILLO, S.,
PAPPALARDO, L AND GABRIELLI, L. (2015). Small area model-based estimators using big data
sources. J. Off. Stat., 31, 263-281.
MCCONVILLE, K. S. AND TOTH, D. (2018). Automated selection of post-strata using a model-assisted
regression tree estimator. Scand. J. Stat. (in press).
MCCONVILLE, K. S., BREIDT, F. J., LEE, T. C. AND MOISEN, G. G. (2017). Model-assisted survey regression
estimation with the lasso. J. Surv. Statist. Methodol., 5, 131-158.
MCLEOD, A. I. AND BELLHOUSE, D. R. (1983). A convenient algorithm for drawing a simple random
sample. Applied Statistics., 32, 182-184.
MENG, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations,
big data paradox, and the 2016 US presidential election. Ann. Appl. Stat., 12, 685-726.
MERCER, A. W., KREUTER, F. AND STUART, E. A. (2017). Theory and practice in nonprobability surveys.
Public Opin. Q., 81, 250-279.
MUHYI, F. A., SARTONO, B., SULVIANTI, I. D. AND KURNIA, A. (2019). Twitter utilization in application of
small area estimation to estimate electability of candidate central java governor. IOP Conf.
Ser. Earth Environ. Sci., 299 012033, 1-10.
NARAIN, R. D. (1951). On sampling without replacement with varying probabilities. J. Indian Soc.
Agric. Stat., 3, 169-174.
30 J. N. K. Rao

NATIONAL ACADEMIES OF SCIENCES, ENGINEERING, AND MEDICINE. (2017). Federal Statistics, Multiple
Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National
Academies Press. https://doi.org/10.17226/24893.
NEYMAN, J. (1934). On the two different approaches of the representative method. The method of
stratified sampling and the method of purposive selection. J. R. Stat. Soc., 97, 558-606.
PFEFFERMANN, D. AND SVERCHKOV, M. (2007). Small-area estimation under informative probability
sampling of area and within the selected areas. J. Am. Stat. Assoc., 102, 1427-1439.
PORTER, A. T., HOLAN, S. H., WIKLE, C. K. AND CRESSIE, N. (2014). Spatial Fay-Herriot model for small
area estimation with functional covariates. Spat. Stat., 10, 27-42.
RAO, J. N. K. (1999). Some current trends in sample survey theory and methods (with discussion).
Sankhya Ser. B, 61, 1-57.
RAO, J. N. K. AND FULLER, W. A. (2017). Sample survey theory and methods: Past, present and future
directions (with discussion). Surv. Methodol., 43, 145-181.
RAO, J. N.K. AND MOLINA, I. (2015). Small Area Estimation. Wiley, Hoboken.
REITER, J. (2008). Multiple imputation when records used for imputation are not used or dissem-
inated for analysis. Biometrika, 95, 933-946.
RIVERS, D. (2007). Sampling for web surveys. In 2007 JSM Proceedings, ASA Section on Survey
Research Methods, American Statistical Association.
ROYALL, R. M. (1970). On finite population sampling under certain linear regression models.
Biometrika, 57, 377-387.
SARNDAL, C.-E. (2007). The calibration approach in survey theory and practice. Surv. Methodol.,
33, 99-119.
SCHENKER, N. AND RAGHUNATHAN, T. (2007). Combining information from multiple surveys to
enhance estimation of measure of health. Stat. Med., 26, 1802-1811.
SCHMID, T., BRUCKSCHEN, F., SALVATI, N. AND ZBIRANSKI, T. (2017). Constructing sociodemographic
indicators for national statistical institutes by using mobile phone data: estimating literacy
rates in Senegal. J. R. Stat. Soc. Ser. A, 180, 1163-1190.
SINGER, E. (2016). Reflections on surveys past and future. J. Surv. Statist. Methodol., 4, 463-
475.
SINGH, A. C., BERESOVSKY, V. AND YE, C. (2017). Estimation from purposive samples with the aid of
probability supplements but without data on the study variable. In 2017 JSM Proceedings,
ASA Section on the Survey Research Method Section, American Statistical Association.
SMITH, T. M. F. (1983). On the validity of inferences from non-random samples. J. R. Stat. Soc. Ser.
A, 146, 393-403.
TA, T., SHAO, J., LI, Q. AND WANG, L. (2019). Generalized regression estimators with high-dimensional
covariates. Stat. Sin. (in press).
TAM. S. M. AND KIM, J. K. (2018). Big data, selection bias and ethics – an official statistician s
perspective. Stat. J. IAOS, 34, 577-588.
THOMPSON, S. K. (2002). Sampling. Wiley: New York.
THOMPSON, M. E. (2019). Combining data from new and traditional sources in population surveys.
Int. Stat. Rev., 87, S79-S89.
TIBSHIRANI, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B,
58, 267-288.
TOURANGEAU, R., BRICK, M. J., LOHR, S. AND LI, J. (2017). Adaptive and responsive survey designs: a
review and assessment. J. R. Stat. Soc. Ser. A, 180, 203-223.
VALLIANT, R. AND DEVER, J. A. (2011). Estiamting propensity adjustments for volunteer web surveys.
Sociol. Methods Res., 40, 105-137.
VERRET, F., RAO, J. N. K. AND HIDIROGLOU, M. H. (2015). Model-based small area estimation under
informative sampling. Surv. Methodol., 41, 333-347.
ON MAKING VALID INFERENCES BY... 31

WANG, W., ROTHSCHILD, D., GOEL, S. AND GELMAN, A. (2015). Forecasting elections with non-
representative polls. Int. J. Forecast., 31, 980-991.
WILLIAMS, D. AND BRICK, M. J. (2018). Trends in U. S. face-to-face household survey nonresponse and
level of effort. J. Surv. Statist. Methodol., 6, 186-211.
WOODRUFF, R. S. (1952). Confidence intervals for medians and other position measures. J. Am.
Stat. Assoc., 47, 635-646.
WU, C. AND SITTER, R. R. (2001). A model-calibrated approach to using complete auxiliary
information from survey data. J. Am. Stat. Assoc., 96, 185-193.
YANG, S., KIM, J. K. AND SONG, R. (2019). Doubly robust inference when combining probability and
non-probability samples with high-dimensional data. Technical Report: arXiv:
1903.05212v1 [stat.ME].
YBARRA, L. M. R. AND LOHR, S. L. (2008). Small area estimation when auxiliary information is
measured with error. Biometrika, 95, 919-931.

Publisher’s Note . Springer Nature remains neutral with regard to juris-


dictional claims in published maps and institutional affiliations.

J. N. RAO CARLETON UNIVERSITY, OTTAWA, CANADA

You might also like