License: CC BY 4.0
arXiv:2604.23464v1 [stat.ME] 25 Apr 2026

On cross-validation for small area estimators

Qianyu Dong Department of Statistics, University of California, Santa Cruz,
Santa Cruz, CA, USA.
      Zehang Richard Li Department of Statistics, University of California, Santa Cruz,
Santa Cruz, CA, USA.
Abstract

Subnational monitoring of public health often relies on household surveys where data are sparse at the desired spatial resolution. Small area estimation (SAE) methods address this challenge by borrowing strength across areas and incorporating auxiliary information. However, comparing these estimators remains difficult in the absence of ground truth. We propose a cross-validation framework for evaluating small area estimators that accommodates complex survey designs. Our approach enables model-agnostic comparisons between area-level and unit-level models. Central to our framework is a decomposition of the cross-validated squared error in the context of SAE, which reveals both identifiable bias and unidentifiable components that can be bounded. Our theoretical results and simulation studies show that conventional approaches, such as leave-one-area-out cross-validation, can yield misleading model rankings, whereas the proposed approach offers more robust and interpretable model comparison with uncertainty quantification. We demonstrate the procedure for comparing SAE models for mapping the female literacy rate using Demographic and Health Surveys from Zambia.

keywords:
survey sampling; small area estimation; spatial smoothing; model checking; cross-validation.

1 Introduction

Small area estimation (SAE) refers to the process of estimating quantities of interest for specific geographic areas or subpopulations using survey data. When within-area sample sizes are too small for design-based direct estimation, model‐based methods borrow information across areas and incorporate auxiliary information to improve the precision of estimates (Rao & Molina, 2015). In most low- and middle-income countries (LMICs), census or administrative microdata are limited. Routinely collected household surveys serve as the only reliable sources for many key health and demographic indicators. Demographic and Health Surveys (DHS) (The DHS Program, 2026; Corsi et al., 2012) and the Multiple Indicator Cluster Surveys (MICS) (UNICEF, 2026; Khan & Hancioglu, 2019) are two prominent examples that have been implemented in many countries over several decades. Small area estimation methods have been developed and applied in this setting for outcomes including poverty, child mortality, fertility, vaccination, and HIV (see e.g., Chi et al., 2022; Mercer et al., 2015; Li et al., 2019; Wakefield et al., 2019; Wu et al., 2021; Saha et al., 2023; Dong & Wakefield, 2021; Wakefield et al., 2020).

SAE models are generally categorized into either area-level or unit-level models. The most widely used area-level model is the Fay-Herriot model (Fay & Herriot, 1979), where the direct estimates and their associated estimates of the sampling variances are used as response data in a smoothing model. In contrast, unit-level approaches model individual survey responses, often via generalized linear mixed models. No single modeling strategy performs uniformly well across different survey designs and spatial dependence structures. Model evaluation and selection, however, remains challenging for SAE models. Existing approaches are either restricted to comparing models within shared likelihood families or require external benchmark data or assumptions of the true data generating mechanism. There is a critical need for a unified, model-agnostic framework for comparing candidate estimators on a common and scientifically meaningful scale.

In this paper, we develop a cross-validation framework for comparing small area estimators fitted to complex survey data. Our approach evaluates estimators through held-out, design-based direct estimates constructed from sample splits that preserve the survey design. This strategy enables model-agnostic comparisons between area-level and unit-level models on a common scale. We derive a decomposition of conditional mean squared errors for small area estimators, leading to a consistent cross-validation score. In finite samples, we show that the unidentifiable bias in score difference between two models can be bounded, which provides a principled threshold for deciding whether an observed score difference is large enough to support model ranking. We also demonstrate that popular alternative model comparison tools, such as leave-one-area-out cross-validation, produce systematic bias in model ranking. All theoretical results are supported via design-based simulations based on the 2024 Zambia DHS.

The rest of the paper is organized as follows. Section 2 reviews current approaches to evaluating small area estimators in the literature and discusses their limitations. Section 3 develops the proposed scoring and cross-validation framework. Section 4 reports simulation results and applies the methodology to compare estimators of female literacy rate in Zambia. Section 5 concludes with a discussion of future directions. All proofs are deferred to the Supplementary Materials.

2 Current approaches

Existing model evaluation approaches fall broadly into three categories: benchmark-based, likelihood-based, and empirical simulation. None of the approaches provides a generally applicable, model-agnostic framework for comparing estimators. The first class of methods involves evaluating candidate estimators in terms of their agreement with some benchmark values, and is often model-agnostic. External validation against census data is widely regarded as the gold standard for model assessment (Merfeld et al., 2024, 2022), and when census is not available, validation against estimates from independent surveys has been proposed (Brown et al., 2001; Dorfman, 2018). However, external census or surveys are rarely available at the required spatial resolution. Alternatively, one may first aggregate SAE estimates to coarser levels, e.g., a single national estimate, and then compare them with external benchmark values. While being useful for falsifying estimators, aggregated evaluation changes the target estimand completely. Estimators that perform well at the aggregated level can still be significantly biased at the finer levels, especially with over-smoothed models. When no external data are available, direct estimates from the same sample are also sometimes used to assess small area estimators (Lahiri & Pramanik, 2019). However, such evaluation suffers from double dipping and systematically favors models that track the noise in direct estimates.

Traditional model assessment and validation tools, including goodness-of-fit tests (Rao & Scott, 1981, 1984; Thomas & Rao, 1987; Lumley & Scott, 2014, 2017), information criteria (Fabrizi & Lahiri, 2013; Lumley & Scott, 2015), and generalization error measures (Holbrook et al., 2020), have been developed for regression models that incorporate survey weights, with adjustments to account for complex sampling designs. Similar approaches for comparing Bayesian SAE models using measures such as DIC and WAIC have also been adopted in the literature (see, e.g., Trevisani et al., 2017; Gomez-Rubio et al., 2008; Franco & Maitra, 2023). While likelihood-based methods are powerful and widely used, they are generally limited to comparing models that share a comparable likelihood formulation. Consequently, they are not well suited for comparing unit-level with area-level models, or models that target the same estimand through fundamentally different constructions.

Lastly, empirical simulation approaches are also commonly used in the SAE literature (Bradley et al., 2015; Janicki et al., 2022). These approaches perturb observed direct estimates by adding noise, and then evaluate models fitted to the synthetic direct estimates against the original estimates. When some individual-level micro data from the population exist, design-based simulation may also be carried out by generating repeated surveys based on a synthetic population dataset (Alfons et al., 2010; Wieczorek & Franco, 2013). Both approaches evaluate model performance for a hypothetical population rather than the specific observed dataset at hand, thus implicitly changing the target of inference.

The cross-validation framework developed in Section 3 retains the model-agnostic advantage of benchmark-based approaches while resolving the lack of benchmark datasets. It is closely related to the recent work of Kawano et al. (2026), where they developed model evaluation approaches by splitting direct estimates into independent copies for model fitting and evaluation, using data thinning (Neufeld et al., 2024; Dharamshi et al., 2025). Kawano et al. (2026) is restricted to comparing area-level models with a Gaussian likelihood that enables data thinning. By splitting at the level of individual survey responses rather than area-level summaries, our framework enables comparison across models with fundamentally different likelihood structures, including between area-level and unit-level models, and extends to any estimators, including those constructed with machine learning or other black-box algorithms.

3 Evaluating small area estimators

3.1 Setup and notation

Let Ui{1,2,,N}U_{i}\subset\{1,2,...,N\} denote the set of indices for the units that belong to the ii-th subpopulation or area. We focus on binary outcomes in this paper, though the method can be generalized to other types of variables too. Let yj{0,1}y_{j}\in\{0,1\} denote the response value from the jj-th unit and s[j]{1,,M}s[j]\in\{1,...,M\} denote which subpopulation the jj-th unit belongs to. From a probabilistic survey, we obtain a sample SU1U2UMS\subset U_{1}\cup U_{2}\cup\cdots\cup U_{M} of the units, with their design weights wj=1/πjw_{j}=1/\pi_{j}, where πj\pi_{j} is the probability that the jj-th unit is sampled. To simplify the notation when comparing different sample splitting strategies, we denote a generic partition of a sample SS into a training set AA and a validation set B=SAB=S\setminus A, and use θ^A\hat{\theta}_{A} and θ^B\hat{\theta}_{B} to denote estimators based on the two subsamples. Specifically for KK-fold cross-validation, we use θ^(k)\hat{\theta}^{(k)} and θ^(k)\hat{\theta}^{(-k)} to denote the estimators based on the kk-th held-out dataset and the remaining samples respectively. As a working example, we consider the stratified multi-stage design adopted by most DHS surveys, where the primary sampling units (PSUs) are enumeration areas (EAs), or clusters, and within each PSU, the secondary sampling units (SSUs) are households.

Let θi\theta_{i} denote the true prevalence in area ii and 𝜽=(θ1,,θM)\bm{\theta}=(\theta_{1},...,\theta_{M}). When data are sufficient for direct estimation, the Hájek estimator (Hájek, 1971) for subpopulation prevalence is given by

θ^iw=jSUiwjyjjSUiwj,i=1,,M.\hat{\theta}_{i}^{w}=\frac{\sum_{j\in S\cap U_{i}}w_{j}y_{j}}{\sum_{j\in S\cap U_{i}}w_{j}},\;\;\;\;i=1,...,M. (1)

Area-level models then proceed by smoothing direct estimates 𝜽^w\hat{\bm{\theta}}^{w} and the associated design-based variance, whereas unit-level approaches specify a parametric distribution for yjy_{j} directly. Our objective is to evaluate the resulting estimates 𝜽^\hat{\bm{\theta}} regardless of the underlying model.

3.2 Scoring SAE estimates

For most SAE applications, a natural metric to score the estimator is the weighted squared error loss, L(𝜽^,𝜽)=i=1Mqi(θ^iθi)2,L(\hat{\bm{\theta}},\bm{\theta})=\sum_{i=1}^{M}q_{i}(\hat{\theta}_{i}-\theta_{i})^{2}, for some weights qiq_{i}. The choice of the aggregation weights is contextual. They may represent population proportions by area, or simply be set to 1/M1/M to weigh all areas equally. Evaluating 𝜽^\hat{\bm{\theta}} based on the loss function instead of model-induced likelihood enables model comparison across different model classes. Similar principles have been adopted by Kuh et al. (2024) in the context of multilevel regression and poststratification.

This squared error loss is closely related to the mean squared error (MSE), a topic extensively studied in the SAE literature. Specifically, our ultimate target of inference for model \mathcal{M} is the weighted average of area-specific conditional MSEs,

MSE=i=1Mqi𝔼[(θ^iθi)2],\mbox{MSE}^{\mathcal{M}}=\sum_{i=1}^{M}q_{i}\mathbb{E}[(\hat{\theta}_{i}^{\mathcal{M}}-\theta_{i})^{2}], (2)

where 𝜽\bm{\theta} is considered fixed and unknown, and the expectation is taken over the distribution of sampling units. The term 𝔼[(θ^iθi)2]\mathbb{E}[(\hat{\theta}_{i}^{\mathcal{M}}-\theta_{i})^{2}] is usually referred to as the conditional MSE in the literature (Rivest & Belmonte, 2000; Lohr & Rao, 2009; Rao & Molina, 2015). The majority of the theoretical results on estimating MSE, however, focus on the unconditional case where the expectation is further taken over the distribution of 𝜽\bm{\theta}, assuming area-level Gaussian random effects (Prasad & Rao, 1990; Datta & Lahiri, 2000). To the best of our knowledge, the score proposed in this paper is the first cross-validation-based estimator of the conditional MSE without imposing distributional assumptions on the random effects.

3.3 Stratified KK-fold cross-validation

Following the multi-stage survey design, we randomly assign SSUs, i.e., households, within each cluster to one of the KK folds. This creates KK folds of data each containing all clusters in the survey but only 1/K1/K of the households within each cluster. The survey design is preserved within each fold, and the selection probability for each unit only needs to be adjusted by 1/K1/K. Given a partition, the naive estimator of the squared error loss is

score^naive=i=1MqiKk=1K(θ^i(k)θ^i(k),w)2,\widehat{\mathrm{score}}_{\mathrm{naive}}=\sum_{i=1}^{M}\frac{q_{i}}{K}\sum_{k=1}^{K}\left(\hat{\theta}_{i}^{(-k)}-\hat{\theta}_{i}^{(k),w}\right)^{2}, (3)

where θ^i(k)\hat{\theta}_{i}^{(-k)} denotes the model-based prediction for area ii obtained using all data except the kk-th fold, and θ^i(k),w\hat{\theta}_{i}^{(k),w} denotes the direct estimate for area ii computed from the kk-th held-out fold. The partitioning step can also be repeated and averaged to stabilize the estimated score.

Stratified cross-validation under complex survey designs has been investigated in the survey literature, usually for the purpose of variable selection in regression models (Iparragirre et al., 2023; Wieczorek et al., 2022). We note that our goal here is different. Under KK-fold splitting, the estimators are compared with the direct estimates of the held-out samples, instead of predicting for individual survey responses. This allows the resulting score to be comparable for both area- and unit-level models.

Importantly, validation approaches based on sample splitting target the conditional MSE at a reduced sample size, i.e., MSEA=i=1Mqi𝔼[(θ^A,iθi)2]\text{MSE}_{A}^{\mathcal{M}}=\sum_{i=1}^{M}q_{i}\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})^{2}], where AA is a sample of size (K1)/K(K-1)/K of SS approximately, rather than MSE\text{MSE}^{\mathcal{M}} in Equation 2. To differentiate the two targets, we refer to MSE\text{MSE}^{\mathcal{M}} as the oracle full-sample MSE and MSEA\text{MSE}_{A}^{\mathcal{M}} the oracle training MSE. A similar gap in the target estimand was also observed in Kawano et al. (2026) and it generally cannot be estimated. For the purpose of model comparison, our goal is to preserve model ranking under the original design. It is therefore desirable to keep the training sets close to the full data distribution. For multi-stage designs, one may choose to stratify at different stages to form the partitions. In the Supplementary Materials, we include one alternative partitioning strategy where clusters are assigned to the KK folds instead of households, and show that such approach performs similarly to cross-validation of SSUs when sample size is large, but is more biased when sample size is small, due to a larger gap between the training oracle and full-sample oracle estimand.

Given a sample splitting strategy, the naive CV score is still biased for MSEA\text{MSE}_{A}, because the held-out direct estimate carries sampling variation. Theorem 3.1 identifies the bias structure for sample splitting estimators in general.

Theorem 3.1 (Unbiased estimation of oracle training MSE).

For the ii-th area, given a partition of SS into two sets AA and BB and model \mathcal{M}, let

m^S=𝔼[(θ^A,iθ^B,iw)2S]vibi2+2ci+2bidi\hat{m}_{S}=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\hat{\theta}^{w}_{B,i})^{2}\mid S]-v_{i}-b_{i}^{2}+2c_{i}^{\mathcal{M}}+2b_{i}d_{i}^{\mathcal{M}} (4)

where the additional terms are

vi\displaystyle v_{i} =var(θ^B,iwS),\displaystyle=\mbox{var}(\hat{\theta}_{B,i}^{w}\mid S), (5)
bi\displaystyle b_{i} =𝔼(θ^B,iwθiS),\displaystyle=\mathbb{E}(\hat{\theta}_{B,i}^{w}-\theta_{i}\mid S), (6)
di\displaystyle d_{i}^{\mathcal{M}} =𝔼(θ^A,iθiS),\displaystyle=\mathbb{E}(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i}\mid S), (7)
ci\displaystyle c_{i}^{\mathcal{M}} =cov(θ^A,i,θ^B,iwS),\displaystyle=\mbox{cov}(\hat{\theta}_{A,i}^{\mathcal{M}},\hat{\theta}_{B,i}^{w}\mid S), (8)

and all expectations are taken with respect to the resampling distribution of index set AA given a fixed survey sample SS. The estimator m^S\hat{m}_{S} is unbiased for the oracle training MSE, i.e., 𝔼(m^S)=𝔼[(θ^A,iθi)2]\mathbb{E}(\hat{m}_{S})=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})^{2}], where the expectation is taken over the distribution of SS.

Remark 3.2.

When the index sets AA and BB are from two independent surveys, instead of a random partition of the same survey, the above estimator simplifies to m^S=𝔼[(θ^A,iθ^B,iw)2]vi\hat{m}_{S}=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\hat{\theta}^{w}_{B,i})^{2}]-v_{i}, as 𝔼(θ^B,iwθi)=0\mathbb{E}(\hat{\theta}^{w}_{B,i}-\theta_{i})=0.

While m^S\hat{m}_{S} provides an unbiased estimator of the training MSE, the conditional biases, bib_{i} and did_{i}^{\mathcal{M}}, are not identifiable without knowing the true 𝜽\bm{\theta}. It is important to note that while the direct estimator is design unbiased, splitting a survey into two sets results in the realized error being carried over to all resulting sub-samples. Thus bib_{i} is nonzero given a finite sample SS. This leads to an unidentifiable bias when evaluating the difference of MSEs across models. Theorem 3.3 provides an adjusted score and characterizes the remainder bias term for cross-validation estimators.

Theorem 3.3 (Adjusted CV score).

For the ii-th area, define the fold-averaged means θ¯iw=K1k=1Kθ^(k),w\bar{\theta}^{w}_{i}=K^{-1}\sum_{k=1}^{K}\hat{\theta}^{(k),w} and θ¯i=K1k=1Kθ^(k),\bar{\theta}^{\mathcal{M}}_{i}=K^{-1}\sum_{k=1}^{K}\hat{\theta}^{(-k),\mathcal{M}}, the empirical variance of the held-out direct estimates

v^i=1Kk=1K(θ^i(k),wθ¯iw)2,\hat{v}_{i}=\frac{1}{K}\sum_{k=1}^{K}(\hat{\theta}^{(k),w}_{i}-\bar{\theta}^{w}_{i})^{2}, (9)

and the empirical covariance between the held-out direct estimates and the training-set model estimates

c^i=1Kk=1K(θ^i(k),wθ¯iw)(θ^i(k),θ¯i).\hat{c}_{i}^{\mathcal{M}}=\frac{1}{K}\sum_{k=1}^{K}(\hat{\theta}^{(k),w}_{i}-\bar{\theta}^{w}_{i})(\hat{\theta}^{(-k),\mathcal{M}}_{i}-\bar{\theta}^{\mathcal{M}}_{i}). (10)

The adjusted CV score for model \mathcal{M} is

score^i=1Kk=1K(θ^i(k)θ^i(k),w)2v^i+2c^i.\widehat{\mathrm{score}}_{i}^{\mathcal{M}}=\frac{1}{K}\sum_{k=1}^{K}\left(\hat{\theta}_{i}^{(-k)}-\hat{\theta}_{i}^{(k),w}\right)^{2}-\hat{v}_{i}+2\hat{c}_{i}^{\mathcal{M}}. (11)

For two models 1\mathcal{M}_{1} and 2\mathcal{M}_{2}, the expected score difference satisfies

𝔼(score^i1score^i2)=𝔼[(θ^A,i1θi)2(θ^A,i2θi)2]+ei,\mathbb{E}(\widehat{\mathrm{score}}_{i}^{\mathcal{M}_{1}}-\widehat{\mathrm{score}}_{i}^{\mathcal{M}_{2}})=\mathbb{E}\big[(\hat{\theta}_{A,i}^{\mathcal{M}_{1}}-\theta_{i})^{2}-(\hat{\theta}_{A,i}^{\mathcal{M}_{2}}-\theta_{i})^{2}\big]+e_{i}, (12)

where the expectation is over the sampling distribution of SS. The remainder term is

ei=2cov(bi,di1di2).e_{i}=-2\,\mbox{cov}(b_{i},\;d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}}). (13)

where bi=𝔼(θ^i(k),wθiS)b_{i}=\mathbb{E}(\hat{\theta}_{i}^{(k),w}-\theta_{i}\mid S) and di=𝔼(θ^i(k),θiS)d_{i}^{\mathcal{M}}=\mathbb{E}(\hat{\theta}_{i}^{(-k),\mathcal{M}}-\theta_{i}\mid S), and the covariance is over the distribution of samples SS.

Among the two correction terms, viv_{i} is a constant across models and cic_{i}^{\mathcal{M}} is analogous to the covariance penalty in Efron (2004), reflecting the shared survey structures between the training and validation samples. Theorem 3.3 also shows that the residual bias in the score difference is driven by the covariance between the conditional bias of the direct estimate in the validation set, bib_{i}, and the model difference in the training set, di1di2d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}}. The bias is large when, across hypothetical replicate surveys, the sampling error in the direct estimator for area ii is systematically associated with the difference in model biases. This occurs, for instance, when one model tracks the direct estimates more closely than the other. When there is only a single sample SS, the bias eie_{i} cannot be identified. Therefore, when two models have similar conditional MSE, the oracle MSE difference in Equation (12) is small and the remainder eie_{i} may dominate the score difference, making the comparison uninformative. To account for the unidentifiable bias, we establish a finite sample bound for |ei|e_{i} in Theorem 3.4, and show that ei0e_{i}\to 0 as data accumulates in Theorem 3.6.

Theorem 3.4 (Approximate error bound for adjusted CV score difference).

Under the approximations 𝔼(θ^B,iw|S)θ^S,iw\mathbb{E}(\hat{\theta}_{B,i}^{w}|S)\approx\hat{\theta}_{S,i}^{w} and 𝔼(θ^A,i|S)θ^S,i\mathbb{E}(\hat{\theta}_{A,i}^{\mathcal{M}}|S)\approx\hat{\theta}_{S,i}^{\mathcal{M}}, eie_{i} can be bounded by

|ei|22var(θ^S,iw)(var(θ^S,i1)+var(θ^S,i2))|e_{i}|\leq 2\sqrt{2\mbox{var}(\hat{\theta}_{S,i}^{w})\Big(\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}})+\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{2}})\Big)} (14)

where the subscript SS denotes models fitted on the full sample SS.

Theorem 3.4 gives rise to a plug-in estimator of the upper bound of the absolute bias. Under the stratified cross-validation scheme,

ti=22var^(θ^S,iw)(var^(θ^S,i1)+var^(θ^S,i2))t_{i}=2\sqrt{2\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{w})\Big(\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}})+\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{\mathcal{M}_{2}})\Big)} (15)

The first term var^(θ^S,iw)\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{w}) is the standard design-based variance estimator for the direct estimate based on the full sample. When considering Bayesian models, the posterior variance, varpost(θS,i)\mbox{var}_{\text{post}}(\theta_{S,i}^{\mathcal{M}}) provides a conservative plug-in estimator for var^(θ^S,i)\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{\mathcal{M}}), since it exceeds the sampling variance of the posterior mean by accounting for residual model uncertainty in addition to sampling variability.

For the estimation of the aggregated squared error, the aggregated adjusted CV score is score^q=i=1Mqiscore^i\widehat{\mbox{score}}_{q}^{\mathcal{M}}=\sum_{i=1}^{M}q_{i}\widehat{\mbox{score}}_{i}^{\mathcal{M}}, with the aggregated bias eq=i=1Mqieie_{q}=\sum_{i=1}^{M}q_{i}e_{i}, satisfying

|eq|i=1Mqi|ei|i=1Mqiti|e_{q}|\leq\sum_{i=1}^{M}q_{i}|e_{i}|\leq\sum_{i=1}^{M}q_{i}t_{i} (16)

Thus tq=i=1Mqitit_{q}=\sum_{i=1}^{M}q_{i}t_{i} serves as a threshold for deciding whether an observed aggregated score difference is large enough to support a ranking between 1\mathcal{M}_{1} and 2\mathcal{M}_{2}. The full procedure for KK-fold cross-validation of small area estimators is summarized in Algorithm 1.

Algorithm 1 KK-fold Stratified cross-validation for comparing small area estimators
Multi-stage stratified survey sample SS, candidate models 1\mathcal{M}_{1} and 2\mathcal{M}_{2}, aggregation weights q1,,qMq_{1},...,q_{M}.
  1. 1.

    Randomly partition SSUs within each cluster into KK folds and adjust for survey weights accordingly.

  2. 2.

    for k=1,,Kk=1,\ldots,K do

  3. 3.

    Compute the kk-th held-out direct estimates 𝜽^(k),w\hat{\bm{\theta}}^{(k),w}.

  4. 4.

    Fit 1\mathcal{M}_{1} and 2\mathcal{M}_{2} on the remaining samples to obtain 𝜽^(k),1\hat{\bm{\theta}}^{(-k),\mathcal{M}_{1}} and 𝜽^(k),2\hat{\bm{\theta}}^{(-k),\mathcal{M}_{2}}.

  5. 5.

    Compute adjusted CV scores for 1\mathcal{M}_{1} and 2\mathcal{M}_{2} using Equation 11.

  6. 6.

    Compute the error bounds tit_{i} using Equation 15 and the aggregated bound tq=i=1Mqitit_{q}=\sum_{i=1}^{M}q_{i}t_{i}.

  7. 7.

    If |score1score2|>tq|\mbox{score}^{\mathcal{M}_{1}}-\mbox{score}^{\mathcal{M}_{2}}|>t_{q} then return the model with the smaller score.

  8. 8.

    Else comparison is inconclusive.

Remark 3.5.

We focus on KK-fold cross-validation here, but the MSE decomposition in Theorem 3.1 and the error bound in Theorem 3.4 apply to a broad range of sample splitting schemes, as long as the two sets AA and BB are both representative of the full sample SS. The cross-validation procedure in Algorithm 1 can also be directly extended to repeated sample splitting with averaged estimators. A special case of such averaged estimator with K=2K=2 is provided in the Supplementary Materials.

Finally, the preceding results provide finite-sample tools for model comparison. A natural question is whether the remainder eie_{i} vanishes as data accumulate. Theorem 3.6 shows that the adjusted score is asymptotically consistent for the conditional MSE difference under mild conditions.

Theorem 3.6 (Consistency of the adjusted score).

With KK and MM fixed, assume var(θ^S,iw)0\mbox{var}(\hat{\theta}_{S,i}^{w})\to 0 as nn\to\infty and var(θ^S,i)\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}}) is bounded for each model \mathcal{M}, then ei0e_{i}\to 0 for each area i=1,,Mi=1,\ldots,M. Consequently, 𝔼(score^q1score^q2)(MSEA1MSEA2) 0.\mathbb{E}\big(\widehat{\mathrm{score}}_{q}^{\mathcal{M}_{1}}-\widehat{\mathrm{score}}_{q}^{\mathcal{M}_{2}}\big)-\big(\mathrm{MSE}_{A}^{\mathcal{M}_{1}}-\mathrm{MSE}_{A}^{\mathcal{M}_{2}}\big)\;\to\;0.

3.4 The failure of leave-one-area-out (LOAO) cross-validation

To illustrate the pitfalls of sample splitting strategies that fail to maintain the original survey design, we examine a popular validation scheme in the literature using leave-one-area-out (LOAO) cross-validation. Under LOAO, the model is fitted on data from M1M-1 areas and then used to predict the prevalence in the held-out area. Let θ^i(i)\hat{\theta}_{i}^{(-i)} denote the predicted prevalence for the ii-th area using the rest of the data, an analogous score evaluating the squared error loss is

score^LOAO=i=1Mqi(θ^i(i)θ^iw)2.\widehat{\mathrm{score}}_{\mathrm{LOAO}}=\sum_{i=1}^{M}q_{i}\left(\hat{\theta}_{i}^{(-i)}-\hat{\theta}_{i}^{w}\right)^{2}.

LOAO validation has been informally used in the SAE literature (Steorts et al., 2020; Wakefield et al., 2025). For unit-level models, where individual observations are nested within areas, LOAO is a form of block cross-validation. For area-level models, LOAO corresponds directly to the standard leave-one-out cross-validation (LOO-CV) and can be evaluated using Pareto smoothed importance sampling without refitting the model MM times (Vehtari et al., 2016). The computational efficiency makes it a widely used diagnostic tool to assess Bayesian area-level models .

Although conceptually simple and straightforward to implement, LOAO is an extrapolation estimator targeting a different estimand, 𝔼[(θ^i(i)θi)2]\mathbb{E}[(\hat{\theta}_{i}^{(-i)}-\theta_{i})^{2}]. The shifted target can lead to reversed rankings when comparing smoothing models in the typical SAE setting.

Proposition 3.7.

For the ii-th area, the difference between the extrapolation and smoothing MSE is

𝔼[(θ^i(i)θi)2]𝔼[(θ^iθi)2]=(var(θ^i(i))var(θ^i))+((Δi(i))2(Δi)2),\mathbb{E}[(\hat{\theta}_{i}^{(-i)}-\theta_{i})^{2}]-\mathbb{E}[(\hat{\theta}_{i}-\theta_{i})^{2}]=\Big(\mbox{var}(\hat{\theta}_{i}^{(-i)})-\mbox{var}(\hat{\theta}_{i})\Big)+\Big((\Delta_{i}^{(-i)})^{2}-(\Delta_{i})^{2}\Big),

where Δi=𝔼(θ^iθi)\Delta_{i}=\mathbb{E}(\hat{\theta}_{i}-\theta_{i}) and Δi(i)=𝔼(θ^i(i)θi)\Delta_{i}^{(-i)}=\mathbb{E}(\hat{\theta}_{i}^{(-i)}-\theta_{i}) are the conditional biases of the smoothing and extrapolation estimators, and the expectations are taken over the sampling distribution conditional on 𝛉\bm{\theta}.

The proof of Proposition 3.7 follows from standard MSE decomposition. The gap between the two MSE estimand manifests differently across models. It is easy to see that for a strongly over-smoothed model, the gap is close to 0 as the model estimates are not sensitive to dropping data from a single area. However, for a well calibrated model, both the estimation variance and bias of the estimator for the ii-th area are expected to increase after removing all data from the ii-th area, thus making the extrapolation MSE larger than the smoothing MSE. Therefore, LOAO systematically penalizes flexible models more than over-smoothed models. We numerically illustrate this bias in Section 4.2.

4 Numerical results

We use the 2024 Zambia DHS (Zambia Statistics Agency et al., 2024) throughout our numerical analysis. Section 4.1 describes the survey design and candidate models under comparison. Section 4.2 describes a design-based simulation framework to construct synthetic population and surveys similar to the 2024 Zambia DHS. Section 4.3 illustrates our procedure with the real survey data.

4.1 Data and candidate SAE models

The 2024 Zambia DHS adopted a two-stage stratified cluster design with a sampling frame based on the 2022 Census of Population and Housing of the Republic of Zambia. The population comprises 36,77036,770 enumeration areas (EAs). Zambia has 1010 provinces, which are further divided into 115115 districts. In the first stage, each of the province was stratified by urban and rural residence, yielding 2020 strata in total. Within each stratum, EAs were selected using probability proportional to size (PPS) sampling, resulting in 545545 clusters in total. In the second stage, within each selected cluster, households were selected through equal-probability systematic sampling, with a fixed take of 2525 households per cluster, yielding a target sample of 13,62513,625 households. As a working example, we consider estimating the prevalence of female literacy in Zambia. We focus our analysis on prevalence estimation at the province level. The Supplementary Materials contain additional numerical results comparing models at the district level.

To illustrate the proposed cross-validation framework with models commonly used in practice, we consider three candidate estimators. At the area level, we consider Fay-Herriot models with

ϕ^iwθiN(logit(θi),V^i),logit(θi)=α+ui,i=1,,M,\displaystyle\hat{\phi}_{i}^{w}\mid\theta_{i}\sim N(\mbox{logit}(\theta_{i}),\hat{V}_{i}),\;\;\;\;\mbox{logit}(\theta_{i})=\alpha+u_{i},\;\;\;\;i=1,...,M, (17)

where ϕ^iw=logit(θ^iw)\hat{\phi}_{i}^{w}=\mbox{logit}(\hat{\theta}_{i}^{w}) is the transformed direct estimates, and V^i\hat{V}_{i} is the associated asymptotic variance estimate of θ^iw\hat{\theta}_{i}^{w}, and uiiidN(0,σu2)u_{i}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\sigma_{u}^{2}) are random effects.

At the unit level, Bernoulli likelihood for the binary survey responses is equivalent to a cluster-level binomial model. For a survey consisting of CC clusters, we let c[j]c[j] be the cluster index of the jj-th individual and i[c]i[c] be the area index of the cc-th cluster. Let YcY_{c} and ncn_{c} denote the number of positive responses and the number of sampled units from the cc-th cluster, respectively, i.e., Yc=j:c[j]=cykY_{c}=\sum_{j:c[j]=c}y_{k} and nc=j:c[j]=c1n_{c}=\sum_{j:c[j]=c}1. We consider a cluster-level model with a beta-binomial likelihood (Dong & Wakefield, 2021; Dong et al., 2026; Wakefield et al., 2025),

Ycpc\displaystyle Y_{c}\mid p_{c} BetaBinomial(nc,pc,d).\displaystyle\sim\mbox{BetaBinomial}(n_{c},p_{c},d). (18)

In this formulation, pcp_{c} is the latent prevalence for the cc-th cluster, and dd is the overdispersion parameter that captures the additional within-cluster variation. The cluster-level prevalence is then linked to area-level prevalence via logit(pc)=α+ui[c],\mbox{logit}(p_{c})=\alpha+u_{i[c]}, with random effects uiiidN(0,σu2)u_{i}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\sigma_{u}^{2}).

For the rest of the section, we consider three candidate models that differ in their model class and prior specification. For all three models, we adopt the penalised complexity (PC) prior (Simpson et al., 2017), where PC(U,α)PC(U,\alpha) prior for the standard deviation σ\sigma specifies that Pr(σ>U)=αPr(\sigma>U)=\alpha.

  • 1\mathcal{M}_{1}: Unit-level Beta-Binomial model with a default PC(1,0.01)PC(1,0.01) prior on σu\sigma_{u}.

  • 2\mathcal{M}_{2}: Area-level Fay-Herriot model with the same priors for random effects as in 1\mathcal{M}_{1}.

  • 3\mathcal{M}_{3}: Same model as in 1\mathcal{M}_{1}, but with a highly informative PC(0.01,0.01)PC(0.01,0.01) prior on σu\sigma_{u}.

All models are fitted using the R package surveyPrev (Dong et al., 2024), employing INLA (Lindgren & Rue, 2015; Rue et al., 2009) in posterior computation. Figure 1 compares the direct estimates and estimates from the candidate models. These estimates will be formally evaluated using the proposed method in Section 4.3.

Refer to caption
Figure 1: Prevalence estimates and standard deviations of three candidate models for the percentage of women who are literate using 2024 Zambia DHS survey.

4.2 Design-based synthetic survey simulation

We first consider a design-based simulation study that mimics the 2024 Zambia DHS. Since the 2024 Zambia DHS was based on the sampling frame from the 2022 census, we constructed a 300m ×\times 300m population grid using the 2022 population surface from WorldPop (WorldPop, 2018). To construct the sampling frame, we ranked pixels by population within each province and retained the most populated pixels to match the provincial EA counts from the 2024 DHS sampling frame, yielding 36,77036{,}770 gridded pixels. We treat these pixels as the master sampling frame of our study. For simplicity, we ignore urban/rural status of the pixels in the simulation and consider synthetic surveys stratified by province.

While the analysis in this paper focuses on prevalence estimation at the province level, we construct the synthetic population in this simulation study using smoothed district-level estimates of women’s literacy rate, obtained by fitting a Fay-Herriot model on the 2024 Zambia DHS. This introduces realistic within-province heterogeneity in our simulated data. For a given cluster cc with population NcN_{c} in a district with assumed true prevalence pp, we generate event count YcY_{c} from a Beta-Binomial distribution with mean NcpN_{c}p and variance Ncp(1p)(1+(Nc1)d)N_{c}p(1-p)(1+(N_{c}-1)d), where d=0.00001d=0.00001 introduces additional overdispersion within the district. The simulated finite population prevalence in the ii-th province is then θi=ciYc/ciNc\theta_{i}=\sum_{c\in i}Y_{c}/\sum_{c\in i}N_{c}.

Given the synthetic population, we draw 50 replication surveys using a two-stage stratified design, where 5050 clusters are sampled in each stratum and 3030 households are sampled in each cluster. Figures 2 displays, for a single replicate, point estimates and 90% uncertainty intervals produced by the candidate models. Direct estimation, 1\mathcal{M}_{1}, and 2\mathcal{M}_{2} all produce estimates close to the population truth across all Admin-1 areas. By contrast, 3\mathcal{M}_{3} shows excessive shrinkage, though with overlapping interval estimates compared to others.

Refer to caption
Figure 2: Comparison of Admin-1 prevalence estimates and 90% credible intervals for three model specifications (1\mathcal{M}_{1}, 2\mathcal{M}_{2}, 3\mathcal{M}_{3}) and direct estimates against the synthetic population truth (red dots). Estimates are shown for a single simulated survey.

We computed the adjusted cross-validation scores of all models using 55-fold stratified cross-validation. Figure 3 summarizes the adjusted CV score comparisons of the three models across the 5050 synthetic surveys. When comparing 3\mathcal{M}_{3} with 1\mathcal{M}_{1}, the adjusted CV score was able to consistently identify that 3\mathcal{M}_{3} has larger MSE and the score differences all exceed the error bound, tqt_{q}, providing decisive evidence in favor of 1\mathcal{M}_{1}. The comparison between 2\mathcal{M}_{2} and 1\mathcal{M}_{1} is more difficult as the oracle differences are much smaller in scale. The adjusted CV score differences mostly have the same sign as the oracle difference, but in all replicates, the estimated score differences were between ±tq\pm t_{q}, thus suggesting the comparison between 2\mathcal{M}_{2} and 1\mathcal{M}_{1} is inconclusive, as the available information in the data does not overcome the unestimable bias components of the conditional MSE.

Refer to caption
Figure 3: Left: Adjusted CV score differences versus oracle full-sample MSE differences. The xx-axis shows the oracle full-sample MSE difference for 31\mathcal{M}_{3}-\mathcal{M}_{1} (top) and 21\mathcal{M}_{2}-\mathcal{M}_{1} (bottom). Each point represents one synthetic survey replicate. Points in the first and third quadrants indicate that the CV score difference has the same sign as the oracle full-sample MSE difference and thus correct ranking. Right: Adjusted CV score differences versus the error bound. The red shaded region is the inconclusive region where |score^qascore^qb|<tq|\widehat{\mbox{score}}_{q}^{\mathcal{M}a}-\widehat{\mbox{score}}_{q}^{\mathcal{M}b}|<t_{q}, i.e., scores differences are too small to support decisive order of the two models.

In this simulation setting, the relative performance of the candidate models remains unchanged between the full-sample and oracle training MSE. When sample sizes are smaller, the two estimands may lead to different model rankings, which we investigate in the Supplementary Materials. We also compare different sample-splitting strategies and the choice of KK in the Supplementary Materials.

To gain additional insight on the model performance, we also examine the area-specific scores in Figure 4. For the comparison between 3\mathcal{M}_{3} and 1\mathcal{M}_{1}, the over-smoothed 3\mathcal{M}_{3} departs noticeably from the other models in Copperbelt, Luapula, Lusaka, and Northern provinces, producing absolute score differences that are well above the error bound. For comparing 2\mathcal{M}_{2} with 1\mathcal{M}_{1}, area-specific score differences still remain tightly clustered near zero, with all points falling within the inconclusive region, except for Lusaka in a few occasions.

Refer to caption
Figure 4: Area-level adjusted CV score differences versus oracle error differences (left) and versus the error bound tit_{i} (right) under Setting 1, disaggregated by Admin-1 province. The top row shows 31\mathcal{M}_{3}-\mathcal{M}_{1} and the bottom row shows 21\mathcal{M}_{2}-\mathcal{M}_{1}. Each point represents one simulation replicate for one province, colored by province.

In summary, these results demonstrate that the CV score reliably ranks the models for the performance of subnational level prevalence estimation, while the error bound tqt_{q} provides the principled threshold for separating genuine differences from noise-dominated ones. Additional diagnostics and visualizations are provided in Supplementary Materials.

We also compare the LOAO strategy in Figure 5 for 1\mathcal{M}1 and 3\mathcal{M}3 under leave-one-area-out (LOAO) validation. As discussed in Section 3.4, LOAO penalizes the over-smoothed model less. Indeed in this comparison, the procedure reached the opposite conclusion and ranked 3\mathcal{M}3 as the preferred model in every replicate, illustrating the bias described in Section 3.4.

Refer to caption
Figure 5: Comparison of 1\mathcal{M}1 and 3\mathcal{M}3 under LOAO validation across 50 simulation replicates. Left: LOAO scores versus oracle full-sample MSE for the two models. Right: LOAO score difference, score^LOAO(3)score^LOAO(1)\widehat{\mathrm{score}}_{\mathrm{LOAO}}(\mathcal{M}3)-\widehat{\mathrm{score}}_{\mathrm{LOAO}}(\mathcal{M}1), versus oracle full-sample MSE difference, MSEoracle(3)MSEoracle(1)\mathrm{MSE}_{\mathrm{oracle}}(\mathcal{M}3)-\mathrm{MSE}_{\mathrm{oracle}}(\mathcal{M}1).

4.3 Analysis of household survey data

We now return to the 2024 Zambia DHS survey with the indicator given by the percentage of women who are literate. The stratified cross-validation was conducted with the 2020 strata defined by province crossed with urban/rural residence. Results from 5-fold cross-validation is reported in Figure 6. Estimates from 3\mathcal{M}_{3} is heavily shrunk to the overall mean. Between 1\mathcal{M}_{1} and 2\mathcal{M}_{2}, we observe slightly more shrinkage from 1\mathcal{M}_{1}, and the shrinkage is more obvious in Lusaka and Copperbelt. The cross-validation procedure again identifies that 3\mathcal{M}_{3} has larger aggregated score that exceeds the estimated threshold, compared with 2\mathcal{M}_{2}. The difference between 1\mathcal{M}_{1} and 2\mathcal{M}_{2}, on the other hand, is a lot smaller and within the inclusive region.

Refer to caption
(a)
Refer to caption
(b)
Figure 6: Analysis of female literacy rate using the 2024 Zambia DHS. Panel (a): Point estimates and 90% credible intervals for the three models with direct estimates and confidence interval. Panel (b): 5-fold adjusted CV score comparison by Admin-1 area and in aggregate. The top row compares the Fay-Herriot model (2\mathcal{M}_{2}) with the over-smoothed unit-level model (3\mathcal{M}_{3}), and the bottom row compares the Fay-Herriot model (2\mathcal{M}_{2}) with default unit-level model (1\mathcal{M}_{1}). In each scatter plot, points denote the 10 provinces and the asterisk denotes the aggregated result. The maps show the associated area-level score differences, grey area has |score^sascore^sb|<ti|\widehat{\mathrm{score}}_{s}^{\mathcal{M}_{a}}-\widehat{\mathrm{score}}_{s}^{\mathcal{M}_{b}}|<t_{i}.

A close examination of the area-specific scores in Figure 6 provides additional insight into the behavior of the models. When comparing 2\mathcal{M}_{2} and 3\mathcal{M}_{3}, the score differences are negative and exceed the area-specific threshold in magnitude across all ten provinces, yielding a uniform preference for 2\mathcal{M}_{2} over 3\mathcal{M}_{3}. When comparing 1\mathcal{M}_{1} and 2\mathcal{M}_{2}, while the overall score differences are smaller than the potential bias in magnitude, there are two provinces, Lusaka and Copperbelt, with score differences exceeding their area-specific thresholds. This is consistent with what we observe with the estimated prevalence difference between the two models, and can lead to M2M_{2} being preferred under another set of aggregation weights different from the population proportions in this analysis.

5 Discussion

The proposed cross-validation framework leaves several methodological questions open. First, although the aggregate score is defined for general area weights, the choice of the weighting scheme itself deserves further study. In some applications, one may prefer to use variance to aggregate the scores, so that discrepancies in areas with different uncertainty receive different emphasis. Such choices may be attractive when model comparison is intended to reflect not only average squared error, but also performance in areas where the estimator is most informative. At the same time, the theoretical consequences of variance-based weighting are not yet clear, especially when the weights depend on the fitted model through posterior uncertainty estimates.

The squared error loss considered in this paper evaluates only point estimates. When full predictive distributions are available, one may instead use the continuous ranked probability score (CRPS), where the full posterior predictive CDF is compared against held-out direct estimates. Establishing similar decomposition and finite-sample error bound for CRPS can be a promising future research direction.

As with all sample-splitting methods, the success of the proposed procedure depends on the availability of sufficient data. When the target of inference is at the stratification level, e.g., provinces of Zambia, the impact of data sparsity is usually not severe. While we demonstrate in the Supplementary Materials that the procedure can also be useful at the level of 115115 districts, it remains an open question how to deal with even more extreme data sparsity, e.g., when no samples are collected in many areas.

Lastly, the optimal choice of KK in the KK-fold cross-validation is another important question relevant to practitioners. As we demonstrate empirically in the Supplementary Materials, a larger KK reduces the gap between MSEAMSE_{A} and the oracle full-sample MSE, at the sacrifice of validation power due to the small sample sizes used for validation. Similar observations were made in Kawano et al. (2026). We leave these topics for future research.

Acknowledgement

We are grateful to the Space Time Analysis Bayes (STAB) working group for discussion and feedback on the paper. We also acknowledge the DHS for access and use of the data. The authors are supported by the National Institutes of Health [R01HD112421], and in part by the Gates Foundation. The findings and conclusions contained within are those of the authors and do not necessarily reflect positions or policies of the Gates Foundation.

References

  • Alfons et al. (2010) Alfons, A., Kraft, S., Templ, M. & Filzmoser, P. (2010). Simulation of synthetic population data for household surveys with application to EU-SILC. Research Report CS-2010-1, Department of Statistics and Probability Theory, Vienna University of Technology.
  • Besag et al. (1991) Besag, J., York, J. & Mollié, A. (1991). Bayesian image restoration with two applications in spatial statistics. Annals of the Institute of Statistics and Mathematics 43, 1–59.
  • Bradley et al. (2015) Bradley, J. R., Holan, S. H. & Wikle, C. K. (2015). Multivariate spatio-temporal models for high-dimensional areal data with application to longitudinal employer-household dynamics. The Annals of Applied Statistics 9, 1761–1791.
  • Brown et al. (2001) Brown, G., Chambers, R., Heady, P. & Heasman, D. (2001). Evaluation of small area estimation methods—an application to unemployment estimates from the uk lfs. In Proceedings of statistics Canada symposium, vol. 2001. Statistics Canada.
  • Chi et al. (2022) Chi, G., Fang, H., Chatterjee, S. & Blumenstock, J. E. (2022). Microestimates of wealth for all low-and middle-income countries. Proceedings of the National Academy of Sciences 119, e2113658119.
  • Corsi et al. (2012) Corsi, D. J., Neuman, M., Graham, W. B. & Subramanian, S. (2012). The Demographic and Health Surveys program: an overview. International Journal of Epidemiology 41, 1602–1613.
  • Datta & Lahiri (2000) Datta, G. S. & Lahiri, P. (2000). A unified measure of uncertainty of estimated best linear unbiased predictors in small area estimation problems. Statistica Sinica , 613–627.
  • Dharamshi et al. (2025) Dharamshi, A., Neufeld, A., Motwani, K., Gao, L. L., Witten, D. & Bien, J. (2025). Generalized data thinning using sufficient statistics. Journal of the American Statistical Association 120, 511–523.
  • Dong et al. (2024) Dong, Q., Li, Z. R., Wu, Y., Boskovic, A. & Wakefield, J. (2024). surveyPrev: Mapping the Prevalence of Binary Indicators using Survey Data in Small Areas. R package version 1.0.0.
  • Dong et al. (2026) Dong, Q., Wu, Y., Li, Z. R. & Wakefield, J. (2026). Toward a principled workflow for prevalence mapping using household survey data. Journal of Survey Statistics and Methodology , smaf048.
  • Dong & Wakefield (2021) Dong, T. & Wakefield, J. (2021). Modeling and presentation of health and demographic indicators in a low- and middle-income countries context. Vaccine 39, 2584–2594.
  • Dorfman (2018) Dorfman, A. H. (2018). Towards a routine external evaluation protocol for small area estimation. International Statistical Review 86, 259–274.
  • Efron (2004) Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association 99, 619–632.
  • Fabrizi & Lahiri (2013) Fabrizi, E. & Lahiri, P. (2013). A design-based approximation to the bayes information criterion in finite population sampling. Statistica 73, 289–301.
  • Fay & Herriot (1979) Fay, R. & Herriot, R. (1979). Estimates of income for small places: an application of James–Stein procedure to census data. Journal of the American Statistical Association 74, 269–277.
  • Franco & Maitra (2023) Franco, C. & Maitra, P. (2023). Combining surveys in small area estimation using area-level models. Wiley Interdisciplinary Reviews: Computational Statistics 15, e1613.
  • Gomez-Rubio et al. (2008) Gomez-Rubio, V., Best, N. & Richardson, S. (2008). A comparison of different methods for small area estimation .
  • Hájek (1971) Hájek, J. (1971). Discussion of, “An essay on the logical foundations of survey sampling, part I”, by D. Basu. In Foundations of Statistical Inference, V. Godambe & D. Sprott, eds. Toronto: Holt, Rinehart and Winston.
  • Holbrook et al. (2020) Holbrook, A., Lumley, T. & Gillen, D. (2020). Estimating prediction error for complex samples. Canadian Journal of Statistics 48, 204–221.
  • Iparragirre et al. (2023) Iparragirre, A., Lumley, T., Barrio, I. & Arostegui, I. (2023). Variable selection with lasso regression for complex survey data. Stat 12, e578.
  • Janicki et al. (2022) Janicki, R., Raim, A. M., Holan, S. H. & Maples, J. J. (2022). Bayesian nonparametric multivariate spatial mixture mixed effects models with application to american community survey special tabulations. The Annals of Applied Statistics 16, 144–168.
  • Kawano et al. (2026) Kawano, S., Parker, P. A. & Li, Z. R. (2026). On data thinning for model validation in small area estimation. arXiv preprint arXiv:2604.04141 .
  • Khan & Hancioglu (2019) Khan, S. & Hancioglu, A. (2019). Multiple indicator cluster surveys: Delivering robust data on children and women across the globe. Studies in Family Planning 50, 279–286.
  • Kuh et al. (2024) Kuh, S., Kennedy, L., Chen, Q. & Gelman, A. (2024). Using leave-one-out cross validation (loo) in a multilevel regression and poststratification (mrp) workflow: A cautionary tale. Statistics in Medicine 43, 953–982.
  • Lahiri & Pramanik (2019) Lahiri, P. & Pramanik, S. (2019). Evaluation of synthetic small-area estimators using design-based methods. Austrian Journal of Statistics 48, 43–57.
  • Li et al. (2019) Li, Z. R., Hsiao, Y., Godwin, J., Martin, B. D., Wakefield, J. & Clark, S. J. (2019). Changes in the spatial distribution of the under five mortality rate: small-area analysis of 122 DHS surveys in 262 subregions of 35 countries in Africa. PLoS One 14, e0210645.
  • Lindgren & Rue (2015) Lindgren, F. & Rue, H. (2015). Bayesian spatial modelling with R-INLA. Journal of Statistical Software 63, 1–25.
  • Lohr & Rao (2009) Lohr, S. L. & Rao, J. (2009). Jackknife estimation of mean squared error of small area predictors in nonlinear mixed models. Biometrika 96, 457–468.
  • Lumley & Scott (2014) Lumley, T. & Scott, A. (2014). Tests for regression models fitted to survey data. Australian & New Zealand Journal of Statistics 56, 1–14.
  • Lumley & Scott (2015) Lumley, T. & Scott, A. (2015). Aic and bic for modeling with complex survey data. Journal of Survey Statistics and Methodology 3, 1–18.
  • Lumley & Scott (2017) Lumley, T. & Scott, A. (2017). Fitting regression models to survey data. Statistical Science , 265–278.
  • Mercer et al. (2015) Mercer, L., Wakefield, J., Pantazis, A., Lutambi, A., Mosanja, H. & Clark, S. (2015). Small area estimation of childhood mortality in the absence of vital registration. Annals of Applied Statistics 9, 1889–1905.
  • Merfeld et al. (2024) Merfeld, J. D., Chen, H., Lahiri, P. & Newhouse, D. (2024). Small area estimation with geospatial data: A primer. Background Document ESA/STAT/AC.394/BG-3p, United Nations Statistical Commission. Prepared for the 56th Session of the United Nations Statistical Commission (March 2025) by the Inter-Secretariat Working Group on Household Surveys.
  • Merfeld et al. (2022) Merfeld, J. D., Newhouse, D. L., Weber, M. & Lahiri, P. (2022). Combining survey and geospatial data can significantly improve gender-disaggregated estimates of labor market outcomes .
  • Neufeld et al. (2024) Neufeld, A., Dharamshi, A., Gao, L. L. & Witten, D. (2024). Data thinning for convolution-closed distributions. Journal of Machine Learning Research 25, 1–35.
  • Prasad & Rao (1990) Prasad, N. & Rao, J. (1990). The estimation of the mean squared error of small-area estimators. The Annals of Statistics 85, 163–171.
  • Rao & Molina (2015) Rao, J. & Molina, I. (2015). Small Area Estimation, Second Edition. New York: John Wiley.
  • Rao & Scott (1981) Rao, J. N. & Scott, A. J. (1981). The analysis of categorical data from complex sample surveys: chi-squared tests for goodness of fit and independence in two-way tables. Journal of the American statistical association 76, 221–230.
  • Rao & Scott (1984) Rao, J. N. & Scott, A. J. (1984). On chi-squared tests for multiway contingency tables with cell proportions estimated from survey data. The Annals of statistics , 46–60.
  • Riebler et al. (2016) Riebler, A., Sørbye, S., Simpson, D. & Rue, H. (2016). An intuitive Bayesian spatial model for disease mapping that accounts for scaling. Statistical Methods in Medical Research 25, 1145–1165.
  • Rivest & Belmonte (2000) Rivest, L.-P. & Belmonte, E. (2000). A conditional mean squared error of small area estimators. Survey Methodology 26, 67–78.
  • Rue et al. (2009) Rue, H., Martino, S. & Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations (with discussion). Journal of the Royal Statistical Society, Series B 71, 319–392.
  • Saha et al. (2023) Saha, U. R., Das, S., Baffour, B. & Chandra, H. (2023). Small area estimation of age-specific and total fertility rates in Bangladesh. Spatial Demography 11, 2.
  • Simpson et al. (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. & Sørbye, S. (2017). Penalising model component complexity: A principled, practical approach to constructing priors (with discussion). Statistical Science 32, 1–28.
  • Steorts et al. (2020) Steorts, R. C., Schmid, T. & Tzavidis, N. (2020). Smoothing and benchmarking for small area estimation. International Statistical Review 88, 580–598.
  • The DHS Program (2026) The DHS Program (2026). The Demographic and Health Surveys program. http://www.dhsprogram.com. Accessed: 2026-04-08.
  • Thomas & Rao (1987) Thomas, D. R. & Rao, J. (1987). Small-sample comparisons of level and power for simple goodness-of-fit statistics under cluster sampling. Journal of the American Statistical Association 82, 630–636.
  • Trevisani et al. (2017) Trevisani, M., Torelli, N. et al. (2017). A comparison of hierarchical bayesian models for small area estimation of counts. Open Journal of Statistics 7, 521–550.
  • UNICEF (2026) UNICEF (2026). Multiple indicator cluster surveys. http://mics.unicef.org. Accessed: 2026-04-08.
  • Vehtari et al. (2016) Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T. & Winther, O. (2016). Bayesian leave-one-out cross-validation approximations for gaussian latent variable models. Journal of Machine Learning Research 17, 1–38.
  • Wakefield et al. (2019) Wakefield, J., Fuglstad, G.-A., Riebler, A., Godwin, J., Wilson, K. & Clark, S. (2019). Estimating under five mortality in space and time in a developing world context. Statistical Methods in Medical Research 28, 2614–2634.
  • Wakefield et al. (2025) Wakefield, J., Gao, P. A., Fuglstad, G.-A. & Li, Z. R. (2025). The two cultures of prevalence mapping: Small area estimation and model-based geostatistics. arXiv preprint arXiv:2110.09576 .
  • Wakefield et al. (2026) Wakefield, J., Jiang, J. & Wu, Y. (2026). Automatic variance adjustment for small area estimation. arXiv preprint arXiv:2602.14387 .
  • Wakefield et al. (2020) Wakefield, J., Okonek, T. & Pedersen, J. (2020). Small area estimation for disease prevalence mapping. International Statistical Review 88, 398–418.
  • Wieczorek et al. (2022) Wieczorek, J., Guerin, C. & McMahon, T. (2022). K-fold cross-validation for complex sample surveys. Stat 11, e454.
  • Wieczorek & Franco (2013) Wieczorek, O. & Franco, C. (2013). Small area estimation evaluation strategies: An application to the American Community Survey. In Proceedings of the Joint Statistical Meetings, Section on Survey Research Methods. American Statistical Association Alexandria, VA.
  • WorldPop (2018) WorldPop (2018). Global high resolution population denominators project - funded by the bill and melinda gates foundation (opp1134076).
  • Wu et al. (2021) Wu, Y., Li, Z. R., Mayala, B., Wang, H., Gao, P., Paige, J., Fuglstad, G.-A., Moe, C., Godwin, J., Donohue, R., Janocha, B., Croft, T. & Wakefield, J. (2021). Spatial Modeling for Subnational Administrative level 2 Small-Area Estimation. DHS Spatial Analysis Reports No. 21. Rockville, Maryland, USA.
  • Zambia Statistics Agency et al. (2024) Zambia Statistics Agency, Ministry of Health (MoH) [Zambia] & ICF (2024). Zambia demographic and health survey 2024. Tech. rep., Zambia Statistics Agency, MoH, and ICF, Lusaka, Zambia, and Rockville, Maryland, USA.

Supplementary material

Appendix A Proofs

Consider splitting a sample SS into two disjoint sets AA and B=SAB=S\setminus A following one of the sample splitting schemes discussed in the previous section. For a sequence of random variables {Xn}\{X_{n}\} and a sequence of constants {an}\{a_{n}\}, Xn=Op(an)X_{n}=O_{p}(a_{n}) indicates that Xn/anX_{n}/a_{n} is bounded in probability and Xn=op(an)X_{n}=o_{p}(a_{n}) indicates that Xn/anX_{n}/a_{n} converges in probability to zero. O()O(\cdot) and o()o(\cdot) denote the corresponding deterministic asymptotic bounds. We use XYX\lesssim Y to denote that XCYX\leq CY for some constant C>0C>0 that does not depend on the sample size nn.

A.1 Proof of Theorem 3.1

Proof A.1.

Consider the conditional expectation of the first term:

𝔼[(θ^A,iθ^B,iw)2S]\displaystyle\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\hat{\theta}^{w}_{B,i})^{2}\mid S] =𝔼[(θ^A,iθi(θ^B,iwθi))2S]\displaystyle=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i}-(\hat{\theta}^{w}_{B,i}-\theta_{i}))^{2}\mid S]
=𝔼[(θ^A,iθi)2S]+𝔼[(θ^B,iwθi)2S]2𝔼[(θ^A,iθi)(θ^B,iwθi)S].\displaystyle=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})^{2}\mid S]+\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\mid S]-2\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})(\hat{\theta}^{w}_{B,i}-\theta_{i})\mid S].

Expanding the last two terms yields:

𝔼[(θ^B,iwθi)2S]\displaystyle\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\mid S] =var(θ^B,iwS)+(𝔼[θ^B,iwθiS])2=vi+bi2,\displaystyle=\mbox{var}(\hat{\theta}_{B,i}^{w}\mid S)+(\mathbb{E}[\hat{\theta}^{w}_{B,i}-\theta_{i}\mid S])^{2}=v_{i}+b_{i}^{2},
𝔼[(θ^A,iθi)(θ^B,iwθi)S]\displaystyle\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})(\hat{\theta}^{w}_{B,i}-\theta_{i})\mid S] =cov(θ^A,i,θ^B,iwS)+𝔼(θ^A,iθiS)𝔼(θ^B,iwθiS)=ci+dibi.\displaystyle=\mbox{cov}(\hat{\theta}_{A,i}^{\mathcal{M}},\hat{\theta}_{B,i}^{w}\mid S)+\mathbb{E}(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i}\mid S)\mathbb{E}(\hat{\theta}_{B,i}^{w}-\theta_{i}\mid S)=c_{i}^{\mathcal{M}}+d_{i}^{\mathcal{M}}b_{i}.

Substituting these into the expression for 𝔼(m^SS)\mathbb{E}(\hat{m}_{S}\mid S) and taking the expectation over SS yields 𝔼(m^S)=𝔼[(θ^A,iθi)2]\mathbb{E}(\hat{m}_{S})=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})^{2}].

A.2 Proof of Theorem 3.3

Proof A.2.

From Theorem 3.1, 𝔼(scoreS)=𝔼[(θ^A,iθi)2]+𝔼(vi+bi2)2𝔼(bidi)\mathbb{E}(\mbox{score}_{S}^{\mathcal{M}})=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})^{2}]+\mathbb{E}(v_{i}+b_{i}^{2})-2\mathbb{E}(b_{i}d_{i}^{\mathcal{M}}). The difference is 𝔼[(θ^A,i1θi)2]𝔼[(θ^A,i2θi)2]2𝔼[bi(di1di2)]\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}_{1}}-\theta_{i})^{2}]-\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}_{2}}-\theta_{i})^{2}]-2\mathbb{E}[b_{i}(d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}})]. Since θ^w\hat{\theta}^{w} is design unbiased, 𝔼(bi)=0\mathbb{E}(b_{i})=0. Thus, 𝔼[bi(di1di2)]=cov(bi,di1di2)\mathbb{E}[b_{i}(d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}})]=\mbox{cov}(b_{i},d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}}).

A.3 Proof of Theorem 3.4

{condition}

[Linearization of the model-based estimator difference] For each area ii and each pair of models (1,2)(\mathcal{M}_{1},\mathcal{M}_{2}), there exist bounded functions Δψi,j\Delta\psi_{i,j} of jSj\in S, fixed given SS, such that for any subsample ASA\subseteq S with nA=|A|n_{A}=|A|,

θ^A,i1θ^A,i2=1nAjAΔψi,j+ΔrA,i,ΔrA,i=Op(nA1).\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}=\frac{1}{n_{A}}\sum_{j\in A}\Delta\psi_{i,j}+\Delta r_{A,i},\qquad\Delta r_{A,i}=O_{p}(n_{A}^{-1}). (19)
{condition}

[Linearization of the Hájek direct estimator] For each area ii, there exist bounded functions ψi,jw\psi^{w}_{i,j}, defined on units jUij\in U_{i} and not depending on the random partition, such that for any subset BSB\subseteq S with nB,i=|BUi|n_{B,i}=|B\cap U_{i}|,

θ^B,iwθi=1nB,ijBUiψi,jw+rB,iw,rB,iw=Op(nB,i1).\hat{\theta}^{w}_{B,i}-\theta_{i}=\frac{1}{n_{B,i}}\sum_{j\in B\cap U_{i}}\psi^{w}_{i,j}+r^{w}_{B,i},\qquad r^{w}_{B,i}=O_{p}(n_{B,i}^{-1}). (20)
Lemma A.3 (Conditional expectation of a subsample-based difference under linearization).

Let SS be a fixed sample of size n=|S|n=|S|, and let ASA\subset S be a random subsample under a mechanism that assigns each unit jSj\in S to AA with equal inclusion probability πA(0,1)\pi_{A}\in(0,1), with nA=πAnn_{A}=\pi_{A}n under balanced partitioning. For area ii and two estimators θ^A,i1\hat{\theta}^{\mathcal{M}_{1}}_{A,i} and θ^A,i2\hat{\theta}^{\mathcal{M}_{2}}_{A,i} of θi\theta_{i}, suppose their difference admits the linearization

θ^A,i1θ^A,i2=1nAjAΔψi,j+ΔrA,i,ΔrA,i=Op(nA1),\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}=\frac{1}{n_{A}}\sum_{j\in A}\Delta\psi_{i,j}+\Delta r_{A,i},\qquad\Delta r_{A,i}=O_{p}(n_{A}^{-1}), (21)

where Δψi,j\Delta\psi_{i,j} are bounded functions of unit jj, fixed given SS, and the same expansion holds at A=SA=S with remainder ΔrS,i=Op(n1)\Delta r_{S,i}=O_{p}(n^{-1}). Then

E(θ^A,i1θ^A,i2|S)=θ^S,i1θ^S,i2+Op(n1).E\!\left(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\;\big|\;S\right)=\hat{\theta}^{\mathcal{M}_{1}}_{S,i}-\hat{\theta}^{\mathcal{M}_{2}}_{S,i}+O_{p}(n^{-1}). (22)

Proof A.4.

Taking conditional expectation of (21) given SS and writing the sum using indicators,

E(θ^A,i1θ^A,i2S)=1nAjSP(jAS)Δψi,j+E[ΔrA,iS],E\!\left(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\mid S\right)=\frac{1}{n_{A}}\sum_{j\in S}P(j\in A\mid S)\,\Delta\psi_{i,j}+E[\Delta r_{A,i}\mid S], (23)

where the Δψi,j\Delta\psi_{i,j} come out as constants since they are fixed given SS. Under equal inclusion probability πA\pi_{A} and nA=πAnn_{A}=\pi_{A}n,

1nAjSP(jAS)Δψi,j=πAπAnjSΔψi,j=1njSΔψi,j.\frac{1}{n_{A}}\sum_{j\in S}P(j\in A\mid S)\,\Delta\psi_{i,j}=\frac{\pi_{A}}{\pi_{A}n}\sum_{j\in S}\Delta\psi_{i,j}=\frac{1}{n}\sum_{j\in S}\Delta\psi_{i,j}. (24)

Applying (21) at A=SA=S gives n1jSΔψi,j=θ^S,i1θ^S,i2ΔrS,in^{-1}\sum_{j\in S}\Delta\psi_{i,j}=\hat{\theta}^{\mathcal{M}_{1}}_{S,i}-\hat{\theta}^{\mathcal{M}_{2}}_{S,i}-\Delta r_{S,i}. Substituting,

E(θ^A,i1θ^A,i2S)=θ^S,i1θ^S,i2ΔrS,i+E[ΔrA,iS].E\!\left(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\mid S\right)=\hat{\theta}^{\mathcal{M}_{1}}_{S,i}-\hat{\theta}^{\mathcal{M}_{2}}_{S,i}-\Delta r_{S,i}+E[\Delta r_{A,i}\mid S]. (25)

Since nA=πAnn_{A}=\pi_{A}n with πA\pi_{A} bounded away from zero, Op(nA1)=Op(n1)O_{p}(n_{A}^{-1})=O_{p}(n^{-1}). The bounded influence function ensures that conditional expectation preserves the rate, so E[ΔrA,iS]=Op(n1)E[\Delta r_{A,i}\mid S]=O_{p}(n^{-1}). And ΔrS,i=Op(n1)\Delta r_{S,i}=O_{p}(n^{-1}), the remainder is Op(n1)O_{p}(n^{-1}), yielding (22).

Remark A.5.

The same argument applied to a single estimator θ^B,iw\hat{\theta}^{w}_{B,i} with influence function ψi,jw\psi^{w}_{i,j} supported on UiU_{i} (rather than a difference with influence function supported on all of SS) yields E(θ^B,iwθiS)=θ^S,iwθi+Op(ni1)E(\hat{\theta}^{w}_{B,i}-\theta_{i}\mid S)=\hat{\theta}^{w}_{S,i}-\theta_{i}+O_{p}(n_{i}^{-1}), where ni=|SUi|n_{i}=|S\cap U_{i}|. We invoke this special case for the Hájek direct estimator on BB in the proof of Theorem 3.4 below.

Proof A.6 (Proof of Theorem 3.4).

Applying the Cauchy–Schwarz inequality to eie_{i} in Theorem 3.3,

|ei|2var[E(θ^B,iwθiS)]var[E(θ^A,i1θ^A,i2S)].|e_{i}|\leq 2\sqrt{\mbox{var}[E(\hat{\theta}_{B,i}^{w}-\theta_{i}\mid S)]\;\mbox{var}[E(\hat{\theta}_{A,i}^{\mathcal{M}_{1}}-\hat{\theta}_{A,i}^{\mathcal{M}_{2}}\mid S)]}. (26)

Applying Lemma A.3 in its single-estimator form to the Hájek direct estimator on BB under Condition A.3,

E(θ^B,iwθiS)=θ^S,iwθi+Op(ni1).E(\hat{\theta}_{B,i}^{w}-\theta_{i}\mid S)=\hat{\theta}_{S,i}^{w}-\theta_{i}+O_{p}(n_{i}^{-1}). (27)

Applying Lemma A.3 directly to the model difference on AA under Condition A.3,

E(θ^A,i1θ^A,i2S)=θ^S,i1θ^S,i2+Op(n1).E\!\left(\hat{\theta}_{A,i}^{\mathcal{M}_{1}}-\hat{\theta}_{A,i}^{\mathcal{M}_{2}}\mid S\right)=\hat{\theta}_{S,i}^{\mathcal{M}_{1}}-\hat{\theta}_{S,i}^{\mathcal{M}_{2}}+O_{p}(n^{-1}). (28)

Since n1ni1n^{-1}\leq n_{i}^{-1}, both remainders are Op(ni1)O_{p}(n_{i}^{-1}). Using that θi\theta_{i} is non-random and each remainder has variance of smaller order than the leading variance term,

var[E(θ^B,iwθiS)]\displaystyle\mbox{var}[E(\hat{\theta}_{B,i}^{w}-\theta_{i}\mid S)] =var(θ^S,iw)+o(ni1),\displaystyle=\mbox{var}(\hat{\theta}_{S,i}^{w})+o(n_{i}^{-1}), (29)
var[E(θ^A,i1θ^A,i2S)]\displaystyle\mbox{var}[E(\hat{\theta}_{A,i}^{\mathcal{M}_{1}}-\hat{\theta}_{A,i}^{\mathcal{M}_{2}}\mid S)] =var(θ^S,i1θ^S,i2)+o(ni1).\displaystyle=\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}}-\hat{\theta}_{S,i}^{\mathcal{M}_{2}})+o(n_{i}^{-1}). (30)

Substituting into (26) yields the leading-order bound

|ei|2var(θ^S,iw)var(θ^S,i1θ^S,i2)+o(ni1).|e_{i}|\leq 2\sqrt{\mbox{var}(\hat{\theta}_{S,i}^{w})\,\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}}-\hat{\theta}_{S,i}^{\mathcal{M}_{2}})}+o(n_{i}^{-1}). (31)

The variance of the difference var(θ^S,i1θ^S,i2)\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}}-\hat{\theta}_{S,i}^{\mathcal{M}_{2}}) is not directly available from independent Bayesian fits of 1\mathcal{M}_{1} and 2\mathcal{M}_{2}. To obtain an implementable plug-in, we apply the inequality var(XY)2var(X)+2var(Y)\mbox{var}(X-Y)\leq 2\mbox{var}(X)+2\mbox{var}(Y), yielding

|ei|22var(θ^S,iw)[var(θ^S,i1)+var(θ^S,i2)]+o(ni1).|e_{i}|\leq 2\sqrt{2\,\mbox{var}(\hat{\theta}_{S,i}^{w})\,[\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}})+\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{2}})]}+o(n_{i}^{-1}). (32)

This bound is conservative whenever the cross-model covariance cov(θ^S,i1,θ^S,i2)\mbox{cov}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}},\hat{\theta}_{S,i}^{\mathcal{M}_{2}}) is non-negative, which is expected when 1\mathcal{M}_{1} and 2\mathcal{M}_{2} are smoothing estimators fitted to the same sample and targeting the same estimand.

The sampling variances on the right-hand side of (32) are not available from a single sample. For the direct estimator, we use the standard design-based variance estimator var^(θ^S,iw)\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{w}). For each model-based estimator, we use the posterior variance varpost(θi)\mbox{var}_{\mathrm{post}}(\theta_{i}^{\mathcal{M}}) as a plug-in approximation. The resulting plug-in threshold is

ti=22var^(θ^S,iw)[varpost(θi1)+varpost(θi2)].t_{i}=2\sqrt{2\,\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{w})\,[\mbox{var}_{\mathrm{post}}(\theta_{i}^{\mathcal{M}_{1}})+\mbox{var}_{\mathrm{post}}(\theta_{i}^{\mathcal{M}_{2}})]}. (33)

The posterior variance exceeds the sampling variance of the posterior mean by accounting for residual model uncertainty in addition to sampling variability, so tit_{i} is a conservative estimator of the right-hand side of (32). The resulting bound is confirmed empirically in the simulation study, as shown in Figure 9.

A.4 Proof of Theorem 3.6

Proof A.7.

The remainder term is ei=2cov(bi,di1di2)e_{i}=-2\mbox{cov}(b_{i},d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}}). By Cauchy–Schwarz inequality, |ei|2var(bi)var(di1di2)|e_{i}|\leq 2\sqrt{\mbox{var}(b_{i})\mbox{var}(d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}})}. Under the design regularity assumptions and law of total variance, var(bi)var(θ^B,iw)=O(ni1)0\mbox{var}(b_{i})\leq\mbox{var}(\hat{\theta}_{B,i}^{w})=O(n_{i}^{-1})\to 0. For the second term, var(di1di2)2var(di1)+2var(di2)\mbox{var}(d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}})\leq 2\mbox{var}(d_{i}^{\mathcal{M}_{1}})+2\mbox{var}(d_{i}^{\mathcal{M}_{2}}), and var(di)var(θ^A,i)\mbox{var}(d_{i}^{\mathcal{M}})\leq\mbox{var}(\hat{\theta}_{A,i}^{\mathcal{M}}) by the law of total variance. Thus given the assumption that var(θ^A,i)\mbox{var}(\hat{\theta}_{A,i}^{\mathcal{M}}) is bounded, var(di1di2)\mbox{var}(d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}}) is also bounded. Hence ei0e_{i}\to 0.

Appendix B Additional results for Section  4.2 of the main paper

To support the analysis presented in the main text, this section provides additional visual evaluation of model performance, error decomposition, and cross-validation reliability. Figure 7 initializes this assessment by illustrating the spatial distribution of estimation bias across Admin 1 regions, comparing direct estimates and three model-based approaches (1,2,3\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3}) against the known population truth. Figure 8 compares the naive and adjusted CV scores against the two oracle MSE estimands. Figure 9 validates the approximated error bounds against the true remainder terms. Finally, Figure 10 disaggregates the CV scores and show a distribution of the area-specific scores across the ten provinces, which clearly illustrates which provinces drive the differences in the scores.

Refer to caption
Figure 7: Admin 1 province maps of direct estimates and posterior mean estimates minus population truth for the direct estimators and three model-based estimators (1\mathcal{M}_{1}, 2\mathcal{M}_{2}, 3\mathcal{M}_{3}), for a single simulation under the moderate sample size setting.
Refer to caption
Figure 8: Scatter plots of CV scores versus oracle errors under moderate sample size setting for the three models 1\mathcal{M}_{1}, 2\mathcal{M}_{2}, and 3\mathcal{M}_{3}. Columns correspond to the naive and adjusted scores. The top row uses the full data oracle error, whereas the bottom row uses the Set A oracle error. Each point represents one simulation replicate, and the dashed x=yx=y line marks exact agreement between the CV score and the oracle error.
Refer to caption
Figure 9: Box plot of approximated error bounds over 50 replicates compared to the absolute value of true remainder term |e||e| (red dot).
Refer to caption
Figure 10: Area-level adjusted CV score distributions across ten Admin-1 provinces under moderate sample size setting, for models 1\mathcal{M}_{1}, 2\mathcal{M}_{2}, and 3\mathcal{M}_{3}. Each histogram summarises the distribution of area-specific adjusted CV scores over 50 simulation replicates. The score distributions of 1\mathcal{M}_{1} and 2\mathcal{M}_{2} are nearly identical within each province, while 3\mathcal{M}_{3} is consistently shifted toward larger values, confirming that the aggregate ranking reflects a uniform pattern across areas.

Appendix C Model comparison using naive CV scores

As an alternative to the proposed adjusted score comparison in the main paper, one may also evaluate the differences in naive CV scores of different models, and absorb the viv_{i} and 2ci2c_{i}^{\mathcal{M}} terms in the remainder bias term. We examine this approach in this section.

Similar to Theorem 3.4, we can derive the bias for the naive score difference, as detailed in Theorem C.1 below, and a bound on the bias is provided in Theorom C.3.

Theorem C.1 (Naive CV score difference).

For the ii-th area, the expected difference of naive CV scores satisfies

𝔼[score^naive,i1score^naive,i2]=𝔼[(θ^A,i1θi)2(θ^A,i2θi)2]+fi,\mathbb{E}\!\left[\widehat{\mathrm{score}}^{\mathcal{M}_{1}}_{\mathrm{naive},i}-\widehat{\mathrm{score}}^{\mathcal{M}_{2}}_{\mathrm{naive},i}\right]=\mathbb{E}\!\left[(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\theta_{i})^{2}-(\hat{\theta}^{\mathcal{M}_{2}}_{A,i}-\theta_{i})^{2}\right]+f_{i}, (34)

where the expectation is over the sampling distribution of SS and the remainder is

fi=2𝔼[(θ^B,iwθi)(θ^A,i1θ^A,i2)].f_{i}=-2\,\mathbb{E}\!\left[(\hat{\theta}^{w}_{B,i}-\theta_{i})\bigl(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\bigr)\right]. (35)

Proof C.2.

From Theorem 3.1, taking the difference between 1\mathcal{M}_{1} and 2\mathcal{M}_{2}, the model-free terms 𝔼[vi]\mathbb{E}[v_{i}] and 𝔼[bi2]\mathbb{E}[b_{i}^{2}] cancel exactly, leaving

𝔼[score^naive,i1score^naive,i2]=𝔼[(θ^A,i1θi)2(θ^A,i2θi)2]2𝔼[ci1ci2]2𝔼[bi(di1di2)].\mathbb{E}\!\left[\widehat{\mathrm{score}}^{\mathcal{M}_{1}}_{\mathrm{naive},i}-\widehat{\mathrm{score}}^{\mathcal{M}_{2}}_{\mathrm{naive},i}\right]=\mathbb{E}\!\left[(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\theta_{i})^{2}-(\hat{\theta}^{\mathcal{M}_{2}}_{A,i}-\theta_{i})^{2}\right]-2\mathbb{E}\!\left[c^{\mathcal{M}_{1}}_{i}-c^{\mathcal{M}_{2}}_{i}\right]-2\mathbb{E}\!\left[b_{i}(d^{\mathcal{M}_{1}}_{i}-d^{\mathcal{M}_{2}}_{i})\right].

Collecting the last two terms, and using the identity ci+bidi=𝔼[(θ^A,iθi)(θ^B,iwθi)S]c^{\mathcal{M}}_{i}+b_{i}d^{\mathcal{M}}_{i}=\mathbb{E}[(\hat{\theta}^{\mathcal{M}}_{A,i}-\theta_{i})(\hat{\theta}^{w}_{B,i}-\theta_{i})\mid S] from Theorem 3.1, the remainder becomes

fi=2𝔼[𝔼[(θ^B,iwθi)(θ^A,i1θ^A,i2)|S]],f_{i}=-2\,\mathbb{E}\!\left[\mathbb{E}\!\left[(\hat{\theta}^{w}_{B,i}-\theta_{i})\bigl(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\bigr)\,\Big|\,S\right]\right],

which equals 2𝔼[(θ^B,iwθi)(θ^A,i1θ^A,i2)]-2\,\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i})] by the law of iterated expectations.

Theorem C.3 (Error bound for naive CV score difference).

The remainder fif_{i} in Equation (35) satisfies

|fi|2𝔼[(θ^B,iwθi)2]𝔼[(θ^A,i1θ^A,i2)2].|f_{i}|\leq 2\sqrt{\mathbb{E}\!\left[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\right]\cdot\mathbb{E}\!\left[\bigl(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\bigr)^{2}\right]}. (36)

For KK-fold CV, a plug-in estimator of this bound is

t^naive,i=2[1Kk=1Kvar^(θ^Bk,iw)][1Kk=1K(θ^Ak,i1θ^Ak,i2)2].\hat{t}_{\text{naive},i}=2\sqrt{\left[\frac{1}{K}\sum_{k=1}^{K}\widehat{\mathrm{var}}\!\left(\hat{\theta}^{w}_{B_{k},i}\right)\right]\left[\frac{1}{K}\sum_{k=1}^{K}\!\left(\hat{\theta}^{\mathcal{M}_{1}}_{A_{k},i}-\hat{\theta}^{\mathcal{M}_{2}}_{A_{k},i}\right)^{\!2}\right]}. (37)

Proof C.4.

Apply the Cauchy-Schwarz inequality to the conditional expectation given SS:

|𝔼[(θ^B,iwθi)(θ^A,i1θ^A,i2)|S]|𝔼[(θ^B,iwθi)2|S]𝔼[(θ^A,i1θ^A,i2)2|S].\left|\mathbb{E}\!\left[(\hat{\theta}^{w}_{B,i}-\theta_{i})\bigl(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\bigr)\,\Big|\,S\right]\right|\leq\sqrt{\mathbb{E}\!\left[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\,\Big|\,S\right]\cdot\mathbb{E}\!\left[\bigl(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\bigr)^{2}\,\Big|\,S\right]}.

Take the outer expectation over SS and apply the Jensen’s inequality to that expectation:

𝔼[𝔼[(θ^B,iwθi)2S]𝔼[(θ^A,i1θ^A,i2)2S]]𝔼[𝔼[(θ^B,iwθi)2S]]𝔼[𝔼[(θ^A,i1θ^A,i2)2S]]\mathbb{E}\!\left[\sqrt{\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\mid S]\cdot\mathbb{E}[(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i})^{2}\mid S]}\right]\leq\sqrt{\mathbb{E}\!\left[\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\mid S]\right]\cdot\mathbb{E}\!\left[\mathbb{E}[(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i})^{2}\mid S]\right]}

For the plug-in estimator, the first factor in Equation (36) is 𝔼[(θ^B,iwθi)2]=𝔼[vi]+𝔼[bi2]\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}]=\mathbb{E}[v_{i}]+\mathbb{E}[b_{i}^{2}]. Since the Hájek estimator is approximately design-unbiased, 𝔼[bi2]\mathbb{E}[b_{i}^{2}] is negligible relative to 𝔼[vi]\mathbb{E}[v_{i}], and K1k=1Kvar^(θ^Bk,iw)K^{-1}\sum_{k=1}^{K}\widehat{\mathrm{var}}(\hat{\theta}^{w}_{B_{k},i}) is an estimator of 𝔼[vi]\mathbb{E}[v_{i}]. The second factor 𝔼[(θ^A,i1θ^A,i2)2]\mathbb{E}[(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i})^{2}] is the expected squared model difference. The fold average K1k(θ^Ak,i1θ^Ak,i2)2K^{-1}\sum_{k}(\hat{\theta}^{\mathcal{M}_{1}}_{A_{k},i}-\hat{\theta}^{\mathcal{M}_{2}}_{A_{k},i})^{2} estimates this second moment directly. Substituting both gives t^naive,i\hat{t}_{\text{naive},i} in Equation (37). The aggregation steps are the same as it for adjusted scores. Then in the full procedure, we compare the difference in naive scores to the error bound. If it exceeds the bound, we report the comparison based on the naive score.

Refer to caption
Figure 11: Results for province-level comparison under CV-SSU. Left: Naive CV score differences versus oracle full-sample MSE differences. The xx-axis shows the oracle full-sample MSE difference for 31\mathcal{M}_{3}-\mathcal{M}_{1} (top) and 21\mathcal{M}_{2}-\mathcal{M}_{1} (bottom). Each point represents one synthetic survey replicate. Points in the first and third quadrants indicate that the CV score difference has the same sign as the oracle full-sample MSE difference and thus correct ranking. Right: Naive CV score differences versus the error bound for naive CV scores. The red shaded region is the inconclusive region where |score^qascore^qb|<tq|\widehat{\mbox{score}}_{q}^{\mathcal{M}a}-\widehat{\mbox{score}}_{q}^{\mathcal{M}b}|<t_{q}, i.e., scores differences are too small to support decisive order of the two models.

Comparing to the error bound on adjusted score, t^naive,i\hat{t}_{\text{naive},i} is more conservative. Figure 11 illustrates the consequences. Here we used the moderate sample size setting as described in Section 4.2. For the comparison between 3\mathcal{M}3 and 1\mathcal{M}1 , the naive score differences are positive in nearly all replicates under naive scores, correctly signing the oracle difference, but the bound t^naive,i\hat{t}_{\text{naive},i} is relatively large to place some of replicates in the inconclusive region. For the comparison between 2\mathcal{M}_{2} and 1\mathcal{M}_{1}, the model differences are again very small in scale and within the inconclusive region of naive score error bound.

Appendix D Cross-validation by PSU

In this section, we consider an alternative sample splitting strategy for multi-stage stratified sampling design. Instead of randomly assigning SSUs within each PSU into KK folds, we randomly assign PSUs into KK folds. We refer to this strategy as CV-PSU, and the version presented in the main paper as CV-SSU.

In principle, both CV strategies maintain the structure of the survey design after sample splitting. Figure 12 presents results for CV-PSU under the setting of moderate sample size (50 clusters per stratum, and 30 households per cluster). CV-PSU leads to the same conclusion as CV-SSU: 1\mathcal{M}_{1} and 2\mathcal{M}_{2} cannot be ranked conclusively, whereas the score difference between 1\mathcal{M}_{1} and 3\mathcal{M}_{3} is sufficiently large to conclude that 1\mathcal{M}_{1} outperforms 3\mathcal{M}_{3}.

Refer to caption
Figure 12: CV-PSU: Simulation results under 50 clusters per stratum, 30 households per cluster, Left: Adjusted CV score differences versus oracle MSE differences. The xx-axis shows the oracle MSE difference 31\mathcal{M}_{3}-\mathcal{M}_{1} (top row) and 21\mathcal{M}_{2}-\mathcal{M}_{1} (bottom row); the yy-axis shows the corresponding adjusted CV score differences. Each point represents one synthetic survey replicate. Points in the first and third quadrants indicate that the CV score difference has the same sign as the oracle MSE difference, implying correct ranking of the two models. Right: Adjusted CV score differences versus the error bound. The red shaded region, bounded by the lines y=xy=x and y=xy=-x, is the inconclusive zone where |score^qascore^qb|<tq|\widehat{\mbox{score}}_{q}^{\mathcal{M}a}-\widehat{\mbox{score}}_{q}^{\mathcal{M}b}|<t_{q}; scores differences are too small to support decisive order of the two models.

Figure 13 highlights a key difference between CV-PSU and CV-SSU. Under 55-fold cross-validation, removing one fifth of the clusters from a model causes greater information loss than removing one fifth of households within each cluster. Thus for CV-PSU, the gap between the full sample oracel MSE and oracle training MSE increases faster than CV-SSU as sample size decreases. This can also be seen from Figure 14 and 14, where it shows that while the adjusted CV scores align relatively well with the training set oracle MSE, they could deviate from the oracle full-sample MSE when sample size is small. Therefore, in addition to the effect of the unidentifiable remainder term in estimating the training set oracle MSE, the performance of CV scores can also be impacted significantly by the gap between the two oracle MSE estimands. In contrast, the impact of reduced sample size on such gap is smaller for CV-SSU, as shown in Figure 16.

In summary, the oracle training MSE is a substantially poorer proxy for the full-sample oracle under CV-PSU than under CV-SSU, which can lead to more bias in model ranking.

Refer to caption
(a) CV-PSU
Refer to caption
(b) CV-SSU
Figure 13: Comparison of full-data oracle MSE and oracle training MSE under two cross-validation schemes. Left: 50 clusters per stratum, each with 30 households. Right: 40 clusters per stratum, each with 30 households. Top row: CV-PSU. Bottom row: CV-SSU.
Refer to caption
Figure 14: Naive (left) and adjusted (right) CV scores against oracle MSE under CV-PSU, with 50 clusters and 30 households per cluster. The top row uses the full-data oracle MSE as reference; the bottom row uses the oracle training MSE. Each point represents one sample
Refer to caption
Figure 15: Naive (left) and adjusted (right) CV scores against oracle MSE under CV-PSU, with 40 clusters and 30 households per cluster. The top row uses the full-data oracle MSE as reference; the bottom row uses the oracle training MSE. Each point represents one sample
Refer to caption
Figure 16: Naive (left) and adjusted (right) CV scores against oracle MSE under CV-SSU, with 40 clusters and 30 households per cluster. The top row uses the full-data oracle MSE as reference; the bottom row uses the oracle training MSE. Each point represents one sample

Appendix E Two-fold CV-SSU with resplitting

In this section, we discuss an alternative to 55-fold CV presented in the main paper. Under two-fold CV, the sample SS is randomly partitioned into two equal halves, with one serving as the training set AA and the other as the validation set BB. The partition can be repeated multiple times to reduce sensitivity to the choice of split. Because each training set retains only half the data, the dependence between θ^A,i\hat{\theta}^{\mathcal{M}}_{A,i} and θ^B,iw\hat{\theta}^{w}_{B,i} is weaker than under KK-fold CV, where the training set retains (K1)/K(K-1)/K of the sample. Consequently, the covariance term c^i\hat{c}^{\mathcal{M}}_{i} in Theorem 3.3 is smaller in magnitude, and the naive and adjusted scores are closer to each other than under 5-fold CV. The terms v^i\hat{v}_{i}, c^i\hat{c}^{\mathcal{M}}_{i}, and the naive squared errors are averaged over the repeated pairs before computing the adjusted score.

Refer to caption
(a) Moderate sample size: 50 clusters per stratum, 30 households per cluster.
Refer to caption
(b) Small sample size: 40 clusters per stratum, 30 households per cluster.
Figure 17: Scatter plots of 2-fold CV scores versus oracle errors for the three models 1\mathcal{M}_{1}, 2\mathcal{M}_{2}, and 3\mathcal{M}_{3}, under CV-PSU (top row of each panel) and CV-SSU (bottom row of each panel). Columns correspond to the naive and adjusted scores. The left half of each panel uses the full-data oracle error as the reference; the right half uses the Set A oracle error. Each point represents one simulation replicate, and the dashed x=yx=y line marks exact agreement between the CV score and the oracle error.

In this simulation, we repeat the partition five times and compare both CV-PSU and CV-SSU under the moderate-sample setting (50 clusters per stratum, 30 households per cluster) and the small-sample setting (40 clusters per stratum, 30 households per cluster). Under all settings, two-fold CV-SSU with resplitting correctly ranks the three models: 3\mathcal{M}_{3} is least favoured and 1\mathcal{M}_{1} and 2\mathcal{M}_{2} are very close. Additionally, both CV-PSU and CV-SSU estimate the training oracle error accurately.

The differences again lie in how closely the training oracle MSE tracks the full-data oracle in these scenarios. Under the moderate-sample setting, CV-SSU tracks the full-data oracle more closely than CV-PSU, consistent with the findings in Supplementary Section D. Under the small-sample setting (40 clusters), the oracle training MSE for 3\mathcal{M}_{3} departs from the full-data oracle under both strategies, but the divergence is more severe for CV-PSU. Taken together, two-fold CV-SSU is more reliable than two-fold CV-PSU for tracking the full-sample MSE, but less so than 5-fold CV-SSU at small sample sizes, because each training fold retains only half the data and the resulting oracle training MSE estimand is a noisier proxy for the full-sample target.

Appendix F Comparing SAE models for the 115115 districts in Zambia

In this section, we present results comparing models at a finer geographical level. Direct estimates are often unavailable for subnational areas below the stratification level of the survey design, so the validation set is restricted to areas with at least one sampled cluster. Although both splitting schemes apply at finer levels, the CV-PSU scheme can introduce additional missingness in direct estimates within the test set, whereas each fold of CV-SSU preserves the same pattern of missingness as in the full sample. We therefore consider only CV-SSU for comparison at finer geographical level.

F.1 Synthetic data

For Zambia, the administrative level below the 10 provinces consists of 115 districts. We use the synthetic population of women’s literacy rate and draw 50 replicate surveys under a two-stage stratified design, sampling 60 clusters per stratum and 30 households per cluster. Across the 50 replicates, the number of districts with micro data for direct estimates ranges from 89 to 99, with median 94. We compare the following models:

  • 1\mathcal{M}_{1} : Unit-level Beta-Binomial model with BYM2 random effects (Besag et al., 1991; Riebler et al., 2016). A PC(1,0.01)PC(1,0.01) prior is used for the precision parameter τ\tau of the BYM2 model.

  • 2\mathcal{M}_{2} : Area-level Fay-Herriot model, with automatic variance adjustment (Wakefield et al., 2026). The priors on random effects are the same as 1\mathcal{M}_{1}.

  • 3\mathcal{M}_{3} : Same model as in 1\mathcal{M}_{1}, but with a highly informative prior PC(0.005,0.01)PC(0.005,0.01) on τ\tau.

Refer to caption
Figure 18: Maps of for district-level direct estimates, 1\mathcal{M}_{1}, 2\mathcal{M}_{2} and 3\mathcal{M}_{3} under one simulation replicate.
Refer to caption
Figure 19: Results for district-level comparison under CV-SSU. Left: Adjusted CV score differences versus oracle full-sample MSE differences. The xx-axis shows the oracle full-sample MSE difference for 31\mathcal{M}_{3}-\mathcal{M}_{1} (top) and 21\mathcal{M}_{2}-\mathcal{M}_{1} (bottom). Each point represents one synthetic survey replicate. Points in the first and third quadrants indicate that the CV score difference has the same sign as the oracle full-sample MSE difference and thus correct ranking. Right: Adjusted CV score differences versus the error bound. The red shaded region is the inconclusive region where |score^qascore^qb|<tq|\widehat{\mbox{score}}_{q}^{\mathcal{M}a}-\widehat{\mbox{score}}_{q}^{\mathcal{M}b}|<t_{q}, i.e., scores differences are too small to support decisive order of the two models.

Figure 18 shows district-level posterior mean estimates under one replicate. The direct estimates exhibit substantial variability across districts, reflecting the sparse sample at this geographical level. 1\mathcal{M}_{1} and 2\mathcal{M}_{2} produce similar spatial patterns, while 3\mathcal{M}_{3} yields a notably smoother surface due to its informative prior on τ\tau. Figure 15 summarises the CV-SSU comparison across all 50 replicates. When comparing 3\mathcal{M}_{3} with 1\mathcal{M}_{1}, the adjusted CV score differences are positive in nearly all replicates, agreeing in sign with the oracle MSE differences; all points fall in the first quadrant of the left panel, indicating that CV-SSU correctly identifies 1\mathcal{M}_{1} as the better model. The right panel confirms that the absolute score differences exceed the error bound tqt_{q} in all replicates, so the CV-SSU yields a decisive selection.

In terms of the comparison between 2\mathcal{M}_{2} and 1\mathcal{M}_{1}, both oracle MSE differences and adjusted CV score differences are concentrated near zero. The right panel shows that nearly all replicates fall within the inconclusive region, |score^qascore^qb|<tq|\widehat{\mathrm{score}}_{q}^{\mathcal{M}_{a}}-\widehat{\mathrm{score}}_{q}^{\mathcal{M}_{b}}|<t_{q}, indicating that the error bound correctly abstains from making a decisive ranking when the two models provide very similar subnational estimates. Together, these two comparisons demonstrate that the proposed framework maintains the performance at a finer geographical level with sparser data.

F.2 Analysis of the 2024 Zambia DHS

We now demonstrate district-level comparison using real data. Figure 20 shows the point estimates with 90% intervals for direct estimates, 1\mathcal{M}_{1}, 2\mathcal{M}_{2}, and 3\mathcal{M}_{3}. Two districts have no sampled clusters and therefore no direct estimates. Similar to other results,1\mathcal{M}_{1} and 2\mathcal{M}_{2} are close to each other, while 3\mathcal{M}_{3} is over-smoothed. Figure 21 shows the district-level comparison. When comparing 1\mathcal{M}_{1} and 2\mathcal{M}_{2}, the majority of the districts fall in the inconclusive zone, and so is the aggregated score difference. When comparing 2\mathcal{M}_{2} and 3\mathcal{M}_{3}, there are more area-specific score differences that exceed the thresholds, and the overall score difference is also large enough to suggest that 2\mathcal{M}_{2} is the better model. Compared to province-level models, district-level estimation uses less data and therefore has larger uncertainty. This leads to larger and more conservative error bounds, which explains why we see more disagreement in the interval plot but fewer areas with conclusive comparisons.

Refer to caption
Figure 20: Interval plot for district-level direct estimates, Direct estimates, 1\mathcal{M}_{1}, 2\mathcal{M}_{2} and 3\mathcal{M}_{3} under real data.
Refer to caption
Figure 21: 5-fold adjusted CV score comparison by districts and in aggregate. The top row compares the Fay-Herriot model (2\mathcal{M}_{2}) with the over-smoothed unit-level model (3\mathcal{M}_{3}), and the bottom row compares the Fay-Herriot model (2\mathcal{M}_{2}) with default unit-level model (1\mathcal{M}_{1}). In each scatter plot, points denote the 115 districts and the asterisk denotes the aggregated result. The maps show the associated area-level score differences, grey area has |score^sascore^sb|<ti|\widehat{\mathrm{score}}_{s}^{\mathcal{M}_{a}}-\widehat{\mathrm{score}}_{s}^{\mathcal{M}_{b}}|<t_{i}.