On cross-validation for small area estimators

Qianyu Dong Department of Statistics, University of California, Santa Cruz,
Santa Cruz, CA, USA. Zehang Richard Li Department of Statistics, University of California, Santa Cruz,
Santa Cruz, CA, USA.

Abstract

Subnational monitoring of public health often relies on household surveys where data are sparse at the desired spatial resolution. Small area estimation (SAE) methods address this challenge by borrowing strength across areas and incorporating auxiliary information. However, comparing these estimators remains difficult in the absence of ground truth. We propose a cross-validation framework for evaluating small area estimators that accommodates complex survey designs. Our approach enables model-agnostic comparisons between area-level and unit-level models. Central to our framework is a decomposition of the cross-validated squared error in the context of SAE, which reveals both identifiable bias and unidentifiable components that can be bounded. Our theoretical results and simulation studies show that conventional approaches, such as leave-one-area-out cross-validation, can yield misleading model rankings, whereas the proposed approach offers more robust and interpretable model comparison with uncertainty quantification. We demonstrate the procedure for comparing SAE models for mapping the female literacy rate using Demographic and Health Surveys from Zambia.

keywords:

survey sampling; small area estimation; spatial smoothing; model checking; cross-validation.

1 Introduction

Small area estimation (SAE) refers to the process of estimating quantities of interest for specific geographic areas or subpopulations using survey data. When within-area sample sizes are too small for design-based direct estimation, model‐based methods borrow information across areas and incorporate auxiliary information to improve the precision of estimates (Rao & Molina, 2015). In most low- and middle-income countries (LMICs), census or administrative microdata are limited. Routinely collected household surveys serve as the only reliable sources for many key health and demographic indicators. Demographic and Health Surveys (DHS) (The DHS Program, 2026; Corsi et al., 2012) and the Multiple Indicator Cluster Surveys (MICS) (UNICEF, 2026; Khan & Hancioglu, 2019) are two prominent examples that have been implemented in many countries over several decades. Small area estimation methods have been developed and applied in this setting for outcomes including poverty, child mortality, fertility, vaccination, and HIV (see e.g., Chi et al., 2022; Mercer et al., 2015; Li et al., 2019; Wakefield et al., 2019; Wu et al., 2021; Saha et al., 2023; Dong & Wakefield, 2021; Wakefield et al., 2020).

SAE models are generally categorized into either area-level or unit-level models. The most widely used area-level model is the Fay-Herriot model (Fay & Herriot, 1979), where the direct estimates and their associated estimates of the sampling variances are used as response data in a smoothing model. In contrast, unit-level approaches model individual survey responses, often via generalized linear mixed models. No single modeling strategy performs uniformly well across different survey designs and spatial dependence structures. Model evaluation and selection, however, remains challenging for SAE models. Existing approaches are either restricted to comparing models within shared likelihood families or require external benchmark data or assumptions of the true data generating mechanism. There is a critical need for a unified, model-agnostic framework for comparing candidate estimators on a common and scientifically meaningful scale.

In this paper, we develop a cross-validation framework for comparing small area estimators fitted to complex survey data. Our approach evaluates estimators through held-out, design-based direct estimates constructed from sample splits that preserve the survey design. This strategy enables model-agnostic comparisons between area-level and unit-level models on a common scale. We derive a decomposition of conditional mean squared errors for small area estimators, leading to a consistent cross-validation score. In finite samples, we show that the unidentifiable bias in score difference between two models can be bounded, which provides a principled threshold for deciding whether an observed score difference is large enough to support model ranking. We also demonstrate that popular alternative model comparison tools, such as leave-one-area-out cross-validation, produce systematic bias in model ranking. All theoretical results are supported via design-based simulations based on the 2024 Zambia DHS.

The rest of the paper is organized as follows. Section 2 reviews current approaches to evaluating small area estimators in the literature and discusses their limitations. Section 3 develops the proposed scoring and cross-validation framework. Section 4 reports simulation results and applies the methodology to compare estimators of female literacy rate in Zambia. Section 5 concludes with a discussion of future directions. All proofs are deferred to the Supplementary Materials.

2 Current approaches

Existing model evaluation approaches fall broadly into three categories: benchmark-based, likelihood-based, and empirical simulation. None of the approaches provides a generally applicable, model-agnostic framework for comparing estimators. The first class of methods involves evaluating candidate estimators in terms of their agreement with some benchmark values, and is often model-agnostic. External validation against census data is widely regarded as the gold standard for model assessment (Merfeld et al., 2024, 2022), and when census is not available, validation against estimates from independent surveys has been proposed (Brown et al., 2001; Dorfman, 2018). However, external census or surveys are rarely available at the required spatial resolution. Alternatively, one may first aggregate SAE estimates to coarser levels, e.g., a single national estimate, and then compare them with external benchmark values. While being useful for falsifying estimators, aggregated evaluation changes the target estimand completely. Estimators that perform well at the aggregated level can still be significantly biased at the finer levels, especially with over-smoothed models. When no external data are available, direct estimates from the same sample are also sometimes used to assess small area estimators (Lahiri & Pramanik, 2019). However, such evaluation suffers from double dipping and systematically favors models that track the noise in direct estimates.

Traditional model assessment and validation tools, including goodness-of-fit tests (Rao & Scott, 1981, 1984; Thomas & Rao, 1987; Lumley & Scott, 2014, 2017), information criteria (Fabrizi & Lahiri, 2013; Lumley & Scott, 2015), and generalization error measures (Holbrook et al., 2020), have been developed for regression models that incorporate survey weights, with adjustments to account for complex sampling designs. Similar approaches for comparing Bayesian SAE models using measures such as DIC and WAIC have also been adopted in the literature (see, e.g., Trevisani et al., 2017; Gomez-Rubio et al., 2008; Franco & Maitra, 2023). While likelihood-based methods are powerful and widely used, they are generally limited to comparing models that share a comparable likelihood formulation. Consequently, they are not well suited for comparing unit-level with area-level models, or models that target the same estimand through fundamentally different constructions.

Lastly, empirical simulation approaches are also commonly used in the SAE literature (Bradley et al., 2015; Janicki et al., 2022). These approaches perturb observed direct estimates by adding noise, and then evaluate models fitted to the synthetic direct estimates against the original estimates. When some individual-level micro data from the population exist, design-based simulation may also be carried out by generating repeated surveys based on a synthetic population dataset (Alfons et al., 2010; Wieczorek & Franco, 2013). Both approaches evaluate model performance for a hypothetical population rather than the specific observed dataset at hand, thus implicitly changing the target of inference.

The cross-validation framework developed in Section 3 retains the model-agnostic advantage of benchmark-based approaches while resolving the lack of benchmark datasets. It is closely related to the recent work of Kawano et al. (2026), where they developed model evaluation approaches by splitting direct estimates into independent copies for model fitting and evaluation, using data thinning (Neufeld et al., 2024; Dharamshi et al., 2025). Kawano et al. (2026) is restricted to comparing area-level models with a Gaussian likelihood that enables data thinning. By splitting at the level of individual survey responses rather than area-level summaries, our framework enables comparison across models with fundamentally different likelihood structures, including between area-level and unit-level models, and extends to any estimators, including those constructed with machine learning or other black-box algorithms.

3 Evaluating small area estimators

3.1 Setup and notation

Let $U_{i}\subset\{1,2,...,N\}$ denote the set of indices for the units that belong to the $i$ -th subpopulation or area. We focus on binary outcomes in this paper, though the method can be generalized to other types of variables too. Let $y_{j}\in\{0,1\}$ denote the response value from the $j$ -th unit and $s[j]\in\{1,...,M\}$ denote which subpopulation the $j$ -th unit belongs to. From a probabilistic survey, we obtain a sample $S\subset U_{1}\cup U_{2}\cup\cdots\cup U_{M}$ of the units, with their design weights $w_{j}=1/\pi_{j}$ , where $\pi_{j}$ is the probability that the $j$ -th unit is sampled. To simplify the notation when comparing different sample splitting strategies, we denote a generic partition of a sample $S$ into a training set $A$ and a validation set $B=S\setminus A$ , and use $\hat{\theta}_{A}$ and $\hat{\theta}_{B}$ to denote estimators based on the two subsamples. Specifically for $K$ -fold cross-validation, we use $\hat{\theta}^{(k)}$ and $\hat{\theta}^{(-k)}$ to denote the estimators based on the $k$ -th held-out dataset and the remaining samples respectively. As a working example, we consider the stratified multi-stage design adopted by most DHS surveys, where the primary sampling units (PSUs) are enumeration areas (EAs), or clusters, and within each PSU, the secondary sampling units (SSUs) are households.

Let $\theta_{i}$ denote the true prevalence in area $i$ and $\bm{\theta}=(\theta_{1},...,\theta_{M})$ . When data are sufficient for direct estimation, the Hájek estimator (Hájek, 1971) for subpopulation prevalence is given by

\hat{\theta}_{i}^{w}=\frac{\sum_{j\in S\cap U_{i}}w_{j}y_{j}}{\sum_{j\in S\cap U_{i}}w_{j}},\;\;\;\;i=1,...,M.

(1)

Area-level models then proceed by smoothing direct estimates $\hat{\bm{\theta}}^{w}$ and the associated design-based variance, whereas unit-level approaches specify a parametric distribution for $y_{j}$ directly. Our objective is to evaluate the resulting estimates $\hat{\bm{\theta}}$ regardless of the underlying model.

3.2 Scoring SAE estimates

For most SAE applications, a natural metric to score the estimator is the weighted squared error loss, $L(\hat{\bm{\theta}},\bm{\theta})=\sum_{i=1}^{M}q_{i}(\hat{\theta}_{i}-\theta_{i})^{2},$ for some weights $q_{i}$ . The choice of the aggregation weights is contextual. They may represent population proportions by area, or simply be set to $1/M$ to weigh all areas equally. Evaluating $\hat{\bm{\theta}}$ based on the loss function instead of model-induced likelihood enables model comparison across different model classes. Similar principles have been adopted by Kuh et al. (2024) in the context of multilevel regression and poststratification.

This squared error loss is closely related to the mean squared error (MSE), a topic extensively studied in the SAE literature. Specifically, our ultimate target of inference for model $\mathcal{M}$ is the weighted average of area-specific conditional MSEs,

\mbox{MSE}^{\mathcal{M}}=\sum_{i=1}^{M}q_{i}\mathbb{E}[(\hat{\theta}_{i}^{\mathcal{M}}-\theta_{i})^{2}],

(2)

where $\bm{\theta}$ is considered fixed and unknown, and the expectation is taken over the distribution of sampling units. The term $\mathbb{E}[(\hat{\theta}_{i}^{\mathcal{M}}-\theta_{i})^{2}]$ is usually referred to as the conditional MSE in the literature (Rivest & Belmonte, 2000; Lohr & Rao, 2009; Rao & Molina, 2015). The majority of the theoretical results on estimating MSE, however, focus on the unconditional case where the expectation is further taken over the distribution of $\bm{\theta}$ , assuming area-level Gaussian random effects (Prasad & Rao, 1990; Datta & Lahiri, 2000). To the best of our knowledge, the score proposed in this paper is the first cross-validation-based estimator of the conditional MSE without imposing distributional assumptions on the random effects.

3.3 Stratified $K$ -fold cross-validation

Following the multi-stage survey design, we randomly assign SSUs, i.e., households, within each cluster to one of the $K$ folds. This creates $K$ folds of data each containing all clusters in the survey but only $1/K$ of the households within each cluster. The survey design is preserved within each fold, and the selection probability for each unit only needs to be adjusted by $1/K$ . Given a partition, the naive estimator of the squared error loss is

\widehat{\mathrm{score}}_{\mathrm{naive}}=\sum_{i=1}^{M}\frac{q_{i}}{K}\sum_{k=1}^{K}\left(\hat{\theta}_{i}^{(-k)}-\hat{\theta}_{i}^{(k),w}\right)^{2},

(3)

where $\hat{\theta}_{i}^{(-k)}$ denotes the model-based prediction for area $i$ obtained using all data except the $k$ -th fold, and $\hat{\theta}_{i}^{(k),w}$ denotes the direct estimate for area $i$ computed from the $k$ -th held-out fold. The partitioning step can also be repeated and averaged to stabilize the estimated score.

Stratified cross-validation under complex survey designs has been investigated in the survey literature, usually for the purpose of variable selection in regression models (Iparragirre et al., 2023; Wieczorek et al., 2022). We note that our goal here is different. Under $K$ -fold splitting, the estimators are compared with the direct estimates of the held-out samples, instead of predicting for individual survey responses. This allows the resulting score to be comparable for both area- and unit-level models.

Importantly, validation approaches based on sample splitting target the conditional MSE at a reduced sample size, i.e., $\text{MSE}_{A}^{\mathcal{M}}=\sum_{i=1}^{M}q_{i}\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})^{2}]$ , where $A$ is a sample of size $(K-1)/K$ of $S$ approximately, rather than $\text{MSE}^{\mathcal{M}}$ in Equation 2. To differentiate the two targets, we refer to $\text{MSE}^{\mathcal{M}}$ as the oracle full-sample MSE and $\text{MSE}_{A}^{\mathcal{M}}$ the oracle training MSE. A similar gap in the target estimand was also observed in Kawano et al. (2026) and it generally cannot be estimated. For the purpose of model comparison, our goal is to preserve model ranking under the original design. It is therefore desirable to keep the training sets close to the full data distribution. For multi-stage designs, one may choose to stratify at different stages to form the partitions. In the Supplementary Materials, we include one alternative partitioning strategy where clusters are assigned to the $K$ folds instead of households, and show that such approach performs similarly to cross-validation of SSUs when sample size is large, but is more biased when sample size is small, due to a larger gap between the training oracle and full-sample oracle estimand.

Given a sample splitting strategy, the naive CV score is still biased for $\text{MSE}_{A}$ , because the held-out direct estimate carries sampling variation. Theorem 3.1 identifies the bias structure for sample splitting estimators in general.

Theorem 3.1 (Unbiased estimation of oracle training MSE).

For the $i$ -th area, given a partition of $S$ into two sets $A$ and $B$ and model $\mathcal{M}$ , let

\hat{m}_{S}=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\hat{\theta}^{w}_{B,i})^{2}\mid S]-v_{i}-b_{i}^{2}+2c_{i}^{\mathcal{M}}+2b_{i}d_{i}^{\mathcal{M}}

(4)

where the additional terms are

$\displaystyle v_{i}$	$\displaystyle=\mbox{var}(\hat{\theta}_{B,i}^{w}\mid S),$	(5)
$\displaystyle b_{i}$	$\displaystyle=\mathbb{E}(\hat{\theta}_{B,i}^{w}-\theta_{i}\mid S),$	(6)
$\displaystyle d_{i}^{\mathcal{M}}$	$\displaystyle=\mathbb{E}(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i}\mid S),$	(7)
$\displaystyle c_{i}^{\mathcal{M}}$	$\displaystyle=\mbox{cov}(\hat{\theta}_{A,i}^{\mathcal{M}},\hat{\theta}_{B,i}^{w}\mid S),$	(8)

and all expectations are taken with respect to the resampling distribution of index set $A$ given a fixed survey sample $S$ . The estimator $\hat{m}_{S}$ is unbiased for the oracle training MSE, i.e., $\mathbb{E}(\hat{m}_{S})=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})^{2}]$ , where the expectation is taken over the distribution of $S$ .

Remark 3.2.

When the index sets $A$ and $B$ are from two independent surveys, instead of a random partition of the same survey, the above estimator simplifies to $\hat{m}_{S}=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\hat{\theta}^{w}_{B,i})^{2}]-v_{i}$ , as $\mathbb{E}(\hat{\theta}^{w}_{B,i}-\theta_{i})=0$ .

While $\hat{m}_{S}$ provides an unbiased estimator of the training MSE, the conditional biases, $b_{i}$ and $d_{i}^{\mathcal{M}}$ , are not identifiable without knowing the true $\bm{\theta}$ . It is important to note that while the direct estimator is design unbiased, splitting a survey into two sets results in the realized error being carried over to all resulting sub-samples. Thus $b_{i}$ is nonzero given a finite sample $S$ . This leads to an unidentifiable bias when evaluating the difference of MSEs across models. Theorem 3.3 provides an adjusted score and characterizes the remainder bias term for cross-validation estimators.

Theorem 3.3 (Adjusted CV score).

For the $i$ -th area, define the fold-averaged means $\bar{\theta}^{w}_{i}=K^{-1}\sum_{k=1}^{K}\hat{\theta}^{(k),w}$ and $\bar{\theta}^{\mathcal{M}}_{i}=K^{-1}\sum_{k=1}^{K}\hat{\theta}^{(-k),\mathcal{M}}$ , the empirical variance of the held-out direct estimates

\hat{v}_{i}=\frac{1}{K}\sum_{k=1}^{K}(\hat{\theta}^{(k),w}_{i}-\bar{\theta}^{w}_{i})^{2},

(9)

and the empirical covariance between the held-out direct estimates and the training-set model estimates

\hat{c}_{i}^{\mathcal{M}}=\frac{1}{K}\sum_{k=1}^{K}(\hat{\theta}^{(k),w}_{i}-\bar{\theta}^{w}_{i})(\hat{\theta}^{(-k),\mathcal{M}}_{i}-\bar{\theta}^{\mathcal{M}}_{i}).

(10)

The adjusted CV score for model $\mathcal{M}$ is

\widehat{\mathrm{score}}_{i}^{\mathcal{M}}=\frac{1}{K}\sum_{k=1}^{K}\left(\hat{\theta}_{i}^{(-k)}-\hat{\theta}_{i}^{(k),w}\right)^{2}-\hat{v}_{i}+2\hat{c}_{i}^{\mathcal{M}}.

(11)

For two models $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ , the expected score difference satisfies

\mathbb{E}(\widehat{\mathrm{score}}_{i}^{\mathcal{M}_{1}}-\widehat{\mathrm{score}}_{i}^{\mathcal{M}_{2}})=\mathbb{E}\big[(\hat{\theta}_{A,i}^{\mathcal{M}_{1}}-\theta_{i})^{2}-(\hat{\theta}_{A,i}^{\mathcal{M}_{2}}-\theta_{i})^{2}\big]+e_{i},

(12)

where the expectation is over the sampling distribution of $S$ . The remainder term is

e_{i}=-2\,\mbox{cov}(b_{i},\;d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}}).

(13)

where $b_{i}=\mathbb{E}(\hat{\theta}_{i}^{(k),w}-\theta_{i}\mid S)$ and $d_{i}^{\mathcal{M}}=\mathbb{E}(\hat{\theta}_{i}^{(-k),\mathcal{M}}-\theta_{i}\mid S)$ , and the covariance is over the distribution of samples $S$ .

Among the two correction terms, $v_{i}$ is a constant across models and $c_{i}^{\mathcal{M}}$ is analogous to the covariance penalty in Efron (2004), reflecting the shared survey structures between the training and validation samples. Theorem 3.3 also shows that the residual bias in the score difference is driven by the covariance between the conditional bias of the direct estimate in the validation set, $b_{i}$ , and the model difference in the training set, $d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}}$ . The bias is large when, across hypothetical replicate surveys, the sampling error in the direct estimator for area $i$ is systematically associated with the difference in model biases. This occurs, for instance, when one model tracks the direct estimates more closely than the other. When there is only a single sample $S$ , the bias $e_{i}$ cannot be identified. Therefore, when two models have similar conditional MSE, the oracle MSE difference in Equation (12) is small and the remainder $e_{i}$ may dominate the score difference, making the comparison uninformative. To account for the unidentifiable bias, we establish a finite sample bound for $|e_{i}$ in Theorem 3.4, and show that $e_{i}\to 0$ as data accumulates in Theorem 3.6.

Theorem 3.4 (Approximate error bound for adjusted CV score difference).

Under the approximations $\mathbb{E}(\hat{\theta}_{B,i}^{w}|S)\approx\hat{\theta}_{S,i}^{w}$ and $\mathbb{E}(\hat{\theta}_{A,i}^{\mathcal{M}}|S)\approx\hat{\theta}_{S,i}^{\mathcal{M}}$ , $e_{i}$ can be bounded by

|e_{i}|\leq 2\sqrt{2\mbox{var}(\hat{\theta}_{S,i}^{w})\Big(\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}})+\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{2}})\Big)}

(14)

where the subscript $S$ denotes models fitted on the full sample $S$ .

Theorem 3.4 gives rise to a plug-in estimator of the upper bound of the absolute bias. Under the stratified cross-validation scheme,

t_{i}=2\sqrt{2\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{w})\Big(\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}})+\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{\mathcal{M}_{2}})\Big)}

(15)

The first term $\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{w})$ is the standard design-based variance estimator for the direct estimate based on the full sample. When considering Bayesian models, the posterior variance, $\mbox{var}_{\text{post}}(\theta_{S,i}^{\mathcal{M}})$ provides a conservative plug-in estimator for $\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{\mathcal{M}})$ , since it exceeds the sampling variance of the posterior mean by accounting for residual model uncertainty in addition to sampling variability.

For the estimation of the aggregated squared error, the aggregated adjusted CV score is $\widehat{\mbox{score}}_{q}^{\mathcal{M}}=\sum_{i=1}^{M}q_{i}\widehat{\mbox{score}}_{i}^{\mathcal{M}}$ , with the aggregated bias $e_{q}=\sum_{i=1}^{M}q_{i}e_{i}$ , satisfying

|e_{q}|\leq\sum_{i=1}^{M}q_{i}|e_{i}|\leq\sum_{i=1}^{M}q_{i}t_{i}

(16)

Thus $t_{q}=\sum_{i=1}^{M}q_{i}t_{i}$ serves as a threshold for deciding whether an observed aggregated score difference is large enough to support a ranking between $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ . The full procedure for $K$ -fold cross-validation of small area estimators is summarized in Algorithm 1.

Algorithm 1

K

-fold Stratified cross-validation for comparing small area estimators

Multi-stage stratified survey sample

S

, candidate models

\mathcal{M}_{1}

and

\mathcal{M}_{2}

, aggregation weights

q_{1},...,q_{M}

1.

Randomly partition SSUs within each cluster into $K$ folds and adjust for survey weights accordingly.
2.

for $k=1,\ldots,K$ do
3.

Compute the $k$ -th held-out direct estimates $\hat{\bm{\theta}}^{(k),w}$ .
4.

Fit $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ on the remaining samples to obtain $\hat{\bm{\theta}}^{(-k),\mathcal{M}_{1}}$ and $\hat{\bm{\theta}}^{(-k),\mathcal{M}_{2}}$ .
5.

Compute adjusted CV scores for $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ using Equation 11.
6.

Compute the error bounds $t_{i}$ using Equation 15 and the aggregated bound $t_{q}=\sum_{i=1}^{M}q_{i}t_{i}$ .
7.

If $|\mbox{score}^{\mathcal{M}_{1}}-\mbox{score}^{\mathcal{M}_{2}}|>t_{q}$ then return the model with the smaller score.
8.

Else comparison is inconclusive.

Remark 3.5.

We focus on $K$ -fold cross-validation here, but the MSE decomposition in Theorem 3.1 and the error bound in Theorem 3.4 apply to a broad range of sample splitting schemes, as long as the two sets $A$ and $B$ are both representative of the full sample $S$ . The cross-validation procedure in Algorithm 1 can also be directly extended to repeated sample splitting with averaged estimators. A special case of such averaged estimator with $K=2$ is provided in the Supplementary Materials.

Finally, the preceding results provide finite-sample tools for model comparison. A natural question is whether the remainder $e_{i}$ vanishes as data accumulate. Theorem 3.6 shows that the adjusted score is asymptotically consistent for the conditional MSE difference under mild conditions.

Theorem 3.6 (Consistency of the adjusted score).

With $K$ and $M$ fixed, assume $\mbox{var}(\hat{\theta}_{S,i}^{w})\to 0$ as $n\to\infty$ and $\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}})$ is bounded for each model $\mathcal{M}$ , then $e_{i}\to 0$ for each area $i=1,\ldots,M$ . Consequently, $\mathbb{E}\big(\widehat{\mathrm{score}}_{q}^{\mathcal{M}_{1}}-\widehat{\mathrm{score}}_{q}^{\mathcal{M}_{2}}\big)-\big(\mathrm{MSE}_{A}^{\mathcal{M}_{1}}-\mathrm{MSE}_{A}^{\mathcal{M}_{2}}\big)\;\to\;0.$

3.4 The failure of leave-one-area-out (LOAO) cross-validation

To illustrate the pitfalls of sample splitting strategies that fail to maintain the original survey design, we examine a popular validation scheme in the literature using leave-one-area-out (LOAO) cross-validation. Under LOAO, the model is fitted on data from $M-1$ areas and then used to predict the prevalence in the held-out area. Let $\hat{\theta}_{i}^{(-i)}$ denote the predicted prevalence for the $i$ -th area using the rest of the data, an analogous score evaluating the squared error loss is

\widehat{\mathrm{score}}_{\mathrm{LOAO}}=\sum_{i=1}^{M}q_{i}\left(\hat{\theta}_{i}^{(-i)}-\hat{\theta}_{i}^{w}\right)^{2}.

LOAO validation has been informally used in the SAE literature (Steorts et al., 2020; Wakefield et al., 2025). For unit-level models, where individual observations are nested within areas, LOAO is a form of block cross-validation. For area-level models, LOAO corresponds directly to the standard leave-one-out cross-validation (LOO-CV) and can be evaluated using Pareto smoothed importance sampling without refitting the model $M$ times (Vehtari et al., 2016). The computational efficiency makes it a widely used diagnostic tool to assess Bayesian area-level models .

Although conceptually simple and straightforward to implement, LOAO is an extrapolation estimator targeting a different estimand, $\mathbb{E}[(\hat{\theta}_{i}^{(-i)}-\theta_{i})^{2}]$ . The shifted target can lead to reversed rankings when comparing smoothing models in the typical SAE setting.

Proposition 3.7.

For the $i$ -th area, the difference between the extrapolation and smoothing MSE is

\mathbb{E}[(\hat{\theta}_{i}^{(-i)}-\theta_{i})^{2}]-\mathbb{E}[(\hat{\theta}_{i}-\theta_{i})^{2}]=\Big(\mbox{var}(\hat{\theta}_{i}^{(-i)})-\mbox{var}(\hat{\theta}_{i})\Big)+\Big((\Delta_{i}^{(-i)})^{2}-(\Delta_{i})^{2}\Big),

where $\Delta_{i}=\mathbb{E}(\hat{\theta}_{i}-\theta_{i})$ and $\Delta_{i}^{(-i)}=\mathbb{E}(\hat{\theta}_{i}^{(-i)}-\theta_{i})$ are the conditional biases of the smoothing and extrapolation estimators, and the expectations are taken over the sampling distribution conditional on $\bm{\theta}$ .

The proof of Proposition 3.7 follows from standard MSE decomposition. The gap between the two MSE estimand manifests differently across models. It is easy to see that for a strongly over-smoothed model, the gap is close to $0$ as the model estimates are not sensitive to dropping data from a single area. However, for a well calibrated model, both the estimation variance and bias of the estimator for the $i$ -th area are expected to increase after removing all data from the $i$ -th area, thus making the extrapolation MSE larger than the smoothing MSE. Therefore, LOAO systematically penalizes flexible models more than over-smoothed models. We numerically illustrate this bias in Section 4.2.

4 Numerical results

We use the 2024 Zambia DHS (Zambia Statistics Agency et al., 2024) throughout our numerical analysis. Section 4.1 describes the survey design and candidate models under comparison. Section 4.2 describes a design-based simulation framework to construct synthetic population and surveys similar to the 2024 Zambia DHS. Section 4.3 illustrates our procedure with the real survey data.

4.1 Data and candidate SAE models

The 2024 Zambia DHS adopted a two-stage stratified cluster design with a sampling frame based on the 2022 Census of Population and Housing of the Republic of Zambia. The population comprises $36,770$ enumeration areas (EAs). Zambia has $10$ provinces, which are further divided into $115$ districts. In the first stage, each of the province was stratified by urban and rural residence, yielding $20$ strata in total. Within each stratum, EAs were selected using probability proportional to size (PPS) sampling, resulting in $545$ clusters in total. In the second stage, within each selected cluster, households were selected through equal-probability systematic sampling, with a fixed take of $25$ households per cluster, yielding a target sample of $13,625$ households. As a working example, we consider estimating the prevalence of female literacy in Zambia. We focus our analysis on prevalence estimation at the province level. The Supplementary Materials contain additional numerical results comparing models at the district level.

To illustrate the proposed cross-validation framework with models commonly used in practice, we consider three candidate estimators. At the area level, we consider Fay-Herriot models with

\displaystyle\hat{\phi}_{i}^{w}\mid\theta_{i}\sim N(\mbox{logit}(\theta_{i}),\hat{V}_{i}),\;\;\;\;\mbox{logit}(\theta_{i})=\alpha+u_{i},\;\;\;\;i=1,...,M,

(17)

where $\hat{\phi}_{i}^{w}=\mbox{logit}(\hat{\theta}_{i}^{w})$ is the transformed direct estimates, and $\hat{V}_{i}$ is the associated asymptotic variance estimate of $\hat{\theta}_{i}^{w}$ , and $u_{i}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\sigma_{u}^{2})$ are random effects.

At the unit level, Bernoulli likelihood for the binary survey responses is equivalent to a cluster-level binomial model. For a survey consisting of $C$ clusters, we let $c[j]$ be the cluster index of the $j$ -th individual and $i[c]$ be the area index of the $c$ -th cluster. Let $Y_{c}$ and $n_{c}$ denote the number of positive responses and the number of sampled units from the $c$ -th cluster, respectively, i.e., $Y_{c}=\sum_{j:c[j]=c}y_{k}$ and $n_{c}=\sum_{j:c[j]=c}1$ . We consider a cluster-level model with a beta-binomial likelihood (Dong & Wakefield, 2021; Dong et al., 2026; Wakefield et al., 2025),

\displaystyle Y_{c}\mid p_{c}

\displaystyle\sim\mbox{BetaBinomial}(n_{c},p_{c},d).

(18)

In this formulation, $p_{c}$ is the latent prevalence for the $c$ -th cluster, and $d$ is the overdispersion parameter that captures the additional within-cluster variation. The cluster-level prevalence is then linked to area-level prevalence via $\mbox{logit}(p_{c})=\alpha+u_{i[c]},$ with random effects $u_{i}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\sigma_{u}^{2})$ .

For the rest of the section, we consider three candidate models that differ in their model class and prior specification. For all three models, we adopt the penalised complexity (PC) prior (Simpson et al., 2017), where $PC(U,\alpha)$ prior for the standard deviation $\sigma$ specifies that $Pr(\sigma>U)=\alpha$ .

•

$\mathcal{M}_{1}$ : Unit-level Beta-Binomial model with a default $PC(1,0.01)$ prior on $\sigma_{u}$ .
•

$\mathcal{M}_{2}$ : Area-level Fay-Herriot model with the same priors for random effects as in $\mathcal{M}_{1}$ .
•

$\mathcal{M}_{3}$ : Same model as in $\mathcal{M}_{1}$ , but with a highly informative $PC(0.01,0.01)$ prior on $\sigma_{u}$ .

All models are fitted using the R package surveyPrev (Dong et al., 2024), employing INLA (Lindgren & Rue, 2015; Rue et al., 2009) in posterior computation. Figure 1 compares the direct estimates and estimates from the candidate models. These estimates will be formally evaluated using the proposed method in Section 4.3.

Refer to caption — Figure 1: Prevalence estimates and standard deviations of three candidate models for the percentage of women who are literate using 2024 Zambia DHS survey.

4.2 Design-based synthetic survey simulation

We first consider a design-based simulation study that mimics the 2024 Zambia DHS. Since the 2024 Zambia DHS was based on the sampling frame from the 2022 census, we constructed a 300m $\times$ 300m population grid using the 2022 population surface from WorldPop (WorldPop, 2018). To construct the sampling frame, we ranked pixels by population within each province and retained the most populated pixels to match the provincial EA counts from the 2024 DHS sampling frame, yielding $36{,}770$ gridded pixels. We treat these pixels as the master sampling frame of our study. For simplicity, we ignore urban/rural status of the pixels in the simulation and consider synthetic surveys stratified by province.

While the analysis in this paper focuses on prevalence estimation at the province level, we construct the synthetic population in this simulation study using smoothed district-level estimates of women’s literacy rate, obtained by fitting a Fay-Herriot model on the 2024 Zambia DHS. This introduces realistic within-province heterogeneity in our simulated data. For a given cluster $c$ with population $N_{c}$ in a district with assumed true prevalence $p$ , we generate event count $Y_{c}$ from a Beta-Binomial distribution with mean $N_{c}p$ and variance $N_{c}p(1-p)(1+(N_{c}-1)d)$ , where $d=0.00001$ introduces additional overdispersion within the district. The simulated finite population prevalence in the $i$ -th province is then $\theta_{i}=\sum_{c\in i}Y_{c}/\sum_{c\in i}N_{c}$ .

Given the synthetic population, we draw 50 replication surveys using a two-stage stratified design, where $50$ clusters are sampled in each stratum and $30$ households are sampled in each cluster. Figures 2 displays, for a single replicate, point estimates and 90% uncertainty intervals produced by the candidate models. Direct estimation, $\mathcal{M}_{1}$ , and $\mathcal{M}_{2}$ all produce estimates close to the population truth across all Admin-1 areas. By contrast, $\mathcal{M}_{3}$ shows excessive shrinkage, though with overlapping interval estimates compared to others.

We computed the adjusted cross-validation scores of all models using $5$ -fold stratified cross-validation. Figure 3 summarizes the adjusted CV score comparisons of the three models across the $50$ synthetic surveys. When comparing $\mathcal{M}_{3}$ with $\mathcal{M}_{1}$ , the adjusted CV score was able to consistently identify that $\mathcal{M}_{3}$ has larger MSE and the score differences all exceed the error bound, $t_{q}$ , providing decisive evidence in favor of $\mathcal{M}_{1}$ . The comparison between $\mathcal{M}_{2}$ and $\mathcal{M}_{1}$ is more difficult as the oracle differences are much smaller in scale. The adjusted CV score differences mostly have the same sign as the oracle difference, but in all replicates, the estimated score differences were between $\pm t_{q}$ , thus suggesting the comparison between $\mathcal{M}_{2}$ and $\mathcal{M}_{1}$ is inconclusive, as the available information in the data does not overcome the unestimable bias components of the conditional MSE.

In this simulation setting, the relative performance of the candidate models remains unchanged between the full-sample and oracle training MSE. When sample sizes are smaller, the two estimands may lead to different model rankings, which we investigate in the Supplementary Materials. We also compare different sample-splitting strategies and the choice of $K$ in the Supplementary Materials.

To gain additional insight on the model performance, we also examine the area-specific scores in Figure 4. For the comparison between $\mathcal{M}_{3}$ and $\mathcal{M}_{1}$ , the over-smoothed $\mathcal{M}_{3}$ departs noticeably from the other models in Copperbelt, Luapula, Lusaka, and Northern provinces, producing absolute score differences that are well above the error bound. For comparing $\mathcal{M}_{2}$ with $\mathcal{M}_{1}$ , area-specific score differences still remain tightly clustered near zero, with all points falling within the inconclusive region, except for Lusaka in a few occasions.

In summary, these results demonstrate that the CV score reliably ranks the models for the performance of subnational level prevalence estimation, while the error bound $t_{q}$ provides the principled threshold for separating genuine differences from noise-dominated ones. Additional diagnostics and visualizations are provided in Supplementary Materials.

We also compare the LOAO strategy in Figure 5 for $\mathcal{M}1$ and $\mathcal{M}3$ under leave-one-area-out (LOAO) validation. As discussed in Section 3.4, LOAO penalizes the over-smoothed model less. Indeed in this comparison, the procedure reached the opposite conclusion and ranked $\mathcal{M}3$ as the preferred model in every replicate, illustrating the bias described in Section 3.4.

4.3 Analysis of household survey data

We now return to the 2024 Zambia DHS survey with the indicator given by the percentage of women who are literate. The stratified cross-validation was conducted with the $20$ strata defined by province crossed with urban/rural residence. Results from 5-fold cross-validation is reported in Figure 6. Estimates from $\mathcal{M}_{3}$ is heavily shrunk to the overall mean. Between $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ , we observe slightly more shrinkage from $\mathcal{M}_{1}$ , and the shrinkage is more obvious in Lusaka and Copperbelt. The cross-validation procedure again identifies that $\mathcal{M}_{3}$ has larger aggregated score that exceeds the estimated threshold, compared with $\mathcal{M}_{2}$ . The difference between $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ , on the other hand, is a lot smaller and within the inclusive region.

A close examination of the area-specific scores in Figure 6 provides additional insight into the behavior of the models. When comparing $\mathcal{M}_{2}$ and $\mathcal{M}_{3}$ , the score differences are negative and exceed the area-specific threshold in magnitude across all ten provinces, yielding a uniform preference for $\mathcal{M}_{2}$ over $\mathcal{M}_{3}$ . When comparing $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ , while the overall score differences are smaller than the potential bias in magnitude, there are two provinces, Lusaka and Copperbelt, with score differences exceeding their area-specific thresholds. This is consistent with what we observe with the estimated prevalence difference between the two models, and can lead to $M_{2}$ being preferred under another set of aggregation weights different from the population proportions in this analysis.

5 Discussion

The proposed cross-validation framework leaves several methodological questions open. First, although the aggregate score is defined for general area weights, the choice of the weighting scheme itself deserves further study. In some applications, one may prefer to use variance to aggregate the scores, so that discrepancies in areas with different uncertainty receive different emphasis. Such choices may be attractive when model comparison is intended to reflect not only average squared error, but also performance in areas where the estimator is most informative. At the same time, the theoretical consequences of variance-based weighting are not yet clear, especially when the weights depend on the fitted model through posterior uncertainty estimates.

The squared error loss considered in this paper evaluates only point estimates. When full predictive distributions are available, one may instead use the continuous ranked probability score (CRPS), where the full posterior predictive CDF is compared against held-out direct estimates. Establishing similar decomposition and finite-sample error bound for CRPS can be a promising future research direction.

As with all sample-splitting methods, the success of the proposed procedure depends on the availability of sufficient data. When the target of inference is at the stratification level, e.g., provinces of Zambia, the impact of data sparsity is usually not severe. While we demonstrate in the Supplementary Materials that the procedure can also be useful at the level of $115$ districts, it remains an open question how to deal with even more extreme data sparsity, e.g., when no samples are collected in many areas.

Lastly, the optimal choice of $K$ in the $K$ -fold cross-validation is another important question relevant to practitioners. As we demonstrate empirically in the Supplementary Materials, a larger $K$ reduces the gap between $MSE_{A}$ and the oracle full-sample MSE, at the sacrifice of validation power due to the small sample sizes used for validation. Similar observations were made in Kawano et al. (2026). We leave these topics for future research.

Acknowledgement

We are grateful to the Space Time Analysis Bayes (STAB) working group for discussion and feedback on the paper. We also acknowledge the DHS for access and use of the data. The authors are supported by the National Institutes of Health [R01HD112421], and in part by the Gates Foundation. The findings and conclusions contained within are those of the authors and do not necessarily reflect positions or policies of the Gates Foundation.

References

Alfons et al. (2010) Alfons, A., Kraft, S., Templ, M. & Filzmoser, P. (2010). Simulation of synthetic population data for household surveys with application to EU-SILC. Research Report CS-2010-1, Department of Statistics and Probability Theory, Vienna University of Technology.
Besag et al. (1991) Besag, J., York, J. & Mollié, A. (1991). Bayesian image restoration with two applications in spatial statistics. Annals of the Institute of Statistics and Mathematics 43, 1–59.
Bradley et al. (2015) Bradley, J. R., Holan, S. H. & Wikle, C. K. (2015). Multivariate spatio-temporal models for high-dimensional areal data with application to longitudinal employer-household dynamics. The Annals of Applied Statistics 9, 1761–1791.
Brown et al. (2001) Brown, G., Chambers, R., Heady, P. & Heasman, D. (2001). Evaluation of small area estimation methods—an application to unemployment estimates from the uk lfs. In Proceedings of statistics Canada symposium, vol. 2001. Statistics Canada.
Chi et al. (2022) Chi, G., Fang, H., Chatterjee, S. & Blumenstock, J. E. (2022). Microestimates of wealth for all low-and middle-income countries. Proceedings of the National Academy of Sciences 119, e2113658119.
Corsi et al. (2012) Corsi, D. J., Neuman, M., Graham, W. B. & Subramanian, S. (2012). The Demographic and Health Surveys program: an overview. International Journal of Epidemiology 41, 1602–1613.
Datta & Lahiri (2000) Datta, G. S. & Lahiri, P. (2000). A unified measure of uncertainty of estimated best linear unbiased predictors in small area estimation problems. Statistica Sinica , 613–627.
Dharamshi et al. (2025) Dharamshi, A., Neufeld, A., Motwani, K., Gao, L. L., Witten, D. & Bien, J. (2025). Generalized data thinning using sufficient statistics. Journal of the American Statistical Association 120, 511–523.
Dong et al. (2024) Dong, Q., Li, Z. R., Wu, Y., Boskovic, A. & Wakefield, J. (2024). surveyPrev: Mapping the Prevalence of Binary Indicators using Survey Data in Small Areas. R package version 1.0.0.
Dong et al. (2026) Dong, Q., Wu, Y., Li, Z. R. & Wakefield, J. (2026). Toward a principled workflow for prevalence mapping using household survey data. Journal of Survey Statistics and Methodology , smaf048.
Dong & Wakefield (2021) Dong, T. & Wakefield, J. (2021). Modeling and presentation of health and demographic indicators in a low- and middle-income countries context. Vaccine 39, 2584–2594.
Dorfman (2018) Dorfman, A. H. (2018). Towards a routine external evaluation protocol for small area estimation. International Statistical Review 86, 259–274.
Efron (2004) Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association 99, 619–632.
Fabrizi & Lahiri (2013) Fabrizi, E. & Lahiri, P. (2013). A design-based approximation to the bayes information criterion in finite population sampling. Statistica 73, 289–301.
Fay & Herriot (1979) Fay, R. & Herriot, R. (1979). Estimates of income for small places: an application of James–Stein procedure to census data. Journal of the American Statistical Association 74, 269–277.
Franco & Maitra (2023) Franco, C. & Maitra, P. (2023). Combining surveys in small area estimation using area-level models. Wiley Interdisciplinary Reviews: Computational Statistics 15, e1613.
Gomez-Rubio et al. (2008) Gomez-Rubio, V., Best, N. & Richardson, S. (2008). A comparison of different methods for small area estimation .
Hájek (1971) Hájek, J. (1971). Discussion of, “An essay on the logical foundations of survey sampling, part I”, by D. Basu. In Foundations of Statistical Inference, V. Godambe & D. Sprott, eds. Toronto: Holt, Rinehart and Winston.
Holbrook et al. (2020) Holbrook, A., Lumley, T. & Gillen, D. (2020). Estimating prediction error for complex samples. Canadian Journal of Statistics 48, 204–221.
Iparragirre et al. (2023) Iparragirre, A., Lumley, T., Barrio, I. & Arostegui, I. (2023). Variable selection with lasso regression for complex survey data. Stat 12, e578.
Janicki et al. (2022) Janicki, R., Raim, A. M., Holan, S. H. & Maples, J. J. (2022). Bayesian nonparametric multivariate spatial mixture mixed effects models with application to american community survey special tabulations. The Annals of Applied Statistics 16, 144–168.
Kawano et al. (2026) Kawano, S., Parker, P. A. & Li, Z. R. (2026). On data thinning for model validation in small area estimation. arXiv preprint arXiv:2604.04141 .
Khan & Hancioglu (2019) Khan, S. & Hancioglu, A. (2019). Multiple indicator cluster surveys: Delivering robust data on children and women across the globe. Studies in Family Planning 50, 279–286.
Kuh et al. (2024) Kuh, S., Kennedy, L., Chen, Q. & Gelman, A. (2024). Using leave-one-out cross validation (loo) in a multilevel regression and poststratification (mrp) workflow: A cautionary tale. Statistics in Medicine 43, 953–982.
Lahiri & Pramanik (2019) Lahiri, P. & Pramanik, S. (2019). Evaluation of synthetic small-area estimators using design-based methods. Austrian Journal of Statistics 48, 43–57.
Li et al. (2019) Li, Z. R., Hsiao, Y., Godwin, J., Martin, B. D., Wakefield, J. & Clark, S. J. (2019). Changes in the spatial distribution of the under five mortality rate: small-area analysis of 122 DHS surveys in 262 subregions of 35 countries in Africa. PLoS One 14, e0210645.
Lindgren & Rue (2015) Lindgren, F. & Rue, H. (2015). Bayesian spatial modelling with R-INLA. Journal of Statistical Software 63, 1–25.
Lohr & Rao (2009) Lohr, S. L. & Rao, J. (2009). Jackknife estimation of mean squared error of small area predictors in nonlinear mixed models. Biometrika 96, 457–468.
Lumley & Scott (2014) Lumley, T. & Scott, A. (2014). Tests for regression models fitted to survey data. Australian & New Zealand Journal of Statistics 56, 1–14.
Lumley & Scott (2015) Lumley, T. & Scott, A. (2015). Aic and bic for modeling with complex survey data. Journal of Survey Statistics and Methodology 3, 1–18.
Lumley & Scott (2017) Lumley, T. & Scott, A. (2017). Fitting regression models to survey data. Statistical Science , 265–278.
Mercer et al. (2015) Mercer, L., Wakefield, J., Pantazis, A., Lutambi, A., Mosanja, H. & Clark, S. (2015). Small area estimation of childhood mortality in the absence of vital registration. Annals of Applied Statistics 9, 1889–1905.
Merfeld et al. (2024) Merfeld, J. D., Chen, H., Lahiri, P. & Newhouse, D. (2024). Small area estimation with geospatial data: A primer. Background Document ESA/STAT/AC.394/BG-3p, United Nations Statistical Commission. Prepared for the 56th Session of the United Nations Statistical Commission (March 2025) by the Inter-Secretariat Working Group on Household Surveys.
Merfeld et al. (2022) Merfeld, J. D., Newhouse, D. L., Weber, M. & Lahiri, P. (2022). Combining survey and geospatial data can significantly improve gender-disaggregated estimates of labor market outcomes .
Neufeld et al. (2024) Neufeld, A., Dharamshi, A., Gao, L. L. & Witten, D. (2024). Data thinning for convolution-closed distributions. Journal of Machine Learning Research 25, 1–35.
Prasad & Rao (1990) Prasad, N. & Rao, J. (1990). The estimation of the mean squared error of small-area estimators. The Annals of Statistics 85, 163–171.
Rao & Molina (2015) Rao, J. & Molina, I. (2015). Small Area Estimation, Second Edition. New York: John Wiley.
Rao & Scott (1981) Rao, J. N. & Scott, A. J. (1981). The analysis of categorical data from complex sample surveys: chi-squared tests for goodness of fit and independence in two-way tables. Journal of the American statistical association 76, 221–230.
Rao & Scott (1984) Rao, J. N. & Scott, A. J. (1984). On chi-squared tests for multiway contingency tables with cell proportions estimated from survey data. The Annals of statistics , 46–60.
Riebler et al. (2016) Riebler, A., Sørbye, S., Simpson, D. & Rue, H. (2016). An intuitive Bayesian spatial model for disease mapping that accounts for scaling. Statistical Methods in Medical Research 25, 1145–1165.
Rivest & Belmonte (2000) Rivest, L.-P. & Belmonte, E. (2000). A conditional mean squared error of small area estimators. Survey Methodology 26, 67–78.
Rue et al. (2009) Rue, H., Martino, S. & Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations (with discussion). Journal of the Royal Statistical Society, Series B 71, 319–392.
Saha et al. (2023) Saha, U. R., Das, S., Baffour, B. & Chandra, H. (2023). Small area estimation of age-specific and total fertility rates in Bangladesh. Spatial Demography 11, 2.
Simpson et al. (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. & Sørbye, S. (2017). Penalising model component complexity: A principled, practical approach to constructing priors (with discussion). Statistical Science 32, 1–28.
Steorts et al. (2020) Steorts, R. C., Schmid, T. & Tzavidis, N. (2020). Smoothing and benchmarking for small area estimation. International Statistical Review 88, 580–598.
The DHS Program (2026) The DHS Program (2026). The Demographic and Health Surveys program. http://www.dhsprogram.com. Accessed: 2026-04-08.
Thomas & Rao (1987) Thomas, D. R. & Rao, J. (1987). Small-sample comparisons of level and power for simple goodness-of-fit statistics under cluster sampling. Journal of the American Statistical Association 82, 630–636.
Trevisani et al. (2017) Trevisani, M., Torelli, N. et al. (2017). A comparison of hierarchical bayesian models for small area estimation of counts. Open Journal of Statistics 7, 521–550.
UNICEF (2026) UNICEF (2026). Multiple indicator cluster surveys. http://mics.unicef.org. Accessed: 2026-04-08.
Vehtari et al. (2016) Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T. & Winther, O. (2016). Bayesian leave-one-out cross-validation approximations for gaussian latent variable models. Journal of Machine Learning Research 17, 1–38.
Wakefield et al. (2019) Wakefield, J., Fuglstad, G.-A., Riebler, A., Godwin, J., Wilson, K. & Clark, S. (2019). Estimating under five mortality in space and time in a developing world context. Statistical Methods in Medical Research 28, 2614–2634.
Wakefield et al. (2025) Wakefield, J., Gao, P. A., Fuglstad, G.-A. & Li, Z. R. (2025). The two cultures of prevalence mapping: Small area estimation and model-based geostatistics. arXiv preprint arXiv:2110.09576 .
Wakefield et al. (2026) Wakefield, J., Jiang, J. & Wu, Y. (2026). Automatic variance adjustment for small area estimation. arXiv preprint arXiv:2602.14387 .
Wakefield et al. (2020) Wakefield, J., Okonek, T. & Pedersen, J. (2020). Small area estimation for disease prevalence mapping. International Statistical Review 88, 398–418.
Wieczorek et al. (2022) Wieczorek, J., Guerin, C. & McMahon, T. (2022). K-fold cross-validation for complex sample surveys. Stat 11, e454.
Wieczorek & Franco (2013) Wieczorek, O. & Franco, C. (2013). Small area estimation evaluation strategies: An application to the American Community Survey. In Proceedings of the Joint Statistical Meetings, Section on Survey Research Methods. American Statistical Association Alexandria, VA.
WorldPop (2018) WorldPop (2018). Global high resolution population denominators project - funded by the bill and melinda gates foundation (opp1134076).
Wu et al. (2021) Wu, Y., Li, Z. R., Mayala, B., Wang, H., Gao, P., Paige, J., Fuglstad, G.-A., Moe, C., Godwin, J., Donohue, R., Janocha, B., Croft, T. & Wakefield, J. (2021). Spatial Modeling for Subnational Administrative level 2 Small-Area Estimation. DHS Spatial Analysis Reports No. 21. Rockville, Maryland, USA.
Zambia Statistics Agency et al. (2024) Zambia Statistics Agency, Ministry of Health (MoH) [Zambia] & ICF (2024). Zambia demographic and health survey 2024. Tech. rep., Zambia Statistics Agency, MoH, and ICF, Lusaka, Zambia, and Rockville, Maryland, USA.

Supplementary material

Appendix A Proofs

Consider splitting a sample $S$ into two disjoint sets $A$ and $B=S\setminus A$ following one of the sample splitting schemes discussed in the previous section. For a sequence of random variables $\{X_{n}\}$ and a sequence of constants $\{a_{n}\}$ , $X_{n}=O_{p}(a_{n})$ indicates that $X_{n}/a_{n}$ is bounded in probability and $X_{n}=o_{p}(a_{n})$ indicates that $X_{n}/a_{n}$ converges in probability to zero. $O(\cdot)$ and $o(\cdot)$ denote the corresponding deterministic asymptotic bounds. We use $X\lesssim Y$ to denote that $X\leq CY$ for some constant $C>0$ that does not depend on the sample size $n$ .

A.1 Proof of Theorem 3.1

Proof A.1.

Consider the conditional expectation of the first term:

	$\displaystyle\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\hat{\theta}^{w}_{B,i})^{2}\mid S]$	$\displaystyle=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i}-(\hat{\theta}^{w}_{B,i}-\theta_{i}))^{2}\mid S]$
		$\displaystyle=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})^{2}\mid S]+\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\mid S]-2\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})(\hat{\theta}^{w}_{B,i}-\theta_{i})\mid S].$

Expanding the last two terms yields:

	$\displaystyle\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\mid S]$	$\displaystyle=\mbox{var}(\hat{\theta}_{B,i}^{w}\mid S)+(\mathbb{E}[\hat{\theta}^{w}_{B,i}-\theta_{i}\mid S])^{2}=v_{i}+b_{i}^{2},$
	$\displaystyle\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})(\hat{\theta}^{w}_{B,i}-\theta_{i})\mid S]$	$\displaystyle=\mbox{cov}(\hat{\theta}_{A,i}^{\mathcal{M}},\hat{\theta}_{B,i}^{w}\mid S)+\mathbb{E}(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i}\mid S)\mathbb{E}(\hat{\theta}_{B,i}^{w}-\theta_{i}\mid S)=c_{i}^{\mathcal{M}}+d_{i}^{\mathcal{M}}b_{i}.$

Substituting these into the expression for $\mathbb{E}(\hat{m}_{S}\mid S)$ and taking the expectation over $S$ yields $\mathbb{E}(\hat{m}_{S})=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})^{2}]$ .

A.2 Proof of Theorem 3.3

Proof A.2.

From Theorem 3.1, $\mathbb{E}(\mbox{score}_{S}^{\mathcal{M}})=\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}}-\theta_{i})^{2}]+\mathbb{E}(v_{i}+b_{i}^{2})-2\mathbb{E}(b_{i}d_{i}^{\mathcal{M}})$ . The difference is $\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}_{1}}-\theta_{i})^{2}]-\mathbb{E}[(\hat{\theta}_{A,i}^{\mathcal{M}_{2}}-\theta_{i})^{2}]-2\mathbb{E}[b_{i}(d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}})]$ . Since $\hat{\theta}^{w}$ is design unbiased, $\mathbb{E}(b_{i})=0$ . Thus, $\mathbb{E}[b_{i}(d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}})]=\mbox{cov}(b_{i},d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}})$ .

A.3 Proof of Theorem 3.4

{condition}

[Linearization of the model-based estimator difference] For each area $i$ and each pair of models $(\mathcal{M}_{1},\mathcal{M}_{2})$ , there exist bounded functions $\Delta\psi_{i,j}$ of $j\in S$ , fixed given $S$ , such that for any subsample $A\subseteq S$ with $n_{A}=|A|$ ,

\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}=\frac{1}{n_{A}}\sum_{j\in A}\Delta\psi_{i,j}+\Delta r_{A,i},\qquad\Delta r_{A,i}=O_{p}(n_{A}^{-1}).

(19)

{condition}

[Linearization of the Hájek direct estimator] For each area $i$ , there exist bounded functions $\psi^{w}_{i,j}$ , defined on units $j\in U_{i}$ and not depending on the random partition, such that for any subset $B\subseteq S$ with $n_{B,i}=|B\cap U_{i}|$ ,

\hat{\theta}^{w}_{B,i}-\theta_{i}=\frac{1}{n_{B,i}}\sum_{j\in B\cap U_{i}}\psi^{w}_{i,j}+r^{w}_{B,i},\qquad r^{w}_{B,i}=O_{p}(n_{B,i}^{-1}).

(20)

Lemma A.3 (Conditional expectation of a subsample-based difference under linearization).

Let $S$ be a fixed sample of size $n=|S|$ , and let $A\subset S$ be a random subsample under a mechanism that assigns each unit $j\in S$ to $A$ with equal inclusion probability $\pi_{A}\in(0,1)$ , with $n_{A}=\pi_{A}n$ under balanced partitioning. For area $i$ and two estimators $\hat{\theta}^{\mathcal{M}_{1}}_{A,i}$ and $\hat{\theta}^{\mathcal{M}_{2}}_{A,i}$ of $\theta_{i}$ , suppose their difference admits the linearization

\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}=\frac{1}{n_{A}}\sum_{j\in A}\Delta\psi_{i,j}+\Delta r_{A,i},\qquad\Delta r_{A,i}=O_{p}(n_{A}^{-1}),

(21)

where $\Delta\psi_{i,j}$ are bounded functions of unit $j$ , fixed given $S$ , and the same expansion holds at $A=S$ with remainder $\Delta r_{S,i}=O_{p}(n^{-1})$ . Then

E\!\left(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\;\big|\;S\right)=\hat{\theta}^{\mathcal{M}_{1}}_{S,i}-\hat{\theta}^{\mathcal{M}_{2}}_{S,i}+O_{p}(n^{-1}).

(22)

Proof A.4.

Taking conditional expectation of (21) given $S$ and writing the sum using indicators,

E\!\left(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\mid S\right)=\frac{1}{n_{A}}\sum_{j\in S}P(j\in A\mid S)\,\Delta\psi_{i,j}+E[\Delta r_{A,i}\mid S],

(23)

where the $\Delta\psi_{i,j}$ come out as constants since they are fixed given $S$ . Under equal inclusion probability $\pi_{A}$ and $n_{A}=\pi_{A}n$ ,

\frac{1}{n_{A}}\sum_{j\in S}P(j\in A\mid S)\,\Delta\psi_{i,j}=\frac{\pi_{A}}{\pi_{A}n}\sum_{j\in S}\Delta\psi_{i,j}=\frac{1}{n}\sum_{j\in S}\Delta\psi_{i,j}.

(24)

Applying (21) at $A=S$ gives $n^{-1}\sum_{j\in S}\Delta\psi_{i,j}=\hat{\theta}^{\mathcal{M}_{1}}_{S,i}-\hat{\theta}^{\mathcal{M}_{2}}_{S,i}-\Delta r_{S,i}$ . Substituting,

E\!\left(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\mid S\right)=\hat{\theta}^{\mathcal{M}_{1}}_{S,i}-\hat{\theta}^{\mathcal{M}_{2}}_{S,i}-\Delta r_{S,i}+E[\Delta r_{A,i}\mid S].

(25)

Since $n_{A}=\pi_{A}n$ with $\pi_{A}$ bounded away from zero, $O_{p}(n_{A}^{-1})=O_{p}(n^{-1})$ . The bounded influence function ensures that conditional expectation preserves the rate, so $E[\Delta r_{A,i}\mid S]=O_{p}(n^{-1})$ . And $\Delta r_{S,i}=O_{p}(n^{-1})$ , the remainder is $O_{p}(n^{-1})$ , yielding (22).

Remark A.5.

The same argument applied to a single estimator $\hat{\theta}^{w}_{B,i}$ with influence function $\psi^{w}_{i,j}$ supported on $U_{i}$ (rather than a difference with influence function supported on all of $S$ ) yields $E(\hat{\theta}^{w}_{B,i}-\theta_{i}\mid S)=\hat{\theta}^{w}_{S,i}-\theta_{i}+O_{p}(n_{i}^{-1})$ , where $n_{i}=|S\cap U_{i}|$ . We invoke this special case for the Hájek direct estimator on $B$ in the proof of Theorem 3.4 below.

Proof A.6 (Proof of Theorem 3.4).

Applying the Cauchy–Schwarz inequality to $e_{i}$ in Theorem 3.3,

|e_{i}|\leq 2\sqrt{\mbox{var}[E(\hat{\theta}_{B,i}^{w}-\theta_{i}\mid S)]\;\mbox{var}[E(\hat{\theta}_{A,i}^{\mathcal{M}_{1}}-\hat{\theta}_{A,i}^{\mathcal{M}_{2}}\mid S)]}.

(26)

Applying Lemma A.3 in its single-estimator form to the Hájek direct estimator on $B$ under Condition A.3,

E(\hat{\theta}_{B,i}^{w}-\theta_{i}\mid S)=\hat{\theta}_{S,i}^{w}-\theta_{i}+O_{p}(n_{i}^{-1}).

(27)

Applying Lemma A.3 directly to the model difference on $A$ under Condition A.3,

E\!\left(\hat{\theta}_{A,i}^{\mathcal{M}_{1}}-\hat{\theta}_{A,i}^{\mathcal{M}_{2}}\mid S\right)=\hat{\theta}_{S,i}^{\mathcal{M}_{1}}-\hat{\theta}_{S,i}^{\mathcal{M}_{2}}+O_{p}(n^{-1}).

(28)

Since $n^{-1}\leq n_{i}^{-1}$ , both remainders are $O_{p}(n_{i}^{-1})$ . Using that $\theta_{i}$ is non-random and each remainder has variance of smaller order than the leading variance term,

	$\displaystyle\mbox{var}[E(\hat{\theta}_{B,i}^{w}-\theta_{i}\mid S)]$	$\displaystyle=\mbox{var}(\hat{\theta}_{S,i}^{w})+o(n_{i}^{-1}),$		(29)
	$\displaystyle\mbox{var}[E(\hat{\theta}_{A,i}^{\mathcal{M}_{1}}-\hat{\theta}_{A,i}^{\mathcal{M}_{2}}\mid S)]$	$\displaystyle=\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}}-\hat{\theta}_{S,i}^{\mathcal{M}_{2}})+o(n_{i}^{-1}).$		(30)

Substituting into (26) yields the leading-order bound

|e_{i}|\leq 2\sqrt{\mbox{var}(\hat{\theta}_{S,i}^{w})\,\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}}-\hat{\theta}_{S,i}^{\mathcal{M}_{2}})}+o(n_{i}^{-1}).

(31)

The variance of the difference $\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}}-\hat{\theta}_{S,i}^{\mathcal{M}_{2}})$ is not directly available from independent Bayesian fits of $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ . To obtain an implementable plug-in, we apply the inequality $\mbox{var}(X-Y)\leq 2\mbox{var}(X)+2\mbox{var}(Y)$ , yielding

|e_{i}|\leq 2\sqrt{2\,\mbox{var}(\hat{\theta}_{S,i}^{w})\,[\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}})+\mbox{var}(\hat{\theta}_{S,i}^{\mathcal{M}_{2}})]}+o(n_{i}^{-1}).

(32)

This bound is conservative whenever the cross-model covariance $\mbox{cov}(\hat{\theta}_{S,i}^{\mathcal{M}_{1}},\hat{\theta}_{S,i}^{\mathcal{M}_{2}})$ is non-negative, which is expected when $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ are smoothing estimators fitted to the same sample and targeting the same estimand.

The sampling variances on the right-hand side of (32) are not available from a single sample. For the direct estimator, we use the standard design-based variance estimator $\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{w})$ . For each model-based estimator, we use the posterior variance $\mbox{var}_{\mathrm{post}}(\theta_{i}^{\mathcal{M}})$ as a plug-in approximation. The resulting plug-in threshold is

t_{i}=2\sqrt{2\,\widehat{\mbox{var}}(\hat{\theta}_{S,i}^{w})\,[\mbox{var}_{\mathrm{post}}(\theta_{i}^{\mathcal{M}_{1}})+\mbox{var}_{\mathrm{post}}(\theta_{i}^{\mathcal{M}_{2}})]}.

(33)

The posterior variance exceeds the sampling variance of the posterior mean by accounting for residual model uncertainty in addition to sampling variability, so $t_{i}$ is a conservative estimator of the right-hand side of (32). The resulting bound is confirmed empirically in the simulation study, as shown in Figure 9.

A.4 Proof of Theorem 3.6

Proof A.7.

The remainder term is $e_{i}=-2\mbox{cov}(b_{i},d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}})$ . By Cauchy–Schwarz inequality, $|e_{i}|\leq 2\sqrt{\mbox{var}(b_{i})\mbox{var}(d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}})}$ . Under the design regularity assumptions and law of total variance, $\mbox{var}(b_{i})\leq\mbox{var}(\hat{\theta}_{B,i}^{w})=O(n_{i}^{-1})\to 0$ . For the second term, $\mbox{var}(d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}})\leq 2\mbox{var}(d_{i}^{\mathcal{M}_{1}})+2\mbox{var}(d_{i}^{\mathcal{M}_{2}})$ , and $\mbox{var}(d_{i}^{\mathcal{M}})\leq\mbox{var}(\hat{\theta}_{A,i}^{\mathcal{M}})$ by the law of total variance. Thus given the assumption that $\mbox{var}(\hat{\theta}_{A,i}^{\mathcal{M}})$ is bounded, $\mbox{var}(d_{i}^{\mathcal{M}_{1}}-d_{i}^{\mathcal{M}_{2}})$ is also bounded. Hence $e_{i}\to 0$ .

Appendix B Additional results for Section 4.2 of the main paper

To support the analysis presented in the main text, this section provides additional visual evaluation of model performance, error decomposition, and cross-validation reliability. Figure 7 initializes this assessment by illustrating the spatial distribution of estimation bias across Admin 1 regions, comparing direct estimates and three model-based approaches ( $\mathcal{M}_{1},\mathcal{M}_{2},\mathcal{M}_{3}$ ) against the known population truth. Figure 8 compares the naive and adjusted CV scores against the two oracle MSE estimands. Figure 9 validates the approximated error bounds against the true remainder terms. Finally, Figure 10 disaggregates the CV scores and show a distribution of the area-specific scores across the ten provinces, which clearly illustrates which provinces drive the differences in the scores.

Appendix C Model comparison using naive CV scores

As an alternative to the proposed adjusted score comparison in the main paper, one may also evaluate the differences in naive CV scores of different models, and absorb the $v_{i}$ and $2c_{i}^{\mathcal{M}}$ terms in the remainder bias term. We examine this approach in this section.

Similar to Theorem 3.4, we can derive the bias for the naive score difference, as detailed in Theorem C.1 below, and a bound on the bias is provided in Theorom C.3.

Theorem C.1 (Naive CV score difference).

For the $i$ -th area, the expected difference of naive CV scores satisfies

\mathbb{E}\!\left[\widehat{\mathrm{score}}^{\mathcal{M}_{1}}_{\mathrm{naive},i}-\widehat{\mathrm{score}}^{\mathcal{M}_{2}}_{\mathrm{naive},i}\right]=\mathbb{E}\!\left[(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\theta_{i})^{2}-(\hat{\theta}^{\mathcal{M}_{2}}_{A,i}-\theta_{i})^{2}\right]+f_{i},

(34)

where the expectation is over the sampling distribution of $S$ and the remainder is

f_{i}=-2\,\mathbb{E}\!\left[(\hat{\theta}^{w}_{B,i}-\theta_{i})\bigl(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\bigr)\right].

(35)

Proof C.2.

From Theorem 3.1, taking the difference between $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ , the model-free terms $\mathbb{E}[v_{i}]$ and $\mathbb{E}[b_{i}^{2}]$ cancel exactly, leaving

\mathbb{E}\!\left[\widehat{\mathrm{score}}^{\mathcal{M}_{1}}_{\mathrm{naive},i}-\widehat{\mathrm{score}}^{\mathcal{M}_{2}}_{\mathrm{naive},i}\right]=\mathbb{E}\!\left[(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\theta_{i})^{2}-(\hat{\theta}^{\mathcal{M}_{2}}_{A,i}-\theta_{i})^{2}\right]-2\mathbb{E}\!\left[c^{\mathcal{M}_{1}}_{i}-c^{\mathcal{M}_{2}}_{i}\right]-2\mathbb{E}\!\left[b_{i}(d^{\mathcal{M}_{1}}_{i}-d^{\mathcal{M}_{2}}_{i})\right].

Collecting the last two terms, and using the identity $c^{\mathcal{M}}_{i}+b_{i}d^{\mathcal{M}}_{i}=\mathbb{E}[(\hat{\theta}^{\mathcal{M}}_{A,i}-\theta_{i})(\hat{\theta}^{w}_{B,i}-\theta_{i})\mid S]$ from Theorem 3.1, the remainder becomes

f_{i}=-2\,\mathbb{E}\!\left[\mathbb{E}\!\left[(\hat{\theta}^{w}_{B,i}-\theta_{i})\bigl(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\bigr)\,\Big|\,S\right]\right],

which equals $-2\,\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i})]$ by the law of iterated expectations.

Theorem C.3 (Error bound for naive CV score difference).

The remainder $f_{i}$ in Equation (35) satisfies

|f_{i}|\leq 2\sqrt{\mathbb{E}\!\left[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\right]\cdot\mathbb{E}\!\left[\bigl(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\bigr)^{2}\right]}.

(36)

For $K$ -fold CV, a plug-in estimator of this bound is

\hat{t}_{\text{naive},i}=2\sqrt{\left[\frac{1}{K}\sum_{k=1}^{K}\widehat{\mathrm{var}}\!\left(\hat{\theta}^{w}_{B_{k},i}\right)\right]\left[\frac{1}{K}\sum_{k=1}^{K}\!\left(\hat{\theta}^{\mathcal{M}_{1}}_{A_{k},i}-\hat{\theta}^{\mathcal{M}_{2}}_{A_{k},i}\right)^{\!2}\right]}.

(37)

Proof C.4.

Apply the Cauchy-Schwarz inequality to the conditional expectation given $S$ :

\left|\mathbb{E}\!\left[(\hat{\theta}^{w}_{B,i}-\theta_{i})\bigl(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\bigr)\,\Big|\,S\right]\right|\leq\sqrt{\mathbb{E}\!\left[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\,\Big|\,S\right]\cdot\mathbb{E}\!\left[\bigl(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i}\bigr)^{2}\,\Big|\,S\right]}.

Take the outer expectation over $S$ and apply the Jensen’s inequality to that expectation:

\mathbb{E}\!\left[\sqrt{\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\mid S]\cdot\mathbb{E}[(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i})^{2}\mid S]}\right]\leq\sqrt{\mathbb{E}\!\left[\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}\mid S]\right]\cdot\mathbb{E}\!\left[\mathbb{E}[(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i})^{2}\mid S]\right]}

For the plug-in estimator, the first factor in Equation (36) is $\mathbb{E}[(\hat{\theta}^{w}_{B,i}-\theta_{i})^{2}]=\mathbb{E}[v_{i}]+\mathbb{E}[b_{i}^{2}]$ . Since the Hájek estimator is approximately design-unbiased, $\mathbb{E}[b_{i}^{2}]$ is negligible relative to $\mathbb{E}[v_{i}]$ , and $K^{-1}\sum_{k=1}^{K}\widehat{\mathrm{var}}(\hat{\theta}^{w}_{B_{k},i})$ is an estimator of $\mathbb{E}[v_{i}]$ . The second factor $\mathbb{E}[(\hat{\theta}^{\mathcal{M}_{1}}_{A,i}-\hat{\theta}^{\mathcal{M}_{2}}_{A,i})^{2}]$ is the expected squared model difference. The fold average $K^{-1}\sum_{k}(\hat{\theta}^{\mathcal{M}_{1}}_{A_{k},i}-\hat{\theta}^{\mathcal{M}_{2}}_{A_{k},i})^{2}$ estimates this second moment directly. Substituting both gives $\hat{t}_{\text{naive},i}$ in Equation (37). The aggregation steps are the same as it for adjusted scores. Then in the full procedure, we compare the difference in naive scores to the error bound. If it exceeds the bound, we report the comparison based on the naive score.

Comparing to the error bound on adjusted score, $\hat{t}_{\text{naive},i}$ is more conservative. Figure 11 illustrates the consequences. Here we used the moderate sample size setting as described in Section 4.2. For the comparison between $\mathcal{M}3$ and $\mathcal{M}1$ , the naive score differences are positive in nearly all replicates under naive scores, correctly signing the oracle difference, but the bound $\hat{t}_{\text{naive},i}$ is relatively large to place some of replicates in the inconclusive region. For the comparison between $\mathcal{M}_{2}$ and $\mathcal{M}_{1}$ , the model differences are again very small in scale and within the inconclusive region of naive score error bound.

Appendix D Cross-validation by PSU

In this section, we consider an alternative sample splitting strategy for multi-stage stratified sampling design. Instead of randomly assigning SSUs within each PSU into $K$ folds, we randomly assign PSUs into $K$ folds. We refer to this strategy as CV-PSU, and the version presented in the main paper as CV-SSU.

In principle, both CV strategies maintain the structure of the survey design after sample splitting. Figure 12 presents results for CV-PSU under the setting of moderate sample size (50 clusters per stratum, and 30 households per cluster). CV-PSU leads to the same conclusion as CV-SSU: $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ cannot be ranked conclusively, whereas the score difference between $\mathcal{M}_{1}$ and $\mathcal{M}_{3}$ is sufficiently large to conclude that $\mathcal{M}_{1}$ outperforms $\mathcal{M}_{3}$ .

Figure 13 highlights a key difference between CV-PSU and CV-SSU. Under $5$ -fold cross-validation, removing one fifth of the clusters from a model causes greater information loss than removing one fifth of households within each cluster. Thus for CV-PSU, the gap between the full sample oracel MSE and oracle training MSE increases faster than CV-SSU as sample size decreases. This can also be seen from Figure 14 and 14, where it shows that while the adjusted CV scores align relatively well with the training set oracle MSE, they could deviate from the oracle full-sample MSE when sample size is small. Therefore, in addition to the effect of the unidentifiable remainder term in estimating the training set oracle MSE, the performance of CV scores can also be impacted significantly by the gap between the two oracle MSE estimands. In contrast, the impact of reduced sample size on such gap is smaller for CV-SSU, as shown in Figure 16.

In summary, the oracle training MSE is a substantially poorer proxy for the full-sample oracle under CV-PSU than under CV-SSU, which can lead to more bias in model ranking.

Appendix E Two-fold CV-SSU with resplitting

In this section, we discuss an alternative to $5$ -fold CV presented in the main paper. Under two-fold CV, the sample $S$ is randomly partitioned into two equal halves, with one serving as the training set $A$ and the other as the validation set $B$ . The partition can be repeated multiple times to reduce sensitivity to the choice of split. Because each training set retains only half the data, the dependence between $\hat{\theta}^{\mathcal{M}}_{A,i}$ and $\hat{\theta}^{w}_{B,i}$ is weaker than under $K$ -fold CV, where the training set retains $(K-1)/K$ of the sample. Consequently, the covariance term $\hat{c}^{\mathcal{M}}_{i}$ in Theorem 3.3 is smaller in magnitude, and the naive and adjusted scores are closer to each other than under 5-fold CV. The terms $\hat{v}_{i}$ , $\hat{c}^{\mathcal{M}}_{i}$ , and the naive squared errors are averaged over the repeated pairs before computing the adjusted score.

In this simulation, we repeat the partition five times and compare both CV-PSU and CV-SSU under the moderate-sample setting (50 clusters per stratum, 30 households per cluster) and the small-sample setting (40 clusters per stratum, 30 households per cluster). Under all settings, two-fold CV-SSU with resplitting correctly ranks the three models: $\mathcal{M}_{3}$ is least favoured and $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ are very close. Additionally, both CV-PSU and CV-SSU estimate the training oracle error accurately.

The differences again lie in how closely the training oracle MSE tracks the full-data oracle in these scenarios. Under the moderate-sample setting, CV-SSU tracks the full-data oracle more closely than CV-PSU, consistent with the findings in Supplementary Section D. Under the small-sample setting (40 clusters), the oracle training MSE for $\mathcal{M}_{3}$ departs from the full-data oracle under both strategies, but the divergence is more severe for CV-PSU. Taken together, two-fold CV-SSU is more reliable than two-fold CV-PSU for tracking the full-sample MSE, but less so than 5-fold CV-SSU at small sample sizes, because each training fold retains only half the data and the resulting oracle training MSE estimand is a noisier proxy for the full-sample target.

Appendix F Comparing SAE models for the $115$ districts in Zambia

In this section, we present results comparing models at a finer geographical level. Direct estimates are often unavailable for subnational areas below the stratification level of the survey design, so the validation set is restricted to areas with at least one sampled cluster. Although both splitting schemes apply at finer levels, the CV-PSU scheme can introduce additional missingness in direct estimates within the test set, whereas each fold of CV-SSU preserves the same pattern of missingness as in the full sample. We therefore consider only CV-SSU for comparison at finer geographical level.

F.1 Synthetic data

For Zambia, the administrative level below the 10 provinces consists of 115 districts. We use the synthetic population of women’s literacy rate and draw 50 replicate surveys under a two-stage stratified design, sampling 60 clusters per stratum and 30 households per cluster. Across the 50 replicates, the number of districts with micro data for direct estimates ranges from 89 to 99, with median 94. We compare the following models:

•

$\mathcal{M}_{1}$ : Unit-level Beta-Binomial model with BYM2 random effects (Besag et al., 1991; Riebler et al., 2016). A $PC(1,0.01)$ prior is used for the precision parameter $\tau$ of the BYM2 model.
•

$\mathcal{M}_{2}$ : Area-level Fay-Herriot model, with automatic variance adjustment (Wakefield et al., 2026). The priors on random effects are the same as $\mathcal{M}_{1}$ .
•

$\mathcal{M}_{3}$ : Same model as in $\mathcal{M}_{1}$ , but with a highly informative prior $PC(0.005,0.01)$ on $\tau$ .

Figure 18 shows district-level posterior mean estimates under one replicate. The direct estimates exhibit substantial variability across districts, reflecting the sparse sample at this geographical level. $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ produce similar spatial patterns, while $\mathcal{M}_{3}$ yields a notably smoother surface due to its informative prior on $\tau$ . Figure 15 summarises the CV-SSU comparison across all 50 replicates. When comparing $\mathcal{M}_{3}$ with $\mathcal{M}_{1}$ , the adjusted CV score differences are positive in nearly all replicates, agreeing in sign with the oracle MSE differences; all points fall in the first quadrant of the left panel, indicating that CV-SSU correctly identifies $\mathcal{M}_{1}$ as the better model. The right panel confirms that the absolute score differences exceed the error bound $t_{q}$ in all replicates, so the CV-SSU yields a decisive selection.

In terms of the comparison between $\mathcal{M}_{2}$ and $\mathcal{M}_{1}$ , both oracle MSE differences and adjusted CV score differences are concentrated near zero. The right panel shows that nearly all replicates fall within the inconclusive region, $|\widehat{\mathrm{score}}_{q}^{\mathcal{M}_{a}}-\widehat{\mathrm{score}}_{q}^{\mathcal{M}_{b}}|<t_{q}$ , indicating that the error bound correctly abstains from making a decisive ranking when the two models provide very similar subnational estimates. Together, these two comparisons demonstrate that the proposed framework maintains the performance at a finer geographical level with sparser data.

F.2 Analysis of the 2024 Zambia DHS

We now demonstrate district-level comparison using real data. Figure 20 shows the point estimates with 90% intervals for direct estimates, $\mathcal{M}_{1}$ , $\mathcal{M}_{2}$ , and $\mathcal{M}_{3}$ . Two districts have no sampled clusters and therefore no direct estimates. Similar to other results, $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ are close to each other, while $\mathcal{M}_{3}$ is over-smoothed. Figure 21 shows the district-level comparison. When comparing $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ , the majority of the districts fall in the inconclusive zone, and so is the aggregated score difference. When comparing $\mathcal{M}_{2}$ and $\mathcal{M}_{3}$ , there are more area-specific score differences that exceed the thresholds, and the overall score difference is also large enough to suggest that $\mathcal{M}_{2}$ is the better model. Compared to province-level models, district-level estimation uses less data and therefore has larger uncertainty. This leads to larger and more conservative error bounds, which explains why we see more disagreement in the interval plot but fewer areas with conclusive comparisons.