License: arXiv.org perpetual non-exclusive license
arXiv:2604.23534v1 [stat.ME] 26 Apr 2026

Multivariate incremental effects for continuous treatments: Studying the health effects of environmental mixtures

Zhuochao Huang Department of Statistics, University of Florida. Corresponding author: Zhuochao Huang, zhuochao.huang@ufl.edu    Kejin Dong11footnotemark: 1    Tuo Lin Department of Biostatistics, University of Florida.    Joseph Antonelli11footnotemark: 1
Abstract

Evaluating the causal health effects of multivariate, continuous exposures, such as air pollution mixtures, is a critical public health challenge. A primary obstacle is the frequent violation of the positivity assumption, which renders the effects of standard deterministic interventions unidentified or heavily reliant on unreliable model extrapolation. In this paper, we develop a novel causal inference framework to address this challenge. We extend exponential tilting to multivariate exposures and address the critical question of how to compare different intervention directions fairly. This establishes a systematic framework for defining and evaluating various policy-relevant causal estimands, allowing researchers to address diverse scientific questions. We develop numerous methodological advancements, including efficient one-step estimation strategies, a Riemannian BFGS algorithm to solve a constrained manifold optimization problem, semiparametric efficiency bounds for causal estimands, minimax rates for estimators, and asymptotic normality results. We demonstrate our framework’s utility by applying it to a nationwide environmental health dataset to identify the optimal strategy for reducing adverse health outcomes associated with a PM2.5 chemical mixture.

1 Introduction

Understanding the causal effects of complex air exposure mixtures is crucial for effective public health policy, but traditional statistical methods face significant challenges. A key challenge is that individuals are exposed to a complex mixture of correlated substances. Single-pollutant analyses often fall short, as they fail to account for the confounding effect of pollutants and cannot capture potential interaction effects, leading to misleading policy recommendations (Bobb et al., 2015; Antonelli and Zigler, 2024). This necessitates a causal inference framework for multivariate continuous exposures. A primary obstacle in evaluating the causal effects of multivariate continuous exposures is the positivity assumption. This assumption requires that for any given set of covariates, the exposure levels of interest have nonzero density. In environmental mixtures, this assumption is frequently violated; certain combinations of pollutants may be physically implausible or absent from observational data (Peters et al., 2012). Consequently, estimating the effect of deterministic interventions, such as setting a pollutant mixture to a specific counterfactual level, becomes impossible without resorting to unreliable extrapolation (Antonelli and Zigler, 2024; Rudolph et al., 2025). In the presence of positivity violations, there are (at least) two potential solutions. The less frequently used approach is to modify the estimation method to be robust to extrapolation, such as “extrapolation-aware” inferential methods that explicitly bound or characterize the uncertainty when moving outside the data support (Pfister and Bühlmann, 2021). The more common approach, and the one we focus on throughout, is to modify the scientific question by redefining the estimand.

One useful strategy for modifying the estimand with continuous treatments involves identifying and estimating the derivative of the exposure-response function, then integrating this derivative to recover the full curve. By focusing on local effects first and developing novel bias-corrected and Neyman Orthogonal estimators for the derivative function, this “differentiate-then-integrate” approach can circumvent the need for a global positivity assumption and has been a subject of significant recent research (Zhang et al., 2023; Kallus and Oprescu, 2022; Rothenhäusler and Yu, 2021). Another alternative is to leverage instrumental variables, where recent advances have extended these methods to continuous treatments without relying on positivity to identify local effects (Dorn and Guo, 2024; Bruns and Kallus, 2024).

While these methods are powerful, a distinct set of approaches that are flexible and policy-relevant rely on the use of stochastic interventions. These interventions generally define a policy that modifies the observed distribution of exposures rather than replacing it with a fixed value. Motivated by defining more realistic policies, early applications have been seen in epidemiology. For example, considering interventions that would truncate the exposure distribution, such as enforcing pollution levels below a certain cutoff (Taubman et al., 2009). This idea was later formalized in the statistical literature to provide a general framework for evaluating the population causal effect of shifting an entire exposure distribution (Díaz and van der Laan, 2012). Over the past decade, several distinct stochastic interventions have been developed. One approach is the shift intervention, which defines the post-intervention exposure as the natural exposure value shifted by a certain amount (Haneuse and Rotnitzky, 2013). A second, influential family of interventions is based on the natural value of treatment. These policies, often called Modified Treatment Policies (MTPs), define the counterfactual exposure as a function of the exposure that would have naturally occurred. This concept was pioneered in the context of dynamic treatment regimes (Robins et al., 2004) and later formalized for single time points (Young et al., 2011; Richardson and Robins, 2013). The framework has since been extended to handle longitudinal scenarios with continuous treatments (Díaz et al., 2023). There is ongoing research exploring the properties of generalized policies that depend on an individual’s natural value of treatment, for example using optimal transport to derive tighter bounds under unmeasured confounding (Kallus and Mbougwe, 2024). Generally speaking, this class of estimands relies on weaker positivity conditions than most deterministic estimands, however, they still rely on positivity holding for certain exposure values.

When positivity violations are a big concern and are likely to occur, a potentially more robust alternative that inherently respects the data’s support and thereby does not rely on positivity is the incremental intervention. This approach was initially developed for binary and longitudinal treatments through incremental shifts in propensity scores, and defines a new interventional distribution by tilting the exposure distribution with an exponential function in a specified direction (Kennedy, 2019). This concept has subsequently been generalized to univariate continuous treatments as well (Díaz and Hejazi, 2020; Schindl et al., 2024). Some recent work has criticized exponential tilted estimands such as these (Schindl and Wasserman, 2025) due to their less intuitive parameterization, asymmetric reallocation of probability mass, and for possessing less favorable asymptotic properties compared to certain alternatives. Regardless of their form, stochastic interventions, typically present their own set of challenges. One key concern is that estimands identified under positivity violations may correspond to interventions that are not directly implementable, revealing a fundamental “interpretability-implementability tradeoff” (Rudolph et al., 2024). Furthermore, when comparing the effects of different stochastic interventions, one must ensure the comparisons are fair, a concept that related work in different contexts has begun to formalize to prevent misleading conclusions (McClean et al., 2024).

Nearly all work to date has focused on univariate treatments, which is insufficient for addressing problems raised in the analysis of environmental mixtures. In this work, we build on the existing literature by using exponentially tilted estimands within the multivariate treatment context. This extension to the multivariate setting is non-trivial and introduces several new, complex questions. First, the intervention parameter becomes a vector, creating an infinite space of possible intervention directions. This raises a variety of policy questions about how to shift all exposures at once and which direction is best. Second, a fair basis for comparing these different directional shifts is needed; interventions must be constrained to be of the same size for meaningful comparisons, where the definition of fair is not unique. We propose solutions to each of these issues and explore a variety of estimands within this framework that target different policy-relevant questions of interest in environmental epidemiology. We show that these different estimands vary in terms of how efficiently they can be estimated from the data, and we provide algorithms for finding optimal policies in terms of shifts in exposures that lead to the biggest reduction in adverse health outcomes. We provide theoretical support in terms of minimax rates that show how well the estimands can be estimated and how the difficulty of estimation intrinsically depends on the conditional covariance matrix of the exposures. We also develop efficient influence function based estimators that can achieve root-nn convergence and asymptotic normality using complex, machine learning estimators for nuisance function estimation.

2 Exposure Shifts under Exponential Tilts

2.1 Notation and Potential Outcomes under SUTVA

Let our observed data consist of nn independent and identically distributed samples {𝒁i}i=1n\{\boldsymbol{Z}_{i}\}_{i=1}^{n}, drawn from some underlying distribution P0P_{0}. Each observation 𝒁i=(𝑿i,𝑾i,Yi)\boldsymbol{Z}_{i}=(\boldsymbol{X}_{i},\boldsymbol{W}_{i},Y_{i}) is composed of a pp-dimensional vector of covariates 𝑿i𝒳p\boldsymbol{X}_{i}\in\mathcal{X}\subseteq\mathbb{R}^{p}, a qq-dimensional vector representing the environmental exposures or treatments111Note that we use the terms treatment and exposure interchangeably throughout the manuscript. Also note that environmental mixture simply refers to a vector of environmental exposures. 𝑾i=(Wi1,,Wiq)𝒲q\boldsymbol{W}_{i}=(W_{i1},\dots,W_{iq})\in\mathcal{W}\subseteq\mathbb{R}^{q}, and a scalar outcome of interest YiY_{i}\in\mathbb{R}. We denote the conditional density of the exposure mixture given the covariates as f(𝒘𝒙)f(\boldsymbol{w}\mid\boldsymbol{x}).

To formally define causal effects, we operate within the potential outcomes framework. For each individual ii and any potential exposure vector 𝒘𝒲\boldsymbol{w}\in\mathcal{W}, we let Yi(𝒘)Y_{i}(\boldsymbol{w}) denote the potential outcome that would have been observed had individual ii received exposure level 𝒘\boldsymbol{w}. This notation relies on the Stable Unit Treatment Value Assumption (SUTVA, (Rubin, 1980)), which comprises two key principles:

  1. 1.

    No Interference: The potential outcome for one individual is unaffected by the exposure assignments of other individuals. That is, Yi(𝒘)Y_{i}(\boldsymbol{w}) depends only on the exposure 𝒘\boldsymbol{w} assigned to individual ii.

  2. 2.

    No multiple versions of treatment: Treatment is well-defined in the sense that there are not two distinct treatments that lead to the same value of 𝑾i\boldsymbol{W}_{i}.

Importantly, this assumption ensures that an individual’s observed outcome corresponds to their potential outcome under their observed exposure. Formally, if an individual ii is observed to have exposure 𝑾i\boldsymbol{W}_{i}, their observed outcome is Yi=Yi(𝑾i)Y_{i}=Y_{i}(\boldsymbol{W}_{i}).

2.2 Estimands using Exponential Tilts

In this work, we extend the framework of exponential tilting incremental causal effects to the multivariate treatment setting. This type of stochastic treatment was first proposed for binary treatments (Kennedy, 2019) and later generalized to single continuous treatments (Díaz and Hejazi, 2020; Schindl et al., 2024). We adapt this formulation to define interventions on the entire qq-dimensional exposure vector, 𝑾\boldsymbol{W}. Given the conditional density of the exposure mixture, f(𝒘𝒙)f(\boldsymbol{w}\mid\boldsymbol{x}), we define an exponentially tilted interventional density, g𝜹(𝒘𝒙)g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x}), indexed by a user-specified vector 𝜹q\boldsymbol{\delta}\in\mathbb{R}^{q}:

g𝜹(𝒘𝒙)=exp(𝜹𝒘)f(𝒘𝒙)𝒲exp(𝜹𝒗)f(𝒗𝒙)𝑑𝒗g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})=\frac{\exp(\boldsymbol{\delta}^{\top}\boldsymbol{w})f(\boldsymbol{w}\mid\boldsymbol{x})}{\int_{\mathcal{W}}\exp(\boldsymbol{\delta}^{\top}\boldsymbol{v})f(\boldsymbol{v}\mid\boldsymbol{x})\,d\boldsymbol{v}} (1)

Here, the denominator is a normalizing constant that ensures g𝜹g_{\boldsymbol{\delta}} integrates to one, and 𝜹\boldsymbol{\delta} is a vector determining both the direction and magnitude of how the natural density is shifted.

Our causal estimand of interest, the incremental effect, is the expected potential outcome under the stochastic intervention defined by the tilted density g𝜹g_{\boldsymbol{\delta}}. We denote this estimand as ψ(𝜹)\psi(\boldsymbol{\delta}):

ψ(𝜹)=𝔼[Yg𝜹]\psi(\boldsymbol{\delta})=\mathbb{E}[Y^{g_{\boldsymbol{\delta}}}]

This represents the population average outcome if, for covariates 𝑿=𝒙\boldsymbol{X}=\boldsymbol{x}, each individual’s exposure were a random draw from the shifted distribution g𝜹(𝒘𝒙)g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x}). In order to identify this quantity from the observed data, we must make a standard no unmeasured confounding assumption: for all 𝒘𝒲\boldsymbol{w}\in\mathcal{W},

Y(𝒘)𝑾𝑿.Y(\boldsymbol{w})\perp\boldsymbol{W}\mid\boldsymbol{X}.

We develop sensitivity analysis approaches to assess violations of this assumption in Section 5. Under SUTVA and the no unmeasured confounding assumption, this causal quantity is identified from the observed data distribution via:

ψ(𝜹)=𝒳𝒲𝔼[Y𝑾=𝒘,𝑿=𝒙]g𝜹(𝒘𝒙)𝑑𝒘𝑑P(𝒙)\psi(\boldsymbol{\delta})=\int_{\mathcal{X}}\int_{\mathcal{W}}\mathbb{E}[Y\mid\boldsymbol{W}=\boldsymbol{w},\boldsymbol{X}=\boldsymbol{x}]\,g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}\,dP(\boldsymbol{x}) (2)

where P(𝒙)P(\boldsymbol{x}) denotes the marginal probability measure of the covariates 𝑿\boldsymbol{X}. Note that because the support of g𝜹g_{\boldsymbol{\delta}} is identical to the support of ff, we do not have to invoke any positivity assumptions and the intervention does not require us to estimate outcomes for exposure combinations that are never observed in the data.

The parameter 𝜹\boldsymbol{\delta} has a similar interpretation as the constant gradient of the log-likelihood ratio between the interventional and observational densities as in (Schindl et al., 2024):

𝜹=𝒘log(g𝜹(𝒘𝒙)f(𝒘𝒙)).\boldsymbol{\delta}=\frac{\partial}{\partial\boldsymbol{w}}\log\left(\frac{g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})}{f(\boldsymbol{w}\mid\boldsymbol{x})}\right).

This means that each component, δj\delta_{j}, quantifies the change in this log density ratio for an infinitesimal increase in the jj-th exposure component, wjw_{j}. Intuitively, setting 𝜹=(1,0,,0)\boldsymbol{\delta}=(1,0,\dots,0)^{\top} defines an intervention that tilts the distribution to favor higher values of the first exposure, W1W_{1}.

2.3 Different Exposure Shifts and Efficiency

Let μ(𝒙,𝒘)=𝔼[Y𝑿=𝒙,𝑾=𝒘]\mu(\boldsymbol{x},\boldsymbol{w})=\mathbb{E}[Y\mid\boldsymbol{X}=\boldsymbol{x},\boldsymbol{W}=\boldsymbol{w}]. Following the same derivation as (Schindl et al., 2024) but generalized to the multivariate setting, we get the following formula for the efficient influence function.

Proposition 1.

The efficient influence function of ψ(𝛅)\psi(\boldsymbol{\delta}) under a nonparametric model is given by φ(𝐙;𝛅)=DY+Dg,μ+Dψ\varphi(\boldsymbol{Z};\boldsymbol{\delta})=D_{Y}+D_{g,\mu}+D_{\psi} for

DY=g𝜹(𝑾𝑿)f(𝑾𝑿)(Yμ(𝑿,𝑾))\displaystyle D_{Y}=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{X})}\left(Y-\mu(\boldsymbol{X},\boldsymbol{W})\right)
Dg,μ=g𝜹(𝑾𝑿)f(𝑾𝑿)(μ(𝑿,𝑾)𝔼g𝜹[μ(𝑿,𝑾)𝑿])\displaystyle D_{g,\mu}=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{X})}\left(\mu(\boldsymbol{X},\boldsymbol{W})-\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\right)
Dψ=𝔼g𝜹[μ(𝑿,𝑾)𝑿]ψ\displaystyle D_{\psi}=\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]-\psi

Note the subscript g𝜹g_{\boldsymbol{\delta}} denotes expectations with respect to the tilted exposure density. The asymptotic variance of any regular and asymptotically linear (RAL) estimator for ψ(𝜹)\psi(\boldsymbol{\delta}) is equal to the variance of its EIF, Var(φ(𝒁;𝜹))\mathrm{Var}(\varphi(\boldsymbol{Z};\boldsymbol{\delta})). A critical observation from the structure of φ\varphi is the repeated appearance of the density ratio, g𝜹(𝑾𝑿)/f(𝑾𝑿)g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})/f(\boldsymbol{W}\mid\boldsymbol{X}). When this ratio exhibits high variability, it inflates the variance of the first two components of the EIF, leading to less precise estimates of the causal effect. Therefore, to improve statistical efficiency for a given intervention strength, we could select an intervention direction 𝜹\boldsymbol{\delta} that minimizes the variance of this density ratio, Var(g𝜹/f)\mathrm{Var}(g_{\boldsymbol{\delta}}/f). The third term, on the other hand, represents the deviation between the average post-intervention effect for a specific covariate group and the overall average post-intervention effect. It quantifies how much higher or lower the expected intervention effect for an individual with covariates 𝑿\boldsymbol{X} is compared to the overall population average.

To gain analytical insight into this variance, we can let the exposures follow a multivariate normal distribution, f(𝒘𝒙)𝒩(𝝁𝒙,𝚺)f(\boldsymbol{w}\mid\boldsymbol{x})\sim\mathcal{N}(\boldsymbol{\mu_{x}},\boldsymbol{\Sigma}). Under this assumption, we have two key results. First, the tilted distribution g𝜹(𝒘𝒙)g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x}) is also normal, with a shifted mean:

g𝜹(𝒘𝒙)𝒩(𝝁𝒙+𝚺𝜹,𝚺)g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})\sim\mathcal{N}(\boldsymbol{\mu_{x}}+\boldsymbol{\Sigma}\boldsymbol{\delta},\boldsymbol{\Sigma})

This provides a clear interpretation of the intervention: it is a shift of the mean of the exposure distribution in the direction 𝚺𝜹\boldsymbol{\Sigma}\boldsymbol{\delta}. Second, the variance of the density ratio has a simple, closed-form expression:

Var(g𝜹(𝑾𝑿)f(𝑾𝑿))=exp(𝜹𝚺𝜹)1\mathrm{Var}\left(\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{X})}\right)=\exp(\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}\boldsymbol{\delta})-1

Minimizing this variance is equivalent to minimizing the quadratic form 𝜹𝚺𝜹\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}\boldsymbol{\delta}.

The question of interest thus becomes an optimization problem: which direction 𝜹\boldsymbol{\delta} minimizes 𝜹𝚺𝜹\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}\boldsymbol{\delta} subject to a constraint on the “strength” of the intervention? While various constraints are possible (e.g., fixing 𝜹𝜹\boldsymbol{\delta}^{\top}\boldsymbol{\delta}), a particularly meaningful constraint is to fix the distance between the original and tilted distributions, for which the 2-Wasserstein distance provides a natural metric. We discuss the choice of this metric in the following section. For the multivariate normal case with the same covariance matrix, the squared Wasserstein distance has a simple closed-form expression as dW2(f,g𝜹)=(𝝁𝒙+𝚺𝜹)𝝁𝒙2=𝚺𝜹2=𝜹𝚺2𝜹d_{W}^{2}(f,g_{\boldsymbol{\delta}})=||(\boldsymbol{\mu_{x}}+\boldsymbol{\Sigma}\boldsymbol{\delta})-\boldsymbol{\mu_{x}}||^{2}=||\boldsymbol{\Sigma}\boldsymbol{\delta}||^{2}=\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}^{2}\boldsymbol{\delta}.

If our goal then is to find a shift that we can estimate with a high degree of efficiency, we can solve the following:

min𝜹𝜹𝚺𝜹subject to𝜹𝚺2𝜹=c2\min_{\boldsymbol{\delta}}\quad\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}\boldsymbol{\delta}\quad\text{subject to}\quad\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}^{2}\boldsymbol{\delta}=c^{2}

for some constant cc defining the size of intervention. It is straightforward to show that this expression is minimized when 𝜹\boldsymbol{\delta} is chosen to be proportional to the eigenvector of 𝚺\boldsymbol{\Sigma} corresponding to its largest eigenvalue, λmax\lambda_{\max}. Hence our finding demonstrates that for a fixed intervention strength (as measured by the Wasserstein distance), the most statistically efficient causal effect to estimate, at least in terms of this density ratio, is the one that corresponds to shifting the exposure mixture along its primary axis of variation. Note that this result only holds exactly under a multivariate normal distribution, but we have seen empirically that this choice of 𝜹\boldsymbol{\delta} typically leads to efficient estimates of ψ(𝜹)\psi(\boldsymbol{\delta}) under a variety of exposure distributions.

2.4 Fair Exposure Shifts under Fixed Gelbrich Distance

A primary motivation for developing our framework is to answer policy-relevant questions, such as identifying the optimal way to modify an environmental exposure mixture to achieve the greatest public health benefit. This naturally leads to an optimization problem: finding the intervention vector 𝜹\boldsymbol{\delta} that maximizes (or minimizes) the causal estimand ψ(𝜹)\psi(\boldsymbol{\delta}). However, a naive comparison across all possible 𝜹\boldsymbol{\delta} vectors is misleading. For any intervention direction that yields a beneficial effect, one could simply increase the intervention’s strength, for example, by scaling the magnitude of 𝜹\boldsymbol{\delta} to produce an arbitrarily larger (or smaller) value of ψ(𝜹)\psi(\boldsymbol{\delta}). This would invariably lead to trivial solutions that correspond to extreme, unrealistic shifts in the exposure distribution. A more relevant policy question is “for a fixed amount of interventional effort, what is the best direction to apply that effort?”. To formalize this, we must first establish a fair basis for comparison, constraining our search to a set of interventions that are of the same size, which captures the actual movement of the exposure values, not just the magnitude of the parameter 𝜹\boldsymbol{\delta}. Note that the notion of fairness here is with respect to the size of the intervention being applied, which differs from typical notions of fairness in causal inference or related fields, such as in recent work that defines a fairness criterion based on whether an estimand preserves the ordinal ranking of effects across all covariate subgroups.(McClean et al., 2024)

In order to establish our notion of fairness between interventions, we can use the 2-Wasserstein distance, dW(f,g𝜹)d_{W}(f,g_{\boldsymbol{\delta}}), which is widely studied and serves this purpose (Panaretos and Zemel, 2019). Intuitively, the Wasserstein distance measures the minimum cost of transporting the probability mass of one distribution to match another, akin to the cost of moving a pile of dirt. By fixing the Wasserstein distance, we ensure that the total amount of shift between the old and new distributions is fixed. The optimization problem thus becomes a meaningful search for the 𝜹\boldsymbol{\delta} that minimizes ψ(𝜹)\psi(\boldsymbol{\delta}) among all possible intervention directions of a comparable magnitude. This distributional fairness notion is widely used in both the operations and statistics research literatures. (Mohajerin Esfahani and Kuhn, 2018; Blanchet and Murthy, 2019; Duchi and Namkoong, 2021; Gao and Kleywegt, 2023)

For analytical tractability, we rely on a well-established result providing a formula for the squared 2-Wasserstein distance dW2d_{W}^{2} based on the means (𝝁1,𝝁2\boldsymbol{\mu}_{1},\boldsymbol{\mu}_{2}) and covariance matrices (𝚺1,𝚺2\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}) of two distributions (P1,P2)(P_{1},P_{2}), which is referred to as the Gelbrich formula (Gelbrich, 1990). Crucially, this formula serves as a general lower bound for the true squared 2-Wasserstein distance between any two probability measures on q\mathbb{R}^{q} with finite second moments. Specifically, we have that

dW2(P1,P2)𝝁1𝝁22+tr(𝚺1+𝚺22(𝚺11/2𝚺2𝚺11/2)1/2):=dG2(P1,P2)d_{W}^{2}\left(P_{1},P_{2}\right)\geq\left\|\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right\|^{2}+\operatorname{tr}\left(\boldsymbol{\Sigma}_{1}+\boldsymbol{\Sigma}_{2}-2\left(\boldsymbol{\Sigma}_{1}^{1/2}\boldsymbol{\Sigma}_{2}\boldsymbol{\Sigma}_{1}^{1/2}\right)^{1/2}\right):=d_{G}^{2}\left(P_{1},P_{2}\right)

Furthermore, this bound becomes an exact equality for any two distributions belonging to the same family of elliptically contoured distributions, a class which notably includes all multivariate normal distributions as well as uniform distributions on ellipsoids. The tractability of the Gelbrich formula has invited various researchers to use it as a surrogate for the 2-Wasserstein distance, a method frequently employed to overcome the computational complexity associated with the Wasserstein metric (Kuhn et al., 2019; Hakobyan and Yang, 2024), since the empirical 2-Wasserstein distance lacks a closed-form formula from data and can only be approached by numerical methods (Panaretos and Zemel, 2020). Moreover, the lower bound it provides has been shown to be tight in a fairly general situation against an upper bound derived for the 2-Wasserstein distance (Biswas and Mackey, 2024; Papp and Sherlock, 2024), and is therefore generally close to the 2-Wasserstein distance (Nguyen et al., 2021; Ye et al., 2024). Therefore, we believe our use of the Gelbrich formula as an approximation formula for the true 2-Wasserstein distance is reasonable.

This provides a computationally tractable and well-justified measure of intervention size. Accordingly, we measure intervention size through the Gelbrich formula applied to the marginal baseline and tilted laws of 𝑾\boldsymbol{W}, and write the resulting quantity as G(𝜹)G(\boldsymbol{\delta}). Comparing interventions on the level set G(𝜹)=c2G(\boldsymbol{\delta})=c^{2} puts them on a common scale. Under a multivariate normal model this coincides with the squared 2-Wasserstein distance, while in more general settings it remains a rigorous lower bound. The optimization problem thus becomes a search for the 𝜹\boldsymbol{\delta} that minimizes ψ(𝜹)\psi(\boldsymbol{\delta}) among all possible intervention directions of a comparable size, allowing us to disentangle the direction of an intervention from its size and enabling a principled exploration of which changes to an environmental exposure mixture are most beneficial or harmful.

2.5 Choice of Estimand

Having established a principled method for comparing interventions of equivalent magnitude, we are now positioned to define our estimands of interest. The framework of incremental effects, when extended to a multivariate setting, moves beyond simply estimating the effect of a single, pre-specified shift or simply examining how the causal effect depends on shift size. Instead, it allows us to explore the entire space of potential interventions to identify those that are most impactful.

2.5.1 Optimal Shifts

In environmental contexts, a key objective is to determine the most effective strategy for intervention given limited resources. For instance, how should regulators modify a complex mixture of air pollutants to achieve the greatest improvement in health outcomes? Our framework directly addresses this question by defining an estimand for the optimal policy shift. We define our primary estimand of interest, 𝜹c\boldsymbol{\delta}^{*}_{c}, as the intervention direction that minimizes the causal effect ψ(𝜹)\psi(\boldsymbol{\delta}) over the set of all “fair shifts” of a given size cc:

𝜹c=argmin𝜹𝒜cψ(𝜹)\boldsymbol{\delta}^{*}_{c}=\arg\min_{\boldsymbol{\delta}\in\mathcal{A}_{c}}\psi(\boldsymbol{\delta})

where

𝒜c={𝜹:G(𝜹)=c2}.\mathcal{A}_{c}=\{\boldsymbol{\delta}:G(\boldsymbol{\delta})=c^{2}\}.

The corresponding value of this optimal policy is ψc=ψ(𝜹c)\psi^{*}_{c}=\psi(\boldsymbol{\delta}^{*}_{c}). To provide some intuition for the optimal 𝜹\boldsymbol{\delta}, we can explore it analytically in a simplified setting. If we assume the outcome model is linear (μ(𝒙,𝒘)=𝜶𝒙+𝜷𝒘\mu(\boldsymbol{x},\boldsymbol{w})=\boldsymbol{\alpha}^{\top}\boldsymbol{x}+\boldsymbol{\beta}^{\top}\boldsymbol{w}) and the exposure distribution is normal (f𝒩(𝝁𝒙,𝚺)f\sim\mathcal{N}(\boldsymbol{\mu_{x}},\boldsymbol{\Sigma})), the causal effect is minimized when 𝜹\boldsymbol{\delta} is proportional to 𝚺1𝜷-\boldsymbol{\Sigma}^{-1}\boldsymbol{\beta}. This result provides some intuition: the optimal direction is a balance between the direct effect of each pollutant on the outcome (the vector 𝜷\boldsymbol{\beta}) and the correlation structure of the mixture (represented by 𝚺1\boldsymbol{\Sigma}^{-1}). This highlights a key insight of our multivariate approach: the covariance of the exposures is critical for determining not only which shifts are easiest to estimate, but also which shifts are most impactful. While the optimal policy shift is one primary goal, our framework is flexible and allows for the definition of other estimands that can be useful in answering other scientific questions of interest in environmental epidemiology. We now detail these in the following sections.

2.5.2 Single Exposure Shifts

A common goal in the analysis of environmental mixtures is to identify the components of the mixture that are most harmful. This has led to a wide range of statistical approaches aimed at performing exposure selection (Bobb et al., 2015; Antonelli et al., 2020; Ferrari and Dunson, 2020; Wei et al., 2020; Samanta and Antonelli, 2022). These approaches have inherently disregarded the potential impacts of positivity violations when examining which exposures are most harmful, but we can adapt our approach here to target similar questions without the need for strong positivity assumptions. A natural choice is to let 𝜹\boldsymbol{\delta} be a vector of zeros with a single non-zero component, e.g., 𝜹j=(0,,tj,,0)\boldsymbol{\delta}_{j}=(0,\dots,t_{j},\dots,0). The magnitude tjt_{j} would be chosen to satisfy the same fairness constraint, G(𝜹j)=c2G(\boldsymbol{\delta}_{j})=c^{2}. While this estimand would encourage the jthj^{th} component of the exposures to increase more than others, in the presence of correlated exposures, it will also shift the remaining exposures, and the extent to which this occurs is less clear. A distinct approach is to define a value of 𝜹\boldsymbol{\delta} such that the means of the exposures in the tilted distribution are the same, except for the jthj^{th} exposure, thereby isolating the impact of that single exposure. If the exposures follow a multivariate normal distribution, this is straightforward since the mean shift is given by 𝚺𝜹\boldsymbol{\Sigma}\boldsymbol{\delta}. If we want to ensure that only the jthj^{th} exposure’s mean is shifted, then we could set 𝜹j=𝚺1𝒆j\boldsymbol{\delta}_{j}=\boldsymbol{\Sigma}^{-1}\boldsymbol{e}_{j} where 𝒆j=(0,,tj,,0)\boldsymbol{e}_{j}=(0,\dots,t_{j},\dots,0) and again tjt_{j} is chosen to ensure a fair shift. Note that this analytical form only holds when the exposure follows a multivariate normal distribution, which won’t be true in general. We can instead use this value of 𝜹j\boldsymbol{\delta}_{j} as a starting point in a numerical algorithm that searches for the exponential tilt that only shifts the jthj^{th} component.

Once these are obtained, we can calculate ψ(𝜹j)\psi(\boldsymbol{\delta}_{j}) for all j=1,,qj=1,\dots,q to infer which exposures have the biggest effect on the outcome. Further note that this approach can be naturally extended to explore interactions or combined effects. For example, by setting two components of 𝜹\boldsymbol{\delta} to be non-zero, one could investigate whether simultaneously increasing two pollutants has a synergistic effect greater than the sum of their individual impacts. We point readers to recent work examining stochastic interventions to identify interactions, as similar ideas could be applied within our framework, though we focus here on single exposure effects for now (McCoy et al., 2023).

2.5.3 Efficient Exposure Shifts

While the previous two estimands are arguably the most scientifically and policy-relevant in most applications, there are other considerations at play, such as how efficiently we can estimate the chosen estimand. Environmental applications in particular are known to have relatively small effect sizes, and therefore efficiency can be particularly important when sample sizes are not exceedingly large. As we established in Section 2.3, the statistical difficulty of estimating ψ(𝜹)\psi(\boldsymbol{\delta}) is heavily driven by the variance of the density ratio, Var(g𝜹/f)\mathrm{Var}(g_{\boldsymbol{\delta}}/f). For a fixed intervention size, as defined by the Wasserstein distance, we showed that the most efficient intervention direction, in terms of minimizing this variance, is proportional to the first eigenvector of the exposure covariance matrix, 𝚺\boldsymbol{\Sigma}. We call this direction 𝜹eff\boldsymbol{\delta}_{\text{eff}}, and it is clear that this direction depends only on the correlation structure of the observed exposures. While this direction is only the most efficient one under normality of the exposures, we proceed with this choice in general, as we have found that it leads to efficient estimates in more general settings, and deriving the most efficient direction in general is a difficult task.

However, the optimal policy direction, 𝜹c\boldsymbol{\delta}^{*}_{c}, which maximizes the causal effect ψ(𝜹)\psi(\boldsymbol{\delta}), depends on both the exposure distribution and the exposure-outcome relationship, μ(𝒙,𝒘)\mu(\boldsymbol{x},\boldsymbol{w}). In simplified linear models, the direction of steepest ascent for the causal effect is proportional to 𝚺1𝜷\boldsymbol{\Sigma}^{-1}\boldsymbol{\beta}. In general, there is no reason for the direction of maximal statistical efficiency (related to the eigenvectors of 𝚺\boldsymbol{\Sigma}) to be the same as the direction of the maximal causal effect (related to 𝚺1𝜷\boldsymbol{\Sigma}^{-1}\boldsymbol{\beta}). To see this, consider an intuitive example with two highly and positively correlated pollutants, where the first principal component is approximately in the (1,1)(1,1) direction, which corresponds to the direction that is “easiest” to estimate with the observed data. Now, suppose that only the second pollutant has a strong causal effect on the outcome (i.e., 𝜷(0,β2)\boldsymbol{\beta}\approx(0,\beta_{2})). The optimal policy would primarily involve shifting the second pollutant. Our framework reveals an inherent tension: the policy we most want to evaluate (shifting the second pollutant alone) is statistically difficult because it moves in a direction against the data’s strong correlation structure, leading to a high-variance density ratio and thus high uncertainty in our estimate of its effect. In general, this presents a trade-off between interpretability and statistical efficiency, and users can decide based on features of their observed data which estimand to target.

3 Estimation and Inference

3.1 Estimating ψ(𝜹)\psi(\boldsymbol{\delta})

For a fixed intervention vector 𝜹\boldsymbol{\delta}, the target estimand is the expected outcome under the tilted exposure distribution:

ψ(𝜹)=𝔼[𝒲μ(𝑿,𝒘)g𝜹(𝒘𝑿)𝑑𝒘]\psi(\boldsymbol{\delta})=\mathbb{E}\left[\int_{\mathcal{W}}\mu(\boldsymbol{X},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]

where the outer expectation is over the marginal distribution of covariates 𝑿\boldsymbol{X}. Therefore, a direct approach to estimating ψ(𝜹)\psi(\boldsymbol{\delta}) is through a plug-in procedure. This method involves replacing each component in the expression above with a corresponding empirical estimate, which leads to the plug-in estimator:

ψ^plugin(𝜹)=1ni=1n[𝒲μ^(𝑿i,𝒘)g^𝜹(𝒘𝑿i)𝑑𝒘]\hat{\psi}_{\text{plugin}}(\boldsymbol{\delta})=\frac{1}{n}\sum_{i=1}^{n}\left[\int_{\mathcal{W}}\hat{\mu}(\boldsymbol{X}_{i},\boldsymbol{w})\hat{g}_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X}_{i})\,d\boldsymbol{w}\right]

However, this estimator faces certain practical and theoretical challenges. One issue is that the estimator’s performance is highly dependent on an accurate estimate of the multivariate conditional density f^(𝒘|𝒙)\hat{f}(\boldsymbol{w}|\boldsymbol{x}), which can be difficult to estimate for multivariate exposures, and will likely have slower convergence rates. Furthermore, this approach does not leverage the structure of the efficient influence function and, as a result, is generally not statistically efficient. These limitations motivate alternative estimators with superior statistical properties.

3.1.1 One-step Estimation and Cross-fitting

To overcome the limitations of the plug-in estimator, we employ an approach rooted in semiparametric efficiency theory. This method uses the efficient influence function (EIF) to construct an estimator that is consistent under weaker conditions and achieves the optimal asymptotic variance. The EIF for the estimand ψ(𝜹)\psi(\boldsymbol{\delta}) is given in Section 2 by φ(𝒁;𝜹)=DY+Dg,μ+Dψ\varphi(\boldsymbol{Z};\boldsymbol{\delta})=D_{Y}+D_{g,\mu}+D_{\psi}. In our notation, substituting the EIF components and algebraically combining the first two terms yields

φ(𝒁;ψ,μ,f)=r𝜹(𝑾,𝑿){Y𝔼g𝜹[μ𝑿]}+𝔼g𝜹[μ𝑿]ψ,\varphi(\boldsymbol{Z};\psi,\mu,f)=r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\Big\{Y-\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]\Big\}+\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]-\psi,

where r𝜹(𝒘,𝒙)=g𝜹(𝒘𝒙)/f(𝒘𝒙)r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})=g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})/f(\boldsymbol{w}\mid\boldsymbol{x}) and expectations with subscript g𝜹g_{\boldsymbol{\delta}} are taken with respect to g𝜹(𝑿)g_{\boldsymbol{\delta}}(\cdot\mid\boldsymbol{X}). The one-step estimation procedure utilizes this EIF to correct an initial parameter estimate by adding the empirical average of the EIF to the initial plug-in estimate, which serves as a bias-correction term:

ψ^onestep(𝜹)=ψ~plugin(𝜹)+1ni=1nφ(𝒁i;ψ~plugin,μ^,f^).\hat{\psi}_{\text{onestep}}(\boldsymbol{\delta})=\tilde{\psi}_{\text{plugin}}(\boldsymbol{\delta})+\frac{1}{n}\sum_{i=1}^{n}\varphi\!\big(\boldsymbol{Z}_{i};\tilde{\psi}_{\text{plugin}},\hat{\mu},\hat{f}\big).

This can be re-written to show that the one-step estimator takes the following form:

ψ^onestep(𝜹)=1ni=1nr^(𝑾i,𝑿i)[Yi𝔼g^𝜹[μ^𝑿i]]+1ni=1n𝔼g^𝜹[μ^𝑿i].\displaystyle\hat{\psi}_{\text{onestep}}(\boldsymbol{\delta})=\frac{1}{n}\sum_{i=1}^{n}\hat{r}(\boldsymbol{W}_{i},\boldsymbol{X}_{i})\big[Y_{i}-\mathbb{E}_{\hat{g}_{\boldsymbol{\delta}}}[\hat{\mu}\mid\boldsymbol{X}_{i}]\big]+\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\hat{g}_{\boldsymbol{\delta}}}[\hat{\mu}\mid\boldsymbol{X}_{i}].

This estimator has a number of key features that we will describe in subsequent sections when we study the asymptotic properties of this estimator. To summarize, it allows the use of flexible machine learning methods for estimation of each of the nuisance functions, and it is asymptotically efficient given its construction based on the EIF. The practical performance of the estimator is highly dependent on an estimate of the conditional density of the exposures, given by ff, which can be challenging with a moderate number of exposures. This density shows up in the expectations with respect to g𝜹g_{\boldsymbol{\delta}}, but also in the ratio term, denoted by r𝜹(𝑾,𝑿)r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}). First, we detail how to estimate this quantity directly using estimates of ff and μ\mu, though in Section 3.1.2 we propose an approach to directly estimating this quantity using regression techniques that may not have the same theoretical properties, but can have good finite-sample performance when there are a moderate number of exposures.

To ensure the desirable asymptotic properties of our final estimator for ψ(𝜹)\psi(\boldsymbol{\delta}), the entire estimation procedure is embedded within a cross-fitting framework. Cross-fitting is a technique designed to eliminate a key source of bias that arises when the same data is used both to train nuisance parameters and to evaluate the final parameter of interest. This prevents overfitting inherent in data-adaptive methods (Zheng and van der Laan, 2011; Chernozhukov et al., 2018) and has been shown to improve performance for related EIF based estimators. The procedure is implemented as follows. The data is randomly partitioned into KK disjoint folds of approximately equal size. For each fold k{1,,K}k\in\{1,\dots,K\}, we treat it as the evaluation set and use the remaining K1K-1 folds as the training set. On the training set, we fit our nuisance models, including the outcome regression μ^k(𝒙,𝒘)\hat{\mu}_{-k}(\boldsymbol{x},\boldsymbol{w}) and the exposure density f^k(𝒘,𝒙)\hat{f}_{-k}(\boldsymbol{w},\boldsymbol{x}). These models, trained on data not in fold kk, are then used to compute the components of the efficient influence function for every observation ii only within the evaluation fold kk. The final estimate, ψ^(𝜹)\hat{\psi}(\boldsymbol{\delta}), is then constructed by solving the estimating equation aggregated across all KK folds, ensuring that the nuisance function estimates for any given observation are always independent of that observation itself. This approach to nuisance function estimation has theoretical advantages by allowing for less restrictive assumptions on the nuisance functions, and important finite-sample properties as it tends to reduce bias in the estimated causal effects that is induced from overfitting.

3.1.2 Direct estimation using regression approaches

When there are a moderate number of exposures, estimation of ψ(𝜹)\psi(\boldsymbol{\delta}) becomes increasingly challenging, even for the efficient one-step estimators described above due to the inherent difficulty of estimating a multivariate, conditional distribution f(𝒘𝒙)f(\boldsymbol{w}\mid\boldsymbol{x}). For this reason we also explore an approach, first described in Schindl et al. (2024), that does not require estimation of the exposure density at all. This can be advantageous in finite samples, particularly when estimation of the density ratio r𝜹(𝑾,𝑿)r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}) is unstable. The first key insight is that the density ratio can be written as

r𝜹(𝑾,𝑿)=exp(𝜹𝒘)𝒲exp(𝜹𝒗)f(𝒗𝒙)𝑑𝒗=exp(𝜹𝒘)𝔼[exp(𝜹𝑾)𝑿]=exp(𝜹𝒘)ν𝜹(𝑿),r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})=\frac{\text{exp}(\boldsymbol{\delta}^{\top}\boldsymbol{w})}{{\int_{\mathcal{W}}\exp(\boldsymbol{\delta}^{\top}\boldsymbol{v})f(\boldsymbol{v}\mid\boldsymbol{x})\,d\boldsymbol{v}}}=\frac{\text{exp}(\boldsymbol{\delta}^{\top}\boldsymbol{w})}{\mathbb{E}[\exp(\boldsymbol{\delta}^{\top}\boldsymbol{W})\mid\boldsymbol{X}]}=\frac{\text{exp}(\boldsymbol{\delta}^{\top}\boldsymbol{w})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})},

which shows the density ratio can be estimated by estimating the conditional expectation ν𝜹(𝑿)\nu_{\boldsymbol{\delta}}(\boldsymbol{X}). This does not require the conditional density of the exposures and can be carried out using flexible, univariate regression techniques. Further, the other key component of our one-step estimator can be written as

𝒲μ(𝑿,𝒘)g𝜹(𝒘𝑿)𝑑𝒘=𝒲μ(𝑿,𝒘)exp(𝜹𝒘)f(𝒘𝒙)𝑑𝒘𝒲exp(𝜹𝒗)f(𝒗𝒙)𝑑𝒗=𝔼[exp(𝜹𝑾)μ(𝑿,𝑾)𝑿]ν𝜹(𝑿)=η𝜹(𝑿)ν𝜹(𝑿).\int_{\mathcal{W}}\mu(\boldsymbol{X},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}=\frac{\int_{\mathcal{W}}\mu(\boldsymbol{X},\boldsymbol{w})\exp(\boldsymbol{\delta}^{\top}\boldsymbol{w})f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}}{\int_{\mathcal{W}}\exp(\boldsymbol{\delta}^{\top}\boldsymbol{v})f(\boldsymbol{v}\mid\boldsymbol{x})\,d\boldsymbol{v}}=\frac{\mathbb{E}[\exp(\boldsymbol{\delta}^{\top}\boldsymbol{W})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})}=\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{X})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})}.

This shows that this quantity can be estimated by taking the ratio of two quantities, each of which can be estimated using flexible, univariate regression techniques. This was introduced in Schindl et al. (2024), though they did not implement this estimator as they found it to be unstable by taking the ratio of the estimates for these two conditional expectations. They were working, however, in the univariate exposure setting where estimating the conditional density of the exposures is more straightforward. In our setting, with multiple exposures, conditional density estimation can be very difficult, whereas this approach relies on univariate prediction models only. Additionally note, that one can take a third strategy, which is to still estimate the conditional density of the exposures and use it whenever calculating 𝒲μ(𝑿,𝒘)g𝜹(𝒘𝑿)𝑑𝒘\int_{\mathcal{W}}\mu(\boldsymbol{X},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}, but then use the regression approach described above for estimating the density ratio to improve stability of our estimates. While potentially useful for estimation, these estimators that obviate the need to estimate the exposure density will not inherit the same theoretical properties as the one-step estimator that directly uses estimates of ff and μ\mu, which we study theoretically in Section 4. They may, however, produce better finite sample performance, which we study in the simulation studies in Section 6 across a wide range of exposure distributions.

3.2 Optimizing over a Manifold

Our primary estimand, the optimal policy shift 𝜹c\boldsymbol{\delta}^{*}_{c}, is defined as the solution to

min𝜹ψ(𝜹),\min_{\boldsymbol{\delta}\in\mathcal{M}}\psi(\boldsymbol{\delta}),

where :={𝜹d:G(𝜹)=c2}\mathcal{M}:=\{\boldsymbol{\delta}\in\mathbb{R}^{d}:G(\boldsymbol{\delta})=c^{2}\} and GG is the Gelbrich quantity defined in Section 2.4. We assume c2c^{2} is a regular value of GG (equivalently, G(𝜹)𝟎\nabla G(\boldsymbol{\delta})\neq\boldsymbol{0} for all 𝜹\boldsymbol{\delta}\in\mathcal{M}), so that \mathcal{M} is a smooth embedded hypersurface. This regularity holds generically and is verified for the Gelbrich constraint in Appendix B. Standard Euclidean optimization algorithms such as gradient descent are not directly applicable as a gradient step taken from a point on the manifold will likely lead to a point outside of the feasible set.

To properly solve this problem, we must recognize that the constraint set forms a smooth, curved space known as a Riemannian manifold. Optimization on manifolds requires specialized techniques that generalize concepts from Euclidean optimization. The core idea is to perform optimization steps within the tangent space at each point on the manifold, which is a local linear approximation of the manifold, and then map the result back onto the manifold itself. For our specific constrained problem, we employ a Riemannian Broyden Fletcher Goldfarb Shanno (BFGS) algorithm. This is a powerful quasi-Newton method adapted for optimization on manifolds, which generally offers faster convergence than simpler first-order methods. In practice, the method is implemented with two computationally light ingredients: a projection-based retraction back to the level set, and an inexpensive projection-based vector transport between tangent spaces. The key components of this algorithm for our problem are:

  1. 1.

    Preliminary definitions: A definition of the tangent space is to take any small smooth curve on \mathcal{M}

    γ:(ε,ε),γ(0)=𝜹.\gamma:(-\varepsilon,\varepsilon)\rightarrow\mathcal{M},\quad\gamma(0)=\boldsymbol{\delta}.

    Then

    γ(0)d,\gamma^{\prime}(0)\in\mathbb{R}^{d},

    is a valid tangent vector. The tangent space at a point 𝜹\boldsymbol{\delta} is the collection of all such vectors:

    Tδ={γ(0):γ(0)=δ,γ(t) for all t}T_{\delta}\mathcal{M}=\left\{\gamma^{\prime}(0):\gamma(0)=\delta,\gamma(t)\in\mathcal{M}\text{ for all }t\right\}

    For a level set ={𝜹:G(𝜹)=c2}\mathcal{M}=\{\boldsymbol{\delta}:G(\boldsymbol{\delta})=c^{2}\}, we also have the equivalent characterization T𝜹=Null(G(𝜹))T_{\boldsymbol{\delta}}\mathcal{M}=\text{Null}(\nabla G(\boldsymbol{\delta})^{\top}). The BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm is a highly effective quasi-Newton method, which finds an optimum by maintaining and iteratively updating an approximation 𝐁\mathbf{B} of the inverse Hessian matrix. In Euclidean space n\mathbb{R}^{n}, the standard update formula is:

    𝐁k+1=𝐁k+𝐛k𝐛k𝐛k𝐚k(𝐁k𝐚k)(𝐁k𝐚k)𝐚k𝐁k𝐚k\mathbf{B}_{k+1}=\mathbf{B}_{k}+\frac{\mathbf{b}_{k}\mathbf{b}_{k}^{\top}}{\mathbf{b}_{k}^{\top}\mathbf{a}_{k}}-\frac{(\mathbf{B}_{k}\mathbf{a}_{k})(\mathbf{B}_{k}\mathbf{a}_{k})^{\top}}{\mathbf{a}_{k}^{\top}\mathbf{B}_{k}\mathbf{a}_{k}}

    Where:

    • 𝐁k+1\mathbf{B}_{k+1} is the updated approximation of the inverse Hessian, which is the target of the update.

    • 𝐁k\mathbf{B}_{k} is the current approximation of the inverse Hessian.

    • 𝐚k\mathbf{a}_{k} is the change in position (step), defined as 𝐚k=𝜹k+1𝜹k\mathbf{a}_{k}=\boldsymbol{\delta}_{k+1}-\boldsymbol{\delta}_{k}.

    • 𝐛k\mathbf{b}_{k} is the change in gradient, defined as 𝐛k=ψ(𝜹k+1)ψ(𝜹k)\mathbf{b}_{k}=\nabla\psi(\boldsymbol{\delta}_{k+1})-\nabla\psi(\boldsymbol{\delta}_{k})

  2. 2.

    Riemannian Gradient: The standard (Euclidean) gradient, ψ(𝜹)\nabla\psi(\boldsymbol{\delta}), must be projected onto the tangent space at the current point 𝜹\boldsymbol{\delta} to find the direction of steepest ascent along the manifold. Let G(𝜹)\nabla G(\boldsymbol{\delta}) denote the Euclidean normal to the level set at 𝜹\boldsymbol{\delta}, and let 𝒏(𝜹)=G(𝜹)/G(𝜹)\boldsymbol{n}(\boldsymbol{\delta})=\nabla G(\boldsymbol{\delta})/\|\nabla G(\boldsymbol{\delta})\| be the corresponding unit normal vector. The orthogonal projection matrix onto the tangent space is defined as P𝜹:=I𝒏(𝜹)𝒏(𝜹)P_{\boldsymbol{\delta}}:=I-\boldsymbol{n}(\boldsymbol{\delta})\boldsymbol{n}(\boldsymbol{\delta})^{\top}. Consequently, the Riemannian gradient under the induced Euclidean metric is given by

    gradψ(𝜹)=P𝜹ψ(𝜹).\operatorname{grad}\psi(\boldsymbol{\delta})=P_{\boldsymbol{\delta}}\nabla\psi(\boldsymbol{\delta}).
  3. 3.

    Retraction: Even though we are moving based on a Riemannian gradient step that is in the current point’s tangent space (which is essentially a linear approximation of the manifold at the current point), we are still technically moving out of the manifold. After finding a search direction 𝒗\boldsymbol{v}, we need a retraction to map the point from the tangent space back onto the manifold. For the level-set manifold \mathcal{M}, we take the projection retraction R𝜹(𝝃)=Π(𝜹+𝝃)R_{\boldsymbol{\delta}}(\boldsymbol{\xi})=\Pi(\boldsymbol{\delta}+\boldsymbol{\xi}), where Π\Pi is the orthogonal projection onto \mathcal{M}, which is a canonical construction for embedded manifolds. In our code, Π\Pi is computed approximately by a single normal-direction correction, which can be interpreted as a projection-like update that preserves the local first-order accuracy required of a retraction while substantially reducing computational cost.

  4. 4.

    Vector Transport: Vector transport is required to compare tangent vectors that live in different tangent spaces across iterates. For our embedded level-set manifold, we use the simple projection transport:

    𝒯~𝜹𝜹+(𝜻):=P𝜹+𝜻,\widetilde{\mathcal{T}}_{\boldsymbol{\delta}\to\boldsymbol{\delta}_{+}}(\boldsymbol{\zeta}):=P_{\boldsymbol{\delta}_{+}}\boldsymbol{\zeta},

    namely, orthogonally projecting an ambient vector onto the new tangent space. Such a transport is generally not an isometry, even though it is often computationally attractive. This transport is also used to move tangent-space quantities between iterates when forming the RBFGS update.

To summarize, the iterative process of the Riemannian BFGS algorithm is outlined as follows.

Algorithm 1 Riemannian BFGS for Optimal Policy Shift
Initialize: k=0k=0, choose 𝜹0\boldsymbol{\delta}_{0} on the manifold (s.t. G(𝜹0)=c2G(\boldsymbol{\delta}_{0})=c^{2}).
Compute Euclidean gradient ψ(𝜹0)\nabla\psi(\boldsymbol{\delta}_{0}) and Riemannian gradient gradψ(𝜹0)=P𝜹0ψ(𝜹0)\operatorname{grad}\psi(\boldsymbol{\delta}_{0})=P_{\boldsymbol{\delta}_{0}}\nabla\psi(\boldsymbol{\delta}_{0}).
Initialize inverse Hessian approximation 𝑩0=𝑰\boldsymbol{B}_{0}=\boldsymbol{I}.
while not converged (e.g., gradψ(𝜹k)>ϵ\|\operatorname{grad}\psi(\boldsymbol{\delta}_{k})\|>\epsilon) do
  Compute search direction: 𝒑k=𝑩kgradψ(𝜹k)\boldsymbol{p}_{k}=-\boldsymbol{B}_{k}\operatorname{grad}\psi(\boldsymbol{\delta}_{k}).
  Perform a weak Wolfe line search along the retraction curve.
  Update point via retraction: 𝜹k+1=R𝜹k(αk𝒑k)\boldsymbol{\delta}_{k+1}=R_{\boldsymbol{\delta}_{k}}(\alpha_{k}\boldsymbol{p}_{k}).
  Compute new Riemannian gradient gradψ(𝜹k+1)\operatorname{grad}\psi(\boldsymbol{\delta}_{k+1}).
  Update 𝑩k+1\boldsymbol{B}_{k+1}: apply the RBFGS update and transport 𝑩k\boldsymbol{B}_{k} to the new tangent space as needed.
  kk+1k\leftarrow k+1.
end while
return 𝜹k\boldsymbol{\delta}_{k}

When the objective is to minimize ψ(𝜹)\psi(\boldsymbol{\delta}) over the Gelbrich level set, convergence guarantees for RBFGS depend on the line-search conditions, the regularity of the objective along retraction curves, and the choice of vector transport. Under the assumptions stated in Appendix B, the global convergence result in Theorem 4.2 of Huang et al. (2018) yields lim infkgradψ(𝜹k)=0\liminf_{k\to\infty}\|\operatorname{grad}\psi(\boldsymbol{\delta}_{k})\|=0 for the RBFGS scheme equipped with a suitable isometric vector transport and weak Wolfe line search. Appendix B verifies the geometric regularity of the Gelbrich level set, establishes objective regularity along the retraction curve, and invokes the existence of a weak Wolfe step. The proof therefore uses a fully rigorous projection retraction and the scaled isometric vector transport 𝒯S\mathcal{T}^{S} (Huang et al., 2015) together with an explicit boundary condition on GG that yields compactness of the relevant level set. Generally, however, these stricter choices are made for the convergence proof. In the actual implementation we use a computationally cheaper projection-type retraction and the orthogonal projection transport 𝒯~\widetilde{\mathcal{T}}, together with safeguard checks on the RBFGS update. This separation between the proof device and the working algorithm is standard in large-scale Riemannian optimization (Ring and Wirth, 2012; Huang et al., 2018). In our numerical work, the simplified implementation converged stably to a local solution while substantially reducing computation time.

In embedded-manifold settings, the projection transport 𝜻P𝜹+𝜻\boldsymbol{\zeta}\mapsto P_{\boldsymbol{\delta}_{+}}\boldsymbol{\zeta} can significantly reduce computational time without degrading practical convergence behavior. This phenomenon is well documented in the RBFGS experiments of Qi et al. (2010), where projection-based transport on the Stiefel manifold reduces wall-clock time from 304.0304.0 seconds to 24.024.0 seconds and also decreases the iteration count from 175175 to 8383 in the Procrustes problem. Moreover, Huang et al. (2018) compare combinations that satisfy sufficient conditions used in global convergence theory with cheaper combinations that do not, and find that the numbers of function and gradient evaluations (as well as vector transports) are not significantly affected by the choice of retraction and transport; as a result, lower-complexity retractions and transports can be markedly faster in terms of computational time. In our applications, the simplified implementation described above converges reliably and is computationally fast.

4 Semiparametric Efficiency Theory

This section explores the fundamental limits of estimation for the multivariate incremental effect, ψ(𝜹)\psi(\boldsymbol{\delta}). The analysis characterizes how the statistical difficulty of this problem depends not only on the sample size nn, but on the intervention vector 𝜹\boldsymbol{\delta} and its interplay with the exposure covariance structure 𝚺\boldsymbol{\Sigma}. We generalize and further develop the theoretical analysis for univariate continuous exposures from (Schindl et al., 2024) to our multivariate setting.

4.1 Minimax Lower Bound

In this section, we establish a minimax lower bound for the incremental effect

θ(𝜹):=ψ(𝜹)ψ(𝟎),\theta(\boldsymbol{\delta})\ :=\ \psi(\boldsymbol{\delta})-\psi(\boldsymbol{0}),

under a flexible nonparametric model. Minimax lower bounds benchmark what is statistically achievable without imposing additional structure: they show that no estimator can attain uniformly smaller risk over a given model class. Before presenting the main results, we must first define a number of important terms. First, we refer to the centered version of the exposures as

𝑾~:=𝑾E[𝑾𝑿].\tilde{\boldsymbol{W}}:=\boldsymbol{W}-\operatorname{E}[\boldsymbol{W}\mid\boldsymbol{X}].

To analyze the asymptotic variance of the effect difference θ(𝜹)\theta(\boldsymbol{\delta}), we decompose the problem into an incremental component 𝒉(𝒁)\boldsymbol{h}(\boldsymbol{Z}) and a baseline influence function φ0(𝒁)\varphi_{0}(\boldsymbol{Z}). We define

𝒉(𝒁):=𝑾~(YE[μ𝑿])E[E[μ𝑾~𝑿]],φ0(𝒁):=Yψ(𝟎).\boldsymbol{h}(\boldsymbol{Z})\ :=\ \tilde{\boldsymbol{W}}\Big(Y-\operatorname{E}[\mu\mid\boldsymbol{X}]\Big)\;-\;\operatorname{E}\big[\operatorname{E}[\mu\tilde{\boldsymbol{W}}\mid\boldsymbol{X}]\big],\qquad\varphi_{0}(\boldsymbol{Z}):=Y-\psi(\boldsymbol{0}).

Next, we calculate the covariance structure of 𝒉(𝒁)\boldsymbol{h}(\boldsymbol{Z}). A key part of this variance arises from the outcome regression. To ensure this term represents a proper covariance (centered moment), we compute the raw second moment and subtract the outer product of the mean:

𝚺μ,full:=E[𝑾~𝑾~(μE[μ𝑿])2]E[E[μ𝑾~𝑿]]E[E[μ𝑾~𝑿]].\boldsymbol{\Sigma}_{\mu,\mathrm{full}}:=\operatorname{E}\!\Big[\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top}\big(\mu-\operatorname{E}[\mu\mid\boldsymbol{X}]\big)^{2}\Big]\;-\;\operatorname{E}\big[\operatorname{E}[\mu\tilde{\boldsymbol{W}}\mid\boldsymbol{X}]\big]\operatorname{E}\big[\operatorname{E}[\mu\tilde{\boldsymbol{W}}\mid\boldsymbol{X}]\big]^{\top}.

Then, letting 𝚺𝑾~𝑾~,ε:=E[Var(Y𝑿,𝑾)𝑾~𝑾~]\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}:=\operatorname{E}\big[\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})\,\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top}\big] be the residual variance component, we show in the appendix that we can write the full covariance matrix of 𝒉(𝒁)\boldsymbol{h}(\boldsymbol{Z}) as

𝑯:=Cov(𝒉(𝒁))=𝚺𝑾~𝑾~,ε+𝚺μ,full.\boldsymbol{H}\ :=\ \operatorname{Cov}\!\big(\boldsymbol{h}(\boldsymbol{Z})\big)\ =\ \boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}\ +\ \boldsymbol{\Sigma}_{\mu,\mathrm{full}}.

Finally, estimating the incremental effect θ(𝜹)\theta(\boldsymbol{\delta}) requires distinguishing the variation in the shift direction from the variation in the baseline. Since the gradient component 𝒉(𝒁)\boldsymbol{h}(\boldsymbol{Z}) and the baseline influence φ0(𝒁)\varphi_{0}(\boldsymbol{Z}) are typically correlated, the fundamental difficulty of estimating their difference is governed by the variability of 𝒉\boldsymbol{h} that cannot be explained by φ0\varphi_{0}. Mathematically, this corresponds to the residual variance of 𝒉\boldsymbol{h} after linearly projecting it onto the space spanned by φ0\varphi_{0}, yielding the Schur–complement covariance

𝚪:=𝑯Cov(𝒉,φ0)Cov(𝒉,φ0)Var(φ0) 0.\boldsymbol{\Gamma}\ :=\ \boldsymbol{H}\ -\ \frac{\operatorname{Cov}(\boldsymbol{h},\varphi_{0})\,\operatorname{Cov}(\boldsymbol{h},\varphi_{0})^{\!\top}}{\operatorname{Var}(\varphi_{0})}\ \succeq\ \boldsymbol{0}.

The next result records how the efficiency bound depends on 𝜹\boldsymbol{\delta}.

Lemma 1 (Variance of the efficient influence function).

Under (A1)(A5) defined in the appendix and for all 𝛅\|\boldsymbol{\delta}\| within a fixed small radius, there exist constants 0<clowcup<0<c_{\mathrm{low}}\leq c_{\mathrm{up}}<\infty (depending only on the model constants and the chosen radius; explicit expressions are given in the Appendix) such that

clow𝜹𝑯𝜹Var{φθ(𝜹)(𝒁)}cup𝜹𝑯𝜹,𝑯=𝚺𝑾~𝑾~,ε+𝚺μ,full.c_{\mathrm{low}}\;\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}\ \leq\ \operatorname{Var}\!\big\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}\ \leq\ c_{\mathrm{up}}\;\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta},\qquad\boldsymbol{H}=\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}}.

Lemma 1 makes explicit that the nonparametric efficiency bound (the variance of the efficient influence function) grows quadratically in the tilt magnitude, with geometry determined by the matrix 𝑯\boldsymbol{H}; this geometry blends the noise component 𝚺𝑾~𝑾~,ε\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon} and the outcome–regression component 𝚺μ,full\boldsymbol{\Sigma}_{\mu,\mathrm{full}}. Since 𝑯=𝚺𝑾~𝑾~,ε+𝚺μ,full𝚺𝑾~𝑾~,ε\boldsymbol{H}=\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}}\succeq\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}, the lower bound immediately implies

Var{φθ(𝜹)(𝒁)}clow𝜹𝚺𝑾~𝑾~,ε𝜹.\operatorname{Var}\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\}\ \geq\ c_{\mathrm{low}}\ \boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}\,\boldsymbol{\delta}.

As the direction of 𝜹\boldsymbol{\delta} changes, the bound can increase substantially showing how the most efficient shifts are those proportional to the first eigenvector of 𝚺𝑾~𝑾~,ε\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}. Note that this matrix is not the same as the covariance matrix of the exposures, but is closely related to it, and therefore this result confirms our findings of Section 2 about which directions are most efficient to estimate. Now, we derive minimax bounds for the estimation error of θ(𝜹)\theta(\boldsymbol{\delta}).

Theorem 2 (Minimax lower bound).

Assume (A1)(A5) and let 𝛅\|\boldsymbol{\delta}\| lie in the finite-tilt regime specified in the Appendix. There exists a universal constant C>0C>0 (independent of nn and 𝛅\boldsymbol{\delta}) such that, for any estimator θ^\widehat{\theta} based on a sample of size nn,

infθ^supPEP[(θ^θP(𝜹))2]C𝜹𝚪𝜹n.\inf_{\widehat{\theta}}\ \sup_{P}\ \operatorname{E}_{P}\!\big[(\widehat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big]\ \geq\ C\cdot\frac{\ \boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\,\boldsymbol{\delta}\ }{n}\,.

Theorem 2 shows that the best possible root mean–squared error obeys

RMSE(θ^)𝜹𝚪𝜹n,\operatorname{RMSE}(\widehat{\theta})\ \gtrsim\ \sqrt{\frac{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\,\boldsymbol{\delta}}{n}}\,,

revealing an effective sample size of order neffn/(𝜹𝚪𝜹)n_{\mathrm{eff}}\asymp n/(\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}). Hence larger tilts and directions aligned with high–variance components of 𝚪\boldsymbol{\Gamma} are intrinsically harder, regardless of the estimation strategy. Conversely, if 𝜹\boldsymbol{\delta} lies near a low–variance direction of 𝚪\boldsymbol{\Gamma}, faster convergence is attainable. Together with Lemma 1, the theorem clarifies why one must account for 𝜹\boldsymbol{\delta} when assessing estimation difficulty and designing procedures: the geometry induced by 𝚪\boldsymbol{\Gamma} determines how precision scales with both the size and the direction of the tilt.

4.2 Convergence and Normality

Under mild regularity conditions (boundedness, i.i.d. sampling, and a fixed finite tilt), we establish a finite-𝜹\boldsymbol{\delta} central limit theorem for our cross-fitted one-step estimator. At first sight, the efficient influence function suggests that asymptotic linearity requires controlling the 𝜹\boldsymbol{\delta}-specific nuisance functions

m𝜹(𝒙)=𝔼g𝜹[μ(𝑿,𝑾)𝑿=𝒙],r𝜹(𝒘,𝒙)=g𝜹(𝒘𝒙)f(𝒘𝒙),m_{\boldsymbol{\delta}}(\boldsymbol{x})\;=\;\mathbb{E}_{g_{\boldsymbol{\delta}}}\!\big[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}=\boldsymbol{x}\big],\qquad r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\;=\;\frac{g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})}{f(\boldsymbol{w}\mid\boldsymbol{x})},

Since both objects are indexed by the tilt parameter 𝜹\boldsymbol{\delta}, and m𝜹m_{\boldsymbol{\delta}} is a nonlinear functional of the observed-data law, it is not obvious how to guarantee the L2L_{2} rates needed for one-step asymptotics.

Our analysis shows that, for any fixed finite tilt 𝜹\boldsymbol{\delta}, it is in fact sufficient to estimate only two familiar observed-data nuisance components:

μ(𝒙,𝒘)=𝔼[Y𝑿=𝒙,𝑾=𝒘],f(𝒘𝒙),\mu(\boldsymbol{x},\boldsymbol{w})\;=\;\mathbb{E}[Y\mid\boldsymbol{X}=\boldsymbol{x},\boldsymbol{W}=\boldsymbol{w}],\qquad f(\boldsymbol{w}\mid\boldsymbol{x}),

namely the outcome regression and the (generalized) propensity score. In the Appendix we show that, under bounded support for 𝑾\boldsymbol{W} and the finite-tilt condition 𝜹Δ\|\boldsymbol{\delta}\|\leq\Delta, the maps (μ,f)r𝜹(\mu,f)\mapsto r_{\boldsymbol{\delta}} and (μ,f)m𝜹(\mu,f)\mapsto m_{\boldsymbol{\delta}} are Lipschitz in L2(P)L_{2}(P): there exist fixed finite constants C1(Δ),C2(Δ)C_{1}(\Delta),C_{2}(\Delta) such that for any estimators (μ^,f^)(\widehat{\mu},\widehat{f}),

r^𝜹r𝜹2C1(Δ)f^f2,m^𝜹m𝜹2C2(Δ)(μ^μ2+f^f2),\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\;\leq\;C_{1}(\Delta)\,\|\widehat{f}-f\|_{2},\qquad\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}\;\leq\;C_{2}(\Delta)\big(\|\widehat{\mu}-\mu\|_{2}+\|\widehat{f}-f\|_{2}\big),

where (m^𝜹,r^𝜹)(\widehat{m}_{\boldsymbol{\delta}},\widehat{r}_{\boldsymbol{\delta}}) are obtained from (μ^,f^)(\widehat{\mu},\widehat{f}) by the same formulas as above; see Lemma 9. Consequently, the product condition

r^𝜹r𝜹2m^𝜹m𝜹2=oP(n1/2),\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/2}),

which ensures that the second-order remainder is negligible, is implied by the more interpretable requirement that both μ^\widehat{\mu} and f^\widehat{f} converge at rate n1/4n^{-1/4} in L2(P)L_{2}(P); see Corollary 3. These L2L_{2} rates for (μ^,f^)(\widehat{\mu},\widehat{f}), and hence for (m^𝜹,r^𝜹)(\widehat{m}_{\boldsymbol{\delta}},\widehat{r}_{\boldsymbol{\delta}}), can be guaranteed under standard smoothness and complexity conditions by the highly adaptive lasso (van der Laan, 2017), which attains near-optimal convergence rates for a broad nonparametric class.

Theorem 3 (Finite-𝜹\boldsymbol{\delta} CLT).

Assume (A1)(A3) and (C1)(C5) in the Appendix. Fix Δ(0,)\Delta\in(0,\infty) and any tilt 𝛅\boldsymbol{\delta} with 𝛅Δ\|\boldsymbol{\delta}\|\leq\Delta. Then

n{ψ^(𝜹)ψ(𝜹)}𝒩(0,Var{φψ(𝜹)(𝒁)}),\sqrt{n}\,\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\}\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big),

where the efficient influence function is

φψ(𝜹)(𝒁)=r𝜹(𝑾,𝑿){Ym𝜹(𝑿)}+m𝜹(𝑿)ψ(𝜹).\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})=r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}+m_{\boldsymbol{\delta}}(\boldsymbol{X})-\psi(\boldsymbol{\delta}).

Consequently,

n{θ^(𝜹)θ(𝜹)}𝒩(0,Var{φθ(𝜹)(𝒁)}),φθ(𝜹):=φψ(𝜹)φψ(𝟎).\sqrt{n}\,\{\widehat{\theta}(\boldsymbol{\delta})-\theta(\boldsymbol{\delta})\}\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big),\qquad\varphi_{\theta(\boldsymbol{\delta})}:=\varphi_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\mathbf{0})}.

Theorem 3 provides the basis for asymptotic inference, which proceeds in a standard influence-function manner. Let 𝒁i=(𝑿i,𝑾i,Yi)\boldsymbol{Z}_{i}=(\boldsymbol{X}_{i},\boldsymbol{W}_{i},Y_{i}), and define the empirical influence values for ψ(𝜹)\psi(\boldsymbol{\delta}) by

φ^ψ(𝜹)(𝒁i):=r^𝜹(𝑾i,𝑿i){Yim^𝜹(𝑿i)}+m^𝜹(𝑿i)ψ^(𝜹),\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}_{i}):=\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{W}_{i},\boldsymbol{X}_{i})\{Y_{i}-\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X}_{i})\}+\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X}_{i})-\widehat{\psi}(\boldsymbol{\delta}),

where r^𝜹\widehat{r}_{\boldsymbol{\delta}} and m^𝜹\widehat{m}_{\boldsymbol{\delta}} are constructed from the cross-fitted estimators (μ^,f^)(\widehat{\mu},\widehat{f}) as above. A consistent estimator of the asymptotic variance is the sample second moment

σ^ψ(𝜹) 2:=1ni=1nφ^ψ(𝜹)(𝒁i)2,\widehat{\sigma}_{\psi(\boldsymbol{\delta})}^{\,2}:=\frac{1}{n}\sum_{i=1}^{n}\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}_{i})^{2},

and analogously for the incremental effect we use

φ^θ(𝜹)(𝒁i):=φ^ψ(𝜹)(𝒁i)φ^ψ(𝟎)(𝒁i),σ^θ(𝜹) 2:=1ni=1nφ^θ(𝜹)(𝒁i)2.\widehat{\varphi}_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z}_{i}):=\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}_{i})-\widehat{\varphi}_{\psi(\mathbf{0})}(\boldsymbol{Z}_{i}),\qquad\widehat{\sigma}_{\theta(\boldsymbol{\delta})}^{\,2}:=\frac{1}{n}\sum_{i=1}^{n}\widehat{\varphi}_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z}_{i})^{2}.

These yield Wald-type confidence intervals of the form

θ^(𝜹)±z1α/2σ^θ(𝜹)n,\widehat{\theta}(\boldsymbol{\delta})\ \pm\ z_{1-\alpha/2}\,\frac{\widehat{\sigma}_{\theta(\boldsymbol{\delta})}}{\sqrt{n}},

with a similar process for ψ(𝜹)\psi(\boldsymbol{\delta}). When inference is required for several tilts 𝜹1,,𝜹J\boldsymbol{\delta}_{1},\ldots,\boldsymbol{\delta}_{J}, a joint covariance estimator is obtained by the empirical covariance matrix of the vector

(φ^θ(𝜹1)(𝒁i),,φ^θ(𝜹J)(𝒁i)),i=1,,n,\big(\widehat{\varphi}_{\theta(\boldsymbol{\delta}_{1})}(\boldsymbol{Z}_{i}),\ldots,\widehat{\varphi}_{\theta(\boldsymbol{\delta}_{J})}(\boldsymbol{Z}_{i})\big)^{\top},\qquad i=1,\ldots,n,

which enables simultaneous confidence intervals or Wald tests via a multivariate normal approximation.

5 Sensitivity analysis to unmeasured confounding

In this section we develop a sensitivity analysis approach to assess the robustness of our causal estimates to the presence of unmeasured confounders. Throughout, we assume that there exist unmeasured confounders 𝑼\boldsymbol{U} such that

Y(𝒘)𝑾𝑿,𝑼,for all 𝒘.Y(\boldsymbol{w})\perp\!\!\!\!\perp\boldsymbol{W}\mid\boldsymbol{X},\boldsymbol{U},\quad\text{for all }\boldsymbol{w}.

For notational clarity, let 𝑽:=(𝑿,𝑼)\boldsymbol{V}:=(\boldsymbol{X},\boldsymbol{U}) denote the full adjustment set, reserving 𝒁:=(𝑿,𝑾,Y)\boldsymbol{Z}:=(\boldsymbol{X},\boldsymbol{W},Y) for the observed data throughout. We still consider incremental policies defined by exponential tilting of the observed conditional exposure density f(𝒘𝑿)f(\boldsymbol{w}\mid\boldsymbol{X}):

g𝜹(𝒘𝑿)=exp(𝜹𝒘)f(𝒘𝑿)𝔼[exp(𝜹𝑾)𝑿].g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})=\frac{\exp(\boldsymbol{\delta}^{\top}\boldsymbol{w})f(\boldsymbol{w}\mid\boldsymbol{X})}{\mathbb{E}[\exp(\boldsymbol{\delta}^{\top}\boldsymbol{W})\mid\boldsymbol{X}]}.

The full-data causal estimand is given by

ψ(𝜹)=𝔼[𝒲μ(𝑽,𝒘)g𝜹(𝒘𝑿)𝑑𝒘],\psi(\boldsymbol{\delta})=\mathbb{E}\left[\int_{\mathcal{W}}\mu(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right],

where μ(𝑽,𝒘)=𝔼[Y𝑽,𝑾=𝒘]\mu(\boldsymbol{V},\boldsymbol{w})=\mathbb{E}[Y\mid\boldsymbol{V},\boldsymbol{W}=\boldsymbol{w}] is referred to as the long outcome regression. We further denote the estimand obtained by applying the same identification formula while omitting 𝑼\boldsymbol{U} as

ψ~(𝜹)=𝔼[𝒲μ~(𝑿,𝒘)g𝜹(𝒘𝑿)𝑑𝒘]=𝔼[𝒲[μ(𝑿,𝒘)+d(𝑿,𝒘)]g𝜹(𝒘𝑿)𝑑𝒘],\widetilde{\psi}(\boldsymbol{\delta})=\mathbb{E}\left[\int_{\mathcal{W}}\widetilde{\mu}(\boldsymbol{X},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]=\mathbb{E}\left[\int_{\mathcal{W}}[\mu(\boldsymbol{X},\boldsymbol{w})+d(\boldsymbol{X},\boldsymbol{w})]g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right],

where μ~(𝑿,𝒘)=𝔼[Y𝑿,𝑾=𝒘]\widetilde{\mu}(\boldsymbol{X},\boldsymbol{w})=\mathbb{E}[Y\mid\boldsymbol{X},\boldsymbol{W}=\boldsymbol{w}] is referred to as the short outcome regression, and d(𝑿,𝒘)d(\boldsymbol{X},\boldsymbol{w}) captures confounding bias at exposure level 𝒘\boldsymbol{w}. When there is unmeasured confounding, the identified short estimand ψ~(𝜹)\widetilde{\psi}(\boldsymbol{\delta}), which conditions only on observed covariates 𝑿\boldsymbol{X}, diverges from the causal target ψ(𝜹)\psi(\boldsymbol{\delta}) that is defined under conditional exchangeability given 𝑽=(𝑿,𝑼)\boldsymbol{V}=(\boldsymbol{X},\boldsymbol{U}). We follow Chernozhukov et al. (2022) and leverage the geometry of Riesz representers to derive sharp bias bounds based on L2L_{2} norms, yielding interpretable calibration of confounding strength without imposing restrictive structural assumptions on μ(,)\mu(\cdot,\cdot).

5.1 The Bias Representation

Let f(𝒘𝑽)f(\boldsymbol{w}\mid\boldsymbol{V}) denote the true conditional exposure density given the full adjustment set. The parameter ψ(𝜹)\psi(\boldsymbol{\delta}) can be written as a linear functional of the long regression evaluated under the observed distribution of (𝑽,𝑾)(\boldsymbol{V},\boldsymbol{W}):

ψ(𝜹)=𝔼[μ(𝑽,𝑾)α𝜹(𝑽,𝑾)],α𝜹(𝑽,𝑾)=g𝜹(𝑾𝑿)f(𝑾𝑽).\psi(\boldsymbol{\delta})=\mathbb{E}\left[\mu(\boldsymbol{V},\boldsymbol{W})\alpha_{\boldsymbol{\delta}}(\boldsymbol{V},\boldsymbol{W})\right],\qquad\alpha_{\boldsymbol{\delta}}(\boldsymbol{V},\boldsymbol{W})=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{V})}.

Analogously, the short estimand admits the representation

ψ~(𝜹)=𝔼[μ~(𝑿,𝑾)αs,𝜹(𝑿,𝑾)],αs,𝜹(𝑿,𝑾)=g𝜹(𝑾𝑿)f(𝑾𝑿).\widetilde{\psi}(\boldsymbol{\delta})=\mathbb{E}\left[\widetilde{\mu}(\boldsymbol{X},\boldsymbol{W})\alpha_{s,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})\right],\qquad\alpha_{s,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{X})}.

Under mild regularity conditions ensuring square-integrability of the relevant Riesz representers, Chernozhukov et al. (2022) shows that the bias admits the exact representation

ψ~(𝜹)ψ(𝜹)=𝔼[Δμ(𝑽,𝑾)Δα(𝑽,𝑾)],\widetilde{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})=\mathbb{E}\left[\Delta_{\mu}(\boldsymbol{V},\boldsymbol{W})\Delta_{\alpha}(\boldsymbol{V},\boldsymbol{W})\right],

where

Δμ(𝑽,𝑾)=μ(𝑽,𝑾)μ~(𝑿,𝑾),Δα(𝑽,𝑾)=α𝜹(𝑽,𝑾)αs,𝜹(𝑿,𝑾).\Delta_{\mu}(\boldsymbol{V},\boldsymbol{W})=\mu(\boldsymbol{V},\boldsymbol{W})-\widetilde{\mu}(\boldsymbol{X},\boldsymbol{W}),\qquad\Delta_{\alpha}(\boldsymbol{V},\boldsymbol{W})=\alpha_{\boldsymbol{\delta}}(\boldsymbol{V},\boldsymbol{W})-\alpha_{s,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W}).

This identity isolates two necessary sources of bias: the additional outcome variation explained by 𝑼\boldsymbol{U} beyond (𝑿,𝑾)(\boldsymbol{X},\boldsymbol{W}) through Δμ\Delta_{\mu}, and the additional information about the exposure distribution provided by 𝑼\boldsymbol{U} beyond 𝑿\boldsymbol{X} through Δα\Delta_{\alpha}. By Cauchy–Schwarz,

|ψ~(𝜹)ψ(𝜹)|𝔼[Δμ(𝑽,𝑾)2]𝔼[Δα(𝑽,𝑾)2].|\widetilde{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})|\leq\sqrt{\mathbb{E}[\Delta_{\mu}(\boldsymbol{V},\boldsymbol{W})^{2}]}\sqrt{\mathbb{E}[\Delta_{\alpha}(\boldsymbol{V},\boldsymbol{W})^{2}]}.

5.2 Sensitivity Parameters

To express the bound in terms of identifiable scale components and partial R2R^{2}-type sensitivity parameters, define the following identifiable parameters

σs2=𝔼[(Yμ~(𝑿,𝑾))2],νs2(𝜹)=𝔼[αs,𝜹(𝑿,𝑾)2],S(𝜹)=σsνs(𝜹).\sigma_{s}^{2}=\mathbb{E}\left[(Y-\widetilde{\mu}(\boldsymbol{X},\boldsymbol{W}))^{2}\right],\qquad\nu_{s}^{2}(\boldsymbol{\delta})=\mathbb{E}\left[\alpha_{s,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})^{2}\right],\qquad S(\boldsymbol{\delta})=\sigma_{s}\nu_{s}(\boldsymbol{\delta}).

We parameterize the outcome component by the nonparametric partial R2R^{2} of 𝑼\boldsymbol{U} with YY given (𝑿,𝑾)(\boldsymbol{X},\boldsymbol{W}),

RY𝑼𝑿,𝑾2=𝔼[Δμ(𝑽,𝑾)2]σs2,CY=RY𝑼𝑿,𝑾2.R^{2}_{Y\sim\boldsymbol{U}\mid\boldsymbol{X},\boldsymbol{W}}=\frac{\mathbb{E}\left[\Delta_{\mu}(\boldsymbol{V},\boldsymbol{W})^{2}\right]}{\sigma_{s}^{2}},\qquad C_{Y}=\sqrt{R^{2}_{Y\sim\boldsymbol{U}\mid\boldsymbol{X},\boldsymbol{W}}}.

For the treatment component, we use the relative increase in L2L_{2} variation of the Riesz representer induced by conditioning on 𝑼\boldsymbol{U}:

CD2(𝜹)=𝔼[Δα(𝑽,𝑾)2]νs2(𝜹).C_{D}^{2}(\boldsymbol{\delta})=\frac{\mathbb{E}\left[\Delta_{\alpha}(\boldsymbol{V},\boldsymbol{W})^{2}\right]}{\nu_{s}^{2}(\boldsymbol{\delta})}.

Combining these definitions yields the sensitivity bound

|ψ~(𝜹)ψ(𝜹)|S(𝜹)CYCD(𝜹).|\widetilde{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})|\leq S(\boldsymbol{\delta})\cdot C_{Y}\cdot C_{D}(\boldsymbol{\delta}).

For the empirical analysis, we parameterize the strength of unmeasured confounding by

ηY2:=CY2,ηα2(𝜹):=1Rα2(𝜹),CD2(𝜹)=ηα2(𝜹)1ηα2(𝜹).\eta_{Y}^{2}:=C_{Y}^{2},\qquad\eta_{\alpha}^{2}(\boldsymbol{\delta}):=1-R_{\alpha}^{2}(\boldsymbol{\delta}),\qquad C_{D}^{2}(\boldsymbol{\delta})=\frac{\eta_{\alpha}^{2}(\boldsymbol{\delta})}{1-\eta_{\alpha}^{2}(\boldsymbol{\delta})}.

where Rα2(𝜹)R_{\alpha}^{2}(\boldsymbol{\delta}) is the squared correlation between the long and short Riesz representers. Here ηY2\eta_{Y}^{2} is the fraction of residual outcome variation explainable by 𝑼\boldsymbol{U} given (𝑿,𝑾)(\boldsymbol{X},\boldsymbol{W}), and ηα2(𝜹)\eta_{\alpha}^{2}(\boldsymbol{\delta}) is the fraction of RR variation explainable by 𝑼\boldsymbol{U} along the intervention path indexed by 𝜹\boldsymbol{\delta}. These two quantities determine the width of the bias bound and, in turn, the width of the endpoint confidence bounds. The sensitivity parameters can be set subjectively or calibrated by formal benchmarking against observed covariates (Cinelli and Hazlett, 2020). The benchmarking procedure used in the empirical analysis is given in Appendix E.7.

5.3 Sensitivity parameters and confidence bounds for θ(𝜹)\theta(\boldsymbol{\delta})

Our empirical results focus on the incremental effect θ(𝜹)=ψ(𝜹)ψ(𝟎)\theta(\boldsymbol{\delta})=\psi(\boldsymbol{\delta})-\psi(\boldsymbol{0}). As shown in Appendix E, the same omitted-variable-bias representation applies to this contrast after replacing the short RR by the RR contrast αs,θ,𝜹(𝑿,𝑾)=αs,𝜹(𝑿,𝑾)αs,𝟎(𝑿,𝑾)\alpha_{s,\theta,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})=\alpha_{s,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})-\alpha_{s,\boldsymbol{0}}(\boldsymbol{X},\boldsymbol{W}). Writing

νs,θ2(𝜹):=𝔼[αs,θ,𝜹(𝑿,𝑾)2],\nu_{s,\theta}^{2}(\boldsymbol{\delta}):=\mathbb{E}\!\left[\alpha_{s,\theta,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})^{2}\right],

the plugin bound used in our analysis is

B^(𝜹)=S^(𝜹)ηY2ηα2(𝜹)1ηα2(𝜹),S^(𝜹):=σ^s2ν^s,θ2(𝜹),\widehat{B}(\boldsymbol{\delta})=\,\widehat{S}(\boldsymbol{\delta})\,\sqrt{\eta_{Y}^{2}}\,\sqrt{\frac{\eta_{\alpha}^{2}(\boldsymbol{\delta})}{1-\eta_{\alpha}^{2}(\boldsymbol{\delta})}},\qquad\widehat{S}(\boldsymbol{\delta}):=\sqrt{\widehat{\sigma}_{s}^{2}\,\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta})},

with ν^s,θ2(𝜹)=n1i=1nα^s,θ,𝜹(𝑿i,𝑾i)2\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta})=n^{-1}\sum_{i=1}^{n}\widehat{\alpha}_{s,\theta,\boldsymbol{\delta}}(\boldsymbol{X}_{i},\boldsymbol{W}_{i})^{2}. Therefore the estimated point identified set is

θ(𝜹)[θ^(𝜹)B^(𝜹),θ^(𝜹)+B^(𝜹)].\theta(\boldsymbol{\delta})\in\Big[\,\widehat{\theta}(\boldsymbol{\delta})-\widehat{B}(\boldsymbol{\delta}),\;\widehat{\theta}(\boldsymbol{\delta})+\widehat{B}(\boldsymbol{\delta})\Big].

For formal sensitivity-adjusted inference, we construct confidence bounds for the estimated endpoints θ^(𝜹):=θ^(𝜹)B^(𝜹)\widehat{\theta}_{-}(\boldsymbol{\delta}):=\widehat{\theta}(\boldsymbol{\delta})-\widehat{B}(\boldsymbol{\delta}) and θ^+(𝜹):=θ^(𝜹)+B^(𝜹)\widehat{\theta}_{+}(\boldsymbol{\delta}):=\widehat{\theta}(\boldsymbol{\delta})+\widehat{B}(\boldsymbol{\delta}). Appendix E gives the endpoint expansions and the resulting standard errors. The empirical results report a sensitivity-adjusted 95%95\% interval obtained from a lower confidence bound for θ^(𝜹)\widehat{\theta}_{-}(\boldsymbol{\delta}) and an upper confidence bound for θ^+(𝜹)\widehat{\theta}_{+}(\boldsymbol{\delta}):

[θ^(𝜹)z0.95se^(𝜹),θ^+(𝜹)+z0.95se^+(𝜹)].\left[\widehat{\theta}_{-}(\boldsymbol{\delta})-z_{0.95}\,\widehat{\mathrm{se}}_{-}(\boldsymbol{\delta}),\;\widehat{\theta}_{+}(\boldsymbol{\delta})+z_{0.95}\,\widehat{\mathrm{se}}_{+}(\boldsymbol{\delta})\right].

6 Simulation studies

We conduct simulation studies to systematically evaluate the finite-sample performance of various nuisance-parameter estimation pipelines. Our goal is to assess how different modeling and estimation pipelines perform when the underlying exposure distribution exhibits complex features such as heavy tails, skewness, or multimodality.

6.1 Simulation design

In all simulations, we fix the sample size at n=5,000n=5,000 and the dimensions of the covariates and exposures at p=10p=10 and q=6q=6, respectively. The results are averaged over 100 independent Monte Carlo repetitions. We generate the covariate vector 𝑿=(X1,,Xp)\boldsymbol{X}=(X_{1},\dots,X_{p})^{\top} from a multivariate normal distribution with an autoregressive correlation structure. Specifically, 𝑿𝒩(0,𝚺X)\boldsymbol{X}\sim\mathcal{N}(0,\boldsymbol{\Sigma}_{X}), where the (j,k)(j,k)-th entry of the covariance matrix is given by ΣX,jk=0.5|jk|\Sigma_{X,jk}=0.5^{|j-k|}. The exposure vector 𝑾\boldsymbol{W} is generated conditional on 𝑿\boldsymbol{X} using a linear location-shift model:

𝑾=𝑩𝑿+𝜷0+𝓔\displaystyle\boldsymbol{W}=\boldsymbol{B}^{\top}\boldsymbol{X}+\boldsymbol{\beta}_{0}+\boldsymbol{\mathcal{E}}

where 𝑩p×q\boldsymbol{B}\in\mathbb{R}^{p\times q} is a sparse coefficient matrix with non-zero entries drawn from 𝒩(0,0.62)\mathcal{N}(0,0.6^{2}) (sparsity level 0.40.4), and 𝜷0\boldsymbol{\beta}_{0} is the intercept vector. The error term 𝓔\boldsymbol{\mathcal{E}} determines the distributional characteristics of the exposures. We consider three scenarios to test model robustness:

  1. 1.

    Scenario I (Gaussian): The errors follow a multivariate normal distribution 𝓔𝒩(0,𝚺W)\boldsymbol{\mathcal{E}}\sim\mathcal{N}(0,\boldsymbol{\Sigma}_{W}), where 𝚺W\boldsymbol{\Sigma}_{W} has an AR(1) structure with correlation parameter ρ=0.6\rho=0.6.

  2. 2.

    Scenario II (Skewed): The errors follow a multivariate skew-normal distribution, 𝓔𝒮𝒩(0,𝚺W,𝜶)\boldsymbol{\mathcal{E}}\sim\mathcal{SN}(0,\boldsymbol{\Sigma}_{W},\boldsymbol{\alpha}), with a slant parameter vector 𝜶=(4,,4)\boldsymbol{\alpha}=(4,\dots,4)^{\top}. This scenario introduces substantial asymmetry and non-Gaussianity.

  3. 3.

    Scenario III (Truncated contaminated normal): To investigate robustness against heavy-tail-like deviations while ensuring the finiteness of the normalizing constant required for exponential tilting, we consider a truncated contaminated normal distribution. Specifically, the error distribution follows a mixture of two centered Gaussians, (1π)𝒩(𝟎,𝚺W)+π𝒩(𝟎,ω2𝚺W)(1-\pi)\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_{W})+\pi\mathcal{N}(\mathbf{0},\omega^{2}\boldsymbol{\Sigma}_{W}), with contamination rate π=0.2\pi=0.2 and scale inflation ω=1.5\omega=1.5. The support is restricted to the region bounded by 66 standard deviations of the baseline component. This setting introduces heavier tails relative to the standard Gaussian assumption while maintaining the required integrability conditions.

We generate the outcome under two different scenarios that comprise differing levels of complexity.

  1. 1.

    Scenario I (Simple Linear Model): We generate a continuous outcome YY using a linear model:

    Y=α0+𝜶𝑿+𝜷𝑾+ϵ,ϵ𝒩(0,1)\displaystyle Y=\alpha_{0}+\boldsymbol{\alpha}^{\top}\boldsymbol{X}+\boldsymbol{\beta}^{\top}\boldsymbol{W}+\epsilon,\quad\epsilon\sim\mathcal{N}(0,1)

    Here, the regression coefficients are generated from normal distributions: entries of 𝜶\boldsymbol{\alpha} are drawn from 𝒩(0.5,1)\mathcal{N}(0.5,1), and entries of 𝜷\boldsymbol{\beta} are drawn from 𝒩(2,1)\mathcal{N}(2,1), ensuring a strong but different signal for both covariates and exposures.

  2. 2.

    Scenario II (Complex Model): We evaluate a second, more complex DGP for the outcome YY to incorporate structural nonlinearities and interactions. The continuous outcome YY is generated using the following model:

    Y=α0+𝜶𝑿+c1X22+𝜷𝑾+c2W1W2+c3X1W1+ϵ,ϵ𝒩(0,1)Y=\alpha_{0}+\boldsymbol{\alpha}^{\top}\boldsymbol{X}+c_{1}X_{2}^{2}+\boldsymbol{\beta}^{\top}\boldsymbol{W}+c_{2}W_{1}W_{2}+c_{3}X_{1}W_{1}+\epsilon,\quad\epsilon\sim\mathcal{N}(0,1)

    Here, the baseline linear confounding coefficients 𝜶\boldsymbol{\alpha} and the marginal independent effect coefficients 𝜷\boldsymbol{\beta} are drawn from 𝒩(0.5,1)\mathcal{N}(0.5,1) and 𝒩(2,1)\mathcal{N}(2,1), respectively. Beyond the linear main effects, the model explicitly introduces a quadratic effect for a specific covariate (c1X22c_{1}X_{2}^{2}, with c1=0.5c_{1}=0.5), a cross-product interaction between two exposures (c2W1W2c_{2}W_{1}W_{2}, with c2=1.0c_{2}=1.0), and a bilinear interaction between a covariate and an exposure (c3X1W1c_{3}X_{1}W_{1}, with c3=0.8c_{3}=0.8).

This more complex setting is designed to emulate mechanisms often encountered in the health effects of air pollution mixtures. The quadratic covariate term can be viewed as mimicking the classical U-shaped meteorological confounding effect of temperature. The cross-product term W1W2W_{1}W_{2} represents synergistic toxicity between two distinct pollutants, analogous to the combined effects of fine particulate matter (PM2.5) and ozone (O3). The interaction X1W1X_{1}W_{1} captures effect modification, reflecting how a vulnerability factor such as patient age can amplify the health impact of a specific exposure.

6.2 Implemented estimators and method comparison

Our primary estimand is the tilted mean ψ(𝜹)\psi(\boldsymbol{\delta}), evaluated at a fixed tilting parameter 𝜹\boldsymbol{\delta}. We compare seven estimation pipelines that differ in how they approximate the nuisance structure induced by the tilted exposure law. For all methods, the outcome regression μ(𝑿,𝑾)=𝔼[Y𝑿,𝑾]\mu(\boldsymbol{X},\boldsymbol{W})=\mathbb{E}[Y\mid\boldsymbol{X},\boldsymbol{W}] is estimated using XGBoost with 5-fold cross-fitting. Our approaches can largely be categorized into two distinct types, which vary in how the exposure distribution is estimated.

  1. 1.

    Semi-parametric location-shift working models.
    We model the exposure distribution using a semi-parametric location-shift decomposition, Wj=μj(𝑿)+σjεjW_{j}=\mu_{j}(\boldsymbol{X})+\sigma_{j}\varepsilon_{j}. For each exposure dimension, the conditional mean μj(𝑿)\mu_{j}(\boldsymbol{X}) is flexibly estimated via XGBoost, while the scale parameter σj\sigma_{j} is estimated as the empirical standard deviation of the residuals within the corresponding training fold. This specification accommodates nonlinear covariate-dependent shifts in the exposure mean while imposing a homoscedastic error structure across the covariate space.

    We evaluate three variations of this working model based on the specified marginal distributions of the standardized residuals εj\varepsilon_{j}. In all three variations, the joint dependence structure of 𝜺=(ε1,,εq)\boldsymbol{\varepsilon}=(\varepsilon_{1},\dots,\varepsilon_{q})^{\top} is modeled using a Gaussian copula:

    • Path 2 (Gaussian): Assumes the standardized residuals follow a Gaussian distribution.

    • Path 2 (tt): Models each marginal residual using a scaled Student’s tt distribution, Fj=tdfjF_{j}=t_{df_{j}}, where the degrees of freedom dfjdf_{j} are estimated via maximum likelihood. This allows the model to accommodate heavy tails in specific exposure dimensions.

    • Path 2 (Empirical): Employs a fully nonparametric approach for the marginal residuals. The marginal cumulative distribution function FjF_{j} for each residual dimension is flexibly estimated using log-spline density estimation.

  2. 2.

    Direct estimation of nuisance parameters via regression.

    In addition to the standard outcome regression μ(𝒙,𝒘)\mu(\boldsymbol{x},\boldsymbol{w}), the one-step estimator depends on nuisance functions that vary with the tilting parameter 𝜹\boldsymbol{\delta}. These are given by ν𝜹(𝒙)\nu_{\boldsymbol{\delta}}(\boldsymbol{x}) and η𝜹(𝒙)\eta_{\boldsymbol{\delta}}(\boldsymbol{x}), which are defined in Section 3.1.2. In this pipeline, we use 5-fold cross-fitted SoftBART regression (Linero and Yang, 2018) for estimation of these functions.

Within each of the three approaches to conditional density estimation, we explore 1) estimating the density ratio using the estimate of ff, and 2) directly estimating the density ratio by estimating ν𝜹(𝒙)\nu_{\boldsymbol{\delta}}(\boldsymbol{x}). This leads to six estimators, though we consider a seventh estimator that involves directly estimating both ν𝜹(𝒙)\nu_{\boldsymbol{\delta}}(\boldsymbol{x}) and η𝜹(𝒙)\eta_{\boldsymbol{\delta}}(\boldsymbol{x}) without ever needing to estimate the exposure density.

6.3 Simulation Results

Figure 1 displays the estimated tilted mean, ψ(𝜹)\psi(\boldsymbol{\delta}), across 100 Monte Carlo repetitions for the seven pipelines under the six data-generating designs. Additionally, Table 1 complements the boxplots with numerical summaries aggregated over the six designs. We report the average signed bias together with the average absolute bias and the average RMSE.

Refer to caption
Figure 1: Boxplots of ψ(𝜹)\psi(\boldsymbol{\delta}) across 100 repetitions under six data-generating designs. The dashed horizontal line marks the true value of the estimand. The labels on the horizontal axis identify, from left to right, the residual family, the method used for r𝜹(𝒘,𝒙)r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}), and the method used for estimating m𝜹(𝒘,𝒙)m_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}). Throughout, MC represents simply plugging in estimates of ff directly, while SoftBART represents direct estimation as described in Section 3.1.2.
Table 1: Average performance over the six simulation designs.
Mean bias Mean absolute bias Mean RMSE
Fully direct SoftBART regression 0.32 0.51 1.07
Gaussian residual model 0.78 0.82 1.05
Gaussian residual model with SoftBART estimation of r𝜹(𝒘,𝒙)r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) 0.15 0.37 0.64
tt residual model 0.72 0.80 1.01
tt residual model with SoftBART estimation of r𝜹(𝒘,𝒙)r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) 0.15 0.37 0.64
Empirical residual model 0.24 0.42 0.66
Empirical residual model with SoftBART estimation of r𝜹(𝒘,𝒙)r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) 0.08 0.31 0.62

Across the six designs, the empirical residual working model is the most reliable default among the three distributional choices. Additionally, the results make it clear that estimating the density ratio directly using SoftBART outperforms estimation of the density ratio by simply plugging in the estimate of ff. Regardless of the distributional choice, direct modeling of the density ratio drastically improves both bias and RMSE. Note, however, that the approach that does not model the exposure density at all, and directly estimates all nuisance functions directly through regression, performs poorly with a high RMSE. This highlights that taking a ratio of two estimated nuisance functions leads to additional, undesirable instability. Overall, the simulation results support a concrete practical conclusion: for the one-step estimator considered here, finite-sample performance is determined primarily by accurate estimation of density ratio r𝜹(𝒘,𝒙)r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) while the remaining choice among the Gaussian, tt, and empirical residual families is secondary once that component is well estimated.

7 Application: Assessing Health Impacts of PM2.5 Component Mixtures

We now evaluate an important public health question regarding the health impacts of long-term exposure to fine particulate matter (PM2.5) and its complex chemical mixtures. Specifically, we constructed a county-level dataset across the United States for the year 2019 to estimate the exposure-response relationship between PM2.5 constituents and age-adjusted hospitalization rates per 10,000 people for Chronic Obstructive Pulmonary Disease (COPD). Our analysis obtained health records from the CDC Environmental Public Health Tracking Network (EPHTN) and CDC WONDER. High-resolution estimates of PM2.5 mass and its chemical constituents are derived from the Atmospheric Composition Analysis Group (Van Donkelaar et al., 2019). To adjust for potential confounding, we integrated a broad set of sociodemographic, behavioral, and clinical covariates compiled from the U.S. Census Bureau and CDC surveillance systems. Our analysis focuses on a q=5q=5 dimensional PM2.5 component mixture: black carbon (BC), nitrates (NO3), organic matter (OM), sulfates (SO4), and ammonium (NH4). All reported curves use the cross-fitted one-step estimator from Section 3 together with the empirical residual model shown to perform best in the simulation studies.

We study three types of intervention paths corresponding to single exposure shifts, shifting groups of exposures, and finding the optimal shift in terms of reducing overall hospitalization rates. For the single exposure shifts, we utilize shifts of the form 𝜹j=(0,,tj,,0)\boldsymbol{\delta}_{j}=(0,\dots,t_{j},\dots,0) when studying exposure jj. The magnitude tjt_{j} is chosen to satisfy the corresponding Gelbrich constraint. These paths are useful for comparing feasible directions on the intervention manifold, though they do not necessarily isolate single-pollutant causal effects because the tilted law remains multivariate and continues to alter the entire mixture distribution, not just exposure jj. We take a different approach for studying groups of exposures, where we consider three groups: 1) BC+OM, 2) NO3+SO4+NH4, and 3) the full five-pollutant mixture. We solve a numerical algorithm for finding the 𝜹\boldsymbol{\delta} value that shifts the means of each of the exposures within a group, while holding the means of the other exposures constant. We do not apply this idea to the single pollutant shifts, because finding such a shift is not feasible in many cases due to the high correlations among the exposures. Third, for the optimal policy we run the Riemannian BFGS algorithm, though we do so from 100 independent random starting values since the optimization problem is non-convex and global optimality is not guaranteed.

Refer to caption
Figure 2: Estimated incremental effect curves for the five single-pollutant grid paths, the three bundle paths, and the BFGS path. Black points and vertical bars show θ^(𝜹)\widehat{\theta}(\boldsymbol{\delta}) and its EIF-based 95%95\% confidence interval. Colored ribbons show the formal sensitivity-adjusted 95%95\% confidence bounds for two fixed sensitivity settings. Dashed blue curves show the formal benchmark with kY=1k_{Y}=1 and kα=1k_{\alpha}=1.

Figure 2 shows the main empirical results. Among the five single pollutant analyses, all estimated curves are negative indicating a harmful effect of the air pollution mixture. The steepest declines are for SO4 and NH4, followed by NO3 and OM, with BC producing a smaller but still negative curve. Because the single exposure shifts may also be shifting other exposures due to correlation among them, we interpret these as rankings of feasible intervention directions rather than pollutant-specific dose-response effects. Arguably the more interesting and informative results therefore are given by the grouped exposure analyses. The BC+OM path remains close to zero throughout the displayed Gelbrich range, indicating that these two exposures do not strongly impact COPD hospitalizations. The NO3+SO4+NH4 group and the all exposures group, however, show significantly more pronounced effects showing that these three exposures are likely driving the adverse health effects seen.

Refer to caption
Figure 3: Distribution of the BFGS objective values across 100 random-start paths for each Gelbrich target.
Refer to caption
Figure 4: Distribution of the BFGS 𝜹\boldsymbol{\delta} values for each exposure across 100 random-start paths for each Gelbrich target.

As expected, the BFGS path has the largest estimated effect showing the biggest reductions in COPD hospitalizations possible at each level of the Gelbrich constraint. To provide more intuition about the results from the optimal policy shift, Figure 3 shows the distribution of the causal effect across the 100 different starting values, and Figure 4 shows the distribution of the 𝜹\boldsymbol{\delta} values for each exposure in the optimal exposure shift across the different starting values. We can see there is some heterogeneity across starting values in both figures, though certain coherent patterns do emerge. The exposures with the most negative tilts assigned to them are SO4 and NO3, highlighting that they are potentially the most impactful of the exposures in the air pollution mixture.

Figure 2 also summarizes the sensitivity of our results to unmeasured confounding bias. The shaded ribbons correspond to two fixed sensitivity parameter settings, while the dashed blue curves implement the formal benchmark procedure described in Appendix E.7. Intuitively, kYk_{Y} scales how much residual outcome variation an omitted confounder may explain relative to the strongest observed covariate, and kDk_{D} scales how much additional RR variation it may explain relative to the strongest observed covariate for a given intervention path. In our data, the strongest outcome-side benchmark is the covariate White, whereas the RR-side benchmark is selected separately for each scenario. Under the benchmark kY=1k_{Y}=1 and kD=1k_{D}=1, the BC, OM, SO4, NH4, NO3+SO4+NH4, all-pollutant, and BFGS curves remain significantly negative throughout the displayed Gelbrich range showing moderate sensitivity to unmeasured confounding. The NO3 path is more sensitive at the smallest Gelbrich constraints, though it becomes more robust for larger Gelbrich distances. Figure 5 provides an assessment of how large the sensitivity parameters would have to become in order to make any of the results insignificant for each exposure (or group of exposures) examined. For this reason, we refer to this as the least favorable point, because it is the level of confounding required simply to make the result insignificant at any Gelbrich distance, not all distances simultaneously. The NO3 contour lies closest to the origin, indicating that comparatively small confounding would suffice to remove significance for at least one value of the Gelbrich constraint. At the other end, SO4, the all-pollutant equal-mean path, and the BFGS path lie farthest from the origin, so they require materially stronger confounding to overturn the estimated negative effects. BC, OM, NH4, and the NO3+SO4+NH4 group fall between these extremes, still showing a modest degree of robustness to confounding.

Refer to caption
Figure 5: Sensitivity contours at the least favorable Gelbrich target for each direction with a negative estimated effect. Each curve gives the combinations of ηY2\eta_{Y}^{2} and ηα2(𝜹)\eta_{\alpha}^{2}(\boldsymbol{\delta}) that move the sensitivity-adjusted upper confidence bound to zero; contours farther from the origin therefore indicate greater robustness to unmeasured confounding.

8 Discussion

In this manuscript we developed methodology for estimating the health effects of multiple air pollutants simultaneously in a way that is robust to the presence of severe positivity violations. By examining stochastic interventions with tilted exposure distributions, we can study which exposures are most harmful without relying on model-based extrapolation. One critical issue in the multivariate setting is how to define a fair shift that corresponds to similar shifts in the exposure distribution, which we do via the 2-Wasserstein distance. We provide asymptotic theory and minimax estimation rates for our proposed estimands, and show in a national study of the health effects of air pollution that there are detrimental effects of the air pollution mixture, but that these are largely driven by nitrates and sulfates.

There are a number of directions for future work that could expand upon, and improve, the methodology seen here. For one, our estimators are applicable to any stochastic shift estimand, and future research could target different shifts other than the exponentially tilted ones seen here, which maintain public health relevance and could be potentially more interpretable for practitioners. Additionally, one could expand on the sensitivity analyses developed here by incorporating recent results on sensitivity analysis for multiple exposures (Zheng et al., 2021). These incorporate moderate parametric assumptions in the multiple exposure setting and allow one to produce partial identification regions that could be tighter than those seen here, and allow one to incorporate additional assumptions or sources of information, such as negative control variables. Overall, we believe the proposed framework provides analysts, particularly those involved in the analysis of air pollution mixtures, robust approaches to estimating causal effects of multivariate, continuous exposures.

References

  • J. Antonelli, M. Mazumdar, D. Bellinger, D. Christiani, R. Wright, and B. Coull (2020) Estimating the health effects of environmental mixtures using bayesian semiparametric regression and sparsity inducing priors. The Annals of Applied Statistics. Cited by: §2.5.2.
  • J. Antonelli and C. Zigler (2024) Causal analysis of air pollution mixtures: estimands, positivity, and extrapolation. American Journal of Epidemiology 193 (10), pp. 1392–1398. Cited by: §1.
  • N. Biswas and L. Mackey (2024) Bounding wasserstein distance with couplings. Journal of the American Statistical Association 119 (548), pp. 2947–2958. Cited by: §2.4.
  • J. Blanchet and K. Murthy (2019) Quantifying distributional model risk via optimal transport. Mathematics of Operations Research 44 (2), pp. 565–600. Cited by: §2.4.
  • J. F. Bobb, L. Valeri, B. Claus Henn, D. C. Christiani, R. O. Wright, M. Mazumdar, J. J. Godleski, and B. A. Coull (2015) Causal inference for mixtures. Statistical science 30 (4), pp. 514–530. Cited by: §1, §2.5.2.
  • C. Bruns and N. Kallus (2024) Local effects of continuous instruments without positivity. arXiv preprint arXiv:2403.06450. Cited by: §1.
  • V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018) Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21 (1), pp. C1–C68. Cited by: §3.1.1.
  • V. Chernozhukov, C. Cinelli, W. Newey, A. Sharma, and V. Syrgkanis (2022) Long story short: omitted variable bias in causal machine learning. Technical report National Bureau of Economic Research. Cited by: §E.7, Appendix E, §5.1, §5.
  • C. Cinelli and C. Hazlett (2020) Making sense of sensitivity: extending omitted variable bias. Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (1), pp. 39–67. Cited by: §5.2.
  • I. Díaz and N. S. Hejazi (2020) Causal mediation analysis for stochastic interventions. Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (3), pp. 661–683. Cited by: §1, §2.2.
  • I. Díaz and M. J. van der Laan (2012) Population intervention causal effects based on stochastic interventions. Biometrics 68 (2), pp. 541–549. Cited by: §1.
  • I. Díaz, N. Williams, K. L. Hoffman, and E. J. Schenck (2023) Nonparametric causal effects based on longitudinal modified treatment policies. Journal of the American Statistical Association 118 (542), pp. 846–857. Cited by: §1.
  • J. Dorn and J. Guo (2024) Nonparametric estimation of local treatment effects with continuous instruments. Journal of Business & Economic Statistics, pp. 1–14. Cited by: §1.
  • J. C. Duchi and H. Namkoong (2021) Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics 49 (3), pp. 1378–1406. Cited by: §2.4.
  • F. Ferrari and D. B. Dunson (2020) Identifying main effects and interactions among exposures using gaussian processes. The annals of applied statistics 14 (4), pp. 1743. Cited by: §2.5.2.
  • R. Gao and A. Kleywegt (2023) Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research 48 (2), pp. 603–655. Cited by: §2.4.
  • M. Gelbrich (1990) On a formula for the L2L^{2} wasserstein metric between measures on euclidean and hilbert spaces. Mathematische Nachrichten 147 (1), pp. 185–203. Cited by: §2.4.
  • A. Hakobyan and I. Yang (2024) Wasserstein distributionally robust control of partially observable linear stochastic systems. IEEE Transactions on Automatic Control 69 (9), pp. 6121–6136. Cited by: §2.4.
  • S. Haneuse and A. Rotnitzky (2013) Estimation of the effect of interventions that modify the received treatment. Statistics in medicine 32 (30), pp. 5260–5277. Cited by: §1.
  • W. Huang, P. Absil, and K. A. Gallivan (2018) A riemannian bfgs method without differentiated retraction for nonconvex optimization problems. SIAM Journal on Optimization 28 (1), pp. 470–495. Cited by: Appendix B, Appendix B, Appendix B, Appendix B, §3.2, §3.2.
  • W. Huang, K. A. Gallivan, and P. Absil (2015) A broyden class of quasi-newton methods for riemannian optimization. SIAM Journal on Optimization 25 (3), pp. 1660–1685. Cited by: Appendix B, Appendix B, Appendix B, Appendix B, Appendix B, §3.2.
  • N. Kallus and E. Mbougwe (2024) Stochastic interventions, sensitivity analysis, and optimal transport. The Annals of Statistics 52 (2), pp. 522–545. Cited by: §1.
  • N. Kallus and M. Oprescu (2022) Doubly robust inference on causal derivative effects for continuous treatments. arXiv preprint arXiv:2203.01878. Cited by: §1.
  • E. H. Kennedy (2019) Non-parametric causal effects based on incremental propensity score interventions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81 (4), pp. 719–742. Cited by: §1, §2.2.
  • E. H. Kennedy (2024) Semiparametric doubly robust targeted double machine learning: a review. Handbook of statistical methods for precision medicine, pp. 207–236. Cited by: Appendix D.
  • D. Kuhn, P. M. Esfahani, V. A. Nguyen, and S. Shafieezadeh-Abadeh (2019) Wasserstein distributionally robust optimization: theory and applications in machine learning. In Operations research & management science in the age of analytics, pp. 130–166. Cited by: §2.4.
  • A. R. Linero and Y. Yang (2018) Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society Series B: Statistical Methodology 80 (5), pp. 1087–1110. Cited by: item 2.
  • L. Malagò, L. Montrucchio, and G. Pistone (2018) Wasserstein riemannian geometry of gaussian densities. Information Geometry 1 (2), pp. 137–179. Cited by: Appendix B.
  • A. McClean, Y. Li, S. Bae, M. A. McAdams-DeMarco, I. Díaz, and W. Wu (2024) Fair comparisons of causal parameters with many treatments and positivity violations. arXiv preprint arXiv:2410.13522. Cited by: §1, §2.4.
  • D. B. McCoy, A. E. Hubbard, A. Schuler, and M. J. van der Laan (2023) Semiparametric discovery and estimation of interaction in mixed exposures using stochastic interventions. arXiv preprint arXiv:2305.01849. Cited by: §2.5.2.
  • P. Mohajerin Esfahani and D. Kuhn (2018) Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming 171 (1), pp. 115–166. Cited by: §2.4.
  • V. A. Nguyen, S. Shafiee, D. Filipović, and D. Kuhn (2021) Mean-covariance robust risk measurement. arXiv preprint arXiv:2112.09959. Cited by: §2.4.
  • V. M. Panaretos and Y. Zemel (2019) Statistical aspects of wasserstein distances. Annual review of statistics and its application 6 (1), pp. 405–431. Cited by: §2.4.
  • V. M. Panaretos and Y. Zemel (2020) An invitation to statistics in wasserstein space. Springer Nature. Cited by: §2.4.
  • T. P. Papp and C. Sherlock (2024) Scalable couplings for the random walk metropolis algorithm. Journal of the Royal Statistical Society Series B: Statistical Methodology, pp. qkae113. Cited by: §2.4.
  • A. Peters, R. Lall, and F. Dominici (2012) Causal inference for observed sudden changes in the composition of a multi-pollutant mixture: application to the effects of the utah valley steel mill closure. Epidemiology (Cambridge, Mass.) 23 (4), pp. 559. Cited by: §1.
  • N. Pfister and P. Bühlmann (2021) Extrapolation-aware nonparametric statistical inference. Journal of the Royal Statistical Society Series B: Statistical Methodology 83 (5), pp. 915–941. Cited by: §1.
  • C. Qi, K. A. Gallivan, and P. Absil (2010) Riemannian bfgs algorithm with applications. In Recent Advances in Optimization and its Applications in Engineering: The 14th Belgian-French-German Conference on Optimization, pp. 183–192. Cited by: §3.2.
  • T. S. Richardson and J. M. Robins (2013) Single world intervention graphs (swigs): a unification of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper 128. Cited by: §1.
  • W. Ring and B. Wirth (2012) Optimization methods on riemannian manifolds and their application to shape space. SIAM Journal on Optimization 22 (2), pp. 596–627. Cited by: Appendix B, Appendix B, §3.2.
  • J. M. Robins, M. A. Hernán, and U. Siebert (2004) Effects of multiple interventions. In Comparative quantification of health risks: Global and regional burden of disease attributable to selected major risk factors, Vol. 1, pp. 2191–2230. Cited by: §1.
  • D. Rothenhäusler and B. Yu (2021) Incremental causal effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 83 (3), pp. 578–605. Cited by: §1.
  • D. B. Rubin (1980) Randomization analysis of experimental data: the fisher randomization test comment. Journal of the American statistical association 75 (371), pp. 591–593. Cited by: §2.1.
  • K. E. Rudolph, N. S. Hejazi, and M. J. van der Laan (2024) Propensity score weighting across counterfactual worlds: longitudinal effects under positivity violations. Statistical Methods in Medical Research 33 (1), pp. 137–154. Cited by: §1.
  • K. E. Rudolph, S. Inose, N. Williams, I. Diaz, L. Calderon, J. M. Torres, and M. Kioumourtzoglou (2025) Everything all at once: on choosing an estimand for multi-component environmental exposures. arXiv preprint arXiv:2509.17960. Cited by: §1.
  • S. Samanta and J. Antonelli (2022) Estimation and false discovery control for the analysis of environmental mixtures. Biostatistics 23 (4), pp. 1039–1055. Cited by: §2.5.2.
  • K. Schindl, S. Shen, and E. H. Kennedy (2024) Incremental effects for continuous exposures. arXiv preprint arXiv:2409.11967. Cited by: §1, §2.2, §2.2, §2.3, §3.1.2, §3.1.2, §4.
  • K. Schindl and L. Wasserman (2025) Causal geodesy: counterfactual estimation along the path between correlation and causation. arXiv preprint arXiv:2508.08499. Cited by: §1.
  • S. L. Taubman, J. M. Robins, M. A. Mittleman, and M. A. Hernán (2009) Intervening on risk factors for coronary heart disease: an application of the parametric g-formula. International journal of epidemiology 38 (6), pp. 1599–1611. Cited by: §1.
  • M. J. van der Laan (2017) A generally efficient targeted minimum loss based estimator based on the highly adaptive lasso. The International Journal of Biostatistics 13 (2), pp. 20160075. External Links: Document Cited by: §4.2.
  • A. Van Donkelaar, R. V. Martin, C. Li, and R. T. Burnett (2019) Regional estimates of chemical composition of fine particulate matter using a combined geoscience-statistical method with information from satellites, models, and monitors. Environmental science & technology 53 (5), pp. 2595–2611. Cited by: §7.
  • R. Wei, B. J. Reich, J. A. Hoppin, and S. Ghosal (2020) Sparse bayesian additive nonparametric regression with application to health effects of pesticides mixtures. Statistica Sinica 30 (1), pp. 55–79. Cited by: §2.5.2.
  • Q. Ye, G. A. Hanasusanto, and W. Xie (2024) Distributionally fair stochastic optimization using wasserstein distance. arXiv preprint arXiv:2402.01872. Cited by: §2.4.
  • J. G. Young, M. A. Hernán, and J. M. Robins (2011) Identification, estimation and approximation of risk under interventions that depend on the natural value of treatment using observational data. In Statistical models and causal inference: a dialogue with the social sciences, pp. 103–128. Cited by: §1.
  • Y. Zhang, P. Han, X. Wu, and G. Diao (2023) Nonparametric inference on dose-response curves without the positivity condition. Biometrika 110 (1), pp. 219–236. Cited by: §1.
  • J. Zheng, A. D’Amour, and A. Franks (2021) Bayesian inference and partial identification in multi-treatment causal inference with unobserved confounding. arXiv preprint arXiv:2111.07973. Cited by: §8.
  • W. Zheng and M. J. van der Laan (2011) Cross-validated targeted minimum-loss-based estimation. In Targeted Learning, pp. 459–474. Cited by: §3.1.1.

Appendix A Neyman-orthogonality and robustness to misspecified outcome model

Setup.

Let μ(𝒙,𝒘)=𝔼[Y𝑿=𝒙,𝑾=𝒘]\mu(\boldsymbol{x},\boldsymbol{w})=\mathbb{E}[Y\mid\boldsymbol{X}=\boldsymbol{x},\boldsymbol{W}=\boldsymbol{w}] denote the outcome regression, let f(𝒘𝒙)f(\boldsymbol{w}\mid\boldsymbol{x}) be the observed exposure density, and let g(𝒘𝒙)g(\boldsymbol{w}\mid\boldsymbol{x}) be the tilted density with density ratio r𝜹(𝒘,𝒙)=g(𝒘𝒙)/f(𝒘𝒙)r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})=g(\boldsymbol{w}\mid\boldsymbol{x})/f(\boldsymbol{w}\mid\boldsymbol{x}). We write 𝔼f[𝑿]\mathbb{E}_{f}[\cdot\mid\boldsymbol{X}] and 𝔼g[𝑿]\mathbb{E}_{g}[\cdot\mid\boldsymbol{X}] for conditional expectations with respect to f(𝑿)f(\cdot\mid\boldsymbol{X}) and g(𝑿)g(\cdot\mid\boldsymbol{X}), respectively. Assume standard regularity conditions. Consider the efficient influence function for a scalar parameter ψ\psi:

φ(𝒁;ψ,μ,r):=r𝜹(𝑾,𝑿){Y𝔼g[μ(𝑿,𝑾)𝑿]}+𝔼g[μ(𝑿,𝑾)𝑿]ψ.\varphi(\boldsymbol{Z};\psi,\mu,r):=r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\Big\{Y-\mathbb{E}_{g}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\Big\}+\mathbb{E}_{g}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]-\psi.

We utilize the identities:

𝔼[r𝜹(𝑾,𝑿)𝑿]=1,𝔼[r𝜹(𝑾,𝑿)μ(𝑿,𝑾)𝑿]=𝔼g[μ(𝑿,𝑾)𝑿].\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]=1,\qquad\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]=\mathbb{E}_{g}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}].
(a) Orthogonality with respect to μ\mu.

Fix rr and perturb μ\mu along a path με=μ+εh\mu_{\varepsilon}=\mu+\varepsilon h. Since 𝔼[rY]\mathbb{E}[rY] and ψ\psi do not depend on ε\varepsilon, we have:

ddε𝔼{φ(𝒁;ψ,με,r)}|ε=0\displaystyle\frac{d}{d\varepsilon}\,\mathbb{E}\{\varphi(\boldsymbol{Z};\psi,\mu_{\varepsilon},r)\}\Big|_{\varepsilon=0} =𝔼[r𝜹(𝑾,𝑿)𝔼g[h(𝑿,𝑾)𝑿]+𝔼g[h(𝑿,𝑾)𝑿]]\displaystyle=\mathbb{E}\!\left[-r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\,\mathbb{E}_{g}[h(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]+\mathbb{E}_{g}[h(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\right]
=𝔼[{1𝔼[r𝜹(𝑾,𝑿)𝑿]}𝔼g[h(𝑿,𝑾)𝑿]]\displaystyle=\mathbb{E}\!\left[\big\{1-\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]\big\}\,\mathbb{E}_{g}[h(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\right]
=0.\displaystyle=0.

Hence, the influence function is Neyman-orthogonal with respect to μ\mu.

(b) Sensitivity in rr.

Fix μ\mu and perturb rr along a normalized path rε=r(1+εv)r_{\varepsilon}=r(1+\varepsilon v), where v(𝒘,𝒙)v(\boldsymbol{w},\boldsymbol{x}) is a measurable function satisfying the constraint 𝔼[rε(𝑾,𝑿)𝑿]=1\mathbb{E}[r_{\varepsilon}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]=1 for all ε\varepsilon. Differentiating this constraint at ε=0\varepsilon=0 yields 𝔼f[r𝜹(𝑾,𝑿)v(𝒘,𝒙)𝑿]=0\mathbb{E}_{f}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})v(\boldsymbol{w},\boldsymbol{x})\mid\boldsymbol{X}]=0, which is equivalent to 𝔼g[v(𝒘,𝒙)𝑿]=0\mathbb{E}_{g}[v(\boldsymbol{w},\boldsymbol{x})\mid\boldsymbol{X}]=0.

Let mε(𝑿):=𝔼gε[μ(𝑿,𝑾)𝑿]m_{\varepsilon}(\boldsymbol{X}):=\mathbb{E}_{g_{\varepsilon}}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}], where gεg_{\varepsilon} is the tilted density corresponding to rεr_{\varepsilon}. Since gε(𝒘𝒙)=rε(𝒘,𝒙)f(𝒘𝒙)g_{\varepsilon}(\boldsymbol{w}\mid\boldsymbol{x})=r_{\varepsilon}(\boldsymbol{w},\boldsymbol{x})f(\boldsymbol{w}\mid\boldsymbol{x}), we have:

ddεmε(𝑿)|ε=0=𝔼g[v(𝒘,𝒙)μ(𝑿,𝑾)𝑿].\frac{d}{d\varepsilon}m_{\varepsilon}(\boldsymbol{X})\Big|_{\varepsilon=0}=\mathbb{E}_{g}[v(\boldsymbol{w},\boldsymbol{x})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}].

Let m(𝑿):=𝔼g[vμ𝑿]m^{\prime}(\boldsymbol{X}):=\mathbb{E}_{g}[v\mu\mid\boldsymbol{X}]. The influence function along the path is:

φ(𝒁;ψ,μ,rε)=rε(𝑾,𝑿){Ymε(𝑿)}+mε(𝑿)ψ.\varphi(\boldsymbol{Z};\psi,\mu,r_{\varepsilon})=r_{\varepsilon}(\boldsymbol{W},\boldsymbol{X})\{Y-m_{\varepsilon}(\boldsymbol{X})\}+m_{\varepsilon}(\boldsymbol{X})-\psi.

Differentiating the expected influence function yields:

ddε𝔼{φ(𝒁;ψ,μ,rε)}|ε=0\displaystyle\frac{d}{d\varepsilon}\,\mathbb{E}\{\varphi(\boldsymbol{Z};\psi,\mu,r_{\varepsilon})\}\Big|_{\varepsilon=0} =𝔼[r𝜹(𝑾,𝑿)v(𝑾,𝑿){Ym0(𝑿)}]+𝔼[(1r𝜹(𝑾,𝑿))m(𝑿)]\displaystyle=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})v(\boldsymbol{W},\boldsymbol{X})\{Y-m_{0}(\boldsymbol{X})\}\big]+\mathbb{E}\big[(1-r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}))\,m^{\prime}(\boldsymbol{X})\big]
=𝔼[r𝜹(𝑾,𝑿)v(𝑾,𝑿)μ(𝑿,𝑾)]\displaystyle=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})v(\boldsymbol{W},\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\big]
𝔼[m0(𝑿)𝔼[r𝜹(𝑾,𝑿)v(𝑾,𝑿)𝑿]]+𝔼[(1r𝜹(𝑾,𝑿))m(𝑿)]\displaystyle\quad-\mathbb{E}\big[m_{0}(\boldsymbol{X})\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})v(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]\big]+\mathbb{E}\big[(1-r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}))\,m^{\prime}(\boldsymbol{X})\big]
=𝔼[r𝜹(𝑾,𝑿)v(𝑾,𝑿)μ(𝑿,𝑾)]𝔼[m0(𝑿)0]= 0+𝔼[(1r𝜹(𝑾,𝑿))m(𝑿)]= 0\displaystyle=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})v(\boldsymbol{W},\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\big]-\underbrace{\mathbb{E}[m_{0}(\boldsymbol{X})\cdot 0]}_{=\ 0}+\underbrace{\mathbb{E}\big[(1-r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}))\,m^{\prime}(\boldsymbol{X})\big]}_{=\ 0}
=𝔼{𝔼g[v(𝑾,𝑿)μ(𝑿,𝑾)𝑿]}.\displaystyle=\mathbb{E}\{\mathbb{E}_{g}[v(\boldsymbol{W},\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\}.

In general, 𝔼{𝔼g[vμ𝑿]}0\mathbb{E}\{\mathbb{E}_{g}[v\mu\mid\boldsymbol{X}]\}\neq 0. Therefore, the derivative does not vanish, implying that the influence function is not Neyman-orthogonal in the rr direction unless additional restrictive constraints are imposed on the perturbation vv.

This non-orthogonality with respect to rr extends directly to the exposure density ff. For a fixed target density gg, any perturbation in ff induces a corresponding change in the density ratio r=g/fr=g/f. Consider a perturbation path fε=f(1+εS)f_{\varepsilon}=f(1+\varepsilon S), where S(𝒘𝒙)S(\boldsymbol{w}\mid\boldsymbol{x}) is a standard score function satisfying 𝔼f[S(𝑾𝑿)𝑿]=0\mathbb{E}_{f}[S(\boldsymbol{W}\mid\boldsymbol{X})\mid\boldsymbol{X}]=0. Specifically:

rε(𝒘,𝒙)=g(𝒘𝒙)fε(𝒘𝒙)=g(𝒘𝒙)f(𝒘𝒙)(1+εS)=r𝜹(𝒘,𝒙)(1+εS)1.r_{\varepsilon}(\boldsymbol{w},\boldsymbol{x})=\frac{g(\boldsymbol{w}\mid\boldsymbol{x})}{f_{\varepsilon}(\boldsymbol{w}\mid\boldsymbol{x})}=\frac{g(\boldsymbol{w}\mid\boldsymbol{x})}{f(\boldsymbol{w}\mid\boldsymbol{x})(1+\varepsilon S)}=r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})(1+\varepsilon S)^{-1}.

Differentiating with respect to ε\varepsilon at ε=0\varepsilon=0:

ddεrε|ε=0=r𝜹(𝒘,𝒙)S(𝒘𝒙).\frac{d}{d\varepsilon}r_{\varepsilon}\Big|_{\varepsilon=0}=-r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})S(\boldsymbol{w}\mid\boldsymbol{x}).

This corresponds to the perturbation v=Sv=-S in the previous derivation for rr. Substituting this into the derivative obtained in part (b), we get:

ddε𝔼{φ(𝒁;ψ,μ,rε)}|ε=0\displaystyle\frac{d}{d\varepsilon}\,\mathbb{E}\{\varphi(\boldsymbol{Z};\psi,\mu,r_{\varepsilon})\}\Big|_{\varepsilon=0} =𝔼{𝔼g[S(𝑾𝑿)μ(𝑿,𝑾)𝑿]}\displaystyle=\mathbb{E}\big\{\mathbb{E}_{g}[-S(\boldsymbol{W}\mid\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\big\}
=𝔼{𝔼g[S(𝑾𝑿)μ(𝑿,𝑾)𝑿]}.\displaystyle=-\mathbb{E}\big\{\mathbb{E}_{g}[S(\boldsymbol{W}\mid\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\big\}.

Since 𝔼g[Sμ𝑿]\mathbb{E}_{g}[S\mu\mid\boldsymbol{X}] is generally non-zero (unless μ\mu is constant or SS is orthogonal to μ\mu under gg), the derivative does not vanish. Thus, the influence function is not Neyman-orthogonal with respect to the exposure density ff.

Robustness to misspecified outcome model.

Let μ0\mu_{0} be the true outcome regression and rr be the true density ratio. Define the target parameter:

ψ0:=𝔼[r𝜹(𝑾,𝑿)μ0(𝑿,𝑾)]=𝔼[𝔼f{r𝜹(𝑾,𝑿)μ0(𝑿,𝑾)𝑿}],\psi_{0}:=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\,\mu_{0}(\boldsymbol{X},\boldsymbol{W})\big]=\mathbb{E}\Big[\mathbb{E}_{f}\{r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mu_{0}(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}\}\Big],

which represents the average outcome under the tilted distribution gg. We now show that at (ψ0,r)(\psi_{0},r), the influence function is globally robust to μ\mu. That is, for any measurable μ~\tilde{\mu}:

𝔼[φ(𝒁;ψ0,μ~,r)]=0.\mathbb{E}\big[\varphi(\boldsymbol{Z};\psi_{0},\tilde{\mu},r)\big]=0.

Consequently, estimators based on the efficient influence function φ\varphi, such as the one-step estimator employed in this work, remain consistent for ψ0\psi_{0} even under misspecification of the outcome regression model.

Proof.

Fix any measurable function μ~\tilde{\mu} and define:

m~(𝑿):=𝔼f[r𝜹(𝑾,𝑿)μ~(𝑿,𝑾)𝑿]=𝔼g[μ~(𝑿,𝑾)𝑿].\tilde{m}(\boldsymbol{X}):=\mathbb{E}_{f}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\tilde{\mu}(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}\big]=\mathbb{E}_{g}[\tilde{\mu}(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}].

Since m~\tilde{m} depends only on 𝑿\boldsymbol{X} and 𝔼[r𝜹(𝑾,𝑿)𝑿]=1\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]=1, we have:

𝔼[r𝜹(𝑾,𝑿)m~(𝑿)]=𝔼[m~(𝑿)𝔼[r𝜹(𝑾,𝑿)𝑿]]=𝔼[m~(𝑿)].\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\,\tilde{m}(\boldsymbol{X})\big]=\mathbb{E}\Big[\tilde{m}(\boldsymbol{X})\,\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]\Big]=\mathbb{E}\big[\tilde{m}(\boldsymbol{X})\big].

Thus, at (ψ0,r)(\psi_{0},r):

𝔼[φ(𝒁;ψ0,μ~,r)]\displaystyle\mathbb{E}\big[\varphi(\boldsymbol{Z};\psi_{0},\tilde{\mu},r)\big] =𝔼[r𝜹(𝑾,𝑿){Ym~(𝑿)}]+𝔼[m~(𝑿)]ψ0\displaystyle=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-\tilde{m}(\boldsymbol{X})\}\big]+\mathbb{E}[\tilde{m}(\boldsymbol{X})]-\psi_{0}
=𝔼[r𝜹(𝑾,𝑿)Y]ψ0.\displaystyle=\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})Y]-\psi_{0}.

By the law of iterated expectations:

𝔼[r𝜹(𝑾,𝑿)Y]=𝔼[r𝜹(𝑾,𝑿)μ0(𝑿,𝑾)]=ψ0.\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})Y]=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\,\mu_{0}(\boldsymbol{X},\boldsymbol{W})\big]=\psi_{0}.

Therefore, 𝔼[φ(𝒁;ψ0,μ~,r)]=0\mathbb{E}[\varphi(\boldsymbol{Z};\psi_{0},\tilde{\mu},r)]=0 for all μ~\tilde{\mu}. This confirms that consistent estimation of ψ0\psi_{0} relies primarily on the consistency of r^𝜹\widehat{r}_{\boldsymbol{\delta}}, rendering the estimator robust to misspecification of μ\mu. ∎

Appendix B Validity for Riemannian BFGS on Gelbrich Constraint for exponential tilting

The convergence theory of Riemannian optimization algorithms, including Riemannian BFGS with Wolfe line search, is established in, e.g., Ring and Wirth (2012); Huang et al. (2015, 2018). We consider the deterministic level–set constraint

={𝜹d:G(𝜹)=c2},\mathcal{M}=\{\boldsymbol{\delta}\in\mathbb{R}^{d}:\,G(\boldsymbol{\delta})=c^{2}\},

where G(𝜹)G(\boldsymbol{\delta}) is the squared Gelbrich distance between the marginal baseline distribution of WW and the marginal tilted distribution of WW induced by 𝜹\boldsymbol{\delta}. The manifold \mathcal{M} is equipped with the Riemannian metric induced by the Euclidean inner product on d\mathbb{R}^{d}. Global convergence of cautious Riemannian BFGS in the sense that lim infkgradψ(𝜹k)=0\liminf_{k\to\infty}\|\operatorname{grad}\psi(\boldsymbol{\delta}_{k})\|=0 follows from Huang et al. (2018, Theorem 4.2) under a compact level-set condition and Lipschitz continuous differentiability with respect to the chosen transport. We establish these properties for the marginal Gelbrich constraint by showing that the level set is a smooth embedded hypersurface, that a C2C^{2} retraction and a continuous scaled transport with the pointwise norm-preserving property used in Huang et al. (2015, 2018) are available, and that ψ\psi is regular on a neighbourhood containing the line-search trial points.

Geometry of the Gelbrich constraint.

Let fWf_{W} denote the marginal baseline law of WW on d\mathbb{R}^{d}, and let

𝝁=𝔼fW[W]d,𝚺=CovfW(W)𝕊++d.\boldsymbol{\mu}=\mathbb{E}_{f_{W}}[W]\in\mathbb{R}^{d},\qquad\boldsymbol{\Sigma}=\mathrm{Cov}_{f_{W}}(W)\in\mathbb{S}_{++}^{d}.

For 𝜹𝒟\boldsymbol{\delta}\in\mathcal{D} (an open set on which the marginal log–moment generating function is finite), define

M(𝜹)=𝔼fW[e𝜹W],g𝜹marg(w)=exp(𝜹wlogM(𝜹))fW(w).M(\boldsymbol{\delta})=\mathbb{E}_{f_{W}}\!\left[e^{\boldsymbol{\delta}^{\top}W}\right],\qquad g_{\boldsymbol{\delta}}^{\mathrm{marg}}(w)=\exp\!\big(\boldsymbol{\delta}^{\top}w-\log M(\boldsymbol{\delta})\big)\,f_{W}(w).

Consequently, let 𝝁𝜹=𝔼g𝜹marg[W]\boldsymbol{\mu}_{\boldsymbol{\delta}}=\mathbb{E}_{g_{\boldsymbol{\delta}}^{\mathrm{marg}}}[W] and 𝚺𝜹=Covg𝜹marg(W)\boldsymbol{\Sigma}_{\boldsymbol{\delta}}=\mathrm{Cov}_{g_{\boldsymbol{\delta}}^{\mathrm{marg}}}(W). Equivalently,

𝝁𝜹=𝜹logM(𝜹)d,𝚺𝜹=𝜹2logM(𝜹)𝕊++d.\boldsymbol{\mu}_{\boldsymbol{\delta}}=\nabla_{\boldsymbol{\delta}}\log M(\boldsymbol{\delta})\in\mathbb{R}^{d},\qquad\boldsymbol{\Sigma}_{\boldsymbol{\delta}}=\nabla_{\boldsymbol{\delta}}^{2}\log M(\boldsymbol{\delta})\in\mathbb{S}_{++}^{d}.

The Gelbrich function is

G(𝜹):=𝝁𝜹𝝁22+tr(𝚺+𝚺𝜹2(𝚺1/2𝚺𝜹𝚺1/2)1/2).G(\boldsymbol{\delta}):=\|\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu}\|_{2}^{2}+\mathrm{tr}\!\Big(\boldsymbol{\Sigma}+\boldsymbol{\Sigma}_{\boldsymbol{\delta}}-2\big(\boldsymbol{\Sigma}^{1/2}\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\boldsymbol{\Sigma}^{1/2}\big)^{1/2}\Big).

Equip d\mathbb{R}^{d} with the Euclidean inner product and 𝕊d\mathbb{S}^{d} with the Frobenius inner product 𝑨,𝑩=tr(𝑨𝑩)\langle\boldsymbol{A},\boldsymbol{B}\rangle=\mathrm{tr}(\boldsymbol{A}^{\top}\boldsymbol{B}). For each 𝜹\boldsymbol{\delta}, define the linear map J𝜹:d𝕊dJ_{\boldsymbol{\delta}}:\mathbb{R}^{d}\to\mathbb{S}^{d} by

(J𝜹𝒆k)ij:=(𝚺𝜹)ijδk,k=1,,d,(J_{\boldsymbol{\delta}}\boldsymbol{e}_{k})_{ij}:=\frac{\partial(\boldsymbol{\Sigma}_{\boldsymbol{\delta}})_{ij}}{\partial\delta_{k}},\qquad k=1,\dots,d,

and extend by linearity so that J𝜹𝒗=k=1dvkJ𝜹𝒆kJ_{\boldsymbol{\delta}}\boldsymbol{v}=\sum_{k=1}^{d}v_{k}J_{\boldsymbol{\delta}}\boldsymbol{e}_{k}. Let J𝜹:𝕊ddJ_{\boldsymbol{\delta}}^{\!*}:\mathbb{S}^{d}\to\mathbb{R}^{d} be the adjoint of J𝜹J_{\boldsymbol{\delta}} with respect to these inner products, that is,

J𝜹𝒗,𝑯=𝒗(J𝜹𝑯)(𝒗d,𝑯𝕊d),\langle J_{\boldsymbol{\delta}}\boldsymbol{v},\boldsymbol{H}\rangle=\boldsymbol{v}^{\top}\big(J_{\boldsymbol{\delta}}^{\!*}\boldsymbol{H}\big)\qquad(\boldsymbol{v}\in\mathbb{R}^{d},\ \boldsymbol{H}\in\mathbb{S}^{d}),

equivalently, for any 𝒗d\boldsymbol{v}\in\mathbb{R}^{d},

𝒗(J𝜹𝑯)=J𝜹𝒗,𝑯=tr((J𝜹𝒗)𝑯)=k=1dvktr((J𝜹𝒆k)𝑯).\boldsymbol{v}^{\top}\big(J_{\boldsymbol{\delta}}^{\!*}\boldsymbol{H}\big)=\langle J_{\boldsymbol{\delta}}\boldsymbol{v},\boldsymbol{H}\rangle=\mathrm{tr}\!\big((J_{\boldsymbol{\delta}}\boldsymbol{v})^{\top}\boldsymbol{H}\big)=\sum_{k=1}^{d}v_{k}\,\mathrm{tr}\!\big((J_{\boldsymbol{\delta}}\boldsymbol{e}_{k})^{\top}\boldsymbol{H}\big).

Define

𝑻𝚺𝜹𝚺:=𝚺𝜹1/2(𝚺𝜹1/2𝚺𝚺𝜹1/2)1/2𝚺𝜹1/2𝕊++d.\boldsymbol{T}_{\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\to\boldsymbol{\Sigma}}:=\boldsymbol{\Sigma}_{\boldsymbol{\delta}}^{-1/2}\big(\boldsymbol{\Sigma}_{\boldsymbol{\delta}}^{1/2}\boldsymbol{\Sigma}\,\boldsymbol{\Sigma}_{\boldsymbol{\delta}}^{1/2}\big)^{1/2}\boldsymbol{\Sigma}_{\boldsymbol{\delta}}^{-1/2}\in\mathbb{S}_{++}^{d}.

For 𝑨𝕊++d\boldsymbol{A}\in\mathbb{S}_{++}^{d}, define the Lyapunov (Sylvester) operator 𝑨(𝒀)=𝑨𝒀+𝒀𝑨\mathcal{L}_{\boldsymbol{A}}(\boldsymbol{Y})=\boldsymbol{A}\boldsymbol{Y}+\boldsymbol{Y}\boldsymbol{A}, and write 𝑨1\mathcal{L}_{\boldsymbol{A}}^{-1} for its solution operator.

Lemma 2 (Exact gradient and generic regularity of Gelbrich level sets).

Assume logM(𝛅)\log M(\boldsymbol{\delta}) is finite on a nonempty open set 𝒟d\mathcal{D}\subset\mathbb{R}^{d}. Then logM(𝛅)\log M(\boldsymbol{\delta}) is real–analytic on 𝒟\mathcal{D}, hence so are 𝛍𝛅\boldsymbol{\mu}_{\boldsymbol{\delta}} and 𝚺𝛅\boldsymbol{\Sigma}_{\boldsymbol{\delta}}.

  1. (i)

    GG is C1C^{1} on 𝒟\mathcal{D} and

    G(𝜹)=2𝚺𝜹(𝝁𝜹𝝁)+J𝜹(I𝑻𝚺𝜹𝚺),𝜹𝒟.\nabla G(\boldsymbol{\delta})=2\,\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\big(\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu}\big)+J_{\boldsymbol{\delta}}^{\!*}\!\Big(I-\boldsymbol{T}_{\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\to\boldsymbol{\Sigma}}\Big),\qquad\boldsymbol{\delta}\in\mathcal{D}.
  2. (ii)

    Let 𝒞={𝜹𝒟:G(𝜹)=𝟎}\mathcal{C}=\{\boldsymbol{\delta}\in\mathcal{D}:\nabla G(\boldsymbol{\delta})=\boldsymbol{0}\} and 𝒱=G(𝒞)[0,)\mathcal{V}=G(\mathcal{C})\subset[0,\infty). Then 𝒱\mathcal{V} has Lebesgue measure zero and empty interior. In particular, for any c>0c>0 with c2𝒱c^{2}\notin\mathcal{V}, the level set

    c:={𝜹𝒟:G(𝜹)=c2}\mathcal{M}_{c}:=\{\boldsymbol{\delta}\in\mathcal{D}:\ G(\boldsymbol{\delta})=c^{2}\}

    is a CC^{\infty} (indeed real-analytic) embedded hypersurface and G(𝜹)𝟎\nabla G(\boldsymbol{\delta})\neq\boldsymbol{0} for all 𝜹c\boldsymbol{\delta}\in\mathcal{M}_{c}.

Proof.

Write G(𝜹)=Gmean(𝜹)+Gcov(𝜹)G(\boldsymbol{\delta})=G_{\mathrm{mean}}(\boldsymbol{\delta})+G_{\mathrm{cov}}(\boldsymbol{\delta}) with

Gmean(𝜹)=𝝁𝜹𝝁22,Gcov(𝜹)=Φ(𝚺𝜹),G_{\mathrm{mean}}(\boldsymbol{\delta})=\|\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu}\|_{2}^{2},\qquad G_{\mathrm{cov}}(\boldsymbol{\delta})=\Phi(\boldsymbol{\Sigma}_{\boldsymbol{\delta}}),

where

Φ(𝚯):=tr(𝚺+𝚯2(𝚺1/2𝚯𝚺1/2)1/2),𝚯𝕊++d.\Phi(\boldsymbol{\Theta}):=\mathrm{tr}\!\Big(\boldsymbol{\Sigma}+\boldsymbol{\Theta}-2\big(\boldsymbol{\Sigma}^{1/2}\boldsymbol{\Theta}\boldsymbol{\Sigma}^{1/2}\big)^{1/2}\Big),\qquad\boldsymbol{\Theta}\in\mathbb{S}_{++}^{d}.

Let 𝒗d\boldsymbol{v}\in\mathbb{R}^{d} and denote the directional derivative by D𝒗D_{\boldsymbol{v}}. Since D𝝁𝜹[𝒗]=𝚺𝜹𝒗\mathrm{D}\boldsymbol{\mu}_{\boldsymbol{\delta}}[\boldsymbol{v}]=\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\boldsymbol{v} and 𝚺𝜹\boldsymbol{\Sigma}_{\boldsymbol{\delta}} is symmetric,

D𝒗Gmean(𝜹)=2𝝁𝜹𝝁,𝚺𝜹𝒗=2𝒗𝚺𝜹(𝝁𝜹𝝁),D_{\boldsymbol{v}}G_{\mathrm{mean}}(\boldsymbol{\delta})=2\,\langle\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu},\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\boldsymbol{v}\rangle=2\,\boldsymbol{v}^{\top}\boldsymbol{\Sigma}_{\boldsymbol{\delta}}(\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu}),

hence Gmean(𝜹)=2𝚺𝜹(𝝁𝜹𝝁)\nabla G_{\mathrm{mean}}(\boldsymbol{\delta})=2\,\boldsymbol{\Sigma}_{\boldsymbol{\delta}}(\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu}).

For 𝑯𝕊d\boldsymbol{H}\in\mathbb{S}^{d}, the first Fréchet derivative of Φ\Phi is given by

DΦ(𝚯)[𝑯]=I𝑻𝚯𝚺,𝑯,\mathrm{D}\Phi(\boldsymbol{\Theta})[\boldsymbol{H}]=\langle I-\boldsymbol{T}_{\boldsymbol{\Theta}\to\boldsymbol{\Sigma}},\boldsymbol{H}\rangle,

where 𝑻𝚯𝚺:=𝚯1/2(𝚯1/2𝚺𝚯1/2)1/2𝚯1/2\boldsymbol{T}_{\boldsymbol{\Theta}\to\boldsymbol{\Sigma}}:=\boldsymbol{\Theta}^{-1/2}(\boldsymbol{\Theta}^{1/2}\boldsymbol{\Sigma}\boldsymbol{\Theta}^{1/2})^{1/2}\boldsymbol{\Theta}^{-1/2}. This follows from the Fréchet derivative of the principal matrix square root, D(𝑨1/2)[𝑽]=𝑨1/21(𝑽)\mathrm{D}(\boldsymbol{A}^{1/2})[\boldsymbol{V}]=\mathcal{L}_{\boldsymbol{A}^{1/2}}^{-1}(\boldsymbol{V}) (Malagò et al., 2018, Eqs. (14)–(16)). In particular, for 𝑨𝕊++d\boldsymbol{A}\in\mathbb{S}_{++}^{d} and 𝑽𝕊d\boldsymbol{V}\in\mathbb{S}^{d},

Dtr(𝑨1/2)[𝑽]=12tr(𝑨1/2𝑽).\mathrm{D}\,\mathrm{tr}(\boldsymbol{A}^{1/2})[\boldsymbol{V}]=\tfrac{1}{2}\,\mathrm{tr}(\boldsymbol{A}^{-1/2}\boldsymbol{V}).

With 𝑨=𝚺1/2𝚯𝚺1/2\boldsymbol{A}=\boldsymbol{\Sigma}^{1/2}\boldsymbol{\Theta}\boldsymbol{\Sigma}^{1/2} and 𝑽=𝚺1/2𝑯𝚺1/2\boldsymbol{V}=\boldsymbol{\Sigma}^{1/2}\boldsymbol{H}\boldsymbol{\Sigma}^{1/2},

DΦ(𝚯)[𝑯]=tr(𝑯)tr(𝚺1/2𝑨1/2𝚺1/2𝑯)=I𝑻𝚯𝚺,𝑯.\mathrm{D}\Phi(\boldsymbol{\Theta})[\boldsymbol{H}]=\mathrm{tr}(\boldsymbol{H})-\mathrm{tr}\!\Big(\boldsymbol{\Sigma}^{1/2}\boldsymbol{A}^{-1/2}\boldsymbol{\Sigma}^{1/2}\boldsymbol{H}\Big)=\langle I-\boldsymbol{T}_{\boldsymbol{\Theta}\to\boldsymbol{\Sigma}},\,\boldsymbol{H}\rangle.

By the chain rule and the definition of J𝜹J_{\boldsymbol{\delta}},

D𝒗Gcov(𝜹)=DΦ(𝚺𝜹)[J𝜹𝒗]=I𝑻𝚺𝜹𝚺,J𝜹𝒗=𝒗J𝜹(I𝑻𝚺𝜹𝚺),D_{\boldsymbol{v}}G_{\mathrm{cov}}(\boldsymbol{\delta})=\mathrm{D}\Phi(\boldsymbol{\Sigma}_{\boldsymbol{\delta}})\!\big[J_{\boldsymbol{\delta}}\boldsymbol{v}\big]=\langle\,I-\boldsymbol{T}_{\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\to\boldsymbol{\Sigma}}\,,J_{\boldsymbol{\delta}}\boldsymbol{v}\rangle=\boldsymbol{v}^{\top}J_{\boldsymbol{\delta}}^{\!*}\!\big(I-\boldsymbol{T}_{\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\to\boldsymbol{\Sigma}}\big),

which implies Gcov(𝜹)=J𝜹(I𝑻𝚺𝜹𝚺)\nabla G_{\mathrm{cov}}(\boldsymbol{\delta})=J_{\boldsymbol{\delta}}^{\!*}\!\big(I-\boldsymbol{T}_{\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\to\boldsymbol{\Sigma}}\big). Summing these terms yields the gradient in (i).

For part (ii), logM(𝜹)\log M(\boldsymbol{\delta}) is real–analytic on 𝒟\mathcal{D}, hence so are 𝝁𝜹\boldsymbol{\mu}_{\boldsymbol{\delta}} and 𝚺𝜹\boldsymbol{\Sigma}_{\boldsymbol{\delta}}, and therefore GG is C1C^{1} on 𝒟\mathcal{D}. Since GG is real–analytic on 𝒟\mathcal{D}, it is CC^{\infty} on 𝒟\mathcal{D}, and Sard’s theorem yields that the set of critical values 𝒱\mathcal{V} has Lebesgue measure zero. Since 𝒱\mathcal{V} has Lebesgue measure zero, it has empty interior. For any c>0c>0 with c2𝒱c^{2}\notin\mathcal{V}, every 𝜹c\boldsymbol{\delta}\in\mathcal{M}_{c} satisfies G(𝜹)𝟎\nabla G(\boldsymbol{\delta})\neq\boldsymbol{0}; the regular level–set theorem yields that c\mathcal{M}_{c} is a CC^{\infty} (indeed real-analytic) embedded hypersurface. ∎

Projection retraction and scaled vector transport.

Fix c>0c>0 such that c2𝒱c^{2}\notin\mathcal{V} and write c={𝜹𝒟:G(𝜹)=c2}\mathcal{M}_{c}=\{\boldsymbol{\delta}\in\mathcal{D}:\,G(\boldsymbol{\delta})=c^{2}\}. For 𝜹c\boldsymbol{\delta}\in\mathcal{M}_{c}, let 𝒏(𝜹):=G(𝜹)/G(𝜹)\boldsymbol{n}(\boldsymbol{\delta}):={\nabla G(\boldsymbol{\delta})}/{\|\nabla G(\boldsymbol{\delta})\|} be the unit normal vector and denote the orthogonal projection onto the tangent space T𝜹cT_{\boldsymbol{\delta}}\mathcal{M}_{c} by

P𝜹:=I𝒏(𝜹)𝒏(𝜹).P_{\boldsymbol{\delta}}:=I-\boldsymbol{n}(\boldsymbol{\delta})\boldsymbol{n}(\boldsymbol{\delta})^{\top}.

Let Π:𝒩c\Pi:\mathcal{N}\to\mathcal{M}_{c} be the orthogonal projection from a tubular neighbourhood 𝒩\mathcal{N} of c\mathcal{M}_{c}. For 𝜹c\boldsymbol{\delta}\in\mathcal{M}_{c}, define the retraction R𝜹R_{\boldsymbol{\delta}} on a sufficiently small ball 𝜹T𝜹c\mathcal{B}_{\boldsymbol{\delta}}\subset T_{\boldsymbol{\delta}}\mathcal{M}_{c} by

R𝜹(𝝃):=Π(𝜹+𝝃),𝝃𝜹.R_{\boldsymbol{\delta}}(\boldsymbol{\xi}):=\Pi(\boldsymbol{\delta}+\boldsymbol{\xi}),\qquad\boldsymbol{\xi}\in\mathcal{B}_{\boldsymbol{\delta}}.

Define the differentiated-retraction transport

T𝜹,𝝃R(𝜻):=DR𝜹(𝝃)[𝜻],𝜻T𝜹c.T^{R}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta}):=\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{\xi})[\boldsymbol{\zeta}],\qquad\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}.

Following Huang et al. (2015, Eq. (4.3)), define the scaled transport

T𝜹,𝝃S(𝜻):={𝜻T𝜹,𝝃R(𝜻)T𝜹,𝝃R(𝜻),𝜻𝟎,𝟎,𝜻=𝟎,T^{S}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta}):=\begin{cases}\dfrac{\|\boldsymbol{\zeta}\|}{\|T^{R}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta})\|}\,T^{R}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta}),&\boldsymbol{\zeta}\neq\boldsymbol{0},\\[6.0pt] \boldsymbol{0},&\boldsymbol{\zeta}=\boldsymbol{0},\end{cases}

which is norm-preserving and satisfies the requirements (2.5)–(2.8) therein. For a step 𝝃T𝜹c\boldsymbol{\xi}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c} and 𝜹+:=R𝜹(𝝃)\boldsymbol{\delta}_{+}:=R_{\boldsymbol{\delta}}(\boldsymbol{\xi}), define

𝒯𝜹𝜹+(𝜻):=T𝜹,𝝃S(𝜻),𝜻T𝜹c.\mathcal{T}_{\boldsymbol{\delta}\to\boldsymbol{\delta}_{+}}(\boldsymbol{\zeta}):=T^{S}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta}),\qquad\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}.
Lemma 3 (Validity of transport).

The retraction RR and the transport 𝒯\mathcal{T} defined above satisfy:

  1. (a)

    there exists an open neighbourhood 𝒰Tc\mathcal{U}\subset T\mathcal{M}_{c} of the zero section such that R𝜹R_{\boldsymbol{\delta}} and 𝒯𝜹R𝜹(𝝃)\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})} are well defined for (𝜹,𝝃)𝒰(\boldsymbol{\delta},\boldsymbol{\xi})\in\mathcal{U} and are C0C^{0} in all arguments;

  2. (b)

    R𝜹(𝟎)=𝜹R_{\boldsymbol{\delta}}(\boldsymbol{0})=\boldsymbol{\delta} and DR𝜹(𝟎)=IdT𝜹c\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{0})=\mathrm{Id}_{T_{\boldsymbol{\delta}}\mathcal{M}_{c}}, and for every 𝜻T𝜹c\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c},

    𝒯𝜹R𝜹(𝝃)(𝜻)=T𝜹,𝝃R(𝜻)+O(𝝃𝜻)(𝝃𝟎),\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}(\boldsymbol{\zeta})=T^{R}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta})+O(\|\boldsymbol{\xi}\|\,\|\boldsymbol{\zeta}\|)\quad(\boldsymbol{\xi}\to\boldsymbol{0}),

    and in particular T𝜹,𝟎R=IdT𝜹cT^{R}_{\boldsymbol{\delta},\boldsymbol{0}}=\mathrm{Id}_{T_{\boldsymbol{\delta}}\mathcal{M}_{c}};

  3. (c)

    for every 𝜹c\boldsymbol{\delta}\in\mathcal{M}_{c} and 𝝃𝜹\boldsymbol{\xi}\in\mathcal{B}_{\boldsymbol{\delta}},

    𝒯𝜹R𝜹(𝝃)(𝜻)=𝜻(𝜻T𝜹c),\|\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}(\boldsymbol{\zeta})\|=\|\boldsymbol{\zeta}\|\qquad(\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}),

    so for each fixed (𝜹,𝝃)(\boldsymbol{\delta},\boldsymbol{\xi}), the map 𝜻𝒯𝜹R𝜹(𝝃)(𝜻)\boldsymbol{\zeta}\mapsto\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}(\boldsymbol{\zeta}) is pointwise norm-preserving on T𝜹cT_{\boldsymbol{\delta}}\mathcal{M}_{c}, and 𝒯\mathcal{T} is uniformly bounded on compact subsets of c\mathcal{M}_{c} in the sense that

    𝒯𝜹R𝜹(𝝃)(𝜻)𝜻\|\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}(\boldsymbol{\zeta})\|\leq\|\boldsymbol{\zeta}\|

    for all admissible (𝜹,𝝃,𝜻)(\boldsymbol{\delta},\boldsymbol{\xi},\boldsymbol{\zeta}).

Proof.

Property (a) follows from the tubular neighbourhood theorem and the C1C^{1} smoothness of Π\Pi: R𝜹(𝝃)=Π(𝜹+𝝃)R_{\boldsymbol{\delta}}(\boldsymbol{\xi})=\Pi(\boldsymbol{\delta}+\boldsymbol{\xi}) is well defined and C1C^{1} for 𝝃\boldsymbol{\xi} small, and DR𝜹(𝝃)\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{\xi}) is a linear map between tangent spaces for 𝝃\|\boldsymbol{\xi}\| sufficiently small. The scaled transport TST^{S} satisfies the continuity and first-order conditions invoked in Huang et al. (2015), so 𝒯\mathcal{T} is continuous in all arguments.

For (b), Π(𝜹)=𝜹\Pi(\boldsymbol{\delta})=\boldsymbol{\delta} and DΠ(𝜹)=P𝜹\mathrm{D}\Pi(\boldsymbol{\delta})=P_{\boldsymbol{\delta}}, hence R𝜹(𝟎)=𝜹R_{\boldsymbol{\delta}}(\boldsymbol{0})=\boldsymbol{\delta} and DR𝜹(𝟎)=P𝜹\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{0})=P_{\boldsymbol{\delta}}. Since P𝜹P_{\boldsymbol{\delta}} is the identity on T𝜹cT_{\boldsymbol{\delta}}\mathcal{M}_{c}, DR𝜹(𝟎)=IdT𝜹c\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{0})=\mathrm{Id}_{T_{\boldsymbol{\delta}}\mathcal{M}_{c}}. The first–order agreement between 𝒯\mathcal{T} and DR𝜹(𝟎)\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{0}) follows from Huang et al. (2015, (2.5)–(2.7), Lem. 3.5).

For (c), by construction,

T𝜹,𝝃S(𝜻)=𝜻(𝜻T𝜹c),\|T^{S}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta})\|=\|\boldsymbol{\zeta}\|\qquad(\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}),

so 𝒯𝜹R𝜹(𝝃)=T𝜹,𝝃S\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}=T^{S}_{\boldsymbol{\delta},\boldsymbol{\xi}} is pointwise norm-preserving:

𝒯𝜹R𝜹(𝝃)(𝜻)=𝜻(𝜻T𝜹c).\|\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}(\boldsymbol{\zeta})\|=\|\boldsymbol{\zeta}\|\qquad(\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}).

The stated uniform boundedness on compact subsets then follows with constant 11. ∎

Objective regularity for the global target.

For 𝒙𝒳\boldsymbol{x}\in\mathcal{X} and 𝜹U\boldsymbol{\delta}\in U, define the local tilted conditional mean

η𝜹(𝒙):=𝔼f[μ(𝒙,W)e𝜹WX=𝒙],ν𝜹(𝒙):=𝔼f[e𝜹WX=𝒙],\eta_{\boldsymbol{\delta}}(\boldsymbol{x}):=\mathbb{E}_{f}[\mu(\boldsymbol{x},W)e^{\boldsymbol{\delta}^{\top}W}\mid X=\boldsymbol{x}],\qquad\nu_{\boldsymbol{\delta}}(\boldsymbol{x}):=\mathbb{E}_{f}[e^{\boldsymbol{\delta}^{\top}W}\mid X=\boldsymbol{x}],
m𝜹(𝒙):=η𝜹(𝒙)ν𝜹(𝒙),ψ(𝜹):=𝔼{m𝜹(X)}.m_{\boldsymbol{\delta}}(\boldsymbol{x}):=\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{x})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})},\qquad\psi(\boldsymbol{\delta}):=\mathbb{E}\big\{m_{\boldsymbol{\delta}}(X)\big\}.
Lemma 4 (Conditional regularity of the local tilted mean).

Assume |μ(x,w)|C(1+wp)|\mu(x,w)|\leq C(1+\|w\|^{p}) for some finite constants C,p>0C,p>0. Let UdU\subset\mathbb{R}^{d} be open and let KUK\subset U be compact. Assume

A(X):=sup𝜹U𝔼f[e𝜹W(1+Wp+2)X]<a.s.,𝔼[A(X)2]<,A(X):=\sup_{\boldsymbol{\delta}\in U}\mathbb{E}_{f}\!\big[e^{\boldsymbol{\delta}^{\top}W}(1+\|W\|^{p+2})\mid X\big]<\infty\quad\text{a.s.},\qquad\mathbb{E}\big[A(X)^{2}\big]<\infty, (3)

and

inf𝜹Kν𝜹(X)Dmin>0a.s.\inf_{\boldsymbol{\delta}\in K}\nu_{\boldsymbol{\delta}}(X)\geq D_{\min}>0\quad\text{a.s.} (4)

Then for almost every XX, the map 𝛅m𝛅(X)\boldsymbol{\delta}\mapsto m_{\boldsymbol{\delta}}(X) is C2C^{2} on KK. Moreover, there exist finite constants C0,C1,C2C_{0},C_{1},C_{2}, depending only on CC, pp, and DminD_{\min}, such that with

Bj(X):=Cj{1+A(X)+A(X)2},j=0,1,2,B_{j}(X):=C_{j}\{1+A(X)+A(X)^{2}\},\qquad j=0,1,2,

we have almost surely

sup𝜹K|m𝜹(X)|B0(X),sup𝜹Km𝜹(X)B1(X),sup𝜹K2m𝜹(X)B2(X),\sup_{\boldsymbol{\delta}\in K}|m_{\boldsymbol{\delta}}(X)|\leq B_{0}(X),\qquad\sup_{\boldsymbol{\delta}\in K}\|\nabla m_{\boldsymbol{\delta}}(X)\|\leq B_{1}(X),\qquad\sup_{\boldsymbol{\delta}\in K}\|\nabla^{2}m_{\boldsymbol{\delta}}(X)\|\leq B_{2}(X),

and 𝔼[Bj(X)]<\mathbb{E}[B_{j}(X)]<\infty for j=0,1,2j=0,1,2.

Proof.

Fix 𝜹U\boldsymbol{\delta}\in U. By (3), the maps η𝜹(X)\eta_{\boldsymbol{\delta}}(X) and ν𝜹(X)\nu_{\boldsymbol{\delta}}(X) are well defined almost surely, and differentiation under the conditional expectation is justified up to second order because e𝜹W(1+Wp+2)e^{\boldsymbol{\delta}^{\top}W}(1+\|W\|^{p+2}) dominates the first- and second-order directional derivatives of both integrands. Thus, for almost every XX,

η𝜹(X)=𝔼f[μ(X,W)We𝜹WX],2η𝜹(X)=𝔼f[μ(X,W)WWe𝜹WX],\nabla\eta_{\boldsymbol{\delta}}(X)=\mathbb{E}_{f}[\mu(X,W)We^{\boldsymbol{\delta}^{\top}W}\mid X],\qquad\nabla^{2}\eta_{\boldsymbol{\delta}}(X)=\mathbb{E}_{f}[\mu(X,W)WW^{\top}e^{\boldsymbol{\delta}^{\top}W}\mid X],
ν𝜹(X)=𝔼f[We𝜹WX],2ν𝜹(X)=𝔼f[WWe𝜹WX].\nabla\nu_{\boldsymbol{\delta}}(X)=\mathbb{E}_{f}[We^{\boldsymbol{\delta}^{\top}W}\mid X],\qquad\nabla^{2}\nu_{\boldsymbol{\delta}}(X)=\mathbb{E}_{f}[WW^{\top}e^{\boldsymbol{\delta}^{\top}W}\mid X].

The polynomial growth bound on μ\mu implies that there is a finite constant CηC_{\eta}, depending only on CC and pp, such that almost surely

sup𝜹K(|η𝜹(X)|+η𝜹(X)+2η𝜹(X)+|ν𝜹(X)|+ν𝜹(X)+2ν𝜹(X))CηA(X).\sup_{\boldsymbol{\delta}\in K}\Big(|\eta_{\boldsymbol{\delta}}(X)|+\|\nabla\eta_{\boldsymbol{\delta}}(X)\|+\|\nabla^{2}\eta_{\boldsymbol{\delta}}(X)\|+|\nu_{\boldsymbol{\delta}}(X)|+\|\nabla\nu_{\boldsymbol{\delta}}(X)\|+\|\nabla^{2}\nu_{\boldsymbol{\delta}}(X)\|\Big)\leq C_{\eta}A(X).

By (4), the quotient rule gives, for almost every XX,

m𝜹(X)=η𝜹(X)ν𝜹(X)η𝜹(X)ν𝜹(X)ν𝜹(X)2,\nabla m_{\boldsymbol{\delta}}(X)=\frac{\nabla\eta_{\boldsymbol{\delta}}(X)}{\nu_{\boldsymbol{\delta}}(X)}-\frac{\eta_{\boldsymbol{\delta}}(X)\nabla\nu_{\boldsymbol{\delta}}(X)}{\nu_{\boldsymbol{\delta}}(X)^{2}},
2m𝜹(X)=2η𝜹(X)ν𝜹(X)η𝜹(X)ν𝜹(X)+ν𝜹(X)η𝜹(X)+η𝜹(X)2ν𝜹(X)ν𝜹(X)2+2η𝜹(X)ν𝜹(X)ν𝜹(X)ν𝜹(X)3.\nabla^{2}m_{\boldsymbol{\delta}}(X)=\frac{\nabla^{2}\eta_{\boldsymbol{\delta}}(X)}{\nu_{\boldsymbol{\delta}}(X)}-\frac{\nabla\eta_{\boldsymbol{\delta}}(X)\nabla\nu_{\boldsymbol{\delta}}(X)^{\top}+\nabla\nu_{\boldsymbol{\delta}}(X)\nabla\eta_{\boldsymbol{\delta}}(X)^{\top}+\eta_{\boldsymbol{\delta}}(X)\nabla^{2}\nu_{\boldsymbol{\delta}}(X)}{\nu_{\boldsymbol{\delta}}(X)^{2}}+\frac{2\eta_{\boldsymbol{\delta}}(X)\nabla\nu_{\boldsymbol{\delta}}(X)\nabla\nu_{\boldsymbol{\delta}}(X)^{\top}}{\nu_{\boldsymbol{\delta}}(X)^{3}}.

Therefore there exist finite constants C0,C1,C2C_{0},C_{1},C_{2}, depending only on CC, pp, and DminD_{\min}, such that

sup𝜹K|m𝜹(X)|C0{1+A(X)},sup𝜹Km𝜹(X)C1{1+A(X)+A(X)2},\sup_{\boldsymbol{\delta}\in K}|m_{\boldsymbol{\delta}}(X)|\leq C_{0}\{1+A(X)\},\qquad\sup_{\boldsymbol{\delta}\in K}\|\nabla m_{\boldsymbol{\delta}}(X)\|\leq C_{1}\{1+A(X)+A(X)^{2}\},
sup𝜹K2m𝜹(X)C2{1+A(X)+A(X)2}a.s.\sup_{\boldsymbol{\delta}\in K}\|\nabla^{2}m_{\boldsymbol{\delta}}(X)\|\leq C_{2}\{1+A(X)+A(X)^{2}\}\qquad\text{a.s.}

Since 𝔼[A(X)2]<\mathbb{E}[A(X)^{2}]<\infty, the envelopes Bj(X)=Cj{1+A(X)+A(X)2}B_{j}(X)=C_{j}\{1+A(X)+A(X)^{2}\} are integrable. This proves the claim. ∎

Corollary 1 (Objective regularity and line search on Gelbrich level sets).

Let c>0c>0 satisfy c2𝒱c^{2}\notin\mathcal{V} and set ={𝛅d:G(𝛅)=c2}\mathcal{M}=\{\boldsymbol{\delta}\in\mathbb{R}^{d}:\,G(\boldsymbol{\delta})=c^{2}\}. Assume the boundary-separation condition

G:=lim inf𝜹,𝜹𝒟G(𝜹)>c2.G^{-}_{\infty}:=\liminf_{\|\boldsymbol{\delta}\|\to\infty,\ \boldsymbol{\delta}\in\mathcal{D}}G(\boldsymbol{\delta})>c^{2}. (5)

Choose Rc>0R_{c}>0 such that

G(𝜹)>c2(𝜹𝒟,𝜹Rc),G(\boldsymbol{\delta})>c^{2}\qquad(\boldsymbol{\delta}\in\mathcal{D},\ \|\boldsymbol{\delta}\|\geq R_{c}), (6)

and define Kc:=B¯(𝟎,Rc)K_{c}:=\overline{B}(\boldsymbol{0},R_{c}). Let RR and 𝒯\mathcal{T} be as defined above, and let the Riemannian metric be the Euclidean metric on T𝛅T_{\boldsymbol{\delta}}\mathcal{M}. Assume the conditions of Lemma 4 on an open neighbourhood UU of KcK_{c}. Assume also the constraint regularity on UU:

inf𝜹UG(𝜹)ϵ>0,sup𝜹U2G(𝜹)MG<.\inf_{\boldsymbol{\delta}\in U\cap\mathcal{M}}\|\nabla G(\boldsymbol{\delta})\|\geq\epsilon_{\nabla}>0,\qquad\sup_{\boldsymbol{\delta}\in U}\|\nabla^{2}G(\boldsymbol{\delta})\|\leq M_{G}<\infty. (7)

Then \mathcal{M} is compact, and:

  1. (a)

    ψ\psi is C2C^{2} on KcK_{c}, with

    ψ(𝜹)=𝔼[m𝜹(X)],2ψ(𝜹)=𝔼[2m𝜹(X)],\nabla\psi(\boldsymbol{\delta})=\mathbb{E}\big[\nabla m_{\boldsymbol{\delta}}(X)\big],\qquad\nabla^{2}\psi(\boldsymbol{\delta})=\mathbb{E}\big[\nabla^{2}m_{\boldsymbol{\delta}}(X)\big],

    and

    sup𝜹Kc2ψ(𝜹)Mψ<.\sup_{\boldsymbol{\delta}\in K_{c}}\|\nabla^{2}\psi(\boldsymbol{\delta})\|\leq M_{\psi}<\infty.

    Consequently, ψ\nabla\psi is MψM_{\psi}–Lipschitz on KcK_{c}.

  2. (b)

    Writing 𝒏(𝜹)=G(𝜹)/G(𝜹)\boldsymbol{n}(\boldsymbol{\delta})=\nabla G(\boldsymbol{\delta})/\|\nabla G(\boldsymbol{\delta})\| and P(𝜹)=I𝒏(𝜹)𝒏(𝜹)P(\boldsymbol{\delta})=I-\boldsymbol{n}(\boldsymbol{\delta})\boldsymbol{n}(\boldsymbol{\delta})^{\top}, the Riemannian gradient gradψ(𝜹)=P(𝜹)ψ(𝜹)\operatorname{grad}\psi(\boldsymbol{\delta})=P(\boldsymbol{\delta})\nabla\psi(\boldsymbol{\delta}) is Lipschitz on \mathcal{M}, that is, there exists L>0L>0 such that

    gradψ(𝜹)gradψ(𝜹)L𝜹𝜹(𝜹,𝜹).\|\operatorname{grad}\psi(\boldsymbol{\delta})-\operatorname{grad}\psi(\boldsymbol{\delta}^{\prime})\|\leq L\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\|\qquad(\boldsymbol{\delta},\boldsymbol{\delta}^{\prime}\in\mathcal{M}).
  3. (c)

    Fix 0<c1<c2<10<c_{1}<c_{2}<1. For any 𝜹\boldsymbol{\delta}\in\mathcal{M} and any descent direction 𝜼T𝜹\boldsymbol{\eta}\in T_{\boldsymbol{\delta}}\mathcal{M} with gradψ(𝜹),𝜼<0\langle\operatorname{grad}\psi(\boldsymbol{\delta}),\boldsymbol{\eta}\rangle<0, there exists a step size t>0t_{\ast}>0 in the domain of γ(t)=R𝜹(t𝜼)\gamma(t)=R_{\boldsymbol{\delta}}(t\boldsymbol{\eta}) such that γ\gamma satisfies the weak Wolfe conditions at tt_{\ast}.

Proof.

By (6), every point of \mathcal{M} lies in KcK_{c}. Since GG is continuous on UU and KcUK_{c}\subset U, the set

=KcG1({c2})\mathcal{M}=K_{c}\cap G^{-1}(\{c^{2}\})

is closed in the compact set KcK_{c}. Hence \mathcal{M} is compact.

For part (a), Lemma 4 provides an integrable envelope B2(X)B_{2}(X) such that

sup𝜹Kc2m𝜹(X)B2(X)a.s.\sup_{\boldsymbol{\delta}\in K_{c}}\|\nabla^{2}m_{\boldsymbol{\delta}}(X)\|\leq B_{2}(X)\qquad\text{a.s.}

The same lemma yields analogous integrable envelopes for m𝜹(X)m_{\boldsymbol{\delta}}(X) and m𝜹(X)\nabla m_{\boldsymbol{\delta}}(X). Hence dominated differentiation applies to

ψ(𝜹)=𝔼{m𝜹(X)},\psi(\boldsymbol{\delta})=\mathbb{E}\big\{m_{\boldsymbol{\delta}}(X)\big\},

which gives

ψ(𝜹)=𝔼[m𝜹(X)],2ψ(𝜹)=𝔼[2m𝜹(X)].\nabla\psi(\boldsymbol{\delta})=\mathbb{E}\big[\nabla m_{\boldsymbol{\delta}}(X)\big],\qquad\nabla^{2}\psi(\boldsymbol{\delta})=\mathbb{E}\big[\nabla^{2}m_{\boldsymbol{\delta}}(X)\big].

Therefore

sup𝜹Kc2ψ(𝜹)𝔼[sup𝜹Kc2m𝜹(X)]𝔼[B2(X)]=:Mψ<.\sup_{\boldsymbol{\delta}\in K_{c}}\|\nabla^{2}\psi(\boldsymbol{\delta})\|\leq\mathbb{E}\Big[\sup_{\boldsymbol{\delta}\in K_{c}}\|\nabla^{2}m_{\boldsymbol{\delta}}(X)\|\Big]\leq\mathbb{E}\big[B_{2}(X)\big]=:M_{\psi}<\infty.

The Lipschitz property of ψ\nabla\psi on KcK_{c} follows from the mean value theorem.

For 𝜹,𝜹\boldsymbol{\delta},\boldsymbol{\delta}^{\prime}\in\mathcal{M},

gradψ(𝜹)gradψ(𝜹)=P(𝜹)(ψ(𝜹)ψ(𝜹))+(P(𝜹)P(𝜹))ψ(𝜹).\operatorname{grad}\psi(\boldsymbol{\delta})-\operatorname{grad}\psi(\boldsymbol{\delta}^{\prime})=P(\boldsymbol{\delta})\big(\nabla\psi(\boldsymbol{\delta})-\nabla\psi(\boldsymbol{\delta}^{\prime})\big)+\big(P(\boldsymbol{\delta})-P(\boldsymbol{\delta}^{\prime})\big)\nabla\psi(\boldsymbol{\delta}^{\prime}).

Since P(𝜹)op1\|P(\boldsymbol{\delta})\|_{\mathrm{op}}\leq 1,

P(𝜹)(ψ(𝜹)ψ(𝜹))Mψ𝜹𝜹.\|P(\boldsymbol{\delta})\big(\nabla\psi(\boldsymbol{\delta})-\nabla\psi(\boldsymbol{\delta}^{\prime})\big)\|\leq M_{\psi}\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\|.

Let 𝒏(𝜹)=G(𝜹)/G(𝜹)\boldsymbol{n}(\boldsymbol{\delta})=\nabla G(\boldsymbol{\delta})/\|\nabla G(\boldsymbol{\delta})\|. By (7), G(𝜹)ϵ\|\nabla G(\boldsymbol{\delta})\|\geq\epsilon_{\nabla} on \mathcal{M}, and G(𝜹)G(𝜹)MG𝜹𝜹\|\nabla G(\boldsymbol{\delta})-\nabla G(\boldsymbol{\delta}^{\prime})\|\leq M_{G}\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\| on KcK_{c}. Using

𝒏(𝜹)𝒏(𝜹)=G(𝜹)G(𝜹)G(𝜹)+(1G(𝜹)1G(𝜹))G(𝜹),\boldsymbol{n}(\boldsymbol{\delta})-\boldsymbol{n}(\boldsymbol{\delta}^{\prime})=\frac{\nabla G(\boldsymbol{\delta})-\nabla G(\boldsymbol{\delta}^{\prime})}{\|\nabla G(\boldsymbol{\delta})\|}+\Big(\frac{1}{\|\nabla G(\boldsymbol{\delta})\|}-\frac{1}{\|\nabla G(\boldsymbol{\delta}^{\prime})\|}\Big)\nabla G(\boldsymbol{\delta}^{\prime}),

one obtains

𝒏(𝜹)𝒏(𝜹)2ϵG(𝜹)G(𝜹)2MGϵ𝜹𝜹.\|\boldsymbol{n}(\boldsymbol{\delta})-\boldsymbol{n}(\boldsymbol{\delta}^{\prime})\|\leq\frac{2}{\epsilon_{\nabla}}\,\|\nabla G(\boldsymbol{\delta})-\nabla G(\boldsymbol{\delta}^{\prime})\|\leq\frac{2M_{G}}{\epsilon_{\nabla}}\,\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\|.

Since P(𝜹)=I𝒏(𝜹)𝒏(𝜹)P(\boldsymbol{\delta})=I-\boldsymbol{n}(\boldsymbol{\delta})\boldsymbol{n}(\boldsymbol{\delta})^{\top},

P(𝜹)P(𝜹)𝒏(𝜹)𝒏(𝜹)(𝒏(𝜹)+𝒏(𝜹))4MGϵ𝜹𝜹.\|P(\boldsymbol{\delta})-P(\boldsymbol{\delta}^{\prime})\|\leq\|\boldsymbol{n}(\boldsymbol{\delta})-\boldsymbol{n}(\boldsymbol{\delta}^{\prime})\|\big(\|\boldsymbol{n}(\boldsymbol{\delta})\|+\|\boldsymbol{n}(\boldsymbol{\delta}^{\prime})\|\big)\leq\frac{4M_{G}}{\epsilon_{\nabla}}\,\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\|.

Let M1=sup𝜹Kcψ(𝜹)<M_{1}=\sup_{\boldsymbol{\delta}\in K_{c}}\|\nabla\psi(\boldsymbol{\delta})\|<\infty. Combining these bounds yields

gradψ(𝜹)gradψ(𝜹)(Mψ+4MGϵM1)𝜹𝜹,\|\operatorname{grad}\psi(\boldsymbol{\delta})-\operatorname{grad}\psi(\boldsymbol{\delta}^{\prime})\|\leq\Big(M_{\psi}+\frac{4M_{G}}{\epsilon_{\nabla}}M_{1}\Big)\,\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\|,

so part (b) holds with L=Mψ+(4MG/ϵ)M1L=M_{\psi}+(4M_{G}/\epsilon_{\nabla})M_{1}.

For part (c), define the one-dimensional line-search function h(t)=ψ(R𝜹(t𝜼))h(t)=\psi(R_{\boldsymbol{\delta}}(t\boldsymbol{\eta})). By part (a) and the C1C^{1} regularity of RR, the map hh is continuously differentiable on an interval containing 0. Because R𝜹(t𝜼)R_{\boldsymbol{\delta}}(t\boldsymbol{\eta})\in\mathcal{M} whenever the retraction is defined, and because ψ\psi is continuous on the compact set \mathcal{M}, the function hh is bounded below on its domain. Therefore there exists t>0t_{\ast}>0 satisfying the weak Wolfe conditions along γ(t)=R𝜹(t𝜼)\gamma(t)=R_{\boldsymbol{\delta}}(t\boldsymbol{\eta}); see Ring and Wirth (2012, Proposition 1). ∎

For an accepted step tt_{\ast}, define

𝒔:=𝒯𝜹𝜹+(t𝜼),𝒚:=gradψ(𝜹+)𝒯𝜹𝜹+(gradψ(𝜹)),𝜹+:=R𝜹(t𝜼).\boldsymbol{s}:=\mathcal{T}_{\boldsymbol{\delta}\to\boldsymbol{\delta}_{+}}(t_{\ast}\boldsymbol{\eta}),\qquad\boldsymbol{y}:=\operatorname{grad}\psi(\boldsymbol{\delta}_{+})-\mathcal{T}_{\boldsymbol{\delta}\to\boldsymbol{\delta}_{+}}\big(\operatorname{grad}\psi(\boldsymbol{\delta})\big),\qquad\boldsymbol{\delta}_{+}:=R_{\boldsymbol{\delta}}(t_{\ast}\boldsymbol{\eta}).

The cautious RBFGS update of Huang et al. (2018, Algorithm 1) is then applied whenever the prescribed curvature condition holds; otherwise the current approximation is transported to the new tangent space without updating the Hessian surrogate.

Lemma 5 (Feasible initialization and containment).

Let c>0c>0 satisfy c2𝒱c^{2}\notin\mathcal{V} and set ={𝛅:G(𝛅)=c2}\mathcal{M}=\{\boldsymbol{\delta}:G(\boldsymbol{\delta})=c^{2}\}. Assume the boundary-separation condition (5). Assume that GG satisfies the local quadratic control near the origin: there exist ρ>0\rho>0, ε(0,1)\varepsilon\in(0,1) and a symmetric positive–definite matrix 𝐇G(𝟎)\boldsymbol{H}_{G}(\boldsymbol{0}) with eigenvalues λmin,λmax>0\lambda_{\min},\lambda_{\max}>0 such that, for all 𝛅ρ\|\boldsymbol{\delta}\|\leq\rho,

12(1ε)λmin𝜹2G(𝜹)12(1+ε)λmax𝜹2.\tfrac{1}{2}(1-\varepsilon)\lambda_{\min}\|\,\boldsymbol{\delta}\|^{2}\ \leq\ G(\boldsymbol{\delta})\ \leq\ \tfrac{1}{2}(1+\varepsilon)\lambda_{\max}\|\,\boldsymbol{\delta}\|^{2}. (8)

Assume also the hypotheses of Corollary 1.

  1. (i)

    For every prescribed local radius r0(0,ρ]r_{0}\in(0,\rho], there exists c0>0c_{0}>0 such that for any c(0,c0)c\in(0,c_{0}) one can select a unit vector 𝒗\boldsymbol{v} and t(0,r0]t^{\ast}\in(0,r_{0}] with G(t𝒗)=c2G(t^{\ast}\boldsymbol{v})=c^{2} and 𝜹0:=t𝒗B¯(𝟎,r0)\boldsymbol{\delta}_{0}:=t^{\ast}\boldsymbol{v}\in\mathcal{M}\cap\overline{B}(\boldsymbol{0},r_{0}).

  2. (ii)

    Let ={𝜹:ψ(𝜹)ψ(𝜹0)}\mathcal{L}=\{\boldsymbol{\delta}\in\mathcal{M}:\ \psi(\boldsymbol{\delta})\leq\psi(\boldsymbol{\delta}_{0})\}. Then \mathcal{M} is compact, \mathcal{L} is compact, and for any sequence {𝜹k}\{\boldsymbol{\delta}_{k}\} generated by Riemannian BFGS on \mathcal{M} using RR and 𝒯\mathcal{T} above with a weak Wolfe line search, one has ψ(𝜹k+1)ψ(𝜹k)ψ(𝜹0)\psi(\boldsymbol{\delta}_{k+1})\leq\psi(\boldsymbol{\delta}_{k})\leq\psi(\boldsymbol{\delta}_{0}) and hence 𝜹k\boldsymbol{\delta}_{k}\in\mathcal{L} for all k0k\geq 0.

Proof.

For part (i), fix r0(0,ρ]r_{0}\in(0,\rho]. Fix a unit vector 𝒗\boldsymbol{v} and define q(t)=G(t𝒗)q(t)=G(t\boldsymbol{v}) on [0,r0][0,r_{0}]. Then qq is continuous, q(0)=0q(0)=0, and by (8),

12(1ε)λmint2q(t)12(1+ε)λmaxt2(0tr0).\tfrac{1}{2}(1-\varepsilon)\lambda_{\min}t^{2}\ \leq\ q(t)\ \leq\ \tfrac{1}{2}(1+\varepsilon)\lambda_{\max}t^{2}\qquad(0\leq t\leq r_{0}).

Let c0=12(1ε)λminr0c_{0}=\sqrt{\tfrac{1}{2}(1-\varepsilon)\lambda_{\min}}\,r_{0}. For any c(0,c0)c\in(0,c_{0}), there exists t(0,r0]t^{\ast}\in(0,r_{0}] with q(t)=c2q(t^{\ast})=c^{2} by the intermediate value theorem. Then 𝜹0=t𝒗B¯(𝟎,r0)\boldsymbol{\delta}_{0}=t^{\ast}\boldsymbol{v}\in\mathcal{M}\cap\overline{B}(\boldsymbol{0},r_{0}), proving part (i).

For part (ii), condition (5) implies the existence of Rc>0R_{c}>0 such that G(𝜹)>c2G(\boldsymbol{\delta})>c^{2} for all 𝜹𝒟\boldsymbol{\delta}\in\mathcal{D} with 𝜹Rc\|\boldsymbol{\delta}\|\geq R_{c}. Hence B¯(𝟎,Rc)\mathcal{M}\subset\overline{B}(\boldsymbol{0},R_{c}). By Corollary 1, the level set \mathcal{M} is compact and ψ\psi is continuous on \mathcal{M}. Hence \mathcal{L} is a closed subset of the compact set \mathcal{M}, so \mathcal{L} is compact. Now proceed by induction. Assume 𝜹k\boldsymbol{\delta}_{k}\in\mathcal{L} and let 𝜼kT𝜹k\boldsymbol{\eta}_{k}\in T_{\boldsymbol{\delta}_{k}}\mathcal{M} be the search direction. Let αk>0\alpha_{k}>0 be produced by the weak Wolfe line search applied to γ(t)=R𝜹k(t𝜼k)\gamma(t)=R_{\boldsymbol{\delta}_{k}}(t\boldsymbol{\eta}_{k}) and set 𝜹k+1=R𝜹k(αk𝜼k)\boldsymbol{\delta}_{k+1}=R_{\boldsymbol{\delta}_{k}}(\alpha_{k}\boldsymbol{\eta}_{k}). The Armijo condition yields

ψ(𝜹k+1)ψ(𝜹k)+c1αkgradψ(𝜹k),𝜼k<ψ(𝜹k)ψ(𝜹0),\psi(\boldsymbol{\delta}_{k+1})\ \leq\ \psi(\boldsymbol{\delta}_{k})+c_{1}\alpha_{k}\,\langle\operatorname{grad}\psi(\boldsymbol{\delta}_{k}),\boldsymbol{\eta}_{k}\rangle\ <\ \psi(\boldsymbol{\delta}_{k})\ \leq\ \psi(\boldsymbol{\delta}_{0}),

hence 𝜹k+1\boldsymbol{\delta}_{k+1}\in\mathcal{L}. Therefore ψ(𝜹k+1)ψ(𝜹k)ψ(𝜹0)\psi(\boldsymbol{\delta}_{k+1})\leq\psi(\boldsymbol{\delta}_{k})\leq\psi(\boldsymbol{\delta}_{0}) and 𝜹k\boldsymbol{\delta}_{k}\in\mathcal{L} for all k0k\geq 0. ∎

Stationarity conclusion.

Under (5), Lemma 5 shows that the initial sublevel set ={𝜹:ψ(𝜹)ψ(𝜹0)}\mathcal{L}=\{\boldsymbol{\delta}\in\mathcal{M}:\psi(\boldsymbol{\delta})\leq\psi(\boldsymbol{\delta}_{0})\} is compact. Thus Assumption 4.1 in Huang et al. (2018) is satisfied for the RBFGS iterates started at 𝜹0\boldsymbol{\delta}_{0}. Together with Assumption 4.2, applied here with the scaled transport 𝒯\mathcal{T} above, Theorem 4.2 yields

lim infkgradψ(𝜹k)=0.\liminf_{k\to\infty}\|\operatorname{grad}\psi(\boldsymbol{\delta}_{k})\|=0.

Since {𝜹k}\{\boldsymbol{\delta}_{k}\}\subset\mathcal{L} and \mathcal{L} is compact, the sequence admits an accumulation point 𝜹\boldsymbol{\delta}_{\star}. Continuity of gradψ\operatorname{grad}\psi implies that any accumulation point along a subsequence with gradψ(𝜹kj)0\|\operatorname{grad}\psi(\boldsymbol{\delta}_{k_{j}})\|\to 0 is Riemannian stationary.

Appendix C Detailed Proof for Minimax Lower Bound

Standing Assumptions and Notation

Let 𝒁=(𝑿,𝑾,Y)\boldsymbol{Z}=(\boldsymbol{X},\boldsymbol{W},Y) with 𝑿p\boldsymbol{X}\in\mathbb{R}^{p}, 𝑾q\boldsymbol{W}\in\mathbb{R}^{q}, and YY\in\mathbb{R}. Write f(𝒘𝒙)f(\boldsymbol{w}\mid\boldsymbol{x}) for the exposure density, and define the exponential tilt

g𝜹(𝒘𝒙)\displaystyle g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x}) :=exp{𝜹𝒔(𝒘,𝒙)}f(𝒘𝒙)ν𝜹(𝒙),\displaystyle:=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\,f(\boldsymbol{w}\mid\boldsymbol{x})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})},
ν𝜹(𝒙)\displaystyle\nu_{\boldsymbol{\delta}}(\boldsymbol{x}) :=𝔼f[exp{𝜹𝒔(𝑾,𝑿)}𝑿=𝒙].\displaystyle:=\mathbb{E}_{f}\big[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mid\boldsymbol{X}=\boldsymbol{x}\big].

Let μ(𝒙,𝒘):=𝔼[Y𝑿=𝒙,𝑾=𝒘]\mu(\boldsymbol{x},\boldsymbol{w}):=\mathbb{E}[Y\mid\boldsymbol{X}=\boldsymbol{x},\boldsymbol{W}=\boldsymbol{w}], and denote

𝒔~(𝒘,𝒙):=𝒔(𝒘,𝒙)𝔼f[𝒔(𝑾,𝑿)𝑿=𝒙].\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x}):=\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\mathbb{E}_{f}[\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}=\boldsymbol{x}].

All bounds are derived for a general tilt and then specialized to 𝒔(𝒘,𝒙)=𝒘\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})=\boldsymbol{w}.

Assumptions:

(A1) Bounded outcomes

There exists M<M<\infty such that |Y|M|Y|\leq M a.s. Hence |μ(𝒙,𝒘)|M|\mu(\boldsymbol{x},\boldsymbol{w})|\leq M.

(A2) Nondegenerate noise

0<σ¯2Var(Y𝑿,𝑾)σ¯2<0<\underline{\sigma}^{2}\leq\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})\leq\overline{\sigma}^{2}<\infty a.s.

(A3) Bounded tilt

𝒔(𝒘,𝒙)Ms<\|\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\|\leq M_{s}<\infty and we require for some finite radius Δ(0,)\Delta\in(0,\infty) that 𝜹Δ\|\boldsymbol{\delta}\|\leq\Delta. For the purpose of this proof, we assume there exists a fixed constant Cδ>0C_{\delta}>0 such that

𝜹CδMs.\|\boldsymbol{\delta}\|\ \leq\ \frac{C_{\delta}}{M_{s}}\,.
(A4) Model richness

The class of distributions 𝒫\mathcal{P} is sufficiently rich. There exist constants B,ϵ0>0B,\epsilon_{0}>0 such that for any P0𝒫P_{0}\in\mathcal{P}, if ϕL(P0)\phi\in L_{\infty}(P_{0}) with EP0[ϕ(𝒁)]=0\operatorname{E}_{P_{0}}[\phi(\boldsymbol{Z})]=0 and ϕB\|\phi\|_{\infty}\leq B, then for all |ϵ|ϵ0|\epsilon|\leq\epsilon_{0}, the density p0(1+ϵϕ)p_{0}(1+\epsilon\phi) corresponds to a distribution P1𝒫P_{1}\in\mathcal{P}.

(A5) Pathwise differentiability

The functional ψ()\psi(\cdot) is pathwise differentiable at any P0𝒫P_{0}\in\mathcal{P}. The remainder from the von Mises expansion admits a uniform quadratic bound: there exists a constant K<K<\infty such that for any perturbation ϕ\phi with ϕB\|\phi\|_{\infty}\leq B, the bound

|ψ(P0(1+ϵϕ))ψ(P0)ϵEP0[φ(𝒁;P0)ϕ(𝒁)]|Kϵ2|\psi(P_{0}(1+\epsilon\phi))-\psi(P_{0})-\epsilon\operatorname{E}_{P_{0}}[\varphi(\boldsymbol{Z};P_{0})\phi(\boldsymbol{Z})]|\leq K\epsilon^{2}

holds uniformly for all P0𝒫P_{0}\in\mathcal{P} and all valid ϕ\phi.

Identification and target. Under standard consistency and conditional exchangeability,

ψP(𝜹)\displaystyle\psi_{P}(\boldsymbol{\delta}) =𝔼P[μ(𝑿,𝒘)g𝜹(𝒘𝑿)𝑑𝒘],\displaystyle=\mathbb{E}_{P}\Big[\int\mu(\boldsymbol{X},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\Big],
θP(𝜹)\displaystyle\theta_{P}(\boldsymbol{\delta}) :=ψP(𝜹)ψP(𝟎).\displaystyle:=\psi_{P}(\boldsymbol{\delta})-\psi_{P}(\boldsymbol{0}).

C.1 Uniform second-order bounds

This subsection derives uniform second-order L2(P0)L_{2}(P_{0}) bounds for (i) the density ratio g𝜹/fg_{\boldsymbol{\delta}}/f and (ii) the tilted conditional mean 𝔼g𝜹[μ𝑿]\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}], valid over a finite-tilt regime. The bounds provide an explicit quadratic remainder for the expansion around 𝜹=𝟎\boldsymbol{\delta}=\boldsymbol{0} and uniform control of second-order terms in the Le Cam two-point construction underlying the minimax lower bound.

Lemma 6 (Uniform L2L_{2} bounds with explicit constants).

Suppose (A1)–(A3). Let τ:=𝛅MsCδ\tau:=\|\boldsymbol{\delta}\|M_{s}\leq C_{\delta}. Then for all such 𝛅\boldsymbol{\delta},

g𝜹f1𝜹𝒔~L2(P0)\displaystyle\Big\|\frac{g_{\boldsymbol{\delta}}}{f}-1-\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}\Big\|_{L_{2}(P_{0})} Cg𝜹2,\displaystyle\leq C_{g}\,\|\boldsymbol{\delta}\|^{2},
𝔼g𝜹[μ𝑿]𝔼f[μ𝑿]𝜹𝔼f[μ𝒔~𝑿]L2(P0)\displaystyle\big\|\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]-\mathbb{E}_{f}[\mu\mid\boldsymbol{X}]-\boldsymbol{\delta}^{\top}\mathbb{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big\|_{L_{2}(P_{0})} Cμ𝜹2,\displaystyle\leq C_{\mu}\,\|\boldsymbol{\delta}\|^{2},

where one can take

Cg\displaystyle C_{g} :=4e2CmaxMs2,\displaystyle:=4\,e^{2C_{\max}}\,M_{s}^{2},
Cμ\displaystyle C_{\mu} :=4e2CmaxMMs2.\displaystyle:=4\,e^{2C_{\max}}\,M\,M_{s}^{2}.
Proof.

All vector norms are Euclidean, and all matrix norms are operator norms. Expectations and conditional expectations are under the baseline law P0P_{0} unless explicitly indicated. Write Ms:=sup𝒘,𝒙𝒔(𝒘,𝒙)<M_{s}:=\sup_{\boldsymbol{w},\boldsymbol{x}}\|\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\|<\infty by (A3), and M:=sup𝒙,𝒘|μ(𝒙,𝒘)|<M:=\sup_{\boldsymbol{x},\boldsymbol{w}}|\mu(\boldsymbol{x},\boldsymbol{w})|<\infty by (A1). Let 𝜹q\boldsymbol{\delta}\in\mathbb{R}^{q} and define

τ:=𝜹Ms.\tau:=\|\boldsymbol{\delta}\|\,M_{s}.

Under (A3),

𝜹CδMsτCδ,\|\boldsymbol{\delta}\|\ \leq\ \frac{C_{\delta}}{M_{s}}\quad\Longrightarrow\quad\tau\leq C_{\delta},

which yields uniform bounds with explicit constants.

Define

r𝜹(𝒘,𝒙)\displaystyle r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) :=g𝜹(𝒘𝒙)f(𝒘𝒙)=exp{𝜹𝒔(𝒘,𝒙)}ν𝜹(𝒙),\displaystyle:=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})}{f(\boldsymbol{w}\mid\boldsymbol{x})}=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})},
ν𝜹(𝒙)\displaystyle\nu_{\boldsymbol{\delta}}(\boldsymbol{x}) :=𝔼f[exp{𝜹𝒔(𝑾,𝑿)}|𝑿=𝒙].\displaystyle:=\mathbb{E}_{f}\big[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\,\big|\,\boldsymbol{X}=\boldsymbol{x}\big].

Since |𝜹𝒔(𝒘,𝒙)|𝜹𝒔(𝒘,𝒙)τCδ|\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})|\leq\|\boldsymbol{\delta}\|\,\|\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\|\leq\tau\leq C_{\delta}, we have

eτexp{𝜹𝒔(𝒘,𝒙)}eτ,eτν𝜹(𝒙)eτ.\displaystyle e^{-\tau}\ \leq\ \exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\ \leq\ e^{\tau},\qquad e^{-\tau}\ \leq\ \nu_{\boldsymbol{\delta}}(\boldsymbol{x})\ \leq\ e^{\tau}.

It follows that

0<e2τr𝜹(𝒘,𝒙)e2τe2Cmax.0<e^{-2\tau}\ \leq\ r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\ \leq\ e^{2\tau}\ \leq\ e^{2C_{\max}}.

Second-order expansion of r𝜹r_{\boldsymbol{\delta}} and a uniform Hessian bound. Introduce the log-partition function

logν𝜹(𝒙)=log𝔼f[exp{𝜹𝒔(𝑾,𝑿)}|𝑿=𝒙].\log\nu_{\boldsymbol{\delta}}(\boldsymbol{x})=\log\mathbb{E}_{f}\!\big[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\,\big|\,\boldsymbol{X}=\boldsymbol{x}\big].

Then

r𝜹(𝒘,𝒙)=exp{𝜹𝒔(𝒘,𝒙)logν𝜹(𝒙)}.r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})=\exp\big\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\log\nu_{\boldsymbol{\delta}}(\boldsymbol{x})\big\}.

Differentiating with respect to 𝜹\boldsymbol{\delta} yields

𝜹logν𝜹(𝒙)\displaystyle\nabla_{\boldsymbol{\delta}}\log\nu_{\boldsymbol{\delta}}(\boldsymbol{x}) =𝔼g𝜹[𝒔(𝑾,𝑿)|𝑿=𝒙]=:𝝁g,s(𝜹,𝒙),\displaystyle=\mathbb{E}_{g_{\boldsymbol{\delta}}}\!\big[\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\,\big|\,\boldsymbol{X}=\boldsymbol{x}\big]=:\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x}),
𝜹2logν𝜹(𝒙)\displaystyle\nabla_{\boldsymbol{\delta}}^{2}\log\nu_{\boldsymbol{\delta}}(\boldsymbol{x}) =Varg𝜹(𝒔(𝑾,𝑿)|𝑿=𝒙),\displaystyle=\operatorname{Var}_{g_{\boldsymbol{\delta}}}\!\big(\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\,\big|\,\boldsymbol{X}=\boldsymbol{x}\big),

and therefore

𝜹logr𝜹(𝒘,𝒙)\displaystyle\nabla_{\boldsymbol{\delta}}\log r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) =𝒔(𝒘,𝒙)𝝁g,s(𝜹,𝒙),\displaystyle=\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x}),
𝜹2logr𝜹(𝒘,𝒙)\displaystyle\nabla_{\boldsymbol{\delta}}^{2}\log r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) =Varg𝜹(𝒔(𝑾,𝑿)|𝑿=𝒙).\displaystyle=-\,\operatorname{Var}_{g_{\boldsymbol{\delta}}}\!\big(\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\,\big|\,\boldsymbol{X}=\boldsymbol{x}\big).

Using F=FlogF\nabla F=F\,\nabla\log F and 2F=F[(logF)(logF)+2logF]\nabla^{2}F=F\big[(\nabla\log F)(\nabla\log F)^{\top}+\nabla^{2}\log F\big], we obtain

𝜹r𝜹(𝒘,𝒙)\displaystyle\nabla_{\boldsymbol{\delta}}r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) =r𝜹(𝒘,𝒙)[𝒔(𝒘,𝒙)𝝁g,s(𝜹,𝒙)],\displaystyle=r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\,\big[\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x})\big],
𝜹2r𝜹(𝒘,𝒙)\displaystyle\nabla_{\boldsymbol{\delta}}^{2}r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) =r𝜹(𝒘,𝒙)([𝒔(𝒘,𝒙)𝝁g,s(𝜹,𝒙)][𝒔(𝒘,𝒙)𝝁g,s(𝜹,𝒙)]\displaystyle=r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\,\Big(\big[\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x})\big]\big[\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x})\big]^{\top}
Varg𝜹(𝒔(𝑾,𝑿)|𝑿=𝒙)).\displaystyle\qquad\qquad\qquad\qquad-\operatorname{Var}_{g_{\boldsymbol{\delta}}}\!\big(\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\,\big|\,\boldsymbol{X}=\boldsymbol{x}\big)\Big).

A uniform bound for the Hessian holds on the ball {𝜹:𝜹Cδ/Ms}\{\boldsymbol{\delta}:\ \|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s}\}. First,

𝒔(𝒘,𝒙)𝝁g,s(𝜹,𝒙)\displaystyle\|\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x})\| 𝒔(𝒘,𝒙)+𝝁g,s(𝜹,𝒙)\displaystyle\leq\|\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\|+\|\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x})\|
Ms+𝔼g𝜹[𝒔(𝑾,𝑿)𝑿=𝒙]2Ms,\displaystyle\leq M_{s}+\mathbb{E}_{g_{\boldsymbol{\delta}}}[\|\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\|\mid\boldsymbol{X}=\boldsymbol{x}]\leq 2M_{s},

so

[𝒔𝝁g,s][𝒔𝝁g,s]𝒔𝝁g,s24Ms2.\big\|\big[\boldsymbol{s}-\boldsymbol{\mu}_{g,s}\big]\big[\boldsymbol{s}-\boldsymbol{\mu}_{g,s}\big]^{\top}\big\|\leq\|\boldsymbol{s}-\boldsymbol{\mu}_{g,s}\|^{2}\leq 4M_{s}^{2}.

Next, since 𝒔𝝁g,s2Ms\|\boldsymbol{s}-\boldsymbol{\mu}_{g,s}\|\leq 2M_{s},

Varg𝜹(𝒔𝑿=𝒙)\displaystyle\big\|\operatorname{Var}_{g_{\boldsymbol{\delta}}}(\boldsymbol{s}\mid\boldsymbol{X}=\boldsymbol{x})\big\| =𝔼g𝜹[(𝒔𝝁g,s)(𝒔𝝁g,s)𝑿=𝒙]\displaystyle=\big\|\mathbb{E}_{g_{\boldsymbol{\delta}}}\big[(\boldsymbol{s}-\boldsymbol{\mu}_{g,s})(\boldsymbol{s}-\boldsymbol{\mu}_{g,s})^{\top}\mid\boldsymbol{X}=\boldsymbol{x}\big]\big\|
𝔼g𝜹[𝒔𝝁g,s2𝑿=𝒙]4Ms2.\displaystyle\leq\mathbb{E}_{g_{\boldsymbol{\delta}}}\big[\|\boldsymbol{s}-\boldsymbol{\mu}_{g,s}\|^{2}\mid\boldsymbol{X}=\boldsymbol{x}\big]\leq 4M_{s}^{2}.

Combining these bounds with 0<r𝜹e2τe2Cmax0<r_{\boldsymbol{\delta}}\leq e^{2\tau}\leq e^{2C_{\max}} yields the uniform Hessian bound

𝜹2r𝜹(𝒘,𝒙)r𝜹(𝒘,𝒙)(4Ms2+4Ms2)8e2CmaxMs2,for all (𝒘,𝒙) and all 𝜹CδMs.\big\|\nabla_{\boldsymbol{\delta}}^{2}r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\big\|\leq r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\,\big(4M_{s}^{2}+4M_{s}^{2}\big)\leq 8\,e^{2C_{\max}}\,M_{s}^{2},\qquad\text{for all }(\boldsymbol{w},\boldsymbol{x})\text{ and all }\|\boldsymbol{\delta}\|\leq\tfrac{C_{\delta}}{M_{s}}. (9)

Proof of (i). By Taylor’s theorem with integral remainder for the scalar function r𝜹(𝒘,𝒙)r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) around 𝜹=𝟎\boldsymbol{\delta}=\boldsymbol{0},

r𝜹(𝒘,𝒙)\displaystyle r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) =r𝟎(𝒘,𝒙)+𝜹r𝟎(𝒘,𝒙)𝜹\displaystyle=r_{\boldsymbol{0}}(\boldsymbol{w},\boldsymbol{x})+\nabla_{\boldsymbol{\delta}}r_{\boldsymbol{0}}(\boldsymbol{w},\boldsymbol{x})^{\top}\boldsymbol{\delta}
+𝜹(01(1t)𝜹2rt𝜹(𝒘,𝒙)𝑑t)𝜹.\displaystyle\qquad+\boldsymbol{\delta}^{\top}\!\Big(\int_{0}^{1}(1-t)\,\nabla_{\boldsymbol{\delta}}^{2}r_{t\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\,dt\Big)\boldsymbol{\delta}.

Since ν𝟎(𝒙)=1\nu_{\boldsymbol{0}}(\boldsymbol{x})=1,

r𝟎(𝒘,𝒙)\displaystyle r_{\boldsymbol{0}}(\boldsymbol{w},\boldsymbol{x}) =1,\displaystyle=1,
𝜹r𝟎(𝒘,𝒙)\displaystyle\nabla_{\boldsymbol{\delta}}r_{\boldsymbol{0}}(\boldsymbol{w},\boldsymbol{x}) =𝒔(𝒘,𝒙)𝔼f[𝒔(𝑾,𝑿)𝑿=𝒙]=𝒔~(𝒘,𝒙).\displaystyle=\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\mathbb{E}_{f}[\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}=\boldsymbol{x}]=\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x}).

Therefore,

r𝜹(𝒘,𝒙)1𝜹𝒔~(𝒘,𝒙)\displaystyle r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})-1-\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x}) =𝜹(01(1t)𝜹2rt𝜹(𝒘,𝒙)𝑑t)𝜹.\displaystyle=\boldsymbol{\delta}^{\top}\!\Big(\int_{0}^{1}(1-t)\,\nabla_{\boldsymbol{\delta}}^{2}r_{t\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\,dt\Big)\boldsymbol{\delta}.

By Taylor’s theorem with integral remainder and (9),

|r𝜹(𝒘,𝒙)1𝜹𝒔~(𝒘,𝒙)|\displaystyle\big|r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})-1-\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x})\big| (01(1t)𝑑t)(sup𝝃:𝝃𝜹𝜹2r𝝃(𝒘,𝒙))𝜹2\displaystyle\leq\Big(\int_{0}^{1}(1-t)\,dt\Big)\,\Big(\sup_{\boldsymbol{\xi}:\ \|\boldsymbol{\xi}\|\leq\|\boldsymbol{\delta}\|}\big\|\nabla_{\boldsymbol{\delta}}^{2}r_{\boldsymbol{\xi}}(\boldsymbol{w},\boldsymbol{x})\big\|\Big)\,\|\boldsymbol{\delta}\|^{2}
4e2CmaxMs2𝜹2.\displaystyle\leq 4\,e^{2C_{\max}}\,M_{s}^{2}\,\|\boldsymbol{\delta}\|^{2}.

Hence,

r𝜹1𝜹𝒔~L2(P0) 4e2CmaxMs2𝜹2,\Big\|r_{\boldsymbol{\delta}}-1-\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}\Big\|_{L_{2}(P_{0})}\ \leq\ 4\,e^{2C_{\max}}\,M_{s}^{2}\,\|\boldsymbol{\delta}\|^{2},

which proves part (i) with Cg:=4e2CmaxMs2C_{g}:=4e^{2C_{\max}}M_{s}^{2}.

Proof of (ii). By definition of the tilt,

𝔼g𝜹[μ𝑿]=𝔼f[μ(𝑿,𝑾)r𝜹(𝑾,𝑿)𝑿].\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]=\mathbb{E}_{f}\!\big[\mu(\boldsymbol{X},\boldsymbol{W})\,r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}\big].

Therefore,

𝔼g𝜹[μ𝑿]𝔼f[μ𝑿]\displaystyle\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]-\mathbb{E}_{f}[\mu\mid\boldsymbol{X}] =𝔼f[μ(𝑿,𝑾){r𝜹(𝑾,𝑿)1}𝑿]\displaystyle=\mathbb{E}_{f}\!\big[\mu(\boldsymbol{X},\boldsymbol{W})\,\{r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})-1\}\mid\boldsymbol{X}\big]
=𝜹𝔼f[μ(𝑿,𝑾)𝒔~(𝑾,𝑿)𝑿]+𝔼f[μ(𝑿,𝑾)r2(𝑾,𝑿;𝜹)𝑿],\displaystyle=\boldsymbol{\delta}^{\top}\,\mathbb{E}_{f}\big[\mu(\boldsymbol{X},\boldsymbol{W})\,\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}\big]+\mathbb{E}_{f}\!\big[\mu(\boldsymbol{X},\boldsymbol{W})\,r_{2}(\boldsymbol{W},\boldsymbol{X};\boldsymbol{\delta})\mid\boldsymbol{X}\big],

where

r2(𝒘,𝒙;𝜹):=r𝜹(𝒘,𝒙)1𝜹𝒔~(𝒘,𝒙).r_{2}(\boldsymbol{w},\boldsymbol{x};\boldsymbol{\delta}):=r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})-1-\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x}).

Define

Rμ(𝑿;𝜹):=𝔼f[μ(𝑿,𝑾)r2(𝑾,𝑿;𝜹)𝑿].R_{\mu}(\boldsymbol{X};\boldsymbol{\delta}):=\mathbb{E}_{f}\!\big[\mu(\boldsymbol{X},\boldsymbol{W})\,r_{2}(\boldsymbol{W},\boldsymbol{X};\boldsymbol{\delta})\mid\boldsymbol{X}\big].

By Cauchy–Schwarz and |μ|M|\mu|\leq M,

Rμ(;𝜹)L2(P0)2\displaystyle\|R_{\mu}(\cdot;\boldsymbol{\delta})\|_{L_{2}(P_{0})}^{2} =𝔼[(𝔼f[μr2𝑿])2]𝔼[𝔼f[μ2r22𝑿]]\displaystyle=\mathbb{E}\Big[\,\big(\mathbb{E}_{f}\big[\mu\,r_{2}\mid\boldsymbol{X}\big]\big)^{2}\,\Big]\leq\mathbb{E}\Big[\,\mathbb{E}_{f}\big[\mu^{2}\,r_{2}^{2}\mid\boldsymbol{X}\big]\,\Big]
=𝔼[μ2r22]M2r2L2(P0)2.\displaystyle=\mathbb{E}\big[\mu^{2}r_{2}^{2}\big]\leq M^{2}\,\|r_{2}\|_{L_{2}(P_{0})}^{2}.

From part (i), r2L2(P0)Cg𝜹2\|r_{2}\|_{L_{2}(P_{0})}\leq C_{g}\,\|\boldsymbol{\delta}\|^{2} with Cg=4e2CmaxMs2C_{g}=4\,e^{2C_{\max}}\,M_{s}^{2}. Hence

Rμ(;𝜹)L2(P0)MCg𝜹2=4e2CmaxMMs2𝜹2.\|R_{\mu}(\cdot;\boldsymbol{\delta})\|_{L_{2}(P_{0})}\leq M\,C_{g}\,\|\boldsymbol{\delta}\|^{2}=4\,e^{2C_{\max}}\,M\,M_{s}^{2}\,\|\boldsymbol{\delta}\|^{2}.

Thus

𝔼g𝜹[μ𝑿]𝔼f[μ𝑿]𝜹𝔼f[μ𝒔~𝑿]L2(P0)4e2CmaxMMs2𝜹2,\Big\|\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]-\mathbb{E}_{f}[\mu\mid\boldsymbol{X}]-\boldsymbol{\delta}^{\top}\mathbb{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\Big\|_{L_{2}(P_{0})}\leq 4\,e^{2C_{\max}}\,M\,M_{s}^{2}\,\|\boldsymbol{\delta}\|^{2},

which proves part (ii) with Cμ:=4e2CmaxMMs2C_{\mu}:=4e^{2C_{\max}}MM_{s}^{2}. ∎

C.2 First-order expansion of the EIF and its covariance

The second-order expansions above imply a first-order expansion of the efficient influence function for ψ(𝜹)\psi(\boldsymbol{\delta}) at 𝜹=𝟎\boldsymbol{\delta}=\boldsymbol{0}, with leading term 𝜹𝒉(𝒁)\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z}) and a quadratic remainder.

Lemma 7 (EIF expansion from the three-component decomposition).

Let

φψ(𝜹)(𝒁)=DY(𝒁)+Dg,μ(𝒁)+Dψ(𝒁).\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})=D_{Y}(\boldsymbol{Z})+D_{g,\mu}(\boldsymbol{Z})+D_{\psi}(\boldsymbol{Z}).

Recall 𝐬~(𝐰,𝐱):=𝐬(𝐰,𝐱)Ef[𝐬(𝐖,𝐗)𝐗=𝐱]\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x}):=\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\operatorname{E}_{f}[\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}=\boldsymbol{x}], and define

𝒉(𝒁):=𝒔~(𝑾,𝑿)(YEf[μ𝑿])E[Ef[μ𝒔~𝑿]].\boldsymbol{h}(\boldsymbol{Z}):=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\Big(Y-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\Big)\;-\;\operatorname{E}\big[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big].

Assume (A1)–(A3) and 𝛅Cδ/Ms\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s}. Then, with the constants from Lemma 6,

Cg\displaystyle C_{g} =4e2CmaxMs2,\displaystyle=4e^{2C_{\max}}M_{s}^{2},
Cμ\displaystyle C_{\mu} =4e2CmaxMMs2,\displaystyle=4e^{2C_{\max}}MM_{s}^{2},

we have the expansion

φψ(𝜹)(𝒁)=YψP(𝟎)φψ(𝟎)(𝒁)+𝜹𝒉(𝒁)+Rφ(𝒁;𝜹),\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})=\underbrace{Y-\psi_{P}(\boldsymbol{0})}_{\varphi_{\psi(\boldsymbol{0})}(\boldsymbol{Z})}+\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})+R_{\varphi}(\boldsymbol{Z};\boldsymbol{\delta}),

and an explicit quadratic L2L_{2} bound

Rφ(;𝜹)L2(P0)\displaystyle\|R_{\varphi}(\cdot;\boldsymbol{\delta})\|_{L_{2}(P_{0})} Cφ𝜹2,\displaystyle\leq C_{\varphi}\,\|\boldsymbol{\delta}\|^{2},
Cφ\displaystyle C_{\varphi} :=4Ms2(e2Cmaxσ¯+M(e4Cmax+(4+2Cmax)e2Cmax+1)).\displaystyle:=4M_{s}^{2}\Big(e^{2C_{\max}}\overline{\sigma}+M\big(e^{4C_{\max}}+(4+2C_{\max})e^{2C_{\max}}+1\big)\Big).

Consequently,

φθ(𝜹)(𝒁):=φψ(𝜹)(𝒁)φψ(𝟎)(𝒁)=𝜹𝒉(𝒁)+Rφ(𝒁;𝜹).\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z}):=\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})-\varphi_{\psi(\boldsymbol{0})}(\boldsymbol{Z})=\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})+R_{\varphi}(\boldsymbol{Z};\boldsymbol{\delta}).

We have 𝔼[𝐡(𝐙)]=𝟎\mathbb{E}[\boldsymbol{h}(\boldsymbol{Z})]=\boldsymbol{0}. Moreover, the leading covariance of the influence function satisfies

Cov(𝒉(𝒁))\displaystyle\operatorname{Cov}\big(\boldsymbol{h}(\boldsymbol{Z})\big) =E[𝒔~(𝑾,𝑿)𝒔~(𝑾,𝑿)Var(Y𝑿,𝑾)]\displaystyle=\operatorname{E}\!\Big[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})^{\top}\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})\Big]
+Cov(𝒔~(𝑾,𝑿)(μ(𝑿,𝑾)Ef[μ𝑿]))=:𝚺ε,s+𝚺μ,full,\displaystyle\quad+\operatorname{Cov}\!\Big(\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\big(\mu(\boldsymbol{X},\boldsymbol{W})-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\big)\Big)=:\boldsymbol{\Sigma}_{\varepsilon,s}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}},

where

𝚺ε,s\displaystyle\boldsymbol{\Sigma}_{\varepsilon,s} :=E[Var(Y𝑿,𝑾)𝒔~𝒔~],\displaystyle:=\operatorname{E}\big[\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})\,\tilde{\boldsymbol{s}}\tilde{\boldsymbol{s}}^{\top}\big],
𝑪\displaystyle\boldsymbol{C} :=E[Ef[μ𝒔~𝑿]],\displaystyle:=\operatorname{E}\!\big[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big],
𝚺μ,full\displaystyle\boldsymbol{\Sigma}_{\mu,\mathrm{full}} :=E[𝒔~𝒔~(μEf[μ𝑿])2]𝑪𝑪.\displaystyle:=\operatorname{E}\!\Big[\tilde{\boldsymbol{s}}\tilde{\boldsymbol{s}}^{\top}\big(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\big)^{2}\Big]-\boldsymbol{C}\boldsymbol{C}^{\top}.

In particular,

𝚺μ,fullCov(Ef[μ𝒔~𝑿])=:𝚺μ.\boldsymbol{\Sigma}_{\mu,\mathrm{full}}\succeq\underbrace{\operatorname{Cov}\!\big(\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big)}_{=:\ \boldsymbol{\Sigma}_{\mu}}.

Equality 𝚺μ,full=𝚺μ\boldsymbol{\Sigma}_{\mu,\mathrm{full}}=\boldsymbol{\Sigma}_{\mu} holds if and only if 𝐬~(𝐖,𝐗){μ(𝐗,𝐖)Ef[μ𝐗]}\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\{\mu(\boldsymbol{X},\boldsymbol{W})-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\} is conditionally degenerate given 𝐗\boldsymbol{X}.

Proof.

Preliminaries. Recall

r𝜹(𝑾,𝑿):=g𝜹(𝑾𝑿)f(𝑾𝑿).r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}):=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{X})}.

By Lemma 6, for 𝜹Cδ/Ms\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s} there exist remainders r2(𝑾,𝑿;𝜹)r_{2}(\boldsymbol{W},\boldsymbol{X};\boldsymbol{\delta}) and rμ(𝑿;𝜹)r_{\mu}(\boldsymbol{X};\boldsymbol{\delta}) such that

r𝜹\displaystyle r_{\boldsymbol{\delta}} =1+𝜹𝒔~+r2,r2L2(P0)Cg𝜹2,\displaystyle=1+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}+r_{2},\qquad\|r_{2}\|_{L_{2}(P_{0})}\leq C_{g}\|\boldsymbol{\delta}\|^{2},
Eg𝜹[μ𝑿]\displaystyle\operatorname{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}] =Ef[μ𝑿]+𝜹Ef[μ𝒔~𝑿]+rμ,rμL2(P0)Cμ𝜹2.\displaystyle=\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]+\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]+r_{\mu},\qquad\|r_{\mu}\|_{L_{2}(P_{0})}\leq C_{\mu}\|\boldsymbol{\delta}\|^{2}.

The tilt radius implies 0<r𝜹e2τe2Cmax0<r_{\boldsymbol{\delta}}\leq e^{2\tau}\leq e^{2C_{\max}} with τ=𝜹MsCδ\tau=\|\boldsymbol{\delta}\|M_{s}\leq C_{\delta}, hence r𝜹e2Cmax\|r_{\boldsymbol{\delta}}\|_{\infty}\leq e^{2C_{\max}}. We also use 𝒔~2Ms\|\tilde{\boldsymbol{s}}\|_{\infty}\leq 2M_{s}, |μ|M|\mu|\leq M, and Ef[μ𝒔~𝑿]2MMs\|\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\|_{\infty}\leq 2MM_{s}.

Also,

ψP(𝜹)\displaystyle\psi_{P}(\boldsymbol{\delta}) =𝔼[𝔼g𝜹[μ𝑿]]\displaystyle=\mathbb{E}\!\left[\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]\right]
=ψP(𝟎)+𝜹𝔼[𝔼f[μ𝒔~𝑿]]+𝔼[rμ(𝑿;𝜹)].\displaystyle=\psi_{P}(\boldsymbol{0})+\boldsymbol{\delta}^{\top}\,\mathbb{E}\!\left[\mathbb{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\right]+\mathbb{E}\!\left[r_{\mu}(\boldsymbol{X};\boldsymbol{\delta})\right].

Expansion of EIF components. Using the displays above,

DY\displaystyle D_{Y} =r𝜹(Yμ)=(1+𝜹𝒔~+r2)(Yμ)\displaystyle=r_{\boldsymbol{\delta}}\,(Y-\mu)=(1+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}+r_{2})(Y-\mu)
=(Yμ)+𝜹𝒔~(Yμ)+r2(Yμ),\displaystyle=(Y-\mu)+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}\,(Y-\mu)+r_{2}\,(Y-\mu),
Dg,μ\displaystyle D_{g,\mu} =r𝜹(μEg𝜹[μ𝑿])\displaystyle=r_{\boldsymbol{\delta}}\big(\mu-\operatorname{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]\big)
=(1+𝜹𝒔~+r2)(μEf[μ𝑿]𝜹Ef[μ𝒔~𝑿]rμ)\displaystyle=(1+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}+r_{2})\Big(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]-\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]-r_{\mu}\Big)
=(μEf[μ𝑿])+𝜹𝒔~(μEf[μ𝑿])𝜹Ef[μ𝒔~𝑿]+R2,\displaystyle=(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}])+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}\,(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}])-\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]+R_{2},
Dψ\displaystyle D_{\psi} =Eg𝜹[μ𝑿]ψP(𝜹)\displaystyle=\operatorname{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]-\psi_{P}(\boldsymbol{\delta})
=Ef[μ𝑿]ψP(𝟎)+𝜹(Ef[μ𝒔~𝑿]E[Ef[μ𝒔~𝑿]])+(rμE[rμ]),\displaystyle=\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]-\psi_{P}(\boldsymbol{0})+\boldsymbol{\delta}^{\top}\Big(\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]-\operatorname{E}[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]]\Big)+(r_{\mu}-\operatorname{E}[r_{\mu}]),

where the remainder R2R_{2} collects all quadratic and higher-order terms:

R2\displaystyle R_{2} :=r2(μEf[μ𝑿])(1+𝜹𝒔~+r2)rμ\displaystyle:=r_{2}(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}])-(1+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}+r_{2})\,r_{\mu}
(𝜹𝒔~)(𝜹Ef[μ𝒔~𝑿])r2(𝜹Ef[μ𝒔~𝑿]).\displaystyle\quad-(\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}})\big(\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big)-r_{2}\,\big(\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big).

Collection of zero and first-order terms. Adding DY+Dg,μ+DψD_{Y}+D_{g,\mu}+D_{\psi} gives

φψ(𝜹)\displaystyle\varphi_{\psi(\boldsymbol{\delta})} =[YψP(𝟎)]\displaystyle=\Big[Y-\psi_{P}(\boldsymbol{0})\Big]
+𝜹[𝒔~(YEf[μ𝑿])E[Ef[μ𝒔~𝑿]]]\displaystyle\quad+\boldsymbol{\delta}^{\top}\Big[\tilde{\boldsymbol{s}}\big(Y-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\big)-\operatorname{E}\big[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big]\Big]
+{r2(Yμ)+R2+(rμE[rμ])}=:Rφ.\displaystyle\quad+\underbrace{\Big\{r_{2}(Y-\mu)+R_{2}+(r_{\mu}-\operatorname{E}[r_{\mu}])\Big\}}_{=:R_{\varphi}}.

Hence the first-order term is 𝜹𝒉(𝒁)\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z}).

Explicit L2L_{2} bounds for remainder terms. By Cauchy–Schwarz and (A2),

r2(Yμ)L2σ¯r2L2Cgσ¯𝜹2.\|r_{2}(Y-\mu)\|_{L_{2}}\ \leq\ \overline{\sigma}\,\|r_{2}\|_{L_{2}}\ \leq\ C_{g}\,\overline{\sigma}\,\|\boldsymbol{\delta}\|^{2}.

Since μEf[μ𝑿]2M\|\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\|_{\infty}\leq 2M,

r2(μEf[μ𝑿])L2 2Mr2L2 2MCg𝜹2.\|r_{2}(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}])\|_{L_{2}}\ \leq\ 2M\,\|r_{2}\|_{L_{2}}\ \leq\ 2M\,C_{g}\,\|\boldsymbol{\delta}\|^{2}.

Using r𝜹e2Cmax\|r_{\boldsymbol{\delta}}\|_{\infty}\leq e^{2C_{\max}},

(1+𝜹𝒔~+r2)rμL2=r𝜹rμL2e2CmaxCμ𝜹2.\|(1+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}+r_{2})\,r_{\mu}\|_{L_{2}}=\|r_{\boldsymbol{\delta}}r_{\mu}\|_{L_{2}}\leq e^{2C_{\max}}\,C_{\mu}\,\|\boldsymbol{\delta}\|^{2}.

Moreover,

rμE[rμ]L2rμL2+E[rμ]L2 2Cμ𝜹2.\|r_{\mu}-\operatorname{E}[r_{\mu}]\|_{L_{2}}\ \leq\ \|r_{\mu}\|_{L_{2}}+\|\operatorname{E}[r_{\mu}]\|_{L_{2}}\ \leq\ 2C_{\mu}\,\|\boldsymbol{\delta}\|^{2}.

For the quadratic product,

(𝜹𝒔~)(𝜹Ef[μ𝒔~𝑿])L2\displaystyle\big\|(\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}})\big(\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big)\big\|_{L_{2}} 𝜹Ef[μ𝒔~𝑿]𝜹𝒔~L2\displaystyle\leq\|\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\|_{\infty}\,\|\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}\|_{L_{2}}
(2MMs𝜹)(2Ms𝜹)=4MMs2𝜹2.\displaystyle\leq(2MM_{s}\|\boldsymbol{\delta}\|)\,(2M_{s}\|\boldsymbol{\delta}\|)=4MM_{s}^{2}\,\|\boldsymbol{\delta}\|^{2}.

Finally,

r2(𝜹Ef[μ𝒔~𝑿])L2\displaystyle\big\|r_{2}\,(\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}])\big\|_{L_{2}} r2L2𝜹Ef[μ𝒔~𝑿]\displaystyle\leq\|r_{2}\|_{L_{2}}\,\|\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\|_{\infty}
Cg𝜹2(2MMs𝜹)\displaystyle\leq C_{g}\|\boldsymbol{\delta}\|^{2}\cdot(2MM_{s}\|\boldsymbol{\delta}\|)
2CmaxMCg𝜹2,\displaystyle\leq 2C_{\max}\,M\,C_{g}\,\|\boldsymbol{\delta}\|^{2},

where we used 𝜹Cδ/Ms\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s}. Combining all displays yields

RφL2\displaystyle\|R_{\varphi}\|_{L_{2}} Cgσ¯𝜹2+2MCg𝜹2+e2CmaxCμ𝜹2+2Cμ𝜹2\displaystyle\leq C_{g}\,\overline{\sigma}\,\|\boldsymbol{\delta}\|^{2}+2M\,C_{g}\,\|\boldsymbol{\delta}\|^{2}+e^{2C_{\max}}\,C_{\mu}\,\|\boldsymbol{\delta}\|^{2}+2C_{\mu}\,\|\boldsymbol{\delta}\|^{2}
+4MMs2𝜹2+2CmaxMCg𝜹2Cφ𝜹2,\displaystyle\quad+4MM_{s}^{2}\,\|\boldsymbol{\delta}\|^{2}+2C_{\max}\,M\,C_{g}\,\|\boldsymbol{\delta}\|^{2}\leq C_{\varphi}\,\|\boldsymbol{\delta}\|^{2},

with CφC_{\varphi} as stated.

Covariance of the gradient. Write

m(𝑿)\displaystyle m(\boldsymbol{X}) =Ef[μ𝑿],\displaystyle=\operatorname{E}_{f}[\mu\mid\boldsymbol{X}],
𝑪\displaystyle\boldsymbol{C} =E[Ef[μ𝒔~𝑿]],\displaystyle=\operatorname{E}\!\big[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big],
𝒉(𝒁)\displaystyle\boldsymbol{h}(\boldsymbol{Z}) =𝒔~(𝑾,𝑿)(Ym(𝑿))𝑪.\displaystyle=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\big(Y-m(\boldsymbol{X})\big)-\boldsymbol{C}.

Set

𝑼\displaystyle\boldsymbol{U} :=𝒔~(𝑾,𝑿)(Yμ(𝑿,𝑾)),\displaystyle:=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\big(Y-\mu(\boldsymbol{X},\boldsymbol{W})\big),
𝑽\displaystyle\boldsymbol{V} :=𝒔~(𝑾,𝑿)(μ(𝑿,𝑾)m(𝑿)),\displaystyle:=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\big(\mu(\boldsymbol{X},\boldsymbol{W})-m(\boldsymbol{X})\big),

so that 𝒉=𝑼+(𝑽𝑪)\boldsymbol{h}=\boldsymbol{U}+(\boldsymbol{V}-\boldsymbol{C}). Because 𝔼[𝑼𝑿,𝑾]=𝟎\mathbb{E}[\boldsymbol{U}\mid\boldsymbol{X},\boldsymbol{W}]=\boldsymbol{0} and 𝑽𝑪\boldsymbol{V}-\boldsymbol{C} is measurable with respect to (𝑿,𝑾)(\boldsymbol{X},\boldsymbol{W}), Cov(𝑼,𝑽𝑪)=𝟎\mathrm{Cov}(\boldsymbol{U},\boldsymbol{V}-\boldsymbol{C})=\boldsymbol{0}, hence

Cov(𝒉)=Cov(𝑼)+Cov(𝑽𝑪).\operatorname{Cov}(\boldsymbol{h})=\operatorname{Cov}(\boldsymbol{U})+\operatorname{Cov}(\boldsymbol{V}-\boldsymbol{C}).

Moreover, conditioning on (𝑿,𝑾)(\boldsymbol{X},\boldsymbol{W}) gives

Cov(𝑼)=E[𝒔~𝒔~Var(Y𝑿,𝑾)]=:𝚺ε,s,\operatorname{Cov}(\boldsymbol{U})=\operatorname{E}\!\Big[\tilde{\boldsymbol{s}}\tilde{\boldsymbol{s}}^{\top}\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})\Big]=:\boldsymbol{\Sigma}_{\varepsilon,s},

and

Cov(𝑽𝑪)\displaystyle\operatorname{Cov}(\boldsymbol{V}-\boldsymbol{C}) =Cov(𝑽)=E[𝑽𝑽]𝑪𝑪\displaystyle=\operatorname{Cov}(\boldsymbol{V})=\operatorname{E}[\boldsymbol{V}\boldsymbol{V}^{\top}]-\boldsymbol{C}\boldsymbol{C}^{\top}
=E[𝒔~𝒔~(μEf[μ𝑿])2]𝑪𝑪=:𝚺μ,full.\displaystyle=\operatorname{E}\!\Big[\tilde{\boldsymbol{s}}\tilde{\boldsymbol{s}}^{\top}\big(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\big)^{2}\Big]-\boldsymbol{C}\boldsymbol{C}^{\top}=:\boldsymbol{\Sigma}_{\mu,\mathrm{full}}.

Thus Cov(𝒉)=𝚺ε,s+𝚺μ,full\operatorname{Cov}(\boldsymbol{h})=\boldsymbol{\Sigma}_{\varepsilon,s}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}}.

Finally, the law of total covariance yields

𝚺μ,full\displaystyle\boldsymbol{\Sigma}_{\mu,\mathrm{full}} =E[Var(𝑽𝑿)]+Cov(E[𝑽𝑿])Cov(E[𝑽𝑿])\displaystyle=\operatorname{E}\big[\operatorname{Var}(\boldsymbol{V}\mid\boldsymbol{X})\big]+\operatorname{Cov}\big(\operatorname{E}[\boldsymbol{V}\mid\boldsymbol{X}]\big)\succeq\operatorname{Cov}\big(\operatorname{E}[\boldsymbol{V}\mid\boldsymbol{X}]\big)
=Cov(Ef[μ𝒔~𝑿])=:𝚺μ.\displaystyle=\operatorname{Cov}\!\big(\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big)=:\boldsymbol{\Sigma}_{\mu}.

Mean zero property of 𝐡\boldsymbol{h}. Let m(𝑿)=Ef[μ𝑿]m(\boldsymbol{X})=\operatorname{E}_{f}[\mu\mid\boldsymbol{X}] and 𝑪=E[Ef[μ𝒔~𝑿]]\boldsymbol{C}=\operatorname{E}[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]]. Using the definition of 𝒉\boldsymbol{h},

𝒉(𝒁)\displaystyle\boldsymbol{h}(\boldsymbol{Z}) =𝒔~(𝑾,𝑿)(Ym(𝑿))𝑪\displaystyle=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,\big(Y-m(\boldsymbol{X})\big)-\boldsymbol{C}
=𝒔~(𝑾,𝑿)(Yμ(𝑿,𝑾))=:𝑼+𝒔~(𝑾,𝑿)(μ(𝑿,𝑾)m(𝑿))=:𝑽𝑪.\displaystyle=\underbrace{\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,\big(Y-\mu(\boldsymbol{X},\boldsymbol{W})\big)}_{=:\boldsymbol{U}}+\underbrace{\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,\big(\mu(\boldsymbol{X},\boldsymbol{W})-m(\boldsymbol{X})\big)}_{=:\boldsymbol{V}}-\boldsymbol{C}.

The expectation decomposes as 𝔼[𝒉]=𝔼[𝑼]+𝔼[𝑽]𝑪\mathbb{E}[\boldsymbol{h}]=\mathbb{E}[\boldsymbol{U}]+\mathbb{E}[\boldsymbol{V}]-\boldsymbol{C}.

First,

𝔼[𝑼]\displaystyle\mathbb{E}[\boldsymbol{U}] =𝔼[𝒔~(𝑾,𝑿)𝔼[Yμ(𝑿,𝑾)𝑿,𝑾]]\displaystyle=\mathbb{E}\!\left[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,\mathbb{E}\big[Y-\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X},\boldsymbol{W}\big]\right]
=𝔼[𝒔~(𝑾,𝑿)0]=𝟎.\displaystyle=\mathbb{E}\!\left[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\cdot 0\right]=\boldsymbol{0}.

Second, for 𝑽\boldsymbol{V},

𝔼[𝑽]\displaystyle\mathbb{E}[\boldsymbol{V}] =𝔼[𝔼[𝒔~(𝑾,𝑿)μ(𝑿,𝑾)𝑿]]𝔼[m(𝑿)𝔼[𝒔~(𝑾,𝑿)𝑿]].\displaystyle=\mathbb{E}\!\left[\mathbb{E}\big[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}\big]\right]-\mathbb{E}\!\left[m(\boldsymbol{X})\,\mathbb{E}\big[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}\big]\right].

By definition 𝒔~(𝑾,𝑿)=𝒔(𝑾,𝑿)𝔼f[𝒔(𝑾,𝑿)𝑿]\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})=\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})-\mathbb{E}_{f}[\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}], hence

𝔼[𝒔~(𝑾,𝑿)𝑿]=𝟎,\mathbb{E}\big[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}\big]=\boldsymbol{0},

so

𝔼[𝑽]=𝔼[𝔼f[μ𝒔~𝑿]]=𝑪.\mathbb{E}[\boldsymbol{V}]=\mathbb{E}\!\left[\mathbb{E}_{f}[\mu\,\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\right]=\boldsymbol{C}.

Putting the pieces together,

𝔼[𝒉]=𝔼[𝑼]+𝔼[𝑽]𝑪=𝟎+𝑪𝑪=𝟎.\mathbb{E}[\boldsymbol{h}]=\mathbb{E}[\boldsymbol{U}]+\mathbb{E}[\boldsymbol{V}]-\boldsymbol{C}=\boldsymbol{0}+\boldsymbol{C}-\boldsymbol{C}=\boldsymbol{0}.

Corollary 2.

Specializing Lemma 7 to 𝐬(𝐰,𝐱)=𝐰\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})=\boldsymbol{w} (so 𝐬~=𝐖~\tilde{\boldsymbol{s}}=\tilde{\boldsymbol{W}}), let

𝑯:=Cov(𝒉(𝒁))=𝚺ε,s+𝚺μ,full𝟎.\boldsymbol{H}:=\operatorname{Cov}\!\big(\boldsymbol{h}(\boldsymbol{Z})\big)=\boldsymbol{\Sigma}_{\varepsilon,s}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}}\succeq\boldsymbol{0}.

Then, for all 𝛅Cδ/Ms\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s},

(𝜹𝑯𝜹Cφ𝜹2)+ 2Var{φθ(𝜹)(𝒁)}(𝜹𝑯𝜹+Cφ𝜹2)2,\displaystyle\big(\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}}-C_{\varphi}\|\boldsymbol{\delta}\|^{2}\big)_{+}^{\,2}\ \leq\ \operatorname{Var}\!\big\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}\ \leq\ \big(\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}}+C_{\varphi}\|\boldsymbol{\delta}\|^{2}\big)^{2},

where (x)+=max{x,0}(x)_{+}=\max\{x,0\} and CφC_{\varphi} is the constant in Lemma 7. Moreover, if λmin(𝐇)>0\lambda_{\min}(\boldsymbol{H})>0 and 𝛅r0\|\boldsymbol{\delta}\|\leq r_{0} for some

r0<λmin(𝑯)Cφ,r_{0}\ <\ \frac{\sqrt{\lambda_{\min}(\boldsymbol{H})}}{C_{\varphi}}\,,

then there exist constants

clow\displaystyle c_{\mathrm{low}} :=(1Cφr0λmin(𝑯))2,\displaystyle:=\Big(1-\frac{C_{\varphi}\,r_{0}}{\sqrt{\lambda_{\min}(\boldsymbol{H})}}\Big)^{2},
cup\displaystyle c_{\mathrm{up}} :=(1+Cφr0λmin(𝑯))2,\displaystyle:=\Big(1+\frac{C_{\varphi}\,r_{0}}{\sqrt{\lambda_{\min}(\boldsymbol{H})}}\Big)^{2},

depending only on (𝐇,Cφ,r0)(\boldsymbol{H},C_{\varphi},r_{0}) but not on 𝛅\boldsymbol{\delta}, such that

clow𝜹𝑯𝜹Var{φθ(𝜹)(𝒁)}cup𝜹𝑯𝜹for all 𝜹r0.c_{\mathrm{low}}\ \boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}\ \leq\ \operatorname{Var}\!\big\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}\ \leq\ c_{\mathrm{up}}\ \boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}\qquad\text{for all }\ \|\boldsymbol{\delta}\|\leq r_{0}.

Since

𝑯=𝚺ε,s+𝚺μ,full𝚺ε,s,\boldsymbol{H}=\boldsymbol{\Sigma}_{\varepsilon,s}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}}\succeq\boldsymbol{\Sigma}_{\varepsilon,s},

the lower bound immediately implies

Var{φθ(𝜹)(𝒁)}clow𝜹𝚺ε,s𝜹.\operatorname{Var}\!\big\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}\ \geq\ c_{\mathrm{low}}\ \boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}_{\varepsilon,s}\,\boldsymbol{\delta}.
Proof.

By Lemma 7,

φθ(𝜹)(𝒁)=𝜹𝒉(𝒁)+Rφ(𝒁;𝜹),Rφ(;𝜹)L2Cφ𝜹2,E{𝒉(𝒁)}=𝟎.\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})=\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})+R_{\varphi}(\boldsymbol{Z};\boldsymbol{\delta}),\qquad\|R_{\varphi}(\cdot;\boldsymbol{\delta})\|_{L_{2}}\leq C_{\varphi}\,\|\boldsymbol{\delta}\|^{2},\qquad\operatorname{E}\{\boldsymbol{h}(\boldsymbol{Z})\}=\boldsymbol{0}.

Write A:=𝜹𝒉A:=\boldsymbol{\delta}^{\top}\boldsymbol{h} and R:=Rφ(;𝜹)R:=R_{\varphi}(\cdot;\boldsymbol{\delta}), and center the remainder by Rc:=RE[R]R_{c}:=R-\operatorname{E}[R]. Then

Var{φθ(𝜹)(𝒁)}=Var(A+R)=Var(A+Rc)=A+Rc22.\operatorname{Var}\!\big\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}=\operatorname{Var}(A+R)=\operatorname{Var}(A+R_{c})=\|A+R_{c}\|_{2}^{2}.

Because E[𝒉]=𝟎\operatorname{E}[\boldsymbol{h}]=\boldsymbol{0}, we have A22=Var(A)=𝜹𝑯𝜹\|A\|_{2}^{2}=\operatorname{Var}(A)=\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}. Also Rc2=Var(R)R2Cφ𝜹2\|R_{c}\|_{2}=\sqrt{\operatorname{Var}(R)}\leq\|R\|_{2}\leq C_{\varphi}\|\boldsymbol{\delta}\|^{2} by Cauchy–Schwarz.

By the triangle inequality and its reverse in L2L_{2},

|A2Rc2|A+Rc2A2+Rc2.\big|\ \|A\|_{2}-\|R_{c}\|_{2}\ \big|\ \leq\ \|A+R_{c}\|_{2}\ \leq\ \|A\|_{2}+\|R_{c}\|_{2}.

Using A2=𝜹𝑯𝜹\|A\|_{2}=\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{H}\boldsymbol{\delta}} and Rc2Cφ𝜹2\|R_{c}\|_{2}\leq C_{\varphi}\|\boldsymbol{\delta}\|^{2} and squaring both sides yields the additive bracket.

Suppose λmin(𝑯)>0\lambda_{\min}(\boldsymbol{H})>0. Then, for all 𝜹\boldsymbol{\delta},

𝜹𝑯𝜹λmin(𝑯)𝜹.\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}}\ \geq\ \sqrt{\lambda_{\min}(\boldsymbol{H})}\,\|\boldsymbol{\delta}\|.

Hence, for any 𝜹r0\|\boldsymbol{\delta}\|\leq r_{0},

Rc2𝜹𝑯𝜹\displaystyle\frac{\|R_{c}\|_{2}}{\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}}} Cφ𝜹2λmin(𝑯)𝜹\displaystyle\leq\frac{C_{\varphi}\|\boldsymbol{\delta}\|^{2}}{\sqrt{\lambda_{\min}(\boldsymbol{H})}\,\|\boldsymbol{\delta}\|}
Cφr0λmin(𝑯)=:r< 1.\displaystyle\leq\frac{C_{\varphi}\,r_{0}}{\sqrt{\lambda_{\min}(\boldsymbol{H})}}=:\ r_{*}\ <\ 1.

Dividing the additive bracket by 𝜹𝑯𝜹\boldsymbol{\delta}^{\top}\boldsymbol{H}\boldsymbol{\delta} yields the multiplicative squeeze with

clow=(1r)2,cup=(1+r)2.c_{\mathrm{low}}=(1-r_{*})^{2},\qquad c_{\mathrm{up}}=(1+r_{*})^{2}.

Finally, because 𝑯𝚺ε,s\boldsymbol{H}\succeq\boldsymbol{\Sigma}_{\varepsilon,s}, the coarser lower bound follows. ∎

Lemma 8 (Hardest direction after orthogonalization).

Let φ0(𝐙):=φψ(𝟎)(𝐙)=Yψ(𝟎)\varphi_{0}(\boldsymbol{Z}):=\varphi_{\psi(\boldsymbol{0})}(\boldsymbol{Z})=Y-\psi(\boldsymbol{0}) and

ϕ~0(𝒁)\displaystyle\tilde{\phi}_{0}(\boldsymbol{Z}) :=𝜹𝒉(𝒁)Var(𝜹𝒉(𝒁)),\displaystyle:=\frac{\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})}{\sqrt{\operatorname{Var}\!\big(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})\big)}},
σh2\displaystyle\sigma_{h}^{2} :=Var(𝜹𝒉(𝒁)).\displaystyle:=\operatorname{Var}\!\big(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})\big).

Define

ρ\displaystyle\rho :=Corr(ϕ~0,φ0)=𝔼[ϕ~0φ0]Var(φ0),\displaystyle:=\text{Corr}(\tilde{\phi}_{0},\varphi_{0})=\frac{\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]}{\sqrt{\operatorname{Var}(\varphi_{0})}},
ϕ(𝒁)\displaystyle\phi(\boldsymbol{Z}) :=ϕ~0(𝒁)𝔼[ϕ~0φ0]φ0(𝒁)Var(φ0) 1(𝔼[ϕ~0φ0])2Var(φ0).\displaystyle:=\frac{\tilde{\phi}_{0}(\boldsymbol{Z})-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]\cdot\frac{\varphi_{0}(\boldsymbol{Z})}{\operatorname{Var}(\varphi_{0})}}{\sqrt{\,1-\frac{\big(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]\big)^{2}}{\operatorname{Var}(\varphi_{0})}\,}}.

Then 𝔼[ϕ]=0\mathbb{E}[\phi]=0, Var(ϕ)=1\operatorname{Var}(\phi)=1, and 𝔼[ϕφ0]=0\mathbb{E}[\phi\,\varphi_{0}]=0. Let Assumption (A4) hold with constants (B,ε0)(B,\varepsilon_{0}). Consider P1P_{1} with density p1=p0(1+εϕ)p_{1}=p_{0}(1+\varepsilon\phi), where |ε|min{ε0,12B}|\varepsilon|\leq\min\{\varepsilon_{0},\tfrac{1}{2B}\}. Then p1𝒫p_{1}\in\mathcal{P}, p1=1\int p_{1}=1, and p112p0p_{1}\geq\tfrac{1}{2}p_{0}. Moreover, for 𝛅Cδ/Ms\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s},

θP1(𝜹)θP0(𝜹)\displaystyle\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta}) =ε𝔼[ϕφθ(𝜹)]+rA5(ε)\displaystyle=\varepsilon\,\mathbb{E}\big[\phi\,\varphi_{\theta(\boldsymbol{\delta})}\big]+r_{\mathrm{A5}}(\varepsilon)
=ε(Var(𝜹𝒉)1ρ2)+ε𝔼[ϕRφ(;𝜹)]+rA5(ε),\displaystyle=\varepsilon\Big(\sqrt{\operatorname{Var}(\boldsymbol{\delta}^{\top}\boldsymbol{h})}\cdot\sqrt{1-\rho^{2}}\Big)+\varepsilon\,\mathbb{E}[\phi\,R_{\varphi}(\cdot;\boldsymbol{\delta})]+r_{\mathrm{A5}}(\varepsilon),

where Rφ(;𝛅)R_{\varphi}(\cdot;\boldsymbol{\delta}) is the remainder in the linear expansion φθ(𝛅)=𝛅𝐡+Rφ(;𝛅)\varphi_{\theta(\boldsymbol{\delta})}=\boldsymbol{\delta}^{\top}\boldsymbol{h}+R_{\varphi}(\cdot;\boldsymbol{\delta}), and the remainders satisfy

|𝔼[ϕRφ(;𝜹)]|\displaystyle\big|\mathbb{E}[\phi\,R_{\varphi}(\cdot;\boldsymbol{\delta})]\big| Rφ(;𝜹)L2Cφ𝜹2,\displaystyle\leq\|R_{\varphi}(\cdot;\boldsymbol{\delta})\|_{L_{2}}\leq C_{\varphi}\,\|\boldsymbol{\delta}\|^{2},
|rA5(ε)|\displaystyle|r_{\mathrm{A5}}(\varepsilon)| 2Kε2,\displaystyle\leq 2K\,\varepsilon^{2},

with CφC_{\varphi} as in Lemma 7 and KK the differentiability modulus from (A5). Finally,

χ2(P1,P0)=ε2andDKL(P1P0)log(1+ε2)ε2.\chi^{2}(P_{1},P_{0})=\varepsilon^{2}\qquad\text{and}\qquad D_{\mathrm{KL}}(P_{1}\|P_{0})\leq\log(1+\varepsilon^{2})\leq\varepsilon^{2}.
Proof.

Properties of ϕ\phi. Write σ02:=Var(φ0)\sigma_{0}^{2}:=\operatorname{Var}(\varphi_{0}), so ρ=𝔼[ϕ~0φ0]/σ0\rho=\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}. By construction,

ϕ=ϕ~0(𝔼[ϕ~0φ0]/σ02)φ0 1𝔼[ϕ~0φ0]2/σ02.\phi=\frac{\tilde{\phi}_{0}-(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}^{2})\,\varphi_{0}}{\sqrt{\,1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}\,}}.

Since 𝔼[ϕ~0]=0\mathbb{E}[\tilde{\phi}_{0}]=0 and 𝔼[φ0]=0\mathbb{E}[\varphi_{0}]=0, we have 𝔼[ϕ]=0\mathbb{E}[\phi]=0. Next,

Var(ϕ)\displaystyle\operatorname{Var}(\phi) =Var(ϕ~0)+(𝔼[ϕ~0φ0]/σ02)2Var(φ0)2(𝔼[ϕ~0φ0]/σ02)Cov(ϕ~0,φ0)1𝔼[ϕ~0φ0]2/σ02\displaystyle=\frac{\operatorname{Var}(\tilde{\phi}_{0})+(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}^{2})^{2}\operatorname{Var}(\varphi_{0})-2(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}^{2})\operatorname{Cov}(\tilde{\phi}_{0},\varphi_{0})}{1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}}
=1+𝔼[ϕ~0φ0]2/σ022𝔼[ϕ~0φ0]2/σ021𝔼[ϕ~0φ0]2/σ02=1.\displaystyle=\frac{1+\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}-2\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}}{1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}}=1.

Also,

𝔼[ϕφ0]=𝔼[ϕ~0φ0](𝔼[ϕ~0φ0]/σ02)σ02 1𝔼[ϕ~0φ0]2/σ02=0.\mathbb{E}[\phi\,\varphi_{0}]=\frac{\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]-(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}^{2})\,\sigma_{0}^{2}}{\sqrt{\,1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}\,}}=0.

Finally,

𝔼[ϕϕ~0]\displaystyle\mathbb{E}[\phi\,\tilde{\phi}_{0}] =𝔼[ϕ~02](𝔼[ϕ~0φ0]/σ02)𝔼[ϕ~0φ0] 1𝔼[ϕ~0φ0]2/σ02= 1𝔼[ϕ~0φ0]2/σ02=1ρ2.\displaystyle=\frac{\mathbb{E}[\tilde{\phi}_{0}^{2}]-(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}^{2})\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]}{\sqrt{\,1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}\,}}=\sqrt{\,1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}\,}=\sqrt{1-\rho^{2}}.

Because ϕ~0=σh1𝜹𝒉\tilde{\phi}_{0}=\sigma_{h}^{-1}\boldsymbol{\delta}^{\top}\boldsymbol{h}, we obtain

𝔼[ϕ𝜹𝒉]=σh𝔼[ϕϕ~0]=σh1ρ2.\mathbb{E}\big[\phi\,\boldsymbol{\delta}^{\top}\boldsymbol{h}\big]=\sigma_{h}\,\mathbb{E}[\phi\,\tilde{\phi}_{0}]=\sigma_{h}\sqrt{1-\rho^{2}}.

Validity of p1p_{1}. By the definition of ε0\varepsilon_{0} in (A4), for every ϕL\phi\in L_{\infty} with 𝔼[ϕ]=0\mathbb{E}[\phi]=0 and ϕB\|\phi\|_{\infty}\leq B, the measure with density p0(1+εϕ)p_{0}(1+\varepsilon\phi) lies in 𝒫\mathcal{P} for all |ε|ε0|\varepsilon|\leq\varepsilon_{0}. Moreover, p1=p0(1+εϕ)=1+ε𝔼[ϕ]=1\int p_{1}=\int p_{0}(1+\varepsilon\phi)=1+\varepsilon\,\mathbb{E}[\phi]=1. If |ε|(2B)1|\varepsilon|\leq(2B)^{-1} then 1+εϕ1|ε|ϕ1/21+\varepsilon\phi\geq 1-|\varepsilon|\,\|\phi\|_{\infty}\geq 1/2, so p112p0p_{1}\geq\tfrac{1}{2}p_{0}.

First-order expansion of θ\theta via (A5). By (A5), for |ε|ε0|\varepsilon|\leq\varepsilon_{0},

ψP0(1+εϕ)(𝜹)ψP0(𝜹)\displaystyle\psi_{P_{0}(1+\varepsilon\phi)}(\boldsymbol{\delta})-\psi_{P_{0}}(\boldsymbol{\delta}) =ε𝔼P0[φψ(𝜹)ϕ]+r1(ε),|r1(ε)|Kε2,\displaystyle=\varepsilon\,\mathbb{E}_{P_{0}}\!\big[\varphi_{\psi(\boldsymbol{\delta})}\,\phi\big]+r_{1}(\varepsilon),\qquad|r_{1}(\varepsilon)|\leq K\varepsilon^{2},
ψP0(1+εϕ)(𝟎)ψP0(𝟎)\displaystyle\psi_{P_{0}(1+\varepsilon\phi)}(\boldsymbol{0})-\psi_{P_{0}}(\boldsymbol{0}) =ε𝔼P0[φψ(𝟎)ϕ]+r0(ε),|r0(ε)|Kε2.\displaystyle=\varepsilon\,\mathbb{E}_{P_{0}}\!\big[\varphi_{\psi(\boldsymbol{0})}\,\phi\big]+r_{0}(\varepsilon),\qquad|r_{0}(\varepsilon)|\leq K\varepsilon^{2}.

Subtracting gives

θP1(𝜹)θP0(𝜹)=ε𝔼[ϕφθ(𝜹)]+rA5(ε),|rA5(ε)|2Kε2.\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})=\varepsilon\,\mathbb{E}\big[\phi\,\varphi_{\theta(\boldsymbol{\delta})}\big]+r_{\mathrm{A5}}(\varepsilon),\qquad|r_{\mathrm{A5}}(\varepsilon)|\leq 2K\varepsilon^{2}.

Evaluation of the leading inner product and remainder bound. By the linearization,

φθ(𝜹)(𝒁)=𝜹𝒉(𝒁)+Rφ(𝒁;𝜹),Rφ(;𝜹)L2Cφ𝜹2,\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})=\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})+R_{\varphi}(\boldsymbol{Z};\boldsymbol{\delta}),\qquad\|R_{\varphi}(\cdot;\boldsymbol{\delta})\|_{L_{2}}\leq C_{\varphi}\,\|\boldsymbol{\delta}\|^{2},

for all 𝜹Cδ/Ms\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s}. Hence

𝔼[ϕφθ(𝜹)]\displaystyle\mathbb{E}\big[\phi\,\varphi_{\theta(\boldsymbol{\delta})}\big] =𝔼[ϕ𝜹𝒉]+𝔼[ϕRφ]\displaystyle=\mathbb{E}[\phi\,\boldsymbol{\delta}^{\top}\boldsymbol{h}]+\mathbb{E}[\phi\,R_{\varphi}]
=σh1ρ2+𝔼[ϕRφ],|𝔼[ϕRφ]|Cφ𝜹2.\displaystyle=\sigma_{h}\sqrt{1-\rho^{2}}+\mathbb{E}[\phi\,R_{\varphi}],\qquad|\mathbb{E}[\phi\,R_{\varphi}]|\leq C_{\varphi}\,\|\boldsymbol{\delta}\|^{2}.

C.3 Minimax Lower Bound

Recall

𝑯\displaystyle\boldsymbol{H} :=Cov(𝒉(𝒁)),\displaystyle:=\operatorname{Cov}\!\big(\boldsymbol{h}(\boldsymbol{Z})\big),
σ02\displaystyle\sigma_{0}^{2} :=Var(φ0)=Var(Yψ(𝟎)).\displaystyle:=\operatorname{Var}(\varphi_{0})=\operatorname{Var}(Y-\psi(\boldsymbol{0})).

Define the projected covariance (Schur complement) of the linear regression of 𝒉\boldsymbol{h} onto the one-dimensional span of φ0\varphi_{0}:

𝚪\displaystyle\boldsymbol{\Gamma} :=𝑯Cov(𝒉,φ0)Cov(𝒉,φ0)σ02𝟎.\displaystyle:=\boldsymbol{H}-\frac{\operatorname{Cov}\big(\boldsymbol{h},\varphi_{0}\big)\,\operatorname{Cov}\big(\boldsymbol{h},\varphi_{0}\big)^{\top}}{\sigma_{0}^{2}}\succeq\boldsymbol{0}.
Theorem 4 (Minimax lower bound).

Assume (A1)–(A5). Fix some κ(0,1)\kappa\in(0,1) and assume

𝜹min{CδMs,(1κ)λmin(𝚪)Cφ}.\displaystyle\|\boldsymbol{\delta}\|\ \leq\ \min\Big\{\frac{C_{\delta}}{M_{s}},\ (1-\kappa)\frac{\sqrt{\lambda_{\min}(\boldsymbol{\Gamma})}}{C_{\varphi}}\Big\}.

If 𝚪\boldsymbol{\Gamma} is positive definite, then there exists a constant C>0C>0 that depends only on κ\kappa and the constants in (A1)–(A5), but not on 𝛅\boldsymbol{\delta}, such that

lim infninfθ^supP𝒫n𝔼P[(θ^θP(𝜹))2]𝜹𝚪𝜹C.\liminf_{n\to\infty}\ \inf_{\hat{\theta}}\ \sup_{P\in\mathcal{P}}\ \frac{n\,\mathbb{E}_{P}\big[(\hat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big]}{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\,\boldsymbol{\delta}}\ \geq\ C.

In particular, the lower bound is expressed in terms of 𝛅𝚪𝛅\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta} and generally cannot be strengthened by replacing 𝚪\boldsymbol{\Gamma} with a larger matrix.

Proof of Theorem 4.

Notation.

Let σh2:=Var(𝜹𝒉(𝒁))\sigma_{h}^{2}:=\operatorname{Var}(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})), σ02:=Var(φ0(𝒁))\sigma_{0}^{2}:=\operatorname{Var}(\varphi_{0}(\boldsymbol{Z})), and ϕ~0:=𝜹𝒉(𝒁)σh\tilde{\phi}_{0}:=\frac{\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})}{\sigma_{h}}.

Define ρ:=Corr(ϕ~0,φ0(𝒁))=Cov(𝜹𝒉(𝒁),φ0(𝒁))σhσ0\rho:=\text{Corr}(\tilde{\phi}_{0},\varphi_{0}(\boldsymbol{Z}))=\frac{\operatorname{Cov}(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z}),\varphi_{0}(\boldsymbol{Z}))}{\sigma_{h}\,\sigma_{0}}.

Two-point construction. Fix P0𝒫P_{0}\in\mathcal{P} and let ϕ\phi be as in Lemma 8. For a given nn, choose

α0\displaystyle\alpha_{0} :=min{ε0,12B,12},\displaystyle:=\min\Big\{\varepsilon_{0},\ \frac{1}{2B},\ \frac{1}{2}\Big\},
ε\displaystyle\varepsilon :=α0n.\displaystyle:=\frac{\alpha_{0}}{\sqrt{n}}.

Then |ε|min{ε0,(2B)1}|\varepsilon|\leq\min\{\varepsilon_{0},(2B)^{-1}\}, and the perturbed model P1P_{1} defined by p1=p0(1+εϕ)p_{1}=p_{0}(1+\varepsilon\phi) lies in 𝒫\mathcal{P}, integrates to 11, and satisfies p112p0p_{1}\geq\tfrac{1}{2}p_{0}.

Parameter gap and the Schur complement. By Lemma 8,

θP1(𝜹)θP0(𝜹)\displaystyle\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta}) =ε(σh1ρ2)+ε𝔼[ϕRφ(;𝜹)]+rA5(ε),\displaystyle=\varepsilon\Big(\sigma_{h}\sqrt{1-\rho^{2}}\Big)+\varepsilon\,\mathbb{E}[\phi\,R_{\varphi}(\cdot;\boldsymbol{\delta})]+r_{\mathrm{A5}}(\varepsilon),

with

|𝔼[ϕRφ(;𝜹)]|Cφ𝜹2,|rA5(ε)|2Kε2.|\mathbb{E}[\phi\,R_{\varphi}(\cdot;\boldsymbol{\delta})]|\leq C_{\varphi}\|\boldsymbol{\delta}\|^{2},\qquad|r_{\mathrm{A5}}(\varepsilon)|\leq 2K\varepsilon^{2}.

Moreover,

σh1ρ2\displaystyle\sigma_{h}\sqrt{1-\rho^{2}} =Var(𝜹𝒉(𝒁))Cov(𝜹𝒉(𝒁),φ0(𝒁))2Var(φ0(𝒁))=𝜹𝚪𝜹.\displaystyle=\sqrt{\operatorname{Var}(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z}))-\frac{\operatorname{Cov}(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z}),\varphi_{0}(\boldsymbol{Z}))^{2}}{\operatorname{Var}(\varphi_{0}(\boldsymbol{Z}))}}=\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\,\boldsymbol{\delta}}.

Therefore

|θP1(𝜹)θP0(𝜹)|\displaystyle\Big|\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})\Big| |ε|(𝜹𝚪𝜹Cφ𝜹2)2Kε2.\displaystyle\geq|\varepsilon|\Big(\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}-C_{\varphi}\|\boldsymbol{\delta}\|^{2}\Big)-2K\varepsilon^{2}.

Using 𝜹𝚪𝜹λmin(𝚪)𝜹\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}\geq\sqrt{\lambda_{\min}(\boldsymbol{\Gamma})}\|\boldsymbol{\delta}\| and the assumption 𝜹(1κ)λmin(𝚪)/Cφ\|\boldsymbol{\delta}\|\leq(1-\kappa)\sqrt{\lambda_{\min}(\boldsymbol{\Gamma})}/C_{\varphi}, we have

Cφ𝜹2𝜹𝚪𝜹\displaystyle\frac{C_{\varphi}\|\boldsymbol{\delta}\|^{2}}{\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}} Cφ𝜹λmin(𝚪)1κ,\displaystyle\leq\frac{C_{\varphi}\|\boldsymbol{\delta}\|}{\sqrt{\lambda_{\min}(\boldsymbol{\Gamma})}}\leq 1-\kappa,

hence

𝜹𝚪𝜹Cφ𝜹2κ𝜹𝚪𝜹.\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}-C_{\varphi}\|\boldsymbol{\delta}\|^{2}\geq\kappa\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}.

Thus

|θP1(𝜹)θP0(𝜹)|\displaystyle\Big|\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})\Big| |ε|κ𝜹𝚪𝜹2Kε2.\displaystyle\geq|\varepsilon|\,\kappa\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}-2K\varepsilon^{2}.

With ε=α0/n\varepsilon=\alpha_{0}/\sqrt{n},

|θP1(𝜹)θP0(𝜹)|α0nκ𝜹𝚪𝜹2Kα02n.\Big|\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})\Big|\geq\frac{\alpha_{0}}{\sqrt{n}}\,\kappa\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}-\frac{2K\alpha_{0}^{2}}{n}.

KL and TV control. From Lemma 8,

DKL(P1P0)\displaystyle D_{\mathrm{KL}}(P_{1}\|P_{0}) ε2=α02n,\displaystyle\leq\varepsilon^{2}=\frac{\alpha_{0}^{2}}{n},
DKL(P1nP0n)\displaystyle D_{\mathrm{KL}}(P_{1}^{\otimes n}\|P_{0}^{\otimes n}) =nDKL(P1P0)α02.\displaystyle=nD_{\mathrm{KL}}(P_{1}\|P_{0})\leq\alpha_{0}^{2}.

By Pinsker’s inequality,

TV(P0n,P1n)\displaystyle\mathrm{TV}(P_{0}^{\otimes n},P_{1}^{\otimes n}) 12DKL(P1nP0n)α02,\displaystyle\leq\sqrt{\frac{1}{2}D_{\mathrm{KL}}(P_{1}^{\otimes n}\|P_{0}^{\otimes n})}\leq\frac{\alpha_{0}}{\sqrt{2}},

so

1TV(P0n,P1n)\displaystyle 1-\mathrm{TV}(P_{0}^{\otimes n},P_{1}^{\otimes n}) 1α02=:cTV>0.\displaystyle\geq 1-\frac{\alpha_{0}}{\sqrt{2}}=:\ c_{\mathrm{TV}}>0.

Le Cam two-point inequality and asymptotic lower bound. Le Cam’s two-point inequality yields, for any estimator θ^\hat{\theta},

supP{P0,P1}𝔼P[(θ^θP(𝜹))2]\displaystyle\sup_{P\in\{P_{0},P_{1}\}}\mathbb{E}_{P}\!\big[(\hat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big] 18(θP1(𝜹)θP0(𝜹))2(1TV(P0n,P1n)).\displaystyle\geq\frac{1}{8}\Big(\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})\Big)^{2}\Big(1-\mathrm{TV}(P_{0}^{\otimes n},P_{1}^{\otimes n})\Big).

Using the TV bound above,

supP{P0,P1}𝔼P[(θ^θP(𝜹))2]\displaystyle\sup_{P\in\{P_{0},P_{1}\}}\mathbb{E}_{P}\!\big[(\hat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big] cTV8(θP1(𝜹)θP0(𝜹))2.\displaystyle\geq\frac{c_{\mathrm{TV}}}{8}\Big(\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})\Big)^{2}.

Now apply the parameter gap bound and divide by 𝜹𝚪𝜹/n\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}/n:

n𝜹𝚪𝜹supP{P0,P1}𝔼P[(θ^θP(𝜹))2]\displaystyle\frac{n}{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}\sup_{P\in\{P_{0},P_{1}\}}\mathbb{E}_{P}\!\big[(\hat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big] cTV8n𝜹𝚪𝜹(α0nκ𝜹𝚪𝜹2Kα02n)2.\displaystyle\geq\frac{c_{\mathrm{TV}}}{8}\cdot\frac{n}{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}\Big(\frac{\alpha_{0}}{\sqrt{n}}\kappa\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}-\frac{2K\alpha_{0}^{2}}{n}\Big)^{2}.

The term 2Kα02/n2K\alpha_{0}^{2}/n is o(n1/2)o(n^{-1/2}) and does not contribute to the limit after normalization by 𝜹𝚪𝜹/n\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}/n. Therefore,

lim infnn𝜹𝚪𝜹supP{P0,P1}𝔼P[(θ^θP(𝜹))2]\displaystyle\liminf_{n\to\infty}\frac{n}{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}\sup_{P\in\{P_{0},P_{1}\}}\mathbb{E}_{P}\!\big[(\hat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big] cTV8α02κ2.\displaystyle\geq\frac{c_{\mathrm{TV}}}{8}\cdot\alpha_{0}^{2}\kappa^{2}.

Since {P0,P1}𝒫\{P_{0},P_{1}\}\subset\mathcal{P}, taking infθ^\inf_{\hat{\theta}} and supP𝒫\sup_{P\in\mathcal{P}} yields the same lower bound, proving the theorem with C=(cTV/8)α02κ2C=(c_{\mathrm{TV}}/8)\alpha_{0}^{2}\kappa^{2}. ∎

On positive definiteness of 𝚪\boldsymbol{\Gamma}.

𝑯\displaystyle\boldsymbol{H} =Cov(𝒉),\displaystyle=\operatorname{Cov}(\boldsymbol{h}),
𝚪\displaystyle\boldsymbol{\Gamma} :=𝑯Cov(𝒉,φ0)Cov(𝒉,φ0)σ02,\displaystyle:=\boldsymbol{H}-\frac{\operatorname{Cov}(\boldsymbol{h},\varphi_{0})\,\operatorname{Cov}(\boldsymbol{h},\varphi_{0})^{\top}}{\sigma_{0}^{2}},
σ02\displaystyle\sigma_{0}^{2} :=Var(φ0).\displaystyle:=\operatorname{Var}(\varphi_{0}).

Let

𝜷\displaystyle\boldsymbol{\beta} :=Cov(𝒉,φ0)σ02,\displaystyle:=\frac{\operatorname{Cov}(\boldsymbol{h},\varphi_{0})}{\sigma_{0}^{2}},
𝒉\displaystyle\boldsymbol{h}_{\perp} :=𝒉𝜷φ0.\displaystyle:=\boldsymbol{h}-\boldsymbol{\beta}\,\varphi_{0}.

Then

𝚪\displaystyle\boldsymbol{\Gamma} =Cov(𝒉)=Cov(𝒉fspan(φ0)𝒉)𝟎.\displaystyle=\operatorname{Cov}(\boldsymbol{h}_{\perp})=\operatorname{Cov}\!\big(\boldsymbol{h}-f_{\mathrm{span}(\varphi_{0})}\boldsymbol{h}\big)\succeq\boldsymbol{0}.

The matrix 𝚪\boldsymbol{\Gamma} is positive definite if and only if the residual 𝒉\boldsymbol{h}_{\perp} is nondegenerate, that is

𝜹𝟎:Var(𝜹𝒉)\displaystyle\forall\,\boldsymbol{\delta}\neq\boldsymbol{0}:\quad\operatorname{Var}\!\big(\boldsymbol{\delta}^{\top}\boldsymbol{h}_{\perp}\big) =Var(𝜹𝒉)Cov(𝜹𝒉,φ0)2Var(φ0)> 0,\displaystyle=\operatorname{Var}(\boldsymbol{\delta}^{\top}\boldsymbol{h})-\frac{\operatorname{Cov}(\boldsymbol{\delta}^{\top}\boldsymbol{h},\varphi_{0})^{2}}{\operatorname{Var}(\varphi_{0})}\;>\;0,

equivalently Corr(𝜹𝒉,φ0)±1\text{Corr}(\boldsymbol{\delta}^{\top}\boldsymbol{h},\varphi_{0})\neq\pm 1 for every nonzero 𝜹\boldsymbol{\delta}.

If Corr(𝜹𝒉,φ0)=±1\text{Corr}(\boldsymbol{\delta}^{\top}\boldsymbol{h},\varphi_{0})=\pm 1 for some 𝜹𝟎\boldsymbol{\delta}\neq\boldsymbol{0}, then there exist scalars a,ba,b such that

𝜹𝒉(𝒁)=aφ0(𝒁)+ba.s.\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})=a\,\varphi_{0}(\boldsymbol{Z})+b\quad\text{a.s.}

Recall φ0(𝒁)=YE[Y]\varphi_{0}(\boldsymbol{Z})=Y-\operatorname{E}[Y] and 𝒉(𝒁)=𝒔~(𝑾,𝑿)(Ym(𝑿))𝑪\boldsymbol{h}(\boldsymbol{Z})=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\big(Y-m(\boldsymbol{X})\big)-\boldsymbol{C} with m(𝑿)=Ef[μ𝑿]m(\boldsymbol{X})=\operatorname{E}_{f}[\mu\mid\boldsymbol{X}] and 𝑪=E[Ef[μ𝒔~𝑿]]\boldsymbol{C}=\operatorname{E}[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]]. Collecting coefficients of YY yields

(𝜹𝒔~(𝑾,𝑿)a)Y\displaystyle\big(\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})-a\big)\,Y =𝜹𝒔~(𝑾,𝑿)m(𝑿)+𝜹𝑪aE[Y]+b.\displaystyle=\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,m(\boldsymbol{X})+\boldsymbol{\delta}^{\top}\boldsymbol{C}-a\,\operatorname{E}[Y]+b.

By (A2), Var(Y𝑿,𝑾)>0\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})>0 a.s., hence

𝜹𝒔~(𝑾,𝑿)a= 0a.s.\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})-a\;=\;0\quad\text{a.s.}

Taking conditional expectation given 𝑿\boldsymbol{X} and using E[𝒔~(𝑾,𝑿)𝑿]=𝟎\operatorname{E}[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]=\boldsymbol{0} gives a=0a=0 and thus

𝜹𝒔~(𝑾,𝑿)0a.s.\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\equiv 0\quad\text{a.s.}

Thus, failure of 𝚪0\boldsymbol{\Gamma}\succ 0 requires a nontrivial direction 𝜹\boldsymbol{\delta} along which 𝒔~(𝑾,𝑿)\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X}) is almost surely degenerate.

Linear Incremental Effect

By explicitly specializing the general tilt function to the linear form 𝒔(𝒘,𝒙)=𝒘\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})=\boldsymbol{w}, the projected covariance 𝚪\boldsymbol{\Gamma} recovers the exact geometric structure defined in Section 4.1, thereby completing the minimax lower bound for the linear incremental effect θ(𝜹)\theta(\boldsymbol{\delta}) evaluated in the main text.

Appendix D On convergence and normality

We first establish a key auxiliary result under the same assumptions and notation used in the proof of the minimax lower bound. We clarify how our rate conditions on the tilted nuisances (m𝜹,r𝜹)(m_{\boldsymbol{\delta}},r_{\boldsymbol{\delta}}) translate into standard L2L_{2} rates for the outcome regression μ\mu and the exposure density ff.

Lemma 9 (Reduction to outcome regression and exposure density).

Assume (A1)–(A3), and let f(𝐰𝐱)f(\boldsymbol{w}\mid\boldsymbol{x}) denote the conditional density of 𝐖\boldsymbol{W} given 𝐗=𝐱\boldsymbol{X}=\boldsymbol{x}, with support 𝒲q\mathcal{W}\subset\mathbb{R}^{q} of finite Lebesgue measure |𝒲|<|\mathcal{W}|<\infty. Suppose there exist constants 0<fminfmax<0<f_{\min}\leq f_{\max}<\infty such that

fminf(𝒘𝒙)fmaxfor all (𝒙,𝒘) in the support of (𝑿,𝑾).f_{\min}\;\leq\;f(\boldsymbol{w}\mid\boldsymbol{x})\;\leq\;f_{\max}\qquad\text{for all $(\boldsymbol{x},\boldsymbol{w})$ in the support of $(\boldsymbol{X},\boldsymbol{W})$.}

Let μ(𝐱,𝐰)=𝔼[Y𝐗=𝐱,𝐖=𝐰]\mu(\boldsymbol{x},\boldsymbol{w})=\mathbb{E}[Y\mid\boldsymbol{X}=\boldsymbol{x},\boldsymbol{W}=\boldsymbol{w}], and fix Δ<\Delta<\infty. For any 𝛅\boldsymbol{\delta} with 𝛅Δ\|\boldsymbol{\delta}\|\leq\Delta define

ν𝜹(𝒙)\displaystyle\nu_{\boldsymbol{\delta}}(\boldsymbol{x}) :=𝒲exp{𝜹𝒔(𝒘,𝒙)}f(𝒘𝒙)𝑑𝒘,\displaystyle:=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\,f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w},
η𝜹(𝒙)\displaystyle\eta_{\boldsymbol{\delta}}(\boldsymbol{x}) :=𝒲exp{𝜹𝒔(𝒘,𝒙)}μ(𝒙,𝒘)f(𝒘𝒙)𝑑𝒘,\displaystyle:=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\,\mu(\boldsymbol{x},\boldsymbol{w})\,f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w},

and

m𝜹(𝒙)\displaystyle m_{\boldsymbol{\delta}}(\boldsymbol{x}) :=η𝜹(𝒙)ν𝜹(𝒙),\displaystyle:=\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{x})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})},
r𝜹(𝒘,𝒙)\displaystyle r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) :=exp{𝜹𝒔(𝒘,𝒙)}ν𝜹(𝒙).\displaystyle:=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})}.

Let μ^,f^\widehat{\mu},\widehat{f} be any estimators of μ,f\mu,f, and construct

ν^𝜹(𝒙)\displaystyle\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x}) :=𝒲exp{𝜹𝒔(𝒘,𝒙)}f^(𝒘𝒙)𝑑𝒘,\displaystyle:=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\,\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w},
η^𝜹(𝒙)\displaystyle\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{x}) :=𝒲exp{𝜹𝒔(𝒘,𝒙)}μ^(𝒙,𝒘)f^(𝒘𝒙)𝑑𝒘,\displaystyle:=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\,\widehat{\mu}(\boldsymbol{x},\boldsymbol{w})\,\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w},

To ensure the denominators in m^𝛅\widehat{m}_{\boldsymbol{\delta}} and r^𝛅\widehat{r}_{\boldsymbol{\delta}} are strictly bounded away from zero, we define the truncated estimator ν^𝛅\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger} below. Later we show this regularization enforces positivity constraints consistent with the true ν𝛅\nu_{\boldsymbol{\delta}} without compromising the L2L_{2} convergence rates inherited from the nuisance estimators.

ν^𝜹(𝒙)\displaystyle\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x}) :=min{max(ν^𝜹(𝒙),eτΔ/2), 2eτΔ},\displaystyle:=\min\Big\{\max\big(\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x}),\ e^{-\tau_{\Delta}}/2\big),\ 2e^{\tau_{\Delta}}\Big\},
m^𝜹(𝒙)\displaystyle\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{x}) :=η^𝜹(𝒙)ν^𝜹(𝒙),\displaystyle:=\frac{\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{x})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})},
r^𝜹(𝒘,𝒙)\displaystyle\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) :=exp{𝜹𝒔(𝒘,𝒙)}ν^𝜹(𝒙).\displaystyle:=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})}.

Then there exist finite constants C1(Δ),C2(Δ)<C_{1}(\Delta),C_{2}(\Delta)<\infty, depending only on (Δ,fmin,fmax,|𝒲|,M)(\Delta,f_{\min},f_{\max},|\mathcal{W}|,M) and the bound on 𝐬(𝐖,𝐗)\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X}), such that

r^𝜹r𝜹2\displaystyle\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2} C1(Δ)f^f2,\displaystyle\leq C_{1}(\Delta)\,\|\widehat{f}-f\|_{2}, (10)
m^𝜹m𝜹2\displaystyle\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2} C2(Δ)(μ^μ2+f^f2),\displaystyle\leq C_{2}(\Delta)\,\big(\|\widehat{\mu}-\mu\|_{2}+\|\widehat{f}-f\|_{2}\big), (11)

where 2\|\cdot\|_{2} denotes the L2(P)L_{2}(P)–norm with respect to the law of (𝐗,𝐖)(\boldsymbol{X},\boldsymbol{W}).

Proof.

Since 𝜹Δ\|\boldsymbol{\delta}\|\leq\Delta and 𝒔(𝑾,𝑿)\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X}) is bounded, there exists τΔ<\tau_{\Delta}<\infty such that |exp{𝜹𝒔(𝑾,𝑿)}|eτΔ|\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}|\leq e^{\tau_{\Delta}} almost surely. Hence, for all such 𝜹\boldsymbol{\delta},

eτΔν𝜹(𝒙)eτΔfor all 𝒙.e^{-\tau_{\Delta}}\;\leq\;\nu_{\boldsymbol{\delta}}(\boldsymbol{x})\;\leq\;e^{\tau_{\Delta}}\qquad\text{for all $\boldsymbol{x}$.}

In particular, ν𝜹\nu_{\boldsymbol{\delta}} is bounded away from 0 and \infty; the same holds for ν^𝜹\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger} by construction.

Control of ν^𝛅ν𝛅\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}. By definition,

ν^𝜹(𝒙)ν𝜹(𝒙)=𝒲exp{𝜹𝒔(𝒘,𝒙)}{f^(𝒘𝒙)f(𝒘𝒙)}𝑑𝒘.\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x})-\nu_{\boldsymbol{\delta}}(\boldsymbol{x})=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\big\{\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big\}\,d\boldsymbol{w}.

Taking absolute values and using the bound on the exponential tilt gives

|ν^𝜹(𝒙)ν𝜹(𝒙)|eτΔ𝒲|f^(𝒘𝒙)f(𝒘𝒙)|d𝒘.|\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x})-\nu_{\boldsymbol{\delta}}(\boldsymbol{x})|\leq e^{\tau_{\Delta}}\int_{\mathcal{W}}\big|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big|\,d\boldsymbol{w}.

By Cauchy–Schwarz and finiteness of |𝒲||\mathcal{W}|,

𝒲|f^f|𝑑𝒘|𝒲|1/2(𝒲|f^f|2𝑑𝒘)1/2.\int_{\mathcal{W}}\big|\widehat{f}-f\big|\,d\boldsymbol{w}\leq|\mathcal{W}|^{1/2}\Big(\int_{\mathcal{W}}\big|\widehat{f}-f\big|^{2}\,d\boldsymbol{w}\Big)^{1/2}.

Squaring both sides and integrating over 𝒙\boldsymbol{x} with respect to P𝑿P_{\boldsymbol{X}} gives

{ν^𝜹(𝒙)ν𝜹(𝒙)}2𝑑P𝑿(𝒙)\displaystyle\int\{\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x})-\nu_{\boldsymbol{\delta}}(\boldsymbol{x})\}^{2}\,dP_{\boldsymbol{X}}(\boldsymbol{x}) e2τΔ|𝒲|𝒲|f^(𝒘𝒙)f(𝒘𝒙)|2d𝒘dP𝑿(𝒙).\displaystyle\leq e^{2\tau_{\Delta}}|\mathcal{W}|\int\!\!\int_{\mathcal{W}}\big|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big|^{2}\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x}).

By definition,

f^f22=𝒲|f^(𝒘𝒙)f(𝒘𝒙)|2f(𝒘𝒙)d𝒘dP𝑿(𝒙).\|\widehat{f}-f\|_{2}^{2}=\int\!\!\int_{\mathcal{W}}\big|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big|^{2}f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x}).

Under the boundedness assumption on the exposure density, there exists fmin>0f_{\min}>0 such that f(𝒘𝒙)fminf(\boldsymbol{w}\mid\boldsymbol{x})\geq f_{\min} for all (𝒙,𝒘)(\boldsymbol{x},\boldsymbol{w}). Hence, for any non-negative integrand h(𝒙,𝒘)h(\boldsymbol{x},\boldsymbol{w}) we have

𝒲h(𝒙,𝒘)𝑑𝒘𝑑P𝑿(𝒙)\displaystyle\int\!\!\int_{\mathcal{W}}h(\boldsymbol{x},\boldsymbol{w})\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x}) =𝒲h(𝒙,𝒘)f(𝒘𝒙)f(𝒘𝒙)𝑑𝒘𝑑P𝑿(𝒙)\displaystyle=\int\!\!\int_{\mathcal{W}}\frac{h(\boldsymbol{x},\boldsymbol{w})}{f(\boldsymbol{w}\mid\boldsymbol{x})}\,f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x})
fmin1𝒲h(𝒙,𝒘)f(𝒘𝒙)𝑑𝒘𝑑P𝑿(𝒙).\displaystyle\leq f_{\min}^{-1}\int\!\!\int_{\mathcal{W}}h(\boldsymbol{x},\boldsymbol{w})\,f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x}).

Applying this with h(𝒙,𝒘)=|f^(𝒘𝒙)f(𝒘𝒙)|2h(\boldsymbol{x},\boldsymbol{w})=\big|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big|^{2} yields

𝒲|f^(𝒘𝒙)f(𝒘𝒙)|2d𝒘dP𝑿(𝒙)fmin1f^f22.\int\!\!\int_{\mathcal{W}}\big|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big|^{2}\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x})\leq f_{\min}^{-1}\,\|\widehat{f}-f\|_{2}^{2}.

Substituting into the previous display, and noting that for any function depending only on 𝑿\boldsymbol{X} we have

ν^𝜹ν𝜹22={ν^𝜹(𝒙)ν𝜹(𝒙)}2𝑑P𝑿(𝒙),\|\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}\|_{2}^{2}=\int\{\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x})-\nu_{\boldsymbol{\delta}}(\boldsymbol{x})\}^{2}\,dP_{\boldsymbol{X}}(\boldsymbol{x}),

we obtain

ν^𝜹ν𝜹22e2τΔ|𝒲|fmin1f^f22,\|\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}\|_{2}^{2}\leq e^{2\tau_{\Delta}}|\mathcal{W}|\,f_{\min}^{-1}\,\|\widehat{f}-f\|_{2}^{2},

and hence

ν^𝜹ν𝜹2eτΔ|𝒲|1/2fmin1/2f^f2.\|\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}\|_{2}\leq e^{\tau_{\Delta}}|\mathcal{W}|^{1/2}f_{\min}^{-1/2}\,\|\widehat{f}-f\|_{2}.

Moreover, since ν𝜹(𝒙)[eτΔ,eτΔ][eτΔ/2,2eτΔ]\nu_{\boldsymbol{\delta}}(\boldsymbol{x})\in[e^{-\tau_{\Delta}},e^{\tau_{\Delta}}]\subset[e^{-\tau_{\Delta}}/2,2e^{\tau_{\Delta}}] for all 𝒙\boldsymbol{x} and the truncation map is 11–Lipschitz, we have

ν^𝜹ν𝜹2ν^𝜹ν𝜹2.\|\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}-\nu_{\boldsymbol{\delta}}\|_{2}\leq\|\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}\|_{2}.

Control of r^𝛅r𝛅\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}. We have

r^𝜹(𝒘,𝒙)r𝜹(𝒘,𝒙)\displaystyle\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})-r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x}) =exp{𝜹𝒔(𝒘,𝒙)}{1ν^𝜹(𝒙)1ν𝜹(𝒙)}\displaystyle=\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\Big\{\frac{1}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})}-\frac{1}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})}\Big\}
=exp{𝜹𝒔(𝒘,𝒙)}ν𝜹(𝒙)ν^𝜹(𝒙)ν^𝜹(𝒙)ν𝜹(𝒙).\displaystyle=\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\frac{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})-\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})\nu_{\boldsymbol{\delta}}(\boldsymbol{x})}.

Using the finite-Δ\Delta bounds on the numerator and denominators, there exists a constant Cr(Δ)C_{r}(\Delta) such that

|r^𝜹r𝜹|Cr(Δ)|ν^𝜹ν𝜹|,|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}|\leq C_{r}(\Delta)\,|\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}-\nu_{\boldsymbol{\delta}}|,

so that

r^𝜹r𝜹2\displaystyle\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2} Cr(Δ)ν^𝜹ν𝜹2Cr(Δ)ν^𝜹ν𝜹2\displaystyle\leq C_{r}(\Delta)\,\|\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}-\nu_{\boldsymbol{\delta}}\|_{2}\leq C_{r}(\Delta)\,\|\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}\|_{2}
C1(Δ)f^f2,\displaystyle\leq C_{1}(\Delta)\,\|\widehat{f}-f\|_{2},

with C1(Δ):=Cr(Δ)eτΔ|𝒲|1/2fmin1/2C_{1}(\Delta):=C_{r}(\Delta)e^{\tau_{\Delta}}|\mathcal{W}|^{1/2}f_{\min}^{-1/2}, which proves (10).

Control of η^𝛅η𝛅\widehat{\eta}_{\boldsymbol{\delta}}-\eta_{\boldsymbol{\delta}}. Using the product expansion μ^f^μf=(μ^μ)f^+μ(f^f)\widehat{\mu}\widehat{f}-\mu f=(\widehat{\mu}-\mu)\widehat{f}+\mu(\widehat{f}-f), we find

η^𝜹(𝒙)η𝜹(𝒙)\displaystyle\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{x})-\eta_{\boldsymbol{\delta}}(\boldsymbol{x}) =𝒲exp{𝜹𝒔(𝒘,𝒙)}[(μ^μ)(𝒙,𝒘)f^(𝒘𝒙)+μ(𝒙,𝒘){f^(𝒘𝒙)f(𝒘𝒙)}]𝑑𝒘.\displaystyle=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\big[(\widehat{\mu}-\mu)(\boldsymbol{x},\boldsymbol{w})\,\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})+\mu(\boldsymbol{x},\boldsymbol{w})\big\{\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big\}\big]\,d\boldsymbol{w}.

Assumption (A1) implies |μ(𝒙,𝒘)|M<|\mu(\boldsymbol{x},\boldsymbol{w})|\leq M<\infty, and the bounds on f,f^f,\widehat{f} give 0f^fmax+oP(1)0\leq\widehat{f}\leq f_{\max}+o_{P}(1). Hence

|η^𝜹(𝒙)η𝜹(𝒙)|\displaystyle|\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{x})-\eta_{\boldsymbol{\delta}}(\boldsymbol{x})| eτΔ[𝒲|(μ^μ)(𝒙,𝒘)|f^(𝒘𝒙)d𝒘+M𝒲|f^(𝒘𝒙)f(𝒘𝒙)|d𝒘].\displaystyle\leq e^{\tau_{\Delta}}\Big[\int_{\mathcal{W}}|(\widehat{\mu}-\mu)(\boldsymbol{x},\boldsymbol{w})|\,\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}+M\int_{\mathcal{W}}|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})|\,d\boldsymbol{w}\Big].

On an event whose probability tends to one we have 0f^fmax+10\leq\widehat{f}\leq f_{\max}+1, and another application of Cauchy–Schwarz yields

η^𝜹η𝜹2Cη(Δ)(μ^μ2+f^f2)\|\widehat{\eta}_{\boldsymbol{\delta}}-\eta_{\boldsymbol{\delta}}\|_{2}\leq C_{\eta}(\Delta)\big(\|\widehat{\mu}-\mu\|_{2}+\|\widehat{f}-f\|_{2}\big)

for some finite constant Cη(Δ)C_{\eta}(\Delta).

Control of m^𝛅m𝛅\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}. By the algebraic identity a/bc/d=(ac)/b+c(1/b1/d)a/b-c/d=(a-c)/b+c(1/b-1/d),

m^𝜹(𝒙)m𝜹(𝒙)\displaystyle\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{x})-m_{\boldsymbol{\delta}}(\boldsymbol{x}) =η^𝜹(𝒙)η𝜹(𝒙)ν^𝜹(𝒙)+η𝜹(𝒙){1ν^𝜹(𝒙)1ν𝜹(𝒙)}.\displaystyle=\frac{\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{x})-\eta_{\boldsymbol{\delta}}(\boldsymbol{x})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})}+\eta_{\boldsymbol{\delta}}(\boldsymbol{x})\Big\{\frac{1}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})}-\frac{1}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})}\Big\}.

Using again the uniform bounds on η𝜹,ν𝜹,ν^𝜹\eta_{\boldsymbol{\delta}},\nu_{\boldsymbol{\delta}},\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger} implied by (A1)–(A3), we obtain

|m^𝜹m𝜹|Cm(Δ)(|η^𝜹η𝜹|+|ν^𝜹ν𝜹|)|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}|\leq C_{m}(\Delta)\Big(|\widehat{\eta}_{\boldsymbol{\delta}}-\eta_{\boldsymbol{\delta}}|+|\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}-\nu_{\boldsymbol{\delta}}|\Big)

for some finite constant Cm(Δ)C_{m}(\Delta). Taking L2(P)L_{2}(P)–norms and combining the bounds above gives (11) with

C2(Δ):=Cm(Δ)(Cη(Δ)+eτΔ|𝒲|1/2).C_{2}(\Delta):=C_{m}(\Delta)\big(C_{\eta}(\Delta)+e^{\tau_{\Delta}}|\mathcal{W}|^{1/2}\big).

Corollary 3 (Rate condition in terms of (μ,f)(\mu,f)).

Under the conditions of Lemma 9, if

μ^μ2=oP(n1/4)andf^f2=oP(n1/4),\|\widehat{\mu}-\mu\|_{2}=o_{P}(n^{-1/4})\qquad\text{and}\qquad\|\widehat{f}-f\|_{2}=o_{P}(n^{-1/4}),

then, for every fixed 𝛅\boldsymbol{\delta} with 𝛅Δ\|\boldsymbol{\delta}\|\leq\Delta,

r^𝜹r𝜹2=oP(n1/4),m^𝜹m𝜹2=oP(n1/4),\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/4}),\qquad\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/4}),

and hence the product condition r^𝛅r𝛅2m^𝛅m𝛅2=oP(n1/2)\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/2}) required in Theorem 5 holds.

We now turn to the von Mises expansion and the second-order remainder bound that underpin the finite-𝜹\boldsymbol{\delta} central limit theorem in the main text. The results in this subsection are stated in terms of the tilted nuisances (m𝜹,r𝜹)(m_{\boldsymbol{\delta}},r_{\boldsymbol{\delta}}); together with Corollary 3, they imply Theorem 3.

We work with the cross-fitted one-step estimator written directly in density-ratio form,

ψ^(𝜹)=Pn[r^𝜹(𝑾,𝑿){Ym^𝜹(𝑿)}]+Pn[m^𝜹(𝑿)],θ^(𝜹):=ψ^(𝜹)ψ^(𝟎),\widehat{\psi}(\boldsymbol{\delta})=P_{n}\!\big[\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})\}\big]\;+\;P_{n}\!\big[\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})\big],\qquad\widehat{\theta}(\boldsymbol{\delta}):=\widehat{\psi}(\boldsymbol{\delta})-\widehat{\psi}(\boldsymbol{0}), (12)

where, for a fixed tilt 𝜹\boldsymbol{\delta} with 𝜹Δ\|\boldsymbol{\delta}\|\leq\Delta,

m𝜹(𝒙)\displaystyle m_{\boldsymbol{\delta}}(\boldsymbol{x}) :=𝔼g𝜹[μ(𝑿,𝑾)𝑿=𝒙]\displaystyle:=\mathbb{E}_{g_{\boldsymbol{\delta}}}\!\left[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}=\boldsymbol{x}\right]
=𝔼[exp{𝜹𝒔(𝑾,𝑿)}μ(𝑿,𝑾)𝑿=𝒙]𝔼[exp{𝜹𝒔(𝑾,𝑿)}𝑿=𝒙]\displaystyle=\frac{\mathbb{E}\!\left[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}=\boldsymbol{x}\right]}{\mathbb{E}\!\left[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mid\boldsymbol{X}=\boldsymbol{x}\right]}
=η𝜹(𝒙)ν𝜹(𝒙),\displaystyle=\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{x})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})},

and

ν𝜹(𝒙)\displaystyle\nu_{\boldsymbol{\delta}}(\boldsymbol{x}) :=𝔼[exp{𝜹𝒔(𝑾,𝑿)}𝑿=𝒙],\displaystyle:=\mathbb{E}\!\left[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mid\boldsymbol{X}=\boldsymbol{x}\right],
η𝜹(𝒙)\displaystyle\eta_{\boldsymbol{\delta}}(\boldsymbol{x}) :=𝔼[exp{𝜹𝒔(𝑾,𝑿)}μ(𝑿,𝑾)𝑿=𝒙],\displaystyle:=\mathbb{E}\!\left[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}=\boldsymbol{x}\right],
ν^𝜹(𝒙)\displaystyle\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x}) :=min{max(ν^𝜹(𝒙),eτΔ/2), 2eτΔ},\displaystyle:=\min\Big\{\max\big(\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x}),\ e^{-\tau_{\Delta}}/2\big),\ 2e^{\tau_{\Delta}}\Big\},
r𝜹(𝑾,𝑿)\displaystyle r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}) :=exp{𝜹𝒔(𝑾,𝑿)}ν𝜹(𝑿),\displaystyle:=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})},
r^𝜹(𝑾,𝑿)\displaystyle\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}) :=exp{𝜹𝒔(𝑾,𝑿)}ν^𝜹(𝑿),\displaystyle:=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{X})},
m^𝜹(𝑿)\displaystyle\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X}) :=η^𝜹(𝑿)ν^𝜹(𝑿).\displaystyle:=\frac{\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{X})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{X})}.

Here ν^𝜹\widehat{\nu}_{\boldsymbol{\delta}} and η^𝜹\widehat{\eta}_{\boldsymbol{\delta}} are obtained from cross-fitted regressions of the transformed outcomes exp{𝜹𝒔(𝑾,𝑿)}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\} and exp{𝜹𝒔(𝑾,𝑿)}μ(𝑿,𝑾)\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mu(\boldsymbol{X},\boldsymbol{W}) on 𝑿\boldsymbol{X}, respectively.

From the von Mises decomposition (see, e.g., Kennedy, 2024) and by adding and subtracting Pφψ(𝜹)P\varphi_{\psi(\boldsymbol{\delta})}, Pnφψ(𝜹)P_{n}\varphi_{\psi(\boldsymbol{\delta})}, and Pnφ^ψ(𝜹)P_{n}\widehat{\varphi}_{\psi(\boldsymbol{\delta})} on the right-hand side, we obtain

ψ^(𝜹)ψ(𝜹)=(PnP){φψ(𝜹)(𝒁)}+(PnP){φ^ψ(𝜹)(𝒁)φψ(𝜹)(𝒁)}+R2(P^,P;𝜹),\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})=(P_{n}-P)\{\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\}+(P_{n}-P)\!\big\{\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})-\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}+R_{2}(\widehat{P},P;\boldsymbol{\delta}), (13)

where

φψ(𝜹)(𝒁)\displaystyle\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}) :=r𝜹(𝑾,𝑿){Ym𝜹(𝑿)}+m𝜹(𝑿)ψ(𝜹),\displaystyle:=r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}+m_{\boldsymbol{\delta}}(\boldsymbol{X})-\psi(\boldsymbol{\delta}),
φ^ψ(𝜹)(𝒁)\displaystyle\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}) :=r^𝜹(𝑾,𝑿){Ym^𝜹(𝑿)}+m^𝜹(𝑿)ψ^(𝜹),\displaystyle:=\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})\}+\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})-\widehat{\psi}(\boldsymbol{\delta}),

and the second-order remainder is

R2(P^,P;𝜹):=ψ^(𝜹)ψ(𝜹)+P[φ^ψ(𝜹)].R_{2}(\widehat{P},P;\boldsymbol{\delta}):=\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})+P\!\left[\widehat{\varphi}_{\psi(\boldsymbol{\delta})}\right].
Lemma 10 (Second-order remainder).

For any fixed 𝛅\boldsymbol{\delta} with 𝛅Δ\|\boldsymbol{\delta}\|\leq\Delta,

|R2(P^,P;𝜹)|r^𝜹r𝜹2m^𝜹m𝜹2.\big|R_{2}(\widehat{P},P;\boldsymbol{\delta})\big|\ \leq\ \|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}.
Proof.

Starting from the estimator representation (12), the definition of R2R_{2} gives

R2\displaystyle R_{2} =𝔼[r^𝜹(𝑾,𝑿){Ym^𝜹(𝑿)}+m^𝜹(𝑿)]\displaystyle=\mathbb{E}\big[\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})\}+\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})\big]
𝔼[r𝜹(𝑾,𝑿){Ym𝜹(𝑿)}+m𝜹(𝑿)]\displaystyle\quad-\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}+m_{\boldsymbol{\delta}}(\boldsymbol{X})\big]
=𝔼[(r^𝜹r𝜹){Ym𝜹(𝑿)}]𝔼[(r^𝜹r𝜹){m^𝜹(𝑿)m𝜹(𝑿)}].\displaystyle=\mathbb{E}\big[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{Y-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}\big]-\mathbb{E}\big[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}\big].

The first term vanishes by iterated expectation:

𝔼[(r^𝜹r𝜹){Yμ(𝑿,𝑾)}]=𝔼[𝔼[(r^𝜹r𝜹){Yμ(𝑿,𝑾)}𝑿,𝑾]]=0,\mathbb{E}\!\big[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{Y-\mu(\boldsymbol{X},\boldsymbol{W})\}\big]=\mathbb{E}\!\Big[\,\mathbb{E}\big[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{Y-\mu(\boldsymbol{X},\boldsymbol{W})\}\mid\boldsymbol{X},\boldsymbol{W}\big]\Big]=0,

and

𝔼[(r^𝜹r𝜹){μ(𝑿,𝑾)m𝜹(𝑿)}𝑿]\displaystyle\mathbb{E}\big[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{\mu(\boldsymbol{X},\boldsymbol{W})-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}\mid\boldsymbol{X}\big] =𝔼[r^𝜹μ𝑿]m𝜹(𝑿)𝔼[r^𝜹𝑿]\displaystyle=\mathbb{E}[\widehat{r}_{\boldsymbol{\delta}}\mu\mid\boldsymbol{X}]-m_{\boldsymbol{\delta}}(\boldsymbol{X})\mathbb{E}[\widehat{r}_{\boldsymbol{\delta}}\mid\boldsymbol{X}]
𝔼[r𝜹μ𝑿]+m𝜹(𝑿)𝔼[r𝜹𝑿]\displaystyle\quad-\mathbb{E}[r_{\boldsymbol{\delta}}\mu\mid\boldsymbol{X}]+m_{\boldsymbol{\delta}}(\boldsymbol{X})\mathbb{E}[r_{\boldsymbol{\delta}}\mid\boldsymbol{X}]
=η𝜹(𝑿)ν^𝜹(𝑿)η𝜹(𝑿)ν𝜹(𝑿)\displaystyle=\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{X})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{X})}-\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{X})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})}
η𝜹(𝑿)ν𝜹(𝑿)(ν𝜹(𝑿)ν^𝜹(𝑿)1)\displaystyle\quad-\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{X})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})}\Big(\frac{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{X})}-1\Big)
=0.\displaystyle=0.

Hence R2=𝔼[(r^𝜹r𝜹)(m^𝜹m𝜹)]R_{2}=-\mathbb{E}[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})(\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}})], and the Cauchy–Schwarz inequality yields

|R2(P^,P;𝜹)|r^𝜹r𝜹2m^𝜹m𝜹2.\big|R_{2}(\widehat{P},P;\boldsymbol{\delta})\big|\leq\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}.

Theorem 5 (Asymptotic normality at n\sqrt{n} rate).

Assume (A1)(A3). Let 𝐙:=(𝐗,𝐖,Y)\boldsymbol{Z}:=(\boldsymbol{X},\boldsymbol{W},Y) and suppose {𝐙i}i=1n\{\boldsymbol{Z}_{i}\}_{i=1}^{n} are i.i.d. draws from PP. Fix 𝛅\boldsymbol{\delta} with 𝛅Δ<\|\boldsymbol{\delta}\|\leq\Delta<\infty, and use KK-fold cross-fitting to obtain the nuisance estimators (r^𝛅,m^𝛅)(\widehat{r}_{\boldsymbol{\delta}},\widehat{m}_{\boldsymbol{\delta}}). Suppose

r^𝜹r𝜹2=oP(1),m^𝜹m𝜹2=oP(1),\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}=o_{P}(1),\qquad\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(1),

and the product condition

r^𝜹r𝜹2m^𝜹m𝜹2=oP(n1/2).\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/2}).

A convenient sufficient condition is

r^𝜹r𝜹2=oP(n1/4),m^𝜹m𝜹2=oP(n1/4),\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/4}),\qquad\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/4}),

which can be achieved by suitable cross-fitted learners under standard regularity conditions. Then

n{ψ^(𝜹)ψ(𝜹)}𝒩(0,Var{φψ(𝜹)(𝒁)}).\sqrt{n}\,\big\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\big\}\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big).

Consequently, for the incremental effect θ^(𝛅)=ψ^(𝛅)ψ^(𝟎)\widehat{\theta}(\boldsymbol{\delta})=\widehat{\psi}(\boldsymbol{\delta})-\widehat{\psi}(\boldsymbol{0}),

n{θ^(𝜹)θ(𝜹)}𝒩(0,Var{φθ(𝜹)(𝒁)}),φθ(𝜹):=φψ(𝜹)φψ(𝟎).\sqrt{n}\,\big\{\widehat{\theta}(\boldsymbol{\delta})-\theta(\boldsymbol{\delta})\big\}\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big),\qquad\varphi_{\theta(\boldsymbol{\delta})}:=\varphi_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\boldsymbol{0})}.
Proof.

Control of the leading empirical process term. For any 𝜹Δ\|\boldsymbol{\delta}\|\leq\Delta, boundedness of 𝒔(𝑾,𝑿)\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X}) implies there exists τΔ<\tau_{\Delta}<\infty such that

eτΔexp{𝜹𝒔(𝑾,𝑿)}eτΔ,eτΔν𝜹(𝑿)eτΔ,e^{-\tau_{\Delta}}\leq\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\leq e^{\tau_{\Delta}},\qquad e^{-\tau_{\Delta}}\leq\nu_{\boldsymbol{\delta}}(\boldsymbol{X})\leq e^{\tau_{\Delta}},

so that e2τΔr𝜹(𝑾,𝑿)e2τΔe^{-2\tau_{\Delta}}\leq r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\leq e^{2\tau_{\Delta}}. Under (A1), |Y|M|Y|\leq M almost surely, and hence |m𝜹(𝑿)|M|m_{\boldsymbol{\delta}}(\boldsymbol{X})|\leq M as well. Therefore

|φψ(𝜹)(𝒁)|\displaystyle|\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})| e2τΔ|Ym𝜹(𝑿)|+|m𝜹(𝑿)ψ(𝜹)|\displaystyle\leq e^{2\tau_{\Delta}}\,|Y-m_{\boldsymbol{\delta}}(\boldsymbol{X})|+|m_{\boldsymbol{\delta}}(\boldsymbol{X})-\psi(\boldsymbol{\delta})|
2Me2τΔ+2M<.\displaystyle\leq 2Me^{2\tau_{\Delta}}+2M<\infty.

Since {𝒁i}i=1n\{\boldsymbol{Z}_{i}\}_{i=1}^{n} are i.i.d., the classical Lindeberg–Feller CLT applies and yields

1ni=1nφψ(𝜹)(𝒁i)𝒩(0,Var{φψ(𝜹)(𝒁)}).\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}_{i})\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big).

Control of the estimated influence term. From the definitions,

φ^ψ(𝜹)(𝒁)\displaystyle\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}) =r^𝜹{Ym^𝜹}+m^𝜹ψ^(𝜹),\displaystyle=\widehat{r}_{\boldsymbol{\delta}}\{Y-\widehat{m}_{\boldsymbol{\delta}}\}+\widehat{m}_{\boldsymbol{\delta}}-\widehat{\psi}(\boldsymbol{\delta}),
φψ(𝜹)(𝒁)\displaystyle\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}) =r𝜹{Ym𝜹}+m𝜹ψ(𝜹),\displaystyle=r_{\boldsymbol{\delta}}\{Y-m_{\boldsymbol{\delta}}\}+m_{\boldsymbol{\delta}}-\psi(\boldsymbol{\delta}),

so that

φ^ψ(𝜹)φψ(𝜹)\displaystyle\widehat{\varphi}_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\boldsymbol{\delta})} =(r^𝜹r𝜹){Ym𝜹}+(1r^𝜹){m^𝜹m𝜹}{ψ^(𝜹)ψ(𝜹)}.\displaystyle=(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{Y-m_{\boldsymbol{\delta}}\}+\big(1-\widehat{r}_{\boldsymbol{\delta}}\big)\{\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\}-\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\}.

Taking PnPP_{n}-P of both sides and noting that (PnP){ψ^(𝜹)ψ(𝜹)}=0(P_{n}-P)\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\}=0, we obtain

(PnP){φ^ψ(𝜹)φψ(𝜹)}\displaystyle(P_{n}-P)\big\{\widehat{\varphi}_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\boldsymbol{\delta})}\big\} =(PnP)[(r^𝜹r𝜹){Ym𝜹}+(1r^𝜹){m^𝜹m𝜹}].\displaystyle=(P_{n}-P)\!\left[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{Y-m_{\boldsymbol{\delta}}\}+\big(1-\widehat{r}_{\boldsymbol{\delta}}\big)\{\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\}\right].

Using |Y|M|Y|\leq M, the finite-Δ\Delta bounds on r𝜹r_{\boldsymbol{\delta}} and r^𝜹\widehat{r}_{\boldsymbol{\delta}}, and the Cauchy–Schwarz inequality, it follows that

(PnP){φ^ψ(𝜹)φψ(𝜹)}\displaystyle(P_{n}-P)\big\{\widehat{\varphi}_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\boldsymbol{\delta})}\big\} =OP(r^𝜹r𝜹2+m^𝜹m𝜹2n)=oP(1),\displaystyle=O_{P}\!\left(\frac{\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}+\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}}{\sqrt{n}}\right)=o_{P}(1),

under the assumed L2L_{2} rates. In particular,

n(PnP){φ^ψ(𝜹)φψ(𝜹)}\displaystyle\sqrt{n}\,(P_{n}-P)\big\{\widehat{\varphi}_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\boldsymbol{\delta})}\big\} =OP(r^𝜹r𝜹2+m^𝜹m𝜹2)=oP(1).\displaystyle=O_{P}\!\Big(\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}+\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}\Big)=o_{P}(1).

Control of the remainder. By Lemma 10,

|R2(P^,P;𝜹)|r^𝜹r𝜹2m^𝜹m𝜹2,\big|R_{2}(\widehat{P},P;\boldsymbol{\delta})\big|\leq\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2},

and the product condition implies nR2(P^,P;𝜹)=oP(1)\sqrt{n}\,R_{2}(\widehat{P},P;\boldsymbol{\delta})=o_{P}(1).

Conclusion. Combining (13) with the three displays above and applying Slutsky’s theorem yields

n{ψ^(𝜹)ψ(𝜹)}𝒩(0,Var{φψ(𝜹)(𝒁)}).\sqrt{n}\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\}\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big).

Moreover, applying the same argument at 𝜹=𝟎\boldsymbol{\delta}=\boldsymbol{0} and using the multivariate Lindeberg–Feller CLT gives the joint convergence

n(ψ^(𝜹)ψ(𝜹),ψ^(𝟎)ψ(𝟎))𝒩(𝟎,Cov(φψ(𝜹)(𝒁),φψ(𝟎)(𝒁))).\sqrt{n}\Big(\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta}),\ \widehat{\psi}(\boldsymbol{0})-\psi(\boldsymbol{0})\Big)\ \rightsquigarrow\ \mathcal{N}\!\Big(\boldsymbol{0},\ \operatorname{Cov}\big(\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}),\ \varphi_{\psi(\boldsymbol{0})}(\boldsymbol{Z})\big)\Big).

Therefore, by the continuous mapping theorem,

n{θ^(𝜹)θ(𝜹)}\displaystyle\sqrt{n}\big\{\widehat{\theta}(\boldsymbol{\delta})-\theta(\boldsymbol{\delta})\big\} =n({ψ^(𝜹)ψ(𝜹)}{ψ^(𝟎)ψ(𝟎)})\displaystyle=\sqrt{n}\Big(\big\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\big\}-\big\{\widehat{\psi}(\boldsymbol{0})-\psi(\boldsymbol{0})\big\}\Big)
𝒩(0,Var{φθ(𝜹)(𝒁)}).\displaystyle\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big).

Combining Theorem 5 with Corollary 3 immediately yields the finite-𝜹\boldsymbol{\delta} CLT stated in Theorem 3 in the main text.

Appendix E Sensitivity analysis for unmeasured confounding

This section assesses the robustness of our incremental policy estimands to violations of the no-unmeasured-confounding assumption and follows the framework of (Chernozhukov et al., 2022).

E.1 Setup: long vs. short worlds

Let the observed data be 𝐙i=(𝐗i,𝐖i,Yi)\mathbf{Z}_{i}=(\mathbf{X}_{i},\mathbf{W}_{i},Y_{i}) drawn i.i.d. from P0P_{0}. Assume there exists an unobserved confounder UU such that, with 𝐕:=(𝐗,U)\mathbf{V}:=(\mathbf{X},U), the potential outcomes {Y(𝐰):𝐰𝒲}\{Y(\mathbf{w}):\mathbf{w}\in\mathcal{W}\} satisfy

Y(𝐰)𝐖|𝐕,𝐰𝒲,Y(\mathbf{w})\ \perp\!\!\!\perp\ \mathbf{W}\ \big|\ \mathbf{V},\qquad\forall\mathbf{w}\in\mathcal{W}, (14)

together with SUTVA/consistency.

Define the long and short outcome regressions

μ(𝐯,𝐰):=E[Y𝐕=𝐯,𝐖=𝐰],μs(𝐱,𝐰):=E[Y𝐗=𝐱,𝐖=𝐰].\mu(\mathbf{v},\mathbf{w}):=\operatorname{E}[Y\mid\mathbf{V}=\mathbf{v},\mathbf{W}=\mathbf{w}],\qquad\mu_{s}(\mathbf{x},\mathbf{w}):=\operatorname{E}[Y\mid\mathbf{X}=\mathbf{x},\mathbf{W}=\mathbf{w}].

Let f(𝐰𝐯)f(\mathbf{w}\mid\mathbf{v}) denote the long conditional density of 𝐖\mathbf{W} given 𝐕\mathbf{V}, and f(𝐰𝐱)f(\mathbf{w}\mid\mathbf{x}) the short conditional density of 𝐖\mathbf{W} given 𝐗\mathbf{X} (i.e. the observed conditional law of the exposure mixture given 𝐗\mathbf{X}).

Stochastic intervention fixed from observed data.

We keep the intervention rule identical to Section 2: for any fixed 𝜹q\boldsymbol{\delta}\in\mathbb{R}^{q},

g𝜹(𝐰𝐱)=exp(𝜹𝐰)f(𝐰𝐱)𝒲exp(𝜹𝐯)f(𝐯𝐱)𝑑𝐯.g_{\boldsymbol{\delta}}(\mathbf{w}\mid\mathbf{x})=\frac{\exp(\boldsymbol{\delta}^{\top}\mathbf{w})\,f(\mathbf{w}\mid\mathbf{x})}{\int_{\mathcal{W}}\exp(\boldsymbol{\delta}^{\top}\mathbf{v})\,f(\mathbf{v}\mid\mathbf{x})\,d\mathbf{v}}. (15)

Importantly, g𝜹(𝐱)g_{\boldsymbol{\delta}}(\cdot\mid\mathbf{x}) depends only on 𝐱\mathbf{x} and is therefore implementable; it is well-defined regardless of whether UU exists.

Target (long) estimand.

Let Yg𝜹Y^{g_{\boldsymbol{\delta}}} denote the counterfactual outcome under the intervention that draws 𝐖\mathbf{W} from g𝜹(𝐗)g_{\boldsymbol{\delta}}(\cdot\mid\mathbf{X}) (independently of UU given 𝐗\mathbf{X}). Under (14), the causal estimand is

ψ(𝜹):=E[Yg𝜹]=E[𝒲μ(𝐕,𝐰)g𝜹(𝐰𝐗)𝑑𝐰].\psi(\boldsymbol{\delta}):=\operatorname{E}\big[Y^{g_{\boldsymbol{\delta}}}\big]=\operatorname{E}\Big[\int_{\mathcal{W}}\mu(\mathbf{V},\mathbf{w})\,g_{\boldsymbol{\delta}}(\mathbf{w}\mid\mathbf{X})\,d\mathbf{w}\Big]. (16)
Identified (short) estimand under ignorability given 𝐗\mathbf{X}.

If one incorrectly assumes ignorability given 𝐗\mathbf{X} only, the same g-formula yields the identified estimand

ψs(𝜹):=E[𝒲μs(𝐗,𝐰)g𝜹(𝐰𝐗)𝑑𝐰],\psi_{s}(\boldsymbol{\delta}):=\operatorname{E}\Big[\int_{\mathcal{W}}\mu_{s}(\mathbf{X},\mathbf{w})\,g_{\boldsymbol{\delta}}(\mathbf{w}\mid\mathbf{X})\,d\mathbf{w}\Big], (17)

which is the estimand targeted by our estimators.

Our sensitivity analysis bounds the discrepancy ψs(𝜹)ψ(𝜹)\psi_{s}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta}) as a function of interpretable sensitivity parameters.

E.2 Linear functional form and Riesz representers

For a fixed 𝜹\boldsymbol{\delta}, define the functional on square-integrable functions h(𝐯,𝐰)L2(P𝐕,𝐖)h(\mathbf{v},\mathbf{w})\in L_{2}(P_{\mathbf{V},\mathbf{W}}):

𝒯𝜹(h):=E[𝒲h(𝐕,𝐰)g𝜹(𝐰𝐗)𝑑𝐰].\mathcal{T}_{\boldsymbol{\delta}}(h):=\operatorname{E}\Big[\int_{\mathcal{W}}h(\mathbf{V},\mathbf{w})\,g_{\boldsymbol{\delta}}(\mathbf{w}\mid\mathbf{X})\,d\mathbf{w}\Big]. (18)

Then ψ(𝜹)=𝒯𝜹(μ)\psi(\boldsymbol{\delta})=\mathcal{T}_{\boldsymbol{\delta}}(\mu) and ψs(𝜹)=𝒯𝜹(μs)\psi_{s}(\boldsymbol{\delta})=\mathcal{T}_{\boldsymbol{\delta}}(\mu_{s}).

Proposition 6.

The functional 𝒯𝛅(h)\mathcal{T}_{\boldsymbol{\delta}}(h) is a linear functional.

Proof.

To establish linearity, we must verify that ψ\psi satisfies both the property of additivity and homogeneity. Let h1,h2L2(P𝐕,𝐖)h_{1},h_{2}\in L_{2}(P_{\mathbf{V},\mathbf{W}}) be arbitrary functions and let cc\in\mathbb{R} be a scalar constant.

1. Additivity.

By the linearity of the Lebesgue integral and the linearity of the expectation operator, we have:

𝒯𝜹(h1+h2)\displaystyle\mathcal{T}_{\boldsymbol{\delta}}(h_{1}+h_{2}) =𝔼[𝒲(h1+h2)(𝑽,𝒘)g𝜹(𝒘𝑿)𝑑𝒘]\displaystyle=\mathbb{E}\left[\int_{\mathcal{W}}(h_{1}+h_{2})(\boldsymbol{V},\boldsymbol{w})\,g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]
=𝔼[𝒲(h1(𝑽,𝒘)g𝜹(𝒘𝑿)+h2(𝑽,𝒘)g𝜹(𝒘𝑿))𝑑𝒘]\displaystyle=\mathbb{E}\left[\int_{\mathcal{W}}\left(h_{1}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})+h_{2}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\right)\,d\boldsymbol{w}\right]
=𝔼[𝒲h1(𝑽,𝒘)g𝜹(𝒘𝑿)𝑑𝒘+𝒲h2(𝑽,𝒘)g𝜹(𝒘𝑿)𝑑𝒘]\displaystyle=\mathbb{E}\left[\int_{\mathcal{W}}h_{1}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}+\int_{\mathcal{W}}h_{2}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]
=𝔼[𝒲h1(𝑽,𝒘)g𝜹(𝒘𝑿)𝑑𝒘]+𝔼[𝒲h2(𝑽,𝒘)g𝜹(𝒘𝑿)𝑑𝒘]\displaystyle=\mathbb{E}\left[\int_{\mathcal{W}}h_{1}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]+\mathbb{E}\left[\int_{\mathcal{W}}h_{2}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]
=𝒯𝜹(h1)+𝒯𝜹(h2).\displaystyle=\mathcal{T}_{\boldsymbol{\delta}}(h_{1})+\mathcal{T}_{\boldsymbol{\delta}}(h_{2}).
2. Homogeneity.
𝒯𝜹(ch)\displaystyle\mathcal{T}_{\boldsymbol{\delta}}(c\cdot h) =𝔼[𝒲ch(𝑽,𝒘)g𝜹(𝒘𝑿)𝑑𝒘]\displaystyle=\mathbb{E}\left[\int_{\mathcal{W}}c\cdot h(\boldsymbol{V},\boldsymbol{w})\,g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]
=𝔼[c𝒲h(𝑽,𝒘)g𝜹(𝒘𝑿)𝑑𝒘]\displaystyle=\mathbb{E}\left[c\int_{\mathcal{W}}h(\boldsymbol{V},\boldsymbol{w})\,g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]
=c𝔼[𝒲h(𝑽,𝒘)g𝜹(𝒘𝑿)𝑑𝒘]\displaystyle=c\,\mathbb{E}\left[\int_{\mathcal{W}}h(\boldsymbol{V},\boldsymbol{w})\,g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]
=c𝒯𝜹(h).\displaystyle=c\cdot\mathcal{T}_{\boldsymbol{\delta}}(h).

Since both conditions are satisfied, 𝒯𝜹\mathcal{T}_{\boldsymbol{\delta}} is a linear functional. ∎

Weak overlap / continuity condition.

Assume 𝒯𝜹\mathcal{T}_{\boldsymbol{\delta}} is continuous on L2(P𝐕,𝐖)L_{2}(P_{\mathbf{V},\mathbf{W}}); a sufficient condition required here is the “weak overlap” requirement

E[(g𝜹(𝐖𝐗)f(𝐖𝐕))2]<.\operatorname{E}\!\left[\left(\frac{g_{\boldsymbol{\delta}}(\mathbf{W}\mid\mathbf{X})}{f(\mathbf{W}\mid\mathbf{V})}\right)^{2}\right]<\infty. (19)

This condition is untestable without UU, but it is the standard integrability requirement needed for the Riesz representation.

Riesz representation.

Under (19), by the Riesz–Fréchet representation theorem there exists a unique (long) Riesz representer α𝜹(𝐕,𝐖)L2(P𝐕,𝐖)\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})\in L_{2}(P_{\mathbf{V},\mathbf{W}}) such that

𝒯𝜹(h)=E[h(𝐕,𝐖)α𝜹(𝐕,𝐖)],hL2(P𝐕,𝐖).\mathcal{T}_{\boldsymbol{\delta}}(h)=\operatorname{E}\big[h(\mathbf{V},\mathbf{W})\,\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})\big],\quad\forall h\in L_{2}(P_{\mathbf{V},\mathbf{W}}).

Moreover, the representer has the Radon–Nikodym form

α𝜹(𝐕,𝐖)=g𝜹(𝐖𝐗)f(𝐖𝐕).\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})=\frac{g_{\boldsymbol{\delta}}(\mathbf{W}\mid\mathbf{X})}{f(\mathbf{W}\mid\mathbf{V})}. (20)

Likewise, the short Riesz representer for 𝒯𝜹\mathcal{T}_{\boldsymbol{\delta}} over L2(P𝐗,𝐖)L_{2}(P_{\mathbf{X},\mathbf{W}}) is

αs,𝜹(𝐗,𝐖)=g𝜹(𝐖𝐗)f(𝐖𝐗)=E[α𝜹(𝐕,𝐖)𝐗,𝐖].\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W})=\frac{g_{\boldsymbol{\delta}}(\mathbf{W}\mid\mathbf{X})}{f(\mathbf{W}\mid\mathbf{X})}=\operatorname{E}\big[\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})\mid\mathbf{X},\mathbf{W}\big]. (21)

Under the exponential tilt (15), αs,𝜹\alpha_{s,\boldsymbol{\delta}} simplifies to

αs,𝜹(𝐗,𝐖)=exp(𝜹𝐖)ν𝜹(𝐗),ν𝜹(𝐱):=E[exp(𝜹𝐖)𝐗=𝐱].\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W})=\frac{\exp(\boldsymbol{\delta}^{\top}\mathbf{W})}{\nu_{\boldsymbol{\delta}}(\mathbf{X})},\qquad\nu_{\boldsymbol{\delta}}(\mathbf{x}):=\operatorname{E}\big[\exp(\boldsymbol{\delta}^{\top}\mathbf{W})\mid\mathbf{X}=\mathbf{x}\big]. (22)

Thus our policy estimand is a continuous linear functional of the long regression μ\mu with a well-defined RR.

E.3 Exact OVB identity and sharp bound

Define the outcome regression error and the RR error:

Δμ(𝐕,𝐖):=μ(𝐕,𝐖)μs(𝐗,𝐖),Δα(𝐕,𝐖):=α𝜹(𝐕,𝐖)αs,𝜹(𝐗,𝐖).\Delta_{\mu}(\mathbf{V},\mathbf{W}):=\mu(\mathbf{V},\mathbf{W})-\mu_{s}(\mathbf{X},\mathbf{W}),\qquad\Delta_{\alpha}(\mathbf{V},\mathbf{W}):=\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})-\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W}).

Then the omitted-variable bias satisfies the exact identity

ψs(𝜹)ψ(𝜹)=E[Δμ(𝐕,𝐖)Δα(𝐕,𝐖)].\psi_{s}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})=\operatorname{E}\big[\Delta_{\mu}(\mathbf{V},\mathbf{W})\,\Delta_{\alpha}(\mathbf{V},\mathbf{W})\big]. (23)

By Cauchy–Schwarz,

|ψs(𝜹)ψ(𝜹)|B(𝜹):=E[Δμ2]E[Δα2].\big|\psi_{s}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\big|\leq B(\boldsymbol{\delta}):=\sqrt{\operatorname{E}[\Delta_{\mu}^{2}]}\;\sqrt{\operatorname{E}[\Delta_{\alpha}^{2}]}. (24)

It is often useful to isolate the “degree of adversity”

ϱ(𝜹):=Corr(Δμ(𝐕,𝐖),Δα(𝐕,𝐖))[1,1],\varrho(\boldsymbol{\delta}):=\operatorname{Corr}\!\big(\Delta_{\mu}(\mathbf{V},\mathbf{W}),\Delta_{\alpha}(\mathbf{V},\mathbf{W})\big)\in[-1,1],

so that (23) yields |ψsψ|2=ϱ(𝜹)2B(𝜹)2B(𝜹)2.|\psi_{s}-\psi|^{2}=\varrho(\boldsymbol{\delta})^{2}\,B(\boldsymbol{\delta})^{2}\leq B(\boldsymbol{\delta})^{2}. In our primary analysis and reported bounds, we focus on the worst-case scenario by considering adversarial confounding, which implicitly sets |ϱ(𝜹)|=1|\varrho(\boldsymbol{\delta})|=1. This allows us to establish a conservative bound and subsequently omit the correlation term from our final operational formulas.

E.4 Reparameterization by interpretable partial R2R^{2}

Following the general theory, the bound can be rewritten as a product of an identifiable scale and two sensitivity parameters with partial-R2R^{2} interpretations.

Identifiable scale.

Let σs2:=E[(Yμs(𝐗,𝐖))2]\sigma_{s}^{2}:=\operatorname{E}\big[(Y-\mu_{s}(\mathbf{X},\mathbf{W}))^{2}\big] and νs2(𝜹):=E[αs,𝜹(𝐗,𝐖)2]\nu_{s}^{2}(\boldsymbol{\delta}):=\operatorname{E}\big[\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W})^{2}\big]. Define

S(𝜹)2:=σs2νs2(𝜹),S(\boldsymbol{\delta})^{2}:=\sigma_{s}^{2}\;\nu_{s}^{2}(\boldsymbol{\delta}), (25)

which depends only on the observed-data law of (Y,𝐖,𝐗)(Y,\mathbf{W},\mathbf{X}) and is therefore estimable.

Outcome-side sensitivity (partial R2R^{2}).

Define

CY2:=E[Δμ(𝐕,𝐖)2]E[(Yμs(𝐗,𝐖))2]=Var(E[Y𝐗,𝐖,U])Var(E[Y𝐗,𝐖])Var(Y)Var(E[Y𝐗,𝐖])[0,1],C_{Y}^{2}:=\frac{\operatorname{E}[\Delta_{\mu}(\mathbf{V},\mathbf{W})^{2}]}{\operatorname{E}[(Y-\mu_{s}(\mathbf{X},\mathbf{W}))^{2}]}=\frac{\operatorname{Var}\!\big(\operatorname{E}[Y\mid\mathbf{X},\mathbf{W},U]\big)-\operatorname{Var}\!\big(\operatorname{E}[Y\mid\mathbf{X},\mathbf{W}]\big)}{\operatorname{Var}(Y)-\operatorname{Var}\!\big(\operatorname{E}[Y\mid\mathbf{X},\mathbf{W}]\big)}\in[0,1], (26)

i.e. the nonparametric partial R2R^{2} of UU with YY given (𝐗,𝐖)(\mathbf{X},\mathbf{W}).

Exposure/RR-side sensitivity.

Because αs,𝜹\alpha_{s,\boldsymbol{\delta}} is the L2L_{2} projection of α𝜹\alpha_{\boldsymbol{\delta}} onto (𝐗,𝐖)(\mathbf{X},\mathbf{W}), we have E[Δα2]=E[α𝜹2]E[αs,𝜹2].\operatorname{E}[\Delta_{\alpha}^{2}]=\operatorname{E}[\alpha_{\boldsymbol{\delta}}^{2}]-\operatorname{E}[\alpha_{s,\boldsymbol{\delta}}^{2}]. Define the (RR-space) R2R^{2}:

Rα2(𝜹):=Corr2(α𝜹(𝐕,𝐖),αs,𝜹(𝐗,𝐖))=E[αs,𝜹2]E[α𝜹2](0,1],R_{\alpha}^{2}(\boldsymbol{\delta}):=\operatorname{Corr}^{2}\!\big(\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W}),\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W})\big)=\frac{\operatorname{E}[\alpha_{s,\boldsymbol{\delta}}^{2}]}{\operatorname{E}[\alpha_{\boldsymbol{\delta}}^{2}]}\in(0,1],

and let

CD2(𝜹):=E[α𝜹2]E[αs,𝜹2]E[αs,𝜹2]=1Rα2(𝜹)Rα2(𝜹).C_{D}^{2}(\boldsymbol{\delta}):=\frac{\operatorname{E}[\alpha_{\boldsymbol{\delta}}^{2}]-\operatorname{E}[\alpha_{s,\boldsymbol{\delta}}^{2}]}{\operatorname{E}[\alpha_{s,\boldsymbol{\delta}}^{2}]}=\frac{1-R_{\alpha}^{2}(\boldsymbol{\delta})}{R_{\alpha}^{2}(\boldsymbol{\delta})}. (27)

Thus 1Rα2(𝜹)1-R_{\alpha}^{2}(\boldsymbol{\delta}) is the fraction of RR variation generated by the latent confounder.

Final bound in R2R^{2} form.

Combining (24)–(27) yields

B(𝜹)2=S(𝜹)2CY2CD2(𝜹),|ψs(𝜹)ψ(𝜹)|S(𝜹)CYCD(𝜹).B(\boldsymbol{\delta})^{2}=S(\boldsymbol{\delta})^{2}\,C_{Y}^{2}\,C_{D}^{2}(\boldsymbol{\delta}),\qquad\big|\psi_{s}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\big|\leq\,S(\boldsymbol{\delta})\,C_{Y}\,C_{D}(\boldsymbol{\delta}). (28)

Therefore, for any posited (CY2,Rα2(𝜹))(C_{Y}^{2},R_{\alpha}^{2}(\boldsymbol{\delta})), we obtain the sensitivity interval

ψ(𝜹)[ψs(𝜹)S(𝜹)CYCD(𝜹),ψs(𝜹)+S(𝜹)CYCD(𝜹)].\psi(\boldsymbol{\delta})\in\Big[\;\psi_{s}(\boldsymbol{\delta})-S(\boldsymbol{\delta})C_{Y}C_{D}(\boldsymbol{\delta}),\;\psi_{s}(\boldsymbol{\delta})+S(\boldsymbol{\delta})C_{Y}C_{D}(\boldsymbol{\delta})\;\Big]. (29)

E.5 Incremental effect

If the scientific target is the incremental effect contrast θ(𝜹):=ψ(𝜹)ψ(𝟎)\theta(\boldsymbol{\delta}):=\psi(\boldsymbol{\delta})-\psi(\mathbf{0}), then the difference of continuous linear functionals is again a continuous linear functional. The same analysis applies with the RR replaced by the RR contrast:

αθ,𝜹(𝐕,𝐖):=α𝜹(𝐕,𝐖)α𝟎(𝐕,𝐖),αs,θ,𝜹(𝐗,𝐖):=αs,𝜹(𝐗,𝐖)αs,𝟎(𝐗,𝐖),\alpha_{\theta,\boldsymbol{\delta}}(\mathbf{V},\mathbf{W}):=\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})-\alpha_{\mathbf{0}}(\mathbf{V},\mathbf{W}),\qquad\alpha_{s,\theta,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W}):=\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W})-\alpha_{s,\mathbf{0}}(\mathbf{X},\mathbf{W}),

and μ\mu unchanged. One obtains an interval for θ(𝜹)\theta(\boldsymbol{\delta}) by applying (29) to θ\theta.

E.6 Implementation

For each fixed 𝜹\boldsymbol{\delta} considered in the paper, the sensitivity analysis requires three estimated quantities:

  1. 1.

    θ^(𝜹)\widehat{\theta}(\boldsymbol{\delta}): our EIF-based (cross-fitted) estimator of the incremental effect from Section 3.

  2. 2.

    σ^s2:=n1i=1n{Yiμ^s(𝐗i,𝐖i)}2\widehat{\sigma}_{s}^{2}:=n^{-1}\sum_{i=1}^{n}\{Y_{i}-\widehat{\mu}_{s}(\mathbf{X}_{i},\mathbf{W}_{i})\}^{2} using the same cross-fitted μ^s\widehat{\mu}_{s} as in the main estimator.

  3. 3.

    ν^s,θ2(𝜹):=n1i=1nα^s,θ,𝜹(𝐗i,𝐖i)2\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta}):=n^{-1}\sum_{i=1}^{n}\widehat{\alpha}_{s,\theta,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i})^{2}, where α^s,θ,𝜹=α^s,𝜹α^s,𝟎=α^s,𝜹1\widehat{\alpha}_{s,\theta,\boldsymbol{\delta}}=\widehat{\alpha}_{s,\boldsymbol{\delta}}-\widehat{\alpha}_{s,\boldsymbol{0}}=\widehat{\alpha}_{s,\boldsymbol{\delta}}-1, and α^s,𝜹\widehat{\alpha}_{s,\boldsymbol{\delta}} is obtained either as the estimated density ratio g^𝜹/f^\widehat{g}_{\boldsymbol{\delta}}/\widehat{f} or via the closed form (22) by estimating ν𝜹(𝐗)=E[exp(𝜹𝐖)𝐗]\nu_{\boldsymbol{\delta}}(\mathbf{X})=\operatorname{E}[\exp(\boldsymbol{\delta}^{\top}\mathbf{W})\mid\mathbf{X}] with regression.

Then S^(𝜹):=σ^s2ν^s,θ2(𝜹)\widehat{S}(\boldsymbol{\delta}):=\sqrt{\widehat{\sigma}_{s}^{2}\,\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta})}.

Finally, for any posited sensitivity parameters

ηY2:=CY2[0,1],ηα2(𝜹):=1Rα2(𝜹)[0,1),\eta_{Y}^{2}:=C_{Y}^{2}\in[0,1],\qquad\eta_{\alpha}^{2}(\boldsymbol{\delta}):=1-R_{\alpha}^{2}(\boldsymbol{\delta})\in[0,1),

the bias bound becomes

B^(𝜹;ηY2,ηα2)=S^(𝜹)ηY2ηα2(𝜹)1ηα2(𝜹).\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2})=\widehat{S}(\boldsymbol{\delta})\,\sqrt{\eta_{Y}^{2}}\,\sqrt{\frac{\eta_{\alpha}^{2}(\boldsymbol{\delta})}{1-\eta_{\alpha}^{2}(\boldsymbol{\delta})}}.

Accordingly, the estimated point identified set for the incremental effect is

θ(𝜹)[θ^(𝜹)B^(𝜹;ηY2,ηα2),θ^(𝜹)+B^(𝜹;ηY2,ηα2)].\theta(\boldsymbol{\delta})\in\Big[\,\widehat{\theta}(\boldsymbol{\delta})-\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2}),\;\widehat{\theta}(\boldsymbol{\delta})+\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2})\Big].

When θ^(𝜹)0\widehat{\theta}(\boldsymbol{\delta})\neq 0, the ratio

B^(𝜹;ηY2,ηα2)|θ^(𝜹)|\frac{\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2})}{|\widehat{\theta}(\boldsymbol{\delta})|}

compares the benchmarked bias half-width with the magnitude of the estimated incremental effect. Table 2 reports this ratio for the main-text benchmark with kY=1k_{Y}=1 and kD=1k_{D}=1 over the 10 positive Gelbrich targets on each of the 9 intervention paths. Values below one indicate that the benchmarked point bound is narrower than the estimated effect in magnitude, whereas values above one indicate that the benchmarked bias half-width exceeds the estimated effect itself.

Table 2: Ratio B^(𝜹)/|θ^(𝜹)|\widehat{B}(\boldsymbol{\delta})/|\widehat{\theta}(\boldsymbol{\delta})| under the formal benchmark with kY=1k_{Y}=1 and kD=1k_{D}=1. Columns correspond to the positive Gelbrich targets used in the application.
Scenario 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
BC 0.22 0.23 0.25 0.27 0.28 0.30 0.32 0.33 0.35 0.37
NO3 0.42 0.62 0.60 0.53 0.47 0.41 0.36 0.32 0.29 0.26
OM 0.21 0.24 0.26 0.28 0.30 0.33 0.35 0.37 0.39 0.41
SO4 0.10 0.10 0.09 0.08 0.07 0.07 0.06 0.06 0.05 0.05
NH4 0.13 0.18 0.21 0.21 0.20 0.19 0.17 0.16 0.15 0.14
BC+OM 0.84 0.98 1.24 1.72 2.67 3.73 7.09 50.85 12.16 12.16
NO3+SO4+NH4 0.17 0.19 0.22 0.27 0.32 0.35 0.36 0.35 0.32 0.32
All 0.07 0.08 0.10 0.12 0.14 0.17 0.22 0.24 0.25 0.25
BFGS 0.03 0.21 0.23 0.20 0.16 0.09 0.08 0.07 0.07 0.07
Confidence bounds for the estimated endpoints.

The empirical results in the main text report confidence bounds for the two estimated endpoints above. Let φ^θ(𝜹)(𝐙i)\widehat{\varphi}_{\theta(\boldsymbol{\delta})}(\mathbf{Z}_{i}) denote the cross-fitted EIF contribution of θ^(𝜹)\widehat{\theta}(\boldsymbol{\delta}). Define the centered plugin signals

φ^σ,i\displaystyle\widehat{\varphi}_{\sigma,i} :={Yiμ^s(𝐗i,𝐖i)}2σ^s2,\displaystyle:=\{Y_{i}-\widehat{\mu}_{s}(\mathbf{X}_{i},\mathbf{W}_{i})\}^{2}-\widehat{\sigma}_{s}^{2},
φ^ν,i(𝜹)\displaystyle\widehat{\varphi}_{\nu,i}(\boldsymbol{\delta}) :=α^s,θ,𝜹(𝐗i,𝐖i)2ν^s,θ2(𝜹),\displaystyle:=\widehat{\alpha}_{s,\theta,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i})^{2}-\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta}),

and write

λ(𝜹):=ηY2ηα2(𝜹)1ηα2(𝜹).\lambda(\boldsymbol{\delta}):=\,\sqrt{\eta_{Y}^{2}}\,\sqrt{\frac{\eta_{\alpha}^{2}(\boldsymbol{\delta})}{1-\eta_{\alpha}^{2}(\boldsymbol{\delta})}}.

The delta method gives

φ^S,i(𝜹)\displaystyle\widehat{\varphi}_{S,i}(\boldsymbol{\delta}) :=ν^s,θ2(𝜹)φ^σ,i+σ^s2φ^ν,i(𝜹)2S^(𝜹),\displaystyle:=\frac{\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta})\,\widehat{\varphi}_{\sigma,i}+\widehat{\sigma}_{s}^{2}\,\widehat{\varphi}_{\nu,i}(\boldsymbol{\delta})}{2\,\widehat{S}(\boldsymbol{\delta})},
φ^,i(𝜹)\displaystyle\widehat{\varphi}_{-,i}(\boldsymbol{\delta}) :=φ^θ(𝜹)(𝐙i)λ(𝜹)φ^S,i(𝜹),\displaystyle:=\widehat{\varphi}_{\theta(\boldsymbol{\delta})}(\mathbf{Z}_{i})-\lambda(\boldsymbol{\delta})\widehat{\varphi}_{S,i}(\boldsymbol{\delta}),
φ^+,i(𝜹)\displaystyle\widehat{\varphi}_{+,i}(\boldsymbol{\delta}) :=φ^θ(𝜹)(𝐙i)+λ(𝜹)φ^S,i(𝜹).\displaystyle:=\widehat{\varphi}_{\theta(\boldsymbol{\delta})}(\mathbf{Z}_{i})+\lambda(\boldsymbol{\delta})\widehat{\varphi}_{S,i}(\boldsymbol{\delta}).

Writing

θ^(𝜹)\displaystyle\widehat{\theta}_{-}(\boldsymbol{\delta}) :=θ^(𝜹)B^(𝜹;ηY2,ηα2),\displaystyle:=\widehat{\theta}(\boldsymbol{\delta})-\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2}),
θ^+(𝜹)\displaystyle\widehat{\theta}_{+}(\boldsymbol{\delta}) :=θ^(𝜹)+B^(𝜹;ηY2,ηα2),\displaystyle:=\widehat{\theta}(\boldsymbol{\delta})+\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2}),

the corresponding standard errors are

se^(𝜹)\displaystyle\widehat{\mathrm{se}}_{-}(\boldsymbol{\delta}) :=1n2i=1nφ^,i(𝜹)2,\displaystyle:=\sqrt{\frac{1}{n^{2}}\sum_{i=1}^{n}\widehat{\varphi}_{-,i}(\boldsymbol{\delta})^{2}},
se^+(𝜹)\displaystyle\widehat{\mathrm{se}}_{+}(\boldsymbol{\delta}) :=1n2i=1nφ^+,i(𝜹)2.\displaystyle:=\sqrt{\frac{1}{n^{2}}\sum_{i=1}^{n}\widehat{\varphi}_{+,i}(\boldsymbol{\delta})^{2}}.

The sensitivity-adjusted 95%95\% interval is obtained by combining a lower confidence bound for θ^(𝜹)\widehat{\theta}_{-}(\boldsymbol{\delta}) and an upper confidence bound for θ^+(𝜹)\widehat{\theta}_{+}(\boldsymbol{\delta}):

[θ^(𝜹)z0.95se^(𝜹),θ^+(𝜹)+z0.95se^+(𝜹)].\left[\widehat{\theta}_{-}(\boldsymbol{\delta})-z_{0.95}\widehat{\mathrm{se}}_{-}(\boldsymbol{\delta}),\;\widehat{\theta}_{+}(\boldsymbol{\delta})+z_{0.95}\widehat{\mathrm{se}}_{+}(\boldsymbol{\delta})\right].
Selection of sensitivity parameters.

The parameters ηY2\eta_{Y}^{2} and ηα2(𝜹)\eta_{\alpha}^{2}(\boldsymbol{\delta}) have direct interpretations: ηY2\eta_{Y}^{2} is the maximal fraction of residual outcome variance explainable by UU given (𝐗,𝐖)(\mathbf{X},\mathbf{W}), and ηα2(𝜹)\eta_{\alpha}^{2}(\boldsymbol{\delta}) is the maximal fraction of RR variation explainable by UU for the policy indexed by 𝜹\boldsymbol{\delta}. These can be calibrated by subject-matter knowledge and by benchmarking against observed covariates, and then used in the bound and the endpoint confidence bounds above.

E.7 Formal benchmarking and calibration

Following Chernozhukov et al. (2022), we calibrate the sensitivity parameters on the f2f^{2} scale induced by nested linear projections. This construction translates the observed contribution of a single covariate XjX_{j} into a benchmark that is commensurate with the omitted-variable calibration in the bias bound. We carry out this comparison for each of the 22 observed covariates.

Outcome-side benchmark.

For each observed covariate XjX_{j}, let

σ^s2\displaystyle\widehat{\sigma}_{s}^{2} :=mina,𝒃,𝒄1ni=1n{Yia𝒃𝐖i𝒄𝐗i}2,\displaystyle:=\min_{a,\boldsymbol{b},\boldsymbol{c}}\frac{1}{n}\sum_{i=1}^{n}\Big\{Y_{i}-a-\boldsymbol{b}^{\top}\mathbf{W}_{i}-\boldsymbol{c}^{\top}\mathbf{X}_{i}\Big\}^{2},
σ^s,j2\displaystyle\widehat{\sigma}_{s,-j}^{2} :=mina,𝒃,𝒄1ni=1n{Yia𝒃𝐖i𝒄𝐗i,j}2,\displaystyle:=\min_{a,\boldsymbol{b},\boldsymbol{c}}\frac{1}{n}\sum_{i=1}^{n}\Big\{Y_{i}-a-\boldsymbol{b}^{\top}\mathbf{W}_{i}-\boldsymbol{c}^{\top}\mathbf{X}_{i,-j}\Big\}^{2},

where 𝐗i,j\mathbf{X}_{i,-j} denotes the observed covariate vector with XijX_{ij} removed. The associated benchmark statistics are

η^Y,j2\displaystyle\widehat{\eta}_{Y,j}^{2} :=σ^s,j2σ^s2σ^s,j2,\displaystyle:=\frac{\widehat{\sigma}_{s,-j}^{2}-\widehat{\sigma}_{s}^{2}}{\widehat{\sigma}_{s,-j}^{2}},
f^Y,j2\displaystyle\widehat{f}_{Y,j}^{2} :=η^Y,j21η^Y,j2=σ^s,j2σ^s2σ^s2.\displaystyle:=\frac{\widehat{\eta}_{Y,j}^{2}}{1-\widehat{\eta}_{Y,j}^{2}}=\frac{\widehat{\sigma}_{s,-j}^{2}-\widehat{\sigma}_{s}^{2}}{\widehat{\sigma}_{s}^{2}}.

Thus η^Y,j2\widehat{\eta}_{Y,j}^{2} is the observed partial R2R^{2} for XjX_{j} after adjusting for (𝑾,𝑿j)(\boldsymbol{W},\boldsymbol{X}_{-j}), and f^Y,j2\widehat{f}_{Y,j}^{2} expresses the same gain relative to the residual variation in the full projection. Table 3 reports these quantities for all 22 covariates. The largest outcome-side benchmark is White, with η^Y,j2=0.0642\widehat{\eta}_{Y,j}^{2}=0.0642 and f^Y,j2=0.0686\widehat{f}_{Y,j}^{2}=0.0686.

RR-side benchmark.

For the RR-side, we first evaluate the fitted short RR α^s,𝜹(𝐗i,𝐖i)\widehat{\alpha}_{s,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i}) at every intervention target retained in the application analysis. For each XjX_{j} and each target 𝜹\boldsymbol{\delta}, we then compute the nested linear projections

r^α,full,j2(𝜹)\displaystyle\widehat{r}_{\alpha,\mathrm{full},j}^{2}(\boldsymbol{\delta}) :=mina,𝒃1ni=1n{α^s,𝜹(𝐗i,𝐖i)a𝒃𝐗i}2,\displaystyle:=\min_{a,\boldsymbol{b}}\frac{1}{n}\sum_{i=1}^{n}\Big\{\widehat{\alpha}_{s,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i})-a-\boldsymbol{b}^{\top}\mathbf{X}_{i}\Big\}^{2},
r^α,red,j2(𝜹)\displaystyle\widehat{r}_{\alpha,\mathrm{red},j}^{2}(\boldsymbol{\delta}) :=mina,𝒃1ni=1n{α^s,𝜹(𝐗i,𝐖i)a𝒃𝐗i,j}2.\displaystyle:=\min_{a,\boldsymbol{b}}\frac{1}{n}\sum_{i=1}^{n}\Big\{\widehat{\alpha}_{s,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i})-a-\boldsymbol{b}^{\top}\mathbf{X}_{i,-j}\Big\}^{2}.

The corresponding pointwise benchmark statistics are

η^α,j2(𝜹)\displaystyle\widehat{\eta}_{\alpha,j}^{2}(\boldsymbol{\delta}) :=r^α,red,j2(𝜹)r^α,full,j2(𝜹)r^α,red,j2(𝜹),\displaystyle:=\frac{\widehat{r}_{\alpha,\mathrm{red},j}^{2}(\boldsymbol{\delta})-\widehat{r}_{\alpha,\mathrm{full},j}^{2}(\boldsymbol{\delta})}{\widehat{r}_{\alpha,\mathrm{red},j}^{2}(\boldsymbol{\delta})},
f^α,j2(𝜹)\displaystyle\widehat{f}_{\alpha,j}^{2}(\boldsymbol{\delta}) :=η^α,j2(𝜹)1η^α,j2(𝜹)=r^α,red,j2(𝜹)r^α,full,j2(𝜹)r^α,full,j2(𝜹).\displaystyle:=\frac{\widehat{\eta}_{\alpha,j}^{2}(\boldsymbol{\delta})}{1-\widehat{\eta}_{\alpha,j}^{2}(\boldsymbol{\delta})}=\frac{\widehat{r}_{\alpha,\mathrm{red},j}^{2}(\boldsymbol{\delta})-\widehat{r}_{\alpha,\mathrm{full},j}^{2}(\boldsymbol{\delta})}{\widehat{r}_{\alpha,\mathrm{full},j}^{2}(\boldsymbol{\delta})}.

This is the direct analogue of the outcome-side construction, with the fitted short RR taking the place of the observed outcome.

The benchmark reported in the main text is attached to an intervention path rather than to a single target. For each scenario, we therefore rank the 22 covariates by the average of f^α,j2(𝜹)\widehat{f}_{\alpha,j}^{2}(\boldsymbol{\delta}) over the positive Gelbrich targets on that path. This average is the RR-side score used in the formal benchmark. It summarizes how much adding XjX_{j} improves the linear approximation to the fitted short RR along the displayed path, and it keeps the benchmark tied to an intervention curve that is actually reported in the application. Table 4 reports these scenario-level mean f^α,j2\widehat{f}_{\alpha,j}^{2} values for all 22 covariates.

With this aggregation, the selected RR-side benchmark covariates are Housing More People Units for BC, OM, and BC+OM; Cigarette Smoking for NO3, NH4, NO3+SO4+NH4, and All; Households Smartphone for SO4; and Housing Vacant for BFGS.

Benchmark calibration.

For the selected outcome-side covariate,

ηY2=kYf^Y,j21+kYf^Y,j2.\eta_{Y}^{2}=\frac{k_{Y}\,\widehat{f}_{Y,j}^{2}}{1+k_{Y}\,\widehat{f}_{Y,j}^{2}}.

For the selected RR-side covariate in a given scenario,

ηα2(𝜹)=kDf^α,j2(𝜹)1+kDf^α,j2(𝜹).\eta_{\alpha}^{2}(\boldsymbol{\delta})=\frac{k_{D}\,\widehat{f}_{\alpha,j}^{2}(\boldsymbol{\delta})}{1+k_{D}\,\widehat{f}_{\alpha,j}^{2}(\boldsymbol{\delta})}.

These calibrated values are inserted directly into the point bounds and the endpoint confidence bounds above. The main application figure uses kY=1k_{Y}=1 and kD=1k_{D}=1, so the omitted confounder is calibrated to the observed benchmark strength on both the outcome side and the RR side.

Table 3: Outcome-side benchmarking statistics for the 22 observed covariates. The table reports the observed partial R2R^{2} and its f2f^{2} transform from the nested linear projections of YY on (𝑾,𝑿)(\boldsymbol{W},\boldsymbol{X}) and (𝑾,𝑿j)(\boldsymbol{W},\boldsymbol{X}_{-j}).
Covariate η^Y,j2\widehat{\eta}_{Y,j}^{2} f^Y,j2\widehat{f}_{Y,j}^{2}
White 0.0642 0.0686
Poverty 0.0341 0.0353
Physical Activity 0.0322 0.0333
Housing More People Units 0.0151 0.0154
Binge Drinking 0.0132 0.0134
Households No Internet 0.0123 0.0124
Housing No Vehicle 0.0112 0.0113
Housing 10 Units 0.0087 0.0087
Median Income 0.0082 0.0083
Cigarette Smoking 0.0052 0.0052
Low Education Computer No Internet 0.0042 0.0042
Percentage No Insurance 0.0037 0.0037
Male 0.0032 0.0032
Households Smartphone 0.0023 0.0023
Housing Renter 0.0011 0.0011
Housing Vacant 0.0010 0.0010
HS Higher 0.0006 0.0006
Housing Mobile 0.0001 0.0001
Obesity 0.0001 0.0001
Unemployed 0.0001 0.0001
Households Low Income No Internet 0.0001 0.0001
Households Only Smartphone 0.0000 0.0000
Table 4: RR-side benchmarking statistics for the 22 observed covariates. Entries are the scenario-level means of f^α,j2(𝜹)\widehat{f}_{\alpha,j}^{2}(\boldsymbol{\delta}) over the positive Gelbrich grid within each scenario, computed from nested linear projections of α^s,𝜹(𝐗i,𝐖i)\widehat{\alpha}_{s,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i}) on 𝑿\boldsymbol{X} and 𝑿j\boldsymbol{X}_{-j}. The main-text benchmark with kD=1k_{D}=1 selects the largest entry within each scenario.
Covariate BC NO3 OM SO4 NH4 BC+OM NO3+SO4+NH4 All BFGS
White 0.0003 0.0049 0.0029 0.0017 0.0005 0.0002 0.0012 0.0008 0.0024
Poverty 0.0011 0.0171 0.0005 0.0015 0.0081 0.0012 0.0023 0.0026 0.0089
Physical Activity 0.0014 0.0018 0.0011 0.0003 0.0007 0.0014 0.0014 0.0001 0.0002
Housing More People Units 0.0089 0.0017 0.0124 0.0026 0.0007 0.0043 0.0007 0.0003 0.0043
Binge Drinking 0.0003 0.0002 0.0027 0.0002 0.0005 0.0004 0.0017 0.0002 0.0021
Households No Internet 0.0003 0.0008 0.0003 0.0002 0.0009 0.0017 0.0003 0.0000 0.0004
Housing No Vehicle 0.0021 0.0006 0.0054 0.0001 0.0006 0.0027 0.0125 0.0022 0.0006
Housing 10 Units 0.0004 0.0004 0.0023 0.0002 0.0001 0.0003 0.0010 0.0003 0.0003
Median Income 0.0018 0.0210 0.0005 0.0016 0.0058 0.0020 0.0025 0.0028 0.0071
Cigarette Smoking 0.0005 0.0259 0.0015 0.0022 0.0167 0.0014 0.0174 0.0079 0.0036
Low Education Computer No Internet 0.0001 0.0001 0.0003 0.0004 0.0008 0.0005 0.0009 0.0009 0.0038
Percentage No Insurance 0.0001 0.0015 0.0004 0.0000 0.0016 0.0002 0.0003 0.0002 0.0015
Male 0.0000 0.0008 0.0020 0.0008 0.0007 0.0001 0.0002 0.0006 0.0017
Households Smartphone 0.0009 0.0042 0.0004 0.0037 0.0029 0.0004 0.0009 0.0015 0.0006
Housing Renter 0.0026 0.0004 0.0001 0.0001 0.0001 0.0013 0.0008 0.0000 0.0002
Housing Vacant 0.0007 0.0224 0.0006 0.0010 0.0035 0.0011 0.0157 0.0064 0.0138
HS Higher 0.0003 0.0005 0.0007 0.0006 0.0000 0.0004 0.0004 0.0005 0.0025
Housing Mobile 0.0008 0.0060 0.0000 0.0006 0.0002 0.0015 0.0145 0.0072 0.0031
Obesity 0.0003 0.0001 0.0012 0.0000 0.0011 0.0009 0.0001 0.0003 0.0006
Unemployed 0.0002 0.0026 0.0001 0.0005 0.0003 0.0000 0.0008 0.0006 0.0029
Households Low Income No Internet 0.0002 0.0010 0.0000 0.0013 0.0002 0.0000 0.0016 0.0009 0.0005
Households Only Smartphone 0.0002 0.0004 0.0003 0.0009 0.0003 0.0001 0.0003 0.0002 0.0016